High Performance and Scalable Matching and Assembly of Biological Sequences

Bibliographic Details
Main Author:	Abu Doleh, Anas
Language:	English
Published:	The Ohio State University / OhioLINK 2016
Subjects:	Computer Engineering Bioinformatics bioinformatics sequence similarity indexing graphical processing unit Apache Spark de Bruijn graph de novo assembly metagenomics
Online Access:	http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998

id	ndltd-OhioLink-oai-etd.ohiolink.edu-osu1469092998
record_format	oai_dc
spelling	ndltd-OhioLink-oai-etd.ohiolink.edu-osu14690929982021-08-03T06:37:35Z High Performance and Scalable Matching and Assembly of Biological Sequences Abu Doleh, Anas Computer Engineering Bioinformatics bioinformatics sequence similarity indexing graphical processing unit Apache Spark de Bruijn graph de novo assembly metagenomics Next Generation Sequencing (NGS), the massive parallel and low-cost sequencing technology, is able to generate an enormous size of sequencing data. This facilitates the discovery of new genomic sequences and expands the biological and medical research. However, these big advancements in this technology also bring big computational challenges. In almost all NGS analysis pipelines, the most crucial and computationally intensive tasks are sequence similarity searching and de novo genome assembly. Thus, in this work, we introduced novel and efficient techniques to utilize the advancements in the High Performance Computing hardware and data computing platforms in order to accelerate these tasks while producing high quality results.For the sequence similarity search, we have studied utilizing the massively multithreaded architectures, such as Graphical Processing Unit (GPU), in accelerating and solving two important problems: reads mapping and maximal exact matching. Firstly, we introduced a new mapping tool, Masher, which processes long~(and short) reads efficiently and accurately. Masher employs a novel indexing technique that produces an index for huge genome, such as the human genome, with a small memory footprint such that it could be stored and efficiently accessed in a restricted-memory device such as a GPU. The results show that Masher is faster than state-of-the-art tools and obtains a good accuracy and sensitivity on sequencing data with various characteristics. Secondly, maximal exact matching problem has been studied because of its importance in detection and evaluating the similarity between sequences. We introduced a novel tool, GPUMEM, which efficiently utilizes GPU in building a lightweight indexing and finding maximal exact matches inside two genome sequences. The index construction is so fast that even by including its time, GPUMEM is faster in practice than state-of-the-art tools that use a pre-built index. De novo genome assembly is a crucial step in NGS analysis because of the novelty of discovered sequences. Firstly, we have studied parallelizing the de Bruijn graph based de novo genome assembly on distributed memory systems using Spark framework and GraphX API. We proposed a new tool, Spaler, which assembles short reads efficiently and accurately. Spaler starts with the de Bruijn graph construction. Then, it applies an iterative graph reduction and simplification techniques to generate contigs. After that, Spaler uses the reads mapping information to produce scaffolds. Spaler employs smart parallelism level tuning technique to improve the performance in each of these steps independently. The experiments show promising results in term of scalability, execution time and quality. Secondly, we addressed the problem of de novo metagenomics assembly. Spaler may not properly assemble the sequenced data extracted from environmental samples. This is because of the complexity and diversity of the living microbial communities. Thus, we introduced meta-Spaler, an extension of Spaler, to handle metagenomics dataset. meta-Spaler partitions the reads based on their expected coverage and applies an iterative assembly. The results show an improving in the assembly quality of meta-Spaler in comparison to the assembly of Spaler. 2016-12-21 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998 http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection	NDLTD
language	English
sources	NDLTD
topic	Computer Engineering Bioinformatics bioinformatics sequence similarity indexing graphical processing unit Apache Spark de Bruijn graph de novo assembly metagenomics
spellingShingle	Computer Engineering Bioinformatics bioinformatics sequence similarity indexing graphical processing unit Apache Spark de Bruijn graph de novo assembly metagenomics Abu Doleh, Anas High Performance and Scalable Matching and Assembly of Biological Sequences
author	Abu Doleh, Anas
author_facet	Abu Doleh, Anas
author_sort	Abu Doleh, Anas
title	High Performance and Scalable Matching and Assembly of Biological Sequences
title_short	High Performance and Scalable Matching and Assembly of Biological Sequences
title_full	High Performance and Scalable Matching and Assembly of Biological Sequences
title_fullStr	High Performance and Scalable Matching and Assembly of Biological Sequences
title_full_unstemmed	High Performance and Scalable Matching and Assembly of Biological Sequences
title_sort	high performance and scalable matching and assembly of biological sequences
publisher	The Ohio State University / OhioLINK
publishDate	2016
url	http://rave.ohiolink.edu/etdc/view?acc_num=osu1469092998
work_keys_str_mv	AT abudolehanas highperformanceandscalablematchingandassemblyofbiologicalsequences
_version_	1719440491421892608

High Performance and Scalable Matching and Assembly of Biological Sequences

Similar Items