The Implementation and Performance Improvement of Genome Assembly on Hadoop MapReduce

碩士 === 國立臺灣大學 === 工程科學及海洋工程學研究所 === 101 === Genome assembly is the process of taking the reads and putting them back together to reproduce the original sequences. But the process takes lots of computer resources, makes it hard to complete whole process as it assembling large genome. Hadoop is one of...

Full description

Bibliographic Details
Main Authors: Chun-Yang Huang, 黃峻揚
Other Authors: Chien-Kang Huang
Format: Others
Language:zh-TW
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/62916499843042413787
Description
Summary:碩士 === 國立臺灣大學 === 工程科學及海洋工程學研究所 === 101 === Genome assembly is the process of taking the reads and putting them back together to reproduce the original sequences. But the process takes lots of computer resources, makes it hard to complete whole process as it assembling large genome. Hadoop is one of the hottest topics for these years. By construct distributed computational circumstance, Hadoop reduce local computation and avoid frequently data-transportation between server and client. This thesis use the assembly tool developed by M.Schatz, it combines genome assembly and Hadoop cloud computing, named Contrail. Utilizing the characteristic of distributed computation, Contrail is able to solve the problem that most assemblers are hard to complete large genome assembly. This thesis study the revision of Hadoop system architecture and API, and revise the Contrail code to make it be able to run on current version of Hadoop platfrom. Furthermore, we improve the performance of Contrail and compare the assembly result with Velvet and SOAPdenovo. We find out the assembly result of Contrail is similar with Velvet’s in small genome, and more similar with SOAPdenovo in larger genome. To the large genome assembly Velvet and SOAPdenovo are hard to complete the whole assembly process, Contrail complete the assembly process successfully.