A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.

Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequ...

Full description

Bibliographic Details
Main Authors: Yuan Zhang, Yanni Sun, James R Cole
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2014-08-01
Series:PLoS Computational Biology
Online Access:http://europepmc.org/articles/PMC4133164?pdf=render
id doaj-50715fdebba34b0697895eebe2437fad
record_format Article
spelling doaj-50715fdebba34b0697895eebe2437fad2020-11-25T01:34:04ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582014-08-01108e100373710.1371/journal.pcbi.1003737A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.Yuan ZhangYanni SunJames R ColeGene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.http://europepmc.org/articles/PMC4133164?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Yuan Zhang
Yanni Sun
James R Cole
spellingShingle Yuan Zhang
Yanni Sun
James R Cole
A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
PLoS Computational Biology
author_facet Yuan Zhang
Yanni Sun
James R Cole
author_sort Yuan Zhang
title A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
title_short A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
title_full A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
title_fullStr A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
title_full_unstemmed A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.
title_sort scalable and accurate targeted gene assembly tool (sat-assembler) for next-generation sequencing data.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2014-08-01
description Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.
url http://europepmc.org/articles/PMC4133164?pdf=render
work_keys_str_mv AT yuanzhang ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT yannisun ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT jamesrcole ascalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT yuanzhang scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT yannisun scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
AT jamesrcole scalableandaccuratetargetedgeneassemblytoolsatassemblerfornextgenerationsequencingdata
_version_ 1725073894165970944