NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model
Abstract Background PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fas...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2018-05-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12859-018-2208-0 |
id |
doaj-b4ed667ced9d423491ce6f81e31415dd |
---|---|
record_format |
Article |
spelling |
doaj-b4ed667ced9d423491ce6f81e31415dd2020-11-25T02:20:28ZengBMCBMC Bioinformatics1471-21052018-05-011911910.1186/s12859-018-2208-0NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical modelZe-Gang Wei0Shao-Wu Zhang1Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical UniversityKey Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical UniversityAbstract Background PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). Results By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. Conclusion NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets.http://link.springer.com/article/10.1186/s12859-018-2208-0Sequence simulatorQuality valueContinuous long readsSMRTPacBio |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ze-Gang Wei Shao-Wu Zhang |
spellingShingle |
Ze-Gang Wei Shao-Wu Zhang NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model BMC Bioinformatics Sequence simulator Quality value Continuous long reads SMRT PacBio |
author_facet |
Ze-Gang Wei Shao-Wu Zhang |
author_sort |
Ze-Gang Wei |
title |
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_short |
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_full |
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_fullStr |
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_full_unstemmed |
NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model |
title_sort |
npbss: a new pacbio sequencing simulator for generating the continuous long reads with an empirical model |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2018-05-01 |
description |
Abstract Background PacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR). Results By analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data. Conclusion NPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets. |
topic |
Sequence simulator Quality value Continuous long reads SMRT PacBio |
url |
http://link.springer.com/article/10.1186/s12859-018-2208-0 |
work_keys_str_mv |
AT zegangwei npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel AT shaowuzhang npbssanewpacbiosequencingsimulatorforgeneratingthecontinuouslongreadswithanempiricalmodel |
_version_ |
1724871011085582336 |