Estimation of sequencing error rates in short reads

Abstract Background Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable ap...

Full description

Bibliographic Details
Main Authors:	Victoria Xin, Blades Natalie, Ding Jie, Sultana Razvan, Parmigiani Giovanni
Format:	Article
Language:	English
Published:	BMC 2012-07-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/13/185

id	doaj-e63518ff5e7245e4bdeb7b3c4d37b28c
record_format	Article
spelling	doaj-e63518ff5e7245e4bdeb7b3c4d37b28c2020-11-24T21:25:19ZengBMCBMC Bioinformatics1471-21052012-07-0113118510.1186/1471-2105-13-185Estimation of sequencing error rates in short readsVictoria XinBlades NatalieDing JieSultana RazvanParmigiani Giovanni<p>Abstract</p> <p>Background</p> <p>Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.</p> <p>Results</p> <p>We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from <url>http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html</url>.</p> <p>Conclusions</p> <p>The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.</p> http://www.biomedcentral.com/1471-2105/13/185
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Victoria Xin Blades Natalie Ding Jie Sultana Razvan Parmigiani Giovanni
spellingShingle	Victoria Xin Blades Natalie Ding Jie Sultana Razvan Parmigiani Giovanni Estimation of sequencing error rates in short reads BMC Bioinformatics
author_facet	Victoria Xin Blades Natalie Ding Jie Sultana Razvan Parmigiani Giovanni
author_sort	Victoria Xin
title	Estimation of sequencing error rates in short reads
title_short	Estimation of sequencing error rates in short reads
title_full	Estimation of sequencing error rates in short reads
title_fullStr	Estimation of sequencing error rates in short reads
title_full_unstemmed	Estimation of sequencing error rates in short reads
title_sort	estimation of sequencing error rates in short reads
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2012-07-01
description	<p>Abstract</p> <p>Background</p> <p>Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.</p> <p>Results</p> <p>We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from <url>http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html</url>.</p> <p>Conclusions</p> <p>The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.</p>
url	http://www.biomedcentral.com/1471-2105/13/185
work_keys_str_mv	AT victoriaxin estimationofsequencingerrorratesinshortreads AT bladesnatalie estimationofsequencingerrorratesinshortreads AT dingjie estimationofsequencingerrorratesinshortreads AT sultanarazvan estimationofsequencingerrorratesinshortreads AT parmigianigiovanni estimationofsequencingerrorratesinshortreads
_version_	1725983443670532096

Estimation of sequencing error rates in short reads

Similar Items