Analysis and correction of compositional bias in sparse sequencing count data

Abstract Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of t...

Full description

Bibliographic Details
Main Authors: M. Senthil Kumar, Eric V. Slud, Kwame Okrah, Stephanie C. Hicks, Sridhar Hannenhalli, Héctor Corrada Bravo
Format: Article
Language:English
Published: BMC 2018-11-01
Series:BMC Genomics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12864-018-5160-5
id doaj-07388e3a527746048a5877c7081c3f38
record_format Article
spelling doaj-07388e3a527746048a5877c7081c3f382020-11-25T01:22:01ZengBMCBMC Genomics1471-21642018-11-0119112310.1186/s12864-018-5160-5Analysis and correction of compositional bias in sparse sequencing count dataM. Senthil Kumar0Eric V. Slud1Kwame Okrah2Stephanie C. Hicks3Sridhar Hannenhalli4Héctor Corrada Bravo5Graduate Program in Bioinformatics, University of MarylandDepartment of Mathematics, University of MarylandGRED Oncology Biostatistics, GenentechBiostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard UniversityCenter for Bioinformatics and Computational Biology, University of MarylandCenter for Bioinformatics and Computational Biology, University of MarylandAbstract Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. Results We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. Conclusions Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.http://link.springer.com/article/10.1186/s12864-018-5160-5Compositional biasNormalizationEmpirical BayesData integrationCount dataMetagenomics
collection DOAJ
language English
format Article
sources DOAJ
author M. Senthil Kumar
Eric V. Slud
Kwame Okrah
Stephanie C. Hicks
Sridhar Hannenhalli
Héctor Corrada Bravo
spellingShingle M. Senthil Kumar
Eric V. Slud
Kwame Okrah
Stephanie C. Hicks
Sridhar Hannenhalli
Héctor Corrada Bravo
Analysis and correction of compositional bias in sparse sequencing count data
BMC Genomics
Compositional bias
Normalization
Empirical Bayes
Data integration
Count data
Metagenomics
author_facet M. Senthil Kumar
Eric V. Slud
Kwame Okrah
Stephanie C. Hicks
Sridhar Hannenhalli
Héctor Corrada Bravo
author_sort M. Senthil Kumar
title Analysis and correction of compositional bias in sparse sequencing count data
title_short Analysis and correction of compositional bias in sparse sequencing count data
title_full Analysis and correction of compositional bias in sparse sequencing count data
title_fullStr Analysis and correction of compositional bias in sparse sequencing count data
title_full_unstemmed Analysis and correction of compositional bias in sparse sequencing count data
title_sort analysis and correction of compositional bias in sparse sequencing count data
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2018-11-01
description Abstract Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. Results We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. Conclusions Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.
topic Compositional bias
Normalization
Empirical Bayes
Data integration
Count data
Metagenomics
url http://link.springer.com/article/10.1186/s12864-018-5160-5
work_keys_str_mv AT msenthilkumar analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
AT ericvslud analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
AT kwameokrah analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
AT stephaniechicks analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
AT sridharhannenhalli analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
AT hectorcorradabravo analysisandcorrectionofcompositionalbiasinsparsesequencingcountdata
_version_ 1725128203797790720