Waste not, want not: why rarefying microbiome data is inadmissible.

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these appr...

Full description

Bibliographic Details
Main Authors:	Paul J McMurdie, Susan Holmes
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2014-04-01
Series:	PLoS Computational Biology
Online Access:	http://europepmc.org/articles/PMC3974642?pdf=render

id	doaj-23df78e998f94d62ab52c12b001f49e5
record_format	Article
spelling	doaj-23df78e998f94d62ab52c12b001f49e52020-11-25T01:52:56ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582014-04-01104e100353110.1371/journal.pcbi.1003531Waste not, want not: why rarefying microbiome data is inadmissible.Paul J McMurdieSusan HolmesCurrent practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.http://europepmc.org/articles/PMC3974642?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Paul J McMurdie Susan Holmes
spellingShingle	Paul J McMurdie Susan Holmes Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology
author_facet	Paul J McMurdie Susan Holmes
author_sort	Paul J McMurdie
title	Waste not, want not: why rarefying microbiome data is inadmissible.
title_short	Waste not, want not: why rarefying microbiome data is inadmissible.
title_full	Waste not, want not: why rarefying microbiome data is inadmissible.
title_fullStr	Waste not, want not: why rarefying microbiome data is inadmissible.
title_full_unstemmed	Waste not, want not: why rarefying microbiome data is inadmissible.
title_sort	waste not, want not: why rarefying microbiome data is inadmissible.
publisher	Public Library of Science (PLoS)
series	PLoS Computational Biology
issn	1553-734X 1553-7358
publishDate	2014-04-01
description	Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
url	http://europepmc.org/articles/PMC3974642?pdf=render
work_keys_str_mv	AT pauljmcmurdie wastenotwantnotwhyrarefyingmicrobiomedataisinadmissible AT susanholmes wastenotwantnotwhyrarefyingmicrobiomedataisinadmissible
_version_	1724991872868286464

Waste not, want not: why rarefying microbiome data is inadmissible.

Similar Items