Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unre...
Main Authors: | , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2017-01-01
|
Series: | PLoS ONE |
Online Access: | http://europepmc.org/articles/PMC5689832?pdf=render |
id |
doaj-3e7bfcd1f299447ba023b8d9804aa7e6 |
---|---|
record_format |
Article |
spelling |
doaj-3e7bfcd1f299447ba023b8d9804aa7e62020-11-25T01:42:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-011211e018713210.1371/journal.pone.0187132Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.Kumar P MainaliSharon BewickPeter ThielenThomas MehokeFlorian P BreitwieserShishir PaudelArjun AdhikariJoshua WolfeEric V SludDavid KarigWilliam F FaganDrawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.http://europepmc.org/articles/PMC5689832?pdf=render |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Kumar P Mainali Sharon Bewick Peter Thielen Thomas Mehoke Florian P Breitwieser Shishir Paudel Arjun Adhikari Joshua Wolfe Eric V Slud David Karig William F Fagan |
spellingShingle |
Kumar P Mainali Sharon Bewick Peter Thielen Thomas Mehoke Florian P Breitwieser Shishir Paudel Arjun Adhikari Joshua Wolfe Eric V Slud David Karig William F Fagan Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. PLoS ONE |
author_facet |
Kumar P Mainali Sharon Bewick Peter Thielen Thomas Mehoke Florian P Breitwieser Shishir Paudel Arjun Adhikari Joshua Wolfe Eric V Slud David Karig William F Fagan |
author_sort |
Kumar P Mainali |
title |
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
title_short |
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
title_full |
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
title_fullStr |
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
title_full_unstemmed |
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
title_sort |
statistical analysis of co-occurrence patterns in microbial presence-absence datasets. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2017-01-01 |
description |
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa. |
url |
http://europepmc.org/articles/PMC5689832?pdf=render |
work_keys_str_mv |
AT kumarpmainali statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT sharonbewick statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT peterthielen statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT thomasmehoke statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT florianpbreitwieser statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT shishirpaudel statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT arjunadhikari statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT joshuawolfe statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT ericvslud statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT davidkarig statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets AT williamffagan statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets |
_version_ |
1725034703621193728 |