Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.

Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unre...

Full description

Bibliographic Details
Main Authors: Kumar P Mainali, Sharon Bewick, Peter Thielen, Thomas Mehoke, Florian P Breitwieser, Shishir Paudel, Arjun Adhikari, Joshua Wolfe, Eric V Slud, David Karig, William F Fagan
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2017-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5689832?pdf=render
id doaj-3e7bfcd1f299447ba023b8d9804aa7e6
record_format Article
spelling doaj-3e7bfcd1f299447ba023b8d9804aa7e62020-11-25T01:42:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-011211e018713210.1371/journal.pone.0187132Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.Kumar P MainaliSharon BewickPeter ThielenThomas MehokeFlorian P BreitwieserShishir PaudelArjun AdhikariJoshua WolfeEric V SludDavid KarigWilliam F FaganDrawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.http://europepmc.org/articles/PMC5689832?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Kumar P Mainali
Sharon Bewick
Peter Thielen
Thomas Mehoke
Florian P Breitwieser
Shishir Paudel
Arjun Adhikari
Joshua Wolfe
Eric V Slud
David Karig
William F Fagan
spellingShingle Kumar P Mainali
Sharon Bewick
Peter Thielen
Thomas Mehoke
Florian P Breitwieser
Shishir Paudel
Arjun Adhikari
Joshua Wolfe
Eric V Slud
David Karig
William F Fagan
Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
PLoS ONE
author_facet Kumar P Mainali
Sharon Bewick
Peter Thielen
Thomas Mehoke
Florian P Breitwieser
Shishir Paudel
Arjun Adhikari
Joshua Wolfe
Eric V Slud
David Karig
William F Fagan
author_sort Kumar P Mainali
title Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
title_short Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
title_full Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
title_fullStr Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
title_full_unstemmed Statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
title_sort statistical analysis of co-occurrence patterns in microbial presence-absence datasets.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2017-01-01
description Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.
url http://europepmc.org/articles/PMC5689832?pdf=render
work_keys_str_mv AT kumarpmainali statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT sharonbewick statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT peterthielen statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT thomasmehoke statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT florianpbreitwieser statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT shishirpaudel statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT arjunadhikari statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT joshuawolfe statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT ericvslud statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT davidkarig statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
AT williamffagan statisticalanalysisofcooccurrencepatternsinmicrobialpresenceabsencedatasets
_version_ 1725034703621193728