Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.

Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated...

Full description

Bibliographic Details
Main Authors: Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, Nicola Segata
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-07-01
Series:PLoS Computational Biology
Online Access:http://europepmc.org/articles/PMC4939962?pdf=render
id doaj-878ed84a65bf4861b0625ddaba2cc5e9
record_format Article
spelling doaj-878ed84a65bf4861b0625ddaba2cc5e92020-11-25T01:44:26ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582016-07-01127e100497710.1371/journal.pcbi.1004977Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.Edoardo PasolliDuy Tin TruongFaizan MalikLevi WaldronNicola SegataShotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the "healthy" microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml.http://europepmc.org/articles/PMC4939962?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Edoardo Pasolli
Duy Tin Truong
Faizan Malik
Levi Waldron
Nicola Segata
spellingShingle Edoardo Pasolli
Duy Tin Truong
Faizan Malik
Levi Waldron
Nicola Segata
Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
PLoS Computational Biology
author_facet Edoardo Pasolli
Duy Tin Truong
Faizan Malik
Levi Waldron
Nicola Segata
author_sort Edoardo Pasolli
title Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
title_short Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
title_full Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
title_fullStr Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
title_full_unstemmed Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights.
title_sort machine learning meta-analysis of large metagenomic datasets: tools and biological insights.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2016-07-01
description Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the "healthy" microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml.
url http://europepmc.org/articles/PMC4939962?pdf=render
work_keys_str_mv AT edoardopasolli machinelearningmetaanalysisoflargemetagenomicdatasetstoolsandbiologicalinsights
AT duytintruong machinelearningmetaanalysisoflargemetagenomicdatasetstoolsandbiologicalinsights
AT faizanmalik machinelearningmetaanalysisoflargemetagenomicdatasetstoolsandbiologicalinsights
AT leviwaldron machinelearningmetaanalysisoflargemetagenomicdatasetstoolsandbiologicalinsights
AT nicolasegata machinelearningmetaanalysisoflargemetagenomicdatasetstoolsandbiologicalinsights
_version_ 1725028706266644480