Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction

Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Ta...

Full description

Bibliographic Details
Main Authors:	Kuncheng Song, Fred A. Wright, Yi-Hui Zhou
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2020-12-01
Series:	Frontiers in Molecular Biosciences
Subjects:	phenotype prediction machine learning method k-mers operational taxonomic unit (OTU) amplicon sequence variant (ASV) phylogenetic analysis
Online Access:	https://www.frontiersin.org/articles/10.3389/fmolb.2020.610845/full

id	doaj-ecfa8f06a7794a2b8128ef06cab5a628
record_format	Article
spelling	doaj-ecfa8f06a7794a2b8128ef06cab5a6282020-12-16T05:21:48ZengFrontiers Media S.A.Frontiers in Molecular Biosciences2296-889X2020-12-01710.3389/fmolb.2020.610845610845Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease PredictionKuncheng Song0Fred A. Wright1Yi-Hui Zhou2Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United StatesDepartments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United StatesDepartment of Biological Sciences, North Carolina State University, Raleigh, NC, United StatesMicrobiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.https://www.frontiersin.org/articles/10.3389/fmolb.2020.610845/fullphenotype predictionmachine learning methodk-mersoperational taxonomic unit (OTU)amplicon sequence variant (ASV)phylogenetic analysis
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Kuncheng Song Fred A. Wright Yi-Hui Zhou
spellingShingle	Kuncheng Song Fred A. Wright Yi-Hui Zhou Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction Frontiers in Molecular Biosciences phenotype prediction machine learning method k-mers operational taxonomic unit (OTU) amplicon sequence variant (ASV) phylogenetic analysis
author_facet	Kuncheng Song Fred A. Wright Yi-Hui Zhou
author_sort	Kuncheng Song
title	Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction
title_short	Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction
title_full	Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction
title_fullStr	Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction
title_full_unstemmed	Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction
title_sort	systematic comparisons for composition profiles, taxonomic levels, and machine learning methods for microbiome-based disease prediction
publisher	Frontiers Media S.A.
series	Frontiers in Molecular Biosciences
issn	2296-889X
publishDate	2020-12-01
description	Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.
topic	phenotype prediction machine learning method k-mers operational taxonomic unit (OTU) amplicon sequence variant (ASV) phylogenetic analysis
url	https://www.frontiersin.org/articles/10.3389/fmolb.2020.610845/full
work_keys_str_mv	AT kunchengsong systematiccomparisonsforcompositionprofilestaxonomiclevelsandmachinelearningmethodsformicrobiomebaseddiseaseprediction AT fredawright systematiccomparisonsforcompositionprofilestaxonomiclevelsandmachinelearningmethodsformicrobiomebaseddiseaseprediction AT yihuizhou systematiccomparisonsforcompositionprofilestaxonomiclevelsandmachinelearningmethodsformicrobiomebaseddiseaseprediction
_version_	1724381811458441216

Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction

Similar Items