Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data

Abstract Background Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources...

Full description

Bibliographic Details
Main Authors: Trevor S. Frisby, Shawn J. Baker, Guillaume Marçais, Quang Minh Hoang, Carl Kingsford, Christopher J. Langmead
Format: Article
Language:English
Published: BMC 2021-04-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04096-6
id doaj-1fc059f1c09743c3894a51417a32d3cc
record_format Article
spelling doaj-1fc059f1c09743c3894a51417a32d3cc2021-04-04T11:45:28ZengBMCBMC Bioinformatics1471-21052021-04-0122111910.1186/s12859-021-04096-6Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing dataTrevor S. Frisby0Shawn J. Baker1Guillaume Marçais2Quang Minh Hoang3Carl Kingsford4Christopher J. Langmead5Computational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputer Science Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityAbstract Background Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Results We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. Conclusion Harvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.https://doi.org/10.1186/s12859-021-04096-6Feature selectionHierarchical feature spacesKnowledge graphsInteger linear programmingMachine learning
collection DOAJ
language English
format Article
sources DOAJ
author Trevor S. Frisby
Shawn J. Baker
Guillaume Marçais
Quang Minh Hoang
Carl Kingsford
Christopher J. Langmead
spellingShingle Trevor S. Frisby
Shawn J. Baker
Guillaume Marçais
Quang Minh Hoang
Carl Kingsford
Christopher J. Langmead
Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
BMC Bioinformatics
Feature selection
Hierarchical feature spaces
Knowledge graphs
Integer linear programming
Machine learning
author_facet Trevor S. Frisby
Shawn J. Baker
Guillaume Marçais
Quang Minh Hoang
Carl Kingsford
Christopher J. Langmead
author_sort Trevor S. Frisby
title Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_short Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_full Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_fullStr Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_full_unstemmed Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
title_sort harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-04-01
description Abstract Background Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. Results We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare Harvestman to existing feature selection methods and demonstrate that our method is more parsimonious—it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. Conclusion Harvestman is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , Harvestman automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, Harvestman is faster and selects features more parsimoniously.
topic Feature selection
Hierarchical feature spaces
Knowledge graphs
Integer linear programming
Machine learning
url https://doi.org/10.1186/s12859-021-04096-6
work_keys_str_mv AT trevorsfrisby harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
AT shawnjbaker harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
AT guillaumemarcais harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
AT quangminhhoang harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
AT carlkingsford harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
AT christopherjlangmead harvestmanaframeworkforhierarchicalfeaturelearningandselectionfromwholegenomesequencingdata
_version_ 1721542348559613952