An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

<p>Abstract</p> <p>Background</p> <p>As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only eluci...

Full description

Bibliographic Details
Main Authors: Hubbard Alan E, Goldstein Benjamin A, Cutler Adele, Barcellos Lisa F
Format: Article
Language:English
Published: BMC 2010-06-01
Series:BMC Genetics
Online Access:http://www.biomedcentral.com/1471-2156/11/49
id doaj-c12c501b18ff47288297b0ef26cb1b64
record_format Article
spelling doaj-c12c501b18ff47288297b0ef26cb1b642020-11-25T03:07:17ZengBMCBMC Genetics1471-21562010-06-011114910.1186/1471-2156-11-49An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findingsHubbard Alan EGoldstein Benjamin ACutler AdeleBarcellos Lisa F<p>Abstract</p> <p>Background</p> <p>As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.</p> <p>Results</p> <p>Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, <it>MPHOSPH9, CTNNA3, PHACTR2 </it>and <it>IL7</it>, by RF analysis and warrant further follow-up in independent studies.</p> <p>Conclusions</p> <p>This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.</p> http://www.biomedcentral.com/1471-2156/11/49
collection DOAJ
language English
format Article
sources DOAJ
author Hubbard Alan E
Goldstein Benjamin A
Cutler Adele
Barcellos Lisa F
spellingShingle Hubbard Alan E
Goldstein Benjamin A
Cutler Adele
Barcellos Lisa F
An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
BMC Genetics
author_facet Hubbard Alan E
Goldstein Benjamin A
Cutler Adele
Barcellos Lisa F
author_sort Hubbard Alan E
title An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_short An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_full An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_fullStr An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_full_unstemmed An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
title_sort application of random forests to a genome-wide association dataset: methodological considerations & new findings
publisher BMC
series BMC Genetics
issn 1471-2156
publishDate 2010-06-01
description <p>Abstract</p> <p>Background</p> <p>As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.</p> <p>Results</p> <p>Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, <it>MPHOSPH9, CTNNA3, PHACTR2 </it>and <it>IL7</it>, by RF analysis and warrant further follow-up in independent studies.</p> <p>Conclusions</p> <p>This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.</p>
url http://www.biomedcentral.com/1471-2156/11/49
work_keys_str_mv AT hubbardalane anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT goldsteinbenjamina anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT cutleradele anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT barcelloslisaf anapplicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT hubbardalane applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT goldsteinbenjamina applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT cutleradele applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
AT barcelloslisaf applicationofrandomforeststoagenomewideassociationdatasetmethodologicalconsiderationsnewfindings
_version_ 1724671448524521472