A random forest approach to the detection of epistatic interactions in case-control studies

<p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Altho...

Full description

Bibliographic Details
Main Authors: Wu Xuebing, Tang Wanwan, Jiang Rui, Fu Wenhui
Format: Article
Language:English
Published: BMC 2009-01-01
Series:BMC Bioinformatics
id doaj-7cbeb7fb07354a508c39117b198ff8c1
record_format Article
spelling doaj-7cbeb7fb07354a508c39117b198ff8c12020-11-25T00:27:26ZengBMCBMC Bioinformatics1471-21052009-01-0110Suppl 1S6510.1186/1471-2105-10-S1-S65A random forest approach to the detection of epistatic interactions in case-control studiesWu XuebingTang WanwanJiang RuiFu Wenhui<p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.</p> <p>Results</p> <p>We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.</p> <p>Conclusion</p> <p>Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p>
collection DOAJ
language English
format Article
sources DOAJ
author Wu Xuebing
Tang Wanwan
Jiang Rui
Fu Wenhui
spellingShingle Wu Xuebing
Tang Wanwan
Jiang Rui
Fu Wenhui
A random forest approach to the detection of epistatic interactions in case-control studies
BMC Bioinformatics
author_facet Wu Xuebing
Tang Wanwan
Jiang Rui
Fu Wenhui
author_sort Wu Xuebing
title A random forest approach to the detection of epistatic interactions in case-control studies
title_short A random forest approach to the detection of epistatic interactions in case-control studies
title_full A random forest approach to the detection of epistatic interactions in case-control studies
title_fullStr A random forest approach to the detection of epistatic interactions in case-control studies
title_full_unstemmed A random forest approach to the detection of epistatic interactions in case-control studies
title_sort random forest approach to the detection of epistatic interactions in case-control studies
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2009-01-01
description <p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.</p> <p>Results</p> <p>We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.</p> <p>Conclusion</p> <p>Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p>
work_keys_str_mv AT wuxuebing arandomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT tangwanwan arandomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT jiangrui arandomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT fuwenhui arandomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT wuxuebing randomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT tangwanwan randomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT jiangrui randomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
AT fuwenhui randomforestapproachtothedetectionofepistaticinteractionsincasecontrolstudies
_version_ 1725339775581290496