SNP interaction detection with Random Forests in high-dimensional genetic data

<p>Abstract</p> <p>Background</p> <p>Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate ana...

Full description

Bibliographic Details
Main Authors: Winham Stacey J, Colby Colin L, Freimuth Robert R, Wang Xin, de Andrade Mariza, Huebner Marianne, Biernacka Joanna M
Format: Article
Language:English
Published: BMC 2012-07-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/13/164
id doaj-e8eb0f4bab274a02b3ce8f6cb4304984
record_format Article
spelling doaj-e8eb0f4bab274a02b3ce8f6cb43049842020-11-24T21:53:02ZengBMCBMC Bioinformatics1471-21052012-07-0113116410.1186/1471-2105-13-164SNP interaction detection with Random Forests in high-dimensional genetic dataWinham Stacey JColby Colin LFreimuth Robert RWang Xinde Andrade MarizaHuebner MarianneBiernacka Joanna M<p>Abstract</p> <p>Background</p> <p>Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.</p> <p>Results</p> <p>RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.</p> <p>Conclusions</p> <p>While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.</p> http://www.biomedcentral.com/1471-2105/13/164
collection DOAJ
language English
format Article
sources DOAJ
author Winham Stacey J
Colby Colin L
Freimuth Robert R
Wang Xin
de Andrade Mariza
Huebner Marianne
Biernacka Joanna M
spellingShingle Winham Stacey J
Colby Colin L
Freimuth Robert R
Wang Xin
de Andrade Mariza
Huebner Marianne
Biernacka Joanna M
SNP interaction detection with Random Forests in high-dimensional genetic data
BMC Bioinformatics
author_facet Winham Stacey J
Colby Colin L
Freimuth Robert R
Wang Xin
de Andrade Mariza
Huebner Marianne
Biernacka Joanna M
author_sort Winham Stacey J
title SNP interaction detection with Random Forests in high-dimensional genetic data
title_short SNP interaction detection with Random Forests in high-dimensional genetic data
title_full SNP interaction detection with Random Forests in high-dimensional genetic data
title_fullStr SNP interaction detection with Random Forests in high-dimensional genetic data
title_full_unstemmed SNP interaction detection with Random Forests in high-dimensional genetic data
title_sort snp interaction detection with random forests in high-dimensional genetic data
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2012-07-01
description <p>Abstract</p> <p>Background</p> <p>Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.</p> <p>Results</p> <p>RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.</p> <p>Conclusions</p> <p>While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.</p>
url http://www.biomedcentral.com/1471-2105/13/164
work_keys_str_mv AT winhamstaceyj snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT colbycolinl snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT freimuthrobertr snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT wangxin snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT deandrademariza snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT huebnermarianne snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
AT biernackajoannam snpinteractiondetectionwithrandomforestsinhighdimensionalgeneticdata
_version_ 1725873289899802624