Interaction-Based Learning for High-Dimensional Data with Continuous Predictors

High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informat...

Full description

Bibliographic Details
Main Author: Huang, Chien-Hsun
Language:English
Published: 2014
Subjects:
Online Access:https://doi.org/10.7916/D8X928CH
id ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-D8X928CH
record_format oai_dc
spelling ndltd-columbia.edu-oai-academiccommons.columbia.edu-10.7916-D8X928CH2019-05-09T15:14:22ZInteraction-Based Learning for High-Dimensional Data with Continuous PredictorsHuang, Chien-Hsun2014ThesesEpistasis (Genetics)Instrumental variables (Statistics)Nonparametric statisticsCluster analysisMachine learning--Statistical methodsStatisticsHigh-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.Englishhttps://doi.org/10.7916/D8X928CH
collection NDLTD
language English
sources NDLTD
topic Epistasis (Genetics)
Instrumental variables (Statistics)
Nonparametric statistics
Cluster analysis
Machine learning--Statistical methods
Statistics
spellingShingle Epistasis (Genetics)
Instrumental variables (Statistics)
Nonparametric statistics
Cluster analysis
Machine learning--Statistical methods
Statistics
Huang, Chien-Hsun
Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
description High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.
author Huang, Chien-Hsun
author_facet Huang, Chien-Hsun
author_sort Huang, Chien-Hsun
title Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
title_short Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
title_full Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
title_fullStr Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
title_full_unstemmed Interaction-Based Learning for High-Dimensional Data with Continuous Predictors
title_sort interaction-based learning for high-dimensional data with continuous predictors
publishDate 2014
url https://doi.org/10.7916/D8X928CH
work_keys_str_mv AT huangchienhsun interactionbasedlearningforhighdimensionaldatawithcontinuouspredictors
_version_ 1719046140752560128