Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis

Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best...

Full description

Bibliographic Details
Other Authors: Thompson, Warren R. (Warren Robert) (authoraut)
Format: Others
Language:English
English
Published: Florida State University
Subjects:
Online Access:http://purl.flvc.org/fsu/fd/FSU_migr_etd-1360
id ndltd-fsu.edu-oai-fsu.digital.flvc.org-fsu_253957
record_format oai_dc
spelling ndltd-fsu.edu-oai-fsu.digital.flvc.org-fsu_2539572020-06-19T03:09:50Z Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis Thompson, Warren R. (Warren Robert) (authoraut) McGee, Daniel (professor directing dissertation) Eberstein, Isaac (university representative) Huffer, Fred (committee member) Sinha, Debajyoti (committee member) She, Yiyuan (committee member) Department of Statistics (degree granting department) Florida State University (degree granting institution) Text text Florida State University Florida State University English eng 1 online resource computer application/pdf Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the diet-heart hypothesis. The diet-heart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program - a study of CHD incidence in men of Japanese descent. Our results were largely method-specific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended. A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Fall Semester, 2009. August 10, 2009. Logistic Regression, Bootstrap, Lasso, Ridge Regression, Bayesian Model Averaging, Diet-Heart Hypothesis Includes bibliographical references. Daniel McGee, Professor Directing Dissertation; Isaac Eberstein, University Representative; Fred Huffer, Committee Member; Debajyoti Sinha, Committee Member; Yiyuan She, Committee Member. Statistics FSU_migr_etd-1360 http://purl.flvc.org/fsu/fd/FSU_migr_etd-1360 This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). The copyright in theses and dissertations completed at Florida State University is held by the students who author them. http://diginole.lib.fsu.edu/islandora/object/fsu%3A253957/datastream/TN/view/Variable%20Selection%20of%20Correlated%20Predictors%20in%20Logistic%20Regression.jpg
collection NDLTD
language English
English
format Others
sources NDLTD
topic Statistics
spellingShingle Statistics
Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
description Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the diet-heart hypothesis. The diet-heart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program - a study of CHD incidence in men of Japanese descent. Our results were largely method-specific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended. === A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. === Fall Semester, 2009. === August 10, 2009. === Logistic Regression, Bootstrap, Lasso, Ridge Regression, Bayesian Model Averaging, Diet-Heart Hypothesis === Includes bibliographical references. === Daniel McGee, Professor Directing Dissertation; Isaac Eberstein, University Representative; Fred Huffer, Committee Member; Debajyoti Sinha, Committee Member; Yiyuan She, Committee Member.
author2 Thompson, Warren R. (Warren Robert) (authoraut)
author_facet Thompson, Warren R. (Warren Robert) (authoraut)
title Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
title_short Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
title_full Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
title_fullStr Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
title_full_unstemmed Variable Selection of Correlated Predictors in Logistic Regression: Investigating the Diet-Heart Hypothesis
title_sort variable selection of correlated predictors in logistic regression: investigating the diet-heart hypothesis
publisher Florida State University
url http://purl.flvc.org/fsu/fd/FSU_migr_etd-1360
_version_ 1719322210549628928