Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest

Bibliographic Details
Main Author: Wathen, Michael J.
Language:English
Published: University of Cincinnati / OhioLINK 2016
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716
id ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1460730716
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-ucin14607307162021-08-03T06:35:45Z Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest Wathen, Michael J. Environmental Health In this thesis we introduce, to population genetics, a method of variable selection based on an estimator for the measure of independence using the data (contingency table) collected on the joint distribution. We call our maximum likelihood estimator the Lancaster Independence Estimate (LIE). We compare, this newly proposed method, with two other methods of variable selection: Principal Component Analysis (PCA) and Random Forest (RF). We employed data from the 1000 Genomes Project as provided by GAWA17 mini-exome data that is comprised of seven populations: Caucasians from the United States (CEPH), Chinese from Denver (Denver), Chinese from Beijing (Han), Japanese from Tokyo (Japanese), Luhya from Kenya (Luhya), Tuscans from Italy (Tuscan), and Yoruba from Nigeria (Yoruba). The data was parsed to explore the 10,455 rare variants with minor allele frequencies less than 5%. These (SNPs) values were recorded as categorical 0, 1. The LIE was used to assemble an - collection of SNPs associated with the seven populations. We also assembled same size collections of SNPs using variable importance measures of PCA and RF. We found that the LIE method preformed better than expected in the predictive models when compared to the predictive models coming from PCA but not as well as the those from RF. We also developed a hybrid method (Piggyback) that improved the predictive accuracy of RF conditional on a substantially smaller set of SNPs coming from the LIE method. Additionally, we found this hybrid method of RF built on the LIE dramatically reduced the computational time normally required for non-hybrid RF. 2016-06-28 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716 unrestricted This thesis or dissertation is protected by copyright: some rights reserved. It is licensed for use under a Creative Commons license. Specific terms and permissions are available from this document's record in the OhioLINK ETD Center.
collection NDLTD
language English
sources NDLTD
topic Environmental Health
spellingShingle Environmental Health
Wathen, Michael J.
Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
author Wathen, Michael J.
author_facet Wathen, Michael J.
author_sort Wathen, Michael J.
title Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
title_short Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
title_full Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
title_fullStr Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
title_full_unstemmed Population Affiliation Prediction Based on Rare Variants and Using Lancaster Importance Estimator, Principal Component Analysis, and Random Forest
title_sort population affiliation prediction based on rare variants and using lancaster importance estimator, principal component analysis, and random forest
publisher University of Cincinnati / OhioLINK
publishDate 2016
url http://rave.ohiolink.edu/etdc/view?acc_num=ucin1460730716
work_keys_str_mv AT wathenmichaelj populationaffiliationpredictionbasedonrarevariantsandusinglancasterimportanceestimatorprincipalcomponentanalysisandrandomforest
_version_ 1719439838439014400