Application of Data Mining Techniques in Human Population Genetic Structure Analysis

Bibliographic Details
Main Author: Weng, Zhouyang
Language:English
Published: University of Cincinnati / OhioLINK 2017
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=ucin149035512350142
id ndltd-OhioLink-oai-etd.ohiolink.edu-ucin149035512350142
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1490355123501422021-08-03T07:00:57Z Application of Data Mining Techniques in Human Population Genetic Structure Analysis Weng, Zhouyang Biostatistics Data Mining Population Genetic Structure Analysis Variable Selection Population Classification The success of genome-wide association study (GWAS) depends on genotyping a large number of SNPs and determining which of these SNPs are significantly associated with the outcome of disease. While studying for these associations, it is important to take into account the effects caused by differences of ethnicities and population groups. The study of human population genetic structure focused on analyzing the human genetic variations between different populations and on assigning individuals to subpopulations based on the degree of human genetic variations. Currently the leading statistical method for uncovering population structure in GWAS is Principal Component Analysis (PCA). However one major problem of using PCA on SNPs data is that the principal components that are defined do not correspond to actual SNP variables, we need to find ways that can map the principal components to measure the importance of actual SNP variables in terms of ancestry information. To overcome these limitations, Sparse Principal Component Analysis (SPCA) has been proposed to identify a small set of structure informative markers more efficiently by modifying the alternating regression equation for PCA with including a penalty term during optimization that encourages SNPs with negligible loadings to vanish. Yet the computation costs of selecting a small subset of actual ancestry informative SNP variables via SPCA can still be expensive, especially where a large number of non-zero loadings across multiple principal components are required for structure analysis. Given these limitations, it is desirable to find some methods which not only achieve the population classification but also reduce the number of explicitly used variables and can select actual SNP variables that are ancestry informative markers in a cost-effective manner. The goals of this study will not only focus on making inferences on the application of major data mining methods in human population genetics structure analysis but also on introducing a two-stage approach which combines two popular methods to improve efficiency and accuracy in population classification and variable selection. Specifically, the first step of the proposed two-stage method is to identify a subset of SNP markers that capture major genetic variations between the population groups using SPCA; the second step is to estimate population structure based on the selected SNP markers and conducted the variable selection of ancestry informative markers using Random Forest (RF). Our two-step SPCA-RF approach was tested using empirical and simulated datasets. The empirical dataset came from the simulated next generation sequence data, which was provided for the Genetic Analysis Workshop (GAW) 17 based on the real exome sequence data from the 1000 Genome Project. Results from the two-step SPCA-RF algorithm suggested higher population prediction accuracy with relatively fewer markers are possible. In comparison with the existing methods, the proposed SPCA-RF approach steadily gave a similar or lower value of error rates and retained all important variables that are ancestry informative. Moreover, the implementation of all methods has been carried out in the open source R software, which provides the future researchers with the source code to replicate the research for further investigation. 2017-10-27 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin149035512350142 http://rave.ohiolink.edu/etdc/view?acc_num=ucin149035512350142 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Biostatistics
Data Mining
Population Genetic Structure Analysis
Variable Selection
Population Classification
spellingShingle Biostatistics
Data Mining
Population Genetic Structure Analysis
Variable Selection
Population Classification
Weng, Zhouyang
Application of Data Mining Techniques in Human Population Genetic Structure Analysis
author Weng, Zhouyang
author_facet Weng, Zhouyang
author_sort Weng, Zhouyang
title Application of Data Mining Techniques in Human Population Genetic Structure Analysis
title_short Application of Data Mining Techniques in Human Population Genetic Structure Analysis
title_full Application of Data Mining Techniques in Human Population Genetic Structure Analysis
title_fullStr Application of Data Mining Techniques in Human Population Genetic Structure Analysis
title_full_unstemmed Application of Data Mining Techniques in Human Population Genetic Structure Analysis
title_sort application of data mining techniques in human population genetic structure analysis
publisher University of Cincinnati / OhioLINK
publishDate 2017
url http://rave.ohiolink.edu/etdc/view?acc_num=ucin149035512350142
work_keys_str_mv AT wengzhouyang applicationofdataminingtechniquesinhumanpopulationgeneticstructureanalysis
_version_ 1719452072533819392