Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies

This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS). The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of in...

Full description

Bibliographic Details
Main Author:	Ali, Fadhaa
Other Authors:	Zhang, Jian; Wang, Xue
Published:	University of Kent 2015
Subjects:	576.5 QA276 Mathematical statistics
Online Access:	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.643535

id	ndltd-bl.uk-oai-ethos.bl.uk-643535
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-6435352018-11-08T03:22:58ZStatistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studiesAli, FadhaaZhang, Jian; Wang, Xue2015This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS). The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant. Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex diseases, there is still much of the genetic heritability that remains unexplained. The power of detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods. Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade. There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance. In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs. In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk. In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding. In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes. In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes. The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method. We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature.576.5QA276 Mathematical statisticsUniversity of Kenthttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.643535https://kar.kent.ac.uk/47963/Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	576.5 QA276 Mathematical statistics
spellingShingle	576.5 QA276 Mathematical statistics Ali, Fadhaa Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
description	This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS). The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant. Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex diseases, there is still much of the genetic heritability that remains unexplained. The power of detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods. Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade. There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance. In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs. In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk. In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding. In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes. In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes. The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method. We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature.
author2	Zhang, Jian; Wang, Xue
author_facet	Zhang, Jian; Wang, Xue Ali, Fadhaa
author	Ali, Fadhaa
author_sort	Ali, Fadhaa
title	Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
title_short	Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
title_full	Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
title_fullStr	Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
title_full_unstemmed	Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
title_sort	statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies
publisher	University of Kent
publishDate	2015
url	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.643535
work_keys_str_mv	AT alifadhaa statisticalmethodsfordetectinggeneticriskfactorsofadiseasewithapplicationstogenomewideassociationstudies
_version_	1718790191038070784

Statistical methods for detecting genetic risk factors of a disease with applications to genome-wide association studies

Similar Items