Shrinkage methods for variable selection and prediction with applications to genetic data

Identifying genotypes using genetic material was at first a painstaking laboratory task. In the decades since the first gene was sequenced, techniques have progressed through milestones requiring massive international collaboration. Today’s genotype sequencing facilities use high-throughput technolo...

Full description

Bibliographic Details
Main Author: Cule, Erika
Other Authors: De Iorio, Maria ; Vineis, Paolo
Published: Imperial College London 2013
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.592734
Description
Summary:Identifying genotypes using genetic material was at first a painstaking laboratory task. In the decades since the first gene was sequenced, techniques have progressed through milestones requiring massive international collaboration. Today’s genotype sequencing facilities use high-throughput technology to sequence entire genomes within days. Despite these technological improvements, and the resultant volume of genetic data, the identification of meaningful genotype-phenotype associations has not been as straightforward as was anticipated in the pre-genome era. The genetic architecture of many common diseases is complex, and heritability often cannot be explained when simple statistical tests are used. This thesis addresses a clinically important problem in statistical genetics - that of predicting disease risk based on genotype information. First, we review progress and current limitations in genetic risk prediction. We then introduce penalised regression. This thesis focusses on ridge regression, a penalised regression approach that has shown promise in risk prediction for high-dimensional data. The choice of the ridge parameter, which controls the amount of penalisation in ridge regression, has not been addressed in the literature with the specific aim of analysing genetic data. We present a method for automatically choosing the ridge parameter based on genome-wide SNP data. Software implementing the method is available to the community. We evaluate the method using simulation studies and a real data example. A ridge regression model does not indicate the strength of association of individual variants with the outcome, a property that is often of interest to geneticists. To this end we extend a previously proposed test of significance in ridge regression models to high-dimensional data and to the logistic model which commonly occurs in the biomedical context. This test is evaluated by comparison to a permutation test, which we view as a benchmark. This test is integrated into the software package mentioned above.