SNP-set Detection and Association Test with Hamming Distance Information

博士 === 國立臺灣大學 === 流行病學與預防醫學研究所 === 104 === With the advance in biotechnology, many researchers try to identify disease-associated markers through genetic association studies. In recent genetic association studies, developing methods to reduce intractably large numbers of genetic variants in genomic...

Full description

Bibliographic Details
Main Authors: Charlotte Wang, 王彥雯
Other Authors: Chuhsing Kate Hsiao
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/82199212159365155483
Description
Summary:博士 === 國立臺灣大學 === 流行病學與預防醫學研究所 === 104 === With the advance in biotechnology, many researchers try to identify disease-associated markers through genetic association studies. In recent genetic association studies, developing methods to reduce intractably large numbers of genetic variants in genomic data to more computationally manageable numbers and finding ways to increase the power of statistical tests used in association studies have been two major challenges. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. Such method can also evaluate joint effect of grouped SNPs in a pre-specified genomic region. Nowadays, most association tests, however, figure out possible marker sets based on testing pre-specified SNP sets or testing through sliding window for whole genome. It seems that no combined procedure to define SNP sets in advance than to test association between SNP sets and the disease of interest. To construct SNP sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs should be clustered. We also recommend a rule-of-thumb to determine the number of clusters after a dendrogram is produced. With the SNP sets obtained, we next develop an association test to examine susceptibility to the disease of interest. For common variants, this proposed test assesses, based on Hamming distance, whether the similarity in genotypes between a diseased and a normal individual differs from the similarity between two individuals with the same disease status. For rare variants, the proposed test evaluates whether the similarity in genotypes within the case group differs from the similarity within the control group. These two statistics are $U$-statistics, and their statistical properties and limiting behaviors are also discussed. Additionally, simulation studies and real data applications were conducted to demonstrate the performance of our proposed methods. The results showed that the Hamming distance-based clustering algorithm can identify correct clustering patterns and is also an efficient algorithm. This method can be applied not only to genetic data, but also to categorical data in general. Additionally, for common variants, the Hamming distance-based association test (HDAT) works well regardless of the sample size, effects of SNPs within the given set, and the signal-to-noise ratio (proportion of the number of disease-associated SNPs to the number of neutral SNPs). Moreover, for genotyping data of coronary artery disease (CAD) from the WTCCC, our proposed methods found one SNP set with four SNPs were associated with the disease. These four SNPs have been reported in literatures. For rare variants, the numerical results demonstrated that the HDAT works well in spite of the sample size, the case-to-control ratio, and the signal-to-noise ratio. To conclude, the proposed clustering algorithm and association test are illustrated with simulations and a genome-wide association study, and the results indicate reliable and satisfactory performance. In our proposed methodology, no inference of haplotypes is needed, and SNPs under consideration do not need to be linked. Specifically, this test works well for a SNP-set containing both SNPs with a deleterious effect and those with a protective effect, and for a set containing many neutral SNPs. Moreover, the statistical properties of the proposed methods are discussed. However, some issues remain unsolved. First, for common variants, some extensions of the HDAT to imbalanced sizes of the case and control group need to be studied. Second, even though categorical disease-related factors can be consider as pseudo genetic markers, how to incorporate disease-related factors, such as environmental factors and personal characteristics, still need to be studies.