SNP-set Detection and Association Test with Hamming Distance Information

博士 === 國立臺灣大學 === 流行病學與預防醫學研究所 === 104 === With the advance in biotechnology, many researchers try to identify disease-associated markers through genetic association studies. In recent genetic association studies, developing methods to reduce intractably large numbers of genetic variants in genomic...

Full description

Bibliographic Details
Main Authors: Charlotte Wang, 王彥雯
Other Authors: Chuhsing Kate Hsiao
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/82199212159365155483
id ndltd-TW-104NTU05544002
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立臺灣大學 === 流行病學與預防醫學研究所 === 104 === With the advance in biotechnology, many researchers try to identify disease-associated markers through genetic association studies. In recent genetic association studies, developing methods to reduce intractably large numbers of genetic variants in genomic data to more computationally manageable numbers and finding ways to increase the power of statistical tests used in association studies have been two major challenges. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. Such method can also evaluate joint effect of grouped SNPs in a pre-specified genomic region. Nowadays, most association tests, however, figure out possible marker sets based on testing pre-specified SNP sets or testing through sliding window for whole genome. It seems that no combined procedure to define SNP sets in advance than to test association between SNP sets and the disease of interest. To construct SNP sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs should be clustered. We also recommend a rule-of-thumb to determine the number of clusters after a dendrogram is produced. With the SNP sets obtained, we next develop an association test to examine susceptibility to the disease of interest. For common variants, this proposed test assesses, based on Hamming distance, whether the similarity in genotypes between a diseased and a normal individual differs from the similarity between two individuals with the same disease status. For rare variants, the proposed test evaluates whether the similarity in genotypes within the case group differs from the similarity within the control group. These two statistics are $U$-statistics, and their statistical properties and limiting behaviors are also discussed. Additionally, simulation studies and real data applications were conducted to demonstrate the performance of our proposed methods. The results showed that the Hamming distance-based clustering algorithm can identify correct clustering patterns and is also an efficient algorithm. This method can be applied not only to genetic data, but also to categorical data in general. Additionally, for common variants, the Hamming distance-based association test (HDAT) works well regardless of the sample size, effects of SNPs within the given set, and the signal-to-noise ratio (proportion of the number of disease-associated SNPs to the number of neutral SNPs). Moreover, for genotyping data of coronary artery disease (CAD) from the WTCCC, our proposed methods found one SNP set with four SNPs were associated with the disease. These four SNPs have been reported in literatures. For rare variants, the numerical results demonstrated that the HDAT works well in spite of the sample size, the case-to-control ratio, and the signal-to-noise ratio. To conclude, the proposed clustering algorithm and association test are illustrated with simulations and a genome-wide association study, and the results indicate reliable and satisfactory performance. In our proposed methodology, no inference of haplotypes is needed, and SNPs under consideration do not need to be linked. Specifically, this test works well for a SNP-set containing both SNPs with a deleterious effect and those with a protective effect, and for a set containing many neutral SNPs. Moreover, the statistical properties of the proposed methods are discussed. However, some issues remain unsolved. First, for common variants, some extensions of the HDAT to imbalanced sizes of the case and control group need to be studied. Second, even though categorical disease-related factors can be consider as pseudo genetic markers, how to incorporate disease-related factors, such as environmental factors and personal characteristics, still need to be studies.
author2 Chuhsing Kate Hsiao
author_facet Chuhsing Kate Hsiao
Charlotte Wang
王彥雯
author Charlotte Wang
王彥雯
spellingShingle Charlotte Wang
王彥雯
SNP-set Detection and Association Test with Hamming Distance Information
author_sort Charlotte Wang
title SNP-set Detection and Association Test with Hamming Distance Information
title_short SNP-set Detection and Association Test with Hamming Distance Information
title_full SNP-set Detection and Association Test with Hamming Distance Information
title_fullStr SNP-set Detection and Association Test with Hamming Distance Information
title_full_unstemmed SNP-set Detection and Association Test with Hamming Distance Information
title_sort snp-set detection and association test with hamming distance information
publishDate 2015
url http://ndltd.ncl.edu.tw/handle/82199212159365155483
work_keys_str_mv AT charlottewang snpsetdetectionandassociationtestwithhammingdistanceinformation
AT wángyànwén snpsetdetectionandassociationtestwithhammingdistanceinformation
AT charlottewang lìyònghànmíngjùlízhēncèdānyīhégānsuānduōxíngxìngzhīqúnjùyǔdānyīhégānsuānduōxíngxìngjíhézhīxiāngguānxìngjiǎndìng
AT wángyànwén lìyònghànmíngjùlízhēncèdānyīhégānsuānduōxíngxìngzhīqúnjùyǔdānyīhégānsuānduōxíngxìngjíhézhīxiāngguānxìngjiǎndìng
_version_ 1718264372751499264
spelling ndltd-TW-104NTU055440022016-05-10T04:08:47Z http://ndltd.ncl.edu.tw/handle/82199212159365155483 SNP-set Detection and Association Test with Hamming Distance Information 利用漢明距離偵測單一核苷酸多型性之群聚與單一核苷酸多型性集合之相關性檢定 Charlotte Wang 王彥雯 博士 國立臺灣大學 流行病學與預防醫學研究所 104 With the advance in biotechnology, many researchers try to identify disease-associated markers through genetic association studies. In recent genetic association studies, developing methods to reduce intractably large numbers of genetic variants in genomic data to more computationally manageable numbers and finding ways to increase the power of statistical tests used in association studies have been two major challenges. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. Such method can also evaluate joint effect of grouped SNPs in a pre-specified genomic region. Nowadays, most association tests, however, figure out possible marker sets based on testing pre-specified SNP sets or testing through sliding window for whole genome. It seems that no combined procedure to define SNP sets in advance than to test association between SNP sets and the disease of interest. To construct SNP sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs should be clustered. We also recommend a rule-of-thumb to determine the number of clusters after a dendrogram is produced. With the SNP sets obtained, we next develop an association test to examine susceptibility to the disease of interest. For common variants, this proposed test assesses, based on Hamming distance, whether the similarity in genotypes between a diseased and a normal individual differs from the similarity between two individuals with the same disease status. For rare variants, the proposed test evaluates whether the similarity in genotypes within the case group differs from the similarity within the control group. These two statistics are $U$-statistics, and their statistical properties and limiting behaviors are also discussed. Additionally, simulation studies and real data applications were conducted to demonstrate the performance of our proposed methods. The results showed that the Hamming distance-based clustering algorithm can identify correct clustering patterns and is also an efficient algorithm. This method can be applied not only to genetic data, but also to categorical data in general. Additionally, for common variants, the Hamming distance-based association test (HDAT) works well regardless of the sample size, effects of SNPs within the given set, and the signal-to-noise ratio (proportion of the number of disease-associated SNPs to the number of neutral SNPs). Moreover, for genotyping data of coronary artery disease (CAD) from the WTCCC, our proposed methods found one SNP set with four SNPs were associated with the disease. These four SNPs have been reported in literatures. For rare variants, the numerical results demonstrated that the HDAT works well in spite of the sample size, the case-to-control ratio, and the signal-to-noise ratio. To conclude, the proposed clustering algorithm and association test are illustrated with simulations and a genome-wide association study, and the results indicate reliable and satisfactory performance. In our proposed methodology, no inference of haplotypes is needed, and SNPs under consideration do not need to be linked. Specifically, this test works well for a SNP-set containing both SNPs with a deleterious effect and those with a protective effect, and for a set containing many neutral SNPs. Moreover, the statistical properties of the proposed methods are discussed. However, some issues remain unsolved. First, for common variants, some extensions of the HDAT to imbalanced sizes of the case and control group need to be studied. Second, even though categorical disease-related factors can be consider as pseudo genetic markers, how to incorporate disease-related factors, such as environmental factors and personal characteristics, still need to be studies. Chuhsing Kate Hsiao 蕭朱杏 2015 學位論文 ; thesis 139 en_US