Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.

The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such...

Full description

Bibliographic Details
Main Authors: Charlotte Wang, Wen-Hsin Kao, Chuhsing Kate Hsiao
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2015-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4547758?pdf=render
id doaj-2ba98012e68a4a558a84961188aaae21
record_format Article
spelling doaj-2ba98012e68a4a558a84961188aaae212020-11-25T01:56:05ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01108e013591810.1371/journal.pone.0135918Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.Charlotte WangWen-Hsin KaoChuhsing Kate HsiaoThe availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.http://europepmc.org/articles/PMC4547758?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Charlotte Wang
Wen-Hsin Kao
Chuhsing Kate Hsiao
spellingShingle Charlotte Wang
Wen-Hsin Kao
Chuhsing Kate Hsiao
Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
PLoS ONE
author_facet Charlotte Wang
Wen-Hsin Kao
Chuhsing Kate Hsiao
author_sort Charlotte Wang
title Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
title_short Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
title_full Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
title_fullStr Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
title_full_unstemmed Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
title_sort using hamming distance as information for snp-sets clustering and testing in disease association studies.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2015-01-01
description The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.
url http://europepmc.org/articles/PMC4547758?pdf=render
work_keys_str_mv AT charlottewang usinghammingdistanceasinformationforsnpsetsclusteringandtestingindiseaseassociationstudies
AT wenhsinkao usinghammingdistanceasinformationforsnpsetsclusteringandtestingindiseaseassociationstudies
AT chuhsingkatehsiao usinghammingdistanceasinformationforsnpsetsclusteringandtestingindiseaseassociationstudies
_version_ 1724981695015288832