Data mining of high density genomic variant data for prediction of Alzheimer's disease risk

Abstract Background The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying...

Full description

Bibliographic Details
Main Authors:	Briones Natalia, Dinu Valentin
Format:	Article
Language:	English
Published:	BMC 2012-01-01
Series:	BMC Medical Genetics
Subjects:	Late-Onset Alzheimer's Disease GWAS SNPs Random Forest
Online Access:	http://www.biomedcentral.com/1471-2350/13/7

id	doaj-c437ae12483d4b8eba5b755f90a1d50c
record_format	Article
spelling	doaj-c437ae12483d4b8eba5b755f90a1d50c2021-04-02T09:49:25ZengBMCBMC Medical Genetics1471-23502012-01-01131710.1186/1471-2350-13-7Data mining of high density genomic variant data for prediction of Alzheimer's disease riskBriones NataliaDinu Valentin<p>Abstract</p> <p>Background</p> <p>The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods.</p> <p>Methods</p> <p>Two different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step.</p> <p>Results</p> <p>The first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with <it>APOE and GAB2 </it>SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with <it>ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH </it>respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included <it>APOE </it>and <it>GAB2 </it>SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling.</p> <p>Conclusions</p> <p>With the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.</p> http://www.biomedcentral.com/1471-2350/13/7Late-Onset Alzheimer's DiseaseGWASSNPsRandom Forest
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Briones Natalia Dinu Valentin
spellingShingle	Briones Natalia Dinu Valentin Data mining of high density genomic variant data for prediction of Alzheimer's disease risk BMC Medical Genetics Late-Onset Alzheimer's Disease GWAS SNPs Random Forest
author_facet	Briones Natalia Dinu Valentin
author_sort	Briones Natalia
title	Data mining of high density genomic variant data for prediction of Alzheimer's disease risk
title_short	Data mining of high density genomic variant data for prediction of Alzheimer's disease risk
title_full	Data mining of high density genomic variant data for prediction of Alzheimer's disease risk
title_fullStr	Data mining of high density genomic variant data for prediction of Alzheimer's disease risk
title_full_unstemmed	Data mining of high density genomic variant data for prediction of Alzheimer's disease risk
title_sort	data mining of high density genomic variant data for prediction of alzheimer's disease risk
publisher	BMC
series	BMC Medical Genetics
issn	1471-2350
publishDate	2012-01-01
description	<p>Abstract</p> <p>Background</p> <p>The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods.</p> <p>Methods</p> <p>Two different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step.</p> <p>Results</p> <p>The first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with <it>APOE and GAB2 </it>SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with <it>ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH </it>respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included <it>APOE </it>and <it>GAB2 </it>SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling.</p> <p>Conclusions</p> <p>With the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.</p>
topic	Late-Onset Alzheimer's Disease GWAS SNPs Random Forest
url	http://www.biomedcentral.com/1471-2350/13/7
work_keys_str_mv	AT brionesnatalia dataminingofhighdensitygenomicvariantdataforpredictionofalzheimersdiseaserisk AT dinuvalentin dataminingofhighdensitygenomicvariantdataforpredictionofalzheimersdiseaserisk
_version_	1724168695302848512

Data mining of high density genomic variant data for prediction of Alzheimer's disease risk

Similar Items