Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method

Recent research has witnessed the fostered application of machine learning approaches in analyzing the single nucleotide polymorphisms (SNP) data, which has been proved to be implicated in complex human diseases. In the identification of SNPs responsible for complex diseases, most genome-wide associ...

Full description

Bibliographic Details
Main Authors:	Longquan Jiang, Bo Zhang, Qin Ni, Xuan Sun, Pingping Dong
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Single nucleotide polymorphism data mining machine learning interaction detection and genome-wide association studies
Online Access:	https://ieeexplore.ieee.org/document/8615995/

id	doaj-3dd376d18418414bbc8e35a62b900bfc
record_format	Article
spelling	doaj-3dd376d18418414bbc8e35a62b900bfc2021-03-29T22:31:28ZengIEEEIEEE Access2169-35362019-01-017126471265710.1109/ACCESS.2019.28932698615995Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting MethodLongquan Jiang0https://orcid.org/0000-0002-7333-2589Bo Zhang1https://orcid.org/0000-0002-2289-2877Qin Ni2Xuan Sun3Pingping Dong4College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaCollege of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaSchool of Information Science and Technology, Sanda University of Shanghai, Shanghai, ChinaCollege of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, ChinaRecent research has witnessed the fostered application of machine learning approaches in analyzing the single nucleotide polymorphisms (SNP) data, which has been proved to be implicated in complex human diseases. In the identification of SNPs responsible for complex diseases, most genome-wide association studies always took single SNP into consideration at one time and ignored diverse interactions between SNPs. One of the major problems is the higher number of features and the relatively small number of individuals, which complicates the task and harms the predictive ability of DNA sequences. In this paper, a novel boosting-based ensemble approach was proposed to study these interactions. An importance scoring strategy based on Gini impurity was introduced for feature selection. We evaluated its efficacy on the SNP genotyping data collected by the Southeastern University of China and compared it with naive Bayes, support vector machine, and random forest. The experimental results have shown its validity and effectiveness on SNP interaction identification. In addition, our approach had an obvious advantage of computational time and resources.https://ieeexplore.ieee.org/document/8615995/Single nucleotide polymorphismdata miningmachine learninginteraction detection and genome-wide association studies
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Longquan Jiang Bo Zhang Qin Ni Xuan Sun Pingping Dong
spellingShingle	Longquan Jiang Bo Zhang Qin Ni Xuan Sun Pingping Dong Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method IEEE Access Single nucleotide polymorphism data mining machine learning interaction detection and genome-wide association studies
author_facet	Longquan Jiang Bo Zhang Qin Ni Xuan Sun Pingping Dong
author_sort	Longquan Jiang
title	Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method
title_short	Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method
title_full	Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method
title_fullStr	Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method
title_full_unstemmed	Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method
title_sort	prediction of snp sequences via gini impurity based gradient boosting method
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Recent research has witnessed the fostered application of machine learning approaches in analyzing the single nucleotide polymorphisms (SNP) data, which has been proved to be implicated in complex human diseases. In the identification of SNPs responsible for complex diseases, most genome-wide association studies always took single SNP into consideration at one time and ignored diverse interactions between SNPs. One of the major problems is the higher number of features and the relatively small number of individuals, which complicates the task and harms the predictive ability of DNA sequences. In this paper, a novel boosting-based ensemble approach was proposed to study these interactions. An importance scoring strategy based on Gini impurity was introduced for feature selection. We evaluated its efficacy on the SNP genotyping data collected by the Southeastern University of China and compared it with naive Bayes, support vector machine, and random forest. The experimental results have shown its validity and effectiveness on SNP interaction identification. In addition, our approach had an obvious advantage of computational time and resources.
topic	Single nucleotide polymorphism data mining machine learning interaction detection and genome-wide association studies
url	https://ieeexplore.ieee.org/document/8615995/
work_keys_str_mv	AT longquanjiang predictionofsnpsequencesviaginiimpuritybasedgradientboostingmethod AT bozhang predictionofsnpsequencesviaginiimpuritybasedgradientboostingmethod AT qinni predictionofsnpsequencesviaginiimpuritybasedgradientboostingmethod AT xuansun predictionofsnpsequencesviaginiimpuritybasedgradientboostingmethod AT pingpingdong predictionofsnpsequencesviaginiimpuritybasedgradientboostingmethod
_version_	1724191363814129664

Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method

Similar Items