XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction

Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most si...

Full description

Bibliographic Details
Main Authors: Khishigsuren Davagdorj, Van Huy Pham, Nipon Theera-Umpon, Keun Ho Ryu
Format: Article
Language:English
Published: MDPI AG 2020-09-01
Series:International Journal of Environmental Research and Public Health
Subjects:
Online Access:https://www.mdpi.com/1660-4601/17/18/6513
id doaj-d4ea8bcce22849ca9a6db9dbcca6541a
record_format Article
spelling doaj-d4ea8bcce22849ca9a6db9dbcca6541a2020-11-25T02:53:00ZengMDPI AGInternational Journal of Environmental Research and Public Health1661-78271660-46012020-09-01176513651310.3390/ijerph17186513XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease PredictionKhishigsuren Davagdorj0Van Huy Pham1Nipon Theera-Umpon2Keun Ho Ryu3Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, KoreaFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, VietnamDepartment of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, ThailandFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, VietnamSmoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.https://www.mdpi.com/1660-4601/17/18/6513smokingnoncommunicable diseasefeature selectionextreme gradient boosting
collection DOAJ
language English
format Article
sources DOAJ
author Khishigsuren Davagdorj
Van Huy Pham
Nipon Theera-Umpon
Keun Ho Ryu
spellingShingle Khishigsuren Davagdorj
Van Huy Pham
Nipon Theera-Umpon
Keun Ho Ryu
XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
International Journal of Environmental Research and Public Health
smoking
noncommunicable disease
feature selection
extreme gradient boosting
author_facet Khishigsuren Davagdorj
Van Huy Pham
Nipon Theera-Umpon
Keun Ho Ryu
author_sort Khishigsuren Davagdorj
title XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
title_short XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
title_full XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
title_fullStr XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
title_full_unstemmed XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction
title_sort xgboost-based framework for smoking-induced noncommunicable disease prediction
publisher MDPI AG
series International Journal of Environmental Research and Public Health
issn 1661-7827
1660-4601
publishDate 2020-09-01
description Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.
topic smoking
noncommunicable disease
feature selection
extreme gradient boosting
url https://www.mdpi.com/1660-4601/17/18/6513
work_keys_str_mv AT khishigsurendavagdorj xgboostbasedframeworkforsmokinginducednoncommunicablediseaseprediction
AT vanhuypham xgboostbasedframeworkforsmokinginducednoncommunicablediseaseprediction
AT nipontheeraumpon xgboostbasedframeworkforsmokinginducednoncommunicablediseaseprediction
AT keunhoryu xgboostbasedframeworkforsmokinginducednoncommunicablediseaseprediction
_version_ 1724727280352100352