A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.

<h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The object...

Full description

Bibliographic Details
Main Authors: Zhiyong Hu, Dongping Du
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0237724
id doaj-d18913b8f38f40a3be5d81f52e2d3f83
record_format Article
spelling doaj-d18913b8f38f40a3be5d81f52e2d3f832021-03-04T11:12:50ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01159e023772410.1371/journal.pone.0237724A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.Zhiyong HuDongping Du<h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients.<h4>Methods</h4>We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors.<h4>Results</h4>The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers.<h4>Conclusions</h4>The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.https://doi.org/10.1371/journal.pone.0237724
collection DOAJ
language English
format Article
sources DOAJ
author Zhiyong Hu
Dongping Du
spellingShingle Zhiyong Hu
Dongping Du
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
PLoS ONE
author_facet Zhiyong Hu
Dongping Du
author_sort Zhiyong Hu
title A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
title_short A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
title_full A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
title_fullStr A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
title_full_unstemmed A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
title_sort new analytical framework for missing data imputation and classification with uncertainty: missing data imputation and heart failure readmission prediction.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description <h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients.<h4>Methods</h4>We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors.<h4>Results</h4>The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers.<h4>Conclusions</h4>The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.
url https://doi.org/10.1371/journal.pone.0237724
work_keys_str_mv AT zhiyonghu anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction
AT dongpingdu anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction
AT zhiyonghu newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction
AT dongpingdu newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction
_version_ 1714804566041559040