A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.
<h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The object...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0237724 |
id |
doaj-d18913b8f38f40a3be5d81f52e2d3f83 |
---|---|
record_format |
Article |
spelling |
doaj-d18913b8f38f40a3be5d81f52e2d3f832021-03-04T11:12:50ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01159e023772410.1371/journal.pone.0237724A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.Zhiyong HuDongping Du<h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients.<h4>Methods</h4>We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors.<h4>Results</h4>The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers.<h4>Conclusions</h4>The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.https://doi.org/10.1371/journal.pone.0237724 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Zhiyong Hu Dongping Du |
spellingShingle |
Zhiyong Hu Dongping Du A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. PLoS ONE |
author_facet |
Zhiyong Hu Dongping Du |
author_sort |
Zhiyong Hu |
title |
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. |
title_short |
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. |
title_full |
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. |
title_fullStr |
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. |
title_full_unstemmed |
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction. |
title_sort |
new analytical framework for missing data imputation and classification with uncertainty: missing data imputation and heart failure readmission prediction. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2020-01-01 |
description |
<h4>Background</h4>The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients.<h4>Methods</h4>We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors.<h4>Results</h4>The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers.<h4>Conclusions</h4>The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs. |
url |
https://doi.org/10.1371/journal.pone.0237724 |
work_keys_str_mv |
AT zhiyonghu anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT dongpingdu anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT zhiyonghu newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT dongpingdu newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction |
_version_ |
1714804566041559040 |