An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge

Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tool...

Full description

Bibliographic Details
Main Authors: Xi Shi, Charlotte Prins, Gijs Van Pottelbergh, Pavlos Mamouris, Bert Vaes, Bart De Moor
Format: Article
Language:English
Published: BMC 2021-09-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-021-01630-7
id doaj-421af7e4e7ff46dda118729e1f803cd8
record_format Article
spelling doaj-421af7e4e7ff46dda118729e1f803cd82021-09-19T11:41:04ZengBMCBMC Medical Informatics and Decision Making1472-69472021-09-0121111010.1186/s12911-021-01630-7An automated data cleaning method for Electronic Health Records by incorporating clinical knowledgeXi Shi0Charlotte Prins1Gijs Van Pottelbergh2Pavlos Mamouris3Bert Vaes4Bart De Moor5Department of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU LeuvenLeuven Statistics Research Center, KU LeuvenAcademic Center for General Practice, KU LeuvenAcademic Center for General Practice, KU LeuvenAcademic Center for General Practice, KU LeuvenDepartment of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU LeuvenAbstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.https://doi.org/10.1186/s12911-021-01630-7Data cleaningAutomated methodClinical decision support
collection DOAJ
language English
format Article
sources DOAJ
author Xi Shi
Charlotte Prins
Gijs Van Pottelbergh
Pavlos Mamouris
Bert Vaes
Bart De Moor
spellingShingle Xi Shi
Charlotte Prins
Gijs Van Pottelbergh
Pavlos Mamouris
Bert Vaes
Bart De Moor
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
BMC Medical Informatics and Decision Making
Data cleaning
Automated method
Clinical decision support
author_facet Xi Shi
Charlotte Prins
Gijs Van Pottelbergh
Pavlos Mamouris
Bert Vaes
Bart De Moor
author_sort Xi Shi
title An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
title_short An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
title_full An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
title_fullStr An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
title_full_unstemmed An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
title_sort automated data cleaning method for electronic health records by incorporating clinical knowledge
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2021-09-01
description Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.
topic Data cleaning
Automated method
Clinical decision support
url https://doi.org/10.1186/s12911-021-01630-7
work_keys_str_mv AT xishi anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT charlotteprins anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT gijsvanpottelbergh anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT pavlosmamouris anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT bertvaes anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT bartdemoor anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT xishi automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT charlotteprins automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT gijsvanpottelbergh automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT pavlosmamouris automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT bertvaes automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
AT bartdemoor automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge
_version_ 1717375593103753216