An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge
Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tool...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-09-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12911-021-01630-7 |
id |
doaj-421af7e4e7ff46dda118729e1f803cd8 |
---|---|
record_format |
Article |
spelling |
doaj-421af7e4e7ff46dda118729e1f803cd82021-09-19T11:41:04ZengBMCBMC Medical Informatics and Decision Making1472-69472021-09-0121111010.1186/s12911-021-01630-7An automated data cleaning method for Electronic Health Records by incorporating clinical knowledgeXi Shi0Charlotte Prins1Gijs Van Pottelbergh2Pavlos Mamouris3Bert Vaes4Bart De Moor5Department of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU LeuvenLeuven Statistics Research Center, KU LeuvenAcademic Center for General Practice, KU LeuvenAcademic Center for General Practice, KU LeuvenAcademic Center for General Practice, KU LeuvenDepartment of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU LeuvenAbstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.https://doi.org/10.1186/s12911-021-01630-7Data cleaningAutomated methodClinical decision support |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Xi Shi Charlotte Prins Gijs Van Pottelbergh Pavlos Mamouris Bert Vaes Bart De Moor |
spellingShingle |
Xi Shi Charlotte Prins Gijs Van Pottelbergh Pavlos Mamouris Bert Vaes Bart De Moor An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge BMC Medical Informatics and Decision Making Data cleaning Automated method Clinical decision support |
author_facet |
Xi Shi Charlotte Prins Gijs Van Pottelbergh Pavlos Mamouris Bert Vaes Bart De Moor |
author_sort |
Xi Shi |
title |
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge |
title_short |
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge |
title_full |
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge |
title_fullStr |
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge |
title_full_unstemmed |
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge |
title_sort |
automated data cleaning method for electronic health records by incorporating clinical knowledge |
publisher |
BMC |
series |
BMC Medical Informatics and Decision Making |
issn |
1472-6947 |
publishDate |
2021-09-01 |
description |
Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people. |
topic |
Data cleaning Automated method Clinical decision support |
url |
https://doi.org/10.1186/s12911-021-01630-7 |
work_keys_str_mv |
AT xishi anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT charlotteprins anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT gijsvanpottelbergh anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT pavlosmamouris anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT bertvaes anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT bartdemoor anautomateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT xishi automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT charlotteprins automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT gijsvanpottelbergh automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT pavlosmamouris automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT bertvaes automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge AT bartdemoor automateddatacleaningmethodforelectronichealthrecordsbyincorporatingclinicalknowledge |
_version_ |
1717375593103753216 |