Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
Abstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-04-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12911-019-0784-1 |
id |
doaj-473dfbeddfd3427fae71e72dcb624e92 |
---|---|
record_format |
Article |
spelling |
doaj-473dfbeddfd3427fae71e72dcb624e922020-11-25T03:23:01ZengBMCBMC Medical Informatics and Decision Making1472-69472019-04-0119S3596910.1186/s12911-019-0784-1Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identificationDavid A. Hanauer0Qiaozhu Mei1V. G. Vinod Vydiswaran2Karandeep Singh3Zach Landis-Lewis4Chunhua Weng5Department of Pediatrics, University of MichiganSchool of Information, University of MichiganSchool of Information, University of MichiganDepartment of Learning Health Sciences, University of MichiganDepartment of Learning Health Sciences, University of MichiganDepartment of Biomedical Informatics, Columbia UniversityAbstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.http://link.springer.com/article/10.1186/s12911-019-0784-1Lexical variationNatural language processingInformation retrieval |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
David A. Hanauer Qiaozhu Mei V. G. Vinod Vydiswaran Karandeep Singh Zach Landis-Lewis Chunhua Weng |
spellingShingle |
David A. Hanauer Qiaozhu Mei V. G. Vinod Vydiswaran Karandeep Singh Zach Landis-Lewis Chunhua Weng Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification BMC Medical Informatics and Decision Making Lexical variation Natural language processing Information retrieval |
author_facet |
David A. Hanauer Qiaozhu Mei V. G. Vinod Vydiswaran Karandeep Singh Zach Landis-Lewis Chunhua Weng |
author_sort |
David A. Hanauer |
title |
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_short |
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_full |
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_fullStr |
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_full_unstemmed |
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_sort |
complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
publisher |
BMC |
series |
BMC Medical Informatics and Decision Making |
issn |
1472-6947 |
publishDate |
2019-04-01 |
description |
Abstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks. |
topic |
Lexical variation Natural language processing Information retrieval |
url |
http://link.springer.com/article/10.1186/s12911-019-0784-1 |
work_keys_str_mv |
AT davidahanauer complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT qiaozhumei complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT vgvinodvydiswaran complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT karandeepsingh complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT zachlandislewis complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT chunhuaweng complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification |
_version_ |
1724608349954113536 |