Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification

Abstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper...

Full description

Bibliographic Details
Main Authors: David A. Hanauer, Qiaozhu Mei, V. G. Vinod Vydiswaran, Karandeep Singh, Zach Landis-Lewis, Chunhua Weng
Format: Article
Language:English
Published: BMC 2019-04-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12911-019-0784-1
id doaj-473dfbeddfd3427fae71e72dcb624e92
record_format Article
spelling doaj-473dfbeddfd3427fae71e72dcb624e922020-11-25T03:23:01ZengBMCBMC Medical Informatics and Decision Making1472-69472019-04-0119S3596910.1186/s12911-019-0784-1Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identificationDavid A. Hanauer0Qiaozhu Mei1V. G. Vinod Vydiswaran2Karandeep Singh3Zach Landis-Lewis4Chunhua Weng5Department of Pediatrics, University of MichiganSchool of Information, University of MichiganSchool of Information, University of MichiganDepartment of Learning Health Sciences, University of MichiganDepartment of Learning Health Sciences, University of MichiganDepartment of Biomedical Informatics, Columbia UniversityAbstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.http://link.springer.com/article/10.1186/s12911-019-0784-1Lexical variationNatural language processingInformation retrieval
collection DOAJ
language English
format Article
sources DOAJ
author David A. Hanauer
Qiaozhu Mei
V. G. Vinod Vydiswaran
Karandeep Singh
Zach Landis-Lewis
Chunhua Weng
spellingShingle David A. Hanauer
Qiaozhu Mei
V. G. Vinod Vydiswaran
Karandeep Singh
Zach Landis-Lewis
Chunhua Weng
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
BMC Medical Informatics and Decision Making
Lexical variation
Natural language processing
Information retrieval
author_facet David A. Hanauer
Qiaozhu Mei
V. G. Vinod Vydiswaran
Karandeep Singh
Zach Landis-Lewis
Chunhua Weng
author_sort David A. Hanauer
title Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_short Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_full Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_fullStr Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_full_unstemmed Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_sort complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2019-04-01
description Abstract Background Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. Methods We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. Results We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. Conclusions Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.
topic Lexical variation
Natural language processing
Information retrieval
url http://link.springer.com/article/10.1186/s12911-019-0784-1
work_keys_str_mv AT davidahanauer complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT qiaozhumei complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT vgvinodvydiswaran complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT karandeepsingh complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT zachlandislewis complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT chunhuaweng complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
_version_ 1724608349954113536