Data-Driven Information Extraction from Chinese Electronic Medical Records.
This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event.Ou...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2015-01-01
|
Series: | PLoS ONE |
Online Access: | http://europepmc.org/articles/PMC4546596?pdf=render |
id |
doaj-177a747b2d9949d297438e194bc2c95a |
---|---|
record_format |
Article |
spelling |
doaj-177a747b2d9949d297438e194bc2c95a2020-11-25T00:57:16ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01108e013627010.1371/journal.pone.0136270Data-Driven Information Extraction from Chinese Electronic Medical Records.Dong XuMeizhuo ZhangTianwan ZhaoChen GeWeiguo GaoJia WeiKenny Q ZhuThis study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event.Our framework uses a hybrid approach. It consists of constructing cross-domain core medical lexica, an unsupervised, iterative algorithm to accrue more accurate terms into the lexica, rules to address Chinese writing conventions and temporal descriptors, and a Support Vector Machine (SVM) algorithm that innovatively utilizes Normalized Google Distance (NGD) to estimate the correlation between medical events and their descriptions.The effectiveness of the framework was demonstrated with a dataset of 24,817 de-identified Chinese EMRs. The cross-domain medical lexica were capable of recognizing terms with an F1-score of 0.896. 98.5% of recorded medical events were linked to temporal descriptors. The NGD SVM description-event matching achieved an F1-score of 0.874. The end-to-end time-event-description extraction of our framework achieved an F1-score of 0.846.In terms of named entity recognition, the proposed framework outperforms state-of-the-art supervised learning algorithms (F1-score: 0.896 vs. 0.886). In event-description association, the NGD SVM is superior to SVM using only local context and semantic features (F1-score: 0.874 vs. 0.838).The framework is data-driven, weakly supervised, and robust against the variations and noises that tend to occur in a large corpus. It addresses Chinese medical writing conventions and variations in writing styles through patterns used for discovering new terms and rules for updating the lexica.http://europepmc.org/articles/PMC4546596?pdf=render |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Dong Xu Meizhuo Zhang Tianwan Zhao Chen Ge Weiguo Gao Jia Wei Kenny Q Zhu |
spellingShingle |
Dong Xu Meizhuo Zhang Tianwan Zhao Chen Ge Weiguo Gao Jia Wei Kenny Q Zhu Data-Driven Information Extraction from Chinese Electronic Medical Records. PLoS ONE |
author_facet |
Dong Xu Meizhuo Zhang Tianwan Zhao Chen Ge Weiguo Gao Jia Wei Kenny Q Zhu |
author_sort |
Dong Xu |
title |
Data-Driven Information Extraction from Chinese Electronic Medical Records. |
title_short |
Data-Driven Information Extraction from Chinese Electronic Medical Records. |
title_full |
Data-Driven Information Extraction from Chinese Electronic Medical Records. |
title_fullStr |
Data-Driven Information Extraction from Chinese Electronic Medical Records. |
title_full_unstemmed |
Data-Driven Information Extraction from Chinese Electronic Medical Records. |
title_sort |
data-driven information extraction from chinese electronic medical records. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2015-01-01 |
description |
This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event.Our framework uses a hybrid approach. It consists of constructing cross-domain core medical lexica, an unsupervised, iterative algorithm to accrue more accurate terms into the lexica, rules to address Chinese writing conventions and temporal descriptors, and a Support Vector Machine (SVM) algorithm that innovatively utilizes Normalized Google Distance (NGD) to estimate the correlation between medical events and their descriptions.The effectiveness of the framework was demonstrated with a dataset of 24,817 de-identified Chinese EMRs. The cross-domain medical lexica were capable of recognizing terms with an F1-score of 0.896. 98.5% of recorded medical events were linked to temporal descriptors. The NGD SVM description-event matching achieved an F1-score of 0.874. The end-to-end time-event-description extraction of our framework achieved an F1-score of 0.846.In terms of named entity recognition, the proposed framework outperforms state-of-the-art supervised learning algorithms (F1-score: 0.896 vs. 0.886). In event-description association, the NGD SVM is superior to SVM using only local context and semantic features (F1-score: 0.874 vs. 0.838).The framework is data-driven, weakly supervised, and robust against the variations and noises that tend to occur in a large corpus. It addresses Chinese medical writing conventions and variations in writing styles through patterns used for discovering new terms and rules for updating the lexica. |
url |
http://europepmc.org/articles/PMC4546596?pdf=render |
work_keys_str_mv |
AT dongxu datadriveninformationextractionfromchineseelectronicmedicalrecords AT meizhuozhang datadriveninformationextractionfromchineseelectronicmedicalrecords AT tianwanzhao datadriveninformationextractionfromchineseelectronicmedicalrecords AT chenge datadriveninformationextractionfromchineseelectronicmedicalrecords AT weiguogao datadriveninformationextractionfromchineseelectronicmedicalrecords AT jiawei datadriveninformationextractionfromchineseelectronicmedicalrecords AT kennyqzhu datadriveninformationextractionfromchineseelectronicmedicalrecords |
_version_ |
1725224963793747968 |