A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
Biomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learni...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8703375/ |
id |
doaj-1d68e46b55784456a64e270632ac549c |
---|---|
record_format |
Article |
spelling |
doaj-1d68e46b55784456a64e270632ac549c2021-03-29T23:47:49ZengIEEEIEEE Access2169-35362019-01-017703087031810.1109/ACCESS.2019.29141688703375A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-DomainsJuae Kim0Youngjoong Ko1https://orcid.org/0000-0002-0241-9193Jungyun Seo2Department of Computer engineering, Sogang University, Seoul, South KoreaDepartment of Computer engineering, Dong-A University, Busan, South KoreaDepartment of Computer engineering, Sogang University, Seoul, South KoreaBiomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learning-based approaches generally require significant amounts of annotated corpora to achieve high performance. However, it is expensive to manually create a large number of high-quality corpora due to the demand for biomedical experts. In addition, most existing corpora have focused on several specific sub-domains, such as disease, protein, and species. It is difficult for a biomedical NER system trained with these corpora to provide much information for biomedical text processing systems. In this paper, we propose a method for automatically generating the machine-labeled biomedical NER corpus that covers various sub-domains by using proper categories from the semantic groups of a unified medical language system (UMLS). We use a bootstrapping approach with a small amount of manually annotated corpus to automatically generate a significant amount of corpus and then construct a biomedical NER system trained with the machine-labeled corpus. At last, we train two machine learning-based classifiers, conditional random fields (CRFs) and long short-term memory (LSTM), with the machine-labeled data to improve performance. The experimental results show that the proposed method is effective to improve performance. As a result, the proposed one obtains higher performance in 23.69% than the model that trained only a small amount of manually annotated corpus in F1-score.https://ieeexplore.ieee.org/document/8703375/Biomedical named entity recognitionbootstrappinginformation extractionsemi-supervised learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Juae Kim Youngjoong Ko Jungyun Seo |
spellingShingle |
Juae Kim Youngjoong Ko Jungyun Seo A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains IEEE Access Biomedical named entity recognition bootstrapping information extraction semi-supervised learning |
author_facet |
Juae Kim Youngjoong Ko Jungyun Seo |
author_sort |
Juae Kim |
title |
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains |
title_short |
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains |
title_full |
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains |
title_fullStr |
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains |
title_full_unstemmed |
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains |
title_sort |
bootstrapping approach with crf and deep learning models for improving the biomedical named entity recognition in multi-domains |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
Biomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learning-based approaches generally require significant amounts of annotated corpora to achieve high performance. However, it is expensive to manually create a large number of high-quality corpora due to the demand for biomedical experts. In addition, most existing corpora have focused on several specific sub-domains, such as disease, protein, and species. It is difficult for a biomedical NER system trained with these corpora to provide much information for biomedical text processing systems. In this paper, we propose a method for automatically generating the machine-labeled biomedical NER corpus that covers various sub-domains by using proper categories from the semantic groups of a unified medical language system (UMLS). We use a bootstrapping approach with a small amount of manually annotated corpus to automatically generate a significant amount of corpus and then construct a biomedical NER system trained with the machine-labeled corpus. At last, we train two machine learning-based classifiers, conditional random fields (CRFs) and long short-term memory (LSTM), with the machine-labeled data to improve performance. The experimental results show that the proposed method is effective to improve performance. As a result, the proposed one obtains higher performance in 23.69% than the model that trained only a small amount of manually annotated corpus in F1-score. |
topic |
Biomedical named entity recognition bootstrapping information extraction semi-supervised learning |
url |
https://ieeexplore.ieee.org/document/8703375/ |
work_keys_str_mv |
AT juaekim abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains AT youngjoongko abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains AT jungyunseo abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains AT juaekim bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains AT youngjoongko bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains AT jungyunseo bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains |
_version_ |
1724188953205014528 |