A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains

Biomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learni...

Full description

Bibliographic Details
Main Authors: Juae Kim, Youngjoong Ko, Jungyun Seo
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8703375/
id doaj-1d68e46b55784456a64e270632ac549c
record_format Article
spelling doaj-1d68e46b55784456a64e270632ac549c2021-03-29T23:47:49ZengIEEEIEEE Access2169-35362019-01-017703087031810.1109/ACCESS.2019.29141688703375A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-DomainsJuae Kim0Youngjoong Ko1https://orcid.org/0000-0002-0241-9193Jungyun Seo2Department of Computer engineering, Sogang University, Seoul, South KoreaDepartment of Computer engineering, Dong-A University, Busan, South KoreaDepartment of Computer engineering, Sogang University, Seoul, South KoreaBiomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learning-based approaches generally require significant amounts of annotated corpora to achieve high performance. However, it is expensive to manually create a large number of high-quality corpora due to the demand for biomedical experts. In addition, most existing corpora have focused on several specific sub-domains, such as disease, protein, and species. It is difficult for a biomedical NER system trained with these corpora to provide much information for biomedical text processing systems. In this paper, we propose a method for automatically generating the machine-labeled biomedical NER corpus that covers various sub-domains by using proper categories from the semantic groups of a unified medical language system (UMLS). We use a bootstrapping approach with a small amount of manually annotated corpus to automatically generate a significant amount of corpus and then construct a biomedical NER system trained with the machine-labeled corpus. At last, we train two machine learning-based classifiers, conditional random fields (CRFs) and long short-term memory (LSTM), with the machine-labeled data to improve performance. The experimental results show that the proposed method is effective to improve performance. As a result, the proposed one obtains higher performance in 23.69% than the model that trained only a small amount of manually annotated corpus in F1-score.https://ieeexplore.ieee.org/document/8703375/Biomedical named entity recognitionbootstrappinginformation extractionsemi-supervised learning
collection DOAJ
language English
format Article
sources DOAJ
author Juae Kim
Youngjoong Ko
Jungyun Seo
spellingShingle Juae Kim
Youngjoong Ko
Jungyun Seo
A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
IEEE Access
Biomedical named entity recognition
bootstrapping
information extraction
semi-supervised learning
author_facet Juae Kim
Youngjoong Ko
Jungyun Seo
author_sort Juae Kim
title A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
title_short A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
title_full A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
title_fullStr A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
title_full_unstemmed A Bootstrapping Approach With CRF and Deep Learning Models for Improving the Biomedical Named Entity Recognition in Multi-Domains
title_sort bootstrapping approach with crf and deep learning models for improving the biomedical named entity recognition in multi-domains
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Biomedical named entity recognition (biomedical NER) is a core component to build biomedical text processing systems, such as biomedical information retrieval and question answering systems. Recently, many studies based on machine learning have been developed for a biomedical NER. The machine learning-based approaches generally require significant amounts of annotated corpora to achieve high performance. However, it is expensive to manually create a large number of high-quality corpora due to the demand for biomedical experts. In addition, most existing corpora have focused on several specific sub-domains, such as disease, protein, and species. It is difficult for a biomedical NER system trained with these corpora to provide much information for biomedical text processing systems. In this paper, we propose a method for automatically generating the machine-labeled biomedical NER corpus that covers various sub-domains by using proper categories from the semantic groups of a unified medical language system (UMLS). We use a bootstrapping approach with a small amount of manually annotated corpus to automatically generate a significant amount of corpus and then construct a biomedical NER system trained with the machine-labeled corpus. At last, we train two machine learning-based classifiers, conditional random fields (CRFs) and long short-term memory (LSTM), with the machine-labeled data to improve performance. The experimental results show that the proposed method is effective to improve performance. As a result, the proposed one obtains higher performance in 23.69% than the model that trained only a small amount of manually annotated corpus in F1-score.
topic Biomedical named entity recognition
bootstrapping
information extraction
semi-supervised learning
url https://ieeexplore.ieee.org/document/8703375/
work_keys_str_mv AT juaekim abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
AT youngjoongko abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
AT jungyunseo abootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
AT juaekim bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
AT youngjoongko bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
AT jungyunseo bootstrappingapproachwithcrfanddeeplearningmodelsforimprovingthebiomedicalnamedentityrecognitioninmultidomains
_version_ 1724188953205014528