Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms

Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social i...

Full description

Bibliographic Details
Main Authors: Lejun Gong, Xingxing Zhang, Tianyin Chen, Li Zhang
Format: Article
Language:English
Published: Hindawi-Wiley 2021-01-01
Series:Security and Communication Networks
Online Access:http://dx.doi.org/10.1155/2021/6635027
id doaj-c8f4b4c8bc364279aad235bfed468bad
record_format Article
spelling doaj-c8f4b4c8bc364279aad235bfed468bad2021-03-01T01:14:45ZengHindawi-WileySecurity and Communication Networks1939-01222021-01-01202110.1155/2021/6635027Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular MechanismsLejun Gong0Xingxing Zhang1Tianyin Chen2Li Zhang3Jiangsu Key Lab of Big Data Security & Intelligent ProcessingJiangsu Key Lab of Big Data Security & Intelligent ProcessingJiangsu Key Lab of Big Data Security & Intelligent ProcessingCollege of Computer Science and TechnologyDisease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. The proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. The work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. This study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information.http://dx.doi.org/10.1155/2021/6635027
collection DOAJ
language English
format Article
sources DOAJ
author Lejun Gong
Xingxing Zhang
Tianyin Chen
Li Zhang
spellingShingle Lejun Gong
Xingxing Zhang
Tianyin Chen
Li Zhang
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
Security and Communication Networks
author_facet Lejun Gong
Xingxing Zhang
Tianyin Chen
Li Zhang
author_sort Lejun Gong
title Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
title_short Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
title_full Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
title_fullStr Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
title_full_unstemmed Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
title_sort recognition of disease genetic information from unstructured text data based on bilstm-crf for molecular mechanisms
publisher Hindawi-Wiley
series Security and Communication Networks
issn 1939-0122
publishDate 2021-01-01
description Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. The proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. The work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. This study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information.
url http://dx.doi.org/10.1155/2021/6635027
work_keys_str_mv AT lejungong recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms
AT xingxingzhang recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms
AT tianyinchen recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms
AT lizhang recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms
_version_ 1714842387259326464