Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms
Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social i...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi-Wiley
2021-01-01
|
Series: | Security and Communication Networks |
Online Access: | http://dx.doi.org/10.1155/2021/6635027 |
id |
doaj-c8f4b4c8bc364279aad235bfed468bad |
---|---|
record_format |
Article |
spelling |
doaj-c8f4b4c8bc364279aad235bfed468bad2021-03-01T01:14:45ZengHindawi-WileySecurity and Communication Networks1939-01222021-01-01202110.1155/2021/6635027Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular MechanismsLejun Gong0Xingxing Zhang1Tianyin Chen2Li Zhang3Jiangsu Key Lab of Big Data Security & Intelligent ProcessingJiangsu Key Lab of Big Data Security & Intelligent ProcessingJiangsu Key Lab of Big Data Security & Intelligent ProcessingCollege of Computer Science and TechnologyDisease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. The proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. The work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. This study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information.http://dx.doi.org/10.1155/2021/6635027 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Lejun Gong Xingxing Zhang Tianyin Chen Li Zhang |
spellingShingle |
Lejun Gong Xingxing Zhang Tianyin Chen Li Zhang Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms Security and Communication Networks |
author_facet |
Lejun Gong Xingxing Zhang Tianyin Chen Li Zhang |
author_sort |
Lejun Gong |
title |
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms |
title_short |
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms |
title_full |
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms |
title_fullStr |
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms |
title_full_unstemmed |
Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms |
title_sort |
recognition of disease genetic information from unstructured text data based on bilstm-crf for molecular mechanisms |
publisher |
Hindawi-Wiley |
series |
Security and Communication Networks |
issn |
1939-0122 |
publishDate |
2021-01-01 |
description |
Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. The proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. The work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. This study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information. |
url |
http://dx.doi.org/10.1155/2021/6635027 |
work_keys_str_mv |
AT lejungong recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms AT xingxingzhang recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms AT tianyinchen recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms AT lizhang recognitionofdiseasegeneticinformationfromunstructuredtextdatabasedonbilstmcrfformolecularmechanisms |
_version_ |
1714842387259326464 |