Automatic gene annotation using GO terms from cellular component domain

Abstract Background The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a p...

Full description

Bibliographic Details
Main Authors:	Ruoyao Ding, Yingying Qu, Cathy H. Wu, K. Vijay-Shanker
Format:	Article
Language:	English
Published:	BMC 2018-12-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Natural language processing Gene ontology annotation Relation extraction
Online Access:	http://link.springer.com/article/10.1186/s12911-018-0694-7

id	doaj-16524b2eb38f469d97d9bc81953813f6
record_format	Article
spelling	doaj-16524b2eb38f469d97d9bc81953813f62020-11-25T00:11:16ZengBMCBMC Medical Informatics and Decision Making1472-69472018-12-0118S59710610.1186/s12911-018-0694-7Automatic gene annotation using GO terms from cellular component domainRuoyao Ding0Yingying Qu1Cathy H. Wu2K. Vijay-Shanker3School of Information Science and Technology, Guangdong University of Foreign StudiesSchool of Business, Guangdong University of Foreign StudiesDepartment of Computer and Information Science, University of DelawareDepartment of Computer and Information Science, University of DelawareAbstract Background The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a process which assigns gene functional information using GO terms to relevant genes in the literature. It is a common task among the Model Organism Database (MOD) groups. Manual GO annotation relies on human curators assigning gene functional information using GO terms by reading the biomedical literature. This process is very time-consuming and labor-intensive. As a result, many MODs can afford to curate only a fraction of relevant articles. Methods GO terms from the CC domain can be essentially divided into two sub-hierarchies: subcellular location terms, and protein complex terms. We cast the task of gene annotation using GO terms from the CC domain as relation extraction between gene and other entities: (1) extract cases where a protein is found to be in a subcellular location, and (2) extract cases where a protein is a subunit of a protein complex. For each relation extraction task, we use an approach based on triggers and syntactic dependencies to extract the desired relations among entities. Results We tested our approach on the BC4GO test set, a publicly available corpus for GO annotation. Our approach obtains a F1-score of 71%, a precision of 91% and a recall of 58% for predicting GO terms from CC Domain for given genes. Conclusions We have described a novel approach of treating gene annotation with GO terms from CC domain as two relation extraction subtasks. Evaluation results show that our approach achieves a F1-score of 71% for predicting GO terms for given genes. Thereby our approach can be used to accelerate the process of GO annotation for the bio-annotators.http://link.springer.com/article/10.1186/s12911-018-0694-7Natural language processingGene ontology annotationRelation extraction
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ruoyao Ding Yingying Qu Cathy H. Wu K. Vijay-Shanker
spellingShingle	Ruoyao Ding Yingying Qu Cathy H. Wu K. Vijay-Shanker Automatic gene annotation using GO terms from cellular component domain BMC Medical Informatics and Decision Making Natural language processing Gene ontology annotation Relation extraction
author_facet	Ruoyao Ding Yingying Qu Cathy H. Wu K. Vijay-Shanker
author_sort	Ruoyao Ding
title	Automatic gene annotation using GO terms from cellular component domain
title_short	Automatic gene annotation using GO terms from cellular component domain
title_full	Automatic gene annotation using GO terms from cellular component domain
title_fullStr	Automatic gene annotation using GO terms from cellular component domain
title_full_unstemmed	Automatic gene annotation using GO terms from cellular component domain
title_sort	automatic gene annotation using go terms from cellular component domain
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2018-12-01
description	Abstract Background The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a process which assigns gene functional information using GO terms to relevant genes in the literature. It is a common task among the Model Organism Database (MOD) groups. Manual GO annotation relies on human curators assigning gene functional information using GO terms by reading the biomedical literature. This process is very time-consuming and labor-intensive. As a result, many MODs can afford to curate only a fraction of relevant articles. Methods GO terms from the CC domain can be essentially divided into two sub-hierarchies: subcellular location terms, and protein complex terms. We cast the task of gene annotation using GO terms from the CC domain as relation extraction between gene and other entities: (1) extract cases where a protein is found to be in a subcellular location, and (2) extract cases where a protein is a subunit of a protein complex. For each relation extraction task, we use an approach based on triggers and syntactic dependencies to extract the desired relations among entities. Results We tested our approach on the BC4GO test set, a publicly available corpus for GO annotation. Our approach obtains a F1-score of 71%, a precision of 91% and a recall of 58% for predicting GO terms from CC Domain for given genes. Conclusions We have described a novel approach of treating gene annotation with GO terms from CC domain as two relation extraction subtasks. Evaluation results show that our approach achieves a F1-score of 71% for predicting GO terms for given genes. Thereby our approach can be used to accelerate the process of GO annotation for the bio-annotators.
topic	Natural language processing Gene ontology annotation Relation extraction
url	http://link.springer.com/article/10.1186/s12911-018-0694-7
work_keys_str_mv	AT ruoyaoding automaticgeneannotationusinggotermsfromcellularcomponentdomain AT yingyingqu automaticgeneannotationusinggotermsfromcellularcomponentdomain AT cathyhwu automaticgeneannotationusinggotermsfromcellularcomponentdomain AT kvijayshanker automaticgeneannotationusinggotermsfromcellularcomponentdomain
_version_	1725405007513124864

Automatic gene annotation using GO terms from cellular component domain

Similar Items