Gene and protein nomenclature in public databases

Abstract Background Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking...

Full description

Bibliographic Details
Main Authors:	Zimmer Ralf, Fundel Katrin
Format:	Article
Language:	English
Published:	BMC 2006-08-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/7/372

id	doaj-e0d4c87e69eb4931b995630b6d37ea6e
record_format	Article
spelling	doaj-e0d4c87e69eb4931b995630b6d37ea6e2020-11-25T01:47:51ZengBMCBMC Bioinformatics1471-21052006-08-017137210.1186/1471-2105-7-372Gene and protein nomenclature in public databasesZimmer RalfFundel Katrin<p>Abstract</p> <p>Background</p> <p>Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap.</p> <p>Results</p> <p>We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those.</p> <p>The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism.</p> <p>Conclusion</p> <p>In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity.</p> <p>The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application.</p> http://www.biomedcentral.com/1471-2105/7/372
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Zimmer Ralf Fundel Katrin
spellingShingle	Zimmer Ralf Fundel Katrin Gene and protein nomenclature in public databases BMC Bioinformatics
author_facet	Zimmer Ralf Fundel Katrin
author_sort	Zimmer Ralf
title	Gene and protein nomenclature in public databases
title_short	Gene and protein nomenclature in public databases
title_full	Gene and protein nomenclature in public databases
title_fullStr	Gene and protein nomenclature in public databases
title_full_unstemmed	Gene and protein nomenclature in public databases
title_sort	gene and protein nomenclature in public databases
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2006-08-01
description	<p>Abstract</p> <p>Background</p> <p>Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap.</p> <p>Results</p> <p>We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those.</p> <p>The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism.</p> <p>Conclusion</p> <p>In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity.</p> <p>The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application.</p>
url	http://www.biomedcentral.com/1471-2105/7/372
work_keys_str_mv	AT zimmerralf geneandproteinnomenclatureinpublicdatabases AT fundelkatrin geneandproteinnomenclatureinpublicdatabases
_version_	1725014321882202112

Gene and protein nomenclature in public databases

Similar Items