The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-ba...

Full description

Bibliographic Details
Main Authors: Evangelos Pafilis, Sune P Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Aikaterini Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2013-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3688812?pdf=render
id doaj-fc85fd5ccf214874acd5326ff6513763
record_format Article
spelling doaj-fc85fd5ccf214874acd5326ff65137632020-11-25T01:56:28ZengPublic Library of Science (PLoS)PLoS ONE1932-62032013-01-0186e6539010.1371/journal.pone.0065390The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.Evangelos PafilisSune P FrankildLucia FaniniSarah FaulwetterChristina PavloudiAikaterini VasileiadouChristos ArvanitidisLars Juhl JensenThe exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.http://europepmc.org/articles/PMC3688812?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Evangelos Pafilis
Sune P Frankild
Lucia Fanini
Sarah Faulwetter
Christina Pavloudi
Aikaterini Vasileiadou
Christos Arvanitidis
Lars Juhl Jensen
spellingShingle Evangelos Pafilis
Sune P Frankild
Lucia Fanini
Sarah Faulwetter
Christina Pavloudi
Aikaterini Vasileiadou
Christos Arvanitidis
Lars Juhl Jensen
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
PLoS ONE
author_facet Evangelos Pafilis
Sune P Frankild
Lucia Fanini
Sarah Faulwetter
Christina Pavloudi
Aikaterini Vasileiadou
Christos Arvanitidis
Lars Juhl Jensen
author_sort Evangelos Pafilis
title The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
title_short The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
title_full The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
title_fullStr The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
title_full_unstemmed The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.
title_sort species and organisms resources for fast and accurate identification of taxonomic names in text.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2013-01-01
description The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.
url http://europepmc.org/articles/PMC3688812?pdf=render
work_keys_str_mv AT evangelospafilis thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT sunepfrankild thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT luciafanini thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT sarahfaulwetter thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT christinapavloudi thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT aikaterinivasileiadou thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT christosarvanitidis thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT larsjuhljensen thespeciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT evangelospafilis speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT sunepfrankild speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT luciafanini speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT sarahfaulwetter speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT christinapavloudi speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT aikaterinivasileiadou speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT christosarvanitidis speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
AT larsjuhljensen speciesandorganismsresourcesforfastandaccurateidentificationoftaxonomicnamesintext
_version_ 1724979928060919808