Enhancement of chemical entity identification in text using semantic similarity validation.

With the amount of chemical data being produced and reported in the literature growing at a fast pace, it is increasingly important to efficiently retrieve this information. To tackle this issue text mining tools have been applied, but despite their good performance they still provide many errors th...

Full description

Bibliographic Details
Main Authors: Tiago Grego, Francisco M Couto
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2013-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3642108?pdf=render
id doaj-1063b1f8c471430a97f015881c8cca3f
record_format Article
spelling doaj-1063b1f8c471430a97f015881c8cca3f2020-11-25T01:16:11ZengPublic Library of Science (PLoS)PLoS ONE1932-62032013-01-0185e6298410.1371/journal.pone.0062984Enhancement of chemical entity identification in text using semantic similarity validation.Tiago GregoFrancisco M CoutoWith the amount of chemical data being produced and reported in the literature growing at a fast pace, it is increasingly important to efficiently retrieve this information. To tackle this issue text mining tools have been applied, but despite their good performance they still provide many errors that we believe can be filtered by using semantic similarity. Thus, this paper proposes a novel method that receives the results of chemical entity identification systems, such as Whatizit, and exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text. The method assigns a single validation score to each entity based on its similarities with the other entities also identified in the text. Then, by using a given threshold, the method selects a set of validated entities and a set of outlier entities. We evaluated our method using the results of two state-of-the-art chemical entity identification tools, three semantic similarity measures and two text window sizes. The method was able to increase precision without filtering a significant number of correctly identified entities. This means that the method can effectively discriminate the correctly identified chemical entities, while discarding a significant number of identification errors. For example, selecting a validation set with 75% of all identified entities, we were able to increase the precision by 28% for one of the chemical entity identification tools (Whatizit), maintaining in that subset 97% the correctly identified entities. Our method can be directly used as an add-on by any state-of-the-art entity identification tool that provides mappings to a database, in order to improve their results. The proposed method is included in a freely accessible web tool at www.lasige.di.fc.ul.pt/webtools/ice/.http://europepmc.org/articles/PMC3642108?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Tiago Grego
Francisco M Couto
spellingShingle Tiago Grego
Francisco M Couto
Enhancement of chemical entity identification in text using semantic similarity validation.
PLoS ONE
author_facet Tiago Grego
Francisco M Couto
author_sort Tiago Grego
title Enhancement of chemical entity identification in text using semantic similarity validation.
title_short Enhancement of chemical entity identification in text using semantic similarity validation.
title_full Enhancement of chemical entity identification in text using semantic similarity validation.
title_fullStr Enhancement of chemical entity identification in text using semantic similarity validation.
title_full_unstemmed Enhancement of chemical entity identification in text using semantic similarity validation.
title_sort enhancement of chemical entity identification in text using semantic similarity validation.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2013-01-01
description With the amount of chemical data being produced and reported in the literature growing at a fast pace, it is increasingly important to efficiently retrieve this information. To tackle this issue text mining tools have been applied, but despite their good performance they still provide many errors that we believe can be filtered by using semantic similarity. Thus, this paper proposes a novel method that receives the results of chemical entity identification systems, such as Whatizit, and exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text. The method assigns a single validation score to each entity based on its similarities with the other entities also identified in the text. Then, by using a given threshold, the method selects a set of validated entities and a set of outlier entities. We evaluated our method using the results of two state-of-the-art chemical entity identification tools, three semantic similarity measures and two text window sizes. The method was able to increase precision without filtering a significant number of correctly identified entities. This means that the method can effectively discriminate the correctly identified chemical entities, while discarding a significant number of identification errors. For example, selecting a validation set with 75% of all identified entities, we were able to increase the precision by 28% for one of the chemical entity identification tools (Whatizit), maintaining in that subset 97% the correctly identified entities. Our method can be directly used as an add-on by any state-of-the-art entity identification tool that provides mappings to a database, in order to improve their results. The proposed method is included in a freely accessible web tool at www.lasige.di.fc.ul.pt/webtools/ice/.
url http://europepmc.org/articles/PMC3642108?pdf=render
work_keys_str_mv AT tiagogrego enhancementofchemicalentityidentificationintextusingsemanticsimilarityvalidation
AT franciscomcouto enhancementofchemicalentityidentificationintextusingsemanticsimilarityvalidation
_version_ 1725150847728353280