Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Abstract Background Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. Wh...

Full description

Bibliographic Details
Main Authors:	Kathrin Blagec, Hong Xu, Asan Agibetov, Matthias Samwald
Format:	Article
Language:	English
Published:	BMC 2019-04-01
Series:	BMC Bioinformatics
Subjects:	Natural language processing Semantics Neural embedding models
Online Access:	http://link.springer.com/article/10.1186/s12859-019-2789-2

id	doaj-0ef4aa0349dc48c0a82da38f8c66c1fd
record_format	Article
spelling	doaj-0ef4aa0349dc48c0a82da38f8c66c1fd2020-11-25T02:58:39ZengBMCBMC Bioinformatics1471-21052019-04-0120111010.1186/s12859-019-2789-2Neural sentence embedding models for semantic similarity estimation in the biomedical domainKathrin Blagec0Hong Xu1Asan Agibetov2Matthias Samwald3Section for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of ViennaSection for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of ViennaSection for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of ViennaSection for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of ViennaAbstract Background Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. Results Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. Conclusions In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.http://link.springer.com/article/10.1186/s12859-019-2789-2Natural language processingSemanticsNeural embedding models
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Kathrin Blagec Hong Xu Asan Agibetov Matthias Samwald
spellingShingle	Kathrin Blagec Hong Xu Asan Agibetov Matthias Samwald Neural sentence embedding models for semantic similarity estimation in the biomedical domain BMC Bioinformatics Natural language processing Semantics Neural embedding models
author_facet	Kathrin Blagec Hong Xu Asan Agibetov Matthias Samwald
author_sort	Kathrin Blagec
title	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_short	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_fullStr	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_full_unstemmed	Neural sentence embedding models for semantic similarity estimation in the biomedical domain
title_sort	neural sentence embedding models for semantic similarity estimation in the biomedical domain
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2019-04-01
description	Abstract Background Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. Results Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. Conclusions In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.
topic	Natural language processing Semantics Neural embedding models
url	http://link.springer.com/article/10.1186/s12859-019-2789-2
work_keys_str_mv	AT kathrinblagec neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT hongxu neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT asanagibetov neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain AT matthiassamwald neuralsentenceembeddingmodelsforsemanticsimilarityestimationinthebiomedicaldomain
_version_	1724705841999773696

Neural sentence embedding models for semantic similarity estimation in the biomedical domain

Similar Items