Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘c...

Full description

Bibliographic Details
Main Authors:	Maldonado Alfredo, Klubička Filip, Kelleher John
Format:	Article
Language:	English
Published:	De Gruyter 2019-10-01
Series:	Open Computer Science
Subjects:	word embeddings taxonomic embeddings wordnet semantic similarity taxonomic enrichment retrofitting
Online Access:	https://doi.org/10.1515/comp-2019-0009

id	doaj-9a8eb5b941bc4dce88079bf39a783089
record_format	Article
spelling	doaj-9a8eb5b941bc4dce88079bf39a7830892021-09-06T19:19:42ZengDe GruyterOpen Computer Science2299-10932019-10-019125226710.1515/comp-2019-0009comp-2019-0009Size Matters: The Impact of Training Size in Taxonomically-Enriched Word EmbeddingsMaldonado Alfredo0Klubička Filip1Kelleher John2ADAPT Centre at Trinity College Dublin, Dublin, IrelandADAPT Centre at Technological University Dublin, Dublin, IrelandADAPT Centre at Technological University Dublin, Dublin, IrelandWord embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.https://doi.org/10.1515/comp-2019-0009word embeddingstaxonomic embeddingswordnetsemantic similaritytaxonomic enrichmentretrofitting
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Maldonado Alfredo Klubička Filip Kelleher John
spellingShingle	Maldonado Alfredo Klubička Filip Kelleher John Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings Open Computer Science word embeddings taxonomic embeddings wordnet semantic similarity taxonomic enrichment retrofitting
author_facet	Maldonado Alfredo Klubička Filip Kelleher John
author_sort	Maldonado Alfredo
title	Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
title_short	Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
title_full	Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
title_fullStr	Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
title_full_unstemmed	Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings
title_sort	size matters: the impact of training size in taxonomically-enriched word embeddings
publisher	De Gruyter
series	Open Computer Science
issn	2299-1093
publishDate	2019-10-01
description	Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.
topic	word embeddings taxonomic embeddings wordnet semantic similarity taxonomic enrichment retrofitting
url	https://doi.org/10.1515/comp-2019-0009
work_keys_str_mv	AT maldonadoalfredo sizematterstheimpactoftrainingsizeintaxonomicallyenrichedwordembeddings AT klubickafilip sizematterstheimpactoftrainingsizeintaxonomicallyenrichedwordembeddings AT kelleherjohn sizematterstheimpactoftrainingsizeintaxonomicallyenrichedwordembeddings
_version_	1717777977302843392

Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings

Similar Items