Towards Robust Word Embeddings for Noisy Texts

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in...

Full description

Bibliographic Details
Main Authors: Yerai Doval, Jesús Vilares, Carlos Gómez-Rodríguez
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/19/6893
id doaj-c7d9aed853894b139292460e0e47447b
record_format Article
spelling doaj-c7d9aed853894b139292460e0e47447b2020-11-25T01:46:33ZengMDPI AGApplied Sciences2076-34172020-10-01106893689310.3390/app10196893Towards Robust Word Embeddings for Noisy TextsYerai Doval0Jesús Vilares1Carlos Gómez-Rodríguez2Grupo COLE, Escola Superior de Enxeñaría Informática, Universidade de Vigo, 36310 Vigo, SpainUniversidade da Coruña, CITIC. Grupo LyS, Departamento de Ciencias da Computación e Tecnoloxías da Información, 15071 A Coruña, SpainUniversidade da Coruña, CITIC. Grupo LyS, Departamento de Ciencias da Computación e Tecnoloxías da Información, 15071 A Coruña, SpainResearch on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.https://www.mdpi.com/2076-3417/10/19/6893natural language processingsemanticsword embeddingsnoisy textssocial media
collection DOAJ
language English
format Article
sources DOAJ
author Yerai Doval
Jesús Vilares
Carlos Gómez-Rodríguez
spellingShingle Yerai Doval
Jesús Vilares
Carlos Gómez-Rodríguez
Towards Robust Word Embeddings for Noisy Texts
Applied Sciences
natural language processing
semantics
word embeddings
noisy texts
social media
author_facet Yerai Doval
Jesús Vilares
Carlos Gómez-Rodríguez
author_sort Yerai Doval
title Towards Robust Word Embeddings for Noisy Texts
title_short Towards Robust Word Embeddings for Noisy Texts
title_full Towards Robust Word Embeddings for Noisy Texts
title_fullStr Towards Robust Word Embeddings for Noisy Texts
title_full_unstemmed Towards Robust Word Embeddings for Noisy Texts
title_sort towards robust word embeddings for noisy texts
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2020-10-01
description Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.
topic natural language processing
semantics
word embeddings
noisy texts
social media
url https://www.mdpi.com/2076-3417/10/19/6893
work_keys_str_mv AT yeraidoval towardsrobustwordembeddingsfornoisytexts
AT jesusvilares towardsrobustwordembeddingsfornoisytexts
AT carlosgomezrodriguez towardsrobustwordembeddingsfornoisytexts
_version_ 1725018682898251776