Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text dat...

Full description

Bibliographic Details
Main Authors:	Dario Borrelli, Gabriela Gongora Svartzman, Carlo Lipizzi
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2020-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0234214

id	doaj-fe74d58f460f4d9e8ce803134d7c112e
record_format	Article
spelling	doaj-fe74d58f460f4d9e8ce803134d7c112e2021-03-04T12:30:37ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01156e023421410.1371/journal.pone.0234214Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.Dario BorrelliGabriela Gongora SvartzmanCarlo LipizziSymbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.https://doi.org/10.1371/journal.pone.0234214
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Dario Borrelli Gabriela Gongora Svartzman Carlo Lipizzi
spellingShingle	Dario Borrelli Gabriela Gongora Svartzman Carlo Lipizzi Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets. PLoS ONE
author_facet	Dario Borrelli Gabriela Gongora Svartzman Carlo Lipizzi
author_sort	Dario Borrelli
title	Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.
title_short	Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.
title_full	Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.
title_fullStr	Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.
title_full_unstemmed	Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.
title_sort	unsupervised acquisition of idiomatic units of symbolic natural language: an n-gram frequency-based approach for the chunking of news articles and tweets.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2020-01-01
description	Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.
url	https://doi.org/10.1371/journal.pone.0234214
work_keys_str_mv	AT darioborrelli unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets AT gabrielagongorasvartzman unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets AT carlolipizzi unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets
_version_	1714802453373779968

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets.

Similar Items