Automatic Multilingual Stopwords Identification from Very Small Corpora

Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpor...

Full description

Bibliographic Details
Main Author:	Stefano Ferilli
Format:	Article
Language:	English
Published:	MDPI AG 2021-09-01
Series:	Electronics
Subjects:	natural language processing machine learning stopword identification
Online Access:	https://www.mdpi.com/2079-9292/10/17/2169

id	doaj-12c67420f2094a46908ac1778b2d6427
record_format	Article
spelling	doaj-12c67420f2094a46908ac1778b2d64272021-09-09T13:42:23ZengMDPI AGElectronics2079-92922021-09-01102169216910.3390/electronics10172169Automatic Multilingual Stopwords Identification from Very Small CorporaStefano Ferilli0Department of Computer Science, University of Bari, Via E. Orabona 4, 70125 Bari, ItalyTools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on <i>stopwords</i>, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.https://www.mdpi.com/2079-9292/10/17/2169natural language processingmachine learningstopword identification
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Stefano Ferilli
spellingShingle	Stefano Ferilli Automatic Multilingual Stopwords Identification from Very Small Corpora Electronics natural language processing machine learning stopword identification
author_facet	Stefano Ferilli
author_sort	Stefano Ferilli
title	Automatic Multilingual Stopwords Identification from Very Small Corpora
title_short	Automatic Multilingual Stopwords Identification from Very Small Corpora
title_full	Automatic Multilingual Stopwords Identification from Very Small Corpora
title_fullStr	Automatic Multilingual Stopwords Identification from Very Small Corpora
title_full_unstemmed	Automatic Multilingual Stopwords Identification from Very Small Corpora
title_sort	automatic multilingual stopwords identification from very small corpora
publisher	MDPI AG
series	Electronics
issn	2079-9292
publishDate	2021-09-01
description	Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on <i>stopwords</i>, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.
topic	natural language processing machine learning stopword identification
url	https://www.mdpi.com/2079-9292/10/17/2169
work_keys_str_mv	AT stefanoferilli automaticmultilingualstopwordsidentificationfromverysmallcorpora
_version_	1717760487222935552

Automatic Multilingual Stopwords Identification from Very Small Corpora

Similar Items