Automatic Multilingual Stopwords Identification from Very Small Corpora

Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpor...

Full description

Bibliographic Details
Main Author: Stefano Ferilli
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/10/17/2169
id doaj-12c67420f2094a46908ac1778b2d6427
record_format Article
spelling doaj-12c67420f2094a46908ac1778b2d64272021-09-09T13:42:23ZengMDPI AGElectronics2079-92922021-09-01102169216910.3390/electronics10172169Automatic Multilingual Stopwords Identification from Very Small CorporaStefano Ferilli0Department of Computer Science, University of Bari, Via E. Orabona 4, 70125 Bari, ItalyTools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on <i>stopwords</i>, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.https://www.mdpi.com/2079-9292/10/17/2169natural language processingmachine learningstopword identification
collection DOAJ
language English
format Article
sources DOAJ
author Stefano Ferilli
spellingShingle Stefano Ferilli
Automatic Multilingual Stopwords Identification from Very Small Corpora
Electronics
natural language processing
machine learning
stopword identification
author_facet Stefano Ferilli
author_sort Stefano Ferilli
title Automatic Multilingual Stopwords Identification from Very Small Corpora
title_short Automatic Multilingual Stopwords Identification from Very Small Corpora
title_full Automatic Multilingual Stopwords Identification from Very Small Corpora
title_fullStr Automatic Multilingual Stopwords Identification from Very Small Corpora
title_full_unstemmed Automatic Multilingual Stopwords Identification from Very Small Corpora
title_sort automatic multilingual stopwords identification from very small corpora
publisher MDPI AG
series Electronics
issn 2079-9292
publishDate 2021-09-01
description Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on <i>stopwords</i>, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.
topic natural language processing
machine learning
stopword identification
url https://www.mdpi.com/2079-9292/10/17/2169
work_keys_str_mv AT stefanoferilli automaticmultilingualstopwordsidentificationfromverysmallcorpora
_version_ 1717760487222935552