Stopwords in technical language processing.

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While rese...

Full description

Bibliographic Details
Main Authors: Serhad Sarica, Jianxi Luo
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0254937
id doaj-f8627466a7514c45b19875785fdc0289
record_format Article
spelling doaj-f8627466a7514c45b19875785fdc02892021-08-10T04:30:53ZengPublic Library of Science (PLoS)PLoS ONE1932-62032021-01-01168e025493710.1371/journal.pone.0254937Stopwords in technical language processing.Serhad SaricaJianxi LuoThere are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.https://doi.org/10.1371/journal.pone.0254937
collection DOAJ
language English
format Article
sources DOAJ
author Serhad Sarica
Jianxi Luo
spellingShingle Serhad Sarica
Jianxi Luo
Stopwords in technical language processing.
PLoS ONE
author_facet Serhad Sarica
Jianxi Luo
author_sort Serhad Sarica
title Stopwords in technical language processing.
title_short Stopwords in technical language processing.
title_full Stopwords in technical language processing.
title_fullStr Stopwords in technical language processing.
title_full_unstemmed Stopwords in technical language processing.
title_sort stopwords in technical language processing.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2021-01-01
description There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.
url https://doi.org/10.1371/journal.pone.0254937
work_keys_str_mv AT serhadsarica stopwordsintechnicallanguageprocessing
AT jianxiluo stopwordsintechnicallanguageprocessing
_version_ 1721212947457376256