Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to ade...

Full description

Bibliographic Details
Main Authors:	María Novo-Lourés, Reyes Pavón, Rosalía Laza, David Ruano-Ordas, Jose R. Méndez
Format:	Article
Language:	English
Published:	Hindawi Limited 2020-01-01
Series:	Scientific Programming
Online Access:	http://dx.doi.org/10.1155/2020/2390941

Description
Summary:	During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.
ISSN:	1058-9244 1875-919X

Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

Similar Items