Enriching feature engineering for short text samples by language time series analysis

Abstract In this case study, we are extending feature engineering approaches for short text samples by integrating techniques which have been introduced in the context of time series classification and signal processing. The general idea of the presented feature engineering approach is to tokenize t...

Full description

Bibliographic Details
Main Authors: Yichen Tang, Kelly Blincoe, Andreas W. Kempa-Liehr
Format: Article
Language:English
Published: SpringerOpen 2020-08-01
Series:EPJ Data Science
Subjects:
Online Access:http://link.springer.com/article/10.1140/epjds/s13688-020-00244-9
id doaj-e7c7d266620c4244a1c7a9aa14347cae
record_format Article
spelling doaj-e7c7d266620c4244a1c7a9aa14347cae2020-11-25T02:53:11ZengSpringerOpenEPJ Data Science2193-11272020-08-019115910.1140/epjds/s13688-020-00244-9Enriching feature engineering for short text samples by language time series analysisYichen Tang0Kelly Blincoe1Andreas W. Kempa-Liehr2Department of Electrical, Computer, and Software Engineering, University of AucklandDepartment of Electrical, Computer, and Software Engineering, University of AucklandDepartment of Engineering Science, University of AucklandAbstract In this case study, we are extending feature engineering approaches for short text samples by integrating techniques which have been introduced in the context of time series classification and signal processing. The general idea of the presented feature engineering approach is to tokenize the text samples under consideration and map each token to a number, which measures a specific property of the token. Consequently, each text sample becomes a language time series, which is generated from consecutively emitted tokens, and time is represented by the position of the respective token within the text sample. The resulting language time series can be characterised by collections of established time series feature extraction algorithms from time series analysis and signal processing. This approach maps each text sample (irrespective of its original length) to 3970 stylometric features, which can be analysed with standard statistical learning methodologies. The proposed feature engineering technique for short text data is applied to two different corpora: the Federalist Papers data set and the Spooky Books data set. We demonstrate that the extracted language time series features can be successfully combined with standard machine learning approaches for natural language processing and have the potential to improve the classification performance. Furthermore, the suggested feature engineering approach can be used for visualizing differences and commonalities of stylometric features. The presented framework models the systematic feature engineering based on approaches from time series classification and develops a statistical testing methodology for multi-classification problems.http://link.springer.com/article/10.1140/epjds/s13688-020-00244-9Time series analysisLanguageMachine learningNatural Language ProcessingtsfreshFeature mining
collection DOAJ
language English
format Article
sources DOAJ
author Yichen Tang
Kelly Blincoe
Andreas W. Kempa-Liehr
spellingShingle Yichen Tang
Kelly Blincoe
Andreas W. Kempa-Liehr
Enriching feature engineering for short text samples by language time series analysis
EPJ Data Science
Time series analysis
Language
Machine learning
Natural Language Processing
tsfresh
Feature mining
author_facet Yichen Tang
Kelly Blincoe
Andreas W. Kempa-Liehr
author_sort Yichen Tang
title Enriching feature engineering for short text samples by language time series analysis
title_short Enriching feature engineering for short text samples by language time series analysis
title_full Enriching feature engineering for short text samples by language time series analysis
title_fullStr Enriching feature engineering for short text samples by language time series analysis
title_full_unstemmed Enriching feature engineering for short text samples by language time series analysis
title_sort enriching feature engineering for short text samples by language time series analysis
publisher SpringerOpen
series EPJ Data Science
issn 2193-1127
publishDate 2020-08-01
description Abstract In this case study, we are extending feature engineering approaches for short text samples by integrating techniques which have been introduced in the context of time series classification and signal processing. The general idea of the presented feature engineering approach is to tokenize the text samples under consideration and map each token to a number, which measures a specific property of the token. Consequently, each text sample becomes a language time series, which is generated from consecutively emitted tokens, and time is represented by the position of the respective token within the text sample. The resulting language time series can be characterised by collections of established time series feature extraction algorithms from time series analysis and signal processing. This approach maps each text sample (irrespective of its original length) to 3970 stylometric features, which can be analysed with standard statistical learning methodologies. The proposed feature engineering technique for short text data is applied to two different corpora: the Federalist Papers data set and the Spooky Books data set. We demonstrate that the extracted language time series features can be successfully combined with standard machine learning approaches for natural language processing and have the potential to improve the classification performance. Furthermore, the suggested feature engineering approach can be used for visualizing differences and commonalities of stylometric features. The presented framework models the systematic feature engineering based on approaches from time series classification and develops a statistical testing methodology for multi-classification problems.
topic Time series analysis
Language
Machine learning
Natural Language Processing
tsfresh
Feature mining
url http://link.springer.com/article/10.1140/epjds/s13688-020-00244-9
work_keys_str_mv AT yichentang enrichingfeatureengineeringforshorttextsamplesbylanguagetimeseriesanalysis
AT kellyblincoe enrichingfeatureengineeringforshorttextsamplesbylanguagetimeseriesanalysis
AT andreaswkempaliehr enrichingfeatureengineeringforshorttextsamplesbylanguagetimeseriesanalysis
_version_ 1724726269337141248