FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow>&l...

Full description

Bibliographic Details
Main Authors:	María José Basgall, Marcelo Naiouf, Alberto Fernández
Format:	Article
Language:	English
Published:	MDPI AG 2021-07-01
Series:	Electronics
Subjects:	big data data reduction classification preprocessing techniques Apache Spark
Online Access:	https://www.mdpi.com/2079-9292/10/15/1757

id	doaj-f29cf52a859747e19da665b86a0d3613
record_format	Article
spelling	doaj-f29cf52a859747e19da665b86a0d36132021-08-06T15:21:02ZengMDPI AGElectronics2079-92922021-07-01101757175710.3390/electronics10151757FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification ProblemsMaría José Basgall0Marcelo Naiouf1Alberto Fernández2National Scientific and Technical Research Council, CONICET, La Plata 1900, ArgentinaInstitute of Research in Computer Science LIDI, III-LIDI, Scientific Research Commission, Province of Buenos Aires, CIC-PBA, School of Computer Science, National University of La Plata, UNLP, La Plata 1900, ArgentinaDepartment of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada, 18071 Granada, SpainIn this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a <i>k</i>-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.https://www.mdpi.com/2079-9292/10/15/1757big datadata reductionclassificationpreprocessing techniquesApache Spark
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	María José Basgall Marcelo Naiouf Alberto Fernández
spellingShingle	María José Basgall Marcelo Naiouf Alberto Fernández FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems Electronics big data data reduction classification preprocessing techniques Apache Spark
author_facet	María José Basgall Marcelo Naiouf Alberto Fernández
author_sort	María José Basgall
title	FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_short	FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full	FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_fullStr	FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full_unstemmed	FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_sort	fdr<sup>2</sup>-bd: a fast data reduction recommendation tool for tabular big data classification problems
publisher	MDPI AG
series	Electronics
issn	2079-9292
publishDate	2021-07-01
description	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mrow></mrow><mn>2</mn></msup></semantics></math></inline-formula>-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a <i>k</i>-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
topic	big data data reduction classification preprocessing techniques Apache Spark
url	https://www.mdpi.com/2079-9292/10/15/1757
work_keys_str_mv	AT mariajosebasgall fdrsup2supbdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems AT marcelonaiouf fdrsup2supbdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems AT albertofernandez fdrsup2supbdafastdatareductionrecommendationtoolfortabularbigdataclassificationproblems
_version_	1721218760722874368

FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Similar Items