EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms

The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-insp...

Full description

Bibliographic Details
Main Author: Sašo Karakatič
Format: Article
Language:English
Published: MDPI AG 2020-06-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/8/6/900
Description
Summary:The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-inspired methods for data preprocessing has become common, but these approaches are still not readily available to practitioners with a simple and extendable application programming interface (API). In this paper the EvoPreprocess open-source Python framework, that preprocesses data with the use of evolutionary and nature-inspired optimization algorithms, is presented. The main problems addressed by the framework are <i>data sampling</i> (simultaneous over- and under-sampling data instances), <i>feature selection</i> and <i>data weighting</i> for supervised machine learning problems. EvoPreprocess framework provides a simple object-oriented and parallelized API of the preprocessing tasks and can be used with scikit-learn and imbalanced-learn Python machine learning libraries. The framework uses self-adaptive well-known nature-inspired meta-heuristic algorithms and can easily be extended with custom optimization and evaluation strategies. The paper presents the architecture of the framework, its use, experiment results and comparison to other common preprocessing approaches.
ISSN:2227-7390