Optimization Algorithms for Scalable Stream Batch Clustering with k Estimation

The increasing volume and velocity of the continuously generated data (data stream) challenge machine learning algorithms, which must evolve to fit real-world problems. The data stream clustering algorithms face issues such as the rapidly increasing volume of the data, the variety of the number of c...

Full description

Bibliographic Details
Main Authors: Cândido, P.G.L (Author), Faria, E.R (Author), Naldi, M.C (Author), Silva, J.A (Author)
Format: Article
Language:English
Published: MDPI 2022
Subjects:
Online Access:View Fulltext in Publisher
LEADER 01935nam a2200217Ia 4500
001 10.3390-app12136464
008 220718s2022 CNT 000 0 und d
020 |a 20763417 (ISSN) 
245 1 0 |a Optimization Algorithms for Scalable Stream Batch Clustering with k Estimation 
260 0 |b MDPI  |c 2022 
856 |z View Fulltext in Publisher  |u https://doi.org/10.3390/app12136464 
520 3 |a The increasing volume and velocity of the continuously generated data (data stream) challenge machine learning algorithms, which must evolve to fit real-world problems. The data stream clustering algorithms face issues such as the rapidly increasing volume of the data, the variety of the number of clusters, and their shapes. The present work aims to improve the accuracy of sequential clustering batches of data streams for scenarios in which clusters evolve dynamically and continuously, automatically estimating their number. In order to achieve this goal, three evolutionary algorithms are presented, along with three novel algorithms designed to deal with clusters of normal distribution based on goodness-of-fit tests in the context of scalable batch stream clustering with automatic estimation of the number of clusters. All of them are developed on top of MapReduce, Discretized-Stream models, and the most recent MPC frameworks to provide scalability, reliability, resilience, and flexibility. The proposed algorithms are experimentally compared with state-of-the-art methods and present the best results for accuracy for normally distributed data sets, reaching their goal. © 2022 by the authors. Licensee MDPI, Basel, Switzerland. 
650 0 4 |a clustering 
650 0 4 |a data stream 
650 0 4 |a machine learning 
650 0 4 |a massive parallel computation 
700 1 |a Cândido, P.G.L.  |e author 
700 1 |a Faria, E.R.  |e author 
700 1 |a Naldi, M.C.  |e author 
700 1 |a Silva, J.A.  |e author 
773 |t Applied Sciences (Switzerland)