A Design of Experiments Comparative Study on Clustering Methods

Cluster analysis is a multivariate data mining technique that is widely used in several areas. It aims to group automatically the n elements of a database into k clusters, using only the information of the variables of each case. However, the accuracy of the final clusters depends on the clustering...

Full description

Bibliographic Details
Main Authors: Natalia Maria Puggina Bianchesi, Estevao Luiz Romao, Marina Fernandes B. P. Lopes, Pedro Paulo Balestrassi, Anderson Paulo De Paiva
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8901114/
Description
Summary:Cluster analysis is a multivariate data mining technique that is widely used in several areas. It aims to group automatically the n elements of a database into k clusters, using only the information of the variables of each case. However, the accuracy of the final clusters depends on the clustering method used. In this paper, we present an evaluation of the performance of main methods for cluster analysis as Ward, K-means, and Self-Organizing Maps. Differently from many studies published in the area, we generated the datasets using the Design of Experiment (DOE) technique, in order to achieve reliable conclusions about the methods through the generalization of the different possible data structures. We considered the number of variables and clusters, dataset size, sample size, cluster overlapping, and the presence of outliers, as the DOE factors. The datasets were analyzed by each clustering method and the clustering partitions were compared by the Attribute Agreement Analysis, providing invaluable information about the effects of the considered factors individually and about their interactions. The results showed that, the number of clusters, overlapping, and the interaction between sample size and number of variable significantly affect all the studied methods. Moreover, it is possible to state that the methods have similar performances, with a significance level of 5%, and it is not possible to affirm that one outperforms the others.
ISSN:2169-3536