Summary: | Finding meaningful clustering patterns in data can be very challenging when the clusters are of arbitrary shapes, different sizes, or densities, and especially when the data set contains high percentage (e.g., 80%) of noise. Unfortunately, most existing clustering techniques cannot properly handle this tough situation and often result in dramatically deteriorating performance. In this paper, a purposefully designed clustering algorithm called Density-Based Multiscale Analysis for Clustering (DBMAC)-II is proposed, which is an improved version of the latest strong-noise clustering algorithm DBMAC. DBMAC is proposed under the assumption that all clusters are homogeneous and cannot work well when the data set contains clusters of varying densities. DBMAC-II overcomes the limitation of DBMAC by executing the multiscale analysis iteratively and can conduct strong noise-robust clustering without any strict assumption on the shapes and densities of clusters. In DBMAC-II, each data point or object is mapped into a feature space using its r-neighborhood statistics with different r (radius) values, which is similar to DBMAC. In general, the higher the value of r-neighborhood statistics, the more likely the object is considered as a “clustered”object. Instead of trying to find a single optimal r value, a set of radius values appropriate for separating “clustered”objects and “noisy”objects is identified, using a formal statistical method for multimodality test, referred to as multiscale analysis. For clusters with varying densities, multiscale analysis is applied to extract the clusters with the highest density from the current data set iteratively. Moreover, a statistical uniformity test for measuring clustering tendency is used as the self-adaptive stopping criterion of the iteration. Comprehensive experimental studies on a series of challenging benchmark data sets demonstrate that DBMAC-II is not only superior to classical density-based clustering approaches, including DBSCAN, OPTICS, and HDBSCAN, but also can consistently outperform the latest strong-noise robust clustering techniques, such as Skinny-dip.
|