A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark frame...

Full description

Bibliographic Details
Main Authors:	Behrooz Hosseini, Kourosh Kiani
Format:	Article
Language:	English
Published:	MDPI AG 2018-08-01
Series:	Symmetry
Subjects:	distributed data clustering big data density-based clustering density peak detection gene expression apache spark Bayesian locality sensitive hashing ordered weighted averaging micro array scalable clustering
Online Access:	http://www.mdpi.com/2073-8994/10/8/342

id	doaj-953827cb9bd740abbae7b162454023f7
record_format	Article
spelling	doaj-953827cb9bd740abbae7b162454023f72020-11-25T00:20:32ZengMDPI AGSymmetry2073-89942018-08-0110834210.3390/sym10080342sym10080342A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache SparkBehrooz Hosseini0Kourosh Kiani1Electrical and Computer Engineering Department, Semnan University, Semnan 35131-1911, IranFaculty of Electrical and Computer Engineering Department, Semnan University, Semnan 35131-1911, IranUnsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.http://www.mdpi.com/2073-8994/10/8/342distributed data clusteringbig datadensity-based clusteringdensity peak detectiongene expressionapache sparkBayesian locality sensitive hashingordered weighted averagingmicro arrayscalable clustering
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Behrooz Hosseini Kourosh Kiani
spellingShingle	Behrooz Hosseini Kourosh Kiani A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark Symmetry distributed data clustering big data density-based clustering density peak detection gene expression apache spark Bayesian locality sensitive hashing ordered weighted averaging micro array scalable clustering
author_facet	Behrooz Hosseini Kourosh Kiani
author_sort	Behrooz Hosseini
title	A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
title_short	A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
title_full	A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
title_fullStr	A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
title_full_unstemmed	A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark
title_sort	robust distributed big data clustering-based on adaptive density partitioning using apache spark
publisher	MDPI AG
series	Symmetry
issn	2073-8994
publishDate	2018-08-01
description	Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.
topic	distributed data clustering big data density-based clustering density peak detection gene expression apache spark Bayesian locality sensitive hashing ordered weighted averaging micro array scalable clustering
url	http://www.mdpi.com/2073-8994/10/8/342
work_keys_str_mv	AT behroozhosseini arobustdistributedbigdataclusteringbasedonadaptivedensitypartitioningusingapachespark AT kouroshkiani arobustdistributedbigdataclusteringbasedonadaptivedensitypartitioningusingapachespark AT behroozhosseini robustdistributedbigdataclusteringbasedonadaptivedensitypartitioningusingapachespark AT kouroshkiani robustdistributedbigdataclusteringbasedonadaptivedensitypartitioningusingapachespark
_version_	1725366830346797056

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Similar Items