An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set
碩士 === 國立中央大學 === 資訊工程研究所 === 98 === Clustering of different kinds of groups is a common and important technique in any research area. Clustering algorithms usually focus on a small dataset which can be analyzed by a single machine. However, as new hardware and techniques are developed for collectin...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2010
|
Online Access: | http://ndltd.ncl.edu.tw/handle/28399862048953094239 |
id |
ndltd-TW-098NCU05392127 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-098NCU053921272016-04-20T04:18:02Z http://ndltd.ncl.edu.tw/handle/28399862048953094239 An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set 適用於大資料集高效率的分散式階層分群演算法 An-Cing Huang 黃安慶 碩士 國立中央大學 資訊工程研究所 98 Clustering of different kinds of groups is a common and important technique in any research area. Clustering algorithms usually focus on a small dataset which can be analyzed by a single machine. However, as new hardware and techniques are developed for collecting data, the size of datasets can grow to an extremely large scale in many domains, such as astronomy, high energy physics, and aircraft engine diagnostics. However, The time complexity of hierarchical clustering algorithms are polynomial time between O(N2) to O(N3). This means that the computation cost of the algorithms will grow very fast as the size of input data become large. Therefore, the hierarchical clustering algorithms cannot be used directly in this situation because they can’t guarantee that the users will get the results back in a bounded amount of time. This research focuses on how to make the hierarchical clustering algorithm process in parallel. The traditional hierarchical clustering algorithm is an unsupervised learning algorithm which doesn''t need to label data in advance or assign the number of clusters. These characteristics make it become adaptable and capable to process many kinds of data. The goal of our research is to use a parallel computing architecture to improve the speed of execution and minimize the storage space needed of traditional hierarchical clustering algorithms, and refining the process of hierarchical clustering algorithms. We propose a Parallelized Hierarchical Clustering Algorithm, which provides a modified Hierarchical Agglomerative Algorithm that can be adapted to the distributed environment. This algorithm can process a grouping in a parallel way, and reduce both data computation load and transmission rate when facing a large-size data. Wei-Jen Wang 王尉任 2010 學位論文 ; thesis 52 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊工程研究所 === 98 === Clustering of different kinds of groups is a common and important technique in any research area. Clustering algorithms usually focus on a small dataset which can be analyzed by a single machine. However, as new hardware and techniques are developed for collecting data, the size of datasets can grow to an extremely large scale in many domains, such as astronomy, high energy physics, and aircraft engine diagnostics. However, The time complexity of hierarchical clustering algorithms are polynomial time between O(N2) to O(N3). This means that the computation cost of the algorithms will grow very fast as the size of input data become large. Therefore, the hierarchical clustering algorithms cannot be used directly in this situation because they can’t guarantee that the users will get the results back in a bounded amount of time.
This research focuses on how to make the hierarchical clustering algorithm process in parallel. The traditional hierarchical clustering algorithm is an unsupervised learning algorithm which doesn''t need to label data in advance or assign the number of clusters. These characteristics make it become adaptable and capable to process many kinds of data. The goal of our research is to use a parallel computing architecture to improve the speed of execution and minimize the storage space needed of traditional hierarchical clustering algorithms, and refining the process of hierarchical clustering algorithms. We propose a Parallelized Hierarchical Clustering Algorithm, which provides a modified Hierarchical Agglomerative Algorithm that can be adapted to the distributed environment. This algorithm can process a grouping in a parallel way, and reduce both data computation load and transmission rate when facing a large-size data.
|
author2 |
Wei-Jen Wang |
author_facet |
Wei-Jen Wang An-Cing Huang 黃安慶 |
author |
An-Cing Huang 黃安慶 |
spellingShingle |
An-Cing Huang 黃安慶 An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
author_sort |
An-Cing Huang |
title |
An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
title_short |
An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
title_full |
An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
title_fullStr |
An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
title_full_unstemmed |
An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set |
title_sort |
efficient distributed hierarchical-clustering algorithm for large scale data set |
publishDate |
2010 |
url |
http://ndltd.ncl.edu.tw/handle/28399862048953094239 |
work_keys_str_mv |
AT ancinghuang anefficientdistributedhierarchicalclusteringalgorithmforlargescaledataset AT huángānqìng anefficientdistributedhierarchicalclusteringalgorithmforlargescaledataset AT ancinghuang shìyòngyúdàzīliàojígāoxiàolǜdefēnsànshìjiēcéngfēnqúnyǎnsuànfǎ AT huángānqìng shìyòngyúdàzīliàojígāoxiàolǜdefēnsànshìjiēcéngfēnqúnyǎnsuànfǎ AT ancinghuang efficientdistributedhierarchicalclusteringalgorithmforlargescaledataset AT huángānqìng efficientdistributedhierarchicalclusteringalgorithmforlargescaledataset |
_version_ |
1718228179352551424 |