Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion

In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters, which has a deterministic effect on the clustering results. However, a limitation in current applications is that no convincingly acceptable solution to the best-number-of-clusters problem is avai...

Full description

Bibliographic Details
Main Author: Yan, Mingjin
Other Authors: Statistics
Format: Others
Published: Virginia Tech 2014
Subjects:
Online Access:http://hdl.handle.net/10919/29957
http://scholar.lib.vt.edu/theses/available/etd-12062005-153906/
id ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-29957
record_format oai_dc
spelling ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-299572020-09-26T05:33:12Z Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion Yan, Mingjin Statistics Ye, Keying Prins, Samantha C. Bates Spitzner, Dan J. Smith, Eric P. Gap statistic Multi-layer clustering DD-weighted gap statistic Cluster analysis Weighted gap statistic Number of clusters K-means clustering In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters, which has a deterministic effect on the clustering results. However, a limitation in current applications is that no convincingly acceptable solution to the best-number-of-clusters problem is available due to high complexity of real data sets. In this dissertation, we tackle this problem of estimating the number of clusters, which is particularly oriented at processing very complicated data which may contain multiple types of cluster structure. Two new methods of choosing the number of clusters are proposed which have been shown empirically to be highly effective given clear and distinct cluster structure in a data set. In addition, we propose a sequential type of clustering approach, called multi-layer clustering, by combining these two methods. Multi-layer clustering not only functions as an efficient method of estimating the number of clusters, but also, by superimposing a sequential idea, improves the flexibility and effectiveness of any arbitrary existing one-layer clustering method. Empirical studies have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting clusters in complicated data sets. The multi-layer clustering approach has been successfully implemented in clustering the WTCHP microarray data and the results can be interpreted very well based on known biological knowledge. Choosing an appropriate clustering method is another critical step in clustering. K-means clustering is one of the most popular clustering techniques used in practice. However, the k-means method tends to generate clusters containing a nearly equal number of objects, which is referred to as the ``equal-size'' problem. We propose a clustering method which competes with the k-means method. Our newly defined method is aimed at overcoming the so-called ``equal-size'' problem associated with the k-means method, while maintaining its advantage of computational simplicity. Advantages of the proposed method over k-means clustering have been demonstrated empirically using simulated data with low dimensionality. Ph. D. 2014-03-14T20:19:52Z 2014-03-14T20:19:52Z 2005-11-28 2005-12-06 2006-12-29 2005-12-29 Dissertation etd-12062005-153906 http://hdl.handle.net/10919/29957 http://scholar.lib.vt.edu/theses/available/etd-12062005-153906/ Proposal-Face.pdf In Copyright http://rightsstatements.org/vocab/InC/1.0/ application/pdf Virginia Tech
collection NDLTD
format Others
sources NDLTD
topic Gap statistic
Multi-layer clustering
DD-weighted gap statistic
Cluster analysis
Weighted gap statistic
Number of clusters
K-means clustering
spellingShingle Gap statistic
Multi-layer clustering
DD-weighted gap statistic
Cluster analysis
Weighted gap statistic
Number of clusters
K-means clustering
Yan, Mingjin
Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
description In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters, which has a deterministic effect on the clustering results. However, a limitation in current applications is that no convincingly acceptable solution to the best-number-of-clusters problem is available due to high complexity of real data sets. In this dissertation, we tackle this problem of estimating the number of clusters, which is particularly oriented at processing very complicated data which may contain multiple types of cluster structure. Two new methods of choosing the number of clusters are proposed which have been shown empirically to be highly effective given clear and distinct cluster structure in a data set. In addition, we propose a sequential type of clustering approach, called multi-layer clustering, by combining these two methods. Multi-layer clustering not only functions as an efficient method of estimating the number of clusters, but also, by superimposing a sequential idea, improves the flexibility and effectiveness of any arbitrary existing one-layer clustering method. Empirical studies have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting clusters in complicated data sets. The multi-layer clustering approach has been successfully implemented in clustering the WTCHP microarray data and the results can be interpreted very well based on known biological knowledge. Choosing an appropriate clustering method is another critical step in clustering. K-means clustering is one of the most popular clustering techniques used in practice. However, the k-means method tends to generate clusters containing a nearly equal number of objects, which is referred to as the ``equal-size'' problem. We propose a clustering method which competes with the k-means method. Our newly defined method is aimed at overcoming the so-called ``equal-size'' problem associated with the k-means method, while maintaining its advantage of computational simplicity. Advantages of the proposed method over k-means clustering have been demonstrated empirically using simulated data with low dimensionality. === Ph. D.
author2 Statistics
author_facet Statistics
Yan, Mingjin
author Yan, Mingjin
author_sort Yan, Mingjin
title Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
title_short Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
title_full Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
title_fullStr Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
title_full_unstemmed Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
title_sort methods of determining the number of clusters in a data set and a new clustering criterion
publisher Virginia Tech
publishDate 2014
url http://hdl.handle.net/10919/29957
http://scholar.lib.vt.edu/theses/available/etd-12062005-153906/
work_keys_str_mv AT yanmingjin methodsofdeterminingthenumberofclustersinadatasetandanewclusteringcriterion
_version_ 1719341317246418944