Efficient Variations of the Quality Threshold Clustering Algorithm

Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest clus...

Full description

Bibliographic Details
Main Author:	Loforte, Frank, Jr.
Format:	Others
Published:	NSUWorks 2015
Subjects:	Computer engineering Computer science Clustering Diameter QTC Qaulity Radius Threshold Computer Sciences
Online Access:	http://nsuworks.nova.edu/gscis_etd/43 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd

id	ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-1042
record_format	oai_dc
spelling	ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-10422016-04-25T19:34:37Z Efficient Variations of the Quality Threshold Clustering Algorithm Loforte, Frank, Jr. Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest cluster is always selected first. The QTC algorithm applies in many other domains that require a similar quality guarantee based on cluster diameter. The worst-case complexity of the original QTC algorithm is (n5). Since practical applications often involve large datasets, researchers called for more efficient versions of the QTC algorithm. This dissertation aimed to develop and evaluate efficient variations of the QTC algorithm that guarantee a maximum cluster diameter while producing partitions that are similar to those produced by the original QTC algorithm. The QTC algorithm is expensive because it considers forming clusters around every item in the dataset. This dissertation addressed this issue by developing methods for selecting a small subset of promising items around which to form clusters. A second factor that adversely affects the efficiency of the QTC algorithm is the computational cost of updating cluster diameters as new items are added to clusters. This dissertation proposed and evaluated alternate methods to meet the cluster diameter constraint while not having to repeatedly update the cluster diameters. The variations of the QTC algorithm developed in this dissertation were evaluated on benchmark datasets using two measures: execution time and quality of solutions produced. Execution times were compared to the time taken to execute the most efficient published implementation of the QTC algorithm. Since the partitions produced by the proposed variations are not guaranteed to be identical to those produced by the original algorithm, the Jaccard measure of partition similarity was used to measure the quality of the solutions. The findings of this research were threefold. First, the Stochastic QTC alone wasn’t computationally helpful since in order to produce partitions that were acceptably similar to those found by the deterministic QTCs, the algorithm had to be seeded with a large number of centers (ntry ≈ n). Second, the preprocessed data methods are desirable since they reduce the complexity of the search for candidate cluster points. Third, radius based methods are promising since they produce partitions that are acceptably similar to those found by the deterministic QTCs in significantly less time. 2015-05-01T07:00:00Z text application/pdf http://nsuworks.nova.edu/gscis_etd/43 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd CEC Theses and Dissertations NSUWorks Computer engineering Computer science Clustering Diameter QTC Qaulity Radius Threshold Computer Sciences
collection	NDLTD
format	Others
sources	NDLTD
topic	Computer engineering Computer science Clustering Diameter QTC Qaulity Radius Threshold Computer Sciences
spellingShingle	Computer engineering Computer science Clustering Diameter QTC Qaulity Radius Threshold Computer Sciences Loforte, Frank, Jr. Efficient Variations of the Quality Threshold Clustering Algorithm
description	Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest cluster is always selected first. The QTC algorithm applies in many other domains that require a similar quality guarantee based on cluster diameter. The worst-case complexity of the original QTC algorithm is (n5). Since practical applications often involve large datasets, researchers called for more efficient versions of the QTC algorithm. This dissertation aimed to develop and evaluate efficient variations of the QTC algorithm that guarantee a maximum cluster diameter while producing partitions that are similar to those produced by the original QTC algorithm. The QTC algorithm is expensive because it considers forming clusters around every item in the dataset. This dissertation addressed this issue by developing methods for selecting a small subset of promising items around which to form clusters. A second factor that adversely affects the efficiency of the QTC algorithm is the computational cost of updating cluster diameters as new items are added to clusters. This dissertation proposed and evaluated alternate methods to meet the cluster diameter constraint while not having to repeatedly update the cluster diameters. The variations of the QTC algorithm developed in this dissertation were evaluated on benchmark datasets using two measures: execution time and quality of solutions produced. Execution times were compared to the time taken to execute the most efficient published implementation of the QTC algorithm. Since the partitions produced by the proposed variations are not guaranteed to be identical to those produced by the original algorithm, the Jaccard measure of partition similarity was used to measure the quality of the solutions. The findings of this research were threefold. First, the Stochastic QTC alone wasn’t computationally helpful since in order to produce partitions that were acceptably similar to those found by the deterministic QTCs, the algorithm had to be seeded with a large number of centers (ntry ≈ n). Second, the preprocessed data methods are desirable since they reduce the complexity of the search for candidate cluster points. Third, radius based methods are promising since they produce partitions that are acceptably similar to those found by the deterministic QTCs in significantly less time.
author	Loforte, Frank, Jr.
author_facet	Loforte, Frank, Jr.
author_sort	Loforte, Frank, Jr.
title	Efficient Variations of the Quality Threshold Clustering Algorithm
title_short	Efficient Variations of the Quality Threshold Clustering Algorithm
title_full	Efficient Variations of the Quality Threshold Clustering Algorithm
title_fullStr	Efficient Variations of the Quality Threshold Clustering Algorithm
title_full_unstemmed	Efficient Variations of the Quality Threshold Clustering Algorithm
title_sort	efficient variations of the quality threshold clustering algorithm
publisher	NSUWorks
publishDate	2015
url	http://nsuworks.nova.edu/gscis_etd/43 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd
work_keys_str_mv	AT lofortefrankjr efficientvariationsofthequalitythresholdclusteringalgorithm
_version_	1718248468907032576

Efficient Variations of the Quality Threshold Clustering Algorithm

Similar Items