Efficient Variations of the Quality Threshold Clustering Algorithm

Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest clus...

Full description

Bibliographic Details
Main Author: Loforte, Frank, Jr.
Format: Others
Published: NSUWorks 2015
Subjects:
QTC
Online Access:http://nsuworks.nova.edu/gscis_etd/43
http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd
id ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-1042
record_format oai_dc
spelling ndltd-nova.edu-oai-nsuworks.nova.edu-gscis_etd-10422016-04-25T19:34:37Z Efficient Variations of the Quality Threshold Clustering Algorithm Loforte, Frank, Jr. Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest cluster is always selected first. The QTC algorithm applies in many other domains that require a similar quality guarantee based on cluster diameter. The worst-case complexity of the original QTC algorithm is (n5). Since practical applications often involve large datasets, researchers called for more efficient versions of the QTC algorithm. This dissertation aimed to develop and evaluate efficient variations of the QTC algorithm that guarantee a maximum cluster diameter while producing partitions that are similar to those produced by the original QTC algorithm. The QTC algorithm is expensive because it considers forming clusters around every item in the dataset. This dissertation addressed this issue by developing methods for selecting a small subset of promising items around which to form clusters. A second factor that adversely affects the efficiency of the QTC algorithm is the computational cost of updating cluster diameters as new items are added to clusters. This dissertation proposed and evaluated alternate methods to meet the cluster diameter constraint while not having to repeatedly update the cluster diameters. The variations of the QTC algorithm developed in this dissertation were evaluated on benchmark datasets using two measures: execution time and quality of solutions produced. Execution times were compared to the time taken to execute the most efficient published implementation of the QTC algorithm. Since the partitions produced by the proposed variations are not guaranteed to be identical to those produced by the original algorithm, the Jaccard measure of partition similarity was used to measure the quality of the solutions. The findings of this research were threefold. First, the Stochastic QTC alone wasn’t computationally helpful since in order to produce partitions that were acceptably similar to those found by the deterministic QTCs, the algorithm had to be seeded with a large number of centers (ntry ≈ n). Second, the preprocessed data methods are desirable since they reduce the complexity of the search for candidate cluster points. Third, radius based methods are promising since they produce partitions that are acceptably similar to those found by the deterministic QTCs in significantly less time. 2015-05-01T07:00:00Z text application/pdf http://nsuworks.nova.edu/gscis_etd/43 http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd CEC Theses and Dissertations NSUWorks Computer engineering Computer science Clustering Diameter QTC Qaulity Radius Threshold Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic Computer engineering
Computer science
Clustering
Diameter
QTC
Qaulity
Radius
Threshold
Computer Sciences
spellingShingle Computer engineering
Computer science
Clustering
Diameter
QTC
Qaulity
Radius
Threshold
Computer Sciences
Loforte, Frank, Jr.
Efficient Variations of the Quality Threshold Clustering Algorithm
description Clustering gene expression data such that the diameters of the clusters formed are no greater than a specified threshold prompted the development of the Quality Threshold Clustering (QTC) algorithm. It iteratively forms clusters of non-increasing size until all points are clustered; the largest cluster is always selected first. The QTC algorithm applies in many other domains that require a similar quality guarantee based on cluster diameter. The worst-case complexity of the original QTC algorithm is (n5). Since practical applications often involve large datasets, researchers called for more efficient versions of the QTC algorithm. This dissertation aimed to develop and evaluate efficient variations of the QTC algorithm that guarantee a maximum cluster diameter while producing partitions that are similar to those produced by the original QTC algorithm. The QTC algorithm is expensive because it considers forming clusters around every item in the dataset. This dissertation addressed this issue by developing methods for selecting a small subset of promising items around which to form clusters. A second factor that adversely affects the efficiency of the QTC algorithm is the computational cost of updating cluster diameters as new items are added to clusters. This dissertation proposed and evaluated alternate methods to meet the cluster diameter constraint while not having to repeatedly update the cluster diameters. The variations of the QTC algorithm developed in this dissertation were evaluated on benchmark datasets using two measures: execution time and quality of solutions produced. Execution times were compared to the time taken to execute the most efficient published implementation of the QTC algorithm. Since the partitions produced by the proposed variations are not guaranteed to be identical to those produced by the original algorithm, the Jaccard measure of partition similarity was used to measure the quality of the solutions. The findings of this research were threefold. First, the Stochastic QTC alone wasn’t computationally helpful since in order to produce partitions that were acceptably similar to those found by the deterministic QTCs, the algorithm had to be seeded with a large number of centers (ntry ≈ n). Second, the preprocessed data methods are desirable since they reduce the complexity of the search for candidate cluster points. Third, radius based methods are promising since they produce partitions that are acceptably similar to those found by the deterministic QTCs in significantly less time.
author Loforte, Frank, Jr.
author_facet Loforte, Frank, Jr.
author_sort Loforte, Frank, Jr.
title Efficient Variations of the Quality Threshold Clustering Algorithm
title_short Efficient Variations of the Quality Threshold Clustering Algorithm
title_full Efficient Variations of the Quality Threshold Clustering Algorithm
title_fullStr Efficient Variations of the Quality Threshold Clustering Algorithm
title_full_unstemmed Efficient Variations of the Quality Threshold Clustering Algorithm
title_sort efficient variations of the quality threshold clustering algorithm
publisher NSUWorks
publishDate 2015
url http://nsuworks.nova.edu/gscis_etd/43
http://nsuworks.nova.edu/cgi/viewcontent.cgi?article=1042&context=gscis_etd
work_keys_str_mv AT lofortefrankjr efficientvariationsofthequalitythresholdclusteringalgorithm
_version_ 1718248468907032576