Data Clustering and Partial Supervision with Some Parallel Developments

Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clusteri...

Full description

Bibliographic Details
Main Author: Salem, Sameh A.
Published: University of Liverpool 2007
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.491357
id ndltd-bl.uk-oai-ethos.bl.uk-491357
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-4913572017-12-24T15:24:20ZData Clustering and Partial Supervision with Some Parallel DevelopmentsSalem, Sameh A.2007Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clustering algorithms affect the reliability of their results. Moreover, the majority tend to be based on the knowledge of the number of clusters as one of the input parameters. Unfortunately, there are many scenarios, where this knowledge may not be available. In addition, clustering algorithms are very computationally intensive which leads to a major challenging problem in scaling up to large datasets. This thesis gives possible solutions for such problems. First, new measures - called clustering performance measures (CPMs) - for assessing the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate: I) clustering algorithms that have a structure bias to certain type of data distribution as well as those that have no such biases, 2) clustering algorithms that have initialisation dependency as well as the clustering algorithms that have a unique solution for a given set of parameter values with no initialisation dependency. Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm (RACAL), is proposed. RACAL uses a distance based principle to map the distributions of the data assuming that clusters are determined by a distance parameter, without having to specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to choose the best clustering result, i.e. result has compact clusters with wide cluster separations, for a given input parameter. Comparisons with other clustering algorithms indicate the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive partial supervision strategy is proposed for using in conjunction with RACAL_to make it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms of speedup and scaleup, which gives the ability to handle large datasets of high dimensions in a reasonable time. Next, a novel clustering algorithm, which achieves clustering without any control of cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering, Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier with the advantage that the algorithm needs no training set and it is completely unsupervised. Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to act as a classifier. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environment indicate the suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets. Further investigations on more challenging data are carried out. In this context, microarray data is considered. In such data, the number of clusters is not clearly defined. This points directly towards the clustering algorithms that does not require the knowledge of the number of clusters. Therefore, the efficacy of one of these algorithms is examined. Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used as a guideline for choosing the proper clustering algorithm that has the ability to extract useful biological information in a particular dataset. Supplied by The British Library - 'The world's knowledge' Supplied by The British Library - 'The world's knowledge'005.3University of Liverpoolhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.491357Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 005.3
spellingShingle 005.3
Salem, Sameh A.
Data Clustering and Partial Supervision with Some Parallel Developments
description Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clustering algorithms affect the reliability of their results. Moreover, the majority tend to be based on the knowledge of the number of clusters as one of the input parameters. Unfortunately, there are many scenarios, where this knowledge may not be available. In addition, clustering algorithms are very computationally intensive which leads to a major challenging problem in scaling up to large datasets. This thesis gives possible solutions for such problems. First, new measures - called clustering performance measures (CPMs) - for assessing the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate: I) clustering algorithms that have a structure bias to certain type of data distribution as well as those that have no such biases, 2) clustering algorithms that have initialisation dependency as well as the clustering algorithms that have a unique solution for a given set of parameter values with no initialisation dependency. Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm (RACAL), is proposed. RACAL uses a distance based principle to map the distributions of the data assuming that clusters are determined by a distance parameter, without having to specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to choose the best clustering result, i.e. result has compact clusters with wide cluster separations, for a given input parameter. Comparisons with other clustering algorithms indicate the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive partial supervision strategy is proposed for using in conjunction with RACAL_to make it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms of speedup and scaleup, which gives the ability to handle large datasets of high dimensions in a reasonable time. Next, a novel clustering algorithm, which achieves clustering without any control of cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering, Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier with the advantage that the algorithm needs no training set and it is completely unsupervised. Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to act as a classifier. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environment indicate the suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets. Further investigations on more challenging data are carried out. In this context, microarray data is considered. In such data, the number of clusters is not clearly defined. This points directly towards the clustering algorithms that does not require the knowledge of the number of clusters. Therefore, the efficacy of one of these algorithms is examined. Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used as a guideline for choosing the proper clustering algorithm that has the ability to extract useful biological information in a particular dataset. Supplied by The British Library - 'The world's knowledge' Supplied by The British Library - 'The world's knowledge'
author Salem, Sameh A.
author_facet Salem, Sameh A.
author_sort Salem, Sameh A.
title Data Clustering and Partial Supervision with Some Parallel Developments
title_short Data Clustering and Partial Supervision with Some Parallel Developments
title_full Data Clustering and Partial Supervision with Some Parallel Developments
title_fullStr Data Clustering and Partial Supervision with Some Parallel Developments
title_full_unstemmed Data Clustering and Partial Supervision with Some Parallel Developments
title_sort data clustering and partial supervision with some parallel developments
publisher University of Liverpool
publishDate 2007
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.491357
work_keys_str_mv AT salemsameha dataclusteringandpartialsupervisionwithsomeparalleldevelopments
_version_ 1718567868847620096