Robust methods in data mining

The thesis focuses on two problems in Data Mining, namely clustering, an exploratory technique to group observations in similar groups, and classification, a technique used to assign new observations to one of the known groups. A thorough study of the two problems, which are also known in the Machin...

Full description

Bibliographic Details
Main Author: Mwitondi, K. S.
Other Authors: Taylor, C. C. ; Kent, J. T.
Published: University of Leeds 2003
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.400882
id ndltd-bl.uk-oai-ethos.bl.uk-400882
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-4008822017-10-04T03:32:58ZRobust methods in data miningMwitondi, K. S.Taylor, C. C. ; Kent, J. T.2003The thesis focuses on two problems in Data Mining, namely clustering, an exploratory technique to group observations in similar groups, and classification, a technique used to assign new observations to one of the known groups. A thorough study of the two problems, which are also known in the Machine Learning literature as unsupervised and supervised classification respectively, is central to decision making in different fields - the thesis seeks to contribute towards that end. In the first part of the thesis we consider whether robust methods can be applied to clustering - in particular, we perform clustering on fuzzy data using two methods originally developed for outlier-detection. The fuzzy data clusters are characterised by two intersecting lines such that points belonging to the same cluster lie close to the same line. This part of the thesis also investigates a new application of finite mixture of normals to the fuzzy data problem. The second part of the thesis addresses issues relating to classification - in particular, classification trees and boosting. The boosting algorithm is a relative newcomer to the classification portfolio that seeks to enhance the performance of classifiers by iteratively re-weighting the data according to their previous classification status. We explore the performance of "boosted" trees (mainly stumps) based on 3 different models all characterised by a sine-wave boundary. We also carry out a thorough study of the factors that affect the boosting algorithm. Other results include a new look at the concept of randomness in the classification context, particularly because the form of randomness in both training and testing data has directly affects the accuracy and reliability of domain- partitioning rules. Further, we provide statistical interpretations of some of the classification-related concepts, originally used in Computer Science, Machine Learning and Artificial Intelligence. This is important since there exists a need for a unified interpretation of some of the "landmark" concepts in various disciplines, as a step forward towards seeking the principles that can guide and strengthen practical applications.006.312University of Leedshttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.400882http://etheses.whiterose.ac.uk/807/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 006.312
spellingShingle 006.312
Mwitondi, K. S.
Robust methods in data mining
description The thesis focuses on two problems in Data Mining, namely clustering, an exploratory technique to group observations in similar groups, and classification, a technique used to assign new observations to one of the known groups. A thorough study of the two problems, which are also known in the Machine Learning literature as unsupervised and supervised classification respectively, is central to decision making in different fields - the thesis seeks to contribute towards that end. In the first part of the thesis we consider whether robust methods can be applied to clustering - in particular, we perform clustering on fuzzy data using two methods originally developed for outlier-detection. The fuzzy data clusters are characterised by two intersecting lines such that points belonging to the same cluster lie close to the same line. This part of the thesis also investigates a new application of finite mixture of normals to the fuzzy data problem. The second part of the thesis addresses issues relating to classification - in particular, classification trees and boosting. The boosting algorithm is a relative newcomer to the classification portfolio that seeks to enhance the performance of classifiers by iteratively re-weighting the data according to their previous classification status. We explore the performance of "boosted" trees (mainly stumps) based on 3 different models all characterised by a sine-wave boundary. We also carry out a thorough study of the factors that affect the boosting algorithm. Other results include a new look at the concept of randomness in the classification context, particularly because the form of randomness in both training and testing data has directly affects the accuracy and reliability of domain- partitioning rules. Further, we provide statistical interpretations of some of the classification-related concepts, originally used in Computer Science, Machine Learning and Artificial Intelligence. This is important since there exists a need for a unified interpretation of some of the "landmark" concepts in various disciplines, as a step forward towards seeking the principles that can guide and strengthen practical applications.
author2 Taylor, C. C. ; Kent, J. T.
author_facet Taylor, C. C. ; Kent, J. T.
Mwitondi, K. S.
author Mwitondi, K. S.
author_sort Mwitondi, K. S.
title Robust methods in data mining
title_short Robust methods in data mining
title_full Robust methods in data mining
title_fullStr Robust methods in data mining
title_full_unstemmed Robust methods in data mining
title_sort robust methods in data mining
publisher University of Leeds
publishDate 2003
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.400882
work_keys_str_mv AT mwitondiks robustmethodsindatamining
_version_ 1718544524644450304