Overlapping clustering
Analysis of large collections of data has become inescapable in many areas of scientific and commercial endeavor. As the size and dimensionality of these collections exceed the pattern recognition capability of the human mind computational analysis tools become a necessity for interpretation. Cluste...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
2010
|
Subjects: | |
Online Access: | http://hdl.handle.net/2152/ETD-UT-2010-08-2022 |
id |
ndltd-UTEXAS-oai-repositories.lib.utexas.edu-2152-ETD-UT-2010-08-2022 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UTEXAS-oai-repositories.lib.utexas.edu-2152-ETD-UT-2010-08-20222015-09-20T16:56:13ZOverlapping clusteringKrumpelman, Chase SerhurClustering algorithmsAnalysis of large collections of data has become inescapable in many areas of scientific and commercial endeavor. As the size and dimensionality of these collections exceed the pattern recognition capability of the human mind computational analysis tools become a necessity for interpretation. Clustering algorithms, which aim to find interesting groupings within collections of data, are one such tool. Each algorithm incorporates into its design an inherent definition of “interesting” intended to capture nonrandom data groupings likely to have some interpretation to human users. Most existing algorithms include as part of their definition of “interesting” an assumption that each data point can belong at most to one grouping. While this assumption allows for algorithmic convenience and ease of analysis, it is often an artificial imposition on true underlying data structure. The idea of allowing points to belong to multiple groupings - known as “overlapping” or “multiple membership” clustering - has emerged in several domains in ad hoc solutions lacking conceptual unity in approach, interpretation, and analysis. This dissertation proposes general, domain-independent elucidations and practical techniques which address each of these. We begin by positing overlapping clustering’s role specifically, and clustering’s role in general, as assistive technologic tools allowing human minds to represent and interpret structures in data beyond the capability of our innate senses. With this guiding purpose clarified, we provide a catalog of existing techniques. We then address the issue of objectively comparing the results of different algorithms, specifically examining the previously defined Omega index, as well as multiple membership generalizations of normalized mutual information. Following that comparison, we propose a novel approach to com- paring clusterings called cluster alignment. By combining a sorting algorithm with a greedy matching algorithm, we produce comparably organized membership matrices and a means for both numerically and visually comparing multiple-membership assignments. With overlapping clustering’s purpose defined, and the means to analyze results, we move on to presenting algorithms for efficiently discovering overlapping clusters in data. First, we present a generalization of one of the common themes in the ad hoc approaches: additive clustering. Starting with a previously developed structural model of additive clustering, we generalize it to be applicable to any regular exponential family distribution thereby extending its utility into several domains, notably high-dimensional sparse domains including text and recommender systems. Finally, we address overlapping clustering by examining the properties of data in similarity spaces. We develop a probabilistic generative model of overlapping data in similarity spaces, and then develop two conceptual approaches to discovering overlapping clustering in similarity spaces. The first of these is the conceptual multiple-membership generalization of hierarchical agglomerative clustering, and the second is an iterative density hill-climbing algorithm.text2010-12-13T22:34:44Z2010-12-13T22:34:50Z2010-12-13T22:34:44Z2010-12-13T22:34:50Z2010-082010-12-13August 20102010-12-13T22:34:50Zthesisapplication/pdfhttp://hdl.handle.net/2152/ETD-UT-2010-08-2022eng |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Clustering algorithms |
spellingShingle |
Clustering algorithms Krumpelman, Chase Serhur Overlapping clustering |
description |
Analysis of large collections of data has become inescapable in many areas of scientific and commercial endeavor. As the size and dimensionality of these collections exceed the pattern recognition capability of the human mind computational analysis tools become a necessity for interpretation. Clustering
algorithms, which aim to find interesting groupings within collections of data, are one such tool. Each algorithm incorporates into its design an inherent definition of “interesting” intended to capture nonrandom data groupings likely to have some interpretation to human users. Most existing algorithms include
as part of their definition of “interesting” an assumption that each data point can belong at most to one grouping. While this assumption allows for algorithmic convenience and ease of analysis, it is often an artificial imposition on true underlying data structure. The idea of allowing points to belong to multiple groupings - known as “overlapping” or “multiple membership” clustering - has emerged in several domains in ad hoc solutions lacking conceptual unity
in approach, interpretation, and analysis. This dissertation proposes general, domain-independent elucidations and practical techniques which address each of these.
We begin by positing overlapping clustering’s role specifically, and clustering’s role in general, as assistive technologic tools allowing human minds to represent and interpret structures in data beyond the capability of our innate senses. With this guiding purpose clarified, we provide a catalog of existing techniques. We then address the issue of objectively comparing the results
of different algorithms, specifically examining the previously defined Omega
index, as well as multiple membership generalizations of normalized mutual information. Following that comparison, we propose a novel approach to com-
paring clusterings called cluster alignment. By combining a sorting algorithm with a greedy matching algorithm, we produce comparably organized membership matrices and a means for both numerically and visually comparing multiple-membership assignments. With overlapping clustering’s purpose defined, and the means to analyze results, we move on to presenting algorithms for efficiently discovering overlapping clusters in data. First, we present a generalization of one of the common themes in the ad hoc approaches: additive
clustering. Starting with a previously developed structural model of additive clustering, we generalize it to be applicable to any regular exponential
family distribution thereby extending its utility into several domains, notably high-dimensional sparse domains including text and recommender systems. Finally, we address overlapping clustering by examining the properties of data
in similarity spaces. We develop a probabilistic generative model of overlapping data in similarity spaces, and then develop two conceptual approaches to
discovering overlapping clustering in similarity spaces. The first of these is the conceptual multiple-membership generalization of hierarchical agglomerative
clustering, and the second is an iterative density hill-climbing algorithm. === text |
author |
Krumpelman, Chase Serhur |
author_facet |
Krumpelman, Chase Serhur |
author_sort |
Krumpelman, Chase Serhur |
title |
Overlapping clustering |
title_short |
Overlapping clustering |
title_full |
Overlapping clustering |
title_fullStr |
Overlapping clustering |
title_full_unstemmed |
Overlapping clustering |
title_sort |
overlapping clustering |
publishDate |
2010 |
url |
http://hdl.handle.net/2152/ETD-UT-2010-08-2022 |
work_keys_str_mv |
AT krumpelmanchaseserhur overlappingclustering |
_version_ |
1716821242466533376 |