Spectral clustering: An explorative study of proximity measures

In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated fr...

Full description

Bibliographic Details
Main Author: Azam, Nadia Farhanaz
Format: Others
Language:en
Published: University of Ottawa (Canada) 2013
Subjects:
Online Access:http://hdl.handle.net/10393/28238
http://dx.doi.org/10.20381/ruor-19150
id ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-28238
record_format oai_dc
spelling ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-282382018-01-05T19:07:54Z Spectral clustering: An explorative study of proximity measures Azam, Nadia Farhanaz Computer Science. In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated from the area of graph partitioning. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between data objects. The algorithm uses the eigenvalues and eigenvectors of a normalized similarity matrix to partition the data. The pair-wise similarity between the objects is calculated from the proximity (e.g. similarity or distance) measures. In any clustering task, the proximity measures often play a crucial role. In fact, one of the early and fundamental steps in a clustering process is the selection of a suitable proximity measure. A number of such measures may be used for this task. However, the success of a clustering algorithm partially depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes on the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures. To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their performance when applying the spectral clustering algorithm to a number of diverse data sets. To accomplish this task, we use a ten-fold cross validation technique, and assess the clustering results using several external cluster evaluation measures. The performances of the proximity measures are then compared using the quantitative results from the external evaluation measures and analyzed further to determine the probable causes that may have led to such results. In essence, our experimental evaluation indicates that the proximity measures, in general, yield comparable results. That is, no measure is clearly superior, or inferior, to the others in its group. However, among the six similarity measures considered for the binary data, one measure (Russell and Roo similarity coefficient) frequently performed poorer than the others. For numeric data, our study shows that the distance measures based on the relative distances (i.e. the Pearson correlation coefficient and the Angular distance) generally performed better than the distance measures based on the absolute distances (e.g. the Euclidean or Manhattan distance). When considering the proximity measures for mixed data, our results indicate that the choice of distance measure for the numeric data has the highest impact on the final outcome. 2013-11-07T19:04:05Z 2013-11-07T19:04:05Z 2009 2009 Thesis Source: Masters Abstracts International, Volume: 48-05, page: 3035. http://hdl.handle.net/10393/28238 http://dx.doi.org/10.20381/ruor-19150 en 196 p. University of Ottawa (Canada)
collection NDLTD
language en
format Others
sources NDLTD
topic Computer Science.
spellingShingle Computer Science.
Azam, Nadia Farhanaz
Spectral clustering: An explorative study of proximity measures
description In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated from the area of graph partitioning. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between data objects. The algorithm uses the eigenvalues and eigenvectors of a normalized similarity matrix to partition the data. The pair-wise similarity between the objects is calculated from the proximity (e.g. similarity or distance) measures. In any clustering task, the proximity measures often play a crucial role. In fact, one of the early and fundamental steps in a clustering process is the selection of a suitable proximity measure. A number of such measures may be used for this task. However, the success of a clustering algorithm partially depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes on the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures. To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their performance when applying the spectral clustering algorithm to a number of diverse data sets. To accomplish this task, we use a ten-fold cross validation technique, and assess the clustering results using several external cluster evaluation measures. The performances of the proximity measures are then compared using the quantitative results from the external evaluation measures and analyzed further to determine the probable causes that may have led to such results. In essence, our experimental evaluation indicates that the proximity measures, in general, yield comparable results. That is, no measure is clearly superior, or inferior, to the others in its group. However, among the six similarity measures considered for the binary data, one measure (Russell and Roo similarity coefficient) frequently performed poorer than the others. For numeric data, our study shows that the distance measures based on the relative distances (i.e. the Pearson correlation coefficient and the Angular distance) generally performed better than the distance measures based on the absolute distances (e.g. the Euclidean or Manhattan distance). When considering the proximity measures for mixed data, our results indicate that the choice of distance measure for the numeric data has the highest impact on the final outcome.
author Azam, Nadia Farhanaz
author_facet Azam, Nadia Farhanaz
author_sort Azam, Nadia Farhanaz
title Spectral clustering: An explorative study of proximity measures
title_short Spectral clustering: An explorative study of proximity measures
title_full Spectral clustering: An explorative study of proximity measures
title_fullStr Spectral clustering: An explorative study of proximity measures
title_full_unstemmed Spectral clustering: An explorative study of proximity measures
title_sort spectral clustering: an explorative study of proximity measures
publisher University of Ottawa (Canada)
publishDate 2013
url http://hdl.handle.net/10393/28238
http://dx.doi.org/10.20381/ruor-19150
work_keys_str_mv AT azamnadiafarhanaz spectralclusteringanexplorativestudyofproximitymeasures
_version_ 1718602549092679680