Spectral clustering: An explorative study of proximity measures
In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated fr...
Main Author: | |
---|---|
Format: | Others |
Language: | en |
Published: |
University of Ottawa (Canada)
2013
|
Subjects: | |
Online Access: | http://hdl.handle.net/10393/28238 http://dx.doi.org/10.20381/ruor-19150 |
id |
ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-28238 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-282382018-01-05T19:07:54Z Spectral clustering: An explorative study of proximity measures Azam, Nadia Farhanaz Computer Science. In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated from the area of graph partitioning. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between data objects. The algorithm uses the eigenvalues and eigenvectors of a normalized similarity matrix to partition the data. The pair-wise similarity between the objects is calculated from the proximity (e.g. similarity or distance) measures. In any clustering task, the proximity measures often play a crucial role. In fact, one of the early and fundamental steps in a clustering process is the selection of a suitable proximity measure. A number of such measures may be used for this task. However, the success of a clustering algorithm partially depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes on the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures. To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their performance when applying the spectral clustering algorithm to a number of diverse data sets. To accomplish this task, we use a ten-fold cross validation technique, and assess the clustering results using several external cluster evaluation measures. The performances of the proximity measures are then compared using the quantitative results from the external evaluation measures and analyzed further to determine the probable causes that may have led to such results. In essence, our experimental evaluation indicates that the proximity measures, in general, yield comparable results. That is, no measure is clearly superior, or inferior, to the others in its group. However, among the six similarity measures considered for the binary data, one measure (Russell and Roo similarity coefficient) frequently performed poorer than the others. For numeric data, our study shows that the distance measures based on the relative distances (i.e. the Pearson correlation coefficient and the Angular distance) generally performed better than the distance measures based on the absolute distances (e.g. the Euclidean or Manhattan distance). When considering the proximity measures for mixed data, our results indicate that the choice of distance measure for the numeric data has the highest impact on the final outcome. 2013-11-07T19:04:05Z 2013-11-07T19:04:05Z 2009 2009 Thesis Source: Masters Abstracts International, Volume: 48-05, page: 3035. http://hdl.handle.net/10393/28238 http://dx.doi.org/10.20381/ruor-19150 en 196 p. University of Ottawa (Canada) |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
Computer Science. |
spellingShingle |
Computer Science. Azam, Nadia Farhanaz Spectral clustering: An explorative study of proximity measures |
description |
In cluster analysis, data are clustered into meaningful groups so that the objects in the same group are very similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm is called the spectral clustering algorithm, which originated from the area of graph partitioning. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between data objects. The algorithm uses the eigenvalues and eigenvectors of a normalized similarity matrix to partition the data. The pair-wise similarity between the objects is calculated from the proximity (e.g. similarity or distance) measures. In any clustering task, the proximity measures often play a crucial role. In fact, one of the early and fundamental steps in a clustering process is the selection of a suitable proximity measure. A number of such measures may be used for this task. However, the success of a clustering algorithm partially depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes on the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures.
To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their performance when applying the spectral clustering algorithm to a number of diverse data sets. To accomplish this task, we use a ten-fold cross validation technique, and assess the clustering results using several external cluster evaluation measures. The performances of the proximity measures are then compared using the quantitative results from the external evaluation measures and analyzed further to determine the probable causes that may have led to such results.
In essence, our experimental evaluation indicates that the proximity measures, in general, yield comparable results. That is, no measure is clearly superior, or inferior, to the others in its group. However, among the six similarity measures considered for the binary data, one measure (Russell and Roo similarity coefficient) frequently performed poorer than the others. For numeric data, our study shows that the distance measures based on the relative distances (i.e. the Pearson correlation coefficient and the Angular distance) generally performed better than the distance measures based on the absolute distances (e.g. the Euclidean or Manhattan distance). When considering the proximity measures for mixed data, our results indicate that the choice of distance measure for the numeric data has the highest impact on the final outcome. |
author |
Azam, Nadia Farhanaz |
author_facet |
Azam, Nadia Farhanaz |
author_sort |
Azam, Nadia Farhanaz |
title |
Spectral clustering: An explorative study of proximity measures |
title_short |
Spectral clustering: An explorative study of proximity measures |
title_full |
Spectral clustering: An explorative study of proximity measures |
title_fullStr |
Spectral clustering: An explorative study of proximity measures |
title_full_unstemmed |
Spectral clustering: An explorative study of proximity measures |
title_sort |
spectral clustering: an explorative study of proximity measures |
publisher |
University of Ottawa (Canada) |
publishDate |
2013 |
url |
http://hdl.handle.net/10393/28238 http://dx.doi.org/10.20381/ruor-19150 |
work_keys_str_mv |
AT azamnadiafarhanaz spectralclusteringanexplorativestudyofproximitymeasures |
_version_ |
1718602549092679680 |