Worst-case and smoothed analysis of k-means clustering with Bregman divergences
The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Carleton University
2013-07-01
|
Series: | Journal of Computational Geometry |
Online Access: | http://jocg.org/index.php/jocg/article/view/39 |
id |
doaj-47421fb4c2e3454e823e394a0cbec7bc |
---|---|
record_format |
Article |
spelling |
doaj-47421fb4c2e3454e823e394a0cbec7bc2020-11-24T21:43:30ZengCarleton UniversityJournal of Computational Geometry1920-180X2013-07-014110.20382/jocg.v4i1a538Worst-case and smoothed analysis of k-means clustering with Bregman divergencesBodo Manthey0Heiko Roeglin1University of TwenteMaastricht UniversityThe <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.http://jocg.org/index.php/jocg/article/view/39 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Bodo Manthey Heiko Roeglin |
spellingShingle |
Bodo Manthey Heiko Roeglin Worst-case and smoothed analysis of k-means clustering with Bregman divergences Journal of Computational Geometry |
author_facet |
Bodo Manthey Heiko Roeglin |
author_sort |
Bodo Manthey |
title |
Worst-case and smoothed analysis of k-means clustering with Bregman divergences |
title_short |
Worst-case and smoothed analysis of k-means clustering with Bregman divergences |
title_full |
Worst-case and smoothed analysis of k-means clustering with Bregman divergences |
title_fullStr |
Worst-case and smoothed analysis of k-means clustering with Bregman divergences |
title_full_unstemmed |
Worst-case and smoothed analysis of k-means clustering with Bregman divergences |
title_sort |
worst-case and smoothed analysis of k-means clustering with bregman divergences |
publisher |
Carleton University |
series |
Journal of Computational Geometry |
issn |
1920-180X |
publishDate |
2013-07-01 |
description |
The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences. |
url |
http://jocg.org/index.php/jocg/article/view/39 |
work_keys_str_mv |
AT bodomanthey worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences AT heikoroeglin worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences |
_version_ |
1725913797578719232 |