Worst-case and smoothed analysis of k-means clustering with Bregman divergences

The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-...

Full description

Bibliographic Details
Main Authors: Bodo Manthey, Heiko Roeglin
Format: Article
Language:English
Published: Carleton University 2013-07-01
Series:Journal of Computational Geometry
Online Access:http://jocg.org/index.php/jocg/article/view/39
id doaj-47421fb4c2e3454e823e394a0cbec7bc
record_format Article
spelling doaj-47421fb4c2e3454e823e394a0cbec7bc2020-11-24T21:43:30ZengCarleton UniversityJournal of Computational Geometry1920-180X2013-07-014110.20382/jocg.v4i1a538Worst-case and smoothed analysis of k-means clustering with Bregman divergencesBodo Manthey0Heiko Roeglin1University of TwenteMaastricht UniversityThe <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.http://jocg.org/index.php/jocg/article/view/39
collection DOAJ
language English
format Article
sources DOAJ
author Bodo Manthey
Heiko Roeglin
spellingShingle Bodo Manthey
Heiko Roeglin
Worst-case and smoothed analysis of k-means clustering with Bregman divergences
Journal of Computational Geometry
author_facet Bodo Manthey
Heiko Roeglin
author_sort Bodo Manthey
title Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_short Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_full Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_fullStr Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_full_unstemmed Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_sort worst-case and smoothed analysis of k-means clustering with bregman divergences
publisher Carleton University
series Journal of Computational Geometry
issn 1920-180X
publishDate 2013-07-01
description The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.
url http://jocg.org/index.php/jocg/article/view/39
work_keys_str_mv AT bodomanthey worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences
AT heikoroeglin worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences
_version_ 1725913797578719232