Worst-case and smoothed analysis of k-means clustering with Bregman divergences

The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-...

Full description

Bibliographic Details
Main Authors:	Bodo Manthey, Heiko Roeglin
Format:	Article
Language:	English
Published:	Carleton University 2013-07-01
Series:	Journal of Computational Geometry
Online Access:	http://jocg.org/index.php/jocg/article/view/39

id	doaj-47421fb4c2e3454e823e394a0cbec7bc
record_format	Article
spelling	doaj-47421fb4c2e3454e823e394a0cbec7bc2020-11-24T21:43:30ZengCarleton UniversityJournal of Computational Geometry1920-180X2013-07-014110.20382/jocg.v4i1a538Worst-case and smoothed analysis of k-means clustering with Bregman divergencesBodo Manthey0Heiko Roeglin1University of TwenteMaastricht UniversityThe <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.http://jocg.org/index.php/jocg/article/view/39
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Bodo Manthey Heiko Roeglin
spellingShingle	Bodo Manthey Heiko Roeglin Worst-case and smoothed analysis of k-means clustering with Bregman divergences Journal of Computational Geometry
author_facet	Bodo Manthey Heiko Roeglin
author_sort	Bodo Manthey
title	Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_short	Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_full	Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_fullStr	Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_full_unstemmed	Worst-case and smoothed analysis of k-means clustering with Bregman divergences
title_sort	worst-case and smoothed analysis of k-means clustering with bregman divergences
publisher	Carleton University
series	Journal of Computational Geometry
issn	1920-180X
publishDate	2013-07-01
description	The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.
url	http://jocg.org/index.php/jocg/article/view/39
work_keys_str_mv	AT bodomanthey worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences AT heikoroeglin worstcaseandsmoothedanalysisofkmeansclusteringwithbregmandivergences
_version_	1725913797578719232

Worst-case and smoothed analysis of k-means clustering with Bregman divergences

Similar Items