Worst-case and smoothed analysis of k-means clustering with Bregman divergences

The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-...

Full description

Bibliographic Details
Main Authors: Bodo Manthey, Heiko Roeglin
Format: Article
Language:English
Published: Carleton University 2013-07-01
Series:Journal of Computational Geometry
Online Access:http://jocg.org/index.php/jocg/article/view/39
Description
Summary:The <em>k</em>-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, <em>k</em>-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that <em>n</em> data points in R<sup><em>d</em></sup> are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected running-time is bounded by a polynomial in <em>n</em> and 1/σ. This result assumes that squared Euclidean distances are used as the distance measure.<br /><br />In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the <em>k</em>-means method for Bregman divergences. We first give a smoothed analysis of <em>k</em>-means with (almost) arbitrary Bregman divergences, and we show bounds of poly(<em>n</em><sup>√<em>k</em></sup>, 1/σ) and <em>k</em><sup><em>kd</em></sup>·poly(<em>n</em>, 1/σ). The latter yields a polynomial bound if <em>k</em> and <em>d</em> are small compared to <em>n</em>. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.
ISSN:1920-180X