We investigate the role of the initialization for the stability of the қ-means clustering algorithm. As opposed to other papers, we consider the actual қ-means algorithm (also known as Lloyd algorithm). In particular we leverage on the property that this algorithm can get stuck in local optima of the қ-means objective function. We are interested in the actual clustering, not only in the costs of the solution. We analyze when different initializations lead to the same local optimum, and when they lead to different local optima. This enables us to prove that it is reasonable to select the number of clusters based on stability scores.

Keywords: clustering, қ-means, stability, model selection

@article{PS_2012__16__436_0, author = {Bubeck, S\'ebastien and Meil\u{a}, Marina and von Luxburg, Ulrike}, title = {How the initialization affects the stability of the $k$-means algorithm}, journal = {ESAIM: Probability and Statistics}, pages = {436--452}, publisher = {EDP-Sciences}, volume = {16}, year = {2012}, doi = {10.1051/ps/2012013}, mrnumber = {2972502}, language = {en}, url = {http://archive.numdam.org/articles/10.1051/ps/2012013/} }

TY - JOUR AU - Bubeck, Sébastien AU - Meilă, Marina AU - von Luxburg, Ulrike TI - How the initialization affects the stability of the $k$-means algorithm JO - ESAIM: Probability and Statistics PY - 2012 SP - 436 EP - 452 VL - 16 PB - EDP-Sciences UR - http://archive.numdam.org/articles/10.1051/ps/2012013/ DO - 10.1051/ps/2012013 LA - en ID - PS_2012__16__436_0 ER -

%0 Journal Article %A Bubeck, Sébastien %A Meilă, Marina %A von Luxburg, Ulrike %T How the initialization affects the stability of the $k$-means algorithm %J ESAIM: Probability and Statistics %D 2012 %P 436-452 %V 16 %I EDP-Sciences %U http://archive.numdam.org/articles/10.1051/ps/2012013/ %R 10.1051/ps/2012013 %G en %F PS_2012__16__436_0

Bubeck, Sébastien; Meilă, Marina; von Luxburg, Ulrike. How the initialization affects the stability of the $k$-means algorithm. ESAIM: Probability and Statistics, Volume 16 (2012), pp. 436-452. doi : 10.1051/ps/2012013. http://archive.numdam.org/articles/10.1051/ps/2012013/

[1] D. Arthur and S. Vassilvitskii, қ-means++ : the advantages of careful seeding, in Proc. of SODA (2007). | Zbl

[2] Relating clustering stability to properties of cluster boundaries, in Proc. of COLT (2008).

and ,[3] A sober look on clustering stability, in Proc. of COLT (2006). | Zbl

, and ,[4] Stability of қ-means clustering, in Proc. of COLT (2007). | Zbl

, and ,[5] Convergence properties of the қ-means algorithm, in Proc. of NIPS (1995).

and ,[6] A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. J. Mach. Learn. Res. 8 (2007) 203-226. | MR | Zbl

and ,[7] Foundations of Quantization for Probability Distributions. Springer (2000). | MR | Zbl

and ,[8] A best possible heuristic for the -center problem. Math. Operat. Res. 10 (1985) 180-184. | MR | Zbl

and ,[9] Stability-based validation of clustering solutions. Neural Comput. 16 (2004) 1299-1323. | Zbl

, , and ,[10] The effectiveness of Lloyd-type methods for the қ-means problem, in Proc. of FOCS (2006). | Zbl

, , and ,[11] Cluster stability for finite samples, in Proc. of NIPS (2008).

and ,[12] Model selection and stability in қ-means clustering, in Proc. of COLT (2008).

and ,[13] On the reliability of clustering stability in the large sample regime, in Proc. of NIPS (2008).

and ,[14] An investigation of computational and informational limits in Gaussian mixture clustering, in Proc. of ICML (2006).

, and ,[15] Estimating local optimums in EM algorithm over Gaussian mixture model, in Proc. of ICML (2008).

, and ,*Cited by Sources: *