Bien qu’utilisé comme algorithme de classification, les -moyennes sont à la base conçus pour fournir un quantifieur, c’est-à-dire un moyen de compresser une distribution de probabilités avec points. En nous appuyant sur les travaux de Levrard (2015) et Tang and Monteleoni (2016a) , nous essayerons d’expliquer en quoi et sous quelles conditions ces deux objectifs a priori distincts sont compatibles. Plus précisément, nous montrerons que dans le cas où la distribution d’où sont tirés les points satisfait une condition de type marge (baptisée ainsi par analogie avec les conditions de marge établies en classification supervisée dans Mammen and Tsybakov, 1999 ), non seulement le minimiseur théorique du risque empirique associé mais aussi le résultat fourni par l’algorithme de Lloyd fournissent d’une part une classification sinon optimale (au sens de Azizyan et al., 2013 ) du moins pertinente et d’autre part une compression rapide (en la taille de l’échantillon) et optimale.
Though mostly used as a clustering algorithm, -means is originally designed as a quantization algorithm. Namely, it aims at providing a compression of a probability distribution with points. Building upon Levrard (2015) ; Tang and Monteleoni (2016a) , we try to investigate how and when these two approaches are compatible. Namely, we show that provided the sample distribution satisfies a margin like condition (in the sense of Mammen and Tsybakov, 1999 for supervised learning), both the associated empirical risk minimizer and the output of Lloyd’s algorithm provide almost optimal classification in certain cases (in the sense of Azizyan et al., 2013 ). Besides, we also show that they achieved fast and optimal convergence rates in terms of sample size and compression risk.
Mot clés : $k$-means, classification non-supervisée, quantification, vitesse de séparation, distorsion
@article{JSFS_2018__159_1_1_0, author = {Levrard, Cl\'ement}, title = {Quantization/Clustering: when and why does $k$-means work?}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {1--26}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {159}, number = {1}, year = {2018}, zbl = {1397.62222}, language = {en}, url = {http://archive.numdam.org/item/JSFS_2018__159_1_1_0/} }
TY - JOUR AU - Levrard, Clément TI - Quantization/Clustering: when and why does $k$-means work? JO - Journal de la société française de statistique PY - 2018 SP - 1 EP - 26 VL - 159 IS - 1 PB - Société française de statistique UR - http://archive.numdam.org/item/JSFS_2018__159_1_1_0/ LA - en ID - JSFS_2018__159_1_1_0 ER -
%0 Journal Article %A Levrard, Clément %T Quantization/Clustering: when and why does $k$-means work? %J Journal de la société française de statistique %D 2018 %P 1-26 %V 159 %N 1 %I Société française de statistique %U http://archive.numdam.org/item/JSFS_2018__159_1_1_0/ %G en %F JSFS_2018__159_1_1_0
Levrard, Clément. Quantization/Clustering: when and why does $k$-means work?. Journal de la société française de statistique, Tome 159 (2018) no. 1, pp. 1-26. http://archive.numdam.org/item/JSFS_2018__159_1_1_0/
[1] Clustering functional data using wavelets (2011) no. RR-7515, 30 pages https://hal.inria.fr/inria-00559115 (Research Report) | Zbl
[2] Projection-based curve clustering, Journal of Statistical Computation and Simulation, Volume 82 (2012) no. 8, pp. 1145-1168 | DOI | Zbl
[3] Individual convergence rates in empirical vector quantizer design, IEEE Trans. Inform. Theory, Volume 51 (2005) no. 11, pp. 4013-4022 | DOI | MR | Zbl
[4] Improved spectral-norm bounds for clustering, Approximation, randomization, and combinatorial optimization (Lecture Notes in Comput. Sci.), Volume 7408, Springer, Heidelberg, 2012, pp. 37-49 | DOI | MR | Zbl
[5] Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation, Advances in Neural Information Processing Systems 26 (Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q., eds.), Curran Associates, Inc., 2013, pp. 2139-2147 http://papers.nips.cc/paper/4983-minimax-theory-for-high-dimensional-gaussian-mixtures-with-sparse-mean-separation.pdf
[6] k-means++: the advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ACM, New York (2007), pp. 1027-1035 | MR | Zbl
[7] On the performance of clustering in Hilbert spaces, IEEE Trans. Inform. Theory, Volume 54 (2008) no. 2, pp. 781-790 | DOI | MR | Zbl
[8] PECOK: a convex optimization approach to variable clustering, ArXiv e-prints (2016) | arXiv
[9] Functional analysis, Sobolev spaces and partial differential equations, Universitext, Springer, New York, 2011, xiv+599 pages | MR | Zbl
[10] A classification EM algorithm for clustering and two stochastic versions, Comput. Statist. Data Anal., Volume 14 (1992) no. 3, pp. 315-332 | DOI | MR | Zbl
[11] The distortion of vector quantizers trained on vectors decreases to the optimum as , Proc. IEEE Int. Symp. Inf. Theory, Trondheim, Norway (1994), 457 pages
[12] Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, series B, Volume 39 (1977) no. 1, pp. 1-38 | MR | Zbl
[13] A probabilistic analysis of EM for mixtures of separated, spherical Gaussians, J. Mach. Learn. Res., Volume 8 (2007), pp. 203-226 | MR | Zbl
[14] Quantization and clustering with Bregman divergences, J. Multivariate Anal., Volume 101 (2010) no. 9, pp. 2207-2221 | DOI | MR | Zbl
[15] Vector Quantization and Signal Compression, Kluwer Academic Publishers, Norwell, MA, USA, 1991 | Zbl
[16] Foundations of quantization for probability distributions, Lecture Notes in Mathematics, 1730, Springer-Verlag, Berlin, 2000, x+230 pages | DOI | MR | Zbl
[17] Clustering with spectral norm and the -means algorithm, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science—FOCS 2010, IEEE Computer Soc., Los Alamitos, CA, 2010, pp. 299-308 | MR
[18] Linear time algorithms for clustering problems in any dimensions, Automata, languages and programming (Lecture Notes in Comput. Sci.), Volume 3580, Springer, Berlin, 2005, pp. 1374-1385 | DOI | MR | Zbl
[19] Measuring similarities between gene expression profiles through new data transformations, BMC Bioinformatics, Volume 8 (2007) no. 1 | DOI
[20] Fast rates for empirical vector quantization, Electron. J. Statist., Volume 7 (2013), pp. 1716-1746 | DOI | MR | Zbl
[21] Nonasymptotic bounds for vector quantization in Hilbert spaces, Ann. Statist., Volume 43 (2015) no. 2, pp. 592-619 | DOI | MR | Zbl
[22] Sparse oracle inequalities for variable selection via regularized quantization, Bernoulli, Volume 24 (2018) no. 1, pp. 271-296 | DOI | MR | Zbl
[23] Learning-Theoretic Methods in Vector Quantization, Springer Vienna, Vienna (2002), pp. 163-210 | DOI
[24] Least squares quantization in PCM, IEEE Trans. Inform. Theory, Volume 28 (1982) no. 2, pp. 129-137 | DOI | MR | Zbl
[25] Statistical and Computational Guarantees of Lloyd’s Algorithm and its Variants, ArXiv e-prints (2016) | arXiv
[26] Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, University of California Press, Berkeley, Calif. (1967), pp. 281-297 https://projecteuclid.org/euclid.bsmsp/1200512992 | MR | Zbl
[27] The planar -means problem is NP-hard, Theoret. Comput. Sci., Volume 442 (2012), pp. 13-21 | DOI | MR | Zbl
[28] Smooth discrimination analysis, Ann. Statist., Volume 27 (1999) no. 6, pp. 1808-1829 | DOI | MR | Zbl
[29] On Spectral Clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems, MIT Press (2001), pp. 849-856
[30] EEG signals classification using the K-means clustering and a multilayer perceptron neural network model., Expert Syst. Appl., Volume 38 (2011) no. 10, pp. 13475-13481 http://dblp.uni-trier.de/db/journals/eswa/eswa38.html#OrhanHO11
[31] The effectiveness of Lloyd-type methods for the -means problem, J. ACM, Volume 59 (2012) no. 6, 22 pages | DOI | MR | Zbl
[32] A central limit theorem for -means clustering, Ann. Probab., Volume 10 (1982) no. 4, pp. 919-926 http://links.jstor.org/sici?sici=0091-1798(198211)10:4<919:ACLTFC>2.0.CO;2-Q&origin=MSN | MR | Zbl
[33] Systematic determination of genetic network architecture, Nature genetics, Volume 22 (1999), p. 281-5
[34] On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (Gretton, Arthur; Robert, Christian C., eds.) (Proceedings of Machine Learning Research), Volume 51, PMLR, Cadiz, Spain (2016), pp. 1280-1289 http://proceedings.mlr.press/v51/tang16b.html
[35] On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice. Supplementary material (2016)
[36] Introduction to Nonparametric Estimation, Springer Publishing Company, Incorporated, 2008 | MR