Agrégation d'estimateurs et optimisation stochastique
Journal de la société française de statistique, Volume 149 (2008) no. 1, p. 3-26

This paper is a written version of the Conférence Lucien Le Cam delivered at the XXXVIIèmes Journées de Statistique in Pau, 2005. It presents an overview of some recent results on the methods of aggregation of estimators. Given a collection of M estimators, aggregation procedures consist in constructing their convex or linear combination with optimally chosen random weights. We mainly focus on the link between aggregation and stochastic optimization which leads us to the construction of some new highly efficient recursive aggregation procedures.

Cet article fait suite à la Conférence Lucien Le Cam que j’ai eu l’honneur de donner lors des XXXVIIèmes Journées de Statistique à Pau, en 2005. Il présente un aperçu de quelques résultats récents sur les méthodes d’agrégation d’estimateurs. Ces méthodes consistent à construire, à partir d’un ensemble de M estimateurs donnés, une combinaison linéaire ou convexe de ces estimateurs avec des poids aléatoires choisis de façon optimale. Nous mettons l’accent sur le lien entre agrégation et optimisation stochastique, ce qui nous permet d’aboutir à de nouvelles procédures recursives d’agrégation très performantes.

Keywords: aggregation, stochastic optimization, mirror descent, adaptive estimation, optimal rate of aggregation
@article{JSFS_2008__149_1_3_0,
     author = {Tsybakov, Alexandre B.},
     title = {Agr\'egation d'estimateurs et optimisation stochastique},
     journal = {Journal de la soci\'et\'e fran\c caise de statistique},
     publisher = {Soci\'et\'e fran\c caise de statistique},
     volume = {149},
     number = {1},
     year = {2008},
     pages = {3-26},
     language = {fr},
     url = {http://www.numdam.org/item/JSFS_2008__149_1_3_0}
}
Tsybakov, Alexandre B. Agrégation d'estimateurs et optimisation stochastique. Journal de la société française de statistique, Volume 149 (2008) no. 1, pp. 3-26. http://www.numdam.org/item/JSFS_2008__149_1_3_0/

[1] Audibert J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Annales de l'Institut Henri Poincaré (B) Probabilités et Statistiques 40 : 685-736. | Numdam | MR 2096215 | Zbl 1052.62037

[2] Barron A., Cohen A., Dahmen W. et Devore R. (2005). Approximation and learning by greedy algorithms. Annals of Statistics, à paraître. | Zbl 1138.62019

[3] Ben-Tal A. et Nemirovski A.S. (1999). The conjugate barrier mirror descent method for non-smooth convex optimization. MINERVA Optim. Center Report., Haifa : Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology. http://iew3.technion.ac.il/Labs/Opt/opt/Pap/CP_MD.pdf

[4] Birgé L. (2006). Model selection via testing : an alternative to (penalized) maximum likelihood estimators. Annales de l'Institut Henri Poincaré (B) Probabilités et Statistiques 42 : 273 - 325. | Numdam | MR 2219712 | Zbl pre05024238

[5] Bickel P.J., Ritov Y. et Zakai A. (2006). Some theory for generalized boosting algorithms. J. Machine Learning Research 7 : 705-732. | MR 2274384

[6] Bühlman P. et Yu B. (2005). Sparse boosting. J. Machine Learning Research 7 : 1001-1024. | MR 2274395

[7] Bunea F., Tsybakov A.B. et Wegkamp M.H. (2004). Aggregation for regression learning. Preprint LPMA, Universities Paris 6 - Paris 7, n. 948, arXiv :math.ST/0410214 et https://hal.ccsd.cnrs.fr/ccsd-00003205

[8] Bunea F., Tsybakov A.B. et Wegkamp M.H. (2006). Aggregation and sparsity via 1 penalized least squares. Proceedings of 19th Annual Conference on Learning Theory (COLT 2006), Lecture Notes in Artificial Intelligence v. 4005 (Lugosi, G. and Simon, H.U., eds.), Springer-Verlag, Berlin-Heidelberg, 379-391. | MR 2280619 | Zbl 1143.62319

[9] Bunea F., Tsybakov A.B. et Wegkamp M.H. (2007a). Aggregation for Gaussian regression. Annals of Statistics 35 : 1674-1697. | MR 2351101 | Zbl pre05201517

[10] Bunea F., Tsybakov A.B. et Wegkamp M.H. (2007b). Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 1 : 169-194. | MR 2312149 | Zbl 1146.62028

[11] Bunea F., Tsybakov A.B. et Wegkamp M.H. (2007c). Sparse density estimation with 1 penalties. Proceedings of 20th Annual Conference on Learning Theory (COLT 2007), Lecture Notes in Artificial Intelligence, v. 4539 (N.H. Bshouty and C. Gentile, eds.), Springer-Verlag, Berlin-Heidelberg, 530-543. | MR 2397610 | Zbl pre05222924

[12] Catoni O. (1999). “Universal” aggregation rules with exact bias bounds. Preprint LPMA, Universités Paris 6 - Paris 7, n.510.

[13] Catoni O. (2004). Statistical Learning Theory and Stochastic Optimization. Ecole d'Eté de Probabilités de Saint-Flour XXXI - 2001. Lecture Notes in Mathematics, vol. 1851, Springer, New York. | MR 2163920 | Zbl 1076.93002

[14] Cesa-Bianchi N. et Lugosi G. (2006). Prediction, Learning, and Games. Cambridge Univ. Press. | MR 2409394 | Zbl 1114.91001

[15] Dalalyan A., Juditsky A. et Spokoiny V. (2007). A new algorithm for estimating the effective dimension-reduction subspace. arXiv :math/0701887v1.

[16] Dalalyan A. et Tsybakov A.B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. Proceedings of the 20th Annual Conference on Learning Theory (COLT-2007), Lecture Notes in Artificial Intelligence, v. 4539 (N.H. Bshouty and C. Gentile, eds.), Springer-Verlag, Berlin-Heidelberg, 97-111. | MR 2397581 | Zbl pre05222895

[17] Freund Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 121 : 256-285. | MR 1348530 | Zbl 0833.68109

[18] Hristache M., Juditsky A. et Spokoiny V. (2001). Direct estimation of the index coefficient in a single-index model. Annals of Statistics 29 : 593-623. | MR 1865333 | Zbl 1012.62043

[19] Huber P.J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proc. Fifth Berkeley Symp. Math. Statist. Prob. 1 : 221-234. | MR 216620 | Zbl 0212.21504

[20] Juditsky A.B., Nazin A.V., Tsybakov A.B. et Vayatis N. (2005). Recursive aggregation of estimators by the mirror descent algorithm with averaging. Problems of Information Transmission 41 : 368-384. | MR 2198228 | Zbl 1123.62044

[21] Juditsky A. et Nemirovski A. (2000). Functional aggregation for nonparametric estimation. Annals of Statistics 28 : 681-712. | MR 1792783 | Zbl 1105.62338

[22] Juditsky A., Rigollet Ph. et Tsybakov A.B. (2005). Learning by mirror averaging. Annals of Statistics, à paraître. | Zbl pre05368488

[23] Klemelä J. (2006). Density estimation with stagewise optimization of the empirical risk. Machine Learning 67 : 169-195.

[24] Koltchinskii V. (2006a). Local Rademacher complexities and oracle inequalities in empirical risk minimization (with discusssion). Annals of Statistics 34 : 2697-2706. | MR 2329464 | Zbl 1118.62065

[25] Koltchinskii V. (2006b). Sparsity in penalized empirical risk minimization. Annales de l'Institut Henri Poincaré (B) Probabilités et Statistiques, à paraître.

[26] Le Cam L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. University of California Publications in Statistics 1 : 277-329. | MR 54913 | Zbl 0052.15404

[27] Lecué G. (2005). Lower bounds and aggregation in density estimation. J. Machine Learning Research 7 : 971-981. | MR 2274393

[28] Lecué G. (2007a). Méthodes d'agrégation : optimalité et vitesses rapides. Thèse de doctorat, Université Paris 6. http://tel.archives-ouvertes.fr/tel-00150402

[29] Lecué G. (2007b). Suboptimality of penalized empirical risk minimization in classification. Proceedings of 20th Annual Conference on Learning Theory (COLT 2007), Lecture Notes in Artificial Intelligence, v. 4539 (N.H. Bshouty and C.Gentile, eds.), Springer-Verlag, Berlin-Heidelberg, 142-156. | MR 2397584 | Zbl pre05222898

[30] Lugosi G. (2006). Prédiction randomisée de suites individuelles. J. Société Française de Statistique 147 : 5-37.

[31] Lounici K. (2007). Generalized mirror averaging and D-convex aggregation. Mathematical Methods of Statistics 16 : 246-259. | MR 2356820

[32] Lugosi G. et Vayatis N. (2004). On the Bayes-risk consistency of regularized boosting methods (with discussion). Annals of Statistics 32 : 30-55. | MR 2051000 | Zbl 1105.62319

[33] Mannor S., Meir R. et Zhang T. (2003). Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research 4 : 713-742. | MR 2072266 | Zbl 1105.68388

[34] Nemirovski A. (2000). Topics in Non-parametric Statistics. Ecole d'Eté de Probabilités de Saint-Flour XXVIII - 1998, Lecture Notes in Mathematics, v. 1738, Springer : New York. | MR 1775640 | Zbl 0998.62033

[35] Nemirovski A.S. et Yudin D.B. (1983). Problem Complexity and Method Efficiency in Optimization, Wiley, Chichester. | MR 702836 | Zbl 0501.90062

[36] Nesterov Yu. (2005). Primal-dual subgradient methods for convex problems. CORE discussion paper 2005/67. Center for Operations Research and Econometrics, Louvain-la-Neuve, Belgique.

[37] Nesterov Yu. (2007). Primal-dual subgradient methods for convex problems. Mathematical Programming, publié en ligne, DOI : 10.1007/s10107-007-0149-x.

[38] Rigollet Ph. (2006). Inégalités d'oracle, agrégation et adaptation. Thèse de doctorat, Université Paris 6. http://tel.archives-ouvertes.fr/tel-00115494

[39] Rigollet Ph. et Tsybakov A. (2007). Linear and convex aggregation of density estimators. Mathematical Methods of Statistics 16 : 260-280. | MR 2356821

[40] Robbins H. et Monro S. (1951). A stochastic approximation method. Annals of Mathematical Statistics 22 : 400-407. | MR 42668 | Zbl 0054.05901

[41] Rockafellar R.T. et Wets R.J.B. (1998). Variational Analysis. Springer, N.Y. | MR 1491362 | Zbl 0888.49001

[42] Samarov A. et Tsybakov A. (2007) Aggregation of density estimators and dimension reduction. Advances in Statistical Modeling and Inference. Essays in Honor of Kjell A. Doksum (V. Nair, ed.), World Scientific, Singapore e.a., 233-251. | MR 2416118

[43] Schapire R.E. (1990). The strength of weak learnability. Machine Learning 5 : 197-227. | Zbl 0747.68058

[44] Tsybakov A. (2003). Optimal rates of aggregation. Computational Learning Theory and Kernel Machines, (B. Schölkopf and M. Warmuth, eds.), Lecture Notes in Artificial Intelligence, v. 2777. Springer, Heidelberg, 303-313.

[45] Vajda I. (1986). Theory of Statistical Inference and Information. Kluwer, Dordrecht. | Zbl 0711.62002

[46] Van De Geer S.A. (2006). High dimensional generalized linear models and the Lasso. Annals of Statistics, à paraître. | Zbl 1138.62323

[47] Vapnik V. et Chervonenkis A. (1974). Theory of Pattern Recognition Nauka, Moscow (in Russian). Traduction allemande : Wapnik W. und Tscherwonenkis A. Theorie der Zeichenerkennung, Berlin : Akademie-Verlag, 1979. | MR 474638

[48] Wand M.P. et Jones M.C. (1995). Kernel Smoothing. Chapman and Hall, London. | MR 1319818 | Zbl 0854.62043

[49] Wegkamp M.H. (2003). Model selection in nonparametric regression. Annals of Statistics 31 : 252-273. | MR 1962506 | Zbl 1019.62037

[50] Yang Y. (2000a). Mixing strategies for density estimation. Annals of Statistics 28 : 75-87. | MR 1762904 | Zbl 1106.62322

[51] Yang Y. (2000b). Combining different procedures for adaptive regression. Journal of Multivariate Analysis 74 : 135-161. | MR 1790617 | Zbl 0964.62032

[52] Yang Y. (2001). Adaptive regression by mixing. Journal of the American Statistical Association 96 : 574-588. | MR 1946426 | Zbl 1018.62033

[53] Yang Y. (2004). Aggregating regression procedures for a better performance. Bernoulli 10 : 25-47. | MR 2044592 | Zbl 1040.62030

[54] Zhang T. et Yu B. (2005) Boosting with early stopping : convergence and cosistency. Annals of Statistics 33 : 1538-1579. | MR 2166555 | Zbl 1078.62038