Estimator selection in the gaussian setting
Annales de l'I.H.P. Probabilités et statistiques, Volume 50 (2014) no. 3, p. 1092-1119

We consider the problem of estimating the mean f of a Gaussian vector Y with independent components of common unknown variance σ 2 . Our estimation procedure is based on estimator selection. More precisely, we start with an arbitrary and possibly infinite collection 𝔽 of estimators of f based on Y and, with the same data Y, aim at selecting an estimator among 𝔽 with the smallest Euclidean risk. No assumptions on the estimators are made and their dependencies with respect to Y may be unknown. We establish a non-asymptotic risk bound for the selected estimator and derive oracle-type inequalities when 𝔽 consists of linear estimators. As particular cases, our approach allows to handle the problems of aggregation, model selection as well as those of choosing a window and a kernel for estimating a regression function, or tuning the parameter involved in a penalized criterion. In all theses cases but aggregation, the method can be easily implemented. For illustration, we carry out two simulation studies. One aims at comparing our procedure to cross-validation for choosing a tuning parameter. The other shows how to implement our approach to solve the problem of variable selection in practice.

Nous présentons une nouvelle procédure de sélection d’estimateurs pour estimer l’espérance f d’un vecteur Y de n variables gaussiennes indépendantes dont la variance est inconnue. Nous proposons de choisir un estimateur de f, dont l’objectif est de minimiser le risque l 2 , dans une collection arbitraire et éventuellement infinie 𝔽 d’estimateurs. La procédure de choix ainsi que la collection 𝔽 ne dépendent que des seules observations Y. Nous calculons une borne de risque, non asymptotique, ne nécessitant aucune hypothèse sur les estimateurs dans 𝔽, ni la connaissance de leur dépendance en Y. Nous calculons des inégalités de type “oracle” quand 𝔽 est une collection d’estimateurs linéaires. Nous considérons plusieurs cas particuliers : estimation par aggrégation, estimation par sélection de modèles, choix d’une fenêtre et du paramètre de lissage en régression fonctionnelle, choix du paramètre de régularisation dans un critère pénalisé. Pour tous ces cas particuliers, sauf pour les méthodes d’aggrégation, la méthode est très facile à programmer. A titre d’illustration nous montrons des résultats de simulations avec deux objectifs : comparer notre méthode à la procédure de cross-validation, montrer comment la mettre en œuvre dans le cadre de la sélection de variables.

DOI : https://doi.org/10.1214/13-AIHP539
Classification:  62J05,  62J07,  62G08
Keywords: estimator selection, model selection, variable selection, linear estimator, kernel estimator, ridge regression, Lasso, elastic net, random forest, PLS1 regression
@article{AIHPB_2014__50_3_1092_0,
     author = {Baraud, Yannick and Giraud, Christophe and Huet, Sylvie},
     title = {Estimator selection in the gaussian setting},
     journal = {Annales de l'I.H.P. Probabilit\'es et statistiques},
     publisher = {Gauthier-Villars},
     volume = {50},
     number = {3},
     year = {2014},
     pages = {1092-1119},
     doi = {10.1214/13-AIHP539},
     zbl = {1298.62113},
     mrnumber = {3224300},
     language = {en},
     url = {http://www.numdam.org/item/AIHPB_2014__50_3_1092_0}
}
Baraud, Yannick; Giraud, Christophe; Huet, Sylvie. Estimator selection in the gaussian setting. Annales de l'I.H.P. Probabilités et statistiques, Volume 50 (2014) no. 3, pp. 1092-1119. doi : 10.1214/13-AIHP539. http://www.numdam.org/item/AIHPB_2014__50_3_1092_0/

[1] S. Arlot. Rééchantillonnage et Sélection de modèles. Ph.D. thesis, Univ. Paris XI, 2007.

[2] S. Arlot. Model selection by resampling penalization. Electron. J. Stat. 3 (2009) 557-624. | MR 2519533 | Zbl pre06166454

[3] S. Arlot and F. Bach. Data-driven calibration of linear estimators with minimal penalties, 2011. Available at arXiv:0909.1884v2.

[4] S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection. Stat. Surv. 4 (2010) 40-79. | MR 2602303 | Zbl 1190.62080

[5] Y. Baraud. Model selection for regression on a fixed design. Probab. Theory Related Fields 117 (2000) 467-493. | MR 1777129 | Zbl 0997.62027

[6] Y. Baraud. Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields 151 (2011) 353-401. | MR 2834722 | Zbl pre05968717

[7] Y. Baraud, C. Giraud and S. Huet. Gaussian model selection with an unknown variance. Ann. Statist. 37 (2009) 630-672. | MR 2502646 | Zbl 1162.62051

[8] Y. Baraud, C. Giraud and S. Huet. Estimator selection in the Gaussian setting, 2010. Available at arXiv:1007.2096v1. | MR 3224300 | Zbl 1298.62113

[9] L. Birgé. Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincaré Probab. Stat. 42 (2006) 273-325. | Numdam | MR 2219712 | Zbl pre05024238

[10] L. Birgé and P. Massart. Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 (2001) 203-268. | MR 1848946 | Zbl 1037.62001

[11] A. Boulesteix and K. Strimmer. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 8 (2006) 32-44.

[12] L. Breiman. Random forests. Mach. Learn. 45 (2001) 5-32. | Zbl 1007.68152

[13] F. Bunea, A. B. Tsybakov and M. H. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674-1697. | MR 2351101 | Zbl 1209.62065

[14] E. Candès and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 (2007) 2313-2351. | MR 2382644 | Zbl 1139.62019

[15] Y. Cao and Y. Golubev. On oracle inequalities related to smoothing splines. Math. Methods Statist. 15 (2006) 398-414. | MR 2301659

[16] O. Catoni. Mixture approach to universal model selection. Technical report, Ecole Normale Supérieure, France, 1997. | Zbl 0928.62033

[17] O. Catoni. Statistical learning theory and stochastic optimization. In Lecture Notes from the 31st Summer School on Probability Theory Held in Saint-Flour, July 8-25, 2001. Springer, Berlin, 2004. | MR 2163920 | Zbl 1076.93002

[18] A. Celisse. Model selection via cross-validation in density estimation, regression, and change-points detection. Ph.D. thesis, Univ. Paris XI, 2008.

[19] S. S. Chen, D. L. Donoho and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 (1998) 33-61 (electronic). | MR 1639094 | Zbl 0919.94002

[20] R. Díaz-Uriarte and S. Alvares De Andrés. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 (2006) 3.

[21] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Least angle regression. Ann. Statist. 32 (2004) 407-499. With discussion, and a rejoinder by the authors. | MR 2060166 | Zbl 1091.62054

[22] R. Genuer, J.-M. Poggi and C. Tuleau-Malot. Variable selection using random forests. Patter Recognition Lett. 31 (2010) 2225-2236.

[23] C. Giraud. Mixing least-squares estimators when the variance is unknown. Bernoulli 14 (2008) 1089-1107. | MR 2543587 | Zbl 1168.62327

[24] C. Giraud, S. Huet and N. Verzelen. High-dimensional regression with unknown variance. Statist. Sci. 27 (2013) 500-518. | MR 3025131

[25] A. Goldenshluger. A universal procedure for aggregating estimators. Ann. Statist. 37 (2009) 542-568. | MR 2488362 | Zbl 1155.62018

[26] A. Goldenshluger and O. Lepski. Structural adaptation via 𝕃 p -norm oracle inequalities. Probab. Theory Related Fields 143 (2009) 41-71. | MR 2449122 | Zbl 1149.62020

[27] I. Helland. Partial least squares regression. In Encyclopedia of Statistical Sciences, 2nd edition 9 5957-5962. S. Kotz, N. Balakrishnan, C. Read, B. Vidakovic and N. Johnston (Eds.). Wiley, New York, 2006.

[28] I. Helland. Some theoretical aspects of partial least squares regression. Chemometrics and Intelligent Laboratory Systems 58 (2001) 97-107.

[29] A. Hoerl and R. Kennard. Ridge regression: Bayes estimation for nonorthogonal problems. Technometrics 12 (1970) 55-67. | Zbl 0202.17205

[30] A. Hoerl and R. Kennard. Ridge regression. In Encyclopedia of Statistical Sciences, 2nd edition 11 7273-7280. S. Kotz, N. Balakrishnan, C. Read, B. Vidakovic and N. Johnston (Eds.). Wiley, New York, 2006. | Zbl 0727.62001

[31] J. Huang, S. Ma and C.-H. Zhang. Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 4 (2008) 1603-1618. | MR 2469326 | Zbl 1255.62198

[32] A. Juditsky and A. Nemirovski. Functional aggregation for nonparametric regression. Ann. Statist. 28 (2000) 681-712. | MR 1792783 | Zbl 1105.62338

[33] O. V. Lepskiĭ. A problem of adaptive estimation in Gaussian white noise. Teor. Veroyatnost. i Primenen. 35 (1990) 459-470. | MR 1091202 | Zbl 0725.62075

[34] O. V. Lepskiĭ. Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Teor. Veroyatnost. i Primenen. 36 (1991) 645-659. | MR 1147167 | Zbl 0738.62045

[35] O. V. Lepskiĭ. Asymptotically minimax adaptive estimation. II. Schemes without optimal adaptation. Adaptive estimates. Teor. Veroyatnost. i Primenen. 37 (1992) 468-481. | MR 1214353 | Zbl 0761.62115

[36] O. V. Lepskiĭ. On problems of adaptive estimation in white Gaussian noise. In Topics in Nonparametric Estimation 87-106. Adv. Soviet Math. 12. Amer. Math. Soc., Providence, RI, 1992. | MR 1191692 | Zbl 0783.62061

[37] G. Leung and A. R. Barron. Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 (2006) 3396-3410. | MR 2242356

[38] Y. Makovoz. Random approximants and neural networks. J. Approx. Theory 85 (1996) 98-109. | MR 1382053 | Zbl 0857.41024

[39] E. A. Nadaraya. On estimating regression. Theory Probab. Appl. 9 (1964) 141-142. | Zbl 0136.40902

[40] A. Nemirovski. Topics in non-parametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1998) 85-277. Lecture Notes in Math. 1738. Springer, Berlin, 2000. | MR 1775640 | Zbl 0998.62033

[41] P. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators. Math. Methods Statist. 16 (2007) 260-280. | MR 2356821 | Zbl 1231.62057

[42] J. Salmon and A. Dalalyan. Optimal aggregation of affine estimators. J. Mach. Learn. Res. 19 (2011) 635-660.

[43] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics 9 (2008) 307.

[44] C. Strobl, A.-L. Boulesteix, A. Zeileis and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8 (2007) 25.

[45] M. Tenenhaus. La régression PLS. Éditions Technip, Paris. Théorie et pratique, 1998. [Theory and application]. | MR 1645125 | Zbl 0923.62058

[46] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 (1996) 267-288. | MR 1379242 | Zbl 0850.62538

[47] A. B. Tsybakov. Optimal rates of aggregation. In Proceedings of the 16th Annual Conference on Learning Theory (COLT) and 7th Annual Workshop on Kernel Machines 303-313. Lecture Notes in Artificial Intelligence 2777. Springer, Berlin, 2003. | Zbl 1208.62073

[48] G. S. Watson. Smooth regression analysis. Sankhyā Ser. A 26 (1964) 359-372. | MR 185765 | Zbl 0137.13002

[49] M. Wegkamp. Model selection in nonparametric regression. Ann. Statist. 31 (2003) 252-273. | MR 1962506 | Zbl 1019.62037

[50] Y. Yang. Model selection for nonparametric regression. Statist. Sinica 9 (1999) 475-499. | MR 1707850 | Zbl 0921.62051

[51] Y. Yang. Combining different procedures for adaptive regression. J. Multivariate Anal. 74 (2000) 135-161. | MR 1790617 | Zbl 0964.62032

[52] Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75-87. | MR 1762904 | Zbl 1106.62322

[53] Y. Yang. Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 (2001) 574-588. | MR 1946426 | Zbl 1018.62033

[54] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 17 (2005) 2077-2098. | MR 2175849 | Zbl 1080.68044

[55] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. Technical report, Rutgers Univ., NJ, 2008.

[56] P. Zhao and B. Yu. On model selection consistency of Lasso. J. Mach. Learn. Res. 7 (2006) 2541-2563. | MR 2274449 | Zbl 1222.62008

[57] H. Zou. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 (2006) 1418-1429. | MR 2279469 | Zbl 1171.62326

[58] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301-320. | MR 2137327 | Zbl 1069.62054

[59] H. Zou, T. Hastie and R. Tibshirani On the “degrees of freedom” of the Lasso. Ann. Statist. 35 (2007) 2173-2192. | MR 2363967 | Zbl 1126.62061