Iterative feature selection in least square regression estimation
Annales de l'I.H.P. Probabilités et statistiques, Volume 44 (2008) no. 1, p. 47-88

This paper presents a new algorithm to perform regression estimation, in both the inductive and transductive setting. The estimator is defined as a linear combination of functions in a given dictionary. Coefficients of the combinations are computed sequentially using projection on some simple sets. These sets are defined as confidence regions provided by a deviation (PAC) inequality on an estimator in one-dimensional models. We prove that every projection the algorithm actually improves the performance of the estimator. We give all the estimators and results at first in the inductive case, where the algorithm requires the knowledge of the distribution of the design, and then in the transductive case, which seems a more natural application for this algorithm as we do not need particular information on the distribution of the design in this case. We finally show a connection with oracle inequalities, making us able to prove that the estimator reaches minimax rates of convergence in Sobolev and Besov spaces.

Cette article présente un nouvel algorithme d'estimation de régression, dans les contextes inductifs et transductifs. L'estimateur est défini par une combinaison linéaire de fonctions choisies dans un dictionnaire donné. Les coefficients de cette combinaison sont calculés par des projections successives sur des ensembles simples. Ces ensembles sont définis comme des régions de confiance données par une inégalité de déviation (ou inégalité PAC). On démontre en particulier que chaque projection au cours de l'algorithme améliore effectivement l'estimateur obtenu. On donne tout d'abord les résultats dans le contexte inductif, où l'algorithme nécessite la connaissance de la distribution du design, puis dans le contexte transductif, plus naturel ici puisque l'algorithme s'applique sans la connaissance de cette distribution. On établit finalement un lien avec les inégalités d'oracle, permettant de montrer que notre estimateur atteint les vitesses optimales dans les espaces de Sobolev et de Besov.

DOI : https://doi.org/10.1214/07-AIHP106
Classification:  62G08,  62G15,  68T05
Keywords: regression estimation, statistical learning, confidence regions, thresholding methods, support vector machines
@article{AIHPB_2008__44_1_47_0,
     author = {Alquier, Pierre},
     title = {Iterative feature selection in least square regression estimation},
     journal = {Annales de l'I.H.P. Probabilit\'es et statistiques},
     publisher = {Gauthier-Villars},
     volume = {44},
     number = {1},
     year = {2008},
     pages = {47-88},
     doi = {10.1214/07-AIHP106},
     zbl = {pre05610824},
     mrnumber = {2451571},
     language = {en},
     url = {http://www.numdam.org/item/AIHPB_2008__44_1_47_0}
}
Alquier, Pierre. Iterative feature selection in least square regression estimation. Annales de l'I.H.P. Probabilités et statistiques, Volume 44 (2008) no. 1, pp. 47-88. doi : 10.1214/07-AIHP106. http://www.numdam.org/item/AIHPB_2008__44_1_47_0/

P. Alquier. Transductive and inductive adaptative inference for regression and density estimation. PhD thesis, University Paris 6, 2006.

J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré. Probab. Statist. 40 (2004) 685-736. | Numdam | MR 2096215 | Zbl 1052.62037

A. Barron, A. Cohen, W. Dahmen and R. Devore. Adaptative approximation and learning by greedy algorithms. Preprint, 2006. | MR 2387964

G. Blanchard, P. Massart, R. Vert and L. Zwald. Kernel projection machine: a new tool for pattern recognition. In Advances in Neural Inf. Proc. Systems (NIPS, 2004) 1649-1656, Mit Press, 2005.

B. E. Boser, I. M. Guyon and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144-152, ACM, 1992.

O. Catoni. A pac-Bayesian approach to adaptative classification. Preprint Laboratoire de Probabilités et Modèles Aléatoires, 2003.

O. Catoni. Statistical learning theory and stochastic optimization. Saint-Flour Summer School on Probability Theory. Lecture Notes in Math. 1851. Springer, Berlin, 2004. | MR 2163920 | Zbl 1076.93002

O. Catoni. Improved Vapnik-Cervonenkis bounds. Preprint Laboratoire de Probabilités et Modèles Aléatoires, 2005.

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge University Press, 2000. | Zbl 0994.68074

D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelets. Biometrika 81 (1994) 425-455. | MR 1311089 | Zbl 0815.62019

D. L. Donoho, I. M. Johnstone, G. Kerkyacharian and D. Picard. Density estimation by wavelet thresholding. Ann. Statist. 24 (1996) 508-539. | MR 1394974 | Zbl 0860.62032

W. Härdle, G. Kerkyacharian, D. Picard and A. B. Tsybakov. Wavelets, Approximations and Statistical Applications 129. Springer, New York, 1998. | MR 1618204 | Zbl 0899.62002

A. Juditsky, A. Nazin, A. Tsybakov and N. Vayatis. Recursive aggregation of estimators via the mirror descent algorithm with averaging. Probl. Inf. Transm. 41 (2005) 368-384. | MR 2198228 | Zbl 1123.62044

A. Juditsky, P. Rigollet and A. Tsybakov. Mirror averaging, aggregation and model selection. In Meeting on Statistical and Probabilistic Methods of Model Selection, pp. 2688-2691. Oberwolfach reports, 2005.

G. Kerkyacharian and D. Picard. Regression in random design and warped wavelets. Bernoulli 10 (2004) 1053-1105. | MR 2108043 | Zbl 1067.62039

A. Nemirovski. Topics in non-parametric statistics. Saint-Flour Summer School on Probability Theory 85-277. Lecture Notes in Math. 1738. Springer, Berlin, 2000. | MR 1775640 | Zbl 0998.62033

D. Panchenko. Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 31 (2003) 2068-2081. | MR 2016612 | Zbl 1042.60008

R. Schapire, Y. Freund, P. Bartlett and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 (1998) 1651-1686. | MR 1673273 | Zbl 0929.62069

B. Schölkopf, A. J. Smola and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 (1998) 1299-1319.

M. Seeger. Pac-Bayesian generalization error bounds for Gaussian process classification. J. Mach. Learn. Res. 3 (2002) 233-269. | MR 1971338 | Zbl 1088.68745

A. Tsybakov. Optimal aggregation of classifiers instatistical learning. Ann. Statist. 32 (2004) 135-156. | MR 2051002 | Zbl 1105.62353

V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1998. | MR 1719582 | Zbl 0934.62009

B. Widrow and M. Hoff. Adaptative switching circuits. In IRE WESCON Convention Record, Part 4, Computers: Man-Machine Systems, 96-104, 2005.

Y. Yang. Aggregating regression procedures to improve performances. Bernoulli 10 (2004) 25-47. | MR 2044592 | Zbl 1040.62030