This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results.
Mots-clés : binary classification, CART, model selection, penalization, regression, variable selection
@article{PS_2014__18__770_0, author = {Sauve, Marie and Tuleau-Malot, Christine}, title = {Variable selection through {CART}}, journal = {ESAIM: Probability and Statistics}, pages = {770--798}, publisher = {EDP-Sciences}, volume = {18}, year = {2014}, doi = {10.1051/ps/2014006}, language = {en}, url = {http://archive.numdam.org/articles/10.1051/ps/2014006/} }
TY - JOUR AU - Sauve, Marie AU - Tuleau-Malot, Christine TI - Variable selection through CART JO - ESAIM: Probability and Statistics PY - 2014 SP - 770 EP - 798 VL - 18 PB - EDP-Sciences UR - http://archive.numdam.org/articles/10.1051/ps/2014006/ DO - 10.1051/ps/2014006 LA - en ID - PS_2014__18__770_0 ER -
Sauve, Marie; Tuleau-Malot, Christine. Variable selection through CART. ESAIM: Probability and Statistics, Tome 18 (2014), pp. 770-798. doi : 10.1051/ps/2014006. http://archive.numdam.org/articles/10.1051/ps/2014006/
[1] Margin adaptive model selection in statistical learning. Bernoulli 17 (2011) 687-713. | MR | Zbl
and ,[2] Minimal penalties for gaussian model selection. Probab. Theory Relat. Fields 138 (2007) 33-73. | MR | Zbl
and ,[3] Random forests. Mach. Learn. 45 (2001) 5-32. | Zbl
,[4] Random forests. http://www.stat.berkeley.edu/users/breiman/RandomForests/ (2005).
and ,[5] Classification and Regression Trees. Chapman et Hall (1984). | MR | Zbl
, , and ,[6] Gene selection and classification of microarray data using random forest. BMC Bioinform. 7 (2006) 1-13.
and ,[7] Least angle regression. Ann. Stat. 32 (2004) 407-499. | MR | Zbl
, , and ,[8] A selective overview of variable selection in high dimensional feature space. Stat. Sin. 20 (2010) 101-148. | MR | Zbl
and ,[9] Regression by leaps and bounds. Technometrics 16 (1974) 499-511. | Zbl
and ,[10] Variable selection using random forests. Pattern Recognit. Lett. 31 (2010) 2225-2236.
, and ,[11] Margin adaptive risk bounds for classification trees, hal-00362281.
,[12] Model Selection for CART Regression Trees. IEEE Trans. Inf. Theory 51 (2005) 658-670. | MR | Zbl
and ,[13] Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux données de biopuces. Journal de la société française de statistique 149 (2008) 43-66. | EuDML | MR
and ,[14] Estimators of relative importance in linear regression based on variance decomposition. The American Statistician 61 (2007) 139-147. | MR
,[15] An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157-1182. | Zbl
and ,[16] Gene selection for cancer classification using support vector machines. Mach. Learn. 46 (2002) 389-422. | Zbl
, , and ,[17] The Elements of Statistical Learning. Springer (2001). | MR | Zbl
, and ,[18] Least angle regresion and l1 penalized regression: A review. Stat. Surv. 2 (2008) 61-93. | MR | Zbl
, , and ,[19] Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273-324. | Zbl
and ,[20] Local rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34 (2004) 2593-2656. | MR | Zbl
,[21] Smooth discrimination analysis. Ann. Stat. 27 (1999) 1808-1829. | MR | Zbl
and ,[22] Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de Toulouse 2 (2000) 245-303. | EuDML | Numdam | MR | Zbl
,[23] Concentration Inequlaities and Model Selection. Lect. Notes Math. Springer (2003). | Zbl
,[24] Risk bounds for statistical learning. Ann. Stat. 34 (2006). | MR | Zbl
and ,[25] Classification supervisée en grande dimension. Application à l'agrément de conduite automobile. Revue de Statistique Appliquée LIV (2006) 41-60.
and ,[26] Une inégalité de bennett pour les maxima de processus empiriques. Ann. Inst. Henri Poincaré, Probab. Stat. 38 (2002) 1053-1057. | EuDML | Numdam | MR | Zbl
,[27] Sensitivity Analysis. Wiley (2000). | MR | Zbl
, and ,[28] Histogram selection in non gaussian regression. ESAIM PS 13 (2009) 70-86. | Numdam | MR | Zbl
,[29] Variable selection through CART, hal-00551375.
and ,[30] Sensitivity estimates for nonlinear mathematical models. Math. Mod. Comput. Experiment 1 (1993) 271-280. | MR | Zbl
,[31] Regression shrinkage and selection via Lasso. J. R. Stat. Soc. Ser. B 58 (1996) 267-288. | MR | Zbl
,[32] Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32 (2004) 135-166. | MR | Zbl
,Cité par Sources :