Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces

Ghattas, Badih; Ben Ishak, Anis

Ghattas, Badih ; Ben Ishak, Anis

Journal de la Société française de statistique & Revue de statistique appliquée, Tome 149 (2008) no. 3, pp. 43-66.

Résumé
Abstract

Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type $L_{1}$ et Forêts Aléatoires.

In this paper we compare three methods for selecting important features in binary classification. We focus on the case where the sample size is smaller than the number of variables. The three approaches used are based on Support Vector Machines, $L_{1}$ constrained Generalized Linear Models and Random Forests.

Mot clés : biopuces, bootstrap, forêts aléatoires, hiérarchies de variables, machines à vecteurs supports, sélection de variables, méthodes séquentielles, modèles linéaires généralisés, validation croisée
Mots-clés : bootstrap, cross validation, feature selection, forward selection, GLMpath, microarray data, random forests, ranking rules, support vector machines, SVM-based criteria

@article{JSFS_2008__149_3_43_0,
     author = {Ghattas, Badih and Ben Ishak, Anis},
     title = {S\'election de variables pour la classification binaire en grande dimension : comparaisons et application aux donn\'ees de biopuces},
     journal = {Journal de la Soci\'et\'e fran\c{c}aise de statistique & Revue de statistique appliqu\'ee},
     pages = {43--66},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {149},
     number = {3},
     year = {2008},
     language = {fr},
     url = {http://archive.numdam.org/item/JSFS_2008__149_3_43_0/}
}

TY  - JOUR
AU  - Ghattas, Badih
AU  - Ben Ishak, Anis
TI  - Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces
JO  - Journal de la Société française de statistique & Revue de statistique appliquée
PY  - 2008
SP  - 43
EP  - 66
VL  - 149
IS  - 3
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2008__149_3_43_0/
LA  - fr
ID  - JSFS_2008__149_3_43_0
ER  -

%0 Journal Article
%A Ghattas, Badih
%A Ben Ishak, Anis
%T Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces
%J Journal de la Société française de statistique & Revue de statistique appliquée
%D 2008
%P 43-66
%V 149
%N 3
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2008__149_3_43_0/
%G fr
%F JSFS_2008__149_3_43_0

Ghattas, Badih; Ben Ishak, Anis. Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces. Journal de la Société française de statistique & Revue de statistique appliquée, Tome 149 (2008) no. 3, pp. 43-66. http://archive.numdam.org/item/JSFS_2008__149_3_43_0/

Bibliographie
Cité par

[1] Alizadeh A. A. (2000). Distinct types of diffues large b-cell lymphoma identified by gene expression profiling. Nature, 403 : 503-511.

[2] Alon U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96(12) : 6745-6750.

[3] Ambroise C. and G. Maclachlan (2002). Selection Bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academic Science, USA, 99(10) :6562-6566. | Zbl

[4] Ben Ishak A. and B. Ghattas (2005). An efficient method for variable selection using svm-based criteria. Pré-publication de l'Institut de Mathématiques de Luminy, Marseille, France.

[5] Ben Ishak A. (2007). Séléction de variables par les machines à vecteurs supports pour la discrimination binaire et multiclasse en grande dimension. Thèse soutenue à l'Université de la Méditerranée le 06 Spetembre 2007. (http://lumimath.univ-mrs.fr/~ghattas/theseAnisBenIshak.pdf)

[6] Boser A., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh. ACM.

[7] Breiman L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth and Brooks. | MR | Zbl

[8] Breiman L. (2001). Random forests. Machine Learning Journal, 45 :5-32. | Zbl

[9] Cristianini N. and J. Shawe-Taylor (2000). Introduction to Support Vector Machines. Cambridge University Press. | Zbl

[10] Díaz-Uriarte R. and S. Alvarez De Andrés (2006). Gene Selection and classification of microarray data using random forest. BMC Bioinformatics, 7 :3, pp 1-13.

[11] Dudoit S., J. Fridlyand, and T. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Stat. Assoc.. | MR | Zbl

[12] Efron B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics, 32(2) :407-499. | MR | Zbl

[13] Ghattas B. et G. Oppenheim (2001). Etude de faisabilité : Modèles globaux pour la mise au point moteur. Rapport technique Renault, 56 pages.

[14] Golub T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander (1999). Molecular classification of cancer : Class discovery and class prediction by gene expression monitoring. Science, 286 : 531-537.

[15] Guyon I. and A. Elisseff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3 : 1157-1182. | Zbl

[16] Guyon I., J. Weston, S. Barnhill, and V. Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3) : 389-422. | Zbl

[17] Kohavi R. and G. H. John (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2) : 273-324. | Zbl

[18] Liaw A. and M. Wiener (2002). Classification and Regression by Random Forest. Rnews, 2 :18-22.

[19] Luntz A. and V. Brailovsky (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3.

[20] Mccullagh P. and J. Nelder (1989). Generalized Linear Models. CHAPMAN & HALL/CRC, Boca Raton. | MR | Zbl

[21] Park M. Y. and T. Hastie (2006). L $_{1}$ Regularization Path Algorithm for Generalized Linear Models. Technical report, Stanford University.

[22] Poggi J. M. et C. Tuleau (2006). Classification supervisée en grande dimension. Application à l'agrément de conduite automobile. Revue de Statistique Appliquée, LIV (4), 39-58.

[23] Rakotomamonjy A. (2003). Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3 : 1357-1370. | MR | Zbl

[24] Reunanen J. (2003). Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research, 3 :1371-1382. | Zbl

[25] Singh D., P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D'Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2) : 203-209.

[26] Somol P., P. Pudil, J. Novovičová, and P. Paclik (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters, 20 :1157-1163.

[27] Svetnik V., A. Liaw, C. Tong, and T. Wang (2004). Application of Breiman's random forest to modeling structure-activity relashionships of pharmaceutical molecules. Multiple Classifier Systems. Lecture Notes in Computer Science, Springer, 3077 :334-343.

[28] Vapnik V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York. | MR | Zbl

[29] Vapnik V. (1998). Statistical Learning Theory. John Wiley and Sons, New York. | MR | Zbl

[30] Vapnik V. and O. Chapelle (2000). Bounds on error expectation for support vector machines. Neural Computation, 12 : 9.

[31] Weston J., A. Elisseff, B. Schoelkopf, and M. Tipping (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3 : 1439-1461. | MR | Zbl