Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces
Journal de la société française de statistique, Volume 149 (2008) no. 3, p. 43-66

In this paper we compare three methods for selecting important features in binary classification. We focus on the case where the sample size is smaller than the number of variables. The three approaches used are based on Support Vector Machines, L 1 constrained Generalized Linear Models and Random Forests.

Dans cet article nous nous proposons de comparer trois méthodes récentes de sélection de variables dans le cadre de la classification binaire. Le contexte auquel nous nous intéressons ici est celui où le nombre de variables est très grand et beaucoup plus important que le nombre d’observations, comme c’est le cas pour les données issues des biopuces. Les approches comparées sont de type SVM, GLM sous contraintes de type L 1 et Forêts Aléatoires.

Keywords: bootstrap, cross validation, feature selection, forward selection, GLMpath, microarray data, random forests, ranking rules, support vector machines, SVM-based criteria
@article{JSFS_2008__149_3_43_0,
     author = {Ghattas, Badih and Ben Ishak, Anis},
     title = {S\'election de variables pour la classification binaire en grande dimension : comparaisons et application aux donn\'ees de biopuces},
     journal = {Journal de la soci\'et\'e fran\c caise de statistique},
     publisher = {Soci\'et\'e fran\c caise de statistique},
     volume = {149},
     number = {3},
     year = {2008},
     pages = {43-66},
     language = {fr},
     url = {http://www.numdam.org/item/JSFS_2008__149_3_43_0}
}
Ghattas, Badih; Ben Ishak, Anis. Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de biopuces. Journal de la société française de statistique, Volume 149 (2008) no. 3, pp. 43-66. http://www.numdam.org/item/JSFS_2008__149_3_43_0/

[1] Alizadeh A. A. (2000). Distinct types of diffues large b-cell lymphoma identified by gene expression profiling. Nature, 403 : 503-511.

[2] Alon U., N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96(12) : 6745-6750.

[3] Ambroise C. and G. Maclachlan (2002). Selection Bias in gene extraction on the basis of microarray gene expression data. Proceedings of the National Academic Science, USA, 99(10) :6562-6566. | Zbl 1034.92013

[4] Ben Ishak A. and B. Ghattas (2005). An efficient method for variable selection using svm-based criteria. Pré-publication de l'Institut de Mathématiques de Luminy, Marseille, France.

[5] Ben Ishak A. (2007). Séléction de variables par les machines à vecteurs supports pour la discrimination binaire et multiclasse en grande dimension. Thèse soutenue à l'Université de la Méditerranée le 06 Spetembre 2007. (http://lumimath.univ-mrs.fr/~ghattas/theseAnisBenIshak.pdf)

[6] Boser A., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh. ACM.

[7] Breiman L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth and Brooks. | MR 726392 | Zbl 0541.62042

[8] Breiman L. (2001). Random forests. Machine Learning Journal, 45 :5-32. | Zbl 1007.68152

[9] Cristianini N. and J. Shawe-Taylor (2000). Introduction to Support Vector Machines. Cambridge University Press. | Zbl 0994.68074

[10] Díaz-Uriarte R. and S. Alvarez De Andrés (2006). Gene Selection and classification of microarray data using random forest. BMC Bioinformatics, 7 :3, pp 1-13.

[11] Dudoit S., J. Fridlyand, and T. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Stat. Assoc.. | MR 1963389 | Zbl 1073.62576

[12] Efron B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics, 32(2) :407-499. | MR 2060166 | Zbl 1091.62054

[13] Ghattas B. et G. Oppenheim (2001). Etude de faisabilité : Modèles globaux pour la mise au point moteur. Rapport technique Renault, 56 pages.

[14] Golub T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander (1999). Molecular classification of cancer : Class discovery and class prediction by gene expression monitoring. Science, 286 : 531-537.

[15] Guyon I. and A. Elisseff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3 : 1157-1182. | Zbl 1102.68556

[16] Guyon I., J. Weston, S. Barnhill, and V. Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3) : 389-422. | Zbl 0998.68111

[17] Kohavi R. and G. H. John (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2) : 273-324. | Zbl 0904.68143

[18] Liaw A. and M. Wiener (2002). Classification and Regression by Random Forest. Rnews, 2 :18-22.

[19] Luntz A. and V. Brailovsky (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3.

[20] Mccullagh P. and J. Nelder (1989). Generalized Linear Models. CHAPMAN & HALL/CRC, Boca Raton. | MR 727836 | Zbl 0744.62098

[21] Park M. Y. and T. Hastie (2006). L 1 Regularization Path Algorithm for Generalized Linear Models. Technical report, Stanford University.

[22] Poggi J. M. et C. Tuleau (2006). Classification supervisée en grande dimension. Application à l'agrément de conduite automobile. Revue de Statistique Appliquée, LIV (4), 39-58.

[23] Rakotomamonjy A. (2003). Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3 : 1357-1370. | MR 2020764 | Zbl 1102.68583

[24] Reunanen J. (2003). Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research, 3 :1371-1382. | Zbl 1102.68635

[25] Singh D., P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D'Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2) : 203-209.

[26] Somol P., P. Pudil, J. Novovičová, and P. Paclik (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters, 20 :1157-1163.

[27] Svetnik V., A. Liaw, C. Tong, and T. Wang (2004). Application of Breiman's random forest to modeling structure-activity relashionships of pharmaceutical molecules. Multiple Classifier Systems. Lecture Notes in Computer Science, Springer, 3077 :334-343.

[28] Vapnik V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York. | MR 1367965 | Zbl 0833.62008

[29] Vapnik V. (1998). Statistical Learning Theory. John Wiley and Sons, New York. | MR 1641250 | Zbl 0935.62007

[30] Vapnik V. and O. Chapelle (2000). Bounds on error expectation for support vector machines. Neural Computation, 12 : 9.

[31] Weston J., A. Elisseff, B. Schoelkopf, and M. Tipping (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3 : 1439-1461. | MR 2020766 | Zbl 1102.68605