Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau

Brault, Vincent; Lomet, Aurore

Revue Bibliographique

Brault, Vincent ; Lomet, Aurore

Journal de la société française de statistique, Tome 156 (2015) no. 3, pp. 27-51.

Résumé
Abstract

La classification croisée vise à identifier une structure sous-jacente existant entre les lignes et colonnes d’un tableau de données. Cette revue bibliographique présente les différents points de vue abordés depuis cinquante ans pour définir cette structure et propose pour chacun un éventail non exhaustif des algorithmes et applications associés. Enfin, les questions encore ouvertes sont abordées et une méthodologie est proposée dans la partie discussion pour analyser des données réelles.

Co-clustering aims to identify block patterns in a data table, from a joint clustering of rows and columns. This problem has been studied since 1965, with recent interests in various fields, ranging from graph analysis, machine learning, data mining and genomics. Several variants have been proposed with diverse names: bi-clustering, block clustering, cross-clustering, or simultaneous clustering. We propose here a review of these methods in order to describe, compare and discuss the different possibilities to realize a co-clustering following the user aim.

Zbl | 1 citation dans Numdam

Mot clés : classification croisée (co-clustering), classification croisée par blocs (block clustering), classification imbriquée, classification avec chevauchement (biclustering), critère de sélection
Keywords: Cross classification, co-clustering, block clustering, biclustering, selection criterion

@article{JSFS_2015__156_3_27_0,
     author = {Brault, Vincent and Lomet, Aurore},
     title = {Revue des m\'ethodes pour la classification jointe des lignes et des colonnes d{\textquoteright}un tableau},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {27--51},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {156},
     number = {3},
     year = {2015},
     zbl = {1335.62092},
     language = {fr},
     url = {http://archive.numdam.org/item/JSFS_2015__156_3_27_0/}
}

TY  - JOUR
AU  - Brault, Vincent
AU  - Lomet, Aurore
TI  - Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau
JO  - Journal de la société française de statistique
PY  - 2015
SP  - 27
EP  - 51
VL  - 156
IS  - 3
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2015__156_3_27_0/
LA  - fr
ID  - JSFS_2015__156_3_27_0
ER  -

%0 Journal Article
%A Brault, Vincent
%A Lomet, Aurore
%T Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau
%J Journal de la société française de statistique
%D 2015
%P 27-51
%V 156
%N 3
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2015__156_3_27_0/
%G fr
%F JSFS_2015__156_3_27_0

Brault, Vincent; Lomet, Aurore. Revue des méthodes pour la classification jointe des lignes et des colonnes d’un tableau. Journal de la société française de statistique, Tome 156 (2015) no. 3, pp. 27-51. http://archive.numdam.org/item/JSFS_2015__156_3_27_0/

Bibliographie
Cité par

[1] Andrews, R. L.; Currim, I. S. A comparison of segment retention criteria for finite mixture logit models, Journal of Marketing Research (2003), pp. 235-243

[2] Burnham, Kenneth P; Anderson, David R Multimodel inference understanding AIC and BIC in model selection, Sociological methods & research, Volume 33 (2004) no. 2, pp. 261-304

[3] Biernacki, Christophe; Celeux, Gilles; Govaert, Gérard Assessing a mixture model for clustering with the integrated completed likelihood, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume 22 (2000) no. 7, pp. 719-725

[4] Biernacki, C.; Celeux, G.; Govaert, G. Exact and Monte Carlo calculations of integrated likelihoods for the latent class model, Journal of Statistical Planning and Inference, Volume 140 (2010) no. 11, pp. 2991-3002 | Zbl

[5] Ben-Dor, Amir; Chor, Benny; Karp, Richard; Yakhini, Zohar Discovering local structure in gene expression data : the order-preserving submatrix problem, Journal of computational biology, Volume 10 (2003) no. 3-4, pp. 373-384

[6] Banerjee, A.; Dhillon, I.; Ghosh, J.; Merugu, S. A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation, Journal of Machine Learning Research, Volume 8 (2007), pp. 1919-1986 | Zbl

[7] Berkhin, P. A survey of clustering data mining techniques, Springer, 2006, pp. 25-71

[8] Baker, Frank B; Hubert, Lawrence J Measuring the power of hierarchical cluster analysis, Journal of the American Statistical Association, Volume 70 (1975) no. 349, pp. 31-38 | Zbl

[9] Bennett, James; Lanning, Stan The netflix prize, Proceedings of KDD cup and workshop, Volume 2007 (2007), 35 pages

[10] Bock, H Simultaneous clustering of objects and variables, Analyse des données et Informatique (1979), pp. 187-203 | Zbl

[11] Banfield, J. D.; Raftery, A. E. Model-based Gaussian and non-Gaussian clustering, Biometrics (1993), pp. 803-821 | Zbl

[12] Cheng, Y.; Church, G. M. Biclustering of expression data., Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB) (2000), 93 pages

[13] Corsten, LCA; Denis, JB Structuring interaction in two-way tables by clustering, Biometrics (1990), pp. 207-215 | Zbl

[14] Camiz, S.; Denimal, JJ A new method for cross-classification analysis of contingency data tables, Compstat 98-Proceedings in Computational Statistics, Physica-Verlag, Heidelberg (1998), pp. 209-214 | Zbl

[15] Celeux, G.; Govaert, G. A classification EM algorithm for clustering and two stochastic versions, Computational Statistics and Data Analysis, Volume 14 (1992) no. 3, pp. 315-332 | Zbl

[16] Caliński, Tadeusz; Harabasz, Jerzy A dendrite method for cluster analysis, Communications in Statistics-theory and Methods, Volume 3 (1974) no. 1, pp. 1-27 | Zbl

[17] Charrad, M.; Lechevallier, Y.; Saporta, G.; Ben Ahmed, M. Détermination du nombre de classes dans les méthodes de bipartitionnement, 17ème Rencontres de la Société Francophone de Classification, Saint-Denis de la Réunion (2010), pp. 119-122

[18] Celeux, Gilles; Robert, Claudine Une histoire de discrétisation, La Revue de Modulad, Volume 11 (1993), pp. 7-44

[19] Davies, David L; Bouldin, Donald W A cluster separation measure, Pattern Analysis and Machine Intelligence, IEEE Transactions on (1979) no. 2, pp. 224-227

[20] Deodhar, M.; Ghosh, J. Simultaneous co-clustering and modeling of market data, Proceedings of the Workshop for Data Mining in Marketing (DMM 2007)(Leipzig, Germany). IEEE Computer Society Press, Los Alamitos, CA (2007)

[21] Dhillon, I. S. Co-clustering documents and words using bipartite spectral graph partitioning, KDD’01 : Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM (2001), pp. 269-274

[22] Dempster, Arthur P; Laird, Nan M; Rubin, Donald B Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Volume 39 (1977) no. 1, pp. 1-38 | Zbl

[23] Dhillon, I. S.; Mallela, S.; Modha, D. S. Information-theoretic co-clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2003), pp. 89-98

[24] Duffy, D. E.; Quiroz, J. A permutation-based algorithm for block clustering, Journal of Classification, Volume 8 (1991) no. 1, pp. 65-91

[25] Dunn, Joseph C Well-separated clusters and optimal fuzzy partitions, Journal of cybernetics, Volume 4 (1974) no. 1, pp. 95-104 | Zbl

[26] Fraley, C.; Raftery, A. E. How many clusters ? Which clustering method ? Answers via model-based cluster analysis, The Computer Journal, Volume 41 (1998) no. 8, pp. 578-588 | Zbl

[27] Gan, Guojun; Ma, Chaoqun; Wu, Jianhong Data clustering : theory, algorithms, and applications, 20, Siam, 2007 | Zbl

[28] Govaert, G.; Nadif, M. Clustering with block mixture models, Pattern Recognition, Volume 36 (2003), pp. 463-473

[29] Govaert, G.; Nadif, M. Clustering of contingency table and mixture model, European Journal of Operational Research, Volume 183 (2007), pp. 1055-1066 | Zbl

[30] Govaert, G.; Nadif, M. Block clustering with Bernoulli mixture models : Comparison of different approaches, Computational Statistics and Data Analysis, Volume 52 (2008), pp. 3233-3245 | Zbl

[31] Govaert, G.; Nadif, M. Un modèle de mélange pour la classification croisée d’un tableau de données continues, CAP’09, 11e conférence sur l’apprentissage artificiel (2009)

[32] Good, I. J. Categorization of classification, Mathematics and Computer Science in Biology and Medicine, Her Majesty’s Stationery Office, 1965

[33] Govaert, G. Algorithme de classification d’un tableau de contingence, First international symposium on data analysis and informatics, INRIA, Versailles (1977), pp. 487-500

[34] Govaert, G. Classification croisée, Thèse d’état, Université Pierre et Marie Curie (1983) (Ph. D. Thesis)

[35] Govaert, G. Classification croisée, Modulad, Volume 4 (1989), pp. 9-36

[36] Govaert, G. Simultaneous Clustering of Rows and Columns, Control and Cybernetics, Volume 24 (1995) no. 4, pp. 437-458 | Zbl

[37] Hansohm, J Two-mode clustering with genetic algorithms, Classification, automation, and new media, Springer, 2002, pp. 87-93

[38] Hartigan, J. A. Bloc voting in the United States senate, Journal of Classification, Volume 17 (2000) no. 1, pp. 29-49 | Zbl

[39] Hartigan, John A. Clustering Algorithms, John Wiley & Sons, Inc., New York, NY, USA, 1975 | Zbl

[40] Hedenfalk, I.; Duggan, D.; Chen, Y.; Radmacher, M.; Bitter, M.; Simon, R.; Meltzer, P.; Gusterson, B.; Esteller, M.; Raffeld, M. Gene-expression profiles in hereditary breast cancer, New Eng. J. Med., Volume 344 (2001), pp. 539-548

[41] Hubert, Lawrence J; Levin, Joel R A general statistical framework for assessing categorical clustering in free recall., Psychological bulletin, Volume 83 (1976) no. 6, pp. 1072-1080

[42] Hanczar, Blaise; Nadif, Mohamed Bagging for Biclustering : Application to Microarray Data, Machine Learning and Knowledge Discovery in Databases (Balcázar, JoséLuis; Bonchi, Francesco; Gionis, Aristides; Sebag, Michèle, eds.) (Lecture Notes in Computer Science), Volume 6321, Springer Berlin Heidelberg, 2010, pp. 490-505

[43] Hofmann, Thomas Probabilistic latent semantic indexing, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM (1999), pp. 50-57

[44] Ihmels, Jan; Bergmann, Sven; Barkai, Naama Defining transcription modules using large-scale gene expression data, Bioinformatics, Volume 20 (2004) no. 13, pp. 1993-2003

[45] Jain, Anil K; Murty, M Narasimha; Flynn, Patrick J Data clustering : a review, ACM computing surveys (CSUR), Volume 31 (1999) no. 3, pp. 264-323

[46] Jagalur, M.; Pal, C.; Learned-Miller, E.; Zoeller, R. T.; Kulp, D. Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering, BMC Bioinformatics, Volume 8 (2007) no. Suppl 10 | DOI

[47] Kluger, Y.; Basri, R.; Chang, J. T.; Gerstein, M. Spectral biclustering of microarray data : coclustering genes and conditions, Genome Research, Volume 13 (2003) no. 4, pp. 703-716

[48] Keribin, C.; Brault, V.; Celeux, G.; Govaert, G. Model selection for the binary latent block model, Compstat (2012), pp. 379-390

[49] Keribin, Christine; Brault, Vincent; Celeux, Gilles; Govaert, Gérard Estimation and Selection for the Latent Block Model on Categorical Data (2013) no. RR-8264, 30 pages (Rapport de recherche)

[50] Keribin, C.; Govaert, G.; Celeux, G. Estimation d’un modèle à blocs latents par l’algorithme SEM, 42e Journées de Statistique, SFdS, Marseille (2010)

[51] Krzanowski, Wojtek J; Lai, YT A criterion for determining the number of groups in a data set using sum-of-squares clustering, Biometrics (1988), pp. 23-34 | Zbl

[52] Kemp, C.; Tenenbaum, J. B.; Griffiths, T. L.; Yamada, T.; Ueda, N. Learning systems of concepts with an infinite relational model, Proceedings of The Twenty-First National Conference on Artificial Intelligence, AAAI Press (2006), pp. 381-388

[53] Lashkari, D.; Golland, P. Co-clustering with generative models (2009) (Technical report)

[54] Lomet, A.; Govaert, G.; Grandvalet, Y. An Approximation of the Integrated Classification Likelihood for the Latent Block Model, ICDM 2012 IEEE International Conference on Data Mining (2012)

[55] Lomet, A.; Govaert, G.; Grandvalet, Y. Model selection in block clustering by the integrated classification likelihood, Proceedings of Compstat 2012 (2012), pp. 519-530

[56] Lomet, A.; Govaert, G.; Grandvalet, Y. Un protocole de simulation de données pour la classification croisée, 44^e Journées de Statistique de la SFdS (2012)

[57] Lerman, IC; Leredde, H. La méthode des pôles d’attraction, Journées Analyse des Données et Informatique (1977)

[58] Lazzeroni, Laura; Owen, Art Plaid Models for Gene Expression Data, Statistica Sinica, Volume 12 (2000), pp. 61-86 | Zbl

[59] Leredde, H.; Perin, P. Les plaques-boucles mérovingiennes, Dossiers de l’Archéologie, Volume 42 (1980), pp. 83-87

[60] Long, B.; Zhang, Z. M.; Yu, P. S. Co-clustering by block value decomposition, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM (2005), pp. 635-640

[61] Milligan, G. W.; Cooper, M. C. An examination of procedures for determining the number of clusters in a data set, Psychometrika, Volume 50 (1985) no. 2, pp. 159-179

[62] McLachlan, G. J. The classification and mixture maximum likelihood approaches to cluster analysis, Handbook of statistics, Volume 2 (1982), pp. 199-208 | Zbl

[63] Murali, TM; Kasif, Simon Extracting conserved gene expression motifs from gene expression data, Pacific Symposium on Biocomputing, Volume 8 (2003), pp. 77-88 | Zbl

[64] Mouhoubi, Karima; Létocart, Lucas; Rouveirol, Céline Extraction de biclusters contraints dans des contextes bruités, Conférence Francophone sur l’Apprentissage Automatique - CAp 2012, Nancy, France, Laurent Bougrain (2012), 16 pages

[65] Maugis, Cathy; Martin-Magniette, Marie-Laure; Tamby, Jean-Philippe; Renou, Jean-Pierre; Lecharny, Alain; Aubourg, Sèbastien; Celeux, Gilles Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins, La Revue de Modulad, Volume 40 (2009), pp. 69-80

[66] Madeira, Sara C; Oliveira, Arlindo L Biclustering algorithms for biological data analysis : a survey, Computational Biology and Bioinformatics, IEEE/ACM Transactions on, Volume 1 (2004) no. 1, pp. 24-45

[67] Meeds, E.; Roweis, S. Nonparametric Bayesian biclustering (2007) (Technical report)

[68] Matias, Catherine; Robin, Stéphane Modeling heterogeneity in random graphs : a selective review, arXiv :1402.4296 (2014) | Zbl

[69] Mariadassou, M.; Robin, S.; Vacher, C. Uncovering latent structure in valued graphs : a variational approach, The Annals of Applied Statistics, Volume 4 (2010) no. 2, pp. 715-742 | Zbl

[70] Nowicki, K.; Snijders, T. A. B. Estimation and prediction for stochastic blockstructures, Journal of the American Statistical Association, Volume 96 (2001) no. 455, pp. 1077-1087 | Zbl

[71] Oyanagi, Shigeru; Kubota, Kazuto; Nakase, Akihiko Application of matrix clustering to web log analysis and access prediction, WEBKDD 2001-Mining Web Log Data Across All Customers Touch Points, Third International Workshop (2001), pp. 13-21

[72] Prelić, A.; Bleuler, S.; Zimmermann, P.; Wille, A.; Bühlmann, P.; Gruissem, W.; Hennig, L.; Thiele, L.; Zitzler, E. A systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics, Volume 22 (2006) no. 9, pp. 1122-1129 | DOI

[73] Podani, J.; Feoli, E. A general strategy for the simultaneous classification of variables and objects in ecological data tables, Journal of Vegetation Science, Volume 2 (1991) no. 4, pp. 435-444

[74] Robert, Christian Le choix bayésien : Principes et pratique, Springer Science & Business, 2006

[75] Rooth, M. Two-dimensional clusters in grammatical relations, AAAI Symposium on Representation and Acquisition of Lexical Knowledge (1995)

[76] Rousseeuw, Peter J Silhouettes : a graphical aid to the interpretation and validation of cluster analysis, Journal of computational and applied mathematics, Volume 20 (1987), pp. 53-65 | Zbl

[77] Roy, Daniel M; Teh, Yee Whye The Mondrian Process., NIPS (2008), pp. 1377-1384

[78] Rocci, R.; Vichi, M. Two-mode multi-partitioning, Computational Statistics and Data Analysis, Volume 52 (2008) no. 4, pp. 1984-2003 | Zbl

[79] Shan, H.; Banerjee, A. Bayesian co-clustering, Eighth IEEE International Conference on Data Mining, 2008. ICDM’08 (2008), pp. 530-539

[80] Schepers, J.; Ceulemans, E.; Van Mechelen, I. Selecting among multi-mode partitioning models of different complexities : A comparison of four model selection criteria, Journal of Classification, Volume 25 (2008) no. 1, pp. 67-85 | Zbl

[81] Seung, D.; Lee, L. Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems 13 (2001), pp. 556-562

[82] Shafiei, M.; Milios, E. Model-based overlapping co-clustering, Proceeding of SIAM Conference on Data Mining (2006)

[83] Seldin, Y.; Tishby, N. PAC-Bayesian analysis of co-clustering and beyond, The Journal of Machine Learning Research, Volume 11 (2010), pp. 3595-3646 | Zbl

[84] Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method, Invited paper to The 37th annual Allerton Conference on Communication, Control, and Computing (1999)

[85] Tanay, Amos; Sharan, Roded; Shamir, Ron Discovering Statistically Significant Biclusters in Gene Expression Data, Proceedings of ISMB 2002 (2002), pp. 136-144

[86] Van Dijk, B.; Van Rosmalen, J.; Paap, R. A Bayesian approach to two-mode clustering (2009) no. 2009-06 (Technical report)

[87] Wyse, J.; Friel, N. Block clustering with collapsed latent block models, Statistics and Computing (2010), pp. 1-14 | Zbl

[88] Wang, Pu; Laskey, Kathryn B; Domeniconi, Carlotta; Jordan, Michael Nonparametric Bayesian Co-clustering Ensembles., SIAM (2011)

[89] Yoo, J.; Choi, S. Orthogonal nonnegative matrix tri-factorization for co-clustering : Multiplicative updates on Stiefel manifolds, Information processing & management, Volume 46 (2010) no. 5, pp. 559-570

[90] Yang, Jiong; Wang, Wei; Wang, Haixun; Yu, Philip $δ$ -clusters : Capturing subspace correlation in a large data set, Data Engineering, 2002. Proceedings. 18th International Conference on, IEEE (2002), pp. 517-528