Nous présentons ici les méthodes de co-clustering, avec une emphase sur les modèles à blocs latents (LBM) et les parallèles qui existent entre le LBM et le Modèle à Blocs Stochastiques (SBM), notamment pour l’analyse de graphes bipartites. Nous introduisons différentes variantes du LBM (standard, sparse, bayésien) et présentons des résultats d’identifiabilité. Nous montrons comment la structure de dépendance complexe induite par le LBM rend l’estimation des paramètres par maximum de vraisemblance impossible en pratique et passons en revue des méthodes d’inférence alternatives. Ces dernières sont basées sur des procédures itératives, combinées à des approximations faciles à maximiser de la vraisemblance, ce qui les rend malaisés à analyser théoriquement. Il existe néanmoins des résultats de consistence, partiels en ce qu’ils reposent sur une condition raisonnable mais encore non démontrée. De même, les outils de sélection de modèle actuellement disponibles pour choisir le nombre de cluster reposent sur une conjecture. Nous replacons brièvement LBM dans le contexte des méthodes de co-clustering qui ne s’appuient pas sur un modèle génératif, particulièrement celles basées sur la factorisation de matrices. Nous concluons avec une étude de cas qui illustre les avantages du co-clustering sur le clustering simple.
We present here model-based co-clustering methods, with a focus on the latent block model (LBM). We introduce several specifications of the LBM (standard, sparse, Bayesian) and review some identifiability results. We show how the complex dependency structure prevents standard maximum likelihood estimation and present alternative and popular inference methods. Those estimation methods are based on a tractable approximation of the likelihood and rely on iterative procedures, which makes them difficult to analyze. We nevertheless present some asymptotic results for consistency. The results are partial as they rely on a reasonable but still unproved condition. Likewise, available model selection tools for choosing the number of groups in rows and columns are only valid up to a conjecture. We also briefly discuss non model-based co-clustering procedures. Finally, we show how LBM can be used for bipartite graph analysis and highlight throughout this review its connection to the Stochastic Block Model.
Mot clés : Modèle à blocs latents, Modèle à variables latentes, Approximation variationnelle, Sélection de modèle, ICL, BIC, Graphes bipartites
@article{JSFS_2015__156_3_120_0, author = {Brault, Vincent and Mariadassou, Mahendra}, title = {Co-clustering through {Latent} {Bloc} {Model:} a {Review}}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {120--139}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {156}, number = {3}, year = {2015}, mrnumber = {3432606}, zbl = {1341.62172}, language = {en}, url = {http://archive.numdam.org/item/JSFS_2015__156_3_120_0/} }
TY - JOUR AU - Brault, Vincent AU - Mariadassou, Mahendra TI - Co-clustering through Latent Bloc Model: a Review JO - Journal de la société française de statistique PY - 2015 SP - 120 EP - 139 VL - 156 IS - 3 PB - Société française de statistique UR - http://archive.numdam.org/item/JSFS_2015__156_3_120_0/ LA - en ID - JSFS_2015__156_3_120_0 ER -
%0 Journal Article %A Brault, Vincent %A Mariadassou, Mahendra %T Co-clustering through Latent Bloc Model: a Review %J Journal de la société française de statistique %D 2015 %P 120-139 %V 156 %N 3 %I Société française de statistique %U http://archive.numdam.org/item/JSFS_2015__156_3_120_0/ %G en %F JSFS_2015__156_3_120_0
Brault, Vincent; Mariadassou, Mahendra. Co-clustering through Latent Bloc Model: a Review. Journal de la société française de statistique, Tome 156 (2015) no. 3, pp. 120-139. http://archive.numdam.org/item/JSFS_2015__156_3_120_0/
[1] Modele à blocs latents pour l’analyse de données métagénomiques, journées de Statistiques de la SFdS (2014)
[2] Identifiability of parameters in latent structure models with many observed variables, The Annals of Statistics, Volume 37 (2009), pp. 3099-3132 | MR | Zbl
[3] A nonparametric view of network models and Newman-Girvan and other modularities, PNAS, Volume 106 (2009) no. 50, pp. 21068-21073 | DOI | Zbl
[4] Modele génératif pour données ordinales, 44e Journées de Statistique, SFdS, Bruxelles, Belgique (2012)
[5] The netflix prize, Proceedings of KDD cup and workshop, Volume 2007 (2007), 35 pages
[6] Revue bibliographique pour la classification croisée (2014)
[7] Estimation et sélection de modèles pour le modèle des blocs latents, Université Paris-Sud (2014) (Ph. D. Thesis)
[8] Consistency of maximum-likelihood and variational estimators in the stochastic block model, Electronic Journal of Statistics, Volume 6 (2012), pp. 1847-1899 | DOI | MR | Zbl
[9] Classification and estimation in the Stochastic Blockmodel based on the empirical degrees, Electronic Journal of Statistics, Volume 6 (2012), pp. 2574-2601 | DOI | MR | Zbl
[10] Model selection and clustering in stochastic block models with the exact integrated complete data likelihood, ArXiv e-prints (2013) | arXiv | MR
[11] Stochastic blockmodels with a growing number of classes, Biometrika (2012) (in press) | DOI | MR | Zbl
[12] Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, Volume 39 (1977) no. 1, pp. 1-38 (With discussion) | MR | Zbl
[13] Finite Mixture and Markov Switching Models, Springer, 2006 | MR | Zbl
[14] Convergence Theorems for Generalized Alternating Minimization Procedures, Journal of Machine Learning Research, Volume 6 (2005), pp. 2049-2073 | MR | Zbl
[15] Accuracy of variational estimates for random graph mixture models, Journal of Statistical Computation and Simulation, Volume 0 (2011) no. 0, pp. 1-14 | DOI | Zbl
[16] Non-Uniqueness in Probabilistic Numerical Identification of Bacteria, Journal of Applied Probability, Volume 31 (1994), pp. 542-548 | MR | Zbl
[17] Clustering with block mixture models, Pattern Recognition, Volume 36 (2003), pp. 463-473
[18] Block clustering with Bernoulli mixture models: Comparison of different approaches, Computational Statistics and Data Analysis, Volume 52 (2008), pp. 3233 -3245 | MR | Zbl
[19] Latent block model for contingency table, Communication in Statistics - Theory and Methods, Volume 39 (2010), pp. 416 -425 | MR | Zbl
[20] Co-Clustering, ISTE Ltd and John Wiley & Sons, Inc, 2013
[21] Classification croisée, Université Pierre et Marie Curie (1983) (Thèse d’état)
[22] Model selection for the binary latent block model, Proceedings of COMPSTAT 2012 (2012)
[23] Estimation and selection for the latent block model on categorical data, Statistics and Computing (2014), pp. 1-16 | DOI | MR | Zbl
[24] Design of Artificial Data Tables for Co-Clustering Analysis (2012) (Technical report)
[25] Sélection de modèle pour la classification croisée de données continues., Université de Technologie de Compiègne, December (2012) (PhD thesis)
[26] Co-clustering by block value decomposition, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM (2005), pp. 635-640
[27] Convergence of the groups posterior distribution in latent or stochastic block models, Bernoulli, Volume 21 (2015), pp. 537-573 | MR | Zbl
[28] Nonparametric Bayesian biclustering (2007) (Technical report)
[29] Modeling heterogeneity in random graphs: a selective review, arXiv preprint arXiv:1402.4296 (2014) | MR | Zbl
[30] Spectral clustering and the high-dimensional Stochastic Block Model, Ann. Statist, Volume 39 (2011) no. 4, pp. 1878-1915 | MR | Zbl
[31] Applied statistical decision theory, Studies in managerial economics, Division of Research, Graduate School of Business Adminitration, Harvard University, 1961 http://books.google.fr/books?id=wPBLAAAAMAAJ | MR
[32] Bayesian co-clustering, Eighth IEEE International Conference on Data Mining, 2008. ICDM’08 (2008), pp. 530-539
[33] Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems 13 (2001), pp. 556-562
[34] Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data, PNAS, Volume 101 (2004) no. 9, pp. 2981-2986
[35] A Bayesian approach to two-mode clustering (2009) no. 2009-06 http://hdl.handle.net/1765/15112 (Technical report)
[36] Block clustering with collapsed latent block models, Statistics and Computing, Volume 22 (2012), pp. 415-428 | MR | Zbl
[37] Inferring structure in bipartite networks using the latent block model and exact ICL, ArXiv e-prints (2014) | arXiv
[38] Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds, Information processing & management, Volume 46 (2010) no. 5, pp. 559-570