Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models

Chamroukhi, Faicel; Huynh, Bao-Tuyen

Numéro spécial : analyse de mélanges

Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models
[Estimation par maximum de vraisemblance régularisé et sélection de variables dans les modèles de mélanges d’experts]

Chamroukhi, Faicel ; Huynh, Bao-Tuyen

Journal de la société française de statistique, Tome 160 (2019) no. 1, pp. 57-85.

Résumé
Abstract

Les mélanges d’experts (MoE) sont des modèles efficaces pour la modélisation de données hétérogènes dans de nombreux problèmes en apprentissage statistique, y compris en régression, en classification et en discrimination. Généralement ajustés par maximum de vraisemblance via l’algorithme EM, leur application aux problémes de grande dimension est difficile dans un tel contexte. Nous considérons le problème de l’estimation et de la sélection de variables dans les modèles de mélanges d’experts, et proposons une approche d’estimation par maximum de vraisemblance régularisé qui encourage des solutions parcimonieuses pour des modéles de données de régression hétérogènes comportant un nombre de prédicteurs potentiellement grand. La méthode de régularisation proposée, contrairement aux méthodes de l’état de l’art sur les mélanges d’experts, ne se base pas sur une pénalisation approchée et ne nécessite pas de seuillage pour retrouver la solution parcimonieuse. L’estimation parcimonieuse des paramètres s’appuie sur une régularisation de l’estimateur du maximum de vraisemblance pour les experts et les fonctions d’activations, mise en œuvre par deux versions d’un algorithme EM hybride. L’étape M de l’algorithme, effectuée par montée de coordonnées ou par un algorithme MM, évite l’inversion de matrices dans la mise à jour et rend ainsi prometteur le passage de l’algorithme à l’échelle. Une étude expérimentale met en évidence de bonnes performances de l’approche proposée.

Mixture of Experts (MoE) are successful models for modeling heterogeneous data in many statistical learning problems including regression, clustering and classification. Generally fitted by maximum likelihood estimation via the well-known EM algorithm, their application to high-dimensional problems is still therefore challenging. We consider the problem of fitting and feature selection in MoE models, and propose a regularized maximum likelihood estimation approach that encourages sparse solutions for heterogeneous regression data models with potentially high-dimensional predictors. Unlike state-of-the art regularized MLE for MoE, the proposed modelings do not require an approximate of the penalty function. We develop two hybrid EM algorithms: an Expectation-Majorization-Maximization (EM/MM) algorithm, and an EM algorithm with coordinate ascent algorithm. The proposed algorithms allow to automatically obtaining sparse solutions without thresholding, and avoid matrix inversion by allowing univariate parameter updates. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data.

MR Zbl

Keywords: Mixture of experts, Model-based clustering, Feature selection, Regularization, EM algorithm, Coordinate ascent, MM algorithm, High-dimensional data
Mot clés : Mélanges d’experts, Classification à base de modéle, Sélection de variable, Régularisation, Algorithme EM, Montée de coordonnées, Algorithme MM, Données de grande dimension

@article{JSFS_2019__160_1_57_0,
     author = {Chamroukhi, Faicel and Huynh, Bao-Tuyen},
     title = {Regularized {Maximum} {Likelihood} {Estimation} and {Feature} {Selection} in {Mixtures-of-Experts} {Models}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {57--85},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {160},
     number = {1},
     year = {2019},
     mrnumber = {3928540},
     zbl = {1417.62170},
     language = {en},
     url = {http://archive.numdam.org/item/JSFS_2019__160_1_57_0/}
}

TY  - JOUR
AU  - Chamroukhi, Faicel
AU  - Huynh, Bao-Tuyen
TI  - Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models
JO  - Journal de la société française de statistique
PY  - 2019
SP  - 57
EP  - 85
VL  - 160
IS  - 1
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2019__160_1_57_0/
LA  - en
ID  - JSFS_2019__160_1_57_0
ER  -

%0 Journal Article
%A Chamroukhi, Faicel
%A Huynh, Bao-Tuyen
%T Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models
%J Journal de la société française de statistique
%D 2019
%P 57-85
%V 160
%N 1
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2019__160_1_57_0/
%G en
%F JSFS_2019__160_1_57_0

Chamroukhi, Faicel; Huynh, Bao-Tuyen. Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models. Journal de la société française de statistique, Tome 160 (2019) no. 1, pp. 57-85. http://archive.numdam.org/item/JSFS_2019__160_1_57_0/

Bibliographie
Cité par

[1] Celeux, Gilles; Maugis-Rabusseau, Cathy; Sedki, Mohammed Variable selection in model-based clustering and discriminant analysis with a regularization approach, Advances in Data Analysis and Classification (2018) to appear in 2018 (available on line) | DOI | MR | Zbl

[2] Devijver, E. An $ℓ_{1}$ -oracle inequality for the Lasso in multivariate finite mixture of multivariate Gaussian regression models, ESAIM: Probability and Statistics, Volume 19 (2015), pp. 649-670 | MR | Zbl

[3] Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm, J. of the royal statistical society. Series B (1977), pp. 1-38 | MR | Zbl

[4] Friedman, Jerome; Hastie, Trevor; Tibshirani, Rob Regularization paths for generalized linear models via coordinate descent, Journal of statistical software, Volume 33 (2010) no. 1, pp. 1-22

[5] Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, Volume 96 (2001) no. 456, pp. 1348-1360 | MR | Zbl

[6] Fraley, Chris; Raftery, Adrian E Bayesian regularization for normal mixture estimation and model-based clustering (2005) (Technical report) | MR | Zbl

[7] Fraley, Chris; Raftery, Adrian E Bayesian regularization for normal mixture estimation and model-based clustering, Journal of classification, Volume 24 (2007) no. 2, pp. 155-181 | MR | Zbl

[8] Frühwirth-Schnatter, S. Finite Mixture and Markov Switching Models (Springer Series in Statistics), Springer Verlag, New York, 2006 | MR | Zbl

[9] Hunter, D. R.; Li, R. Variable selection using $M M$ algorithms, Annals of statistics, Volume 33 (2005) no. 4, pp. 1617-1642 | DOI | MR | Zbl

[10] Hui, F. K.; Warton, D. I.; Foster, S. D. Multi-species distribution modeling using penalized mixture of regressions, The Annals of Applied Statistics, Volume 9 (2015) no. 2, pp. 866-882 | MR | Zbl

[11] Jacobs, R. A.; Jordan, M. I.; Nowlan, S. J.; Hinton, G. E. Adaptive mixtures of local experts, Neural computation, Volume 3 (1991) no. 1, pp. 79-87

[12] Jiang, W.; Tanner, M. A. Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation, Annals of Statistics (1999), pp. 987-1011 | MR | Zbl

[13] Khalili, A.; Chen, J. Variable selection in finite mixture of regression models, Journal of the American Statistical association, Volume 102 (2007) no. 479, pp. 1025-1038 | MR | Zbl

[14] Khalili, A. New estimation and feature selection methods in mixture-of-experts models, Canadian Journal of Statistics, Volume 38 (2010) no. 4, pp. 519-539 | MR | Zbl

[15] Lange, K. Optimization (2nd edition), Springer, 2013 | MR | Zbl

[16] Law, Martin HC; Figueiredo, Mario AT; Jain, Anil K Simultaneous feature selection and clustering using mixture models, IEEE transactions on pattern analysis and machine intelligence, Volume 26 (2004) no. 9, pp. 1154-1166

[17] Lloyd-Jones, Luke R.; Nguyen, Hien D.; McLachlan, Geoffrey J. A globally convergent algorithm for lasso-penalized mixture of linear regression models, Computational Statistics & Data Analysis, Volume 119 (2018), pp. 19 - 38 | MR | Zbl

[18] Lee, Su-In; Lee, Honglak; Abbeel, Pieter; Ng, Andrew Y Efficient $L_{1}$ regularized logistic regression, AAAI, Volume 6 (2006), pp. 401-408

[19] Lee, Jason D; Sun, Yuekai; Saunders, Michael A Proximal Newton-type methods for minimizing composite functions, SIAM Journal on Optimization, Volume 24 (2014) no. 3, pp. 1420-1443 | MR | Zbl

[20] Maugis, Cathy; Celeux, Gilles; Martin-Magniette, Marie-Laure Variable selection for clustering with Gaussian mixture models, Biometrics, Volume 65 (2009) no. 3, pp. 701-709 | MR | Zbl

[21] Maugis, Cathy; Celeux, Gilles; Martin-Magniette, M-L Variable selection in model-based clustering: A general variable role modeling, Computational Statistics & Data Analysis, Volume 53 (2009) no. 11, pp. 3872-3882 | MR | Zbl

[22] Meynet, C. An $ℓ_{1}$ -oracle inequality for the Lasso in finite mixture Gaussian regression models, ESAIM: Probability and Statistics, Volume 17 (2013), pp. 650-671 | Numdam | MR | Zbl

[23] McLachlan, G. J.; Peel., D. Finite mixture models, New York: Wiley, 2000 | MR | Zbl

[24] Nguyen, Hien D.; Chamroukhi, Faicel Practical and Theoretical Aspects of Mixture-of-Experts Modeling: An overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2018), p. e1246-n/a (https://arxiv.org/abs/1707.03538v1) | DOI

[25] Pan, Wei; Shen, Xiaotong Penalized model-based clustering with application to variable selection, Journal of Machine Learning Research, Volume 8 (2007) no. May, pp. 1145-1164 | Zbl

[26] Raftery, Adrian E; Dean, Nema Variable selection for model-based clustering, Journal of the American Statistical Association, Volume 101 (2006) no. 473, pp. 168-178 | MR | Zbl

[27] Städler, N.; Bühlmann, P.; Van De Geer, S. $l$ 1-penalization for mixture regression models, Test, Volume 19 (2010) no. 2, pp. 209-256 | MR | Zbl

[28] Snoussi, Hichem; Mohammad-Djafari, Ali Degeneracy and likelihood penalization in multivariate Gaussian mixture models, Univ. of Technology of Troyes, Troyes, France, Tech. Rep. UTT (2005)

[29] Stephens, Matthew; Phil, D Bayesian methods for mixtures of normal distributions, 1997

[30] Schifano, Elizabeth D; Strawderman, Robert L; Wells, Martin T Majorization-minimization algorithms for nonsmoothly penalized objective functions, Electronic Journal of Statistics, Volume 4 (2010), pp. 1258-1299 | MR | Zbl

[31] Tibshirani, R. Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (1996), pp. 267-288 | MR | Zbl

[32] Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization, Journal of optimization theory and applications, Volume 109 (2001) no. 3, pp. 475-494 | MR | Zbl

[33] Tseng, P. Coordinate ascent for maximizing nondifferentiable concave functions (1988) (Technical report)

[34] Titterington, D.; Smith, A.; Makov, U. Statistical Analysis of Finite Mixture Distributions, John Wiley & Sons, 1985 | MR | Zbl

[35] Witten, Daniela M; Tibshirani, Robert A framework for feature selection in clustering, Journal of the American Statistical Association, Volume 105 (2010) no. 490, pp. 713-726 | MR | Zbl