De nombreuses méthodes de classification non supervisée existent mais sont souvent conçues sans procédure de sélection de variables et ne permettent pas toujours de gérer les données manquantes. Dans les données issues de puces à ADN, les gènes sont décrits par un grand nombre d’expériences où il existe toujours des données manquantes. Il est donc important de détecter les expériences biologiques significatives afin d’améliorer la classification des gènes et son interprétation. Concernant les valeurs manquantes, il est courant d’écarter de l’étude les gènes non totalement observés ou d’estimer les valeurs manquantes avant classification. Dans cet article, nous traitons la sélection de variables et le problème des données manquantes grâce à une unique procédure. Nous proposons un modèle de sélection de variables pour prendre en compte le rôle des variables pour la classification non supervisée par mélanges gaussiens, où les données manquantes ne sont pas prétraitées. Des expériences numériques illustrent le gain de notre méthode par rapport aux méthodes avec imputation des données manquantes qui ne permettent pas toujours de retrouver le vrai rôle des variables et parfois perdent des informations biologiques.
Overabundance of clustering methods exists but none was devised with a variable selection procedure and a missing data management. However in microarray datasets, genes are described by a growing number of experiments and missing data always exist. It is also important to detect the relevant experiments for improving the gene clustering and the data interpretation. A common practice is to remove genes with missing values or to replace missing values with estimation. However it is known to have an important impact on the clustering result. We tackle variable selection and missing data in a unique statistical framework: A versatile variable selection model based on multidimensional Gaussian mixtures is proposed, taking variable roles for clustering into account. Moreover this statistical framework manages missing values without imposing any data pre-processing. Numerical experiments highlight the gain of our method compared to imputation methods which do not allow to find the true variable roles and sometimes lose biological information.
Mot clés : Sélection de variables, Données manquantes, Classification par mélanges gaussiens
@article{JSFS_2012__153_2_21_0, author = {Maugis-Rabusseau, Cathy and Martin-Magniette, Marie-Laure and Pelletier, Sandra}, title = {SelvarClustMV: {Variable} selection approach in model-based clustering allowing for missing~values}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {21--36}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {153}, number = {2}, year = {2012}, mrnumber = {3008597}, zbl = {1316.62092}, language = {en}, url = {http://archive.numdam.org/item/JSFS_2012__153_2_21_0/} }
TY - JOUR AU - Maugis-Rabusseau, Cathy AU - Martin-Magniette, Marie-Laure AU - Pelletier, Sandra TI - SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values JO - Journal de la société française de statistique PY - 2012 SP - 21 EP - 36 VL - 153 IS - 2 PB - Société française de statistique UR - http://archive.numdam.org/item/JSFS_2012__153_2_21_0/ LA - en ID - JSFS_2012__153_2_21_0 ER -
%0 Journal Article %A Maugis-Rabusseau, Cathy %A Martin-Magniette, Marie-Laure %A Pelletier, Sandra %T SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values %J Journal de la société française de statistique %D 2012 %P 21-36 %V 153 %N 2 %I Société française de statistique %U http://archive.numdam.org/item/JSFS_2012__153_2_21_0/ %G en %F JSFS_2012__153_2_21_0
Maugis-Rabusseau, Cathy; Martin-Magniette, Marie-Laure; Pelletier, Sandra. SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values. Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 21-36. http://archive.numdam.org/item/JSFS_2012__153_2_21_0/
[1] An Introduction to Multivariate Statistical Analysis, Wiley, 2003 | MR
[2] Model-based cluster and discriminant analysis with the mixmod software, Computational Statistics and Data Analysis, Volume 51 (2006) no. 2, pp. 587-600 | MR | Zbl
[3] Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes, BMC Bioinformatics, Volume 9 (2008) | DOI
[4] Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments, BMC Genomics, Volume 11 (2010) no. 1 | DOI
[5] Cluster analysis and display of genome-wide expression patterns., PNAS, Volume 95 (1998) no. 25, pp. 14863-14868
[6] CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform., Nucleic Acids Research, Volume 36 (2008) no. Database Issues, pp. 986-990
[7] Integrative missing value estimation for microarray data, BMC Bioinformatics, Volume 7 (2006) | DOI
[8] Imputing Missing Data for Gene Expression Arrays (1999) (Technical Report)
[9] DNA microarray data imputation and significance analysis of differential expression, Bioinformatics, Volume 21 (2005) no. 22, pp. 4155-4161
[10] Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics, Volume 21 (2005) no. 2, pp. 187-198
[11] Simultaneous Feature Selection and Clustering Using Mixture Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26 (2004) no. 9, pp. 1154-1166
[12] Statistical analysis with missing data, John Wiley & Sons, Inc., New York, USA, 1986
[13] Variable Selection for Clustering with Gaussian Mixture Models, Biometrics, Volume 65 (2009), pp. 701-709 | MR | Zbl
[14] Variable selection in model-based clustering: A general variable role modeling, Computational Statistics and Data Analysis, Volume 53 (2009), pp. 3872-3882 | MR | Zbl
[15] Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins, Modulad, Volume 40 (2009)
[16] A Bayesian missing value estimation method for gene expression profile data., Bioinformatics, Volume 19 (2003) no. 16, pp. 2088-2096
[17] Gaussian mixture clustering and imputation of microarray data, Bioinformatics, Volume 20 (2004) no. 6, pp. 917-923
[18] Variable Selection for Model-Based Clustering, Journal of the American Statistical Association, Volume 101 (2006) no. 473, pp. 168-178 | MR | Zbl
[19] Inference and missing data, Biometrika, Volume 63 (1976) no. 3, pp. 581-592 | MR | Zbl
[20] Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data-A Model-Based Study, EURASIP Journal on Bioinformatics and Systems Biology, Volume 2009 (2009)
[21] Analysis of incomplete multivariate data, Chapman & Hall, London, 1997 | MR | Zbl
[22] Ameliorative lmissing value imputation for robust biological knowledge inference, Journal of Biomedical Informatics, Volume 41 (2008), pp. 499-514
[23] Missing value estimation methods for DNA microarrays., Bioinformatics, Volume 17 (2001) no. 6, pp. 520-525
[24] Improving missing value estimation in microarray data with gene ontology, Bioinformatics, Volume 22 (2006) no. 5, pp. 566-572
[25] Missing value imputation improves clustering and interpretation of gene expression microarray data, BMC Bioinformatics, Volume 9 (2008)
[26] Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme, BMC Bioinformatics, Volume 7 (2006) no. 32