Les données générées par les biotechnologies haut-débit sont caractérisées par leur grande dimension et leur hétérogénéité. L’analyse statistique de ces données remet en cause y compris les approches les plus éprouvées, comme les méthodes usuelles d’inférence statistique. Cet article a pour objectif de présenter une étude de l’impact de la dépendance sur les propriétés des procédures de tests multiples en grande dimension : après une description introductive des principales problématiques liées à la présence de dépendance, les mesures de risques d’erreurs et les algorithmes permettant de contrôler ces risques lors de la mise en œuvre de procédures de tests multiples sont plus particulièrement étudiés. Cette étude analytique aboutit à la définition d’un cadre général de la prise en compte de l’hétérogénéité des données, grâce à la modélisation de la structure de dépendance par Analyse en Facteurs. L’instabilité des procédures induite par la présence de dépendance est alors réduite, procurant à la fois une augmentation de la puissance des tests et une diminution de la variabilité des taux d’erreurs. La mise en œuvre de cette méthode est également évoquée, et les résultats méthodologiques sont illustrés à partir de données génomiques, analysées à l’aide du package FAMT du logiciel libre R qui implémente les méthodes présentées précédemment.
Cet article accompagne la conférence que j’ai eu l’honneur de donner lors de la réception du prix Marie-Jeanne Laurent-Duhamel, dans le cadre des 44èmes Journées de Statistique organisées par la Société Française de Statistique à Bruxelles, en mai 2012.
The data generated by high-throughput biotechnologies are characterized by their high-dimension and heterogeneity. Usual, tried and tested inference approaches are questioned in the statistical analysis of such data. Motivated by issues raised by the analysis of gene expressions data, I focus on the impact of dependence on the properties of multiple testing procedures in high-dimension. This article aims at presenting the main results: after introducing the issues brought by dependence among variables, the impact of dependence on the error rates and on the procedures developed to control them is more particularly studied. It results in the description of an innovative methodology based on a factor structure to model the data heterogeneity, which provides a general framework to deal with dependence in multiple testing. The proposed framework leads to less variability for error rates and consequently shows large improvements of power and stability of simultaneous inference with respect to existing multiple testing procedures. Besides, the model parameters estimation in a high-dimensional setting and the determination of the number of factors to be considered in the model are evoked. These results are then illustrated by real data from microarray experiments analyzed using the R package called FAMT.
This paper is an extended written version of my oral presentation on the same topic at the 44th Journées de Statistique organized by the French Statistical Society (SFdS) in Bruxelles, Belgium, 2012, when being awarded the Marie-Jeanne Laurent-Duhamel prize.
Mot clés : Tests multiples, Dépendance, Grande dimension, Taux d’erreurs, Analyse en facteurs, Proportion d’hypothèses nulles
@article{JSFS_2012__153_2_100_0, author = {Friguet, Chlo\'e}, title = {A general approach to account for dependence in large-scale multiple testing}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {100--122}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {153}, number = {2}, year = {2012}, mrnumber = {3008601}, zbl = {1316.62111}, language = {en}, url = {http://archive.numdam.org/item/JSFS_2012__153_2_100_0/} }
TY - JOUR AU - Friguet, Chloé TI - A general approach to account for dependence in large-scale multiple testing JO - Journal de la société française de statistique PY - 2012 SP - 100 EP - 122 VL - 153 IS - 2 PB - Société française de statistique UR - http://archive.numdam.org/item/JSFS_2012__153_2_100_0/ LA - en ID - JSFS_2012__153_2_100_0 ER -
%0 Journal Article %A Friguet, Chloé %T A general approach to account for dependence in large-scale multiple testing %J Journal de la société française de statistique %D 2012 %P 100-122 %V 153 %N 2 %I Société française de statistique %U http://archive.numdam.org/item/JSFS_2012__153_2_100_0/ %G en %F JSFS_2012__153_2_100_0
Friguet, Chloé. A general approach to account for dependence in large-scale multiple testing. Journal de la société française de statistique, Tome 153 (2012) no. 2, pp. 100-122. http://archive.numdam.org/item/JSFS_2012__153_2_100_0/
[1] A mixture model approach for the analysis of microarray gene expression data, Computational Statistics and Data Analysis, Volume 39 (2002), pp. 1-20 | MR | Zbl
[2] Controlling the False Discovery Rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological), Volume 57 (1995), pp. 289-300 | MR | Zbl
[3] Adaptive linear step-up procedures that control the false discovery rate, Biometrika, Volume 93 (2006), pp. 491-507 | MR | Zbl
[4] A note on the adaptative control of false discovery rates, Journal of the Royal Statistical Society. Series B, Volume 66 (2004), pp. 297-304 | MR | Zbl
[5] A factor model to analyze heterogeneity in gene expression, BMC bioinformatics, Volume 11:368 (2010)
[6] Teoria statistica delle classi e calcolo delle probabilità, Pubblicazioni del R Istituto Superiore si Scienze Economiche e Comerciali di Firenze (1936), pp. 3-62 | Zbl
[7] Two simple sufficient conditions for FDR control, Electronic journal of Statistics, Volume 2 (2008), pp. 963-992 | MR | Zbl
[8] The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, Volume 29 (2001), pp. 1165-1188 | MR | Zbl
[9] The scree test for the number of factors, Multivariate Behavioural Research, Volume 1 (1966), pp. 245-276
[10] Factor Analysis for Multiple Testing (FAMT): an R package for large-scale significance testing under dependence, Journal of Statistical Software, Volume 40(14) (2011), pp. 1-19
[11] Control of the FWER in Multiple Testing Under Dependence, Communications in Statistics - Theory and Methods, Volume 38 (2009), pp. 2733-2747 | MR | Zbl
[12] Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, Volume 97 (2002), pp. 77-87 | MR | Zbl
[13] Multiple hypothesis testing in microarray experiments, Statistical Science, Volume 18 (2003), pp. 71-103 | MR | Zbl
[14] Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis, Journal of the American Statistical Association, Volume 99 (2004), pp. 96-104 | MR | Zbl
[15] Correlation and large-scale simultaneous testing, Journal of the American Statistical Association, Volume 102 (2007), pp. 93-103 | MR | Zbl
[16] Empirical Bayes Analysis of a Microarray Experiment, Journal of the American Statistical Association, Volume 96 (2001), pp. 1151-1160 | MR | Zbl
[17] Estimation of the proportion of true null hypotheses in high-dimensional data under dependence, Computational Statistics and Data Analysis, Volume 55 (2011), pp. 2665-2676 | MR | Zbl
[18] A factor model approach to multiple testing under dependence, Journal of the American Statistical Association, Volume 104:488 (2009), pp. 1406-1415 | MR | Zbl
[19] Operarting characteristics and extensions of the false discovery rate procedure, Journal of the Royal Statistical Society. Series B, Volume 64 (2002), pp. 499-517 | MR | Zbl
[20] Gene expression profiles in hereditary breast cancer, New England Journal of Medicine, Volume 344 (2001), pp. 539-548
[21] A factor analysis model for functional genomics, BMC Bioinformatics, Volume 7 (2006)
[22] Controlling the number of false discoveries: application to high-dimensional genomic data, Journal of Statistical Planning and Inference, Volume 124 (2004), pp. 379-398 | MR | Zbl
[23] Effects of dependence in high-dimensional multiple testing problems, BMC Bioinformatics, Volume 9 (2008)
[24] Using transcriptome profiling to characterize QTL regions on chicken chromosome 5, BMC Genomics (2009), pp. 10-575
[25] Estimating the proportion of true null hypotheses, with application to DNA microarray data, Journal of the Royal Statistical Society. Series B, Volume 67 (2005), pp. 555-572 | MR | Zbl
[26] A general framework for multiple testing dependence, Proceedings of the National Academy of Sciences, Volume 105 (2008), pp. 18718-18723 | Zbl
[27] Latent roots of ranrom data correlatoin matrices with squared multiple correlations on the diagonal: a Monte-Carlo study, Psychometrica, Volume 41 (1976), pp. 341-348 | Zbl
[28] Multivariate Analysis, 1979 | MR | Zbl
[29] Variance of the number of false discoveries, Journal of the Royal Statistical Society. Series B, Volume 67 (2005), pp. 411-426 | MR | Zbl
[30] multtest: Resampling-based multiple hypothesis testing (R package version 1.23.3)
[31] Factor analysis for gene regulatory networks and transcription factor activity profiles, BMC Bioinformatics, Volume 8 (2007) | DOI
[32] Maximum likelyhood factor analysis with rank-deficient sample covariance matrix, Journal of Multivariate Analysis, Volume 98 (2007), pp. 813-828 | MR | Zbl
[33] EM Algorithms for ML Factor Analysis, Psychometrika, Volume 47 (1982), pp. 69-76 | MR | Zbl
[34] The optimal discovery procedure for large-scale significance testing, with application to comparative microarray experiments, Biostatistics, Volume 8 (2007), pp. 414-432 | Zbl
[35] Multiple hypotheses testing: a review, Annual review of psychology, Volume 46 (1995), pp. 561-584
[36] General intelligence, objectively determined and measured, American Journal of Psychology, Volume 15 (1904), pp. 201-293
[37] Statistical significance for genomewide studies, Proceedings of the National Academy of Sciences, Volume 100 (2003), pp. 9440-9445 | MR | Zbl
[38] A direct approach to false discovery rates, Journal of the Royal Statistical Society. Series B, Volume 64 (2002), pp. 479-498 | MR | Zbl
[39] The positive false discovery rate: a Bayesian interpretation and the q -value, Annals of Statistics, Volume 31 (2003), pp. 2013-2035 | MR | Zbl
[40] The optimal discovery procedure: A new approach to simultaneous significance testing, Journal of the Royal Statistical Society. Series B, Volume 69 (2007), pp. 347-368 | MR
[41] Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach, Journal of the Royal Statistical Society. Series B, Volume 66 (2004), pp. 187-205 | MR | Zbl