Over the last few years, case-control genome-wide association studies (GWAS) have proven to be a successful tool to identify genomic regions associated with complex diseases. Nevertheless, current GWAS still heavily rely on a single-marker strategy, in which each biological marker (or SNP for single nucleotide polymorphism) is tested individually for association with the disease. However, it is widely admitted that this is an oversimplified approach to tackle the complexity of underlying biological mechanisms and gene-gene interaction must be considered. Unfortunately, gene-gene interaction detection gives rise to complex statistical challenges, arising from the high-dimensionality and the complex architecture of the data as well as the size of the space of interaction models. The purpose of this survey is to provide a critical overview of the numerous statistical methods proposed to detect gene-gene interaction detection in GWAS. Those methods have been developed to detect interaction at various scales of the data and we decompose our survey in three main classes: SNP-SNP interaction methods, Gene-Gene interaction methods and large-scale methods. For each class of methods, we identify relative strengths and weaknesses in terms of statistical power and provide perspectives to the future of statistical strategies in gene-gene interaction analysis.
Ces dernières années ont confirmé l’intérêt des études pangénomiques (GWAS) pour l’identification de régions génomiques associées à des maladies complexes. Néanmoins, les études actuelles reposent sur une stratégie simple-point, dans laquelle chaque marqueur biologique est testé individuellement pour l’association avec la maladie. Cependant, il est largement admis que cette approche est trop simpliste pour s’attaquer à la complexité des mécanismes biologiques sous-jacents et qu’il est important d’inclure l’interaction gène-gène dans l’analyse. Malheureusement, la détection de l’interaction gène-gène soulève des défis statistiques complexes, issus de la grande dimension et de l’architecture complexe des données ainsi que de la taille de l’espace des modèles d’interaction. Le but de cette étude est de fournir un aperçu des nombreuses méthodes statistiques proposées pour détecter une interaction gène-gène dans les GWAS. Ces méthodes ont été développées pour détecter l’interaction à différentes échelles des données et nous décomposons notre étude en trois classes principales : les méthodes d’interaction SNP-SNP, les méthodes d’interaction Gene-Gene et les méthodes à grande échelle. Pour chaque classe de méthodes, nous identifions les forces et les faiblesses en termes de puissance statistique et proposons des pistes de développements dans la modélisation statistique de l’interaction gène-gène.
Mot clés : Interaction gène-gène, Modèles de régression, Apprentissage, Théorie de l’information, Puissance statistique
@article{JSFS_2018__159_1_27_0, author = {Emily, Mathieu}, title = {A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies}, journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique}, pages = {27--67}, publisher = {Soci\'et\'e fran\c{c}aise de statistique}, volume = {159}, number = {1}, year = {2018}, mrnumber = {3803123}, zbl = {1398.62339}, language = {en}, url = {http://archive.numdam.org/item/JSFS_2018__159_1_27_0/} }
TY - JOUR AU - Emily, Mathieu TI - A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies JO - Journal de la société française de statistique PY - 2018 SP - 27 EP - 67 VL - 159 IS - 1 PB - Société française de statistique UR - http://archive.numdam.org/item/JSFS_2018__159_1_27_0/ LA - en ID - JSFS_2018__159_1_27_0 ER -
%0 Journal Article %A Emily, Mathieu %T A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies %J Journal de la société française de statistique %D 2018 %P 27-67 %V 159 %N 1 %I Société française de statistique %U http://archive.numdam.org/item/JSFS_2018__159_1_27_0/ %G en %F JSFS_2018__159_1_27_0
Emily, Mathieu. A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies. Journal de la société française de statistique, Volume 159 (2018) no. 1, pp. 27-67. http://archive.numdam.org/item/JSFS_2018__159_1_27_0/
[1] A tutorial on statistical methods for population association studies, Nature Reviews Genetics, Volume 7 (2006), pp. 781-791
[2] Identifying SNPs predictive of phenotype using random forests, Genetic Epidemiology, Volume 28 (2005) no. 2, pp. 171-182 | DOI
[3] Classification and Regression Trees, Wadsworth and Brooks, Monterey, CA, 1984 | MR | Zbl
[4] Random Forests, Machine Learning, Volume 45 (2001) no. 1, pp. 5-32 | DOI | Zbl
[5] A lasso for hierarchical interactions, Ann. Statist., Volume 41 (2013) no. 3, pp. 1111-1141 | DOI | MR | Zbl
[6] Basic statistical analysis in genetic case-control studies, Nature Protocols, Volume 6 (2011) no. 2, pp. 121-133
[7] So Many Correlated Tests, So Little Time! Rapid Adjustment of P Values for Multiple Correlated Tests, The American Journal of Human Genetics, Volume 81 (2007) no. 6, pp. 1158-1168
[8] Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions, Gene, Volume 533 (2014) no. 1, pp. 304 -312 | DOI
[9] A simple correction for multiple comparisons in interval mapping genome scans, Heredity, Volume 87 (2001) no. 1, pp. 52-58
[10] Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans, Human Molecular Genetics, Volume 11 (2002) no. 20, pp. 2463-2468
[11] Detecting gene-gene interactions that underlie human diseases, Nature Review Genetics, Volume 10 (2009) no. 2, pp. 392-404
[12] A support vector machine approach for detecting gene-gene interaction, Genetic Epidemiology, Volume 32 (2008) no. 2, pp. 152-167
[13] Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing), Wiley-Interscience, 2006 | MR
[14] Support-Vector Networks, Machine Learning, Volume 20 (1995) no. 3, pp. 273-297 | Zbl
[15] Investigating a pathogenic role for TXNDC5 in tumors, International Journal of Oncology, Volume 43 (2013) no. 43, pp. 1871-1884
[16] Comparative analysis of methods for detecting interacting loci, BMC Genomics, Volume 12 (2011) no. 1 | DOI
[17] Exploration of gene-gene interaction effects using entropy-based methods, European Journal of Human Genetics, Volume 16 (2008) no. 2, pp. 229-235
[18] Higher criticism for detecting sparse heterogeneous mixtures, Ann. Statist., Volume 32 (2004) no. 3, pp. 962-994 | DOI | MR | Zbl
[19] Power evaluation of asymptotic tests for comparing two binomial proportions to detect direct and indirect association in large-scale studies, Statistical Methods in Medical Research (2015) (In press)
[20] Using biological networks to search for interacting loci in genome-wide association studies., European Journal of Human Genetics, Volume 17 (2009), pp. 1231-1240
[21] IndOR: a new statistical procedure to test for SNPxSNP epistasis in genome-wide association studies, Statistics in Medicine, Volume 31 (2012) no. 21, pp. 2359-2373 | MR
[22] AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies, Statistical Application in Genetics and Molecular Biology, Volume 15 (2016) no. 2, pp. 151-171 | MR | Zbl
[23] Power comparison of Cochran-Armitage Test of Trend against allelic and genotypic tests in case-control genetic association studies, Statistical Methods in Medical Research (2016) (In press) | MR
[24] GeneGeneInteR: Tools for Testing Gene-Gene Interaction at the Gene Level (2017) (R package version 1.00.1)
[25] Transferring entropy to the realm of GxG interactions, Briefings in Bioinformatics (2016), pp. 1-12
[26] Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases, Genetic Epidemiology, Volume 35 (2011) no. 7, pp. 706-721 | DOI
[27] A new measure of the effective number of tests, a practical tool for comparing families of non-independent significance tests, Genetic Epidemiology, Volume 33 (2009) no. 7, pp. 559-568
[28] Computation of Multivariate Normal and T Probabilities, Springer-Verlag, 2009 | MR | Zbl
[29] A roadmap to multifactor dimensionality reduction methods, Briefings in Bioinformatics, Volume 17 (2016) no. 2, pp. 293-308 | DOI
[30] Forward LASSO analysis for high-order interactions in genome-wide association study, Briefings in Bioinformatics, Volume 15 (2014) no. 4, pp. 552-561 | DOI
[31] Gene-Based Tests of Association, PLoS Genetics, Volume 7 (2011) no. 7 | DOI
[32] Correction for multiple testing in a gene region, European Journal of Human Genetics, Volume 22 (2014) no. 3, pp. 414-418
[33] Innovated higher criticism for detecting sparse signals in correlated noise, Ann. Statist., Volume 38 (2010) no. 3, pp. 1686-1732 | DOI | MR | Zbl
[34] Linkage diseqilibrium in finite populations., Theoretical and Applied Genetics, Volume 38 (1968), pp. 226-231
[35] Potential etiologic and functional implications of genome-wide association loci for human diseases and traits, Proceeding of the National Academy of Sciences, Volume 106 (2009) no. 23, pp. 9362-9367
[36] A complete classification of epistatic two-locus models, BMC Genetics, Volume 9 (2008) no. 17
[37] has emerged as a standard of statistical significance for genome-wide association studies, Journal of Clinical Epidemiology, Volume 68 (2015) no. 4, pp. 460-465
[38] A gene-centric approach to genome-wide association studies, Nature Review Genetics, Volume 7 (2006) no. 11, pp. 885-891
[39] A powerful truncated tail strength method for testing multiple null hypotheses in one dataset, Journal of Theoretical Biology, Volume 277 (2011) no. 1, pp. 67 -73 | MR | Zbl
[40] A Review for Detecting Gene-Gene Interactions Using Machine Learning Methods in Genetic Epidemiology, BioMed Research International (2013), 13 pages | DOI
[41] IGENT: efficient entropy based algorithm for genome-wide gene-gene interaction analysis, BMC Medical Genomics, Volume 7 (2014) no. 1 | DOI
[42] Identifying interacting SNPs using Monte Carlo Logic regression, Genetic Epidemiology, Volume 28 (2005), pp. 157-170
[43] Genetic association studies: Design, analysis and interpretation, Briefings in Bioinformatics, Volume 3 (2002) no. 2, pp. 146-153
[44] GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure, The American Journal of Human Genetics, Volume 88 (2011) no. 3, pp. 283-293
[45] A gene-based information gain method for detecting gene-gene interactions in case-control studies, European Journal of Human Genetics, Volume Online (2015) (Online)
[46] Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix., Heredity, Volume 95 (2005), pp. 221-227
[47] Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer, European Journal of Human Genetics, Volume 22 (2014) no. 1, pp. 126-131
[48] Detecting gene-gene interactions using a permutation-based random forest method, BioData Mining, Volume 9 (2016) no. 1, 17 pages | DOI
[49] A Versatile Gene-Based Test for Genome-wide Association Studies, The American Journal of Human Genetics, Volume 87 (2010) no. 1, pp. 139 -145
[50] A complete enumeration and classification of two-locus disease models., Human Heredity, Volume 50 (2000) no. 6, pp. 334-349
[51] Identification of gene-gene interaction using principal components, BMC Proceedings, Volume 3 (2009) no. Suppl 7 | DOI
[52] Personal genomes: The case of the missing heritability, Nature, Volume 456 (2008), pp. 18-21
[53] Finding the missing heritability of complex diseases, Nature, Volume 461 (2009), pp. 747-753
[54] Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis, PLOS Genetics, Volume 5 (2009) no. 3, pp. 1-12 | DOI
[55] The group Lasso for logistic regression, Journal of the Royal Statistical Society, Series B (2008) | MR | Zbl
[56] Predicting genetic interactions in Caenorhabditiselegans using machine learning, Massachusetts Institue of Technology (2010) (Ph. D. Thesis)
[57] Why epistasis is important for tackling complex human disease genetics., Genome Medicine, Volume 6 (2014) no. 42
[58] Generalized Linear Models, Chapman and Hall, London, UK, 1989 | MR | Zbl
[59] The ubiquitous nature of epistasis in determining susceptibility to common human diseases, Human Heredity, Volume 56 (2003), pp. 73-82
[60] Use of Information Measures and Their Approximations to Detect Predictive Gene-Gene Interaction, Entropy, Volume 19 (2017) no. 1
[61] Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology, Genetic Epidemiology, Volume 32 (2008) no. 4, pp. 325-340
[62] Machine learning for detecting gene-gene interactions, Appl. Bioinformatics, Volume 5 (2006) no. 2, pp. 77-88
[63] On multiple-testing correction in genome-wide association studies, Genetic Epidemiology, Volume 32 (2008) no. 6, pp. 567-573 | DOI
[64] Detection of Gene x Gene Interactions in Genome-Wide Association Studies of Human Population Data, Human Heredity, Volume 63 (2007), pp. 67-84
[65] The Future of Association Studies: Gene-Based Analysis and Replication, The American Journal of Human Genetics, Volume 75 (2004) no. 3, pp. 353-362
[66] A survey about methods dedicated to epistasis detection, Frontiers in Genetics, Volume 6 (2015) | DOI
[67] A Simple Correction for Multiple Testing for Single-Nucleotide Polymorphisms in Linkage Disequilibrium with Each Other, The American Journal of Human Genetics, Volume 74 (2004) no. 4, pp. 765 -769
[68] Penalized logistic regression for detecting gene interactions, Biostatistics, Volume 9 (2008), pp. 30-50 | Zbl
[69] Epistasis, the essential role of gene interactions in the structure and evolution of genetic systems, Nature Review Genetics, Volume 9 (2008), pp. 855-867
[70] PLINK: a toolset for whole-genome association and population-based linkage analysis, American Journal of Human Genetics, Volume 81 (2007), pp. 559-575
[71] Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease, Genome Research, Volume 22 (2012) no. 11, pp. 2230-2240 | DOI
[72] A gene-based method for detecting gene-gene co-association in a case-control association study, European Journal of Human Genetics, Volume 18 (2010) no. 5, pp. 582-587
[73] Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer., American Journal of Juman Genetics, Volume 69 (2001) no. 1, pp. 138-147
[74] Finding the Epistasis Needles in the Genome-Wide Haystack, Springer New York, New York, NY (2015), pp. 19-33 | DOI
[75] Multivariate Detection of Gene-Gene Interactions, Genetic Epidemiology, Volume 36 (2012) no. 6, pp. 622-630
[76] Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases, BMC Bioinformatics, Volume 4 (2003) no. 1 | DOI
[77] A deep learning approach to detect SNP interactions, Journal of software, Volume 11 (2016) no. 10, pp. 965-975
[78] Eigen-Epistasis for detecting gene-gene interactions, BMC Bioinformatics, Volume 18 (2017) no. 1 | DOI
[79] A Mathematical Theory of Communication, SIGMOBILE Mob. Comput. Commun. Rev., Volume 5 (2001) no. 1, pp. 3-55 | DOI
[80] Identification of SNP interactions using logic regression, Biostatistics, Volume 9 (2008) no. 1, pp. 187-198 | DOI | Zbl
[81] Building Neural Networks, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1995
[82] Detecting gene-gene interactions using support vector machines with L1 penalty, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) (2010), pp. 309-311
[83] Research on Single Nucleotide Polymorphisms Interaction Detection from Network Perspective, PLOS ONE, Volume 10 (2015) no. 3, pp. 1-19 | DOI
[84] Testing SNPs and sets of SNPs for importance in association studies, Biostatistics, Volume 12 (2011) no. 1, pp. 18-32 | DOI | Zbl
[85] Travelling the world of gene-gene interactions., Briefings in Bioinformatics (2011)
[86] Performance analysis of novel methods for detecting epistasis, BMC Bioinformatics, Volume 12 (2011) no. 1 | DOI
[87] Statistical Methods in Genetic Epidemiology, Oxford University Press, New York, 2004, 464 pages | Zbl
[88] Improved Statistics for Genome-Wide Interaction Analysis, PLoS Genet, Volume 8 (2012) no. 4 | DOI
[89] Machine learning approaches for the discovery of gene x gene interactions in disease data, Briefings in Bioinformatics, Volume 14 (2013) no. 2, pp. 251-260 | DOI
[90] SNP interaction detection with Random Forests in high-dimensional genetic data, BMC Bioinformatics, Volume 13 (2012) no. 1 | DOI
[91] Genome-wide association analysis by lasso penalized logistic regression, Bioinformatics, Volume 25 (2009) no. 6, pp. 714-721 | DOI
[92] A Novel Statistic for Genome-Wide Interaction Analysis, PLoS Genetics, Volume 6 (2010) no. 9
[93] Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies, American Journal of Human Genetics, Volume 86 (2010) no. 6, pp. 929-942
[94] BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies., The American Journal of Human Genetics, Volume 87 (2010), pp. 325-340
[95] EINVis: A Visualization Tool for Analyzing and Exploring Genetic Interactions in Large-Scale Association Studies, Genetic Epidemiology, Volume 37 (2013) no. 7, pp. 675-685 | DOI
[96] Detection for gene-gene co-association via kernel canonical correlation analysis, BMC Genetics, Volume 13 (2012) no. 1 | DOI
[97] A Modified Entropy-Based Approach for Identifying Gene-Gene Interactions in Case-Control Study, PLOS ONE, Volume 8 (2013) no. 7, pp. 1-8 | DOI
[98] Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso, BMC Bioinformatics, Volume 11 (2010) no. 1 | DOI
[99] GBOOST: A GPU-based tool for detecting gene-gene interactions in genome-wide case control studies, Bioinformatics (2011)
[100] Test for Interaction between Two Unlinked Loci, The American Journal of Human Genetics, Volume 79 (2006) no. 5, pp. 831-845
[101] A PLSPM-Based Test Statistic for Detecting Gene-Gene Co-Association in Genome-Wide Association Study with Case-Control Design, PLoS ONE, Volume 8 (2013) no. 4 | DOI
[102] Identifying Gene-Environment and Gene-Gene Interactions Using a Progressive Penalization Approach, Genetic Epidemiology, Volume 38 (2014) no. 4, pp. 353-368 | DOI
[103] Truncated product method for combining P-values, Genetic Epidemiology, Volume 22 (2002) no. 2, pp. 170-185