A Mutual Information-based method to select informative pairs of variables in case-control genetic association studies to improve the power of detecting interaction between genetic variants
Journal de la société française de statistique, Tome 159 (2018) no. 2, pp. 84-110.

Nous proposons une nouvelle méthode de sélection de marqueurs biologiques, appelée EpiTag, permettant la détection d’interaction de gènes dans les études d’association à l’échelle du génome. Notre méthode extrait un sous-ensemble de marqueurs qui caractérise de façon optimale la variabilité de la totalité des couples de marqueurs, là où les approches usuelles considèrent les marqueurs de façon univariée. Nous proposons de quantifier le lien entre couples de marqueurs par l’Information Mutuelle Normalisée. La faisabilité de notre méthode est validée à partir d’une étude de la puissance de détection d’interaction sur un ensemble de jeu de données avec une structure de dépendance simulée ou bien provenant de donnéées réelles. EpiTag réalise de bonnes performances en terme de puissance, et ce quelque soit la force du signal ou la dimension des données testées, par rapport aux autres méthodes.

We propose a novel procedure for tagging Single Nucleotide Polymorphisms (SNPs), called EpiTag, to deal with interaction detection in Genome-Wide Association Studies. The aim of our method is to select a set of tag-SNPs that optimally represents the whole set of pairs of SNPs whereas usual approaches are univariate. The linkage between two pairs of SNPs is measured by the Normalized Mutual Information. The proposed algorithm is assessed considering the power of interaction detection compared to a no-tagging strategy and a usual one-dimensional tagging procedure, both on simulated and real genotype structures. EpiTag demonstrates good power performances along with various signal strengths or data sizes w.r.t the competing methods.

Keywords: Genome-wide association studies, Gene-gene interaction, Mutual information, Selection of pairs of variables
Mots-clés : Études d’association à l’échelle du génome, Interaction entre gènes, Information mutuelle, Sélection de paires de variables
@article{JSFS_2018__159_2_84_0,
     author = {Emily, Mathieu and Friguet, Chlo\'e},
     title = {A {Mutual} {Information-based} method to select informative pairs of variables in case-control genetic association studies to improve the power of detecting interaction between genetic variants},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {84--110},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {159},
     number = {2},
     year = {2018},
     mrnumber = {3855902},
     zbl = {1406.62137},
     language = {en},
     url = {http://archive.numdam.org/item/JSFS_2018__159_2_84_0/}
}
TY  - JOUR
AU  - Emily, Mathieu
AU  - Friguet, Chloé
TI  - A Mutual Information-based method to select informative pairs of variables in case-control genetic association studies to improve the power of detecting interaction between genetic variants
JO  - Journal de la société française de statistique
PY  - 2018
SP  - 84
EP  - 110
VL  - 159
IS  - 2
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2018__159_2_84_0/
LA  - en
ID  - JSFS_2018__159_2_84_0
ER  - 
%0 Journal Article
%A Emily, Mathieu
%A Friguet, Chloé
%T A Mutual Information-based method to select informative pairs of variables in case-control genetic association studies to improve the power of detecting interaction between genetic variants
%J Journal de la société française de statistique
%D 2018
%P 84-110
%V 159
%N 2
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2018__159_2_84_0/
%G en
%F JSFS_2018__159_2_84_0
Emily, Mathieu; Friguet, Chloé. A Mutual Information-based method to select informative pairs of variables in case-control genetic association studies to improve the power of detecting interaction between genetic variants. Journal de la société française de statistique, Tome 159 (2018) no. 2, pp. 84-110. http://archive.numdam.org/item/JSFS_2018__159_2_84_0/

[1] Ao, S.I.; Yip, K.; Ng, M. K.; Cheung, D; Fong, P-Y.; Melhado, I.; Sham, P. C. CLUSTAG: Hierarchical Clustering and Graph Methods for Selecting Tag-SNPs, Bioinformatics, Volume 21 (2005) no. 8, pp. 1735-1736

[2] Benjamini, Yoav; Hochberg, Yosef Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society. Series B (Methodological), Volume 57 (1995) no. 1, pp. 289-300 | MR | Zbl

[3] Bush, William S.; Moore, Jason H. Chapter 11: Genome-Wide Association Studies, PLOS Computational Biology, Volume 8 (2012) no. 12, pp. 1-11 | DOI

[4] Carlson, Christopher S.; Eberle, Michael A.; Rieder, Mark J.; Yi, Qian; Kruglyak, Leonid; Nickerson, Deborah A. Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium, The American Journal of Human Genetics, Volume 74 (2004) no. 1, pp. 106 -120 | DOI

[5] Cordell, Heather J. Detecting gene-gene interactions that underlie human diseases, Nature Review Genetics, Volume 10 (2009) no. 2, pp. 392-404

[6] Chen, Shyh-Huei; Sun, Jielin; Dimitrov, Latchezar; Turner, Aubrey R.; Adams, Tamara S.; Meyers, Deborah A.; Chang, Bao-Li; Zheng, S. Lilly; Gronberg, Henrik; Xu, Jianfeng; Hsu, Fang-Chi A support vector machine approach for detecting gene-gene interaction, Genetic Epidemiology, Volume 32 (2008) no. 2, pp. 152-167

[7] de Bakker, Paul I. W.; Yelensky, Roman; Peer, Itsik; Gabriel, Stacey B.; Daly, Mark J.; Altshuler, David Efficiency and power in genetic association studies, Nature Genetics, Volume 37 (2005), pp. 1217-1223 | DOI

[8] Emily, M.; Friguet, C. Power evaluation of asymptotic tests for comparing two binomial proportions to detect direct and indirect association in large-scale studies, Statistical Methods in Medical Research, Volume 26 (2017) no. 6, pp. 2780-2799 | DOI | MR

[9] Emily, M.; Mailund, T.; Hein, J.; Schauser, L.; Schierup, M. H. Using biological networks to search for interacting loci in genome-wide association studies., European Journal of Human Genetics, Volume 17 (2009), pp. 1231-1240

[10] Emily, M. IndOR: a new statistical procedure to test for SNPxSNP epistasis in genome-wide association studies, Statistics in Medicine, Volume 31 (2012) no. 21, pp. 2359-2373 | MR

[11] Emily, Mathieu AGGrEGATOr: A Gene-based GEne-Gene interActTiOn test for case-control association studies, Statistical Application in Genetics and Molecular Biology, Volume 15 (2016) no. 2, pp. 151-171 | MR | Zbl

[12] Emily, Mathieu Power comparison of Cochran-Armitage Test of Trend against allelic and genotypic tests in case-control genetic association studies, Statistical Methods in Medical Research (2016) (In press) | MR

[13] Emily, M. A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies., Journal de la Société Française de Statistique, Volume 159 (2018) no. 1, pp. 27-67 | Numdam | MR | Zbl

[14] Frommlet, Florian Tag-SNP selection based on clustering according to dominant sets found using replicator dynamics, Advances in Data Analysis and Classification, Volume 4 (2010) no. 1, pp. 65-83 | DOI | MR | Zbl

[15] Gola, Damian; Mahachie John, Jestinah M.; van Steen, Kristel; Knig, Inke R. A roadmap to multifactor dimensionality reduction methods, Briefings in Bioinformatics, Volume 17 (2016) no. 2, pp. 293-308 | DOI

[16] Hill, W. G.; Robertson, A. Linkage diseqilibrium in finite populations., Theoretical and Applied Genetics, Volume 38 (1968), pp. 226-231

[17] Hindorff, L. A.; Sethupathy, P.; Junkins, H. A.; Ramos, E. M.; Mehta, J. P.; Collins, F. S.; A., Teri Potential etiologic and functional implications of genome-wide association loci for human diseases and traits, Proceeding of the National Academy of Sciences, Volume 106 (2009) no. 23, pp. 9362-9367

[18] Hallgrimsdottir, I. B.; Yuster, D. S. A complete classification of epistatic two-locus models, BMC Genetics, Volume 9 (2008) no. 17

[19] International HapMap Consortium The International HapMap Project, Nature, Volume 426 (2003), pp. 789-796 | DOI

[20] Kooperberg, C.; Ruczinski, I. Identifying interacting SNPs using Monte Carlo Logic regression, Genetic Epidemiology, Volume 28 (2005), pp. 157-170

[21] Kullback, Solomon Information Theory and Statistics, Wiley, New York, 1959 | MR | Zbl

[22] Lewis, Cathryn M. Genetic association studies: Design, analysis and interpretation, Briefings in Bioinformatics, Volume 3 (2002) no. 2, pp. 146-153

[23] Li, W.; Reich, J. A complete enumeration and classification of two-locus disease models., Human Heredity, Volume 50 (2000) no. 6, pp. 334-349

[24] Larson, N. B.; Schaid, D. J. A Kernel Regression Approach to Gene-Gene Interaction Detection for Case-Control Studies, Genetic Epidemiology, Volume 37 (2013) no. 7, pp. 695-703

[25] Maher, B. Personal genomes: The case of the missing heritability, Nature, Volume 456 (2008), pp. 18-21

[26] Manolio, T. A.; Collins, F. S.; Cox, N. J.; Goldstein, D. B.; Hindorff, L. A.; Hunter, D. J.; McCarthy, M. I.; Ramos, E. M.; Cardon, L. R.; Chakravarti, A.; Cho, J. H.; Guttmacher, A. E/; Kong, A.; Kruglyak, L.; Mardis, E.; Rotimi, C. N.; Slatkin, M.; Valle, D.; Whittemore, A. S.; Boehnke, M.; Clark, A. G.; Eichler, E. E.; Gibson, G.; Haines, J. L.; Mackay, T. F. C.; McCarroll, S. A.; Visscher, P. M. Finding the missing heritability of complex diseases, Nature, Volume 461 (2009), pp. 747-753

[27] McKinney, Brett A.; Crowe, James E. Jr; Guo, Jingyu; Tian, Dehua Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis, PLOS Genetics, Volume 5 (2009) no. 3, pp. 1-12 | DOI

[28] Marchini, J.; Donnelly, P.; Cardon, L. R. Genome-Wide Strategies for detecting multiple loci that influence complex diseases, Nature Genetics, Volume 37 (2005) no. 4, pp. 413-417

[29] Nielsen, Dahlia M.; Ehm, Margaret G.; Zaykin, Dmitri V.; Weir, Bruce S. Effect of Two- and Three-Locus Linkage Disequilibrium on the Power to Detect Marker/Phenotype Associations, Genetics, Volume 168 (2004) no. 2, pp. 1029-1040

[30] Ng, Vivian Wai Ying Univariate and Bivariate Variable Selection in High Dimensional Data, University of California at Berkeley, Berkeley, CA, USA (2004) (Ph. D. Thesis AAI3167213) | MR

[31] Park, M. Y.; Hastie, T. Penalized logistic regression for detecting gene interactions, Biostatistics, Volume 9 (2008), pp. 30-50 | Zbl

[32] Pritchard, Jonathan K.; Przeworski, Molly Linkage Disequilibrium in Humans: Models and Data, The American Journal of Human Genetics, Volume 69 (2001), pp. 1 -14

[33] Ritchie, Marylyn D. Finding the Epistasis Needles in the Genome-Wide Haystack, Springer New York, New York, NY (2015), pp. 19-33 | DOI

[34] Strehl, A.; Ghosh, J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, Volume 3 (2002), pp. 583-617 | MR | Zbl

[35] Sicotte, Hugues; Rider, David N.; Poland, Gregory A.; Dhiman, Neelam; Kocher, Jean-Pierre A. SNPPicker: High quality tag SNP selection across multiple populations, BMC Bioinformatics, Volume 12 (2011) no. 1 | DOI

[36] Ueki, Masao; Cordell, Heather J. Improved Statistics for Genome-Wide Interaction Analysis, PLoS Genet, Volume 8 (2012) no. 4 | DOI

[37] Weir, B. S. Linkage disequilibrium and association mapping, Annual Review of Genomics and Human Genetics, Volume 9 (2008), pp. 129-142

[38] Wan, X.; Yang, C.; Yang, Q.; Xue, H.; Fan, X.; Tang, N. L. S.; Yu, W. BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies., The American Journal of Human Genetics, Volume 87 (2010), pp. 325-340