Modèles à variables latentes en génétique des populations
Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 3-20.

Dans cet article, nous présentons plusieurs familles de modèles hiérarchiques bayésiens dédiés à l’analyse de la structure génétique des populations à partir de génotypes multi-locus. L’analyse bayésienne de la structure génétique résout des problèmes de classification non supervisée à partir de données catégorielles. L’une des spécificités des modèles de la génétique des populations vient du fait que le génome d’un individu peut provenir de plusieurs groupes génétiques en raison du métissage. L’originalité des modèles présentés réside dans l’utilisation d’un contexte bayésien hiérarchique qui permet d’inclure, avec une couche de régression cachée, des covariables spatiales et environnementales pour modéliser le métissage. De plus, nous présentons différents critères de choix de modèles qui permettent de choisir le nombre de groupes génétiques ainsi que l’ensemble des covariables spatiales et environnementales. Une première application de ces modèles concerne la détection de la structure génétique des populations humaines et les relations entre structure génétique et classifications linguistiques pour les populations amérindiennes. Une deuxième application concerne l’estimation de la structure d’espèces de plantes et les prévision des modèles en fonction de différents scénarios de changement climatique.

In this study, we review Bayesian methods of inference of population genetic structure using multi-locus genotypic data sets. The Bayesian analysis of population genetic structure typically addresses unsupervised classification problems for categorical data. However, peculiarities of population genetic data sets arise from a process called genetic admixture, in which the genome of any individual can contain DNA from several groups of populations. A common feature of the methods presented here is the use of a hierarchical framework which allows their users to implement models of admixture based on hidden regressions of genetic clusters on geographic and ecological variables. In addition, we present techniques for choosing the number of clusters and for selecting informative subsets of ecological variables with respect to population structure. Then we survey applications of Bayesian methods to human and plant genetic data. For humans, we review previous works that examined relationships between genetic structure and languages in Native American populations using two distinct linguistic classifications. For plants, we estimate population genetic structure in an alpine species, and we provide an example of forecasting potential modifications in intra-specific genetic variation in response to global climatic change.

Mot clés : structure génétique des populations, estimation bayésienne, écologie moléculaire
Keywords: population genetic structure, bayesian inference, ecological modeling
@article{JSFS_2011__152_3_3_0,
     author = {Jay, Flora and GB Blum, Michael and Frichot, Eric and Fran\c{c}ois, Olivier},
     title = {Mod\`eles \`a variables latentes en g\'en\'etique des populations},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {3--20},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {152},
     number = {3},
     year = {2011},
     mrnumber = {2871174},
     zbl = {1316.92049},
     language = {fr},
     url = {http://archive.numdam.org/item/JSFS_2011__152_3_3_0/}
}
TY  - JOUR
AU  - Jay, Flora
AU  - GB Blum, Michael
AU  - Frichot, Eric
AU  - François, Olivier
TI  - Modèles à variables latentes en génétique des populations
JO  - Journal de la société française de statistique
PY  - 2011
SP  - 3
EP  - 20
VL  - 152
IS  - 3
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2011__152_3_3_0/
LA  - fr
ID  - JSFS_2011__152_3_3_0
ER  - 
%0 Journal Article
%A Jay, Flora
%A GB Blum, Michael
%A Frichot, Eric
%A François, Olivier
%T Modèles à variables latentes en génétique des populations
%J Journal de la société française de statistique
%D 2011
%P 3-20
%V 152
%N 3
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2011__152_3_3_0/
%G fr
%F JSFS_2011__152_3_3_0
Jay, Flora; GB Blum, Michael; Frichot, Eric; François, Olivier. Modèles à variables latentes en génétique des populations. Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 3-20. http://archive.numdam.org/item/JSFS_2011__152_3_3_0/

[1] Albert, J. H.; Chib, S. Bayesian analysis of binary and polychotomous response data, Journal of the American Statistical Association, Volume 88 (1993) no. 422, 11 pages | DOI | MR | Zbl

[2] Alexander, D. H.; Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation, BMC Bioinformatics, Volume 12 (2011) no. 1 http://www.ncbi.nlm.nih.gov/pubmed/21682921

[3] Baik, J.; Benarous, G.; Peche, S. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices, Annals of Probability (2005) | MR | Zbl

[4] Besag, J. Statistical analysis of non-lattice data, The Statistician, Volume 24 (1975) no. 3, pp. 179-195 http://www.jstor.org/stable/2987782

[5] Barton, N. H.; Hewitt, G. M. Analysis of hybrid zones, Annual Review of Ecology and Systematics, Volume 16 (1985) no. 1, pp. 113-148 | DOI

[6] Berry, A.; Kreitman, M. Molecular analysis of an allozyme cline : alcohol dehydrogenase in Drosophila melanogaster on the East Coast of North America, Genetics, Volume 134 (1993) no. 3, pp. 869-893 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1205523&tool=pmcentrez&rendertype=abstract

[7] Blei, D. M.; Lafferty, J. D. Correlated topic models, Advances in Neural Information Processing Systems, Volume 18 (2006) no. 1, pp. 147-154 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.2352&rep=rep1&type=pdf

[8] Balding, D.J.; Nichols, R. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity (vol 96, pg 3, 1995), Genetica, Volume 96 (1995) no. 1-2, pp. 3-12 http://discovery.ucl.ac.uk/170785/

[9] Blei, D. M.; Ng, A. Y.; Jordan, M. I. Latent Dirichlet allocation, Journal of Machine Learning Research, Volume 3 (2003) no. 4-5, pp. 993-1022 http://www.crossref.org/jmlr_DOI.html | Zbl

[10] Bamshad, M. J.; Wooding, S.; Watkins, W. S.; Ostler, C. T.; Batzer, M. A.; Jorde, L. B. Human population genetic structure and inference of group membership., The American Journal of Human Genetics, Volume 72 (2003) no. 3, pp. 578-589 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1180234&tool=pmcentrez&rendertype=abstract

[11] Chen, C.; Durand, E.; Forbes, F.; François, O. Bayesian clustering algorithms ascertaining spatial population structure : a new computer program and a comparison study, Molecular Ecology Notes, Volume 7 (2007) no. 5, pp. 747-756 | DOI

[12] Celeux, G.; Forbes, F.; Robert, C. P.; Titterington, D. M. Deviance information criteria for missing data models, Bayesian Analysis, Volume 1 (2006) no. 4, pp. 651-674 http://ba.stat.cmu.edu/journal/2006/vol01/issue04/celeux.pdf | MR | Zbl

[13] Chung, H.; Flaherty, B. P.; Schafer, J. L. Latent class logistic regression : application to marijuana use and attitudes among high school seniors, Journal of the Royal Statistical Society Series A Statistics in Society, Volume 169 (2006) no. 4, pp. 723-743 | DOI | MR

[14] Chakraborty, R. Gene admixture in human populations : models and predictions, American Journal of Physical Anthropology, Volume 29 (1986) no. S7, pp. 1-43 | DOI

[15] Cressie, N. A. C. Statistics for spatial data, Rev., Wiley, 1993, 900 pages (ISBN: 0471002550) | MR

[16] Corander, J.; Siren, J.; Arjas, E. Bayesian spatial modeling of genetic population structure, Computational Statistics, Volume 23 (2007) no. 1, pp. 111-129 | DOI | MR

[17] Cavalli-Sforza, L.; Menozzi, P.; Piazza, A. The History and Geography of Human Genes, Princeton University Press, Princeton, NJ, 1994

[18] Chakraborty, R.; Weiss, K. M. Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci., Proceedings of the National Academy of Sciences of the United States of America, Volume 85 (1988) no. 23, p. 9119-23 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=282675&tool=pmcentrez&rendertype=abstract

[19] Corander, J.; Waldmann, P.; Sillanpaa, M.J. Bayesian analysis of genetic differentiation between populations., Genetics, Volume 163 (2003) no. 1, pp. 367-374 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3005348&tool=pmcentrez&rendertype=abstract

[20] Durbin, R.M.; Altshuler, D.L.; Abecasis, G.R.; Bentley, D.R.; Chakravarti, A.; al., et A map of human genome variation from population-scale sequencing, Nature, Volume 467 (2010) no. 7319, pp. 1061-1073 | DOI

[21] Diaconis, P.; Goel, S.; Holmes, S. Horseshoes in multidimensional scaling and local kernel methods, The Annals of Applied Statistics, Volume 2 (2008) no. 3, pp. 777-807 | arXiv | MR | Zbl

[22] Durand, E.; Jay, F.; Gaggiotti, O.E.; François, O. Spatial inference of admixture proportions and secondary contact zones., Molecular Biology and Evolution, Volume 26 (2009) no. 9, pp. 1963-1973 http://www.ncbi.nlm.nih.gov/pubmed/19461114

[23] Dayton, C. M.; Macready, G. B. Concomitant-variable latent-class models, Journal of the American Statistical Association, Volume 83 (1988) no. 401, pp. 173-178 http://www.jstor.org/stable/2288938 | MR

[24] Evanno, G.; Regnaut, S.; Goudet, J. Detecting the number of clusters of individuals using the software STRUCTURE : a simulation study., Molecular Ecology, Volume 14 (2005) no. 8, pp. 2611-2620 http://www.ncbi.nlm.nih.gov/pubmed/15969739

[25] Engelhardt, B.E.; Stephens, M. Analysis of population structure : a unifying framework and novel methods based on sparse factor analysis, PLoS Genetics, Volume 6 (2010) no. 9, 12 pages | DOI

[26] François, O.; Ancelet, S.; Guillot, G. Bayesian clustering using hidden Markov random fields in spatial population genetics., Genetics, Volume 174 (2006) no. 2, pp. 805-816 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1602073&tool=pmcentrez&rendertype=abstract

[27] François, O.; Blum, M.G.B.; Jakobsson, M.; Rosenberg, N.A. Demographic history of european populations of Arabidopsis thaliana, PLoS Genetics, Volume 4 (2008) no. 5, 15 pages | DOI

[28] François, O.; Currat, M.; Ray, N.; Han, E.; Excoffier, L.; Novembre, J. Principal component analysis under population genetic models of range expansion and admixture., Molecular Biology and Evolution, Volume 27 (2010) no. 6, pp. 1257-1268 http://www.ncbi.nlm.nih.gov/pubmed/20097660

[29] François, O.; Durand, E. Spatially explicit bayesian clustering models in population genetics, Molecular Ecology Resources, Volume 10 (2010) no. 5, pp. 773-784 | DOI

[30] Fogelqvist, J.; Niittyvuopio, A.; Agren, J.; Savolainen, O.; Lascoux, M. Cryptic population genetic structure : the number of inferred clusters depends on sample size., Molecular ecology resources, Volume 10 (2010) no. 2, pp. 314-323 | DOI

[31] Falush, D.; Stephens, M.; Pritchard, J.K. Inference of population structure using multilocus genotype data : linked loci and correlated allele frequencies., Genetics, Volume 164 (2003) no. 4, pp. 1567-1587 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1462648&tool=pmcentrez&rendertype=abstract

[32] Greenberg, J.H. Book review : language in the Americas, Current Anthropology, Volume 28 (1987) no. 5, pp. 647-667 https://www.jstor.org/stable/2743361

[33] Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Markov Chain Monte Carlo in Practice, Chapman and Hall, New-York, 1996 | MR | Zbl

[34] Guisan, A.; Zimmermann, N. E. Predictive habitat distribution models in ecology, Ecological Modelling, Volume 135 (2000) no. 2-3, pp. 147-186 http://linkinghub.elsevier.com/retrieve/pii/S0304380000003549

[35] Huelsenbeck, J.P.; Andolfatto, P. Inference of population structure under a Dirichlet process model, Genetics, Volume 175 (2007) no. 4, pp. 1787-1802 | DOI

[36] Hartl, D. L.; Clark, A. G. Principles of Population Genetics, 37, Sinauer Associates, 1997 no. 2, 542 pages http://www.jstor.org/stable/2530432

[37] Jay, F.; François, O.; Blum, M.G.B. Predictions of native american population structure using linguistic covariates in a hidden regression framework, PLoS ONE, Volume 6 (2011) no. 1, 11 pages | DOI

[38] Jay, F.; François, O.; Durand, E.Y.; Blum, M.G.B. POPS : A software for the prediction of genetic population structure using latent regression models, Soumis (2011)

[39] Jeffrey, B. L.; Linzer, D.A. poLCA : an R package for polytomous variable latent class analysis, Journal of Statistical Software, Volume 42 (2011) no. i10 http://ideas.repec.org/a/jss/jstsof/42i10.html

[40] Jay, F.; Manel, S.; Alvarez, N.; Durand, E.; Thuiller, W.; al., et Forecasting changes in population genetic structure of Alpine plants in response to global warming, Molecular Ecology (2012)

[41] Jakobsson, M.; Rosenberg, N.A. CLUMPP : a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure., Bioinformatics, Volume 23 (2007) no. 14, pp. 1801-1806 http://www.ncbi.nlm.nih.gov/pubmed/17485429

[42] Kalinowski, S. T. The computer program STRUCTURE does not reliably identify the main genetic clusters within species : simulations and implications for human population structure., Heredity, Volume 106 (2011) no. 4, pp. 625-632 http://www.ncbi.nlm.nih.gov/pubmed/20683484

[43] Kimura, M.; Weiss, G.H. The stepping stone model of population structure and the decrease of genetic correlation with distance., Genetics, Volume 49 (1964) no. 4, pp. 561-576 http://www.ncbi.nlm.nih.gov/pubmed/11841176

[44] Lazarsfeld, P. F.; Henry, N. W. Latent Structure Analysis, 16, Houghton Mifflin, 1968 no. 2, p. 1951-1951

[45] Lee, D. D.; Seung, H. S. Algorithms for non-negative matrix factorization, Advances in neural information processing systems, Volume 13 (2001) no. 1, pp. 556-562 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.7566&rep=rep1&type=pdf

[46] Lee, D. D.; Seung, H. S. Learning the parts of objects by non-negative matrix factorization., Nature, Volume 401 (1999) no. 6755, p. 788-91 http://www.ncbi.nlm.nih.gov/pubmed/10548103 | Zbl

[47] Lichstein, J.W.; Simons, T.R.; Shriner, S.A.; Franzreb, K.E. Spatial autocorrelation and autoregressive models in ecology, Ecological Monographs, Volume 72 (2002) no. 3, pp. 445-463 | DOI

[48] Malecot, G. Les Mathématiques de l’Hérédité, Masson, Paris, 1948 | MR | Zbl

[49] McVean, G. A Genealogical interpretation of principal components analysis, PLoS Genetics, Volume 5 (2009) no. 10, 10 pages | DOI

[50] Novembre, J.; Stephens, M. Interpreting principal component analyses of spatial population genetic variation., Nature Genetics, Volume 40 (2008) no. 5, pp. 646-649 http://www.ncbi.nlm.nih.gov/pubmed/18425127

[51] Patterson, N.; Price, A.L.; Reich, D. Population structure and eigenanalysis, PLoS Genetics, Volume 2 (2006) no. 12, 20 pages | DOI

[52] Pritchard, J. K.; Stephens, M.; Donnelly, P. Inference of population structure using multilocus genotype data., Genetics, Volume 155 (2000) no. 2, pp. 945-959 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1461096&tool=pmcentrez&rendertype=abstract

[53] Pritchard, J. K.; Stephens, M.; Rosenberg, N. A.; Donnelly, P. Association mapping in structured populations, American Journal of Human Genetic, Volume 67 (2000) no. 1, pp. 170-181 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10827107

[54] Ripley, B. D. Spatial Statistics, 19, Wiley, 1981 no. 4, pp. 624-635 http://books.google.com/books?hl=en&lr=&id=BDDPTdohXeYC&oi=fnd&pg=PA9&dq=Spatial+Statistics&ots=YqBm-3X2_Z&sig=mPqsWtN4PoC1ybM5bIB-lBauUx8 | MR | Zbl

[55] Rosenberg, N.A.; Pritchard, J.K.; Weber, J.L.; Cann, H.M.; Kidd, K.K.; Zhivotovsky, L.A.; Feldman, M. W. Genetic structure of human populations., Science, Volume 298 (2002) no. 5602, pp. 2381-2385 http://www.ncbi.nlm.nih.gov/pubmed/12493913

[56] Spiegelhalter, D. J.; Best, N.G.; Carlin, B. P.; Van Der Linde, A. Bayesian measures of model complexity and fit, Journal of the Royal Statistical Society - Series B : Statistical Methodology, Volume 64 (2002) no. 4, pp. 583-639 | DOI | MR | Zbl

[57] Segelbacher, G.; Cushman, S.A.; Epperson, B. K.; Fortin, M-J.; François, O. et al. Applications of landscape genetics in conservation biology : concepts and challenges, Conservation Genetics, Volume 11 (2010) no. 2, pp. 375-385 | DOI

[58] Slatkin, M. Isolation by distance in equilibrium and non-equilibrium populations, Evolution, Volume 47 (1993) no. 1, pp. 264-279 http://www.jstor.org/stable/2410134

[59] Schwartz, M.; McKelvey, K. Why sampling scheme matters : the effect of sampling scheme on landscape genetic results, Conservation Genetics, Volume 10 (2009), pp. 441-452 (10.1007/s10592-008-9622-1)

[60] Smyth, P. Model selection for probabilistic clustering using cross-validated likelihood, Statistics and Computing, Volume 10 (2000) no. 1, pp. 63-72 http://www.springerlink.com/index/Q383447Q63844643.pdf

[61] Serre, D.; Paabo, S. Evidence for gradients of human genetic diversity within and among continents, Genome Research, Volume 14 (2004) no. 9, pp. 1679-1685 http://www.ncbi.nlm.nih.gov/pubmed/15342553

[62] Vounatsou, P.; Smith, T.; Gelfand, A. E. Spatial modelling of multinomial data with latent structure : an application to geographical mapping of human gene and haplotype frequencies., Biostatistics Oxford England, Volume 1 (2000) no. 2, pp. 177-189 | DOI | Zbl

[63] Ward, R. H. The genetic structure of a tribal population, the Yanomama Indians. V. comparisons of a series of genetic networks., Annals of Human Genetics, Volume 36 (1972) no. 1, pp. 21-43

[64] Waples, R.S.; Gaggiotti, O. What is a population ? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity., Molecular Ecology, Volume 15 (2006) no. 6, pp. 1419-1439 http://www.ncbi.nlm.nih.gov/pubmed/16629801

[65] Wang, S.; Lewis, C. M.; Jakobsson, M.; Ramachandran, S.; Ray, N. et al. Genetic variation and population structure in Native Americans, PLoS Genetics, Volume 3 (2007) no. 11, 19 pages | DOI

[66] Wright, S. Isolation by distance, Genetics, Volume 28 (1943) no. 2, pp. 114-138 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1209196&tool=pmcentrez&rendertype=abstract

[67] Yu, J.; Pressoir, G.; Briggs, W. H.; Vroh, B.; Yamasaki, M.; al., et A unified mixed-model method for association mapping that accounts for multiple levels of relatedness., Nature Genetics, Volume 38 (2006) no. 2, pp. 203-208 http://www.ncbi.nlm.nih.gov/pubmed/16380716