Special Issue on Models and Inference in Population Genetics
Model choice using Approximate Bayesian Computation and Random Forests: analyses based on model grouping to make inferences about the genetic history of Pygmy human populations
[Choix de modèles par calcul bayésien approximé et forêts aléatoires : analyses basées sur le groupement de modèles pour inférer l’histoire génétique des populations Pygmées]
Journal de la société française de statistique, Tome 159 (2018) no. 3, pp. 167-190.

En biologie évolutive, les méthodes d’inférence fondées sur la simulation, comme le calcul bayésien approché (ABC), sont particulièrement adaptées pour traiter les modèles complexes. Pudlo et al. (2016) ont récemment développé une nouvelle approche basée sur les forêts aléatoires (RF) denommée ABC-RF. Nous présentons ici les résultats d’analyses basées sur la méthodologie ABC-RF pour inférer l’histoire des populations humaines pygmées d’Afrique centrale occidentale à partir d’un ensemble de données génétiques issues de marqueurs microsatellites. Une nouveauté notable de nos analyses statistiques concerne l’application des techniques ABC-RF pour choisir des groupes prédéfinis de modèles. Nous avons formalisé huit scénarios évolutifs complexes intégrant (ou non) trois événements majeurs : (i) l’existence d’une population pygmée ancestrale commune, (ii) la possibilité d’événements de mélange génétique / migration entre populations pygmées et non-pygmées, et (iii) la possibilité d’un changement de taille dans le passé de la population non pygmée. Nous montrons que notre approche de regroupement de scénarios permet de discerner avec une forte confiance les principaux événements évolutifs qui caractérise l’histoire populationelle d’intérêt. Le scénario sélectionné final corresponds à une origine commune de tous les groupes pygmées d’Afrique centrale occidentale, la population pygmée ancestrale ayant divergé, avec des mélanges génétiques asymétriques, d’une population non-pygmée en expansion démographique.

In evolutionary biology, simulation-based methods such as Approximate Bayesian Computation (ABC) are well adapted to make statistical inferences about complex models of natural population histories. Pudlo et al. (2016) recently developed a novel approach based on the Random Forests method (RF): the ABC-RF algorithm. Here we present the results of analyses based on ABC-RF to make inferences about the history of Pygmy human populations from Western Central Africa from a microsatellite genetic dataset. A noticeable novelty of the statistical analyses presented here is the application of ABC-RF methodology to make model choice on predefined groups of models. We formalized eight complex evolutionary scenarios which incorporate (or not) three major events: (i) whether there exists an ancestral common Pygmy population, (ii) the possibility of introgression/migration events between Pygmy and non-Pygmy populations, and (iii) the possibility of a change in size in the past in the non-Pygmy African population. We show that our grouping approach allows disentangling with strong confidence the main evolutionary events characterizing the population history of interest. The selected final scenario corresponds to a common origin of all Western Central African Pygmy groups, with the ancestral Pygmy population having diverged, with asymmetrical genetic introgression, from a demographically expanding non-Pygmy population.

Keywords: Approximate Bayesian Computation, evolutionary biology, genetic variation, microsatellites, model selection, population genetics, Random Forests
Mot clés : calcul bayésien approché, biologie évolutive, variations génétiques, microsatellites, sélection de modèle(s), génétique des populations, forêts aléatoires
@article{JSFS_2018__159_3_167_0,
     author = {Estoup, Arnaud and Raynal, Louis and Verdu, Paul and Marin, Jean-Michel},
     title = {Model choice using {Approximate} {Bayesian} {Computation} and {Random} {Forests:} analyses based on model grouping to make inferences about the genetic history of {Pygmy} human populations},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {167--190},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {159},
     number = {3},
     year = {2018},
     zbl = {1410.62177},
     language = {en},
     url = {http://archive.numdam.org/item/JSFS_2018__159_3_167_0/}
}
TY  - JOUR
AU  - Estoup, Arnaud
AU  - Raynal, Louis
AU  - Verdu, Paul
AU  - Marin, Jean-Michel
TI  - Model choice using Approximate Bayesian Computation and Random Forests: analyses based on model grouping to make inferences about the genetic history of Pygmy human populations
JO  - Journal de la société française de statistique
PY  - 2018
SP  - 167
EP  - 190
VL  - 159
IS  - 3
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2018__159_3_167_0/
LA  - en
ID  - JSFS_2018__159_3_167_0
ER  - 
%0 Journal Article
%A Estoup, Arnaud
%A Raynal, Louis
%A Verdu, Paul
%A Marin, Jean-Michel
%T Model choice using Approximate Bayesian Computation and Random Forests: analyses based on model grouping to make inferences about the genetic history of Pygmy human populations
%J Journal de la société française de statistique
%D 2018
%P 167-190
%V 159
%N 3
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2018__159_3_167_0/
%G en
%F JSFS_2018__159_3_167_0
Estoup, Arnaud; Raynal, Louis; Verdu, Paul; Marin, Jean-Michel. Model choice using Approximate Bayesian Computation and Random Forests: analyses based on model grouping to make inferences about the genetic history of Pygmy human populations. Journal de la société française de statistique, Tome 159 (2018) no. 3, pp. 167-190. http://archive.numdam.org/item/JSFS_2018__159_3_167_0/

[1] Bertorelle, G.; Benazzo, A.; Mona, S. ABC as a flexible framework to estimate demography over space and time: some cons, many pros, Molecular Ecology, Volume 19 (2010) no. 13, pp. 2609-2625

[2] Biau, G.; Cérou, F.; Guyader, A. New Insights into Approximate Bayesian Computation, Annales de l’Institut Henri Poincaré B, Probability and Statistics, Volume 51 (2015) no. 1, pp. 376-403 | Zbl

[3] Beaumont, Mark A.; Cornuet, Jean-Marie; Marin, Jean-Michel; Robert, Christian P. Adaptive approximate Bayesian computation, Biometrika, Volume 96 (2009) no. 4, pp. 983-990 | Zbl

[4] Beaumont, Mark A. Approximate Bayesian Computation in Evolution and Ecology, Annual Review of Ecology, Evolution, and Systematics, Volume 41 (2010) no. 1, pp. 379-406

[5] Blum, M.; François, O. Non-linear regression models for Approximate Bayesian Computation, Statistics and Computing, Volume 20 (2010), pp. 63-73

[6] Biau, G. Analysis of a random forest model, Journal of Machine Learning Research, Volume 13 (2012), pp. 1063-1095 | Zbl

[7] Blum, M. Approximate Bayesian Computation: A Nonparametric Perspective, Journal of the American Statistical Association, Volume 105 (2010) no. 491, pp. 1178-1187 | Zbl

[8] Blum, M.G.B.; Nunes, M.; Prangle, D.; Sisson, S.A. A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation, Statistical Science, Volume 28 (2013) no. 2, pp. 189-208 | Zbl

[9] Breiman, Leo Random Forests, Machine Learning, Volume 45 (2001) no. 1, pp. 5-32 | Zbl

[10] Beaumont, Mark A.; Zhang, W.; Balding, D. Approximate Bayesian Computation in Population Genetics, Genetics, Volume 162 (2002) no. 4, pp. 2025-2035

[11] Cavalli-Sforza, L.L. African pygmies: an evaluation of the state of research, African pygmies (Cavalli-Sforza, L.L., ed.), Orlando Academic Press (1986), pp. 361-426

[12] Csilléry, Katalin; Blum, Michael G.B.; Gaggiotti, Oscar E.; François, Olivier Approximate Bayesian Computation (ABC) in practice, Trends in Ecology & Evolution, Volume 25 (2010) no. 7, pp. 410-418

[13] Choisy, M.; Franck, P.; Cornuet, J.-M. Estimating admixture proportions with microsatellites: comparison of methods based on simulated data, Molecular Ecology, Volume 13 (2004), pp. 955-968

[14] Cornuet, Jean-Marie; Pudlo, Pierre; Veyssier, Julien; Dehne-Garcia, Alexandre; Gautier, Mathieu; Leblois, Raphaël; Marin, Jean-Michel; Estoup, Arnaud DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data, Bioinformatics, Volume 30 (2014) no. 8, pp. 1187-1189

[15] Cornuet, J.-M.; Ravigné, V.; Estoup, A. Inference on population history and model checking using DNA sequence and microsatellite data with the software DIYABC (v1.0), BMC Bioinformatics, Volume 11 (2010) no. 1

[16] Cornuet, Jean-Marie; Santos, Filipe; Beaumont, Mark A.; Robert, Christian P.; Marin, Jean-Michel; Balding, David J.; Guillemaud, Thomas; Estoup, Arnaud Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation, Bioinformatics, Volume 24 (2008) no. 23, pp. 2713-2719

[17] Cavalli-Sforza, L.L.; Feldman, M.W. The application of molecular genetic approaches to the study of human evolution, Nature Genetics, Volume 33 (2003), pp. 266-275

[18] Cavalli-Sforza, L.L.; Menozzi, P.; Piazza, A. The History and Geography of Human Genes, Princeton University Press, 1994

[19] Drummond, A.; Bouckaert, R. Bayesian evolutionary analysis by sampling trees, Bayesian Evolutionary Analysis with BEAST, Cambridge University Press (2015), pp. 79-96

[20] Destro-Bisol, Giovanni; Donati, Francesco; Coia, Valentina; Boschi, Ilaria; Verginelli, Fabio; Caglià, Alessandra; Tofanelli, Sergio; Spedini, Gabriella; Capelli, Cristian Variation of Female and Male Lineages in Sub-Saharan Populations: the Importance of Sociocultural Factors, Molecular Biology and Evolution, Volume 21 (2004) no. 9, pp. 1673-1682

[21] De Iorio, M.; Griffiths, R. Importance sampling on coalescent histories. I, Advances in Applied Probability, Volume 36 (2004) no. 2, pp. 417-433 | Zbl

[22] De Iorio, M.; Griffiths, R. Importance sampling on coalescent histories. II: Subdivided population models, Advances in Applied Probability, Volume 36 (2004) no. 2, pp. 434-454 | Zbl

[23] De Iorio, M.; Griffiths, R.; Leblois, R.; Rousset, F. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models, Theoretical Population Biology, Volume 68 (2005) no. 1, pp. 41-53 | Zbl

[24] Del Moral, Pierre; Doucet, Arnaud; Jasra, Ajay An adaptive sequential Monte Carlo method for approximate Bayesian computation, Statistics and Computing, Volume 22 (2012) no. 5, pp. 1009-1020 | Zbl

[25] Drummond, A.; Rambaut, A. BEAST: Bayesian evolutionary analysis by sampling trees, BMC Evolutionary Biology, Volume 7 (2007) no. 1 | DOI

[26] Excoffier, L.; Estoup, A.; Cornuet, J.-M. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers, Genetics, Volume 169 (2005), pp. 1727-1738

[27] Estoup, Anaud; Guillemaud, Thomas Reconstructing routes of invasion using genetic data: why, how and so what?, Molecular Ecology, Volume 19 (2010) no. 19, pp. 4113-4130

[28] Estoup, Arnaud; Jarne, Philippe; Cornuet, Jean-Marie Homoplasy and mutation model at microsatellite loci and their consequences for population genetics analysis, Molecular Ecology, Volume 11 (2002) no. 9, pp. 1591-1604

[29] Estoup, A.; Lombaert, E.; Marin, J.-M.; Robert, C.P.; Guillemaud, T.; Pudlo, P.; Cornuet, J.-M. Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics, Molecular Ecology Ressources, Volume 12 (2012) no. 5, pp. 846-855

[30] Estoup, Arnaud; Verdu; Marin; Robert; Dehne-Garcia; Cornuet; Pudlo Application of approximate Bayesian computation to infer the genetic history of Pygmy hunter-gatherers populations from Western Central Africa, Handbook of Approximate Bayesian Computation (Sisson, S.A.; Fan, Y.; Beaumont, M., eds.), Chapman and Hall/CRC (2018)

[31] Fraimout, Antoine; Debat, Vincent; Fellous, Simon; Hufbauer, Ruth A; Foucaud, Julien; Pudlo, Pierre; Marin, Jean-Michel; Price, Donald K; Cattel, Julien; Chen, Xiao Deciphering the Routes of invasion of Drosophila suzukii by Means of ABC Random Forest, Molecular biology and evolution, Volume 34 (2017) no. 4, pp. 980-996

[32] Fearnhead, P.; Prangle, D. Constructing summary statistics for Approximate Bayesian Computation: semi-automatic Approximate Bayesian Computation, Journal of the Royal Statistical Society: Series B (Statistical Methodology), Volume 74 (2012) no. 3, pp. 419-474 | Zbl

[33] Fagundes, Nelson J. R.; Ray, Nicolas; Beaumont, Mark; Neuenschwander, Samuel; Salzano, Francisco M.; Bonatto, Sandro L.; Excoffier, Laurent Statistical evaluation of alternative models of human evolution, Proceedings of the National Academy of Sciences, Volume 104 (2007) no. 45, pp. 17614-17619

[34] Frazier, David T.; Robert, Christian P.; Rousseau, Judith Model Misspecification in ABC: Consequences and Diagnostics, ArXiv e-prints (2018) no. 1708.01974v2

[35] Goldstein, D.; Linares, A.; Cavalli-Sforza, L.; Feldman, N. An evaluation of genetic distances for use with microsatellite loci, Genetics, Volume 139 (1995), pp. 463-471

[36] Grelaud, A.; Marin, J.-M.; Robert, C.P.; Rodolphe, F.; Tally, F. Likelihood-free methods for model choice in Gibbs random fields, Bayesian Analysis, Volume 3 (2009) no. 2, pp. 427-442

[37] Garza, J.; Williamson, E. Detection of reduction in population size using data from microsatellite DNA, Molecular Ecology, Volume 10 (2001), pp. 305-318

[38] Hewlett, B. S. Hunter-gatherers of the Congo Basin: cultures, histories, and biology of African pygmies, New Brunswick: Transactions Publishers, 2014

[39] Hewlett, B. S. Cultural diversity among African pygmies, Cultural Diveristy among Twentieth-Century Foragers: An African Perspective (Kent, S., ed.), Cambridge: Cambridge University Press (1996), pp. 361-426

[40] Jin, L.; Chakraborty, R. Estimation of genetic distance and coefficient of gene diversity from single-probe multilocus DNA fingerprinting data, Molecular Biology and Evolution, Volume 11 (1994) no. 1, pp. 120-127

[41] Lombaert, Eric; Guillemaud, Thomas; Cornuet, Jean-Marie; Malausa, Thibaut; Facon, Benoît; Estoup, Arnaud Bridgehead effect in the worldwide invasion of the biocontrol harlequin ladybird, PloS one, Volume 5 (2010) no. 3

[42] Merle, C.; Leblois, R.; Rousset, F.; Pudlo, P. Resampling: An improvement of importance sampling in varying population size models, Theoretical Population Biology, Volume 114 (2017), pp. 70-87 | Zbl

[43] Marjoram, P.; Molitor, J.; Plagnol, V.; Tavaré, S. Markov chain Monte Carlo without likelihoods, Proceedings of the National Academy of Sciences, Volume 100 (2003) no. 26, pp. 15324-15328

[44] Marin, Jean-Michel; Pudlo, Pierre; Estoup, Arnaud; Robert, Christian P. Likelihood-free Model Choice. In Handbook of Approximate Bayesian Computation, Handbook of Approximate Bayesian Computation (Sisson, S.A.; Fan, Y.; Beaumont, M., eds.), Chapman and Hall/CRC (2018)

[45] Marin, Jean-Michel; Pudlo, Pierre; Robert, Christian P.; Ryder, Robin J. Approximate Bayesian computational methods, Statistics and Computing (2012), pp. 1-14 | Zbl

[46] Nei, M. Molecular Evolutionary Genetics, Columbia University Press, New York, USA, 1987

[47] Nordborg, Magnus Coalescent theory, Handbook of statistical genetics (2001), pp. 179-212

[48] Pascual, M.; Chapuis, M.; Mestres, F.; Balanyà, J.; Huey, R.; Gilchrist, G.; Estoup, A. Introduction history of Drosophila subobscura in the New World: a microsatellite-based survey using ABC methods, Molecular Ecology, Volume 19 (2007), pp. 3069-3083

[49] Pudlo, Pierre; Marin, Jean-Michel; Estoup, Arnaud; Cornuet, Jean-Marie; Gautier, Mathieu; Robert, Christian P. Reliable ABC model choice via random forests, Bioinformatics, Volume 32 (2016) no. 6, pp. 859-866

[50] Pritchard, J.K.; Seielstad, M.T.; Perez-Lezaun, A.; Feldman, M.W. Population growth of human Y chromosomes: a study of Y chromosome microsatellites, Molecular Biology and Evolution, Volume 16 (1999), pp. 1791-1798

[51] Robert, C.P.; Cornuet, Jean-Marie; Marin, Jean-Michel; Pillai, N.S. Lack of confidence in approximate Bayesian computation model choice, Proceedings of the National Academy of Sciences, Volume 108 (2011) no. 37, pp. 15112-15117

[52] Rannala, B.; J., Mountain Detecting immigration by using multilocus genotypes, Proceedings of the National Academy of Sciences, USA, Volume 94 (1997), pp. 9197-9201

[53] Rousset, F.; Leblois, R. Likelihood and Approximate Likelihood Analyses of Genetic Structure in a Linear Habitat: Performance and Robustness to Model Mis-Specification, Molecular Biology and Evolution, Volume 24 (2007) no. 12, pp. 2730-2745

[54] Rousset, F.; Leblois, R. Likelihood-Based Inferences under Isolation by Distance: Two-Dimensional Habitats and Confidence Intervals, Molecular Biology and Evolution, Volume 29 (2012) no. 3, pp. 957-973

[55] Raynal, L.; Marin, J.-M.; Pudlo, P.; Ribatet, M.; Robert, C. P.; Estoup, A. ABC random forests for Bayesian parameter inference, Bioinformatics (2018) (bty867)

[56] Rubin, D. B. Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician, The Annals of Statistics, Volume 12 (1984), pp. 1151-1172 | Zbl

[57] Sunnåker, Mikael; Busetto, Alberto Giovanni; Numminen, Elina; Corander, Jukka; Foll, Matthieu; Dessimoz, Christophe Approximate Bayesian computation, PLoS Computational Biology, Volume 9 (2013) no. 1

[58] Scornet, E.; Biau, G.; Vert, J.-P. Consistency of random forests, Annals of Statistics, Volume 43 (2015) no. 4, pp. 1716-1741 | Zbl

[59] Stephens, Matthew; Donnelly, Peter Inference in molecular population genetics, Journal of the Royal Statistical Society: Series B, Volume 62 (2000) no. 4, pp. 605-655 (With discussion and a reply by the authors) | Zbl

[60] Sisson, S. A.; Fan, Y.; Tanaka, Mark M. Sequential Monte Carlo without likelihoods, Proceedings of the National Academy of Sciences, Volume 104 (2007) no. 6, pp. 1760-1765 | Zbl

[61] Sisson, S.A.; Fan, Y.; Tanaka, M.M. Sequential Monte Carlo without likelihoods: Errata, Proceedings of the National Academy of Sciences, Volume 106 (2009) no. 39 | DOI

[62] Schrider, D.; Kern, A. S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning, PLOS Genetics, Volume 12 (2016) no. 3, pp. 1-31

[63] Sheehan, S.; Song, Y. Deep Learning for Population Genetic Inference, PLOS Computational Biology, Volume 12 (2016), pp. 1-28

[64] Tavaré, Simon; Balding, David; Griffiths, Robert; Donnelly, Peter Inferring coalescence times from DNA sequence data, Genetics, Volume 145 (1997) no. 2, pp. 505-518

[65] Thouzeau, V.; Mennecier, P.; Verdu, P.; Austerlitz, F. Genetic and linguistic histories in Central Asia inferred using approximate Bayesian computations, Proceedings of the Royal Society of London B: Biological Sciences, Volume 284 (2017) no. 1861

[66] Verdu, Paul; Austerlitz, Frederic; Estoup, Arnaud; Vitalis, Renaud; Georges, Myriam; Théry, Sylvain; Froment, Alain; Le Bomin, Sylvie; Gessain, Antoine; Hombert, Jean-Marie Origins and genetic diversity of pygmy hunter-gatherers from Western Central Africa, Current Biology, Volume 19 (2009) no. 4, pp. 312-318

[67] Verdu, Paul; Becker, Noémie SA; Froment, Alain; Georges, Myriam; Grugni, Viola; Quintana-Murci, Lluis; Hombert, Jean-Marie; Van der Veen, Lolke; Le Bomin, Sylvie; Bahuchet, Serge Sociocultural behavior, sex-biased admixture, and effective population sizes in Central African Pygmies and non-Pygmies, Molecular Biology and Evolution, Volume 30 (2013) no. 4, pp. 918-937

[68] Weir, B.; Cockerham, C. Estimating F-Statistics for the Analysis of Population Structure, Evolution, Volume 38 (1984) no. 6, pp. 1358-1370

[69] Wilkinson, R. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error, Statistical Applications in Genetics and Molecular Biology, Volume 12 (2013) no. 2, pp. 129-141