Numéro spécial : données longitudinales quantitatives, événementielles, incomplètement observées
Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable
[Modèle de Markov caché mixte pour des données longitudinales hétérogènes avec erreurs et données manquantes dans la variable de sortie]
Journal de la société française de statistique, Tome 155 (2014) no. 1, pp. 73-98.

L’analyse de données déclaratives longitudinales fait apparaître de nombreuses difficultés, comme le traitement des erreurs et des données manquantes de la variable de sortie. En outre, les cohortes suivies sur le long terme, telles que celles utilisées en épidémiologie « life-course » peuvent soulever un problème d’hétérogénéité du temps, surtout en ce qui concerne la façon de répondre aux questions de l’enquêteur. Nous proposons dans cet article l’introduction d’un modèle de Markov caché mixte qui comprend les possibilités d’erreur et de non-réponse, et permet également de considérer que l’effet d’un résultat de santé passé peut agir sur les réponses actuelles à travers une mémoire d’ état. En ce qui concerne les estimations, nous avons proposé d’utiliser un algorithme EM Stochastique (SEM), qui est moins gourmand en temps de calcul que l’algorithme EM usuel utilisant une intégration sur les effets aléatoires.

Nous avons effectué une étude par simulation afin d’évaluer les performances de cet algorithme dans le contexte de l’épidémiologie du cancer avec les données de la cohorte britanniques « NCDS 1958 ». Les simulations ont montré que l’effet des covariables sur les probabilités de transitions a été estimée avec un biais modéré. Enfin, nous avons réalisé une application à des données réelles en étudiant l’effet de la classe sociale précoce sur le cancer à travers un comportement tabagique. Il est apparu que, dans l’échantillon de femmes utilisé pour cette enquête, la classe sociale précoce n’agit pas principalement sur l’usage du tabac. Cependant, plus d’information est nécessaire pour compenser les données manquantes et les erreurs de déclaration et obtenir de meilleurs résultats statistiques.

Analysing longitudinal declarative data raises many difficulties, such as the processing of errors and missingness in the outcome variable. Moreover, long-term monitored cohorts (commonly encountered in life-course epidemiology) may reveal a problem of time heterogeneity, especially regarding the way subjects respond to the investigator. We propose a Mixed Hidden Markov Model which considers several causes of randomness in response and also enables the effect of a past health outcome to act on present responses through a memory state. Hence, we take into account both errors and missing responses, time heterogeneity, and retrospective questions. We thus propose a Stochastic Expectation Maximization algorithm (SEM), which is less time-consuming than usual EM algorithms to perform the estimation of the parameters of our MHMM.

We carry out a simulation study to assess the performances of this algorithm in the context of cancer epidemiology with the British NCDS 1958 cohort. Simulations show that the effect of covariates on the transitions probabilities is estimated with moderate bias. At last, we investigate a brief real data application on the effect of early social class on cancer through a smoking behaviour. It appears that in the female sample we used, the early social class does not mainly act on smoking behaviours. Moreover, more information is needed to compensate for data missingness and declarative errors in the view to improve our statistical analysis.

Keywords: Longitudinal data, Mixed Hidden Markov Model, Random effects, Stochastic EM
Mot clés : Données longitudinales, Modèle de Markov caché mixtes, Effets aléatoires, Algorithme EM stochastique
@article{JSFS_2014__155_1_73_0,
     author = {Dedieu, Dominique and Delpierre, Cyrille and Gadat, S\'ebastien and Lang, Thierry},
     title = {Mixed {Hidden} {Markov} {Model} for {Heterogeneous} {Longitudinal} {Data} with {Missingness} and {Errors} in the {Outcome} {Variable}},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {73--98},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {155},
     number = {1},
     year = {2014},
     zbl = {1316.62125},
     language = {en},
     url = {http://archive.numdam.org/item/JSFS_2014__155_1_73_0/}
}
TY  - JOUR
AU  - Dedieu, Dominique
AU  - Delpierre, Cyrille
AU  - Gadat, Sébastien
AU  - Lang, Thierry
TI  - Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable
JO  - Journal de la société française de statistique
PY  - 2014
SP  - 73
EP  - 98
VL  - 155
IS  - 1
PB  - Société française de statistique
UR  - http://archive.numdam.org/item/JSFS_2014__155_1_73_0/
LA  - en
ID  - JSFS_2014__155_1_73_0
ER  - 
%0 Journal Article
%A Dedieu, Dominique
%A Delpierre, Cyrille
%A Gadat, Sébastien
%A Lang, Thierry
%T Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable
%J Journal de la société française de statistique
%D 2014
%P 73-98
%V 155
%N 1
%I Société française de statistique
%U http://archive.numdam.org/item/JSFS_2014__155_1_73_0/
%G en
%F JSFS_2014__155_1_73_0
Dedieu, Dominique; Delpierre, Cyrille; Gadat, Sébastien; Lang, Thierry. Mixed Hidden Markov Model for Heterogeneous Longitudinal Data with Missingness and Errors in the Outcome Variable. Journal de la société française de statistique, Tome 155 (2014) no. 1, pp. 73-98. http://archive.numdam.org/item/JSFS_2014__155_1_73_0/

[1] Aalen, O.O.; Borgan, O.; Gjessing, H.K. Survival and event history analysis, Statistics for Biology and Health, Springer, New York, 2008, xviii+539 pages | DOI | MR | Zbl

[2] Albert, P.S. A Transitional Model for Longitudinal Binary Data Subject to Nonignorable Missing Data, Biometrics, Volume 56 (2000) no. 2, pp. 602-608 http://www.jstor.org/stable/2677007 | Zbl

[3] Altman, R.M. Assessing the Goodness-of-Fit of Hidden Markov Models, Biometrics, Volume 60 (2004) no. 2, pp. 444-450 http://www.jstor.org/stable/3695772 | Zbl

[4] Altman, R.M. Mixed hidden Markov models: an extension of the hidden Markov model to the longitudinal data setting, Journal of the American Statistical Association, Volume 102 (2007) no. 477, pp. 201-210 | DOI | MR | Zbl

[5] Bartolucci, F.; Pennoni, F.; Francis, B. A Latent Markov Model for Detecting Patterns of Criminal Activity, Journal of the Royal Statistical Society. Series A (Statistics in Society), Volume 170 (2007) no. 1, pp. 115-132 http://www.jstor.org/stable/4623137

[6] Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, The Annals of Mathematical Statistics, Volume 41 (1970) no. 1, pp. 164-171 http://www.jstor.org/stable/2239727 | Zbl

[7] Bureau, A.; Shiboski, S.; Hughes, J.P. Applications of continuous time hidden Markov models to the study of misclassified disease outcomes, Statistics in Medicine, Volume 22 (2003) no. 3, pp. 441-462 | DOI

[8] Celeux, G.; Diebolt, J. A stochastic approximation type EM algorithm for the mixture problem, Stochastics and Stochastics Reports, Volume 41 (1992) no. 1-2, pp. 119-134 | MR | Zbl

[9] Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings Algorithm, The American Statistician, Volume 49 (1995) no. 4, pp. 27-335 http://www.jstor.org/stable/2684568

[10] Cho, L.; Lian, L.; JaeJeong, Y.; SoungHoon, C.; KeunYoung, Y.; Park, S. Validation of self-reported cancer incidence at follow-up in a prospective cohort study, Annals of Epidemiology, Volume 19 (2009) no. 9, p. 644-–646

[11] Commenges, D. Inference for multi-state models from interval-censored data, Statistical Methods in Medical Research, Volume 11 (2002) no. 2, pp. 167-182 | Zbl

[12] Commenges, D. Multi-state models in epidemiology, Lifetime Data Analysis, Volume 5 (1999) no. 4, pp. 315-327 | DOI | MR | Zbl

[13] Delattre, M. Inference in mixed hidden Markov models and applications to medical studies, Journal de la Société Française de Statistique, Volume 151 (2010) no. 1, pp. 90-105 | Numdam | Zbl

[14] Detilleux, J.C. The analysis of disease biomarker data using a mixed hidden Markov model, Genetics, Selection, Evolution, Volume 40 (2008) no. 5, pp. 491-509

[15] Diebolt, J.; Ip, E. A stochastic EM algorithm for approximating the maximum likelihood estimate, in Markov chain Monte Carlo in practice, Chapman and Hall, Dordrect, The Netherlands, 1996 | Zbl

[16] Delattre, M.; Lavielle, M. Maximum likelihood estimation in discrete mixed hidden Markov models using the SAEM algorithm, Comput. Statist. Data Anal., Volume 56 (2012) no. 6, pp. 2073-2085 | DOI | MR | Zbl

[17] Delyon, B.; Lavielle, M.; Moulines, E. Convergence of a Stochastic Approximation Version of the EM Algorithm, The Annals of Statistics, Volume 27 (1999) no. 1, pp. 94-128 http://www.jstor.org/stable/120120 | Zbl

[18] Dempster, A.P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B, Volume 39 (1977) no. 1 | Zbl

[19] Efron, B.; Tibshirani, R. An Introduction to the Bootstrap, Chapman and Hall, Dordrect, The Netherlands, 1994 | Zbl

[20] Goldberg, M.; Leclerc, A.; Bonenfant, S.; Chastang, J.F.; Schmaus, A.; Kaniewski, N.; Zins, M. Cohort profile: the GAZEL Cohort Study, International journal of epidemiology, Volume 36 (2007) no. 1, p. 32-9

[21] Gilks, W.; Richardson, S.; Spiegelhalter, D. Markov chain Monte Carlo in practice, Chapman and Hall, Dordrecht, The Netherlands, 1996 | Zbl

[22] Applied latent class analysis (Hagenaars, J. A.; McCutcheon, A. L., eds.), Cambridge University Press, Cambridge, 2002, xxii+454 pages | DOI | MR | Zbl

[23] Holford, N. The Visual Predictive Check Superiority to Standard Diagnostic Plots, Proccedings of the “Population Approach Group in Europe” meeting (2005)

[24] Jackson, C.H.; Sharples, L.D.; Thompson, S.G.; Duffy, S.W.; Couto, E. Multistate Markov models for disease progression with classification error, Journal of The Royal Statistical Society Series D (the Statistician), Volume 52 (2003), pp. 193-209 | DOI

[25] Kelly-Irving, M.; Lepage, B.; Dedieu, D.; Lacey, R.; Cable, N.; Bartley, M.; Blane, D.; Grosclaude, P.; Lang, T.; Delpierre, C. Childhood adversity as a risk for cancer. Findings from the 1958 british birth cohort study (2012) (Under review for BMC Public Health)

[26] Kuhn, E.; Lavielle, M. Coupling a stochastic approximation version of EM with an MCMC procedure, ESAIM: Probability and Statistics, Volume 8 (2004), pp. 115-131 | DOI | Numdam | MR | Zbl

[27] Louis, T.A. Finding the Observed Information Matrix when Using the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological), Volume 44 (1982) no. 2, pp. 226-233 http://www.jstor.org/stable/2345828 | Zbl

[28] Lystig, T. Evaluation of hidden Markov models, University of Washington (2001) (Ph. D. Thesis)

[29] Manjer, J.; Merlo, J.; Berglund, G. Validity of Self-Reported Information on Cancer: Determinants of Under- and Over-Reporting, European Journal of Epidemiology, Volume 19 (2004) no. 3, pp. 239-247 http://www.jstor.org/stable/3582689

[31] Nielsen, S.F. The Stochastic EM Algorithm: Estimation and Asymptotic Results, Bernoulli, Volume 6 (2000) no. 3, pp. 457-489 http://www.jstor.org/stable/3318671 | Zbl

[32] Power, C.; Elliott, J. Cohort profile: 1958 british birth cohort (national child development study), International journal of epidemiology, Volume 35 (2006) no. 1, pp. 34-41

[33] Post, T.M.; Freijer, J.I.; Winter, W.; Ploeger, B.A. Accurate Interpretation of the Visual Predictive Check in order to Evaluate Model Performance, Proccedings of the “Population Approach Group in Europe” meeting (2006)

[34] Panhard, X.; Samson, A. Extension of the SAEM algorithm for nonlinear mixed models with 2 levels of random effects, Biostatistics, Volume 10 (2008) no. 1, pp. 121-135 | Zbl

[35] Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE (1989), pp. 257-286

[36] Satten, G.A.; Longini, I.M. Markov Chains With Measurement Error: Estimating the ‘True’ Course of a Marker of the Progression of Human Immunodeficiency Virus Disease, Journal of the Royal Statistical Society. Series C (Applied Statistics), Volume 45 (1996) no. 3, pp. 275-309 http://www.jstor.org/stable/2986089 | Zbl

[37] Titman, A.C.; Sharples, L.D. A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, Volume 27 (2008) no. 12, pp. 2177-2195 | DOI | MR

[38] Vermunt, J.K.; Langeheine, R.; Bockenholt, U. Discrete-Time Discrete-State Latent Markov Models with Time-Constant and Time-Varying Covariates, Journal of Educational and Behavioral Statistics, Volume 24 (1999) no. 2, pp. 179-207 http://www.jstor.org/stable/1165200

[39] Zhang, Q.; Snow J., Alison; Rijmen, F.; Ip, E.H. Multivariate discrete hidden Markov models for domain-based measurements and assessment of risk factors in child development, Journal of Computational and Graphical Statistics, Volume 19 (2010) no. 3, pp. 746-765 (With supplementary material available online) | DOI | MR