Katholieke Universiteit Leuven Faculteit Wetenschappen Arnošt Komárek Accelerated Failure Time Models for Multivariate Interval-Censored Data with Flexible Distributional Assumptions Promotoren: Prof. Emmanuel Lesaffre Prof. Jan Beirlant Proefschrift ingediend tot het behalen van de graad van Doctor in de Wetenschappen Mei 2006 ISBN 90-8649-014-X c Arnošt Komárek All rights reserved. No part of this book may be reproduced, in any form or by any other means, without the written permission of the copyright owner. Dankwoord Ik wens mijn dank te betuigen aan iedereen die er op een of andere manier toe bijgedragen hebben aan goede afloop van mijn doctoraatstudies. Bijzondere dank gaat uit naar mijn promotor, Professor Emmanuel Lesaffre, die mij altijd deskundig begeleidde en wist te steunen bij moeilijke momenten tijdens het voorbereiden van deze proefschrift. De feit dat deze proefschrift tot vijf artikels die in de internationale wetenschappelijke tijdschriften aanvaard werden, leidde, is vooral het resultaat van zijn bekwaamheid om mij op juiste moment in de juiste richting te duwen. Hij heeft mij ook vele mogelijkheden geboden tot kontakt met andere onderzoekers op zoals nationale als internationale niveau waarvoor ik hem ook bedank. Verder wil ik bedanken mijn co-promotor, Professor Jan Beirlant, die altijd bereid was om mij te helpen als het nodig was. Dank ook aan alle leden van de jury, Prof. Paul Janssen, Prof. An Carbonez, Prof. Irene Gijbels, Prof. Guadalupe Gómez en Prof. Dominique Declerck voor de kritische lezing van dit werk die tot een serieuze verbetering leidde. Mijn (ex-)collega’s van het Biostatistisch centrum, Dora, Kris, Samuel, Silvia, Steffen, Geert, Dimitris, Roula, Wendim, Luwis, Alejandro, Marı́a José, Ann, Roos, Annelies, Bart en Francis wil ik voor de aangename werkomgeving gedurende de laatste vijf jaren danken. Extra dank gaat uit naar Jeannine, voor haar perfecte administratieve steun waarop ik altijd kon leunen. Voor een boeiend jaar dat mij in het gebied van de toegepaste statistiek en vooral biostatistiek geintroduceerd heb en dat tot het begin van mijn doctoraatstudies in Leuven leidde wil ik al mijn medestudenten en lesgevers van het Biostatistiek programma aan het Limburgs Universiteit Centrum in Diepenbeek in het academiejaar 2000 – 01 bedanken. Deze proefschrift zou niet kunnen ontstaan zonder de financiële steun van de onderzoeksbeurzen van de Katholieke Universteit Leuven. De steun van de beurzen OT/00/35, OE/03/29, DB/04/031 en BDB-B/05/10 wordt diep geapprecieerd. Als laatst maar niet in het minst wil ik Pascale, Filip, Sibe, Ine en Wout bedanken die voor mij een familie in België wisten te creëren. Dank u wel! Arnošt Poděkovánı́ Tento text by též nikdy nemohl vzniknout bez znalostı́ matematiky a statistiky, kterých jsem nabyl během pregraduálnı́ch studiı́ na Matematicko-fyzikálnı́ fakultě Univerzity Karlovy v Praze. Za základy svého statistického věděnı́ chci potom poděkovat všem pracovnı́kům Katedry pravděpodobnosti a matematické statistiky. Michalu Kulichovi potom děkuji, že mě přemluvil, abych v roce 2000 odjel na jeden rok do Belgie, čı́mž změnil na dalšı́ch nejméně šest let mı́sto mého trvalého pobytu a přispěl nepřı́mo k naprosté změně tématu mé doktorské práce. Profesoru Jaromı́ru Antochovi, svému původnı́mu vedoucı́mu doktorské práce, děkuji, že na mě i přes moji dezerci, soudě dle našich následných ROBUSTnı́ch a jiných setkánı́, nezanevřel. Závěrem děkuji Lence, že si mě ponechala i přes to, že mnohý čas, který bych mohl věnovat jı́, jsem věnoval statistice. Děkuji též za dvou a půl kilový dárek, který mezitı́m trochu narostl a kterým mi zpestřila závěr jednoho COMPSTATu. Jindře děkuji za to, že prostě je. Bez tvých úsměvů a dalšı́ch projevů přı́zně i nepřı́zně by finálnı́ práce na tomto textu nebyly zdaleka tak úsměvné jak byly. Děkuji! Arnošt Acknowledgement There would be no need to develop the techniques presented in this thesis if there were no data posing interesting questions. I would like to thank to all who collected those interesting data sets and allowed me to use them in this thesis. Data collection for the Signal Tandmobielr project introduced in Section 1.1 was supported by Unilever, Belgium. The Signal Tandmobielr project comprises the following partners: D. Declerck (Dental School, Catholic University Leuven), L. Martens (Dental School, University Ghent), J. Vanobbergen (Oral Health Promotion and Prevention, Flemish Dental Association), P. Bottenberg (Dental School, University Brussels), E. Lesaffre (Biostatistical Centre, Catholic University Leuven), K. Hoppenbrouwers (Youth Health Department, Catholic University Leuven; Flemish Association for Youth Health Care). The WIHS data introduced in Section 1.3 were collected by the Women’s Interagency HIV Study Collaborative Study Group and its Oral Substudy with centers (Principal Investigators) at New York City/Bronx Consortium (K. Anastos, J. A. Phelan); Brooklyn, NY (H. Minkoff); Washington DC Metropolitan Consortium (M. Young); The Connie Wofsy Study Consortium of Northern California (R. Greenblatt, D. Greenspan, J. S. Greenspan); Los Angeles County/Southern California Consortium (A. Levine, R. Mulligan, M. Navazesh); Chicago Consortium (M. Cohen, M. Alves); Data Coordinating Center (A. Muñoz). The WIHS is funded by the National Institute of Allergy and Infectious Diseases, with supplemental funding from the National Cancer Institute, the National Institute of Child Health & Human Development, the National Institute on Drug Abuse, the National Institute of Dental and Craniofacial Research, the Agency for Health Care Policy and Research, the National Center for Research Resources, and the Centers for Disease Control and Prevention. U01-AI-35004, U01-AI-31834, U01-AI-34994, U01-AI-34989, U01-HD-32632 (NICHD), U01-AI-34993, U01-AI-42590, M01-RR00079, and M01-RR00083. The WIHS Oral Substudy is funded by the National Institute of Dental and Craniofacial Research. The EBCP data introduced in Section 1.4 were kindly provided by Catherine Legrand and Richard Sylvester from the European Organisation for Research and Treatment of Cancer. Thank You! Arnošt The majority of the material in this thesis is based on the original publications. Below, we give a list of the parts of the thesis based principally on these publications. Sections 5.1, 7.7: Lesaffre, E., Komárek, A., and Declerck, D. (2005). An overview of methods for interval-censored data with an emphasis on applications in dentistry. Statistical Methods in Medical Research, 14, 539–552. Section 5.2: Komárek, A., Lesaffre, E., Härkänen, T., Declerck, D., and Virtanen, J. I. (2005). A Bayesian analysis of multivariate doubly-interval-censored data. Biostatistics, 6, 145–155. Chapter 7: Komárek, A., Lesaffre, E., and Hilton, J. F. (2005). Accelerated failure time model for arbitrarily censored data with smoothed error distribution. Journal of Computational and Graphical Statistics, 14, 726–745. Chapter 8: Komárek, A. and Lesaffre, E. (2006a). Bayesian accelerated failure time model for correlated censored data with a normal mixture as an error distribution. To appear in Statistica Sinica. Chapter 9: Komárek, A. and Lesaffre, E. (2006b). Bayesian accelerated failure time model with multivariate doubly-interval-censored data and flexible distributional assumptions. Submitted. Chapter 10: Komárek, A. and Lesaffre, E. (2006c). Bayesian semiparametric accelerated failure time model for paired doubly-intervalcensored data. Statistical Modelling, 6, 3–22. Contents Notation xvii Preface Part I xix Introduction 1 1 Motivating Data Sets 3 Tandmobielr 1.1 The Signal study . . . . . . . . . . . . . . . . . 3 1.2 The Chronic Granulomatous Disease trial (CGD) . . . . . . . 5 1.3 The Woman’s Interagency HIV Study (WIHS) . . . . . . . . 6 1.4 Perioperative Chemotherapy in Early Breast Cancer Patients (EBCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Basic Notions 11 2.1 Right, left and interval censoring . . . . . . . . . . . . . . . . 11 2.2 Doubly interval censoring . . . . . . . . . . . . . . . . . . . . 12 2.3 Density, survival, hazard and cumulative hazard functions . . 13 2.4 Independent noninformative censoring and simplified likelihood 13 2.4.1 Right-censored data . . . . . . . . . . . . . . . . . . . 14 2.4.2 Interval-censored data . . . . . . . . . . . . . . . . . . 15 2.4.3 Simplified likelihood for interval-censored data . . . . 16 3 An Overview of Regression Models for Survival Data ix 17 x CONTENTS 3.1 Proportional hazards model . . . . . . . . . . . . . . . . . . . 17 3.2 Accelerated failure time model . . . . . . . . . . . . . . . . . 18 3.3 Accelerated failure time model versus proportional hazards model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Regression models for multivariate survival data . . . . . . . 21 3.4.1 Frailty proportional hazards model . . . . . . . . . . . 21 3.4.2 Population averaged accelerated failure time model . . 22 3.4.3 Cluster specific accelerated failure time model . . . . . 23 3.4.4 Population averaged model versus cluster specific model 24 3.4 4 Frequentist and Bayesian Inference 4.1 27 Likelihood for interval-censored data . . . . . . . . . . . . . . 28 4.1.1 Interval-censored data . . . . . . . . . . . . . . . . . . 28 4.1.2 Doubly-interval-censored data . . . . . . . . . . . . . . 29 4.2 Likelihood for multivariate (doubly) interval-censored data . . 30 4.3 Bayesian data augmentation . . . . . . . . . . . . . . . . . . . 30 4.4 Hierarchical specification of the model . . . . . . . . . . . . . 32 4.5 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 35 4.6 Credible regions and Bayesian p-values . . . . . . . . . . . . . 36 4.6.1 Credible regions . . . . . . . . . . . . . . . . . . . . . 36 4.6.2 Bayesian p-values . . . . . . . . . . . . . . . . . . . . . 37 5 An Overview of Methods for Interval-Censored Data 5.1 5.2 39 Frequentist methods . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.1 Estimation of the survival function . . . . . . . . . . . 40 5.1.2 Comparison of two survival distributions . . . . . . . . 42 5.1.3 Proportional hazards model . . . . . . . . . . . . . . . 44 5.1.4 Accelerated failure time model . . . . . . . . . . . . . 45 5.1.5 Interval-censored covariates . . . . . . . . . . . . . . . 46 Bayesian proportional hazards model: An illustration . . . . . 46 5.2.1 Tandmobielr Signal study: Research question and related data characteristics . . . . . . . . . . . . . . . . 47 5.2.2 Proportional hazards modelling using midpoints . . . 48 5.2.3 The Bayesian survival model for doubly-interval-censored data . . . . . . . . . . . . . . . . . . . . . . . . . 50 CONTENTS xi 5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Bayesian accelerated failure time model . . . . . . . . . . . . 59 5.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 60 Concluding Remarks to Part I and Introduction to Part II 61 Part II Accelerated Failure Time Models with Flexible Distributional Assumptions 63 6 Mixtures as Flexible Models for Unknown Distributions 6.1 6.2 6.3 Classical normal mixture . . . . . . . . . . . . . . . . . . . . . 65 6.1.1 From general finite mixture to normal mixture . . . . 65 6.1.2 Estimation of mixture parameters . . . . . . . . . . . 66 Penalized B-splines . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2.1 Introduction to B-splines . . . . . . . . . . . . . . . . 68 6.2.2 Penalized smoothing . . . . . . . . . . . . . . . . . . . 71 6.2.3 B-splines in the survival analysis . . . . . . . . . . . . 72 6.2.4 B-splines as models for densities . . . . . . . . . . . . 72 6.2.5 B-splines for multivariate smoothing . . . . . . . . . . 74 Penalized normal mixture . . . . . . . . . . . . . . . . . . . . 74 6.3.1 From B-spline to normal density . . . . . . . . . . . . 74 6.3.2 Transformation of mixture weights . . . . . . . . . . . 77 6.3.3 Penalized normal mixture for distributions with an arbitrary location and scale . . . . . . . . . . . . . . . . 78 Multivariate smoothing . . . . . . . . . . . . . . . . . 79 Classical versus penalized normal mixture . . . . . . . . . . . 81 6.3.4 6.4 7 Maximum Likelihood Penalized AFT Model 7.1 7.2 65 83 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.1.1 Model for the error density . . . . . . . . . . . . . . . 84 7.1.2 Scale regression . . . . . . . . . . . . . . . . . . . . . . 85 Penalized maximum-likelihood . . . . . . . . . . . . . . . . . 85 7.2.1 85 Penalized log-likelihood . . . . . . . . . . . . . . . . . xii CONTENTS 7.2.2 Remarks on the penalty function . . . . . . . . . . . . 87 7.2.3 Selecting the smoothing parameter . . . . . . . . . . . 88 Inference based on the maximum likelihood penalized AFT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.3.1 Pseudo-variance . . . . . . . . . . . . . . . . . . . . . 90 7.3.2 Asymptotic variance . . . . . . . . . . . . . . . . . . . 91 7.3.3 The pseudo-variance versus the asymptotic variance . 91 7.3.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.4 Predictive survival and hazard curves and predictive densities 92 7.5 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.6 Example: WIHS data – interval censoring . . . . . . . . . . . 94 7.6.1 Fitted models . . . . . . . . . . . . . . . . . . . . . . . 96 7.6.2 Predictive survival and hazard curves, predictive densities 96 7.6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 97 Example: Signal Tandmobielr study – interval-censored data 99 7.3 7.7 7.8 7.7.1 Fitted models . . . . . . . . . . . . . . . . . . . . . . . 100 7.7.2 Predictive emergence and hazard curves . . . . . . . . 101 7.7.3 Comparison of emergence distributions between different groups . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 105 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8 Bayesian Normal Mixture Cluster-Specific AFT Model 8.1 8.2 8.3 107 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.1.1 Distributional assumptions . . . . . . . . . . . . . . . 109 8.1.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 110 Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 110 8.2.1 Prior specification of the error part . . . . . . . . . . . 111 8.2.2 Prior specification of the regression part . . . . . . . . 113 8.2.3 Weak prior information . . . . . . . . . . . . . . . . . 114 8.2.4 Posterior distribution . . . . . . . . . . . . . . . . . . 115 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 116 8.3.1 Update of the error part of the model . . . . . . . . . 116 8.3.2 Update of the regression part of the model . . . . . . 123 CONTENTS 8.4 xiii Bayesian estimates of the survival distribution . . . . . . . . . 125 8.4.1 Predictive survival and hazard curves and predictive survival densities . . . . . . . . . . . . . . . . . . . . . 125 8.4.2 Predictive error densities . . . . . . . . . . . . . . . . 126 8.5 Bayesian estimates of the individual random effects . . . . . . 127 8.6 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.7 Example: Signal Tandmobielr study – clustered interval-censored data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.8 8.9 8.7.1 Prior distribution . . . . . . . . . . . . . . . . . . . . . 130 8.7.2 Results for the regression and error parameters . . . . 131 8.7.3 Inter-teeth relationship 8.7.4 Predictive emergence and hazard curves . . . . . . . . 132 8.7.5 Predictive error density . . . . . . . . . . . . . . . . . 136 8.7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . 132 Example: CGD data – recurrent events analysis . . . . . . . . 136 8.8.1 Prior distribution . . . . . . . . . . . . . . . . . . . . . 138 8.8.2 Effect of covariates on the time to infection . . . . . . 139 8.8.3 Predictive error density and variability of random effects144 8.8.4 Estimates of individual random effects . . . . . . . . . 144 8.8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 144 Example: EBCP data – multicenter study . . . . . . . . . . . 144 8.9.1 Prior distribution . . . . . . . . . . . . . . . . . . . . . 146 8.9.2 Effect of covariates on PFS time . . . . . . . . . . . . 146 8.9.3 Predictive error density and variance components of random effects . . . . . . . . . . . . . . . . . . . . . . 148 8.9.4 Estimates of individual random effects . . . . . . . . . 152 8.9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 152 8.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9 Bayesian Penalized Mixture Cluster-Specific AFT Model 9.1 9.2 155 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.1.1 Distributional assumptions . . . . . . . . . . . . . . . 157 9.1.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 158 Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 159 xiv CONTENTS 9.2.1 9.3 9.2.2 Prior distribution for the generic node Y 9.2.3 Prior distribution for multivariate random effects in Model M . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.2.4 Prior distribution for the regression parameters . . . . 165 9.2.5 Prior distribution for the time variables . . . . . . . . 165 9.2.6 Posterior distribution . . . . . . . . . . . . . . . . . . 166 . . . . . . . 164 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 166 9.3.1 9.3.2 9.4 Prior distribution for G . . . . . . . . . . . . . . . . . 162 Updating the parameters related to the penalized mixture G . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Updating the generic node Y . . . . . . . . . . . . . . 169 9.3.3 Updating the parameters related to the multivariate random effects in Model M . . . . . . . . . . . . . . . 171 9.3.4 Updating the regression parameters . . . . . . . . . . 172 Bayesian estimates of the survival distribution . . . . . . . . . 172 9.4.1 Predictive survival and hazard curves and predictive survival densities . . . . . . . . . . . . . . . . . . . . . 172 9.4.2 Predictive error and random effect densities . . . . . . 173 9.5 Bayesian estimates of the individual random effects . . . . . . 173 9.6 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.7 Example: Signal Tandmobielr study – clustered doubly-intervalcensored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.8 9.9 9.7.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 176 9.7.2 Final Model . . . . . . . . . . . . . . . . . . . . . . . . 177 9.7.3 Prior distribution . . . . . . . . . . . . . . . . . . . . . 178 9.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 183 Example: EBCP data – multicenter study . . . . . . . . . . . 184 9.8.1 Prior distribution . . . . . . . . . . . . . . . . . . . . . 184 9.8.2 Effect of covariates on PFS time . . . . . . . . . . . . 185 9.8.3 Predictive error density and variance components of random effects . . . . . . . . . . . . . . . . . . . . . . 188 9.8.4 Estimates of individual random effects . . . . . . . . . 192 9.8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 192 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 CONTENTS xv 10 Bayesian Penalized Mixture Population-Averaged AFT Model 193 10.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 10.1.1 Distributional assumptions . . . . . . . . . . . . . . . 194 10.1.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 195 10.2 Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 196 10.2.1 Prior distribution for G . . . . . . . . . . . . . . . . . 197 10.2.2 Prior distribution for the generic node Y . . . . . . . 200 10.2.3 Prior distribution for the regression parameters and time variables . . . . . . . . . . . . . . . . . . . . . . . 201 10.2.4 Posterior distribution . . . . . . . . . . . . . . . . . . 201 10.3 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 201 10.4 Evaluation of association . . . . . . . . . . . . . . . . . . . . . 202 10.5 Bayesian estimates of the survival distribution . . . . . . . . . 203 10.5.1 Predictive survival nad hazard curves and predictive survival densities . . . . . . . . . . . . . . . . . . . . . 203 10.5.2 Predictive error densities . . . . . . . . . . . . . . . . 204 Tandmobielr 10.6 Example: Signal study – paired doubly-intervalcensored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 10.6.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 205 10.6.2 Final Model . . . . . . . . . . . . . . . . . . . . . . . . 205 10.6.3 Prior distribution . . . . . . . . . . . . . . . . . . . . . 205 10.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 206 10.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 11 Overview and Further Research 215 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.2 Generalizations and improvements . . . . . . . . . . . . . . . 217 11.3 The use of penalized mixtures in other application areas . . . 219 11.3.1 Generalized linear mixed models with random effects having a flexible distribution . . . . . . . . . . . . . . 219 11.3.2 Spatial models with the intensity specified by the penalized mixture . . . . . . . . . . . . . . . . . . . . . . 220 A Technical details for the Maximum Likelihood Penalized AFT Model 223 xvi CONTENTS A.1 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 224 A.2 Individual log-likelihood contributions . . . . . . . . . . . . . 225 A.3 First derivatives of the log-likelihood . . . . . . . . . . . . . . 226 A.3.1 With respect to the regression parameters and the intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.3.2 With respect to the log-scale and the scale-regression parameters . . . . . . . . . . . . . . . . . . . . . . . . 226 A.3.3 With respect to the transformed mixture weights . . . 227 A.4 Second derivatives of the log-likelihood . . . . . . . . . . . . . 227 A.4.1 With respect to the extended regression parameters . 227 A.4.2 Mixed with respect to the extended regression parameters and the log-scale or the scale-regression parameters228 A.4.3 Mixed with respect to the extended regression parameters and the transformed mixture weights . . . . . . . 229 A.4.4 With respect to the log-scale or the scale-regression parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.4.5 Mixed with respect to the log-scale or the scale-regression parameters and the transformed mixture weights . . . 230 A.4.6 With respect to the transformed mixture weights . . . 231 A.5 Derivatives of the penalty term . . . . . . . . . . . . . . . . . 232 A.6 Derivatives of the constraints . . . . . . . . . . . . . . . . . . 232 A.7 Proof of Proposition 7.1 . . . . . . . . . . . . . . . . . . . . . 233 B Simulation results 235 B.1 Simulation for the maximum likelihood penalized AFT model 235 B.2 Simulation for the Bayesian normal mixture cluster-specific AFT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 B.3 Simulation for the Bayesian penalized mixture cluster-specific AFT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 C Software 271 C.1 Package smoothSurv . . . . . . . . . . . . . . . . . . . . . . . 271 C.2 Package bayesSurv . . . . . . . . . . . . . . . . . . . . . . . . 272 Bibliography 273 Curriculum Vitae 291 Notation Here, we give a list of the most often used symbols within this thesis. δi ⋆ censoring indicator, ⋆ 0 for right-censored, 1 for exactly observed, 2 for leftcensored, 3 for interval-censored observations; 1 ⋆ vector of ones; ϕ(e) ⋆ density of N (0, 1) ϕ(e | µ, σ 2 ) ⋆ density of N (µ, σ 2 ) ϕq (e | µ, Σ) ⋆ density of q-variate normal distribution with mean µ and covariance matrix Σ Φ(e) ⋆ cumulative distribution function of N (0, 1) Φ(e | µ, σ 2 ) ⋆ cumulative distribution function of N (µ, σ 2 ) ⌊tL , tU ⌋ ⋆ interval censored observation ⋆ according to the context, the interval might be closed, half closed or open Z tU p(s) ds if tL < tU = I tU p(s) ds tL tL = p(tL ) = p(tU ) if tL = tU ⋆ symbol used to write down the likelihood of the interval censored data xvii xviii NOTATION Preface The accelerated failure time (AFT) model, the principal topic of this thesis is a regression model used to analyze survival data. The term survival data is usually used for data that measure the time to some event, not necessarily death. Precisely, the event time will be considered a positive real valued variable having a continuous distribution. In particular practical situations, data on event times are obtained by following subjects in the study over (calendar) time, recording the moments of the specified events of interest and computing the time spans between the event and some initial - onset time (e.g. enter to a study and disease progression, contagion by HIV virus and onset of AIDS, tooth emergence and the time it is attacked by caries for the first time). A typical feature of survival data is the fact that the time to event is not always observed completely and observations are imposed to censoring. Most commonly, either the study is finished before all subjects involved encounter the specified event or the subject leaves for some reasons the study before encountering the event. In both situations, only the lower limit for the true event time is known and we talk about right censoring (see Sections 1.2 and 1.4 for examples). In many areas of medical research, the occurrence of the event of interest can only be recorded at planned (or unplanned) visits. The exact event time is then only known to happen between two examination times (visits) and we encounter interval censoring. Typical examples are (a) time to caries development; (b) time emergence of a tooth (Section 1.1); (c) time to HIV seroconversion; (d) time to the onset of AIDS (Section 1.3). Indeed, in case of a cavity or of emergence the event is often observed after some delay, say at planned (or even unplanned) visits. Similarly, HIV seroconversion can only be determined by regular or irregular laboratory assessments. However, the xix xx PREFACE event may also happen before the first examination (e.g. a decayed tooth is detected already at the first dental examination) and we get a so called left-censored observation or it may happen after the last examination resulting in a right-censored observation. Hence interval censoring is a natural generalization of the commonly encountered right censoring. Often not only the event time but also the time which specifies the origin of the time scale for the event (the onset time) can only be recorded in the same way as described in the previous paragraph. An example is the time to caries development on a tooth where the time of tooth emergence constitutes the onset time for caries (see Section 1.1). We then speak of doubly interval censoring. We further formalize the notion of censoring in Chapter 2 Furthermore, the independence between the event times cannot always be assumed thereby entering the area of multivariate survival data. The dependence can be caused by very different factors. Although many methods described in this thesis can be applied to any multivariate survival data the dependencies in our applications are all result of some type of clustering: emergence or caries times of several teeth of one child (Section 1.1), or progression free survival times of several patients within one hospital in a multicenter clinical trial (Section 1.4). Also recurrent infection times on one patient (Section 1.2) can be considered to result in clustered data. The ultimate goal of the research presented in this thesis was to develop the AFT models which can be used to analyze multivariate survival data, possibly under the presence of doubly interval censoring. The scale of complexity considered in this thesis starts with interval censoring which can be handled by all methods introduced here. Possible dependencies between the observations (multivariate survival data) are viewed as the next step on the scale of complexity and finally, doubly interval censoring is regarded to be the final level of complexity treated by this thesis and only some methods shown here reached this final stage. With all the levels of complexity we strived for the model with distributional assumptions as flexible as possible. Two slightly different directions are followed in the thesis to address this issue. Both of them use a Gaussian mixture as a building block to model an unknown distribution. Whereas the first and more extensively explored approach uses the mixture with a higher number of fixed mixture components with mixture weights estimated using a kind of penalized methodology the second technique uses a classical mixture with both the number as well as the weights, locations and scales of the mixture components unknown. Chapter 1 introduces several data sets that contain each survival data involving one or more issues discussed above and that will be used throughout the thesis to illustrate the developed methods. Terminology and notation xxi used in the thesis are formalized in Chapter 2 together with an explanation of some basic notions in the analysis of the survival data. The most popular regression models for the survival data are introduced in Chapter 3. In Chapter 4 we give the likelihood for interval and doubly-interval-censored data and discuss briefly the difficulties encountered when using maximumlikelihood methods in the context of (doubly) interval-censored data. Subsequently, we show how the Bayesian inference together with the Markov chain Monte Carlo (MCMC) methodology can simplify the calculations. Available methods for the analysis of interval-censored data will be reviewed in Chapter 5 and one of the methods, namely the Bayesian proportional hazards model with a piecewise constant baseline hazard function will be applied to the analysis of the dental clustered doubly-interval-censored data. In Chapter 6 we explain in detail how the classical and the penalized normal mixtures can be used to specify unknown distributions in a flexible way. The first AFT model presented in this thesis – the AFT model with an error distribution being a normal mixture with a high number of fixed components estimated using the penalized maximum-likelihood method – is shown in Chapter 7. However only univariate interval-censored data can be handled by this method. To move on to the area of multivariate or even doublyinterval-censored survival data we found it more advantageous to use a Bayesian methodology rather than the more classical maximum-likelihood based techniques. The Bayesian AFT model allowing for multivariate intervalcensored data and using a classical normal mixture with both unknown number of mixture components as well as the mixture components themselves to specify the error distribution is presented in Chapter 8. Finally, Chapters 9 and 10 show the Bayesian AFT models suitable for multivariate doubly-interval-censored data that exploit a penalized normal mixture with higher number of fixed components. For all methods described in this thesis, software was written in the form of R (R Development Core Team, 2005) packages called smoothSurv and bayesSurv downloadable from the Comprehensive R Archive Network at http://www.R-project.org. The software is briefly described in Appendix C. xxii PREFACE Part I Introduction Chapter 1 Motivating Data Sets This chapter introduces the data sets which will be used throughout the thesis illustrating the developed techniques and showing their generality. Each data set involves one or more specific features of interest here, discussed briefly in the Preface. The Signal Tandmobielr data set introduced in Section 1.1 involves clustered interval- and doubly-interval-censored dental observations. Section 1.2 describes a clinical trial with patients with a chronic granulomatous disease where times of possibly recurrent infections were of interest. At the same time, the time of the last infection is right-censored. The Women’s Interagency HIV Study involved interval-censored data and is described in Section 1.3. In Section 1.4, a multicenter clinical trial is described which evaluated the effect of perioperative chemotherapy on disease progression in early breast cancer patients where the heterogeneity accross the centra plays an important role. 1.1 The Signal Tandmobielr study The Signal Tandmobielr project is a longitudinal oral health study performed in Flanders from 1996 to 2001. It involved 4 468 schoolchildren (2 315 boys and 2 153 girls) born in 1989. Two stratification factors, i.e. geographical location (5 provinces) and educational system (3 school systems) establishing 15 strata, were taken into account. The sample represented about 7% of the corresponding Flemish population of school children. Detailed oral health data at tooth and tooth-surface level (caries experience, gingivitis, etc.) were annually collected by a team of 16 dentists whose examination method was calibrated every six months. In addition, data on dietary and oral hygiene 3 4 CHAPTER 1. MOTIVATING DATA SETS habits were collected using a questionnaire completed by the parents. Hence the data set consists of a series of at most 6 longitudinal dental observations and reported oral health habits. The details of the study design and research methods have been described in detail by Vanobbergen et al. (2000). Here, we concentrate on the emergence and caries times of permanent premolars and molars (teeth κ + 4, κ + 5, κ + 6, κ = 10, 20, 30, 40 in European dental notation, see Figure 1.1). There is no doubt that an adequate knowledge of timing and patterns of tooth emergence and/or caries attacks are still essential for diagnosis and treatment planning in paediatric dentistry and orthodontics. Additionally, the effect of certain prespecified factors (like the caries status of the primary teeth – see Figure 1.2 for their notation, use of fluoride supplements, brushing habits etc.) on the emergence or caries processes are often of interest. An interesting feature of this data set, though typical in dental applications, is the fact that both emergence and onset of caries are only observable when the child is examined (by a dentist). This leads to interval-censored emergence times and to doubly-interval-censored times for caries (see also Figure 2.1). Additionally, the teeth of a single mouth share common immeasurable or Figure 1.1: European notation for the position of permanent teeth. Maxilla = upper jaw, mandible = lower jaw. The first and the fourth quadrants are at the right-hand side of the subject, the second and the third quandrats are at the left-hand side of the subject. 1.2. THE CHRONIC GRANULOMATOUS DISEASE TRIAL (CGD) 5 only roughly measured factors like genetical dispositions or dietary habits. As a result, the emergence times or the times to caries of teeth in the same mouth are related. Hence, when studying the emergence time or the time to caries of several teeth, dependencies among the observations taken on a single child must be taken into account. Analysis of the emergence time or time to caries is reported in several sections of the thesis. 1.2 The Chronic Granulomatous Disease trial (CGD) The Chronic Granulomatous Disease is a group of inherited rare disorders of the immune function characterized by recurrent pyogenic infections which may lead to death in childhood. There is evidence of a positive role of gamma interferon in restoring the immune functions of the patients. For that reason, a multicenter placebo-controlled randomized trial was conducted to study the ability of gamma interferon to reduce the rate of serious infections. Between October 1988 and March 1989, 128 patients (63 taking gamma inter- Figure 1.2: European notation for the position of deciduous (primary) teeth. The quadrants are numbered 5, 6, 7, 8. The fifth and the eight quadrants are at the right-hand side of the subject, the sixth and the seventh quadrants are at the left-hand side of the subject. 6 CHAPTER 1. MOTIVATING DATA SETS feron, 65 taking placebo) with CGD were accrued by 13 hospitals in Europe and the United States. The average follow-up time was 292 days, minimal and maximal follow-up times, respectively were 91 and 432 days, respectively. For each patient, times of initial and any recurrent serious infections were recorded. There is a minimum of one and a maximum of eight recurrent infection times per patient, with a total of 203 records. Besides the gamma interferon there are other factors that may influence the times between the infections. In the course of the study the following additional information was recorded for each patient: • Age at time of study entry (mean 14.6 years, range from 1 to 44 years, standard deviation 9.8 years); • Gender: male (n = 104), female (n = 24); • Pattern of inheritance: autosomal recessive (n = 42), X-linked (n = 86); • Using corticosteroids at time of study entry: yes (n = 3), no (n = 125); • Using prophylactic antibiotics at time of study entry: yes (n = 111), no (n = 17); • Category of the hospital: US – NIH (n = 26), US – other (n = 63), Europe – Amsterdam (n = 19), Europe – other (n = 20). The data can be found in Appendix D.2 of Fleming and Harrington (1991). It is of interest here to set up a regression model with the time between the two consecutive infections as response and mentioned factors as covariates. It should be taken into account that the infection times of one patient cannot be assumed to be independent. We address this issue in Section 8.8. 1.3 The Woman’s Interagency HIV Study (WIHS) The Woman’s Interagency HIV Study comprises the cohort of 2 058 seropositive women with a comparison cohort of 568 seronegative women being exposed to a higher risk of HIV infection than the common U.S. population. The study groups were enrolled between October 1994 and November 1995 through six clinical consortia at 23 sites throughout the United States. Barkan et al. (1998) provide full details on the setup of the study. In this thesis we concentrate only on the WIHS Oral Substudy involving 224 seropositive AIDS-free (at baseline) women. The women participating in the Oral Substudy were regularly (on average every 7 months) examined for AIDS symptoms, the number of copies of 1.4. PERIOPERATIVE CHEMOTHERAPY IN EARLY BREAST CANCER PATIENTS (EBCP) 7 the HIV RNA virus (viral load) and CD4 T-lymphocyte counts per ml of blood. Additionally, the presence of one of the three oral lesion markers: oral candidiasis, hairy leukoplakia and angular cheilitis was checked. The average follow-up time was 41 months and the maximal follow-up time was 84 months. For each woman, the time of seroconversion (HIV infection) was externally estimated and assumed to be known. Clinical AIDS diagnoses were self-reported in 73.5% of cases, presumptive or definite in 17.5%, and indeterminate in 9%; the case definition did not depend on CD4 T-lymphocytes. For 66 women the onset of AIDS was interval-censored, while for 158 women it was right-censored. For HIV positive people, it is of interest to describe the distribution of the time to the onset of an AIDS-related illness based on some measured quantities. We examine in Section 7.6 how the classical predictors like viral load and CD4 T-cells counts together with oral lesion markers can be used in describing this distribution. 1.4 Perioperative Chemotherapy in Early Breast Cancer Patients (EBCP) To investigate whether a short intensive course of perioperative chemotherapy can change the course of early breast cancer compared to surgery alone, the European Organization for Research and Treatment of Cancer (EORTC) conducted a multicenter randomized clinical trial (EORTC Trial 10854). From 1986 to 1991, a total of 2 793 women with early breast cancer were randomized to receive either one perioperative course of an anthracycline-containing chemotherapeutic regimen within 24 h after surgery (n = 1 398) or surgery alone (n = 1 395). See Clahsen et al. (1996) for more details on the trial. Patients were followed-up for several endpoints, however, we concentrate on the progression-free survival (PFS) time. The mean follow-up time was 8.15 years with a maximum of 14.13 years. Other factors that may influence the PFS time include: • Category of the age of the patient: <40 years (n = 321), 40–50 years (n = 796), >50 years (n = 1 676); • Type of surgery: mastectomy (n = 1 231), breast-conserving surgery (n = 1 542), missing data for n = 20 patients; • Category of the tumor size: <2 cm (n = 823), ≥2 cm (n = 1 915), missing data for n = 55 patients; 8 CHAPTER 1. MOTIVATING DATA SETS • Pathological nodal status: negative (n = 1 467), positive (n = 1 303), missing data for n = 23 patients; • Presence of other disease: no (n = 2 542), yes (n = 234), missing for n = 19 patients. The trial was conducted in 14 centra located in 5 geographical regions (the Netherlands, Poland, France, South of Europe and South Africa). Figure 1.3 shows Kaplan-Meier estimates of PFS time survival functions for the treatment and control group, separately for each center. Obviously, there is a huge heterogeneity among the centra. Not only the overall proportion of PFS patients at fixed time points differs from center to center but also the effect of treatment on PFS both quantitatively and qualitatively seems to vary accross centra. Models that measure the effect of covariates and that allow modelling heterogeneity between centra will be considered in Chapters 8 and 9. 1.4. PERIOPERATIVE CHEMOTHERAPY IN EARLY BREAST CANCER PATIENTS (EBCP) France (31) 0.8 n = 53 n = 54 0.0 0.0 n = 311 0.0 0 1000 2000 3000 4000 5000 Days the Netherlands (12) France (32) South Europe (42) 0.8 0.8 0.4 0.4 0.4 n = 60 0.0 0.0 0.0 n = 622 0 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 Days the Netherlands (13) France (33) South Europe (43) 0.8 0.4 0.4 0.4 0.8 0.8 Days n = 25 0.0 0.0 0.0 n = 185 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 Days Days Poland (21) France (34) South Europe (44) 0.8 0.4 0.4 0.4 0.8 0.8 Days n = 902 n = 48 1000 2000 3000 4000 5000 Days 0.0 0.0 0.0 n = 40 0 0 1000 2000 3000 4000 5000 Days n = 184 0 1000 2000 3000 4000 5000 Days n = 25 0 0 1000 2000 3000 4000 5000 Days 0.8 0 South Europe (41) 0.4 0.4 0.4 0.8 0.8 the Netherlands (11) 0 0 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 Days Days Poland (22) 0.8 South Africa (51) 0.4 0.4 0.8 9 n = 206 0.0 0.0 n = 78 0 1000 2000 3000 4000 5000 Days 0 1000 2000 3000 4000 5000 Days Figure 1.3: EBCP Data. Kaplan-Meier estimates of the PFS time distribution separately for each institution. Solid line: treatment arm, dotted-dashed line: control arm. 10 CHAPTER 1. MOTIVATING DATA SETS Chapter 2 Basic Notions In this chapter we introduce some notation that will be used throughout the thesis and explain more in detail some basic notions like types and mechanisms of censoring considered. 2.1 Right, left and interval censoring Let Ti,l , i = 1, . . . , N, l = 1, . . . , ni be the exact event time for the lth observational unit of the ith cluster. It will be assumed throughout the thesis that Ti,l is a nonnegative random variable with a continuous distribution with some density pi,l (t) which might depend on a vector of covariates, e.g., xi,l = (xi,l,1 , . . . , xi,l,m )′ . The time Ti,l can either be known exactly or in a coarsened manner and is then called censored. Suppose first that knowing whether the event occurred or not requires a detailed examination (visit to a dentist, laboratory assessment) executed at pre-planned visits. Then it is only known U that the event time occurred after, say tL i,l , and before, say ti,l . According to U U L U L the context, we either know ti,l < Ti,l ≤ ti,l , ti,l ≤ Ti,l < ti,l , tL i,l ≤ Ti,l ≤ ti,l , U or tL i,l < Ti,l < ti,l . Thus, the true event time Ti,l is known to lie in the interval U whose lower and upper limits are equal to tL i,l and ti,l , respectively and the observation is called interval-censored. Note that all methods presented in Part II of the thesis lead to the same results irrespective of whether the interval is closed, open or half open. To cover all these situations we will U write Ti,l ∈ ⌊tL i,l , ti,l ⌋. With the same notation right-censored observations are obtained, i.e. by setL ting tU i,l = ∞ and ti,l equal to the time the subject was last seen before leaving the study or before the study was terminated. Similarly, a left-censored ob11 12 CHAPTER 2. BASIC NOTIONS U servation is obtained with tL i,l = 0 and ti,l equal to the first time, the subject was seen after the event. Finally, an exactly observed time ti,l is recorded U with tL i,l = ti,l = ti,l . Below, a censoring indicator δi,l is used, which will be equal to 0 for right-censored, 1 for exactly observed, 2 for left-censored and 3 for interval-censored observations, respectively. 2.2 Doubly interval censoring Suppose that the event time Ti,l is obtained as the difference of two random variables: Vi,l , here always called the failure time and Ui,l , here always called the onset time, i.e. Ti,l = Vi,l − Ui,l . The pair Ui,l , Vi,l can be, for example, the emergence time of a tooth and the onset time of caries of that tooth. Doubly interval censoring is obtained in the situations when either Ui,l and/or U Vi,l are interval-censored and it is only known Ui,l ∈ ⌊uL i,l , ui,l ⌋ and Vi,l ∈ L , v U ⌋. A scheme of a typical doubly-interval-censored observation is given ⌊vi,l i,l in Figure 2.1 and an example is given by the Signal Tandmobielr data of Section 1.1 with Ui,l being the emergence time of the lth tooth of the ith child and Vi,l being the time when the same tooth is attacked by caries for the first time. In the following, we omit the subscript (i, l) from all expressions if it is not necessary to make an explicit distinction among different observations of one data set or use only a single subscript i if we do not deal with multivariate survival data. U Observed onset time ⌊uL i,l , ui,l ⌋ L , vU ⌋ Observed failure time ⌊vi,l i,l - True onset time ui,l Examinations: True failure time vi,l True event time ti,l ? 6 si,l,1 - 6 si,l,2 6 si,l,3 6 si,l,4 - ? 6 si,l,5 6 si,l,6 Figure 2.1: Doubly interval censoring. A scheme of a doubly-interval-censored observation obtained by performing examinations to check the event status at times si,l,1 , . . . , si,l,6 . The onset time is left-censored at time uU i,l = si,l,1 U ⌋ = ⌊0, s ⌋), the failure time , u (i.e. interval-censored in the interval ⌊uL i,l,1 i,l i,l U L is interval-censored in the interval ⌊vi,l , vi,l ⌋ = ⌊si,l,5 , si,l,6 ⌋. 2.3. DENSITY, SURVIVAL, HAZARD AND CUMULATIVE HAZARD FUNCTIONS 2.3 13 Density, survival, hazard and cumulative hazard functions A continuous distribution of an event time T is uniquely determined by its density p(t). Equivalently, the distribution of T is determined by a nonincreasing right-continuous survival function S(t) defined as the probability that T exceeds a value t in its range, i.e. S(t) = Pr(T > t) = Z ∞ p(s) ds. t Another possibility is to specify the hazard function ℏ(t) which gives the instantaneous rate at which an event occurs for an item that is still at risk for the event at time t, i.e. Pr(t ≤ T < t + ∆t T ≥ t) = Pr T ∈ Nt (dt) | T ≥ t , ℏ(t) = lim ∆t→0+ ∆t where Nt (dt) = [t, t + dt). The density and the survival function can be computed from the hazard function using the following relationships: where H(t) = 2.4 Rt 0 p(t) = ℏ(t) exp −H(t) , S(t) = exp −H(t) , ℏ(s) ds is the cumulative hazard function. Independent noninformative censoring and simplified likelihood Throughout the thesis we will assume independent noninformative censoring in the therminology of Kalbfleisch and Prentice (2002). In this section, we explain this concept first in the framework of right-censored data and then extend it to the area of interval-censored data. Finally, we introduce the term of simplified likelihood and remark that it can be used for the inference with censored data under the assumption of independent noninformative censoring. 14 2.4.1 CHAPTER 2. BASIC NOTIONS Right-censored data Kalbfleisch and Prentice (2002) introduce the concept of independent noninformative censoring in the context of right-censored data in the following way. Let C denote the random variable causing the censoring. That is, instead of observing the event time T we only observe X = min(T, C) and δ = I[T ≤ C]. Independent censoring They call the censoring mechanism independent when the hazard which applies to the censored population is at each time point the same as the hazard which applies would there have been no censoring. That is, the hazard functions have to satisfy Pr T ∈ Nt (dt) C ≥ t, T ≥ t = Pr T ∈ Nt (dt) T ≥ t (2.1) for any t > 0. Note that independence of random variables T and C implies that the condition (2.1) is satisfied. However, T and C are not necessarily independent when the condition (2.1) is fulfilled. Further, Kalbfleisch and MacKay (1979) proved that the condition (2.1) is equivalent to so called constant-sum condition: Z t Pr C ∈ Nx (dx), δ = 0 T ≥ x = 1 Pr δ = 1 T ∈ Nt (dt) + (2.2) 0 for any t > 0, introduced by Williams and Lagakos (1977). The term Pr δ = 1 T ∈ Nt (dt) could be interpreted as the probability that a subject who would fail at time t is actually observed to fail and the term Pr C ∈ Nx (dx), δ = 0 T ≥ x has the meaning that a subject who survives at least x time units is censored at time x. To relate the condition (2.2) to its interval-censored version which will be introduced in the following section, we rewrite it into the form: Z t Pr C ∈ N (dx), T ∈ [x, ∞), δ = 0 x = 1. (2.3) Pr δ = 1 T ∈ Nt (dt) + Pr T ∈ [x, ∞) 0 Noninformative censoring Kalbfleisch and Prentice (2002) further call the censoring mechanism noninformative if the censoring random variable C does not depend on any parameters used to model the distribution of the event time T . In other words, 2.4. INDEPENDENT NONINFORMATIVE CENSORING AND SIMPLIFIED LIKELIHOOD 15 with the independent noninformative censoring, the censoring procedure or rules may depend arbitrarily during the course of the study on: • previous event times of other subjects in the study; • previous censoring times of other subjects in the study; • random mechanisms external to the study; • values of covariates possibly included in the model; but must not contain any information on the parameters used to model the event time. The independent noninformative censoring includes type I censoring. In this case, censoring can only happen at a pre-planned calendar time. This censoring scheme has been used for the CGD data introduced in Section 1.2 and for the EBCP data of Section 1.4. 2.4.2 Interval-censored data Consider now the case of interval-censored data where the observed intervals are generated by a triplet (T L , T U , T )′ . That is, we observe an interval ⌊tL , tU ⌋ if T L = tL , T U = tU and T ∈ ⌊T L , T U ⌋. Note that since the observed interval ⌊T L , T U ⌋ must contain the event time T , the support of the random vector (T L , T U , T )′ is equal to L U (t , t , t) : 0 ≤ tL ≤ t ≤ tU ≤ ∞ . Oller, Gómez, and Calle (2004) show that the interval-censored counterpart of the constant-sum condition (2.3) is given by ZZ Pr T L ∈ NtL (dtL ), T U ∈ NtU (dtU ), T ∈ ⌊tL , tU ⌋ =1 (2.4) Pr T ∈ ⌊tL , tU ⌋ (tL , tU ): t∈⌊tL , tU ⌋ for all t > 0. Further, they introduce the term noninformative condition and show that it is stronger than the constant-sum condition (2.4). It should be pointed out that Oller et al. use the term “noninformative” in a different context than Kalbfleisch and Prentice (2002) whose meaning of this word is adopted in this thesis. In summary, we will call the interval censoring independent if it satisfies the constant-sum condition (2.4) and noninformative if the distribution of censoring random variables T L and T U does not depend on the parameters used to model the distribution of the event time T . 16 CHAPTER 2. BASIC NOTIONS A typical example of an independent noninformative interval censoring can be found in the Signal Tandmobielr data (Section 1.1) and in the WIHS data (Section 1.3). In both cases even a stronger condition of independence of T and (T L , T U )′ is satisfied. Indeed, either dental examinations or check ups of the AIDS status were performed at pre-planned time-points and thus external to the studied event time. Note that interval censoring would not be independent when the event induces an examination, namely when a child visits the dentist because of a decayed tooth. 2.4.3 Simplified likelihood for interval-censored data We explain in Chapter 4 that likelihood is the corner stone for the inference on the event time T . Strictly speaking, with interval-censored data, the likelihood contribution is given by the density of observables, i.e. by the density of the vector (T L , T U )′ whose support is such that T ∈ ⌊T L , T U ⌋ with probability one. That is, the likelihood contribution of the observed ⌊tL , tU ⌋ is given by Lf ull = Pr T L ∈ NtL (dtL ), T U ∈ NtU (dtU ), T ∈ ⌊tL , tU ⌋ . However, it is shown in Oller et al. (2004) that under the assumption of independent noninformative censoring, the likelihood contribution Lf ull is proportional to so called simplified likelihood contribution L = Pr T ∈ ⌊tL , tU ⌋ , where a possible randomness of T L and T U is ignored. Consequently, the inference on the event time T can be based on this simplified likelihood. In the remainder of the thesis, we will use the simplified likelihood for the inference and omit the word ‘simplified’ for clarity. Chapter 3 An Overview of Regression Models for Survival Data Two regression models dominate the survival analysis to describe the dependence of the distribution of the event time T on covariates, say x = (x1 , . . . , xm )′ : (a) the proportional hazards (PH) model and (b) the accelerated failure time (AFT) model. In this chapter, we introduce these two models, compare them and show how they can be extended to handle multivariate survival data. We also review these models for the analysis of right-censored data however with an emphasis on the AFT model. For methods that allow interval- or doubly-interval-censored data we refer to Chapter 5. 3.1 Proportional hazards model This model, introduced by Cox (1972), specifies that, for a given covariate vector x, the hazard function is expressed as the product of an unspecified baseline hazard function ℏ0 (t) and the exponential of a linear function of the covariates, i.e. ℏ(t | x) = ℏ0 (t) exp(β ′ x). (3.1) The regression parameter vector β is estimated by maximizing a partial likelihood (Cox, 1975) which treats ℏ0 as nuisance and does not estimate it. However, when the baseline hazard ℏ0 is of interest as well, e.g. for prediction purposes, its non-parametric estimate can be obtained using the method of Breslow (1974). The survival function for an object with covariates x, 17 18 CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA S(· | x), is related to the baseline survival function S0 by the relationship exp(β′ x) S(t | x) = S0 (t) . An exhaustive treatment of the PH model and its extensions can be found, e.g., in Therneau and Grambsch (2000) or Kalbfleisch and Prentice (2002, Chapter 4). The software to fit the PH model using the method of maximal partial likelihood together with possibilities to compute residuals, draw diagnostic plots or assess goodness of fit is available in most modern statistical packages, e.g. function coxph in R/S-plus or procedure PHREG in SAS. 3.2 Accelerated failure time model The accelerated failure time model is a useful, however less frequently used alternative to the PH model. For this model, the effect of a covariate implies on average an acceleration or deceleration of the event time. For a vector of covariates x the effect is expressed by the parameter vector β in the following way: T = exp(β ′ x) T0 , where T0 is a baseline survival time. On the logarithmic scale, this model becomes a simple linear regression model log(T ) = β ′ x + ε, (3.2) with ε = log(T0 ). The hazard and survival functions of a subject with covariate vector x is related to the baseline hazard (ℏ0 ) and survival function (S0 ) by the relationships (3.3) ℏ(t |x) = ℏ0 exp(−β ′ x) t exp(−β ′ x), ′ S(t | x) = S0 exp(−β x) t . Usually one assumes that the error random variable ε has adensity gε (ε) from the location-scale family, i.e. gε (ε) = τ −1 gε∗ τ −1 (ε − α) , where gε∗ (·) has location parameter = 0 and scale parameter = 1. The location parameter α and the scale parameter τ have to be estimated from the data as well as the regression parameter β. A parametric AFT model assumes that gε∗ (·) is a density of a specific type (e.g. Gaussian, logistic or Gumbel). In that case, the parameters α, τ and β can easily be estimated using the method of maximum likelihood. However, the parametric assumptions affect evidently the shape and character 3.3. ACCELERATED FAILURE TIME MODEL VERSUS PROPORTIONAL HAZARDS MODEL 19 of the resultant survival or hazard curves which, in the case of an incorrect specification, is undesirable, especially when prediction is of interest. On the other hand, semi-parametric procedures for the AFT model leave the density gε (ε) unspecified and provide only the estimate of the regression parameter vector β. In the past, primarily two semi-parametric methods for the AFT model with right-censored data have been examined. The first one is based on the generalization of the least squared method to censored data first proposed by Miller (1976) and in a different manner by Buckley and James (1979) giving their names to this approach. A slight modification of the Buckley-James estimator and its asymptotic properties was given by Lai and Ying (1991). However, a drawback of the Buckley-James method is that it may fail to converge or may oscillate between several solutions. The second approach is based on linear-rank-tests for censored data and was developed by Prentice (1978), Gill (1980), and Louis (1981) in the case of one covariate. Tsiatis (1990) extended the method to the multiple regression context. The asymptotic equivalence of the Buckley-James method and the linear-rank-test-based estimators has been pointed out by Ritov (1990). The asymptotic properties of the linear-rank-test-based estimators were presented in greatest generality by Ying (1993). In contrast to the partial likelihood method for the PH model, the numerical aspect of the linear-rank-test-based estimation of the regression parameters of the AFT model could be computationally cumbersome. Only recently, Jin et al. (2003) suggested an algorithm to compute this estimate using a linear programming technique. They also provide an S-plus function. Further, there seems to exist no non-parametric method to estimate the baseline survival distribution like the method of Breslow (1974) for the PH model. Consequently, the semi-parametric procedures cannot be used when prediction is of interest. Only parametric AFT models have been implemented in major statistical packages (functions survreg in R and SurvReg in S-plus and procedure LIFEREG in SAS). 3.3 Accelerated failure time model versus proportional hazards model Both the PH as well as the AFT model make an explicit assumption about the effect of covariates on the hazard function. The effect of covariates on the hazard function in the PH model is given by (3.1), in the AFT model by (3.3). The assumed different effect of a covariate on the baseline hazard for the PH and AFT model is exemplified in Figure 3.1. It is seen that, like in the 20 CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA PH model, in the AFT model the effect of covariates on the baseline hazard function is multiplicative, but additionally for the AFT model an acceleration or deceleration of the time scale is seen. Also, in the AFT model the hazard is increased for β < 0 whereas in the PH model for β > 0. We point out (see Kalbfleisch and Prentice, 2002, Section 2.3.4) that the PH model and the AFT model are equivalent if and only if the distribution of the standardized error term ε∗ = τ −1 (ε − α) in the AFT model (3.2) is the Gumbel (extreme value distribution of a minimum), i.e. when gε∗ (ε∗ ) = exp ε∗ − exp(ε∗ ) . In that case, the distribution of the baseline survival time T0 is Weibull and the baseline hazard function ℏ0 (t) has the form ℏ0 (t) = γ (λ t)γ−1 , where λ = exp(−α) and γ = τ −1 . Further, it is generally true that it is not always possible (e.g. due to lack of knowledge) to include all relevant covariates in the model. One of the advantages of the AFT model is that the regression parameters of the included covariates do not change when other, important, covariates are omitted. Of course, the neglected covariates have an impact on the distribution of the error term ε in (3.2) hich is typically changed into one with larger variability. Such change, however, is of no major importance (except that it influences the precision with which the regression parameters of the included covariates are estimated) when semi-parametric methods or methods with a flexible distribution for ε are used. Unfortunately, a similar property does not hold for the PH model, see Hougaard (1999) for a more detailed discussion. The fact that only parametric AFT models are implemented in major statistical packages, together with the computational difficulties associated with the semi-parametric AFT model may have caused that the PH model became far more popular in practice than the AFT model. See Nardi and Schemper (2003) for comparison of the PH model and parametric AFT models. Though, the property that the AFT model postulates a direct relationship between failure time and covariates led Sir David Cox (see Reid, 1994) to remark that “accelerated life models are in many ways more appealing” than the proportional hazards model “because of their quite direct physical interpretation.” Indeed, in the AFT model, the regression indicates directly how is the time – a quantity being understandable also by non-statisticians – increased or decreased. Whereas, in the PH model, the direct effect of regression is on the hazard which might be more difficult to understand by practitioners. 3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA 3.4 21 Regression models for multivariate survival data Both the PH model and the AFT model can be extended to handle multivariate survival data. In this section, we briefly discuss one extension of the PH model and concentrate mainly on the multivariate versions of the AFT model that will serve as a basis for developments presented in this thesis. 3.4.1 Frailty proportional hazards model For multivariate survival data, a common extension of the PH model includes a cluster specific random effect Zi , called the shared frailty, in the expression of the hazard function, i.e. ℏ(t | xi,l , Zi ) = ℏ0 (t) Zi exp(β ′ xi,l ). (3.4) The frailty component Zi is most often assumed to have a parametric distribution such as a gamma or log-normal distribution. For more details, we refer to Aalen (1994), Hougaard (2000) and Therneau and Grambsch (2000) where also available software is described. Nevertheless, the model (3.4) is rather simple, e.g., in the analysis of a multicenter clinical trial only the center effect and not the center by treatment 3 1 2 ℏ(t) 3 1 2 ℏ(t) 4 AFT model 4 PH model 0 1 2 3 t 4 5 0 1 2 3 4 5 t Figure 3.1: Effect of PH and AFT assumption on a hypothetical baseline hazard function (solid line) for a univariate covariate x taking a value of 0.6 (dashed line) and 1.2 (dotted line) with regression parameter β = −0.5 for the PH model and β = 0.5 for the AFT model. 22 CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA interaction can be controlled for. This drawback led to further developments mimicking the classical linear mixed model of Laird and Ware (1982) by assuming ℏ(t | xi,l , bi ) = ℏ0 (t) exp(β ′ xi,l + b′i z i,l ), (3.5) where z i,l = (zi,l,1 , . . . , zi,l,q )′ is an additional vector of covariates and bi = (bi,1 , . . . , bi,q )′ is a cluster specific random effect which is again usually assumed to follow a parametric distribution, most often multivariate normal. Such model is considered, e.g., by Vaida and Xu (2000). Note that the model (3.4) is a special case of (3.5) with z i,l ≡ 1 and Zi ≡ exp(bi ). Besides the fact that the frailty PH model is not, similarly as the basic PH model, robust towards neglected covariates, it has another important drawback. Indeed, for most frailty distributions, the marginal hazard function obtained from (3.4) by integrating out Zi is no more proportional with respect to the covariates xi,l . Moreover, the form in which the covariate vector xi,l modifies the marginal baseline hazard function depends on the assumed frailty distribution. Consequently, the estimates of the regression parameters β can be highly sensitive towards the choice of the frailty distribution; see Hougaard (2000, Chapter 7) for more details. 3.4.2 Population averaged accelerated failure time model A natural extension of the basic AFT model allowing for multivariate data, breaks down the assumption of i.i.d. error terms ε in the model expression (3.2) by assuming log(Ti,l ) = β ′ xi,l + εi,l , i = 1, . . . , N, l = 1, . . . , ni , (3.6) with εi = (εi,1 , . . . , εi,ni )′ , i = 1, . . . , N being independent random vectors, each with a multivariate density gε,i (εi ). Such model is often called population-averaged (PA) or marginal. When all clusters are of the same size, i.e. when ni = n for all i, it is usually assumed that the random error vectors εi , i = 1, . . . , N are i.i.d. with a multivariate density gε (ε). The main disadvantages of the PA model is that the model is designed only to account for within-cluster dependencies and consequently structured modelling of these dependencies is rather unnatural. Early semi-parametric approaches to the population averaged AFT model (3.6) with right-censored data are given by Lin and Wei (1992); Lee, Wei, and Ying (1993) and are directed mainly towards the estimation of the regression parameter β. They use the following estimation strategy. In the first step, they ignore the correlation and estimate the regression coefficient 3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA 23 β using one of the semi-parametric approaches for uncorrelated censored data outlined in Section 3.2 (the Buckley-James estimator or censored data linear-rank-test-based estimator). In the second step, they correct the standard errors of the estimate using a GEE approach (Liang and Zeger, 1986). However, we can point out that ignoring the dependence in the estimation step generally does not take full advantage of the information in the data and is likely not to be efficient. For that reason, Pan and Kooperberg (1999) suggest, in the case of bivariate survival data, i.e. ni = 2 for all i = 1, . . . , N , methods that account already in the estimation step for the within-cluster correlation. Briefly, their method iterates between (a) estimating the joint bivariate distribution of (εi,1 , εi,2 )′ using the bivariate log-spline density estimate of Kooperberg (1998), (b) multiple imputation (Wei and Tanner, 1991) of censored observations, (c) estimating the regression parameter β using either ordinary or generalized least squares. Note that this procedure can be considered as a generalization of the basic Buckley-James estimator, for which in step (a) the Kaplan-Meier estimator of the survival distribution is used while ignoring the correlation and in step (b) a simple imputation using conditional expectations is employed. Finally, Pan and Connett (2001) present an approach that, to some extent, combines the methods of Lee et al. (1993) and Pan and Kooperberg (1999). It iterates between (a) estimating the marginal distribution of εi,l using the Kaplan-Meier estimate while ignoring the dependencies, (b) multiple imputation of censored observations, (c) GEE estimation of the regression parameter β using a general working correlation matrix. 3.4.3 Cluster specific accelerated failure time model Another extension of the AFT model for multivariate data adds, similarly as the frailty PH model and analogously as the classical linear mixed model of Laird and Ware (1982), cluster specific random effect vector bi = (bi,1 , . . . , bi,q )′ combined with a vector of covariates z i,l = (zi,l,1 , . . . , zi,l,q )′ into the model expression, i.e. log(Ti,l ) = β ′ xi,l + b′i z i,l + εi,l , i = 1, . . . , N, l = 1, . . . , ni . (3.7) The random effect vectors bi , i = 1, . . . , N are assumed to be i.i.d. with some (multivariate) density gb (b), the random error terms εi,l , i = 1, . . . , N, l = 1, . . . , ni are assumed to be i.i.d. with some density gε (ε) and independent on the random effects. Besides the term cluster-specific (CS), the model (3.7) is sometimes called conditional, since the distribution of the event time Ti,l is modelled conditionally, given the cluster specific characteristic bi . 24 CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA In the literature, Pan and Louis (2000) and Pan and Connett (2001) consider model (3.7) with a univariate random effect bi and zi,l ≡ 1 for all i and l. The estimation procedure iterates between (a) estimating the distribution of independent error terms εi,l using the Kaplan-Meier estimator, (b) multiple imputation of censored times, (c) a Monte Carlo EM algorithm of Wei and Tanner (1990) in Pan and Louis (2000) or restricted maximum likelihood in Pan and Connett (2001) to estimate the regression parameter β. Observe that in contrast to the frailty PH model, in the cluster-specific AFT model the meaning of the regression parameters β is the same conditionally given bi as well as marginally. Indeed, when the random effects bi , i = 1, . . . , N are integrated out from model (3.7), we obtain the model (3.6) with the only change in the error distribution which is given as an appropriate convolution. 3.4.4 Population averaged model versus cluster specific model When compared to the PA model, not only the CS model allows for structured modelling of within-cluster dependencies but is often preferred to it due to clear decomposition of the sources of variability and more natural interpretation of the regression parameters, see Lindsey and Lambert (1998) and Lee and Nelder (2004) for more details. However, in some sense, the PA model is more general in the following sense. The CS model is specified hierarchically and always implies a particular PA model when the random effects are integrated out. On the other hand, the same PA model can correspond to several, very different CS models. Moreover, with the most common assumptions, i.e. when the error terms εi,l , i = 1, . . . , N, l = 1, . . . , ni in the CS model are assumed to be i.i.d. the random effects bi , i = 1, . . . , N in the CS model i.i.d. and independent on the errors and the error term vectors εi , i = 1, . . . , N in the PA model i.i.d., the PA model leads to a more general covariance structure than the CS model. To illustrate this, consider the CS model (3.7) with a random intercept only, i.e. zi,l ≡ 1. Let var(εi,l ) = σε2 and var(bi ) = σb2 , i = 1, . . . , N . Such model implies ′ a covariance matrix for the log-event times vector log(Ti,1 ), . . . , log(Ti,ni ) which is of the compound symmetry type, i.e. log(Ti,1 ) σε2 + σb2 . . . σb2 .. .. .. .. = . var . . . . 2 2 2 log(Ti,ni ) σb . . . σε + σ b 3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA 25 That is, the variance is necessarily the same for all observations within a cluster and the correlation between the two observations is the same for all pairs within a cluster. On the other hand, with the PA model (3.6) both the variance and the correlation are allowed to vary across the cluster as usually unstructured covariance matrix for the error terms vector εi and subsequently also for the log-event times vector is assumed. 26 CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA Chapter 4 Frequentist and Bayesian Inference Both PH and AFT models determine a probabilistic mechanism that leads to survival data. The mechanism depends further on a vector of unknown parameters, denoted by θ, which represents the relevant information we wish to pick up from the observed data. For example, for the AFT model (3.2), the θ vector is equal to (α, β ′ , τ )′ and the probabilistic mechanism is given by equation (3.2) together with the specification of the density of the error term ε. The assumed probabilistic mechanism together with the observed data determines the likelihood function, L(θ), which is the corner stone to draw the inference about the unknown parameter vector θ. Two major paradigms exist in statistics of how to use the likelihood in order to draw the inference about θ, namely the frequentist and the Bayesian paradigms. In the classical frequentist point of view, the data are assumed to be a random sample generated by the random mechanism controlled by θ, which is unknown but fixed. Several methods exist to estimate the true value of the parameter θ, maximum likelihood (ML) being one of the most b maximizes the likelihood function over a set popular ones. The estimator, θ, Θ of admissible θ values – the parameter space. Hypotheses about the parameter vector θ can be tested and accuracy of the estimates can be assessed by calculation of the confidence intervals. See, e.g., Cox and Hinkley (1974, Chapter 9) or Lehmann and Casella (1998, Chapter 6) for more details on ML estimation. In Bayesian statistics, both the data and the parameter vector θ are treated as random variables. Besides the probabilistic model to generate the data, a prior distribution p(θ) must be specified for the model parameters. Infer27 28 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE ence is then based on the posterior distribution p(θ | data) of the parameters given the data which is calculated using Bayes’ rule: p(θ | data) = R L(θ) p(θ) ∗ ∗ ∗ ∝ L(θ) p(θ). Θ L(θ ) p(θ ) dθ (4.1) As point estimate of θ, the posterior expectation, median or mode can be used. The uncertainty about the model parameters can be expressed using credible intervals constructed using the quantiles of the posterior distribution (see Section 4.6 for more details). For an extensive introduction into the area of Bayesian statistics, see, e.g., Carlin and Louis (2000); Gelman et al. (2004). 4.1 Likelihood for interval-censored data We saw that the likelihood plays a principal role in drawing inference about unknown model parameters. In this section, we discuss the general form of the likelihood, first for univariate interval-censored and doubly-interval-censored data. The multivariate case will be discussed in the following section. In this section, let Ti , i = 1, . . . , N be a set of independent event times each with a density pi (t; θ). For instance, for AFT model (3.2) density pi (t; θ) is given by pi (t; θ) = (τ t)−1 gε∗ τ −1 (log t − α − β ′ xi ) , where xi is a covariate vector for the ith observation. 4.1.1 Interval-censored data U Let ⌊tL i , ti ⌋ be observed intervals and δi corresponding censoring indicators with the same convention as in Section 2.1. Let the corresponding survival functions be denoted by Si (t; θ). The likelihood L(θ)Q is then the product of individual likelihood contributions Li (θ), i.e. L(θ) = N i=1 Li (θ), where R ∞ pi (s; θ) ds = Si (tL δi = 0, i ; θ), tL i pi (ti ; θ), δi = 1, R tU Li (θ) = U i δi = 2, 0 pi (s; θ) ds = 1 − Si (t ; θ) , U R t Li pi (s; θ) ds = Si (tL ; θ) − Si (tU ; θ) , δi = 3. i i t i This can be briefly written as Li (θ) = I tU i tL i pi (s; θ) ds (4.2) 4.1. LIKELIHOOD FOR INTERVAL-CENSORED DATA if we make use of the notation Z U τ I τU p(s) ds, p(s) ds = L τ τL p(τ L ) = p(τ U ), 29 if τ L < τ U if τL = (4.3) τU, i.e. the integral disappears whenever the event time is exactly observed. Note that already for simple interval-censored data, the likelihood involves integration of the density. 4.1.2 Doubly-interval-censored data U Let ⌊uL i , ui ⌋, i = 1, . . . , N be observed intervals for the onset time Ui and L U ⌊vi , vi ⌋ observed intervals for the failure time Vi in the sense of Section 2.2. It is tempting to transform observations into single intervals of the form U L U U L ⌊tL i , ti ⌋ = ⌊vi − ui , vi − ui ⌋ and then to use methods for simple intervalcensored data with the likelihood (4.2). However, as pointed out by De Gruttola and Lagakos (1989), this approach would be only valid if the onset time Ui is uniformly distributed and independent of the event time Ti . To write a likelihood contribution of each observation in the general case a bivariate density of an event and onset times must be considered. Let qi (t, u; θ) be a density of the random vector (Ti , Ui )′ , i = 1, . . . , N . The likelihood contribution of the ith observation is then given by a double integral of the form I uU nI vU −u o i i Li (θ) = qi (t, u; θ) dt du. (4.4) uL i viL −u Note that whenever either the onset time Ui and/or the failure time Vi are exactly observed either both or one integrals disappear in the formula (4.4). In most practical situations it can be assumed that, given the parameter vector θ, the onset and the event time are independent, i.e. qi (t, u; θ) = pi (t; θ) pU i (u; θ). (4.5) In the rest of this thesis we shall make use of assumption (4.5). The likelihood contribution of the ith subject can then be rewritten as Li (θ) = I nI viU −u uU i uL i viL −u o pi (t; θ) dt pU i (u; θ) du. (4.6) 30 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE 4.2 Likelihood for multivariate (doubly) interval-censored data In the case of multivariate event times Ti,l , i = 1, . . . , N , l = 1, . . . , ni , obU served as intervals ⌊tL i,l , ti,l ⌋, the likelihood contribution of the ith cluster equals I tU I tU i,ni i,1 ··· pi (t1 , . . . , tni ; θ) dtni · · · dt1 , (4.7) Li (θ) = tL i,n tL i,1 i where pi (t1 , . . . , tni ; θ) is the density of (Ti,1 , . . . , Ti,ni )′ implied by the assumed model. When population averaged AFT model introduced in Section 3.4.2 is assumed, pi (t1 , . . . , tni ; θ) equals gε,i log(t1 ) − β ′ xi,1 , . . . , log(tni ) − β ′ xi,ni . pi (t1 , . . . , tni ; θ) = t1 · · · tni (4.8) In the case of the cluster-specific AFT model described in Section 3.4.3, the density pi (t1 , . . . , tni ; θ) becomes Z Y ni gε log(tl ) − β ′ xi,l − b′i z i,l pi (t1 , . . . , tni ; θ) = gb (bi ) dbi . tl Rq (4.9) l=1 For doubly-interval-censored data, under assumption (4.5), the likelihood contribution of the ith cluster is obtained by an appropriate multivariate modification of the expression (4.6). 4.3 Bayesian data augmentation The computation of the likelihood for interval- and doubly-interval-censored data is rather involved. The complexity even increases when multivariate survival data are introduced. Indeed, the maximum likelihood method involves multivariate integration combined with the optimization of the likelihood which becomes quickly intractable even for simple models. On the other hand, in Bayesian statistics, where the unknown parameter vector θ is assumed to be random and its posterior distribution p(θ | data) is used for inference, we are completely free to augment the vector of unknowns by arbitrary auxiliary variables, let say ψ. Inference can then equally be based on the joint posterior distribution p(θ, ψ | data). Indeed, all (marginal) 4.3. BAYESIAN DATA AUGMENTATION 31 posterior characteristics of θ (mean, median, credible intervals) are the same regardless whether they are computed from p(θ | data) or p(θ, ψ | data) since Z p(θ | data) = p(θ, ψ | data) dψ. In the case of censored data, matters simplify considerably if the unknown true event times ti are explicitely considered to make a part of the vector of unknowns, i.e. ψ = (ti : i = 1, . . . , N, ti is censored)′ . Assume now that all observations are censored. In this situation, it is obvious that ψ (uncensored augmented data) conveys more precise information about the model parameter θ than the censored data which implies p(θ | ψ, data) = p(θ | ψ). The joint posterior distribution of θ and ψ then equals p(θ, ψ | data) = p(θ | ψ, data) p(ψ | data) = p(θ | ψ) p(ψ | data). (4.10) The two terms on the right hand side of formula (4.10) are now easily computed. Indeed, p(θ | ψ) is the posterior distribution of θ if the uncensored data were available, i.e. p(θ | ψ) ∝ Laugm (θ) p(θ), where the likelihood Laugm of the uncensored augmented data is simply Laugm (θ) = N Y pi (ti ; θ). i=1 The second term of the right hand side of formula (4.10), p(ψ | data), is under the assumption of independent noninformative censoring proportional to the product of indicator functions: p(ψ | data) ∝ N Y U I ti ∈ ⌊tL i , ti ⌋ . i=1 A similar procedure can be applied for doubly-censored data. In that case, both true onset times ui and true event times ti i = 1, . . . , N are augmented into the vector of unknowns. The situation where only the part of the data is censored is analogous, only with some change in notation. Finally, in the case of multivariate survival data and cluster specific models, the integrals of 32 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE the form (4.9) can easily be avoided by augmenting the vector of unknowns by the values of the random effects bi , i = 1, . . . , N . The idea of data augmentation was first introduced in the context of the EM algorithm (Dempster, Laird, and Rubin, 1977) and formalized in the context of Bayesian computation by Tanner and Wong (1987). For more complex models with censored data, this technique constitutes a highly appealing alternative to difficult maximum likelihood estimation. Moreover, it is quite natural to include the true event times or the values of latent random effects in the set of unknowns. For these reasons, most of the models developed in this thesis make use of the Bayesian estimation with augmented true event times. 4.4 Hierarchical specification of the model In Bayesian statistics, the prior distribution p(θ) and the model assumed to generate the data, represented by the likelihood L(θ) = p(data | θ), are usually specified in a hierarchical manner. Firstly, remember that the parameter vector θ contains not only the parameters in a classical sense but also all remaining latent factors like random effects or augmented times. Crudely, the vector θ can usually be splitted into two parts θ = (ψ ′ , φ′ )′ where ψ refers to the latent factors and φ to the parameters in a classical sense. The specification of the Bayesian model then proceeds in the following steps: 1. Data Model step specifies the likelihood function L(θ) = p(data | θ) = p(data | ψ, φ) and is actually equivalent to the frequentist specification of the model. 2. Latent Process Model step specifies p(ψ | φ), i.e. the distribution of the latent factors, possibly given the classical parameters φ. 3. Parameter Model (Prior) specifies the prior distribution for the classical parameters φ, i.e. it specifies p(φ). Often, the components of φ are assumed to be a priori independent and if no external information is available are assigned vague but proper prior distributions. 4.4. HIERARCHICAL SPECIFICATION OF THE MODEL 33 The overall prior distribution is then given by p(θ) ∝ p(ψ | φ) × p(φ), and the posterior distribution is obtained using the relationship (4.1) as p(θ | data) ∝ L(θ) × p(θ) ∝ p(data | ψ, φ) × p(ψ | φ) × p(φ), (4.11) i.e. it is proportional to the product of the distributions specified in the above three steps. The hierarchical structure of more complex hierarchical models is usually best expressed using so called directed acyclic graphs (DAG) where each model quantity is represented by the node drawn as a circle for unknowns and drawn as a squared box for observed or fixed quantities (data, covariates). Solid arrows are used to represent stochastic dependencies and dashed arrows deterministic dependencies between the nodes. A simple DAG which only distinguishes among the data, latent quantities ψ and classical parameters φ and which corresponds to the expression (4.11) is shown in Figure 4.1. Further, it is assumed that given its parents, each node is conditionally independent on all its grandparents, i.e. schematically p(child | parents, grandparents) = p(child | parents). The posterior distribution of the hierarchical model is then proportional, analogously to the relationship (4.11), to the product of all conditional distributions of the type p(child | parents) times the product of the prior distributions for the nodes of the first generation (i.e. having no parents). Illustration 4.1. Linear mixed model. As an illustration, consider a classical normal linear mixed model with data = {y i , . . . , y N } being a realization φ ψ data Figure 4.1: Directed acyclic graph – general scheme. 34 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE of independent random vectors Y i , i = 1, . . . , N , each of length n which in a frequentist sense can be specified as Y i = Xi β + Zi bi + εi , i.i.d. i = 1, . . . , n, i.i.d. bi ∼ Nq (0, D), εi ∼ Nn (0, Σ), where Xi , Zi , i = 1, . . . , N are fixed covariate matrices. For the sake of the Bayesian modelling, the vector θ = (ψ ′ , φ′ )′ is given by ′ ψ = (b′1 , . . . , b′N )′ , φ = β ′ , vec(D), vec(Σ) . The whole model can be represented by the DAG shown in Figure 4.2. The above mentioned three steps in the model building proceeds as follows. The Data Model is given by a normal likelihood L(θ) = p(data | θ) = p(data | ψ, φ) = N Y i=1 ϕn (y i | β ′ xi + b′i z i , Σ). The Latent Process Model is determined by the normal distribution of the random effects, i.e. N Y ϕq (bi | 0, D). p(ψ | φ) = i=1 Finally, some prior distributions p(β), p(D), p(Σ) are assigned to the parameters of the main interest, i.e. to β, D, Σ and p(φ) = p(β) × p(D) × p(Σ). bi Zi Xi yi β i = 1, . . . , N Σ D Figure 4.2: Directed acyclic graph for the linear mixed model. 4.5. MARKOV CHAIN MONTE CARLO 4.5 35 Markov chain Monte Carlo In previous sections, we stated that the inference in the Bayesian approach is based on the posterior distribution p(θ | data) which is obtained using the Bayes’ formula (4.1) and is proportional to the product of the likelihood and the prior distribution. We also saw that difficult likelihood evaluations can be avoided by the introduction of a set of suitable auxiliary variables (augmented data). What needs to be discussed is how the posterior distribution can be computed and how to determine posterior summaries about θ. Most quantities related to the posterior summarization (posterior moments, quantiles, highest posterior density regions etc.) involve computation of the posterior expectation of some function G(θ), i.e. computation of E G(θ) data = Z Θ G(θ) p(θ | data) dθ = R Θ RG(θ) L(θ) p(θ) dθ Θ L(θ) p(θ) dθ . (4.12) The integration in the expression (4.12) is usually high-dimensional and only rarely analytically tractable in realistic practical situations. Markov chain Monte Carlo (MCMC) methods avoid the explicit evaluations of integrals. Instead, we construct a Markov chain with state space Θ whose stationary distribution is equal to p(θ | data). After a sufficient number of burn-in iterations the current draws follow the stationary distribution, i.e. the posterior distribution of interest. We keep a sample of θ values, let say θ (1) , . . . , θ (M ) and approximate the posterior expectation (4.12) by GM = M 1 X G(θ (m) ). M (4.13) m=1 The ergodic theorem implies that, under mild conditions, GM converges almost surely to E G(θ) data as M → ∞ (see, e.g., Billingsley, 1995, Section 24). Many methods are available to construct the Markov chains with desired properties. The most often used are the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) and the Gibbs algorithm (Geman and Geman, 1984; Gelfand and Smith, 1990). Both of them, often properly dedicated will be used extensively throughout this thesis. A comprehensive introduction into the area of the MCMC can be found, e.g., in Geyer (1992); Tierney (1994); Besag et al. (1995). More details can be obtained from several books, e.g., Gilks, Richardson, and Spiegelhalter (1996); Gamerman (1997); Chen, Shao, and Ibrahim (2000); Robert and Casella (2004). 36 4.6 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE Credible regions and Bayesian p-values With a frequentist approach, confidence intervals or regions and p-values are used to summarize the estimates and the inference for θ – parameter of interest. In Bayesian statistics, the role of the confidence regions is played by the credible regions and p-values are replaced by the Bayesian p-values. In this section, we briefly discuss their construction. 4.6.1 Credible regions For a given α ∈ (0, 1), the 100(1 − α)% credible region Θα for a parameter of interest θ is defined using the conditional distribution θ | data (posterior distribution of θ) as Pr θ ∈ Θα data = 1 − α. (4.14) Equal-tail credible interval Suppose first, the parameter of interest θ is univariate. The credible region Θα can then be obtained by setting Θα = (θαL , θαU ), such that Pr θ ≤ θαL data = Pr θ ≥ θαU data = α/2. Such an interval is easily constructed when a sample from the posterior distribution of θ (obtained, e.g., using the MCMC technique) is available. Indeed, θαL and θαU are 100(α/2)% and 100(1 − α/2)%, respectively, quantiles of the posterior distribution θ | data and from the MCMC output they can be estimated using the sample quantiles. Simultaneous credible bands For the case the parameter of interest, θ = (θ1 , . . . , θq ), is multivariate and we wish to calculate simultaneous probability statements, Besag et al. (1995, p. 30) suggest to compute simultaneous credible bands. In that case, Θα equals L U L U Θα = (θ1,α , θ1,α ) × · · · × (θq,α , θq,α ). (4.15) uni uni uni uni That is, Θα is given as a product of univariate equal-tail credible intervals of the same univariate level αuni (typically αuni ≥ α). As shown by Besag et al. (1995), the simultaneous credible bands can easily be computed when the sample from the posterior distribution is available as only order statistics for each univariate sample are needed. From the computational point of view, 4.6. CREDIBLE REGIONS AND BAYESIAN P -VALUES 37 the most intensive part in computation of the simultaneous credible band is to sort the univariate samples. However, when simultaneous credible bands for different values of α are required this must be done only once. This property is advantageously used when computing the simultaneous Bayesian p-values (see Section 4.6.2). As pointed by Held (2004), due to the fact the simultaneous credible band is by construction restricted to be hyperrectangular, it can cover a huge area actually not supported by the posterior distribution. Obviously, this problem becomes more severe when a high posterior correlation exists between the components of the vector θ. Highest posterior density region An alternative to the credible intervals and simultaneous credible bands is given by the highest posterior density (HPD) region. In that case, Θα is obtained by requiring (4.14) and additionally p(θ 1 | data) > p(θ 2 | data) for all θ 1 ∈ Θα , θ 2 ∈ / Θα . Note that in the univariate case and for unimodal posterior densities p(θ | data), the HPD region becomes an interval. However, it is clear that in contrast to the equal-tail credible interval or the simultaneous credible band the computation of the HPD region is much more complicated even when the sample from the posterior distribution is already available. 4.6.2 Bayesian p-values The Bayesian counterpart of the p-value for the hypothesis H0 : θ = θ 0 (typically θ 0 is a vector of zeros) – the Bayesian p-value – can be defined as 1 minus the content of the credible region which just covers θ 0 , i.e. (4.16) p = 1 − min α : θ 0 ∈ Θα . In the univariate case, a two-sided Bayesian p-value based on the equal-tail credible interval is computed quite easily once the sample from the posterior distribution is available since (4.16) can be expressed as n o p = 2 min Pr(θ ≤ θ0 | data), Pr(θ ≥ θ0 | data) , (4.17) and Pr(θ ≤ θ0 | data), Pr(θ ≥ θ0 | data) can be estimated as a proportion of the sample being higher or lower, respectively, than the point of interest θ0 . 38 CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE In the multivariate case, a two-sided simultaneous Bayesian p-value based on the simultaneous credible band can be obtained by calculating the simultaneous credible bands Θα on various levels α and determining the smallest level, such that θ 0 ∈ Θα , i.e. by direct usage of the expression (4.16). To compute the Bayesian p-value based on the HPD region, the expression (4.16) takes the form p = Pr θ : p(θ | data) ≤ p(θ 0 | data) data . (4.18) An MCMC estimate of (4.18) can easily be obtained when p(θ | data) (any proportionality constants may be ignored) can efficiently be evaluated. Often, this is not the case however. Nevertheless, a technique how to overcome the problem of unknown or difficult to evaluate p(θ | data) using its estimate based on Rao-Blackwellization is given by Held (2004). Mainly for computational reasons, we report in this thesis, if not stated otherwise, univariately equal-tail credible intervals and corresponding Bayesian p-values of the type (4.17) and multivariately simultaneous credible regions (4.15) and corresponding simultaneous Bayesian p-values computed using an iterative procedure to evaluate (4.16). Chapter 5 An Overview of Methods for Interval-Censored Data For right-censored data, a variety of methods (non-, semi- and fully parametric) have been developed. Further, commercial software is available to support these techniques. In contrast, for interval-censored data and multivariate (doubly-)interval-censored data commercial software is much more limited and only parametric approaches seem to be available for regression models besides of course the user-written programs. Further, until recently only few methods were available. That is why, in practice, modelling with interval-censored data is often mimicked by methods developed for rightcensored data. For this, the interval needs to be replaced by an exact time or right-censored time. The most common assumption is that the event occurred at the midpoint of the interval. However, applying methods for right-censored data on these artificial fixed points can lead to biased and misleading results and the correctness of such approach depends strongly on the underlying distribution of the event times, see e.g., Rücker and Messerer (1988); Law and Brookmeyer (1992); Odell, Anderson, and D’Agostino (1992); Dorey, Little, and Schenker (1993). In Section 5.1, we first review appropriate frequentist methods to deal with (doubly-)interval-censored data and link them to the corresponding (classical) method for right-censored data. We start with the estimation of the survival distribution, proceed to the two-sample tests for the survival distributions, continue with the proportional hazards and accelerated failure time models and end up with the remark on the problem of interval-censored covariates. Whenever feasible, we mention computational aspects of described methods applicable for R, Splus and SAS. 39 40 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA With suitable semi-parametric approaches, both PH and AFT models can be used not only for the estimation of the effect of covariates but also for both estimation of the baseline survival distribution or comparison of two or more samples. With the Bayesian approach, it is moreover relatively easy to set up and estimate the models for multivariate (doubly-)interval-censored data. We will illustrate this on the analysis of the Signal Tandmobielr data using a semi-parametric Bayesian PH model in Section 5.2. As we are interested mainly in the AFT model, we give also an overview of available Bayesian developments for this model in Section 5.3. We end this chapter by highlighting our motivations for the further developments presented in this thesis. 5.1 5.1.1 Frequentist methods Estimation of the survival function In the case of simple i.i.d. survival data, often the aim is to estimate the survival function. When only categorical covariates are involved, the survival function can be estimated for each unique combination of covariate values and could be used to check the fitted regression model. For right-censored data, the classical non-parametric maximum-likelihood estimate (NPMLE) of the survival function is given by Kaplan and Meier (1958). For interval-censored data Peto (1973) first proposed the NPMLE and used the constrained Newton-Raphson method to compute it. Nowadays, the NPMLE of the survival function based on the interval-censored data is known as the Turnbull’s estimate (see Turnbull, 1976) who suggested a so called iterative self-consistency algorithm, which is, in fact, an EM-like (Dempster et al., 1977) algorithm. An improved version of the maximization algorithm which utilizes standard convex optimization technique was given by Gentleman and Geyer (1994) who also discussed the unicity of the estimate. For computation, a valuable alternative, the iterative convex minorant algorithm, was suggested by Groeneboom and Wellner (1992). Finally, strong consistency of the Turnbull’s estimate has been proved under rather general assumptions by Yu, Li, and Wong (2000). The asymptotic distributional behaviour of the Turnbull’s estimator for some special cases has been established by Yu et al. (1998) and Huang (1999). An extension of the NPMLE of the survival function for bivariate interval-censored data is discussed, e.g., by Bogaerts and Lesaffre (2004). Several numerical algorithms to compute the non-parametric estimate of the survival function of the interval-censored data are implemented in Vandal’s 5.1. FREQUENTIST METHODS 41 and Gentleman’s R package Icens downloadable from the Comprehensive R Archive Network (CRAN) or in the S-plus function kaplanMeier. A valuable alternative to non-parametric procedures is obtained by smoothing the survival or equivalently the density function or the hazard function. In most practical situations, it can be assumed that the event-times are continuously distributed, and we even get more realistic, not step-wise, estimates. One such method, applicable directly also to interval-censored data is given by Kooperberg and Stone (1992) who smooth the density using splines. They also provide software in the form of the R package logspline downloadable from CRAN or the S-plus library splinelib downloadable from StatLib. Splines for the smoothing the hazard function are exploited by the approach of Rosenberg (1995). Illustration 5.1. Signal Tandmobielr study. As an illustration, we computed both the non-parametric estimate of Turnbull (1976) and the smooth estimate of Kooperberg and Stone (1992) of the cumulative distribution functions (cdf) for the emergence of the right mandibular permanent first premolar, separately for boys and girls based on the Signal Tandmobielr data introduced in Section 1.1. The cdf function giving the proportion of children with the emerged tooth is called in this context the emergence curve and is preferred in this situation to the survival curve. The estimates are plotted in 0.8 0.6 0.4 Boys 0.2 Girls 0.0 Proportion emerged 1.0 Tooth 44 6 7 8 9 10 11 12 Age (years) Figure 5.1: Signal Tandmobielr study: Cumulative distribution functions of emergence for right mandibular permanent first premolar, separately for girls and boys. Non-parametric estimate of Turnbull (solid line), smooth estimate of Kooperberg and Stone (dashed line). 42 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA Figure 5.1. Due to rather high sample size in each group (more than 2 000), the non-parametric estimate is almost the same as the smooth estimate, especially for boys. From the plots it is seen that the emergence for girls is somewhat fastened when compared to boys. Doubly-interval-censored data Non-parametric estimation of the survival curve based on doubly-intervalcensored data was first considered by De Gruttola and Lagakos (1989) who make use of discretization of data and generalization of the self-consistency algorithm of Turnbull (1976). The authors estimate simultaneously the onset and the event distributions by treating them as bivariate data. However, they point out that the large number of parameters resulting from discretization, especially if time is grouped too coarsely may cause identifiability problems. This gave rise to several two-step approaches. First, the distribution of the onset time is separately estimated and second, the estimated onset distribution is used as an input for estimation of the distribution of the event time. Bacchetti (1990); Bacchetti and Jewell (1991) assume piece-wise constant hazard and use penalized maximum-likelihood method to estimate the levels of the hazard on each interval. The roughness penalty in the likelihood prevents the method from identifiability problems reported by De Gruttola and Lagakos (1989). The original proposal of De Gruttola and Lagakos (1989) motivates the two-step approaches of Gómez and Lagakos (1994); Sun (1995). Finally, Gómez and Calle (1999) present an extension of the technique of Gómez and Lagakos (1994) which does not require discretization of the data. 5.1.2 Comparison of two survival distributions If the data can be divided in two (or more) groups, e.g. boys and girls, one could compare the distributions of the event times in these two groups. For right-censored data, many non-parametric tests for comparing two survival curves are available, e.g. the log-rank test (Mantel, 1966), the Gehan generalization of the Wilcoxon test (Gehan, 1965), the Peto-Prentice generalization of the Wilcoxon test (Peto and Peto, 1972; Prentice, 1978) and the weighted Kaplan-Meier statistic of Pepe and Fleming (1989, 1991) which with unit weights is equal to the difference of means of the two survival distributions. The Gehan-Wilcoxon test has been adopted to interval-censored data by Mantel (1967), while the interval-censored version of the Peto-Prentice-Wilcoxon test is presented by Self and Grossman (1986). The log-rank test for interval-censored data is given by Finkelstein (1986). Further, Petroni and 5.1. FREQUENTIST METHODS 43 Wolfe (1994) discuss the weighted Kaplan-Meier statistic in the context of interval-censoring. The performance of above mentioned two-sample tests for interval-censored data is in detail studied and compared by Pan (1999a). Furthermore, Fay (1996, 1999) derived a general class of linear-rank tests for interval-censored data which covers, as special cases, the Wilcoxon-based tests. Finally, Fay and Shih (1998) present a class of tests called distribution permutation tests which besides the Wilcoxon-based tests covers also an improved version of the weighted Kaplan-Meier test. Splus programs to perform some distribution permutation tests are given by Gómez, Calle, and Oller (2004, Section 4.4) and can be downloaded from http://www-eio.upc.es/grass. Regrettably, the asymptotic properties of the above methods assume the grouped continuous model, which implies that the status of each subject is checked at the same timepoints (in the study time scale) whose number is fixed or that observed intervals are grouped in such a way. For example, for the Signal Tandmobielr study this would mean that the emergence status of the teeth was checked at prespecified ages, the same for all children. Obviously, such setting is too restrictive in many practical situations. For instance, in the above example, each child was checked by a dentist-examiner on a prespecified day of the year, irrespective of his or her age. The grouped continuous model assumption is necessary to be able to apply the standard maximum likelihood theory to interval-censored data measured on a continuous scale without making any parametric assumptions. Only recently, Fang, Sun, and Lee (2002) developed a test statistic, based on the weighted Kaplan-Meier statistic of Pepe and Fleming (1989) that does not require the grouped continuous model assumption. Finally, Pan (2000b) offers two-sample test procedures obtained by combining standard right-censored tests and multiple imputation that allows, in contrast to single (e.g. midpoint) imputation mentioned at the beginning of this chapter, to draw appropriately the statistical inference. Illustration 5.2. Signal Tandmobielr study. The emergence curves of the right mandibular permanent first premolar for boys and girls shown in Figure 5.1 were compared using the Wilcoxon-based, log-rank and Fay’s and Shih’s version of the difference in means tests. Not surprisingly, for all these tests, the p-value is practically equal to zero. The values of the test statistics, their mean and variance under the null hypothesis and the standardized value, which can asymptotically be compared to the quantile of the standard Gaussian distribution, are shown in Table 5.1. 44 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA Table 5.1: Signal Tandmobielr study: Two-sample tests comparing the emergence of the permanent right mandibular first premolar (tooth 44) for boys and girls. Test Gehan-Wilcoxon Peto-Prentice-Wilcoxon Log-rank Difference in means 5.1.3 Test Statistic 554 812 140.607 212.316 264.095 Mean 0 −37.634 −53.663 −76.486 Variance 2 865 333 000 284.255 675.251 1 102.340 Standardized Test Statistic 10.365 10.572 10.236 10.258 Proportional hazards model To extend the PH model to interval-censored data, basically four types of approaches can be found in the literature. Firstly, the baseline hazard ℏ0 can be parametrically specified and standard maximum likelihood theory applied to estimate all the parameters. However, the parametric assumptions can cause bias in inference if incorrectly specified and especially with heavily censored data it is general difficult to assess them. The second class of methods makes use of a combination of multiple imputation (see Rubin, 1987; Wei and Tanner, 1991) and methods for right-censored data represented by works of Satten (1996); Satten, Datta, and Williamson (1998); Goggins et al. (1998); Pan (2000a). A disadvantage of these methods is, however, that they are highly computationally demanding and the fact that the procedures they use to impute missing data have a relatively ad hoc nature. The third approach, suggested by Finkelstein (1986), Pan (1999b), and Goetghebeur and Ryan (2000) resembles most the original method of Cox (1972) combined with that of Breslow (1974). Indeed, in all three papers the baseline hazard ℏ0 is estimated non-parametrically on top of estimating the regression coefficients. Whereas the method of Finkelstein relies on the grouped data assumption, Goetghebeur and Ryan developed an EM-type procedure that relaxes that assumption. Moreover, the approach of Goetghebeur and Ryan seems to be the only one that reduces to a standard Cox model when interval-censoring reduces to right-censoring. Finally, the approach of Pan extends the iterative convex minorant method mentioned in Section 5.1.1 into the context of the PH model. His approach is also implemented as R package intcox. Finally, methods that smoothly estimate ℏ0 are a trade-off between para- 5.1. FREQUENTIST METHODS 45 metric modelling that allows for a straightforward maximum likelihood estimation of the parameters and semi-parametric models with a completely unspecified baseline hazard ℏ0 . Kooperberg and Clarkson (1997) suggest to use regression splines to express the logarithm of ℏ0 , while Joly et al. (1998) employ monotone splines (Ramsay, 1988) directly for the baseline hazard ℏ0 . Betensky et al. (1999) use local likelihood smoothing to model the baseline hazard, firstly without covariates. Extension of their method into the regression setting is given by Betensky et al. (2002). Recently, Cai and Betensky (2003) propose to use penalized linear spline for the baseline hazard function. A nice feature of these methods is that predictive survival and hazard curves are directly available and moreover, they are smooth rather than step-wise as in the case of semi-parametric estimation. The software for the approach of Kooperberg and Clarkson (1997) is included in the previously mentioned R package logspline or S-plus library splinelib. Doubly-interval-censored data One of the first approaches to the PH model with doubly-interval-censored data is given by Kim, De Gruttola, and Lagakos (1993) who, under the assumption of the grouped data, directly generalize the one-sample results of De Gruttola and Lagakos (1989). However, their method is highly computationally intensive. For the situation when only the onset time is intervalcensored however the failure time is only right-censored or exactly observed, alternatives are offered by Goggins, Finkelstein, and Zaslavsky (1999); Sun, Liao, and Pagano (1999); Pan (2001). 5.1.4 Accelerated failure time model A parametric AFT model estimated using the maximum likelihood method can be used with interval-censored data as well. It is also implemented in major statistical packages (functions survreg in R and SurvReg in Splus, procedure LIFEREG in SAS). On the other hand, semi-parametric methods which are not straightforward even with right-censored data are only with considerable difficulties extended to the interval-censored data, see Rabinowitz, Tsiatis, and Aragon (1995); Betensky, Rabinowitz, and Tsiatis (2001). Though, both approaches are practically applicable only with low-dimensional covariate vectors x and as well as for right-censored data there exists no nonparametric method to estimate the baseline survival distribution implying that the semi-parametric procedures cannot be used when prediction is of interest. 46 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA More promising alternatives are the methods that make use of multiple imputation and/or smoothing. Indeed, approaches of Pan and Kooperberg (1999); Pan and Louis (2000); Pan and Connett (2001) introduced in Sections 3.4.2 and 3.4.3 could relatively easily be extended to handle also (multivariate) interval-censored or even doubly-interval-censored data. However, it can be computationally demanding, especially with doubly-interval-censored data, to perform integration of the form (4.4) in the optimization of the likelihood. 5.1.5 Interval-censored covariates Up to now, we concentrated on the problem of interval-censored response. In the regression context, it is however possible in practice, that we have to face the problem of interval-censored covariate. Such problem is considered, for example, by Gómez, Espinal, and Lagakos (2003) who studied, in the framework of an HIV/AIDS clinical trial, the association between waiting time between indinavir failure and enrolment (covariate) and subsequent viral load (response). However, we will not consider problems of this type in this thesis. Recent developments in this field can be found, e.g., in Topp and Gómez (2004); Langohr, Gómez, and Muga (2004); Calle and Gómez (2005). 5.2 Bayesian proportional hazards model: An illustration For an extensive overview of the Bayesian methods for the proportional hazards model we refer the reader to the book of Ibrahim, Chen, and Sinha (2001). Here, only the analysis based on the PH model, published as Komárek et al. (2005), will be presented and that of doubly-interval-censored data from the Signal Tandmobielr study. Actually, the main purpose of this section is to illustrate typical features of a Bayesian analysis and show how it can be used to answer rather complex questions. In Section 5.2.1, we formulate the research question and outline the problems related to this question. Section 5.2.2 presents a frequentist Cox’s PH regression model using midpoints of the observed intervals as if they were exact observations, to compare our Bayesian approach to a more commonly used, however incorrect, approach. In Section 5.2.3, the Bayesian model suggested by Härkänen, Virtanen, and Arjas (2000) and modified for our purposes is explained and results are presented in Section 5.2.4. We finalize this part by a discussion. 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 5.2.1 47 Signal Tandmobielr study: Research question and related data characteristics In this section we will tackle the following research question: Does fluorideintake at a young age have a protective effect on caries in permanent teeth? Our analyses will be limited to caries experience of the four permanent first molars (teeth number 16, 26, 36, 46 in Figure 1.1). The data suggest that the use of fluoride reduces caries experience in primary teeth, see Vanobbergen et al. (2001) and that fluoride-intake delays the emergence of the permanent teeth, see Leroy et al. (2003a). The latter result raises the question whether the fluoride-intake only reduces the time at risk or whether it has also a direct protective effect on caries experience. Unfortunately, fluoride-intake in children cannot be measured accurately. Indeed, fluoride-intake can come from: (1) fluoride supplements (systemic), (2) accidental ingestion of toothpaste or (3) tap water. Further, the intake from these sources can be recorded only crudely. Therefore it was decided to measure fluoride-intake by the degree of fluorosis on some reference teeth. Fluorosis is the most common side-effect of fluoride-intake and appears as white spots on the enamel of teeth. For this analysis, a child was considered fluoride-positive (covariate fluor = 1) if there were white spots on at least two permanent maxillary incisors during the fourth year of the study or during both the fifth and sixth year of the study. The prevalence of fluorosis was relatively low (480 children, 10.8%). In our analysis, 480 fluorosis children and 960 randomly selected fluorosis-free children are included. Case-control subsampling was done to reduce computation time. To check that it did not destroy the stratification, we constructed a 5 × 3 × 2 contingency table with factors province, school system and whether the child is in the subsample or not (subsample). A classical p-value of 0.13 was obtained for the significance of the interaction of the third factor with the other two using a likelihood-ratio test in a log-linear model, implying that the stratification is similar in the used and the discarded subsamples. The prevalence of caries experience at the age of 12 was negligible (at most 1.4%) for all permanent teeth except for the first molars (teeth used in the analysis). For these teeth the prevalence was 25.8% in children with fluorosis compared to 29.4% in fluorosis-free children, with prevalence of 23.3% and 27.7% for boys, and 27.9% and 31.2% for girls, respectively. Thus, at first sight the impact of fluoride-intake seems to be minor. However, since the emergence of permanent teeth might be delayed by fluoride-intake, evaluating the impact of fluoride-intake should take into account the time at risk for caries. Hence, in our analysis the response will be the time between emergence 48 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA and the onset of caries development. Remember that both tooth emergence and onset of caries development are interval-censored, implying a doublyinterval-censored response. See Figure 2.1 for a graphical illustration of a possible evolution of a particular tooth. At the onset of the study about 86% of the permanent first molars had already emerged. The severity of this censoring will affect the efficiency with which the effect of fluoride-intake can be estimated. We tried two strategies to improve the efficiency of our estimation procedure. Firstly, we included in our analysis the emergence times of teeth 14, 24, 34, 44, 12, 22, 33, 43 all of which had emerged in more than 60% of cases during the course of the study. By incorporating information on these teeth and using the association between teeth of the same subject (via the concept of “the birth time of dentition”, see next section), it was attempted to better estimate the true emergence time of the permanent first molars. Secondly, emergence times from a Finnish longitudinal data set (Virtanen, 2001), involving 235 boys and 223 girls born in 1980–1981 with follow-up from 6 to 18 years, were added to our Flemish data. For these Finnish data almost all 28 permanent teeth emerged during the study period. Our research question is not uncommon in dentistry, but cannot be addressed within any classical statistical package. For our analysis, we have used the software package BITE (Härkänen, 2003), based on a semi-parametric Bayesian survival model developed by Härkänen et al. (2000). 5.2.2 Proportional hazards modelling using midpoints A standard frequentist Cox’s PH model introduced in Section 3.1 could be applied, replacing interval-censored observations by the midpoints of the observed intervals and treating the resulting data as right-censored observations. In this way, we analyzed time to caries development for the four permanent first molars. For our analysis, the left-censored emergence times were first assumed to be interval-censored with a lower limit for emergence of 5 years, which is practically the youngest age for the emergence of these teeth (Nanda, 1960). Possible dependencies between the four teeth of the same child can be taken into account, for example by inclusion of a gamma–frailty component in the PH model as explained in Section 3.4.1. Based on preliminary Bayesian modelling, we do not distinguish between opposite teeth in the same jaw and assume so called horizontal symmetry. However, we do make a distinction between maxillary (upper) and mandibular (lower) teeth and also between teeth in different positions (of a quadrant) in the mouth. 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 49 Table 5.2: Signal Tandmobielr study. Naive PH models for the effect of fluorosis on caries on permanent first molars. Hazard ratios (95% confidence intervals (CI)) between a fluorosis and fluorosis-free group of children while controlling for gender and jaw. Group Boys, maxilla Boys, mandible Girls, maxilla Girls, mandible Model WITHOUT frailties Estimate 95% CI 0.787 (0.541, 1.032) 0.733 (0.532, 0.934) 0.871 (0.698, 1.044) 0.812 (0.670, 0.953) Model WITH frailties Estimate 95% CI 0.704 (0.204, 1.204) 0.613 (0.231, 0.995) 0.892 (0.610, 1.174) 0.776 (0.559, 0.993) For comparison purposes, we present the same PH model as the one shown in Section 5.2.3 but analyzed by Bayesian methods. Hence, the hazard for the time to caries of the lth tooth of the ith child depends on the tooth position, fluor and gender of the child (0 = boy, 1 = girl). More specifically: ℏ(t|toothl , genderi , fluori ) = ℏ0 (t) · Zi · exp(β ′ xi,l ), (5.1) i = 1, . . . , N, l = 16, 26, 36, 46, where ℏ0 (t) is an unspecified baseline hazard function, β = (β1 , . . . , β5 )′ , and xi,l = (fluori , genderi , toothl , fluori × genderi , fluori × toothl )′ . The covariate “tooth” is a dummy variable that distinguishes teeth on different positions in the mouth (apart from horizontal symmetry). The term Zi is either one, corresponding to a model without frailties, or a gamma distributed frailty term. Estimates of hazard ratios between the fluorosis and fluorosis-free group controlling for gender and jaw are shown in Table 5.2. As seen, incorrectly ignoring dependencies between the responses of one child by using a model without frailties artificially decreases the size of the confidence interval. Although both models conclude that the effect of fluorosis on the development of caries on the permanent first molars is at the borderline of 5% significance (Table 5.2), the results are not reliable. As pointed on page 39, the correctness of the midpoint imputation depends strongly on the underlying distribution of the event times. For that reason, a more sophisticated analysis is needed. 50 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA 5.2.3 The Bayesian survival model for doubly-interval-censored data The non-parametric Bayesian intensity model of Härkänen et al. (2000) provides a flexible tool for analyzing multivariate survival data. Further, a software package written in C, called BITE and downloadable from http://www.rni.helsinki.fi/~tth together with scripts used to perform all analyses presented here, makes the analysis feasible in practice. Model for emergence Let Ui,l be the (unknown) age at which tooth l of child i emerged. The hazard for emergence of tooth l of the ith child at time t is (e) λi,l (t) = ℏ(e) (t − ηi |toothl , genderi ) × I[ηi < t ≤ Ui,l ]. (5.2) The dependence between emergence times of one child is accounted for by using a subject-specific variable ηi called birth time of dentition. This is a latent variable which represents the common time marking the onset of the tooth eruption process and hereby “explains” the positive correlation between eruption times Ui,l within a subject. Note that ηi is always less than the first emergence time of the permanent teeth. The intensity of emergence for a particular child is zero before that time, expressed by the indicator I[ηi < t ≤ Ui,l ]. The hazard function ℏ(e) (·|toothl , genderi ) is defined as piece-wise constant for estimation purposes. Model for caries experience Let Vi,l be the age at which the lth tooth of child i developed caries. The hazard for the caries process is given by (c) λi,l (t) = Zi × ℏ(c) (t − Ui,l |toothl , genderi , fluori ) × I[Ui,l < t ≤ Vi,l ], (5.3) where the variable Zi is an unknown subject-specific frailty coefficient modulating the hazard function. Again, we assume in (5.3) that h is piece-wise constant. We call the difference Vi,l − Ui,l the time-to-caries. The covariate “fluor” will be used in two ways. Firstly, for each combination of values of fluor, gender and tooth a piece-wise constant hazard function is specified and fitted. Secondly, the term ℏ(c) (·|toothl , genderi , fluori ) in (5.3) 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 51 (c) is replaced by ℏ0 (·) × exp(β ′ xi,l ), with β and xi,l being the same as in (5.1), thus assuming a PH model for caries experience whilst retaining a piece-wise (c) constant baseline hazard function ℏ0 (·). Remarks Our statistical model will involve the above two measurement models. Hence the possible dependencies among times of interest are taken into account by involving two types of subject-specific parameters, ηi and Zi . The first subject-specific parameter ηi is included in the model for the emergence and will shift the hazard function in time, whereas the frailty Zi recognizes that the teeth of one child can be more sensitive to caries than the corresponding teeth of another child, reflecting different dietary behavior, brushing habits, etc. Priors for baseline hazard functions In BITE the working assumption is that hazard functions are piece-wise constant. Further, for the emergence hazard functions ℏ(e) (·|toothl , genderi ) the first level of the piece-wise constant and the increment levels are assigned gamma prior distributions. This will ensure a priori an increasing hazard function for emergence. In the case of caries experience, the first level of the piece-wise constant hazard function ℏ(c) (·|toothl , genderi , fluori ) in the non(c) parametric model and ℏ0 (·) in the PH model, say h0 , is assigned a gamma prior distribution. Further, the level hm of the mth interval has, conditional on the previous levels h0 , . . . , hm−1 , a Gamma(α, α/hm−1 ) prior distribution. This gives a priori E[hm |hm−1 , . . . , h0 ] = hm−1 and assures that there is no built-in prior assumption of trend in the hazard rate. Finally, the prior for the jump points of each piece-wise constant function is a homogeneous Poisson process, as suggested by Arjas and Gasbarra (1994). Because jump points are assumed to be random and not fixed, the posterior predictive hazard functions will be smooth, rather than piece-wise constant. Priors for the random effect terms The prior distribution for the birth time of dentition ηi illustrates how we have combined the Flemish data and the Finnish data, and how the timing of emergence of the Finnish data is included in our analysis. We assume that the shapes of the emergence hazard functions f for Finland and Flanders are the same, but we do allow for a shift in emergence times by assuming different 52 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA means for the birth time of dentition in the two countries. More precisely, the prior distribution of ηi is assumed normal N (ξ0 , τ −2 ) for a Finnish child and normal N (ξ1 , τ −2 ) for a Flemish child. The Bayesian approach allows us to include the dentist’s knowledge on the problem at hand by assigning to the parameters ξ0 and ξ1 independent normal prior distributions with mean 5.2 years and standard deviation 1 year. Both the normal distribution as well as the choice of the prior means and standard deviation of the hyperparameters ξ0 and ξ1 are motivated by the results found in the literature on the earliest emergence of permanent teeth, see Nanda (1960) or more recently Parner et al. (2001). This reflects the dentist’s belief that permanent teeth on average emerge slightly after 5 years of age. The parameter τ 2 is assigned a Gamma(2, 2) prior distribution. The individual frailties Zi in the model for caries are a priori assumed to be conditional on the hyper-parameter φ, independent and identically gamma distributed with both shape and inverse scale equal to that hyper-parameter. The hyper-parameter itself is then given a Gamma(2, 2) prior distribution. Sensitivity of the results with respect to the choice of parameters for priors of hyperparameters ξ0 , ξ1 , τ and φ will be discussed in Section 5.2.4. Treatment of censored data Left- and interval-censoring are treated by Bayesian data augmentation introduced in Section 4.3. Additionally, the left-censored emergence times of all teeth are changed into interval-censored emergence times with a lower limit equal to 4 years, implying that less internal information is used here than previously with the frequentist PH model where the limit was 5 years. In the case that both emergence and caries development were observed within one observational interval we force sampled values of the MCMC to satisfy Vi,l > Ui,l . Bayes inference on model components The posterior distributions based on the model with prior assumptions described in the previous paragraphs are minor modifications of those derived in Härkänen et al. (2000). Our Bayesian model is complex and requires the use of Markov Chain Monte Carlo sampling techniques outlined in Section 4.5. The software package BITE (Härkänen, 2003), based on the MetropolisHastings algorithm (Metropolis et al., 1953; Hastings, 1970), was used to sample from the posterior distributions. Further, BITE employs the reversible jump approach of Green (1995) to sample piece-wise constant hazard 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 53 functions. We carried out two runs, each with 20 000 iterations of burn-in followed by 14 000 iterations with a 1:4 thinning to obtain a sample from the posterior distribution. We used the Gelman and Rubin (1992) test to check for convergence. 5.2.4 Results A non-parametric model with Flemish and Finnish data To evaluate the effect of fluoride-intake on the development of caries on the permanent first molars we have calculated the posterior expectations of hazard ratios ℏ(c) (t|tooth, gender, fluorosis) . ℏ(c) (t|tooth, gender, fluorosis-free) These hazard ratios together with their 95% equal tail point-wise credible intervals can be found in Figure 5.2. The PH assumption with respect to covariate fluor seems to be satisfied since credible intervals in all cases cover 0 1 5 2 4 3 Time since emergence (years) 4 3 2 1 0 1 2 3 4 HR (fluor./NO fluor.) Boy, mandible 6 0 HR (fluor./NO fluor.) Boy, maxilla 6 6 0 1 1 5 2 4 3 Time since emergence (years) 6 4 3 2 1 0 HR (fluor./NO fluor.) 4 3 2 1 0 6 Girl, mandible 6 0 HR (fluor./NO fluor.) Girl, maxilla 6 5 2 4 3 Time since emergence (years) 0 1 5 2 4 3 Time since emergence (years) 6 Figure 5.2: Signal Tandmobielr study. Bayesian non-parametric model based on Flemish and Finnish Data. Posterior means of the hazard ratios between the fluorosis groups (solid line), 95% point-wise equal tail probability region (dashed line). 54 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA Table 5.3: Signal Tandmobielr study. Bayesian PH models for the effect of fluorosis on caries on permanent first molars. Hazard ratios (95% equal-tail credible intervals (CI)) between fluorosis groups while controlling for gender and jaw for models fitted using both Flemish and Finnish data and Flemish data only. Group Boys, maxilla Boys, mandible Girls, maxilla Girls, mandible Flemish and Finnish data Posterior mean 95% CI 0.674 (0.492, 1.010) 0.572 (0.414, 0.850) 0.991 (0.721, 1.364) 0.840 (0.608, 1.136) Flemish data only Posterior mean 95% CI 0.651 (0.463, 0.960) 0.549 (0.386, 0.779) 1.002 (0.698, 1.333) 0.844 (0.602, 1.135) a horizontal line. In three cases, this horizontal line is close to the dotteddashed line y = 1 implying no effect of fluoride-intake on caries development. A positive effect of fluoride intake seems to be present only for mandibular permanent first molars in boys. There are also no deviations from the PH assumption with respect to gender and tooth (plots are not shown). This allowed us to assume for the caries model a PH effect of the three covariates, possibly including some interaction terms. By this semi-parametric assumption it was hoped to see more clearly the effect of fluoride-intake on caries experience. A proportional hazards model with Flemish and Finnish data For reasons stated in the previous paragraph, we have fitted a model where the caries hazard function (5.3) was changed into (c) (c) λi,l (t) = Zi × ℏ0 (t) × exp(β ′ xi,l ) × I[Ui,l < t ≤ Vi,l ], (5.4) where xi,l and β are the same as in (5.1). The additional β-parameters were given a N (0, 102 ) prior. However, the hazard function for emergence is still defined by (5.2). Posterior expectations of the hazard ratios between the fluorosis groups while controlling for the other covariates are given in the left part of Table 5.3. The PH analysis for caries gives similar conclusions to the previous nonparametric analysis. A positive effect of fluoride-intake is now seen for the mandibular permanent first molars of boys and has a borderline positive 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 55 Table 5.4: Signal Tandmobielr study. Bayesian models with Flemish and Finnish Data. Posterior means and 95% equal-tail credible intervals for the hyperparameters µ0 – conditional expectation of ηi for Finland, µ1 – conditional expectation of ηi for Flanders, τ −2 – conditional variance of ηi , φ−1 – conditional variance of frailties Zi (top of the Table). Means of the posterior predictive distributions and 95% equal tail posterior predictive intervals for the birth time of dentition ηi in Finland and Flanders, respectively, and for the frailty term Zi (bottom of the Table). Posterior mean (95% credible interval) Hyperparameter Non-parametric model Cox regression model µ0 5.47 (5.40, 5.54) 5.45 (5.38, 5.52) µ1 5.69 (5.64, 5.73) 5.68 (5.64, 5.73) τ −2 0.48 (0.45, 0.52) 0.49 (0.45, 0.52) φ−1 3.85 (3.57, 4.17) 3.94 (3.58, 4.28) Posterior predictive mean (95% posterior predictive interval) Parameter Non-parametric model Cox regression model ηi (Finland) 5.48 (4.12, 6.79) 5.45 (4.05, 6.84) ηi (Flanders) 5.69 (4.33, 7.09) 5.69 (4.34, 7.01) Zi 1.02 (10−6 , 6.90) 0.95 (10−6 , 6.45) effect for the maxillary permanent first molars of boys. However, no effect of fluoride intake was seen for girls. Remark concerning hyperparameters The posterior expectations and 95% equal-tail credible intervals of the hyperparameters related to the birth times of dentition ηi and frailties Zi are given in the upper part of Table 5.4. The non-parametric model and PH model for caries give similar results. We now state our conclusions concerning the emergence process in Flanders and Finland. The emergence process starts slightly earlier in Finland (by approx. 0.2 years) than in Flanders, as is seen by the difference in the posterior expectations of the means of birth time of dentition. The MCMC output for the hyperparameters can also be used to estimate properties of the predictive distributions of birth time of dentition and frailties. Their means and 95% equal-tail posterior predictive intervals are shown in the bottom part of Table 5.4, which shows that the average of Finnish birth time of dentition is 56 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA close to 5.5 years of age, slightly higher than the prior expectation but close to the value obtained by Härkänen et al. (2000) on another Finnish data set. The 95% posterior predictive intervals show that the actual moment of birth time of dentition varies between about 4 and 7 years of age. Finally, the 95% posterior predictive interval of Zi shows a clear heterogeneity in the frailty for caries experience. Sensitivity analysis Firstly, the model (5.4) was fitted using Flemish data only, to see how influential was inclusion of the Finnish data. As seen in Table 5.3, the hazard ratios changed only slightly. The same was true for the remaining parameters. Moreover, the Finnish data improved only slightly the precision with which the emergence of the first permanent molars was estimated. This is seen in Figure 5.3 which shows a comparison of 95% pointwise equal tail credible regions for the emergence hazard functions of the permanent first molars based on the analysis with both data sets and the Flemish data set only. Though, 15 10 5 0 5 10 15 Hazard function Boy, mandible 6 0 Hazard function Boy, maxilla 6 0 2 4 12 6 8 10 Time since birth time of dentition (years) 0 Girl, mandible 6 15 10 0 5 Hazard function 15 10 5 Hazard function Girl, maxilla 6 0 2 4 12 6 8 10 Time since birth time of dentition (years) 0 2 4 12 6 8 10 Time since birth time of dentition (years) 0 2 4 12 6 8 10 Time since birth time of dentition (years) Figure 5.3: Signal Tandmobielr study. Bayesian PH models. Posterior means of the emergence hazard functions ℏ(e) (·|tooth, gender) for the permanent first molars together with their 95% pointwise equal tail probability regions. Comparison of the posterior mean with (solid line) and without additional Finnish data (dashed line), respectively together with 95% prob. regions (dotted-dashed line, dotted line respectively). 5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION 57 the credible regions are somewhat narrower when both databases are used. replacemen To see how the behavior of the parameter estimates changes when informative priors for the hyperparameters are modified we have fitted the proportional hazards model with Flemish data only, using different choices of priors for the hyperparameters. Specifically, we used normal distributions N (3, 2), N (4, 1), N (5.2, 1), N (6, 1) as a priors for the expectation ξ0 of birth time of dentition ηi . The standard deviation of the normal prior with mean 3 years was increased so as to cover realistic emergence times of permanent teeth. We used Gamma(0.1, 0.1), Gamma(2, 2), and Gamma(10, 10) distributions as priors for the precision τ of the variance of the birth time of dentition and for the precision φ of frailties Zi . All other parameters were given flat priors and there is thus no reason to modify them. Posterior means and 95% equal-tail credible intervals for hazard ratios between the fluorosis and fluorosis-free groups for different choices of the prior distributions are shown in Figure 5.4, which shows that the influence of the choice of the prior distribution is not strong. − − − − − − − − 2 − − − − − 4 6 8 Prior pattern 10 2 12 − − − − − − − − − − 10 − 12 − − − − 1.6 τ, φ ∼ Γ(2, 2) τ, φ ∼ Γ(10, 10) 1.2 τ, φ ∼ Γ(0.1, 0.1) − − − − − − − 4 6 8 Prior pattern − − − − − 0.4 − 0.4 − 0.8 τ, φ ∼ Γ(10, 10) Hazard ratio 1.6 − 0.8 − Girl, mandible 6 τ, φ ∼ Γ(2, 2) 1.2 τ, φ ∼ Γ(0.1, 0.1) τ, φ ∼ Γ(10, 10) 6 8 Prior pattern 4 Girl, maxilla 6 Hazard ratio τ, φ ∼ Γ(2, 2) 1.2 0.8 − Hazard ratio − τ, φ ∼ Γ(0.1, 0.1) 0.4 0.8 1.2 τ, φ ∼ Γ(10, 10) − − 1.6 Boy, mandible 6 τ, φ ∼ Γ(2, 2) 0.4 Hazard ratio 1.6 Boy, maxilla 6 τ, φ ∼ Γ(0.1, 0.1) 2 4 6 8 Prior pattern 10 12 2 10 12 Figure 5.4: Sensitivity Analysis. Evolution of posterior mean and 95% credible intervals for the hazard ratios between the fluorosis and fluorosis-free groups with changing prior distributions for hyperparameters τ, φ and ξ0 . Prior patters number 1, 5 and 9 use N (3, 2) prior for ξ0 , patterns number 2, 6 and 10 use N (4, 1) prior for ξ0 , patterns number 3, 7 and 11 use N (5.2, 1) prior for ξ0 and patterns number 4, 8 and 12 use N (6, 1) prior for ξ0 . 58 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA We argue that our other assumptions are not strong. Indeed, we assume that the distributions of the birth time of dentition differ between Finnish and Flemish populations only in their means. Moreover, as indicated above, the Finnish data had only a slight impact on the results for the Flemish data. Further, the baseline hazards were estimated non-parametrically. Finally, different choices for the priors of the hyperparameters led to similar results as discussed above. 5.2.5 Discussion The model presented here allows for the analysis of survival data in dental research where (doubly-)interval-censored data and dependencies between observations (e.g. between teeth in the same mouth) are common. Our specific application is to a typical dental research question, i.e. whether fluorideintake has a protective effect for caries. The results show that the protective effect of fluoride-ingestion is not convincing. We observed a positive effect only for mandibular teeth of boys. This agrees with current guidelines for the use of fluoride in caries prevention, where only the topical application (e.g. fluoride in tooth paste) is considered to be essential (Oulis, Raadal, and Martens, 2000). We acknowledge that our analyses could have been more refined if the amount of left- and right censoring was less, for instance if the study had started approximately one year earlier and ended in high school. This would make our analyses less dependent on prior assumptions. Yet these prior assumptions are simply a reflection of basic dental knowledge and it would be a waste not to use them. Moreover, to our knowledge the Signal Tandmobielr study is possibly the largest longitudinal study executed with such great detail on dental aspects. This section has illustrated the usefulness of the Bayesian approach. Firstly, it was possible to incorporate prior information and to relax the parametric assumptions often made in survival analysis with interval-censored data. Secondly, even rather complex models could be specified for doubly-intervalcensored data. However, we have to admit that this approach is computationally demanding. On a Pentium IV 2 GHz PC with 512 MB RAM one BITE run took about 5 days to converge. However, in an epidemiological analysis where there is correlation among the subjects, where the response and/or the covariates are (right-, left- or interval-) censored and when we wish to avoid parametric assumptions we doubt any classical approach will suffice. 5.3. BAYESIAN ACCELERATED FAILURE TIME MODEL 5.3 59 Bayesian accelerated failure time model Most contributions to the AFT model in the Bayesian literature work explicitely only with right-censored data. However, using the idea of Bayesian data augmentation (Section 4.3) they can all be quite easily extended to handle also interval-censored data. Additionally, actually all papers dealing with the Bayesian AFT model use a Bayesian non-parametric approach (see Walker et al., 1999 or the book Ghosh and Ramamoorthi, 2003) for the distributional parts of the AFT model. In this section, we give a brief overview. Firstly, Christensen and Johnson (1988) and Johnson and Christensen (1989) consider the basic univariate AFT model (3.2) and use a Dirichlet process prior (Ferguson, 1973, 1974) for the underlying baseline survival distribution, i.e. the distribution of exp(ε). In the former paper, only a semi-Bayesian approach is used, whereas the latter paper presents a fully Bayesian analysis however, with uncensored data only. The authors state that “The analysis becomes totally intractable when there are censored observations.” Additionally, as discussed in Johnson and Christensen (1989), difficulties might arise due to the discrete nature of a Dirichlet process (the baseline survival distribution is discrete with probability one if it is assigned the Dirichlet process prior). An improvement is presented by Kuo and Mallick (1997) who consider a Dirichlet processes mixture (Lo, 1984) for either ε or exp(ε). Subsequently, Walker and Mallick (1999) suggest to use a diffuse, finite Pólya tree prior distribution described in Lavine (1992, 1994) and Mauldin, Sudderth, and Williams (1992) for the error term ε in the AFT model (3.2). The main advantages of the Pólya tree prior distribution are (1) it can assign probability one to the set of continuous distribution, (2) it is easy to constraint the resulting error term ε to have the median (or any other quantile) rather than the mean equal to zero (or any other fixed number) such that also the regression quantiles can be modelled, of which the median regression is the most important case. Additionally, Walker and Mallick (1999) break down the i.i.d. assumption of the error terms and assume also the population averaged AFT model (3.6). Successive approaches to the Bayesian non-parametric AFT concentrate on the median regression. Namely, Kottas and Gelfand (2001) suggest to use the Dirichlet process mixture of either unimodal parametric densities or unimodal step functions for the distribution of the error term ε in the basic AFT model (3.2). Another median regression AFT model is given by Hanson and Johnson (2002) who use a mixture of Pólya trees centered about a standard, parametric family of probability distributions as a prior for the error term ε. Finally, Hanson and Johnson (2004) consider a mixture of Dirichlet processes 60 CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA introduced by Antoniak (1974) (which is distinct from the Dirichlet process mixture used by Kuo and Mallick, 1997 or Kottas and Gelfand, 2001) as the prior for the error term ε in the basic AFT model (3.2). They also consider explicitely the interval-censored data. The area of multivariate survival data modelled by the mean of the Bayesian AFT model seems to be almost unexplored. Except the work Walker and Mallick (1999) we are not aware of any other contribution. Moreover, the structured modelling of dependencies by the mean of the cluster specific AFT model introduced in Section 3.4.3 seems to be absent at all in the literature. 5.4 Concluding remarks In this chapter and in Chapter 3 we came across with two fundamental regression models for the survival data. We mentioned that the most frequently used PH model has several drawbacks so that in many practical situations it is worthy to consider alternatives of which the AFT model is an appealing one. We pointed out that the AFT model whose distributional parts are parametrically specified can relatively easily be estimated even using the method of maximum-likelihood. However, especially for prediction purposes, it is important to avoid incorrectly specified parametric models since due to the censoring any parametric assumption is very difficult to check with survival data. For that reason, one aims for methods that leave the distributional parts of the model either completely unspecified or specify them in a flexible way. For the PH model, the partial likelihood due Cox (1975) is available for this purpose. Unfortunately, no similar concept exists for the AFT model. Several frequentist semi-parametric methods were reviewed in Sections 3.2, 3.4.2, 3.4.3, and 5.1.4. Nevertheless, we saw that, especially with interval censoring, or let alone doubly interval censoring, most of them become computationally intractable in practical situations. Moreover, with multivariate data, the situation becomes even more complex. On the other hand, the Bayesian approach together with data augmentation offer an appealing alternative allowing to formulate and also estimate realistically complex models even with multivariate and/or (doubly-)intervalcensored data. We have illustrated this issue on the Bayesian semi-parametric PH model in Section 5.2. In Section 5.3, we have subsequently reviewed existing semi-parametric approaches to the AFT model. However, we mentioned that most of them were primarily developed to handle only univariate data. Nevertheless, many survival problems lead to the analysis of the multivariate data. Concluding Remarks to Part I and Introduction to Part II We have introduced two versions of the AFT model - the population-averaged and the cluster-specific model that can be used to analyze the multivariate survival data. We have also mentioned that, especially for the cluster-specific AFT model (3.7), with unspecified distributional parts of the model, there is almost no methodology developed in the literature. In this thesis, we aim to present the methods to handle both the populationaveraged AFT model (3.6) and the cluster-specific AFT model (3.7) under the presence of multivariate and/or (doubly-)interval-censored data. At the same time, we want to minimize the parametric assumptions concerning the distributional parts of the model as much as possible. One possibility to reach this target is to use smoothing methods for the unknown distributional parts. In the literature, more often the baseline hazard function is smoothed (Section 5.1.3: Kooperberg and Clarkson, 1997; Joly et al., 1998; Betensky et al., 1999; Section 5.2: Härkänen et al., 2000; Komárek et al., 2005). However, with the AFT model, it is quite natural to use a flexible smooth expression for the density, either of the error term ε and/or the random effects b. For example, for the bivariate population-averaged AFT model, Pan and Kooperberg (1999) use this idea in combination with the multiple imputation (see Section 3.4.2). In principal, the methods presented in Part II of this thesis will be built on the same basis as that of Pan and Kooperberg (1999). Whereas they express the logarithm of the unknown density using the splines and use numerical integration to evaluate and optimize the likelihood we will model directly the density using a linear combination of suitable basis parametric functions and simplify thus the likelihood evaluation (see Section 6.2.4). In contrast to Pan 61 62 CONCLUDING REMARKS TO PART I AND INTRODUCTION TO PART II and Kooperberg (1999) we also exploit another strategy to determine the number of the basis functions. Whereas they choose the optimal number of basis functions using a criterion like AIC (Akaike, 1974) we will either take an overspecified number of the basis functions and prevent identifiability problems and overfitting the data using a penalty term (Chapters 7, 9, 10) or estimate the number of the basis functions simultaneously with the other model parameters (Chapter 8). Further, we will show that for univariate survival data we are able, even under the interval-censoring to use maximum-likelihood based methods without the need for multiple imputation (Chapter 7). With the introduction of multivariate and doubly-interval-censored data we avoid multiple imputation by switching to the Bayesian approach (Chapters 8, 9, 10) which is more advantageous in such situation as was explained in Chapter 4. Part II Accelerated Failure Time Models with Flexible Distributional Assumptions Chapter 6 Mixtures as Flexible Models for Unknown Distributions We aim to develop the accelerated failure time models with flexibly specified distributional parts. We have already sketched that we wish to use flexible, yet smooth expressions for densities involved in the specification of these distributional parts. In this chapter, let g(y) (g(y)) denote an unknown density of some generic univariate random variable Y (random vector Y ). We outline two similar, though conceptually different, methods to approximate g(y) or g(y) in a flexible and smooth way, namely 1. The classical mixture approach; 2. An approach based on penalized smoothing. We introduce the classical mixture approach in Section 6.1. In Section 6.2, the penalized smoothing approach exploiting B-splines will be given. In Section 6.3, we replace the B-splines by normal densities and introduce the penalized normal mixture. Finally, we compare the classical and penalized normal mixture in Section 6.4. 6.1 6.1.1 Classical normal mixture From general finite mixture to normal mixture To model unknown distributional shapes finite mixture distributions have been advocated by, e.g., Titterington, Smith, and Makov (1985, Section 2.2) as appealing semi-parametric structures. Using a finite mixture the density 65 66 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS g(y) is modelled in the following way: g(y) = g(y | θ) = K X wj gj (y), (6.1) j=1 where gj , j = 1, . . . , K are known densities and θ = (K, w1 , . . . , wK )′ is the vector of unknown parameters. Namely, K is the number of mixture components, and P wj , j = 1, . . . , K are unknown weights satisfying wj > 0, j = 1, . . . , K and j wj = 1. In general, the number of mixture components, K, is assumed unknown, however, due to difficulties outlined further in the text, estimation of K is often separated from estimation of the remaining parameters, especially when using maximum-likelihood based methods. Further, it is often assumed that the mixture components, gj , j = 1, . . . , K have a common parametric form g̃ and each mixture component depends on an unknown vector of parameters η j , j = 1, . . . , K. Expression (6.1) changes then into K X wj g̃(y | η j ), (6.2) g(y) = g(y | θ) = j=1 w1 , . . . , wK , η ′1 , . . . , η ′K )′ . A frequently used particular form where θ = (K, of (6.2) is a normal mixture where g̃(y | η j ) equals ϕ(y | µj , Σj ), a density of the (multivariate) normal distribution with mean µj and covariance matrix Σj . For instance, Verbeke and Lesaffre (1996) use a mixture of multivariate normal distributions with Σj = Σ for all j to model a distribution of the random effects in the linear mixed model. In this thesis, we use the classical normal mixture only in a univariate context, i.e. to express an unknown univariate density g(y) as g(y) = g(y | θ) = K X j=1 wj ϕ(y | µj , σj2 ). (6.3) In this case, the vector θ equals 2 ′ θ = (K, w1 , . . . , wK , µ1 , . . . , µK , σ12 , . . . , σK ). (6.4) Figure 6.1 illustrates how two- or four-component, even homoscedastic, normal mixtures can be used to obtain densities of different shapes. 6.1.2 Estimation of mixture parameters Let θ be a vector given by the expression (6.4) and containing all unknown parameters of model (6.3). Suppose first that an i.i.d. sample y1 , . . . , yn 6.1. CLASSICAL NORMAL MIXTURE 67 from a density g(y | θ) is available to estimate the unknown parameter vector θ. Maximum-likelihood based methods pose two main difficulties when estimating θ: 1. When K, the number of mixture components is unknown, one of the basic regularity conditions for the validity of the classical maximumlikelihood theory is violated. Namely, the parameter space does not have a fixed dimension. Indeed, the number of unknowns (number of unknown mixture weights, means and variances) is one of the unknowns. See, e.g., Titterington et al. (1985, Section 1.2.2) for a detailed discussion of this difficulty. 2. For a fixed K ≥ 2, the likelihood becomes unbounded resulting in nonexistence of the maximum-likelihood estimate when one of the mixture means, say µ1 , is equal to one of the observations yi , i = 1, . . . , n and when the corresponding mixture variance, σ12 , converges to zero. See, e.g., McLachlan and Basford (1988, Section 2.1) for more details. In classical frequentist approach, the first problem is tackled by consecutive fitting of several models with different numbers of mixture components and choosing the best one using some criterion, e.g., Akaike’s information criterion (Akaike, 1974). To avoid the second problem, homoscedastic normal mixtures, i.e. with σj2 = σ 2 for all j are used leading to a bounded likelihood. µ1 µ2 µ1 µ1 µ2 µ1 µ2 µ3 µ4 µ2 Figure 6.1: Several densities expressed as two- or four-component homoscedastic normal mixtures. 68 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS Bayesian methodology, on the other hand, offers a unified framework to estimate both the number of mixture components K and heteroscedastic normal mixtures in the same way as any other unknown parameters, i.e. using proper posterior summaries. A breakthrough in Bayesian analysis of models with a parameter space of varying dimension is the introduction of the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm by Green (1995) which allows to explore a joint posterior distribution of the whole parameter vector θ from model (6.3), including the number of mixture components K. Explicit application of the RJMCMC algorithm to normal mixtures is then described by Richardson and Green (1997). The fact that the likelihood is unbounded for heteroscedastic normal mixtures leads to an improper posterior distribution in the Bayesian setting when a fully non-informative prior distribution is used for the variances of the Q mix−2 2 ) ∝ ture components (mixture variances), i.e. when p(σ12 , . . . , σK j σj . However, the problem is solved by using a slightly informative prior distribuQ tion for the mixture variances. For instance, replacing j σj−2 by a product of inverse gamma distributions with parameters h1 and h2 where h1 = h2 = 0.001 or h1 = 1, h2 = 0.005, the classical vague priors, is already sufficient to prevent that the mixture variances will tend to zero causing an infinite likelihood. We use a classical normal mixture model (6.3) for the density of the error distribution in the cluster-specific AFT model in Chapter 8. To avoid difficulties with the maximum-likelihood estimation outlined above and for other reasons (see Sections 4.1 and 4.2) only Bayesian methodology will be considered here. In Chapter 8 we also discuss the RJMCMC algorithm and the issue of the prior distribution for mixture variances in more detail. 6.2 6.2.1 Penalized B-splines Introduction to B-splines Different types of smoothing are routinely used in various places of modern statistics to express an unknown (smooth) function. Most often, either regression surfaces or densities are smoothed; see, e.g., Fahrmeir and Tutz (2001, Chapter 5) and Hastie, Tibshirani, and Friedman (2001) for an overview. In this thesis, we concentrate on smoothing based on splines. For simplicity, we consider the univariate case first. The unknown function g(y) (density in our case) is expressed as a linear combination (mixture) of suitable basis 6.2. PENALIZED B-SPLINES 69 spline functions B1 (y), . . . , BK (y), i.e. g(y) = g(y | θ) = K X wj Bj (y), (6.5) j=1 where θ = w = (w1 , . . . , wK )′ . Expression (6.5) is similar to (6.3) introduced in the previous section. Note however that in contrast to normal densities in (6.3), the basis spline functions Bj (y), j = 1, . . . , K are always fully specified, including their location and scale, and the number of basis splines, K, is always fixed beforehand. The only quantities that have to be estimated are the spline coefficients (mixture weights) w. So called B-splines (de Boor, 1978; Dierckx, 1993) form, for their numerical stability and simplicity, a suitable system of basis spline functions. Their use in statistics was promoted especially by Eilers and Marx (1996). The Bspline is a piecewise polynomial function. To fully specify the B-spline basis, µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ10 µ11 µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ10 µ11 µ12 Figure 6.2: Basis B-splines of degree d = 1 (upper panel) and degree d = 2 (lower panel). 70 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS B1 (y), . . . , BK (y), we have to determine 1. Degree d of the polynomial pieces; 2. A set of values (knots) µ1 ≤ · · · ≤ µd+1 < · · · < µK+1 ≤ · · · ≤ µK+d+1 such that the interval (µ1 , µK+d+1 ) covers the domain of the function g(y) we wish to express using the B-splines. Given that, the value of each basis B-spline can easily be computed at an arbitrary point y ∈ R (see de Boor, 1978). Figure 6.2 shows a basis of linear (d = 1) and quadratic (d = 2) B-splines with K = 9. It can be found that the jth basis B-spline of degree d 1. Consists of d + 1 polynomial pieces; 2. Is only positive on the interval (µj , µj+d+1 ); 3. Has continuous derivatives up to order d − 1; 4. Except on boundaries it overlaps with 2d polynomial pieces of its neighbors. µ1 µ3 µ5 µ7 µ9 µ11 µ13 µ1 µ3 µ5 µ7 µ9 µ11 µ13 µ1 µ3 µ5 µ7 µ9 µ11 µ13 µ1 µ3 µ5 µ7 µ9 µ11 µ13 Figure 6.3: Several functions expressed as linear combinations of cubic Bsplines with K = 9 and equidistant set of knots. 6.2. PENALIZED B-SPLINES 71 Furthermore, for all y ∈ (µ1 , µK+d+1 ) the basis B-splines sum up to one, PK i.e. B (y) = 1. Finally, Dierckx (1993) gives simple recursive formulas j j=1 to compute derivatives or integrals of the function g(y) expressed by (6.5). Figure 6.3 illustrates that B-spline mixture can result in functions of various shapes. 6.2.2 Penalized smoothing Choosing the optimal number and position of knots is generally a complex task in the area of spline smoothing. Too many knots leads to overfitting the data; too few knots leads to underfitting and inaccuracy. O’Sullivan (1986, 1988) proposed to take a relatively large number of knots and to restrict the flexibility of the fitted curve by putting a penalty on the second derivative. In the context of B-splines, Eilers and Marx (1996) suggested 1. To use a large number of equidistant knots covering the domain of the function g(y) one wishes to smooth; 2. To estimate the spline coefficients using the method of penalized maximum-likelihood. Further, they propose to base the penalty on squared finite higher-order differences between adjacent spline coefficients wj . They call their method as penalized B-spline, or P-spline smoothing. Eilers and Marx (1996) use the P-splines primarily to smooth regression surfaces although they propose also a methodology, based on the Poisson generalized linear model (GLM), for smooth estimation of the density with the i.i.d. data. We sketch this method in Section 6.2.4. The strategy of several further developments (Chapters 7, 9, 10) in this thesis is based on the ideas of Eilers and Marx (1996), modified and adapted to regression modelling with censored data. Namely, 1. For reasons stated in Section 6.3 we replace the basis B-splines by normal densities with a common variance; 2. We base the penalty term on squared finite higher-order differences between appropriate transformations of the adjacent spline coefficients wj , see Section 7.2.2 for a motivation; 3. More complex models in Chapters 9 and 10 will be estimated using the Bayesian methodology using the prior distributions inspired by the penalty term used in the penalized maximum-likelihood applications; 72 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS 4. We will not use the Poisson GLM-based density estimation, see Section 6.2.4 for the reasons why. In agreement with Eilers and Marx (1996) we use a set of equidistant knots in all penalized-based developments. 6.2.3 B-splines in the survival analysis General splines have been suggested at several places in the survival literature to model flexibly either the (log-)density/(log-)hazard function or the effect of covariates replacing a linear predictor by a spline function. See the discussion section of Abrahamowicz, Ciampi, and Ramsay (1992), the introductory section of Kooperberg, Stone, and Truong (1995) or Chapter 5 of Therneau and Grambsch (2000) for an overview. More specifically, B-splines have been used by Rosenberg (1995) who uses their cubic variant to express the baseline hazard function in the Cox’s PH model. He choses the optimal number of knots according to Akaike’s information criterion (Akaike, 1974) while placing the knots to the quantiles of uncensored observations. An approach based on the penalized maximumlikelihood is given by Joly, Commenges, and Letenneur (1998) who use monotone splines (close relatives of the B-splines introduced by Ramsay, 1988) to model the baseline hazard function, in the Cox’s PH model as well. Tutz and Binder (2004) and Kauermann (2005b) use B-splines to extend the basic Cox’s PH model by allowing for time-varying regression parameters. Recently, Lambert and Eilers (2005) use a Bayesian version of penalized Bsplines to model both the baseline hazard and the effect of covariates in the Cox’s PH model in an actuarial way. To our best knowledge, we are not aware of any approach where the B-splines would be used to model the density of the survival times. 6.2.4 B-splines as models for densities The function g(y | θ) expressed by (6.5) can serve as a model for the density of a continuous distribution with domain (µ1 , µK+d+1 ) provided g(y | θ) ≥ 0 for all y ∈ (µ1 , µK+d+1 ), Z µK+d+1 g(y | θ) dy = 1. µ1 (6.6) (6.7) 6.2. PENALIZED B-SPLINES 73 Condition (6.6) is satisfied if we require all the spline coefficients to be positive, i.e. wj > 0, j = 1, . . . , K. (6.8) Constraint (6.7) can easily be avoided when we change the expression (6.5) for g(y | θ) into g(y | θ) = Q−1 Q= Z K X wj Bj (y), (6.9) j=1 K µK+d+1 nX µ1 j=1 o wj Bj (y) dy The constant Q can easily be computed using the formulas given by Dierckx (1993, Section 1.3). For example, in the case of coincident boundary knots (i.e. µ1 = · · · µd+1 and µK+1 = · · · = µK+d+1 ) the constant Q equals K Q= 1 X wj (µj+d+1 − µj ). d+1 j=1 We show in Section 6.3.2 how to avoid also the inequality constraints (6.8). A somewhat different approach to estimate a density function using B-splines has been suggested in Eilers and Marx (1996, Section 8), namely by smoothing a histogram. They divide the range of the data into a large number K of bins, each of length h, and let the midpoints of the bins to define the knots µ1 , . . . , µK . The raw continuous data, y1 , . . . , yn are changed into counts n1 , . . . , nK such that nj , j = 1, . . . , K equals the number of raw observations yi , i = 1, . . . , n with µj − h/2 ≤ yi < µj + h/2. The counts n1 , . . . , nK constitute a histogram. They assume that each of these counts follows a Poisson distribution with expectation E(n1 ), . . . , E(nK ), respectively. A smoothed histogram is obtained by expressing the Poisson log-expectations as the Bspline, namely K X log E(nj ) = wk Bk (µj ), j = 1, . . . , K. k=1 The corresponding smooth density of the original continuous data is then given by K nX o g(y | θ) = Q−1 exp wk Bk (y) , k=1 74 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS where Q is an appropriate proportionality constant. Eilers and Marx (1996) argue that the use of penalized maximum-likelihood estimation provides stable and useful results and does not lead to any pathological results resulting from discretization of the data. For our developments in the context of the AFT model, we believe that the approach with the density directly expressed as a mixture of B-splines is more advantageous since it leads to a simpler likelihood evaluation. Remember that with censored observations the likelihood involves evaluation of integrals of the assumed density (see Section 4.1 and 4.2). With the density (6.9) these integrals are simply mixtures of integrated basis B-splines whose computation only involves integration of polynomials. Nevertheless, usage of the smoothed histogram approach in the censored data regression context is presented by Lambert and Eilers (2005). 6.2.5 B-splines for multivariate smoothing The concept of B-splines can be extended to the multivariate setting, to smooth (estimate) a function g(y) of several variables. For example the bivariate case is achieved by replacing the formula (6.5) by g(y) = g(y1 , y2 ) = g(y | θ) = K2 K1 X X wj1 , j2 B1, j1 (y1 ) B2, j2 (y2 ), j1 =1 j2 =1 where B1, j1 , j1 = 1, . . . , K1 is a set of basis B-splines of degree d defined by knots µ1, 1 , . . . , µ1, K1 +d+1 , B2, j2 , j2 = 1, . . . , K2 a set of basis B-splines of degree d defined by a generally different set of knots µ2, 1 , . . . , µ2, K2 +d+1 , and θ = (w1,1 , . . . , wK1 ,K2 )′ . Namely, g(y | θ) is expressed as a Kronecker product of univariate B-splines and this idea can be extended also to higher dimensions. 6.3 6.3.1 Penalized normal mixture From B-spline to normal density Using the B-spline expression (6.5) to model a survival density has one drawback. Namely, the support of the resulting density g(y | θ) is always bounded and equal to the interval (µ1 , µK+d+1 ). However, most continuous survival distributions are thought of as having a support of (0, ∞) on the time scale and the real line on the log-scale. While in practice this might not constitute 6.3. PENALIZED NORMAL MIXTURE 75 any difficulty, in theory it might be more comfortable to approximate a density having an infinite support. Remember also that we aim to approximate densities of either the error distribution in the AFT model or the distribution of the random effects in the same model. This implies that it might be quite difficult in some settings to find a proper range of the density for the error terms and/or random effects as both distributions are seen from the data only indirectly. However, one can easily find that the basis B-spline of degree d is very close to the density of the standard normal distribution in the sense of the following proposition. Proposition 6.1. Let B d (y) be a basis B-spline of degree d defined on the grid of d + 2 equidistant knots µd1 = −δ d+1 , 2 ..., µdd+2 = δ with δ = µdj+1 − µdj , j = 1, . . . , d + 1 equal to d Bst (y) = r p d+1 d B (y), 12 d+1 , 2 12/(d + 1). Let y∈R be a standardized basis B-spline of degree d. Then d lim Bst (y) = ϕ(y) d→∞ uniformly for all y ∈ R, where ϕ denotes a density of a standard normal distribution. Proof. We give only main ideas of the proof. All technical details can be found in Unser, Aldroubi, and Eden (1992). Firstly, an arbitrary basis B-spline of degree d is proportional to the density of a sum of d + 1 independent uniformly distributed random variables. Properly d (y), is then equal to a density of a zero mean, standardized basis B-spline, Bst unit variance random variable given as a sum of d + 1 independent uniformly distributed random variables. The proposition is then achieved using the central limit theorem (see, e.g., Billingsley, 1995, Section 27). The property outlined in Proposition 6.1 is illustrated in Figure 6.4. Moreover, the convergence is rather fast. Indeed, the standardized cubic basis B-spline is already quite close to the standard normal density. This reasoning led us to replace the basis B-splines in the expression (6.5) by normal densities whose means are equal to the knots and whose variance is equal to a common value σ02 . In accordance with the idea of penalized B-splines (see Section 6.2.2), we use a larger number of equidistant knots 76 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS 0 1 2 3 0.3 −3 −2 −1 1 2 3 2 3 2 3 −3 −2 −1 3 1 2 3 0.0 0.1 0.2 0.3 0.4 0.2 0.1 2 0 d=8 0.0 1 3 0.2 1 0.3 0.4 0.3 0.2 0.1 0 2 0.1 0 d=7 0.0 −3 −2 −1 3 0.0 −3 −2 −1 d=6 2 0.4 1 1 0.3 0.4 0.3 0.1 0.0 0 0 d=5 0.2 0.3 0.2 0.1 0.0 −3 −2 −1 −3 −2 −1 d=4 0.4 d=3 0 0.4 −3 −2 −1 0.0 0.1 0.2 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 d=2 0.4 d=1 0.4 d=0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 Figure 6.4: Standardized basis B-splines of degree 0 to 8 (solid line) compared to a standard normal density (dashed line). 6.3. PENALIZED NORMAL MIXTURE 77 chosen beforehand. Additionally, as explained in Section 6.3.3, we always use an odd number of knots symmetric around the middle knot. For this reason, the number of mixture components will be indicated by 2K + 1 and the knots – means denoted by µ−K , . . . , µ0 , . . . , µK . Namely, the unknown function g(y) (density) is approximated by g(y) = g(y | θ) = K X j=−K wj ϕ(y | µj , σ02 ), (6.10) where θ = (w−K , . . . , wK )′ . The basis standard deviation, σ0 , is chosen beforehand as well as the knots. For its choice we adopted the value 2δ/3, where δ = µj+1 − µj , j = −K, . . . , K − 1 is the distance between the two consecutive knots – means. The motivation for this choice is provided by an attempt to keep a correspondence with the cubic B-splines. Remember, the basis cubic B-spline covers an interval of length 4δ. The same is nearly true for the normal density with the variance (2δ/3)2 if we admit that the N (µ, σ 2 ) density is practically zero outside the interval (µ − 3σ, µ + 3σ). In this context, we will call (6.10) penalized normal mixture. 6.3.2 Transformation of mixture weights To ensure that the function g(y | θ) given by (6.10) is a density of some continuous distribution, we have to impose constraints analogous to (6.8) and (6.9) upon the mixture weights w = (w−K , . . . , wK )′ . Namely, they have to satisfy wj > 0, K X wj = 1. j = −K, . . . , K, (6.11) (6.12) j=−K To avoid constrained estimation, one can use an alternative parametrization based on transformed mixture weights a = (a−K , . . . , aK )′ w j , j = −K, . . . , K, (6.13) aj (w) = log w0 Inversely, the original weights w are computed from the transformed weights a by exp(aj ) , j = −K, . . . , K. (6.14) wj (a) = PK exp(a ) k k=−K Instead of estimating the constrained weights w, the vector a−0 of unconstrained transformed weights, except a0 which is fixed to zero, is estimated. 78 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS Note that the weights w(a) expressed by (6.14) automatically satisfy both (6.11) and (6.12). Further, an arbitrary mixture component can be chosen to be the reference one having a corresponding a coefficient fixed to zero without any impact on the results. However, for notational convenience, without loss of generality, we will assume that a0 = 0. 6.3.3 Penalized normal mixture for distributions with an arbitrary location and scale Let Y be a random variable with a density g(y) with var(Y ) = τ 2 . E(Y ) = α, To be able to use the same grid of knots – means µ−K , . . . , µK for distributions with an arbitrary location α and scale τ we incorporate these two parameters in the expression (6.10) for the unknown density g(y), i.e., the density g(y) will be approximated by g(y) = g(y | θ) = τ −1 K X j=−K y − α 2 µ , σ wj (a) ϕ j 0 , τ (6.15) where θ = (a−K , . . . , aK , α, τ )′ . In other words, the density of the standardized random variable Y ∗ = τ −1 (Y − α) is approximated by g∗ (y ∗ | θ ∗ ) = K X j=−K wj (a) ϕ y ∗ µj , σ02 , where θ ∗ = (a−K , . . . , aK )′ . The intercept α and the scale τ will be estimated simultaneously with the transformed mixture weights a. With expression (6.15), the knots µ−K , . . . , µK have to cover a high probability region of the zero-mean, unit-variance distribution. In most practical situations, the choice with µ−K equal to a value between −6 and −4.5 and µK equal to a value between 4.5 and 6 provides the range of the knots broad enough. Furthermore, a distance δ of 0.3 between two consecutive knots is small enough to approximate most smooth densities. As an illustration, we computed the L2 -distance between the standard normal density and its best approximation using a penalized mixture (6.15) with µ−K = −6, µK = 6, different choices of δ = µj+1 − µj , and σ0 = 2δ/3. This distance is equal to 0.00570 for δ = 1 (K = 6), and drops to 0.00104 for δ = 0.75 (K = 8). When plotted, the penalized mixture (6.15) is indistinguishable from the normal density at δ = 0.75. Further, for δ equal to 0.5 (K = 12), 0.4 (K = 15), 0.3 6.3. PENALIZED NORMAL MIXTURE 79 (K = 20), 0.2 (K = 30), and 0.1 (K = 60) we obtain distances of 0.00031, 0.00022, 0.00017, 0.00014, and 0.00012, respectively. Clearly, for δ = 0.3 the penalized mixture and the normal density are quite close. 6.3.4 Multivariate smoothing In Section 6.2.5 we discussed how the Kronecker product of univariate Bsplines can be used to model unknown multivariate functions. The same idea can be used also with the penalized normal mixture. In this thesis, we use the multivariate penalized normal mixture only in the bivariate setting which will be discussed now. Extensions to higher dimensions are obvious, only with more complex notation. Firstly, we note that the bivariate basis formed of the Kronecker product of univariate normal densities is actually the basis formed of bivariate normal densities with diagonal covariance matrices. Indeed, for arbitrary y1 ∈ R and y2 ∈ R ϕ(y1 | µ1 , σ12 ) ϕ(y2 | µ2 , σ22 ) = ϕ2 (y1 , y2 | µ, Σ), where ϕ2 (· | µ, Σ) is a density of N2 (µ, Σ) with µ = (µ1 , µ2 )′ and Σ = diag(σ12 , σ22 ). Analogously to the univariate formula (6.10), the unknown bivariate density g(y1 , y2 ) = g(y) is expressed by g(y) = g(y | θ) = K1 X K2 X j1 =−K1 j2 =−K2 wj1 ,j2 ϕ(y | µ(j1 ,j2 ) , Σ), (6.16) where µ(j1 ,j2 ) = (µ1,j1 , µ2,j2 )′ , j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 is a fixed fine grid of knots, Σ = diag(σ12 , σ22 ) is a fixed basis covariance matrix (the same for all mixture components) and W = (wj1 ,j2 ), j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 a matrix of unknown mixture weights satisfying wj1 ,j2 > 0, j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 (6.17) wj1 ,j2 = 1. (6.18) K1 X K2 X j1 =−K1 j2 =−K2 The vector θ of unknown parameters contains the elements of the matrix W. Similarly to Section 6.3.2, the constraints (6.17) and (6.18) are avoided by the reparametrization of the weight matrix W into the matrix A = (aj1 ,j2 ), 80 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 of transformed weights by wj1 ,j2 (A) = w j1 ,j2 , j1 = −K1 , . . . , K1 , w0,0 exp(aj1 ,j2 ) , j2 = −K2 , . . . , K2 . K2 K1 P P exp(ak1 ,k2 ) aj1 ,j2 (W) = log (6.19) k1 =−K1 k2 =−K2 For notational convenience and without loss of generality, the mixture component (0, 0) is chosen to be the baseline with a0,0 = 0. Moments of the bivariate penalized normal mixture It is useful to stress that although all bivariate normal components in (6.16) are uncorrelated the covariance matrix of the random vector (Y1 , Y2 )′ with the density g(y) = g(y | θ) defined by (6.16) is, except for a special combination of mixture weights, not diagonal. Namely, E(Y1 ) = K1 X wj1 + µ1,j1 , var(Y1 ) = + K1 X j1 =−K1 var(Y2 ) = σ22 + K2 X j2 =−K2 cov(Y1 , Y2 ) = K1 X w+j2 µ2,j2 , j2 =−K2 j1 =−K1 σ12 E(Y2 ) = K2 X K2 X o2 n wj1 + µ1,j1 − E(Y1 ) , n o2 w+j2 µ2,j2 − E(Y2 ) , j1 =−K1 j2 =−K2 n on o wj1 ,j2 µ1,j1 − E(Y1 ) µ2,j2 − E(Y2 ) , where subscript + means summation over the range of the corresponding index. Bivariate penalized normal mixture for distributions with an arbitrary location and scale Analogously to Section 6.3.3 we introduce here an extra intercept parameter vector α = (α1 , α2 )′ and an extra scale parameter vector τ = (τ1 , τ2 )′ to 6.4. CLASSICAL VERSUS PENALIZED NORMAL MIXTURE 81 allow for modelling the bivariate densities of a random vector Y = (Y1 , Y2 )′ with a general location and scales, i.e. with E(Y1 ) = α1 , var(Y1 ) = τ12 , E(Y2 ) = α2 , var(Y2 ) = τ22 . As before, the same values of the extreme knots µ1,−K1 , µ1,K1 , µ2,−K2 , µ1,K2 and the basis standard deviations σ1 , σ2 can be used for distributions with different location and scale. Namely the bivariate density g(y) of a general distribution will be approximated by g(y) = g(y | θ) = (τ1 τ2 )−1 K1 X K2 X j1 =−K1 j2 =−K2 (6.20) y − α y − α 1 1 2 2 wj1 ,j2 (A) ϕ2 , µ(j1 ,j2 ) , Σ , τ1 τ2 where θ = (a−K1 ,−K2 , . . . , aK1 ,K2 , α1 , α2 , τ1 , τ2 )′ . In other words, the density of the standardized random vector ! ! ! −1 ∗ Y − α τ 0 Y 1 1 1 1 = Y∗= Y2 − α2 0 τ2−1 Y2∗ is approximated by g∗ (y ∗ | θ ∗ ) = K1 X K2 X j1 =−K1 j2 =−K2 wj1 ,j2 (A) ϕ(y ∗ | µ(j1 ,j2) , Σ), (6.21) where the vector θ ∗ contains only the elements of the matrix A of transformed weights. The same guidelines as in the univariate case (Section 6.3.3) will be applied for the choice of the grid points and the basis standard deviations, i.e. both µ1,−K1 , . . . , µ1,K1 and µ2,−K2 , . . . , µ2,K2 being the univariate grids of equidistant knots with the distance between the two knots equal to δ ≈ 0.3, with the minimal knot lying between −6 and −4.5, the maximal knot lying between 4.5 and 6 and basis standard deviations equal σ1 = σ2 = 2δ/3. 6.4 Classical versus penalized normal mixture We finalize this chapter by an explicit comparison of the classical normal mixture and penalized normal mixture. 82 CHAPTER 6. MIXTURES AS FLEXIBLE MODELS • With the penalized normal mixture, invariably a relatively large but fixed number of mixture components is needed and the smoothness of the resulting smoothed distribution is optimized via a penalty term. On the other hand, with the classical mixture, often a small number of mixture components is sufficient but, the number of components have to be estimated which might cause some difficulties as outlined in Section 6.1.2; • The fine grid of fixed knots in the penalized mixture approach prevents inaccuracy in the estimate of the unknown density, while the penalization inhibits overfitting. In contrast, in the case of a classical mixture, the means and the standard deviations of the mixture components must be estimated; • In order to use a standard grid of knots we have included explicitely the intercept and scale parameters in the model specification when using the penalized approach. This is not desirable with the classical mixture approach as both the overall intercept and the overall scale are implicitely defined by the mixture components means and standard deviations; • Extension of the univariate smoothing into the multivariate smoothing is conceptually simple with the penalized approach as was shown in Section 6.3.4 using the Kronecker product of the basis functions. In higher dimensions, there are only some computational difficulties arising from the fact that the number of unknown parameters increases exponentially with the dimension. Extension of the classical mixture approach into higher dimensions is relatively easy with a fixed number of mixture components however is not straightforward when the number of mixture components have to be estimated simultaneously with the remaining parameters. Even with the Bayesian approach and the reversible jump MCMC algorithm mentioned in Section 6.1.2 the multivariate extensions are still an area of active research, see Dellaportas and Papageorgiou (2006) for recent developments. Chapter 7 Maximum Likelihood Penalized AFT Model In this chapter, we present the AFT model for the case of independent observations. The error distribution of the model will be based on the penalized normal mixture (Section 6.3) and penalized maximum-likelihood estimation. The basic version of this approach is given by Komárek, Lesaffre, and Hilton (2005) and an extension allowing also modelling the dependence of the scale parameter on the covariates can be found in Lesaffre, Komárek, and Declerck (2005). In Section 7.1, we describe the model in detail. In Section 7.2, we show how the model parameters are estimated using the penalized maximum-likelihood method. Section 7.3 describes the inferential procedures. In Section 7.4, computation of predictive survival or hazard functions and predictive densities is discussed. Section 7.5 gives the results of a simulation study that evaluates the performance of the method. The proposed method is applied to the analysis of the WIHS data in Section 7.6 and to the analysis of the Signal Tandmobielr data in Section 7.7. We finalize the chapter by a discussion in Section 7.8. 7.1 Model U Let Ti , i = 1, . . . , N be independent event times observed as intervals ⌊tL i , ti ⌋ and δi be the corresponding censoring indicator with the same convention as U U in Section 2.1. Let yiL = log(tL i ) and yi = log(ti ). Further, let xi = (xi,1 , . . . , xi,m )′ be a vector of covariates associated with the ith subject. The 83 84 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL effect of covariates on the event time Ti will be specified using the basic AFT model introduced in Section 3.2, i.e. log(Ti ) = β ′ xi + εi , i = 1, . . . , N, (7.1) where β = (β1 , . . . , βm )′ is a vector of unknown regression parameters and ε1 , . . . , εN are i.i.d. error random variables with the density gε (ε). 7.1.1 Model for the error density The density gε (ε) of the error term will be expressed using the penalized normal mixture (6.15), i.e. gε (ε) = τ −1 K X j=−K ε − α wj (a) ϕ µj , σ02 , τ (7.2) where µ−K , . . . , µK is a set of fixed equidistant knots, σ0 fixed basis standard deviation, α unknown intercept and τ unknown scale parameter. Finally, w = (w−K , . . . , wK )′ are unknown mixture weights and a = (a−K , . . . , aK )′ their transformations obtained using the relationship (6.13). Let ε∗1 , . . . , ε∗N be standardized error terms, i.e. having the density gε∗ (ε∗ ) = K X j=−K wj (a) ϕ(ε∗ | µj , σ02 ). (7.3) Keeping the intercept α and the scale τ identifiable requires that the first two moments of the density (7.3) be fixed, i.e., E(ε∗i ) = K X wj (a) µj = 0, var(ε∗i ) = K X wj (a) (µ2j + σ02 ) = 1. (7.4) j=−K j=−K P 2 2 Due to the fact that K j=−K wj (a)σ0 = σ0 , the variance constraint can be PK 2 2 rewritten into the form j=−K wj (a)µj = 1 − σ0 . It is then easily seen that the basis standard deviation σ0 must be smaller than 1 to be able to satisfy this constraint. Finally, the two equality constraints (7.4) can be avoided if two coefficients, say, a−1 and a1 , are expressed as functions of the remaining non-baseline coefficients, denoted together as a vector d = (a−K , . . . , a−2 , a2 , . . . , aK )′ : n o X ak (d) = log ω0,k + ωj,k exp(aj ) , k = −1, 1, (7.5) j6∈{−1,0,1} 7.2. PENALIZED MAXIMUM-LIKELIHOOD 85 with µj − µ1 1 − σ02 + µ1 µj , · µ−1 − µ1 1 − σ02 + µ1 µ−1 µj µ−1 = −ωj,−1 · − , µ1 µ1 ωj,−1 = − ωj,1 7.1.2 j = −K, . . . , −2, 0, 2, . . . , K. Scale regression In most regression models, it is conventionally assumed that the covariates influence the mean, but it is presumed that it will not influence the scale parameter. With hindsight, this is simply one model choice and in many cases it may be untenable. Recently, there is interest in joint mean-covariance models in the context of longitudinal studies (Pourahmadi, 1999; Pan and MacKenzie, 2003). Our AFT model (7.1) with the error density (7.2) can be generalized in the same direction yielding the mean-scale penalized AFT model. With this generalization, we allow the scale parameter τ to vary across individuals. Moreover, for the ith individual, the scale parameter τi will depend on a vector of covariates, say z i = (zi,1 , . . . , zi,ms )′ , as τi ≡ τ (z i ) = exp(γ ′ z i ), (7.6) where γ = (γ1 , . . . , γms )′ is a vector of unknown parameters. Note, that the covariate vector z i usually contains the intercept term, i.e. zi,1 = 1 for all i. In that case, the original AFT model (7.1) with the error density (7.2) and the common scale parameter τ can be written as the mean-scale AFT model with z i = 1 for all i and τ = exp(γ1 ). All parameters in the model (transformed mixture coefficients d; regression parameters vector β; intercept α; and log-scale log(τ ) or scale-regression parameters vector γ) are estimated by means of a penalized maximumlikelihood method. In the next section, we construct the penalized loglikelihood function which consists of an ordinary log-likelihood and a difference penalty for the transformed spline coefficients. The penalized loglikelihood is subsequently maximized to obtain the estimates, see Appendix A for practical aspects of the optimization of the penalized log-likelihood. 7.2 7.2.1 Penalized maximum-likelihood Penalized log-likelihood Let θ be the vector of all unknown parameters to be i.e., θ = estimated, ′ ′ ′ (α, β , γ , a−K , . . . , a−2 , a2 , . . . , aK ) . Let ℓi (θ) = log Li (θ) , i = 1, . . . , N 86 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL denote the ordinary log-likelihood contribution of the ith observation based on model (7.1) with error density (7.2), i.e., using the results of Section 4.1.1 and the convention (4.3), log(t) − α − β ′ xi Li (θ) = dt τi tL i I yU i y − α − β ′ xi −1 ∝ τi gε∗ dy τi yiL I yU K X i y − α − β ′ xi −1 ϕ wj (a) = τi µj , σ02 dy τi yiL τi−1 I tU i t−1 gε∗ j=−K U The proportionality constant is equal to tL i = ti for exactly observed event times (δi = 1) and equal to 1 for all remaining observations (δi = 0, 2, 3). For the purpose of maximum-likelihood based estimation, this constant can be ignored so for notational convenience we will assume that this constant P equals one. Finally, let ℓ(θ) = N ℓ (θ) be the ordinary log-likelihood of i=1 i the whole data set. As usual with censored data, the likelihood evaluation involves integration. With our model, however, this does not cause any considerable difficulties irrespective of the type of censoring (left-, right-, interval-). Indeed, all integrals involved in the computation of the likelihood are normal cumulative distribution functions, which can be easily and efficiently evaluated. To construct the penalized log-likelihood function ℓP (θ; λ), we subtract a penalty term q(a; λ) based on the transformed mixture coefficients a from ℓ(θ), i.e., ℓP (θ; λ) = ℓ(θ) − q(a; λ), (7.7) where λ is a fixed tuning parameter that controls the smoothness of the fitted error distribution and inhibits identifiability problems due to overparametrization. For a given (reasonable) λ, Eilers and Marx (1996) proposed to base the penalty on squared higher-order finite differences of the coefficients of adjacent B-splines, and they used second-order difference in their examples. We base our penalty on squared finite differences of order s of the transformed coefficients of adjacent mixture components: q(a; λ) = = λ 2 K X ∆s aj j=−K+m λ ′ ′ a Ps Ps a, 2 2 (7.8) 7.2. PENALIZED MAXIMUM-LIKELIHOOD 87 where ∆1 aj = aj − aj−1 , ∆s aj = ∆s−1 aj − ∆s−1 aj−1 , s = 1, . . . , and Ps is a (2K + 1 − s) × (2K + 1) difference operator matrix. According to our experience, s = 2 or s = 3 is sufficient to obtain a smooth estimate of the density. However, in our context the choice s = 3 has another interesting justification, as explained in Section 7.2.2 and will be used in all applications presented in this thesis. 7.2.2 Remarks on the penalty function There are two reasons why we penalize the transformed mixture coefficients a instead of the original coefficients w and why we prefer the penalty of order s = 3. First, the penalty based on a distinguishes between areas of the density where there are few datapoints (i.e., where the coefficients w are close to zero) and areas where there are many datapoints (i.e., where the coefficients w are well above zero); the penalty based on w cannot distinguish between these areas. For example, w̆ = (0.001, 0.002, 0.001, 0.996)′ , for w̃ = (0.201, 0.202, 0.201, 0.396)′ we have ă = (−6.904, −6.211, −6.904, 0)′ , ã = (−0.678, −0.673, −0.678, 0)′ and while (∆2 w̆3 )2 = 0.000004 = (∆2 w̃3 )2 , (∆2 ă3 )2 = 1.92 ≫ 0.000099 = (∆2 ã3 )2 . Indeed, in the areas with a sufficient amount of data, the estimated shape of the error distribution is mostly driven by the data themselves, whereas in the data-poor areas the shape of the fitted error distribution is inter- or extrapolated from the data-rich areas according to the flexibility allowed by the penalty term. Second, the penalty of the third order (s = 3) based on transformed mixture coefficients a has an interesting property which can serve as a basis for an empirical test of normality (see Section 7.2.3). A basis for this property is given by the following proposition which is proved in Appendix A. Proposition 7.1. Let for K ∈ N µK = {µK j = j , K = {−K, −K + j = −K 2 , . . . , K 2 } 1 1 1 1 , . . . , − , 0, , . . . , K − , K} K K K K 88 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 2 be a sequence of knots. Let for a ∈ R2K +1 a discrete distribution on µK be given by Pr(µK = µK j | a) = exp(aj ). 3 2 P 2 under the constraints Let aK minimizes K j=−K 2 +3 ∆ aj XK 2 j=−K 2 Pr(µK = µK j |a) = 1, E(µK | a) = 0, var(µK | a) = 1 − σ02 Let (7.9) for σ0 ∈ (0, 1) fixed. 2 gK (y) = K X j=−K 2 Then for all y ∈ R K K 2 Pr(µK = µK j |a ) ϕ(y | µj , σ0 ), y ∈ R. lim gK (y) = ϕ(y). K→∞ The empirical normality test is obtained using the following consideration. 1 1 1 , . . . , −K , 0, K , Suppose that for fixed K we have 2K 2 +1 knots −K, −K+ K 1 . . . , K − K , K. Suppose further that we maximize the penalized loglikelihood (7.7) for λ → ∞. This is equivalent (in the limit) to minimizing ∗ the penalty term (7.8) under the constraints (7.4). For fixed K, let gε,K be the fitted standardized error density arising from the above-mentioned optimization problem. Using Proposition 7.1 with wj (a) = Pr(µK = µK j | a), ∗ (ε∗ ) = ϕ(ε∗ ) for all ε∗ ∈ R. In j = −K 2 , . . . , K 2 we get that limK→∞ gε,K practice, the set of knots and the basis standard deviation recommended in Sections 6.3.1 and 6.3.3 (e.g., knots from −6 to 6 by 0.3 and σ0 = 0.2) ∗ give already rise to a fitted standardized error density gε,K practically indistinguishable from the normal density, ϕ, when only the penalty term is minimized. This property does not hold for the order s 6= 3 of the penalty or when the penalty is based on the original mixture coefficients w. 7.2.3 Selecting the smoothing parameter In the area of density estimation, methods for selecting the smoothing parameter, λ, that rely on cross–validation are often used. The standard modified maximum-likelihood cross–validation score that we are attempting to minimize is N X (−i) ℓi (θ̂ ), CV(λ) = − i=1 7.2. PENALIZED MAXIMUM-LIKELIHOOD 89 (−i) where θ̂ is the penalized maximum likelihood estimate (MLE) of θ and θ̂ the penalized MLE based on the sample excluding the ith observation. However, computation and optimization of the cross–validation score is extremely computationally intensive in our case. In a similar context, O’Sullivan (1988) suggested a one-step Newton-Raphson approximation combined with a firstorder Taylor series approximation. Applying his method in our setting results in an approximate cross-validation score given by CV(λ) = − n nX i=1 o ℓi (θ̂) − trace Ĥ−1 Î , (7.10) where Ĥ = − ∂ 2 ℓP (θ̂) , ∂θ∂θ T Î = − ∂ 2 ℓ(θ̂) . ∂θ∂θ T We denote trace(Ĥ−1 Î) by df(λ) and call it the effective degrees of freedom or the effective dimension of the model since it necessarily plays the same role as the effective dimension of a linear smoother (Hastie and Tibshirani, 1990). Depending on a chosen order s of the differences in the penalty, the degrees of freedom decreases in λ from dim(β) + 2 + (2K + 1 − 3) for λ = 0 (i.e., the ordinary log-likelihood) to dim(β) + 2 + (s − 3) for λ → ∞ and s ≥ 3 (i.e., the penalized log-likelihood). For example, when K = 20, µj+1 − µj = 0.3, σ0 = 0.2 and s = 3, penalized likelihood estimation as λ → ∞ depends effectively on 2K + 1 − s = 38 fewer parameters than does ordinary likelihood estimation. Further, minimizing the expression (7.10) is essentially the same as maximizing Akaike’s information criterion AIC(λ) = ℓ(θ̂)−df(λ) (Akaike, 1974). This can be a valuable way to compare different models and assess the importance of covariate contributions (see an example in Section 7.6). In accompanying R programs (see Appendix C), a grid search using userdefined values λ∗1 , . . . , λ∗L (in our applications we used values λ∗1 = e2 , λ∗2 = e1 , . . . , λ∗L = e−9 ) is used to find the optimal AIC. Since the log-likelihood is of the order O(N ), using a factor of N λ∗l /2 in the penalty term (7.8) instead of λ/2 allows one to use approximately the same grid for datasets of different sizes while also maintaining the proportional importance of the penalty term in the penalized log-likelihood at the same level. The result immediately following Proposition 7.1 further implies that with a sufficiently dense set of knots, we can check the normality of the error term. When the optimal value of the tuning parameter λ becomes large the error density of the model can be considered to be normal. 90 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL Linear mixed model interpretation Recently, Wand (2003) or Kauermann (2005a) pointed out the strong link between penalized maximum-likelihood estimation and linear mixed models which can be used for selection of the smoothing parameter. The idea, which underlies also the pseudo-variance estimate in Section 7.3.1 and the full Bayesian developments in Chapters 9 and 10, is the following. The coefficient vector a is considered to be a vector of random effects having the normal distribution a ∼ N 0, λ−1 (P′s Ps )− , where (P′s Ps )− is the generalized inverse of P′s Ps . Smoothing parameter λ then determines (together with the fixed matrix Ps ) the variability of the “random effects” a. Penalized likelihood (7.7) can then be interpreted as the likelihood of the mixed effects model with normal random effects a. The optimal λ value is obtained as the maximum-likelihood or more frequently as the restricted maximum-likelihood estimate of the inverse variance component in such constructed mixed effects model. See, e.g., Cai and Betensky (2003) or Kauermann (2005b) for practical applications of this approach. 7.3 Inference based on the maximum likelihood penalized AFT model With standard maximum-likelihood method the score vector (the first derivative of the log-likelihood) has a zero mean when its expectation is computed under the true parameter vector. Under a mild regularity conditions, it is then possible to prove that the MLE is an unbiased estimate. However, introduction of the penalty term with λ > 0 leads to the penalized score vector (the first derivative of the penalized log-likelihood) having a mean different from zero when its expectation is computed under the true parameter vector. Consequently, the penalized MLE θ̂ is a biased estimator and its standard errors may not be very informative when that bias is high. However, there are two possibilities for drawing accurate inferences based on penalized MLE. 7.3.1 Pseudo-variance Wahba (1983) described a pseudo-Bayesian technique for generating confidence bands around the cross-validated smoothing spline. O’Sullivan (1988) used this technique in the penalized ML framework and his approach can be adopted also here. Basically, the penalized log-likelihood ℓP is viewed 7.3. INFERENCE BASED ON THE MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 91 as a “posterior” log-density for the parameter θ and the penalty term as a “prior” negative log-density of that parameter. Then, the second order Taylor series expansion of the “posterior” log-density around its mode θ̂ leads to 1 ℓP (θ) ≈ ℓP (θ̂) − (θ − θ̂)T Ĥ(θ − θ̂). 2 Finally the Gaussian approximation gives “posterior” normal distribution for θ with covariance matrix var c P (θ̂) = Ĥ−1 . (7.11) We call this estimate of the variance of the penalized MLE θ̂ the “pseudovariance estimate.” 7.3.2 Asymptotic variance More formal inference is possible under the following assumptions. Firstly, we assume independent noninformative censoring. Secondly, as the sample size N increases, the knots (both number and positions) and the basis standard deviation remain fixed. Let θ T be the true parameter value of θ, assuming it exists. To get asymptotically unbiased estimates we have to either keep the value of the smoothing parameter λ constant as N → ∞ or let it increase at a rate lower than N (i.e., λ = λN and limN →∞ λN /N = 0). Under these conditions, the penalty part of the penalized log-likelihood reduces its importance relative to the log-likelihood part as N → ∞ (i.e., as the sample size N increases, the smoothness of the fitted error distribution is determined to greater extent by the data and to a lesser extent by the penalty). Then, in combination with standard maximum likelihood arguments, for arbitrary ξ > 0 the penalized MLE θ̂ satisfies PrθT |θ̂ − θ T | < ξ → 1. Using the √ same arguments as in Gray (1992), one can further show that N θ̂ − θ T is asymptotically normal with mean 0 and covariance matrix limN →∞ (N W) where the matrix W can be consistently estimated by var c A (θ̂) = Ĥ−1 Î Ĥ−1 , (7.12) which we call the “asymptotic variance estimate.” As pointed out by Gray (1992), the asymptotic distribution of θ̂ remains the same if the smoothing Pr parameters λN are replaced by estimates satisfying λ̂N /λN → 1. 7.3.3 The pseudo-variance versus the asymptotic variance In various applications, the pseudo-variance estimate (7.11) has been shown to be useful. When smoothing a spline curve g(t), Wahba (1983) showed 92 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL p it yielded pointwise confidence intervals ĝ(t) ± z var c P {ĝ(t)}, where z is a quantile of the normal distribution, that have good frequentist coverage properties. Verweij and Van Houwelingen (1994) used it in the context of penalized likelihood estimation in Cox regression; they called the square roots of its diagonal elements “pseudo-standard errors.” Joly, Commenges, and Letenneur (1998) exploited this technique to get confidence bands on the hazard function smoothed using M-splines. In contrast, for the asymptotic variance estimate (7.12) there is no guarantee that for finite samples its middle matrix Î is positive semidefinite. Based on our experience, this problem is not rare. Finally, according to our simulations the pseudo-variance estimate q c P (β̂) for regression parameters (7.11) yields confidence intervals β̂ ± z var with better coverage properties than the corresponding confidence intervals based on the asymptotic estimate (7.12). 7.3.4 Remarks We have assumed in this section that the true parameter vector θ T exists. This does not have to be true. In particular, true a coefficients may fail to exist when the true error distribution is not a mixture of the normal densities determined by the choice of knots and the standard deviation σ0 . However, if the distance between two consecutive knots is small enough, we argue that the penalized mixture of the normal densities can approximate every continuous distribution sufficiently well, see Dalal and Hall (1983) or O’Hagan (1994, Sec. 6.47), that the assumption on the existence of the true parameter vector θ T is not restrictive at all. Loosely speaking, combining this with the asymptotic arguments given in Section 7.3.2 implies that by increasing the sample size, the estimated coefficients a will yield an estimated density which is close to the true error density. 7.4 Predictive survival and hazard curves and predictive densities The penalized AFT model has actually a parametric nature given the weights w−K , . . . , wK in (7.2) are known. This makes it easy to compute predictive survival curves or predictive hazards or densities for a given combination of 7.5. SIMULATION STUDY 93 covariates, say xnew and z new . The predictive survival function is given by S(t | xnew , z new ) = 1− K X wj (a) Φ j=−K (7.13) log(t) − α − β ′ xnew τ (z new ) The predictive density is computed by p(t | xnew , z new ) = µj , σ02 . K −1 X log(t) − α − β ′ xnew wj (a) ϕ t τ (z new ) τ (z new ) j=−K (7.14) µj , σ02 , and finally the predictive hazard is obtained from the above quantities as ℏ(t | xnew , z new ) = p(t | xnew , z new ) . S(t | xnew , z new ) (7.15) In practice, all unknown parameters are replaced by their penalized maximumlikelihood estimates. 7.5 Simulation study To see how the proposed method performs, we carried out a simulation study. ‘True’ uncensored data were generated according to model (7.1) with error density (7.2). Two covariates, i.e. xi = (xi,1 , xi,2 )′ were included in the model and the values of the parameters were the following: α = 1.6, τ = 1.4 and β = (−0.8, 0.4)′ . The covariate xi,1 was binary taking a value of 1 with probability 0.4 and covariate xi,2 was generated according to the extreme value distribution of a minimum, with location 8.5 and scale 1. The model attempts to mimic an AFT model used for the dataset presented in Section 7.6 with xi,1 playing the role of the covariate lesion and xi,2 being distributed as log2 (1 + CD4 count). Time to the event T is expressed in months. The standardized error term ε∗ was generated from a standard normal distribution N (0, 1), from a standardized extreme value distribution, and from a mixture of two normal distributions 0.4 N (−1.4, 0.82 ) + 0.6 N (0.93, 0.82 ). Samples of sizes 50, 100, 300, and 600 were generated. Each simulation involved 100 replications. For each uncensored dataset we created four censored datasets that were then used to compute the estimates: a dataset with (1) approximately 20% rightcensored and 80% uncensored observations (light RC); (2) approximately 20% 94 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL right and 80% interval-censored observations (light R+IC); (3) approximately 60% right and 40% uncensored observations (heavy RC); (4) approximately 60% right and 40% interval-censored observations (heavy R+IC). The censoring was created by simulating consecutive ‘visit times’ for each subject in the dataset. Times of the first ‘visits’ were drawn from N (7, 1) distribution. Further, times between each consecutive ‘visits’ were simulated from N (6, 0.52 ). This approach reflects the idea that subjects in our Oral Substudy were seen for the first time about 7 months after the onset of the parent study and then approximately every 6 months for several years. At each visit, subjects were withdrawn (censored) according to a prespecified percentage (between 0.4% and 0.7% for light censoring and between 4.0% and 5.0% for heavy censoring) creating right-censored observations provided that the uncensored event time Ti was greater than the visit time at which the subject was withdrawn. To obtain interval-censored observations, we took the ‘visit’ interval that contained the uncensored event time Ti . For comparison, estimates for each dataset were computed using our smoothed procedure and using two parametric models: an AFT model on the log scale with a correctly specified error distribution (normal, extreme value or mixture of normals, respectively) and a log-normal AFT model. For the smoothing procedure, the third order penalty, equidistant knots with a distance of 0.3 between consecutive knots, and the basis standard deviation of 0.2 were used. Selected results of the simulation are given in Appendix B, Section B.1. Namely, Tables B.1 – B.6 show the results for the regression parameters. It is seen that, in most cases, our smoothed procedure performs better than the incorrectly specified log-normal AFT model and often only but slightly worse than the correctly specified parametric AFT model. Additionally, when our smoothing approach is used, the error distribution is reproduced rather satisfactory as can be seen in Figures B.1 – B.3. This property is quite important especially when the estimated model is to be used for prediction purposes. Further, it is seen that even for small samples the performance of our smoothing procedure is quite similar to the performance of a parametric AFT model with a correctly specified error distribution. 7.6 Example: WIHS data – interval censoring In Section 1.2, we introduced the study comprising the cohort of seropositive women and the cohort of seronegative women with an increased risk of HIV infection. In this section, we concentrate on the data set collected in the framework of the Oral Substudy involving 224 seropositive AIDS-free (at baseline) women. We explore how the distribution of the time between the 7.6. EXAMPLE: WIHS DATA – INTERVAL CENSORING 95 Table 7.1: WIHS Data. Akaike’s information criterion, degrees of freedom, the optimal log(λ/N ) for the fitted models. Model (1) lesion (2) lvload (3) lcd4 (4) lesion + lvload (5) lesion + lcd4 (6) lvload + lcd4 (7) lesion + lvload + lcd4 AIC −262.39 −256.16 −256.94 −255.63 −253.19 −253.45 −250.01 df 3.2 3.4 3.4 4.4 8.9 8.4 10.0 log(λ/N ) 2 2 2 2 −7 −6 −7 baseline measurement and the onset of an AIDS-related illness can be explained using classical predictors which are the number of copies of the HIV RNA virus and the count of CD4 T-cells per ml of blood. Additionally, we examine whether presence of one of the three lesion markers, oral candidiasis, hairy leukoplakia and angular cheilitis, is useful, possibly together with one or both laboratory predictors, in describing the distribution of the residual time to onset of AIDS. For the purpose of modelling, the three lesion markers were summarized in one binary covariate, lesion, equal to one if at least one of the above mentioned three lesion markers was present. Further, the laboratory predictors entered the models in an transformed way, classically used in the HIV research. Namely, the covariate lvload is equal to log10 (1 + viral load) and the covariate lcd4 equals log2 (1 + CD4 count). All three covariates are moderately to strongly associated with one another since, as AIDS progresses, viral load increases, CD4 count falls, and oral lesions occur more frequently. In our sample, for women with lesion = 0 and 1, respectively, the median lvload was 3.60 and 4.23 (Mann-Whitney p-value, 0.001), There was also a moderate negative correlation of −0.46 between lcd4 and lvload. These associations have to be taken into account when interpreting the results. As a response, we used the time in months between the baseline visit, defined as the first visit at which the lesion markers were collected by dental professionals, and the onset of an AIDS-related illness. As mentioned in Section 1.3, the response time is right-censored for 158 women and interval-censored for 66 women with the average length of the observed interval equal to 7 months. 96 7.6.1 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL Fitted models To obtain the results shown below, we used a sequence of 41 equidistant knots from −6 to 6 with a distance of 0.3 between each pair. The basis standard deviation was 0.2 and the third order difference was used in the penalty. Different models were compared using Akaike’s information criterion and claims concerning the significance of the parameters were based on Wald’s tests using the pseudo-variance estimate (7.11). Summary of the fitted models is shown in Tables 7.1 and 7.2. If used alone (model (1)) the effect of lesion on the time to onset of AIDS is statistically significant (p = 0.018) and the estimated time is exp(−0.87) ≈ 0.42 times shorter for women with lesion = 1 than women with lesion = 0. According to the AIC values for models (2) and (3), the transformed CD4 count and viral load are equally good predictors of the time to onset of AIDS. Addition of the lesion marker (models (4) and (5)) improves the model with lcd4 considerably but improves the model with lvload only slightly. Finally, some additional improvement is gained by considering the model with all three predictors (model (7)). 7.6.2 Predictive survival and hazard curves, predictive densities Figure 7.1 shows predictive survival and hazard curves and predictive densities for women with lesion = 0 and lesion = 1 based on the simplest model lesion and on the most complex model considered lesion + lvload + lcd4. The predictive survival curves based on the model lesion are further overlaid with the nonparametric estimate of Turnbull (1976) in each group. The two estimates are quite close to each other, illustrating the semiparametric nature of our approach. However, our procedure gives smooth estimates of the survival curves and moreover enables quantification of the difference in survival between the two groups. Notice further that due to the fact that the hazard is obtained as a ratio of the density and the survival function, which relatively slowly varies from one, only a slight difference is observed between the predictive density and the hazard. Further, we point out that the predictive densities for models where lcd4 was not involved are very close to the log-normal density. This is not surprising since the optimal tuning parameter λ for these models was equal to 224 · exp(2), essentially a value of infinity in this practical situation and thus implying that the fitted error distributions are close to the normal distribution, as discussed in Section 7.2.3. On the other hand, models where lcd4 was 7.6. EXAMPLE: WIHS DATA – INTERVAL CENSORING 97 Table 7.2: WIHS Data. Estimates of the regression parameters (standard error; p-value) for the fitted models. Model (1) lesion lesion −0.87 (0.37; 0.018) logvload −0.76 (0.19; < 0.001) (2) lvload (3) lcd4 0.44 (0.11; < 0.001) (4) lesion + lvload −0.62 (0.36; 0.080) (5) lesion + lcd4 −0.78 (0.26; 0.003) (6) lvload + lcd4 (7) lesion + lvload + + lcd4 logcd4 −0.60 (0.23; 0.008) −0.70 (0.19; < 0.001) 0.39 (0.07; < 0.001) −0.39 (0.14; 0.004) 0.38 (0.06; < 0.001) −0.30 (0.11; 0.005) 0.39 (0.05; < 0.001) used in combination with other covariates gave much lower optimal tuning parameters λ, implying also non-normal error densities. This is seen on the right-hand side of Figure 7.1. The phenomenon could indicate presence of a risk-group mixture in the data or absence of another important predictor. Indeed, a factor that could play an important role is antiretroviral therapy, which might have been used by some women in our sample. However, this factor requires modelling time–dependent covariates, which cannot be done with our model. 7.6.3 Conclusions In conclusion, the time to AIDS onset in this study population is notably shorter in women with oral lesions. Further, this marker improves the prediction of that time based on any of the classical indicators (CD4 count and viral load). When interpreting these findings, one must bear in mind that only a limited number of WIHS women opted to participate in the Oral Substudy, the source of the dental data. Thus they may differ in unknown ways from the 98 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL lesion + lvload + lcd4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Survival lesion lesion = 0 lesion = 1 Turnbull 0 20 40 60 80 lesion = 0 lesion = 1 lvload = 3.875 lcd4 = 8.735 0 0.010 80 lvload = 3.875 lesion = 1 lcd4 = 8.735 0.010 Hazard 60 Time (months) lesion = 1 0.000 0.000 lesion = 0 0 20 40 60 80 lesion = 0 0 20 40 0.000 0.005 0.010 0.015 60 80 Time (months) 0.020 Time (months) lvload = 3.875 lcd4 = 8.735 0.010 lesion = 1 lesion = 0 0 20 lesion = 1 0.000 Density 40 0.020 Time (months) 20 40 Time (months) 60 80 lesion = 0 0 20 40 60 80 Time (months) Figure 7.1: WIHS Data. Predicted survival curves, hazard curves and densities for women with lesion = 1 (dotted-dashed line) vs. women with lesion = 0 (solid line) based on models lesion (left part) and lesion + lvload + lcd4 (right part). Predictive curves for the latter model control for a median value of lvload = 3.875 and a median value of lcd4 = 8.735. Predictive survival curves for model lesion are further compared to the nonparametric estimate of Turnbull (1976) in each group. 7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED DATA 99 Table 7.3: Signal Tandmobielr study. Description of fitted models. gender dmf gender + dmf gender ∗ dmf Models with constant scale x = (gender) x = (dmf) x = (gender, dmf)′ x = (gender, dmf, gender × dmf)′ Mean-scale models gender ∗ dmf/scale(dmf) x = (gender, dmf, gender × dmf)′ z = (dmf) gender ∗ dmf/scale(gender ∗ dmf) x = (gender, dmf, gender × dmf)′ z = (gender, dmf, gender × dmf)′ overall set. Nonetheless, our findings are consistent with those of others who have evaluated oral lesions as predictors of AIDS onset and they illustrate use of our method in the area of AIDS research. Our method restricts us to analysis of baseline covariates. Although this is a very widely applicable special case, extension of the method to accommodate time-dependent covariates would allow more complex relationships between outcomes and covariates. 7.7 Example: Signal Tandmobielr study – intervalcensored data In paediatric dentistry and orthodontics, adequate knowledge of timing and patterns of tooth emergence is useful for diagnosis and treatment planning. This motivates an example in this section where we fit the distribution of emergence times of permanent maxillary right premolars (teeth 14 and 15 in Figure 1.1) based on the data from the Signal Tandmobielr study introduced in Section 1.1. It is anticipated, that the distribution of emergence times of a particular tooth is different for boys and girls. See Figure 5.1 and Table 5.1 where the emergence distributions for boys and girls are compared for tooth 44. However, a similar phenomenon is observed also for other teeth, 14 and 15 included. For that reason, we used the covariate gender (0 for boys and 1 for girls) in our models. Additionally, it was of dental interest to check whether the distribution of the emergence time of a permanent tooth changes when the primary predecessor of the permanent tooth experienced caries or not. 100 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL Table 7.4: Signal Tandmobielr study. Akaike’s information criteria for different models. Model gender dmf gender gender gender gender + dmf ∗ dmf ∗ dmf/scale(dmf) ∗ dmf/scale(gender ∗ dmf) Tooth 14 −5 532.59 −5 538.03 −5 494.51 −5 491.47 −5 468.61 −5 467.67 Tooth 15 −4 551.57 −4 549.93 −4 526.85 −4 522.76 −4 506.66 −4 507.59 For this, we included a binarised dmf score pertaining to the predecessor as a covariate, dmf = 1 if the primary predecessor of that permanent tooth was recorded as decayed, or missing due to caries, or filled and 0 otherwise. As response, for a particular child, we consider the age of emergence of a particular permanent tooth (14 or 15), recorded in years. Due to the design of the study (annual planned examinations), the response variable is intervalcensored with intervals of length equal to approximately 1 year. It should be stressed that in this section, the two teeth will be analyzed separately, i.e. ignoring their possible correlation. In Section 7.8, we indicate how the correlation between teeth can be incorporated in the analysis. For a better fit, we shifted the time origin of the AFT model to 5 years of age which is clinically minimal emergence time for the permanent teeth (see, e.g., Ekstrand, Christiansen, and Christiansen, 2003). Namely, we replaced Ti by Ti − 5 in the AFT model specification (7.1). Similarly as in Section 7.6, we used a sequence of 41 equidistant knots from −6 to 6 with a distance of 0.3 between each pair. The basis standard deviation was 0.2 and the third order difference was used in the penalty. 7.7.1 Fitted models We fitted four penalized AFT models with constant scale parameter and two mean-scale penalized AFT models. The fitted models are described in Table 7.3 and AIC’s for these models are given in Table 7.4. The model selection was based on the AIC. Firstly, the model with the interaction term gender ∗ dmf seems to fit the data best and the interaction term cannot be omitted. Secondly, the models where the scale parameter τ depends on covariates give a considerably better fit. For tooth 15, only dmf included in the scale covariate vector leads to the 7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED DATA 101 Table 7.5: Signal Tandmobielr study. Estimates (standard errors) for the models gender ∗ dmf/scale(dmf). Parameter α β(gender) β(dmf) β(gender ∗ dmf) γ1 γ(dmf) Tooth 14 1.7734 (0.0073) −0.0931 (0.0099) −0.0990 (0.0116) 0.0401 (0.0166) −1.5613 (0.0219) 0.2144 (0.0307) Tooth 15 1.9143 (0.0091) −0.0803 (0.0110) −0.0773 (0.0125) 0.0473 (0.0172) −1.6121 (0.0351) 0.2415 (0.0399) best AIC. For tooth 14, the model with the scale depending only on dmf can be improved by inclusion of gender and its interaction with dmf however the improvement is minor. These findings lead us to conclude that the model that describes satisfactory well the data while being kept as simple as possible is the model gender ∗ dmf/scale(dmf). The estimates for this model are given in Table 7.5. It is seen that dmf = 1 accelerates the emergence for both genders and also increases the variability of the emergence distribution. 7.7.2 Predictive emergence and hazard curves For our data, predictive emergence curves (cumulative distribution functions), which are prefered in this case to survival curves, based on the model gender ∗ dmf/scale(dmf) are shown in Figure 7.2 and predictive hazards in Figure 7.3. Further, Figure 7.2 shows also the non-parametric estimates of Turnbull (1976) computed separately for each combination of covariates. It is seen that model-based emergence curves agree with the non-parametric estimates indicating the goodness-of-fit of our model. Further, the figures show that the difference between children with dmf = 0 and dmf = 1 is higher for boys than for girls and that the emergence process for boys is indeed postponed compared to girls. Non-decreasing predictive hazard curves reflect the nature of the problem at hand. Indeed, it can be expected that, provided the tooth of a child has not emerged yet, the probability that the tooth will emerge increases with age. 102 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL Tooth 14, Boys 1.0 0.8 0.6 0.4 dmf = 1 0.2 Proportion emerged 0.8 0.6 0.4 dmf = 1 0.2 Proportion emerged 1.0 Tooth 14, Girls 0.0 dmf = 0 0.0 dmf = 0 9 10 11 12 7 9 10 11 Age (years) Age (years) Tooth 15, Girls Tooth 15, Boys 12 0.8 0.6 0.4 dmf = 1 0.2 Proportion emerged 0.8 0.6 0.4 dmf = 1 0.2 Proportion emerged 8 1.0 8 1.0 7 dmf = 0 0.0 0.0 dmf = 0 7 8 9 10 Age (years) 11 12 7 8 9 10 11 12 Age (years) Figure 7.2: Signal Tandmobielr study. Predictive emergence curves: solid lines for curves based on the model gender ∗ dmf/scale(dmf) (on each plot: left curve for dmf = 1, right curve for dmf = 0), dashed line for a non-parametric estimate of Turnbull. 7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED DATA 103 Tooth 14, Boys 1.2 1.0 0.8 0.2 0.4 0.6 Hazard 0.6 0.2 0.4 Hazard 0.8 1.0 1.2 Tooth 14, Girls dmf = 1 dmf = 1 0.0 dmf = 0 0.0 dmf = 0 8 9 10 11 12 7 9 10 11 Tooth 15, Girls Tooth 15, Boys 12 1.0 0.8 0.6 0.2 0.2 0.4 0.6 Hazard 0.8 1.0 1.2 Age (years) 0.4 Hazard 8 Age (years) 1.2 7 dmf = 1 8 9 10 Age (years) 11 dmf = 0 0.0 0.0 7 dmf = 1 dmf = 0 12 7 8 9 10 11 12 Age (years) Figure 7.3: Signal Tandmobielr study. Predictive hazard curves based on the model gender ∗ dmf/scale(dmf): solid line for dmf = 1, dotted-dashed line for dmf = 0. 104 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL Table 7.6: Signal Tandmobielr study. Estimates (standard errors) for the models gender, dmf and gender + dmf. Parameter β(gender) β(dmf) β(gender) β(dmf) 7.7.3 Model gender or dmf Model gender + dmf Tooth 14 −0.0740 (0.0080) −0.0766 (0.0081) −0.0729 (0.0086) −0.0741 (0.0085) Tooth 15 −0.0564 (0.0085) −0.0594 (0.0087) −0.0613 (0.0089) −0.0628 (0.0090) Comparison of emergence distributions between different groups While the model gender ∗ dmf/scale(dmf) gives a parsimonious description of emergence distributions for different groups of children and serves as a solid basis for prediction as was shown in the previous section, it is not suitable to provide simple p-values for a comparison of emergence distributions between e.g. boys and girls. Due to the fact that an interaction term gender ∗ dmf appeared to be significantly important, we could only provide a p-value for a multiple comparison of the four groups (girls with dmf = 1 and 0 and boys with dmf = 1 and 0). To simply compare two distributions, while averaging the effect of other covariates, the basic AFT model with a univariate covariate x (i.e. either the model gender or the model dmf) can be used together with a significance test for the group parameter. Additionally, it is possible to perform a test that compares two groups while controling for additional confounding variables (e.g. comparison of boys and girls while controling for dmf or vice versa). To do that, we perform significance tests of β parameters in the model gender + dmf. The estimates of regression parameters β together with their standard errors, derived from the formula (7.11), in mentioned models are given in Table 7.4. The Wald tests of significance for each β parameter all yield p-values lower than 0.0001, which confirm the findings obtained previously that there is indeed a significant difference in emergence distributions of studied teeth between boys and girls and also between the group of children with dmf = 0 and dmf = 1. The difference remains both marginally (irrespective of value of dmf or irrespective of value of gender, respectively) and while controling for the other covariate. 7.8. DISCUSSION 105 The issue of the robustness of the AFT model against the omitted covariates, discussed in Section 3.3, is further illustrated in Table 7.4. The effect of gender remains almost unchanged in both models, gender and gender + dmf, and an analogous conclusion holds also for the effect of dmf. 7.7.4 Conclusions It has been shown that the emergence process of teeth 14 and 15 is significantly different between boys and girls and that the caries experience status of a primary predecessor, expressed by the dmf score, has a significant effect on the timing of emergence of permanent successors. Predictive emergence curves have been drawn that can be used for diagnosis and treatment planning in paediatric dentistry. Further, it was found that the acceleration effect of caries experience on a primary predecessor on the timing of emergence of its successor was stronger for boys than for girls. 7.8 Discussion In this chapter, we have suggested a method useful for fitting the linear regression model for independent censored observations while avoiding overly restrictive parametric assumptions on the error distribution. Most classically, the logarithmic transformation of the response leads to the well known AFT model. However, other transformations of the response leading to its potential range covering the whole real line are also possible. The density of the error distribution is specified in a semi-parametric way as a mixture of the overspecified number of normal densities with fixed means – knots and given common standard deviation. Mixture coefficients are estimated using the penalized maximum-likelihood method. Such model specifications allow flexibility with respect to the resulting error distribution yet retain tractability such that data carrying censoring of several types, especially interval censoring, can be handled naturally. The method of this chapter could generally be extended to handle also multivariate survival data. Namely, the population averaged AFT model (see Section 3.4.2) with a multivariate error distribution specified as a multivariate penalized mixture (see Section 6.3.4) could be used. Or alternatively, the cluster specific AFT model (see Section 3.4.3) with an error distribution given as a penalized mixture and random effects distribution specified either parametrically or as a (multivariate) penalized mixture could be considered. However, as outlined in Sections 4.2 and 4.3, the computation and let alone 106 CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL optimization of the (penalized) likelihood is practically intractable. For this reason, we switch to fully Bayesian approaches using the MCMC methodology. Chapter 8 Bayesian Normal Mixture Cluster-Specific AFT Model In this chapter we present a cluster-specific AFT model (see Section 3.4.3) with a flexible error distribution. This model, introduced by Komárek and Lesaffre (2006a), allows us to analyze also data sets where not necessary all observations are independent. For example, we will be able to analyze jointly several teeth from the Signal Tandmobielr study, analyze the CGD data where the times to recurrent infections are involved or to analyze the data from the multicenter studies like EBCP data. The approach presented here uses the classical normal mixture (see Section 6.1) to express the error density in the AFT model. For the random effects we use a parametric (multivariate) normal distribution. The full Bayesian approach with the Markov chain Monte Carlo methodology will be used for the inference. In Section 8.1, we specify the cluster-specific AFT model and the distributional assumptions we use in this chapter. In Section 8.2, we specify the model from the Bayesian perspective and derive the corresponding posterior distribution. Details of the Markov chain Monte Carlo methodology to sample from the posterior distribution are given in Section 8.3. In Section 8.4, we show how the survival distributions for specific combinations of covariates can be estimated. Further, in Section 8.5, we give the estimates of the individual random effects that could be used, for example, for the discrimination. The performance of the method is evaluated using the simulation study in Section 8.6. The method is applied to the analysis of the interval-censored emergence times of 8 permanent teeth in Section 8.7, to the recurrent events analysis in Section 8.8 and to the analysis of the breast cancer multicenter study in Section 8.9. The chapter is finalized by the discussion in Section 8.10. 107 108 8.1 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Model Let Ti,l , i = 1, . . . , N , l = 1, . . . , ni be the lth event time in the ith cluster or the lth recurrent event on the ith subject in the study. Let Ti,l be observed U as an interval ⌊tL i,l , ti,l ⌋. Let logarithmic transformations of the event and L = log(tL ), y U = log(tU ). We will observed event times be Yi,l = log(Ti,l ), yi,l i,l i,l i,l assume that the random vectors T 1 , . . . , T N , where T i = (Ti,1 , . . . , Ti,ni )′ , i = 1, . . . , N are independent. However, the components of each T i are not necessarily independent. To model the effect of covariates on the event time we use the cluster-specific AFT model (3.7), i.e. log(Ti,l ) = Yi,l = β ′ xi,l + b′i z i,l + εi,l , i = 1, . . . , N, l = 1, . . . , ni , (8.1) where β = (β1 , . . . , βm )′ is the unknown regression coefficient vector, xi,l the covariate vector for fixed effects, bi = (bi,1 , . . . , bi,q )′ , i = 1, . . . , N are the random effect vectors with the density gb (b) causing the possible correlation for the components of Y i = (Yi,1 , . . . , Yi,ni )′ . Further, z i,l is the covariate vector for random effects and εi,l are independent and identically distributed random variables with the density gε (ε). Along the lines of Gelman et al. (2004, Chapter 15) we use the terms ‘fixed’ and ‘random’ effects throughout the thesis even in a Bayesian context where all unknown parameters are treated as random quantities. For recurrent events, usually z i,l = 1 for all i and l and bi = bi,1 expresses an individual-specific deviation from an overall mean log-event time which is not explained by fixed effects covariates (see the analysis of CGD data in Section 8.8). For clustered data, the vector z i,l may define further subclusters (as in the analysis of the Signal Tandmobielr data in Section 8.7) allowing for a higher dependence of observations within sub-clusters given by common values of appropriate components of the vector bi while keeping the dependence also across the sub-clusters through the correlation between the components of bi . In multicenter clinical trials where the aim is to evaluate an effect of some treatment (e.g. the EBCP data analyzed in Section 8.9), the vector z i,l might be equal to (1, treatmenti,l )′ allowing that both a baseline value of the expected event time and a treatment effect can vary across centra. 8.1. MODEL 8.1.1 109 Distributional assumptions The density gε (ε) of the error term εi,l in model (8.1) is specified in a flexible way as a classical normal mixture (6.3), i.e. gε (ε) = K X j=1 wj ϕ(y | µj , σj2 ), (8.2) where K is the unknown number of mixture components and further, w = (w1 , . . . , wK )′ are unknown mixture weights, µ = (µ1 , . . . , µK )′ unknown mix2 )′ unknown mixture variances. ture means and σ2 = (σ12 , . . . , σK We have already mentioned in Section 6.1.2 that a heteroscedastic mixture (8.2) leads to the likelihood which is unbounded if the parameter space for variances is unconstrained. In a full Bayesian analysis, this difficulty is solved by using an appropriate prior distribution for the variances which plays the role of constraints. We discuss this issue in full detail in Section 8.2.1. For the random effects bi , we take a suitable parametric distribution, namely the multivariate normal distribution, see Section 8.2.2 for details. The fact that we put more emphasis on a correct specification of the distribution of the error term εi,l than on a specification of the distribution of random effects bi is driven by the following reasoning. For an AFT model, the regression parameters β express the effect of covariates (xi,l ) both conditionally (given bi ) and marginally (after integrating bi out). Both interpretations do not change when different distributional assumptions are made on bi . Further, with a correctly specified distribution of εi,l the conditional model is always correctly specified. However, when the distribution of εi,l is incorrect neither conditional nor marginal models are specified correctly. Further, Keiding, Andersen, and Klein (1997) showed that for univariate (single-spell) Weibull AFT model the regression parameters are robust against the misspecification of the random effects distribution. This finding, also for non-Weibull models is further supported by the empirical results of Lambert et al. (2004). Finally, Verbeke and Lesaffre (1997) showed, in the context of normal linear mixed model with uncensored data, that the maximum-likelihood estimates of the regression parameters are unaffected by the misspecified random effects distribution. Of course, in situations in which the variability of the random effects considerably exceeds the variability of the error term it becomes more important to specify correctly the distribution of the random effects rather than the distribution of the error term. However, in all applications presented in this chapter this is not the case. 110 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL 8.1.2 Likelihood The likelihood contribution of the ith cluster can be derived from expressions (4.7) and (4.9). Namely, Z Y ni I y U i,l gε (yl − β ′ xi,l − b′i z i,l ) dyl gb (bi ) dbi . (8.3) Li = Rq l=1 L yi,l It might be useful to stress again that due to multivariate integration in the likelihood (8.3), it is rather cumbersome to use maximum-likelihood based methods for the cluster-specific AFT model with interval-censored observations even with gε (ε) and gb (b) being parametrically specified. Mainly for this reason, the full Bayesian approach will be exploited. 8.2 Bayesian hierarchical model The Bayesian specification of the model continues by specification of the prior distributions for all unknown parameters, denoted by θ. We assume a clusterspecific AFT model (8.1) with a hierarchical structure graphically represented by a directed acyclic graph (DAG) given in Figure 8.1. As explained in Section 4.4, the joint prior distribution of θ is then given by the product of the conditional distributions of the nodes pertaining to unobserved quantities given their parents, namely p(θ) ∝ "n N Y i Y 2 p ti,l β, bi , εi,l × p εi,l µ, σ , ri,l × p ri,l K, w × i=1 l=1 # p bi γ, D × (8.4) p µ K) × p σ 2 K, η) × p η × p w K) × p K × p β ×p γ ×p D . For clarity, we omitted all fixed hyperparameters and fixed covariates in the expression (8.4). As the DAG indicates, the unknown parameters can be split into two parts connected only through the node of the true event times. The conditional distribution for this node is simply a Dirac (degenerated) distribution driven by the AFT model (8.1), i.e. p(ti,l | β, bi , εi,l ) = I[log(ti,l ) = β ′ xi,l + b′i z i,l + εi,l ], i = 1, . . . , N, l = 1, . . . , ni . 8.2. BAYESIAN HIERARCHICAL MODEL 111 In the subsequent sections, we explain all the multiplicands of expression (8.4) and also the meaning of the newly introduced parameters ri,l , i = 1, . . . , N , l = 1, . . . , ni , γ, D, and η. 8.2.1 Prior specification of the error part The prior conditional distributions pertaining to the error part of the model are inspired by the work of Richardson and Green (1997) (with some change in notation) who studied Bayesian estimation of the normal mixtures in the context of i.i.d. data. That is, they did not consider covariates or censoring. To improve the computation of the posterior distribution, it is useful to assume that εi,l , i = 1, . . . , N, l = 1, . . . , ni come from a heterogeneous population consisting of groups j = 1, 2, . . . , K of sizes proportional to the mixture weights wj and introduce latent allocation variables ri,l denoting the label of the group from which each random error variable εi,l is drawn. By this we are introducing here the Bayesian implementation of the data augmentation algorithm (see Section 4.3). Together with distributional assumption (8.2) this Error part Regression part η K w D ri,l εi,l xi,l ti,l zi,l censoringi,l tL i,l tU i,l bi i = 1, . . . , N σ2 l = 1, . . . , ni µ γ β Figure 8.1: Directed acyclic graph for the Bayesian normal mixture clusterspecific AFT model. 112 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL leads to the following conditional distributions appearing in the prior (8.4): Pr(ri,l = j | K, w) = wj , j ∈ {1, . . . , K}, p(εi,l | µ, σ 2 , ri,l ) = ϕ(εi,l | µri,l , σr2i,l ) i = 1, . . . , N, l = 1, . . . , ni . For the number of mixture components, K, we experimented with 1. a Poisson distribution with mean equal to a fixed hyper-parameter λ truncated at some prespecified (relatively large) value Kmax and truncated zero, i.e. max n KX λj o−1 λk , Pr(K = k) = j! k! k = 1, . . . , Kmax ; j=1 2. a uniform distribution on {1, . . . , Kmax }, i.e. Pr(K = k) = 1 Kmax , k = 1, . . . , Kmax . The prior for the mixture weights w is taken to be a symmetric K-dimensional Dirichlet with prior ‘sample size’ equal to K δ, i.e. K Γ(K δ) Y δ−1 wj , p(w | K) = K Γ(δ) j=1 where δ is a fixed hyperparameter. Further, the mixture means µj and variances σj2 , j = 1, . . . , K are a priori all drawn independently with normal N (ξ, κ) and inverse-gamma IG(ζ, η) priors respectively, i.e. p(µ | K) = p(σ 2 | K, η) = K Y j=1 ϕ(µj | ξ, κ), K η Y ηζ (σj2 )−(ζ+1) exp − 2 , Γ(ζ) σj (8.5) (8.6) j=1 where ξ, κ and ζ are fixed hyperparameters. As in Richardson and Green (1997) we let the hyperparameter η follow a gamma distribution with fixed shape parameter h1 and fixed rate parameter h2 , i.e. p(η) = hh2 1 h1 −1 η exp(−h2 η). Γ(h1 ) 8.2. BAYESIAN HIERARCHICAL MODEL 113 A rationale for this construction is given in Section 8.2.3. Since the error model is invariant to permutations of labels j = 1, . . . , K, the joint prior distribution of a vector µ is restricted to the set {µ : µ1 < · · · < µK } for identifiability reasons, see Stephens (2000) or Jasra, Holmes, and Stephens (2005) for other approaches to establish identifiability. The joint prior distribution of the mixture means and variances is thus K! times the products (8.5) and (8.6), restricted to above mentioned set of increasing means. 8.2.2 Prior specification of the regression part The regression part of the model has the structure of a classical Bayesian linear Chapter 5). Let X be P mixed model (see, e.g., Gelman′ et al., 2004, ′ as rows. Similarly, let Z , . . . , x a( N n ) × m matrix with vectors x 1,1 N,nN i=1 i PN ′ ′ be a ( i=1 ni ) × q matrix with vectors z 1,1 , . . . , z N,nN as rows. Further, we will assume that the matrix (X, Z) is of full column rank (m + q). In other words, covariates included in xi,l are not included in z i,l and vice versa. This gives rise to hierarchical centering which in general results in a better behavior of the MCMC algorithm (Gelfand, Sahu, and Carlin, 1995). Finally, since gε (ε) does not have zero mean we do not allow a column of ones in the matrix X to avoid identifiability problems. The prior distribution for each regression coefficient βj , j = 1, . . . , m is assumed to be N (νβ,j , ψβ,j ), and the βj are assumed to be a priori independent, i.e. m Y ϕ(βj | νβ,j , ψβ,j ). p(β) = j=1 The vectors ν β = (νβ,1 , . . . , νβ,m )′ and ψ β = (ψβ,1 , . . . , ψβ,m )′ are fixed hyperparameters. As already mentioned in Section 8.1.1, the (prior) distribution for the random effect vector bi , i = 1, . . . , N is assumed to be (multivariate) normal with a prior mean γ and a prior covariance matrix D, i.e. p(bi | γ, D) = ϕq (bi | γ, D), (8.7) where γ = (γ1 , . . . , γq )′ . The prior distribution for each γj , j = 1, . . . , q is N (νγ,j , ψγ,j ), independently for j = 1, . . . , q, i.e. q Y ϕ(γj | νγ,j , ψγ,j ). p(γ) = j=1 114 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL The vectors ν γ = (νγ,1 , . . . , νγ,q )′ and ψ γ = (ψγ,1 , . . . , ψγ,q )′ are fixed. Special care is needed when the random intercept is included in the model (i.e. when Z contains a column of ones, let say its first column). Hierarchical centering cannot be applied in this case since the overall intercept is given by the mean of the mixture (8.2). For that reason, γ1 is fixed to zero (or equivalently, νγ,1 = 0, ψγ,1 = 0). The prior distribution for the covariance matrix D of random effects is assumed to be an inverse-Wishart with fixed degrees of freedom df and a fixed scale matrix S, i.e. q df + 1 + j −1 Y × π Γ p(D) = 2 2 j=1 o n 1 df df +q+1 |S| 2 |D|− 2 exp − trace(SD−1 ) . 2 df q 2 q(q−1) 4 (8.8) In the special case of a univariate random effect (q = 1), we use d instead of D and s instead of S in the notation. Note that in that case, the inverseWishart distribution is the same as the inverse-gamma distribution with the shape parameter equal to df /2 and the scale parameter equal to s/2. Further, in the situation of q = 1, we considered alternatively (see Section 8.8) also the use of a uniform prior for standard deviation of the random effect which is often considered to be a better choice (see Gelman et al., 2004, pp. 136, 390 or Gelman, 2006), i.e. a priori √ 1 p( d) = √ I[0 < d < s], s (8.9) for a large value of s. On the original variance scale the prior (8.9) transforms into 1 p(d) = √ I[0 < d < s], 2 sd which is formally a truncated inverse-gamma distribution with the shape parameter equal to −1/2 and the scale parameter equal to zero. 8.2.3 Weak prior information In this problem, we have opted for specifying weak prior information on the parameters of interest. When a priori information is available, our prior assumptions could be appropriately modified. For the regression part of the model, we use non-informative, however proper distributions, that is, the prior variances of regression parameters β (ψ β ) 8.2. BAYESIAN HIERARCHICAL MODEL 115 and γ (ψ γ ) are chosen such that the posterior variance of the regression parameters is at least 100 times lower (which must be checked from the results). Prior hyperparameters for the covariance matrix D giving a weak prior information correspond to choices of df = q−1+c and S = diag(c, . . . , c) with c being a small positive number. In the error part of the model, it is not possible to be fully non-informative, QK 2 i.e. to use priors p(µ, σ | K) ∝ 1 × j=1 σj−2 and to obtain proper posterior distributions (Diebolt and Robert, 1994; Roeder and Wasserman, 1997). Richardson and Green (1997) offer, in the context of i.i.d. observations, for say e1 , . . . , eN , the following alternative: A rather P flat prior N (ξ, κ) for each µj is achieved by letting ξ equal to ē = N −1 N i=1 ei and setting κ equal to 2 a multiple of R , where R = max(ei ) − min(ei ). They point out that it might be restrictive to suppose that knowledge of the range or variability of the data implies much about the size of each single σj2 and therefore introduced an additional hierarchical level by allowing η to follow a gamma distribution with parameters h1 and h2 . They recommend taking ζ > 1 > h1 to express the belief that the σj2 are similar which is necessary to avoid a problem of unbounded likelihood, without being informative about their absolute size. Finally, they suggest setting the parameter h2 to a small multiple of 1/R2 . Here, the residuals yi,l − β ′ xi,l − b′i z i,l play the role of the observations ei . A rough estimate of their location and scale can be obtained through a maximum-likelihood fit of the AFT model, even without random effects (the scale of residuals can only increase), with an explicitly included intercept and scale parameters in the model. This can be done using standard software packages as R, Splus, SAS. The estimated intercept from this model can then be used instead of ē and a multiple of the estimated scale parameter instead of R. 8.2.4 Posterior distribution As we indicated in Section 4.4, the joint posterior distribution, p(θ | data), is proportional to the product of all DAG conditional distributions, i.e. ni N Y Y U p tL p θ data ∝ p θ × i,l , ti,l ti,l , censoring i,l , (8.10) i=1 l=1 U where p(θ) is given by (8.4) and p(tL i,l , ti,l | ti,l , censoring i,l ) is discussed below. A box called censoringi,l in the DAG represents a realization of the random variable(s) causing the censoring of the (i, l)th event time. Note, that under the assumption of independent noninformative censoring (see Section 2.4) there is no need to specify a measurement model for the censoring mechanism 116 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL since it only acts as a multiplicative constant in the posterior. After omitting subscripts i, l for clarity, the expression of p(tL , tU | t, censoring) is rather obvious for most censoring mechanisms. For example with interval censoring resulting from checking the survival status at (random) times C = {c0 , . . . , cS+1 }, where c0 = 0, cS+1 = ∞ we obtain a Dirac density p(tL = cs , tU = cs+1 | t, C) = I t ∈ ⌊cs , cs+1 ⌋ , s = 0, . . . , S. With standard right-censoring driven by the (random) censoring time C = c, the following Dirac densities are obtained p(tL = tU = t | t, c) = I[t ≤ c], p(tL = t, tU = ∞ | t, c) = I[t > c]. 8.3 Markov chain Monte Carlo Inference is based on a sample from the posterior distribution obtained using the MCMC methodology (see Section 4.5). The parameters of the error part of the model are updated using the combination of the reversible jump MCMC algorithm of Green (1995) and a conventional Gibbs algorithm (Geman and Geman, 1984). For the remaining parameters of the model, each iteration of the MCMC is conducted using the Gibbs sampler. Both the reversible jump MCMC algorithm and the full conditional distributions needed to implement the Gibbs sampler are discussed below. 8.3.1 Update of the error part of the model Details on how to implement the update of the parameters of the error part of the model are given in Richardson and Green (1997). Their guidelines, now based on residuals εi,l = yi,l − β′ xi,l − b′i z i,l , can be immediately applied. We give only a brief summary and for details we refer therein. Six move types are suggested by Richardson and Green (1997), namely (i) Updating the mixture weights w while keeping K fixed; (ii) Updating the mixture means µ and variances σ 2 while keeping K fixed; (iii) Updating the allocation parameters ri,l , i = 1, . . . , N , l = 1, . . . , ni ; (iv) Updating the variance-hyperparameter η; 8.3. MARKOV CHAIN MONTE CARLO 117 (v) Split-combine move, i.e. splitting one mixture component into two, or combining two into one; (vi) Birth-death move, i.e. the birth or death of an empty mixture component. In our context, due to the regression and the presence of censored data, we add one more move type, i.e. (vii) Updating the residuals εi,l , i = 1, . . . , N , l = 1, . . . , ni . Note that only move types (v) and (vi) change the dimension of the parameter vector by changing K to K −1 or K +1 and are performed using the reversible jump MCMC algorithm. The moves (i)–(iv) and the move (vii) are performed by sampling from the full conditional distributions given below. Full conditional for mixture weights w The full conditional distribution for the mixture weights is Dirichlet with parameters δ + Nj , j = 1, . . . , K, i.e. Γ(Kδ + n) p(w | · · · ) = QK j=1 Γ(δ K Y + Nj ) j=1 δ+Nj −1 wj , PN where n = i=1 ni is the total sample size and Nj , j = 1, . . . , K is the number of observations currently allocated in the jth mixture component, i.e. ni N X X Nj = I[ri,l = j], j = 1, . . . , K. i=1 l=1 Full conditional for mixture means The full conditional for each mixture mean is normal with the mean and variance P εi,l + κ−1 ξ σj−2 E(µj | · · · ) = var(µj | · · · ) = (i,l): ri,l =j σj−2 Nj + 1 σj−2 Nj + κ−1 , κ−1 , j = 1, . . . , K, j = 1, . . . , K. 118 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Note that due to the ordering constraint µ1 < · · · < µK , the full conditional only generates a proposal which is accepted provided it does not break this ordering. Full conditional for mixture variances The full conditional for each mixture variance is an inverse gamma distribution o n Nj 1 X (εi,l − µj )2 . ,η+ σj2 | · · · ∼ I-Gamma ζ + 2 2 (i,l): ri,l =j Full conditional for the allocation variables The full conditional for each allocation variable ri,l , i = 1, . . . , N , l = 1, . . . , ni is discrete with n (ε − µ )2 o wj j i,l Pr(ri,l = j | · · · ) ∝ , j ∈ {1, . . . , K}. exp − σj 2σj2 Full conditional for the variance-hyperparameter The full conditional for the variance hyperparameter η is a gamma distribution K X σj−2 ). η | · · · ∼ Gamma(h1 + K ζ, h2 + j=1 Split-combine move To perform the split-combine move, firstly a random choice is made whether to try to perform the split or combine move, namely, given K, the probability split of attempting the split move is πK and the probability of attempting the split split combine = 0. combine move is πK = 1 − πK . Obviously, π1split = 1 and πK max split combine Otherwise we use πK = πK = 0.5, K = 2, . . . , Kmax − 1. When the combine move is attempted the new mixture with K − 1 components is proposed as follows: 1. Choose at random a pair of mixture components (j1 , j2 ) such that for the current values of the mixture means holds µj1 < µj2 and there is no other µj in the interval [µj1 , µj2 ]; (8.11) 8.3. MARKOV CHAIN MONTE CARLO 119 2. Propose a new mixture component by merging the j1 th and the j2 th component. Label this new component by j ∗ . Set the weight, mean and variance of the new component such that its 0th, 1st and 2nd moments are the same as those of the combination of the merged components, i.e. wj ∗ = wj1 + wj2 , wj µj + wj2 µj2 , µj ∗ = 1 1 wj ∗ wj1 (µ2j1 + σj21 ) + wj2 (µ2j2 + σj22 ) σj2∗ = − µ2j ∗ . wj ∗ (8.12) 3. Propose new values for the allocation variables ri,l , i = 1, . . . , N , l = 1, . . . , ni that were equal to j1 or to j2 , i.e. set such allocation variables equal to j ∗ . 4. Accept the proposed mixture with K − 1 components with the probability −1 Prcombine accept = min 1, Asc (K − 1) , where the acceptance ratio Asc (K − 1) is discussed below. If not accepted keep the current K-component mixture. The split move must be reversible in the sense described in Green (1995) to the combine move. Namely it consists of the following steps: 1. Choose at random a component j ∗ which is proposed to be splitted; 2. Propose two new mixture components, labeled j1 and j2 . To keep reversibility, set their weights, means and variances such that the equation (8.12). This can be done by sampling a three-dimensional auxiliary random vector u = (u1 , u2 , u3 )′ from some distribution with a density pu (u) and setting wj1 = wj ∗ u1 , wj2 , wj1 wj ∗ = u3 (1 − u22 )σj2∗ , wj1 µ j 1 = µ j ∗ − u2 σ j ∗ σj21 r wj2 = wj ∗ (1 − u1 ), r wj1 µ j 2 = µ j ∗ + u2 σ j ∗ , wj2 σj22 = (1 − u3 )(1 − u22 )σj2∗ (8.13) wj ∗ . wj2 Check whether the condition (8.11) holds. If not, reject directly the split-proposal otherwise continue; 120 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL 3. Propose new values (either j1 or j2 ) for these allocation variables ri,l , i = 1, . . . , N , l = 1, . . . , ni that were equal to j ∗ . This is done randomly with n (ε − µ )2 o wj j1 i,l Pralloc (ri,l = j1 ) ∝ 1 exp − , σj 1 2σj21 n (ε − µ )2 o wj j2 i,l Pralloc (ri,l = j2 ) ∝ 2 exp − . σj 2 2σj22 4. Accept the proposed mixture with K + 1 components with the probability Prsplit accept = min 1, Asc (K) , see below for the expression of the acceptance ratio Asc (K). If not accepted keep the current K-component mixture. The acceptance ratio Asc (K) has the following general structure: Asc (K) = [posterior ratio] × [proposal ratio](K) × [Jacobian]. The individual components of the above product have the following meaning. [posterior ratio] = p(θ j1 ,j2 | data) , ∗ p(θ j | data) where the posterior density p(· |data) is given by (8.10). Further, θj1 ,j2 refers to the parameter vector pertaining to the proposal in the case of the split move and to the current values of parameters in the case of the combine move. ∗ Similarly, θ j refers to the current parameter vector in the case of the split move and to the proposal in the case of the combine move. The proposal ratio is given by [proposal ratio](K) = split πK pu (u) combine πK+1 Q Pralloc (ri,l ) . (i,l): ri,l =j ∗ Finally, the Jacobian refers to the transformation (8.13) from (wj ∗ , µj ∗ , σj2∗ , u1 , u2 , u3 )′ to (wj1 , wj2 , µj1 , µj2 , σj21 , σj22 )′ , i.e. wj ∗ σj21 σj22 (µj2 − µj1 ) [Jacobian] = 2 σj ∗ u2 (1 − u22 ) u3 (1 − u3 ) What leaves to be discussed is the choice of the density pu (u) of the auxiliary random vector u used to generate the proposal in the split move. Richardson 8.3. MARKOV CHAIN MONTE CARLO 121 and Green (1997) suggest to generate u1 , u2 and u3 independently from the following beta distributions: u1 ∼ Beta(2, 2), u2 ∼ Beta(2, 2), u3 ∼ Beta(1, 1). Note that at each iteration of the MCMC a new auxiliary vector u is generated also independently on the previous iteration. Brooks, Giudici, and Roberts (2003) showed that some improvement of the MCMC sampling can be achieved by allowing (a) a correlation between the components of u; (b) a serial correlation between the auxiliary vectors u generated at successive iterations of the MCMC. In our practical applications (Sections 8.7, 8.8 and 8.9) we exploited their methodology as well. Birth-death move Similarly as in the split-combine move, it is randomly chosen whether the birth or the death move will be attempted. If the current number of mixture birth and components is K, the birth move is attempted with the probability πK death birth the death move with the probability πK = 1 − πK . Analogously to the birth = 0 probabilities of the split and combine moves we use π1birth = 1, πK max birth birth and πK = πK = 0.5, K = 2, . . . , Kmax − 1. When the birth move is attempted the new mixture with K + 1 components is proposed in the following steps: 1. Sample the weight, mean and the variance for the new component from the following distributions: wj ∗ ∼ Beta(1, K), µj ∗ ∼ N (ξ, κ), (8.14) σj2∗ ∼ I-Gamma(ζ, η). Note that the expectation of the new weight is equal to 1/(K + 1), i.e. a reciprocal of the number of components in the proposed mixture; 2. In the proposed mixture, rescale the weights such that they, together with the new weight wj ∗ , sum to one, i.e. the weights of the proposed ′ , w ∗ with mixture are w1′ , . . . , wK j wj′ = wj (1 − wj ∗ ), j = 1, . . . , K. (8.15) 3. Accept the proposed mixture with K + 1 components with the probability Prbirth accept = min 1, Abd (K) , 122 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL see below for the form of the acceptance ratio Abd (K). If not accepted, keep the current K-component mixture. When it is chosen to propose the death move, the new mixture with K − 1 components is proposed in the follwoing way 1. Check whether there areP any empty mixture components, i.e. the components for which Nj = i,l I[ri,l = j] is equal to zero. If not the death move is directly rejected; 2. Choose randomly an empty mixture component. Let j ∗ be the label of this component; 3. In the proposed (K −1)-component mixture, delete the j ∗ th component and rescale the remaining weights such that they sum to one, i.e. the proposed mixture has the weights wj′ , j = 1, . . . , K, j 6= j ∗ . wj′ = wj , 1 − wj ∗ j = 1, . . . , K, j 6= j ∗ ; 4. Accept the proposed mixture with K − 1 components with the probability −1 Prdeath accept = min 1, Abd (K − 1) , where the acceptance ratio A−1 bd (K − 1) is given below. If not accepted keep the current K-component mixture. Analogous to the split-combine move, the acceptance ratio Abd (K) has the general structure Abd (K) = [posterior ratio] × [proposal ratio](K) × [Jacobian], where [posterior ratio] = p(θ + | data) . p(θ − | data) The vector θ + refers to the set of the parameters containing the proposed mixture in the case of the birth move and the set of the current parameter values in the case of the death move. Similarly, the vector θ − refers to the set of the current parameter values in the case of the birth move and to the set of parameters contaning the proposed mixture in the case of the death move. Further, the proposal ratio is given by [proposal ratio](K) = death πK+1 , birth p 2 πK prop (wj ∗ , µj ∗ , σj ∗ ) 8.3. MARKOV CHAIN MONTE CARLO 123 where pprop (wj ∗ , µj ∗ , σj2∗ ) is the density of the proposal step given by (8.14), i.e. pprop (wj ∗ , µj ∗ , σj2∗ ) = ζ η η −(ζ+1) 2 K−1 exp − 2 (σ ∗ ) K (1 − wj ∗ ) × ϕ(µj ∗ | ξ, κ) × . Γ(ζ) j σj ∗ Finally, the Jacobian refers to the transformation (8.15), i.e. [Jacobian] = (1 − wj ∗ )K . Updating the residuals The update of the residuals εi,l , i = 1, . . . , N , l = 1, . . . , ni is fully deterministic provided the (i, l)th residual correspond to an uncensored observation U ti,l = tL i,l = ti,l . In such case, the update of εi,l consists of using the AFT expression (8.1) with the current values of the parameters, i.e. the updated εi,l is equal to log(ti,l ) − β ′ xi,l − b′i z i,l . When the residual εi,l corresponds to the censored observation with an obU served interval ⌊tL i,l , ti,l ⌋ its update consists of sampling from the full conditional distribution of εi,l which appears j to be a truncated normal distribution, namely N (µri,l , σr2i,l ) truncated on k b′i z i,l . 8.3.2 ′ ′ ′ U log(tL i,l )−β xi,l −bi z i,l , log(ti,l )−β xi,l − Update of the regression part of the model The regression part of the model is updated by sampling from the full conditional distribution of each parameter or a set of parameters. Full conditional for the fixed effects β Let β (S) be an arbitrary sub-vector of vector β, and xi,l(S) the corresponding sub-vectors of covariate vectors xi,l , and further let xi,l(−S) be their complementary sub-vectors. Similarly, let further ν β(S) and ψ β(S) be appropriate sub-vectors of hyperparameters ν β and ψ β , respectively. Finally, let Ψβ(S) = diag(ψ β(S) ). Then β (S) | · · · ∼ N E(β (S) | · · · ), var(β (S) | · · · ) , 124 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL with E(β (S) | · · · ) = var(β (S) | · · · )× ni N X o n X (F ) −1 σr−2 x e Ψβ(S) ν β(S) + i,l(S) i,l(S) , i,l i=1 l=1 var(β (S) ni N X −1 X ′ −2 , + | · · · ) = Ψ−1 x x σ ri,l i,l(S) i,l(S) β(S) i=1 l=1 (F ) where ei,l(S) = log(ti,l ) − µri,l − β ′(−S) xi,l(−S) − b′i z i,l . Full conditional for the means of random effects γ There is no loss of generality to assume that γ = (γ ′(S) , γ ′(−S) )′ . Further, let bi(S) , bi(−S) , ν γ(S) , ψ γ(S) the corresponding sub-vectors or complementary sub-vectors of indicated quantities and Ψγ(S) = diag(ψ γ(S) ). Furthermore, let the inversion of the matrix D be decomposed in the following way D then −1 = ! V(S) V(S,−S) , V′(S,−S) V(−S) γ (S) | · · · ∼ N E(γ (S) | · · · ), var(γ (S) | · · · ) , with E(γ (S) | · · · ) = var(γ (S) | · · · ) × n Ψ−1 γ(S) ν γ(S) + V(S) N X bi(S) + V(S,−S) i=1 −1 , + N V var(γ (S) | · · · ) = Ψ−1 (S) γ(S) N X i=1 bi(−S) − γ (−S) Full conditional for the random effects bi For the random effects vectors bi : bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) , i = 1, . . . , N, o , 8.4. BAYESIAN ESTIMATES OF THE SURVIVAL DISTRIBUTION 125 with E(bi | · · · ) = var(bi | · · · ) × ni h X i ′ , log(t ) − µ − β x z D−1 γ + σr−2 ri,l i,l i,l i,l i,l l=1 ni −1 X −1 ′ var(bi | · · · ) = D + . z z σr−2 i,l i,l i,l l=1 Full conditional for the covariance matrix of random effects D Finally, D | · · · is an inverse-Wishart distribution with degrees of freedom equal to df + N and a scale matrix equal to S+ N X i=1 8.4 (bi − γ)(bi − γ)′ . Bayesian estimates of the survival distribution Simple posterior median or mean are suitable overall estimates for the components of the parameter vector θ. To characterize a survival distribution underlying the data we also need an estimate for the survival and hazard function or for the survival density or the density of the error term in the AFT model. All these quantities are functions with an expression that depends on the parameter vector θ. In the Bayesian statistics they are estimated by the mean of (posterior) predictive quantities to be discussed in this section. 8.4.1 Predictive survival and hazard curves and predictive survival densities For a specific value of covariates, say xnew and z new , the predictive survival function is given by Z S(t | data, xnew , z new ) = S(t | θ, data, xnew , z new ) p(θ | data) dθ for any t > 0. Further, once the parameter vector θ is known the data do not bring any additional information and hence S(t | θ, data, xnew , z new ) = S(t | θ, xnew , z new ). 126 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Additionally, analogously to Section 7.4, the quantity S(t | θ, xnew , z new ) is expressed using the model parameters as S(t | θ, xnew , z new ) = 1 − K X j=1 wj Φ log(t) − β ′ xnew − b′ z new µj , σj2 . (8.16) The MCMC estimate of the predictive survival function is then given, using the expression (4.13): Ŝ(t | data, xnew , z new ) = M 1 X S(t | θ (m) , xnew , z new ), M (8.17) m=1 where θ (m) , m = 1, . . . , M is the MCMC sample from the posterior (predictive) distribution. All components of θ (m) are directly available except b(m) . These must be additionally sampled from Nq (γ (m) , D(m) ). Analogously, predictive hazard curves and predictive survival densities are obtained using the relationship p(t | θ, xnew , z new ) = t −1 K X j=1 wj ϕ log(t) − β ′ xnew − b′ z new µj , σj2 (8.18) for the survival density and the relationship ℏ(t | θ, xnew , z new ) = for the hazard. 8.4.2 p(t | θ, xnew , z new ) S(t | θ, xnew , z new ) (8.19) Predictive error densities Averaging the error density (8.2) across the MCMC run, conditionally on fixed values of K, gives a Bayesian predictive error density estimate of the mixture with K components, i.e. an estimate of Z gε (e) p(θ | K, data) dθ, e ∈ R, (8.20) E gε (e) K, data = ΘK where the domain of integration, ΘK , is the subset of the overall parameter space pertaining to mixtures with a fixed number K of the mixture components. Averaging further across values of K gives an estimate of Z E gε (e) data = gε (e) p(θ | data) dθ, e ∈ R, (8.21) the overall Bayesian predictive density estimate of the error distribution. 8.5. BAYESIAN ESTIMATES OF THE INDIVIDUAL RANDOM EFFECTS 8.5 127 Bayesian estimates of the individual random effects In some situations, for example when discrimination between clusters is of interest, an estimate of the individual random effects must be provided. In the Bayesian statistics, their estimates are given by some characteristic of the posterior distribution, for instance by the posterior mean E(bi | data). The precision of the estimate can be evaluated using the credible interval. When using the MCMC to draw the sample from the posterior distribution, we estimate each individual random effect vector bi by the average of the sampled values, i.e. M 1 X (m) b , b̂i = M m=1 i (m) where M is the number of MCMC iterations and bi the value of bi sampled at the mth iteration. The credible interval is obtained by taking sample quantiles from the MCMC sample. 8.6 Simulation study A simulation study was carried out to explore the performance of the proposed method. The setting mimics a study with clustered data where a continuous covariate as well as a dichotomous covariate might influence the distribution of the event time. At the same time there might be an overall heterogeneity between clusters present as well as a possible interaction between the cluster effect and the effect of the dichotomous covariate. The factual setting used to generate the ‘true’ data was motivated by the results of the WIHS analysis presented in Section 7.6. Namely, ‘true’ uncensored data were generated according to the model log(Ti,l ) = 1.5 + β xi,l + bi,1 + bi,2 zi,l + εi,l , i = 1, . . . , N, l = 1, . . . , ni , where β = 0.4, γ = −0.8, (bi,1 , bi,2 )′ ∼ N2 (0, γ)′ , D , var(bi,1 ) = 0.52 , var(bi,2 ) = 0.12 , corr(bi,1 , bi,2 ) = 0.4. The covariate xi,l was generated according to the extreme-value distribution of a minimum, with location equal to 8.5 and scale equal to 1 inspired more 128 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL or less by the log2 (1 + CD4 count) covariate in the WIHS data set. The covariate zi,l was binary taking a value of 1 with probability equal to 0.4. The error term εi,l was generated from a standard normal distribution, from a Cauchy distribution, from a Student t2 distribution, from a standardized extreme value distribution, and from a normal mixture 0.4N1 (−2.000, 0.25)+ 0.6 N1 (1.333, 0.36), respectively. Two sample sizes were considered: (1) N = 50, ni = 5 for all i (small sample size) and (2) N = 100, ni = 10 for all i (large sample size). Each simulation involved 100 replications. All event times were interval-censored by simulating 120 consecutive ‘assessment times’ for each ‘patient’ in the dataset (the first assessment time was drawn from N (7, 1), times between each consecutive assessments from N (6, 0.25)). At each assessment, between 0.2% and 0.6% randomly selected patients were withdrawn from the study resulting in approximately 15% of right-censored observations. For each dataset, the estimates were computed using the Bayesian normal mixture cluster-specific AFT model, using the Bayesian cluster-specific model with a normal error and using the maximumlikelihood AFT model with a normal error and ignoring the random effects structure. Appendix B, Section B.2 gives selected results of the simulation. Average estimates of the regression parameters, their standard and mean squared errors are given in Tables B.7 and B.8. The results related to the covariance matrix D of the random effects are given in Tables B.9 – B.11. It is seen that, in most cases, the Bayesian mixture approach performs better than the incorrectly specified models. A large difference in favour of the Bayesian mixture model is seen in the case of a normal mixture or Cauchy for the error distribution. Additionally, when the Bayesian mixture approach is used, also the error distribution and consequently also the hazard or survival functions are reproduced closely which is not always the case when the Bayesian normal model is used. See Figures B.4 – B.9. 8.7 Example: Signal Tandmobielr study – clustered interval-censored data In Section 7.7 we analyzed separately the emergence times of teeth 14 and 15. In this section, we extend this analysis by inclusion of all permanent premolars, i.e. teeth 14, 15, 24, 25, 34, 35, 44, 45 in Figure 1.1 and additionally, all eight teeth will be analyzed jointly. This allows not only to answer the question what the impact of different covariates on the emergence time is but 8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED INTERVAL-CENSORED DATA 129 also the question concerning the relationship between the emergence times of different teeth. A random sample of 500 boys and 500 girls will be used for the inference. The response variable Ti,l , i = 1, . . . , 1 000, l = 1, . . . , 8, refers to the age of emergence of the lth permanent premolar of the ith child. As indicated in Sections 1.1 and 7.7 the response variable is interval-censored with intervals of length equal to approximately 1 year. For reasons stated in Section 7.7 we shifted the time origin of the AFT model to 5 years of age, i.e. by replacing Ti,l by Ti,l − 5 in the model (8.1). Further, Leroy et al. (2003b) have shown that there is horizontal symmetry with respect to emergence, i.e. the same emergence distribution can be assumed at horizontally symmetric positions (e.g., for teeth 14 and 24). In model (8.1), this leads to the random effect vector bi = (bi,1 , . . . , bi,4 )′ with z i,l = (1, man4i,l , max5i,l , man5i,l )′ , where man4i,l , max5i,l , man5i,l , respectively are dummies for the mandibular first premolars (teeth 34, 44), maxillary second premolars (teeth 15, 25) and mandibular second premolars (teeth 35, 45), respectively. With this model specification, apart of the random variation given by the error term εi,l , the terms b∗i,max4 = bi,1 , b∗i,man4 = bi,1 + bi,2 , b∗i,max5 = bi,1 + bi,3 , b∗i,man5 = bi,1 + bi,4 determine how the log-emergence time of a pair of horizontally symmetric teeth of a single child differ from the population average. As fixed effects we used gender ≡ girl, dmf, interaction between gender and dmf, and all two-way interaction terms between gender, dmf and dummies for the pairs of horizontal symmetric teeth, i.e. xi,l = (gender i , dmf i,l , gender i ∗ dmf i,l , genderi ∗ man4i,l , genderi ∗ max5i,l , genderi ∗ man5i,l , dmf i,l ∗ man4i,l , dmf i,l ∗ max5i,l , dmf i,l ∗ man5i,l )′ . See Section 7.7 for the definition of the covariate dmf. For the inference we sampled two chains, each of length 20 000 with 1:3 thinning which took about 27 hours on a Pentium IV 2 GHz PC with 512 MB RAM. The first 1 500 iterations of each chain were discarded. Convergence was evaluated by the method of Gelman and Rubin (1992). 130 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Table 8.1: Signal Tandmobielr study. Posterior medians, 95% equal-tail credible intervals for the effect of different covariates and error variance. Posterior median intercept gender dmf intercept gender dmf Effect 95% CI Posterior median 95% CI 1.7566 −0.0680 −0.0457 Maxilla 4 (1.7338, 1.7822) (−0.1003, −0.0368) (−0.0631, −0.0284) 1.9001 −0.0504 −0.0317 Maxilla 5 (1.8729, 1.9283) (−0.0844, −0.0163) (−0.0500, −0.0135) 1.7242 −0.0668 −0.0201 Mandible 4 (1.7019, 1.7484) (−0.0972, −0.0375) (−0.0378, −0.0032) 1.9060 −0.0654 −0.0090 Effect Posterior median gender ∗ dmf log(scale) log(σ) error scale σ 8.7.1 All teeth 0.0105 −2.2580 0.1046 Mandible 5 (1.8805, 1.9323) (−0.0965, −0.0323) (−0.0283, 0.0098) 95% CI (−0.0073, 0.0279) (−2.3111, −2.1721) (0.0992, 0.1139) Prior distribution The initial maximum-likelihood AFT model, for each tooth separately, with a normal error distribution and without random effects estimated the intercept as 1.8 and scale as 0.25. According to the suggestions of Section 8.2.3 we used the following values of hyperparameters: ξ = 1.8, κ = (3 · 0.25)2 , ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the number of mixture components, K, a truncated Poisson prior with λ = 5 reflecting our prior belief that the error distribution is skewed and Kmax = 30 was used. All β and γ parameters were assigned a spread N (0, 100) prior. For the covariance matrix D of random effects we used an inverse Wishart prior with df = 4. Though, due to the fact that 1 000 clusters are involved in the data set, even a higher value could be used with a negligible impact on results. Prior scale matrix S was equal to diag(0.002) (corresponding to inverse-gamma(df, 0.001) in the univariate case). 8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED INTERVAL-CENSORED DATA 131 Table 8.2: Signal Tandmobielr study. Posterior medians, 95% equal-tail credible intervals and Bayesian two-sided p-values for the effect of dmf > 0 for the two genders and different teeth. Tooth Maxilla 4 Maxilla 5 Mandible 4 Mandible 5 8.7.2 Gender Girl Boy Girl Boy Girl Boy Girl Boy Post. median −0.0352 −0.0457 −0.0212 −0.0317 −0.0098 −0.0201 0.0015 −0.0090 95% CI (−0.0522, −0.0185) (−0.0631, −0.0284) (−0.0390, −0.0035) (−0.0500, −0.0135) (−0.0267, 0.0070) (−0.0378, −0.0032) (−0.0162, 0.0193) (−0.0283, 0.0098) p-value < 0.001 < 0.001 0.019 < 0.001 0.255 0.021 0.870 0.353 Results for the regression and error parameters The effect of different covariates on the emergence, separately for each tooth is given in Table 8.1. The results in Table 8.1 were obtained as MCMC summary for proper combinations of model parameters. For example, PK the intercept effect for the maxillary teeth 4 equals the error mean α = j=1 wj µj . For the maxillary teeth 5, the intercept effect equals α+γ(max5) where γ(max5) is the mean of the random effect bi,3 . The intercept effects for the remainig teeth are defined in an analogous manner. The effect of gender in Table 8.1 is defined as β(gender) for the maxillary teeth 4, β(gender) + β(gender ∗ max5) for the maxillary teeth 5 and analogously for the remaining teeth. Finally, the effect of dmf is given by β(dmf) for the maxillary teeth 4, by β(dmf)+β(dmf ∗max5) for the maxillary teeth 5 and analogously for remaining teeth. The error scale refers to the summary for the standard deviation σ of the error distribution, qP K 2 2 2 i.e. σ = j=1 wj (µj + σj ) − α . The row labeled as log(scale) refers to the summary for log(σ). Most of the quantities in Table 8.1 are comparable to the results of the earlier analysis (see Section 7.7) given in Table 7.5. Remember however that in Section 7.7 we analyzed separately only one maxillary tooth 4 (14) and one maxillary tooth 5 (15). Furthermore, in contrast to the recent analysis we allowed the dependence of the error variance on the covariates in Section 7.7. In this analysis, the main interest lies in the effect of dmf on emergence. This can be evaluated from Table 8.2 that shows posterior summary statistics for the effect of dmf (appropriate linear combinations of β parameters) for boys 132 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Table 8.3: Signal Tandmobielr study. Posterior medians, 95% equal-tail credible intervals for variances and correlations between tooth-specific linear combinations of random effects. Parameter sd(b∗i,max4 ) sd(b∗i,man4 ) sd(b∗i,max5 ) sd(b∗i,man5 ) corr(b∗max4 , corr(b∗max4 , corr(b∗max4 , corr(b∗man4 , corr(b∗man4 , corr(b∗max5 , b∗man4 ) b∗max5 ) b∗man5 ) b∗max5 ) b∗man5 ) b∗man5 ) Posterior median 0.204 0.198 0.205 0.202 0.887 0.914 0.842 0.793 0.895 0.847 95% CI (0.192, 0.218) (0.186, 0.211) (0.190, 0.221) (0.187, 0.218) (0.856, 0.914) (0.887, 0.938) (0.804, 0.874) (0.749, 0.832) (0.864, 0.923) (0.810, 0.880) and girls and the four pairs of horizontally symmetric teeth. It is seen that caries on the primary predecessor accelerates significantly the emergence of the permanent successor in the case of maxillary teeth. For the mandibular teeth, a slight effect is observed only for the first premolar on boys. Additionally, besides the effect of dmf the emergence process for girls is ahead of boys. 8.7.3 Inter-teeth relationship Further, Table 8.3 shows posterior summary statistics for standard deviations and correlations of above defined tooth-specific linear combinations b∗i,max4 , b∗i,man4 , b∗i,max5 , b∗i,man5 of random effects bi,1 , . . . , bi,4 . It shows how the child effect is important and how the different teeth in one mouth are strongly correlated. The posterior medians of all standard deviations in Table 8.3 are all about 0.2 which is approximately two times higher than the posterior median of the standard deviation of the error distribution which was equal to 0.1. Posterior medians of all correlation parameters lie between 0.79 and 0.91. 8.7.4 Predictive emergence and hazard curves Predictive emergence curves (predictive cumulative distribution functions) computed using an approach described in Section 8.4.1 are shown in Fig- 8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED INTERVAL-CENSORED DATA Maxilla 4, Boys 10 11 0.8 0.4 10 11 9 10 11 0.4 0.0 12 7 8 9 10 11 Maxilla 5, Girls Maxilla 5, Boys 9 10 11 0.8 Age (years) 0.4 dmf = 1 dmf = 0 12 7 8 9 10 11 Mandible 5, Girls Mandible 5, Boys 10 11 12 12 0.4 dmf = 1 dmf = 0 0.0 0.4 0.0 9 Age (years) 0.8 Age (years) Proportion emerged Age (years) 8 12 0.0 0.4 8 12 dmf = 1 dmf = 0 Age (years) Proportion emerged 8 0.8 Mandible 4, Boys 0.4 0.8 9 Mandible 4, Girls 0.0 0.8 8 Age (years) dmf = 1 dmf = 0 7 7 Age (years) dmf = 1 dmf = 0 7 dmf = 1 dmf = 0 0.0 12 Proportion emerged 0.8 9 0.0 Proportion emerged Proportion emerged 8 dmf = 1 dmf = 0 7 Proportion emerged Proportion emerged 0.8 0.4 dmf = 1 dmf = 0 0.0 Proportion emerged Maxilla 4, Girls 7 133 7 8 9 10 11 12 Age (years) Figure 8.2: Signal Tandmobielr study. Posterior predictive emergence curves. Solid line: dmf = 1, dotted-dashed line: dmf = 0. 134 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Maxilla 4, Boys 10 11 12 10 11 Mandible 4, Boys 9 10 11 12 8 7 8 9 10 11 Age (years) Age (years) Maxilla 5, Girls Maxilla 5, Boys 9 10 11 12 7 8 9 10 11 Mandible 5, Boys 9 10 11 12 0.0 0.5 1.0 1.5 2.0 Mandible 5, Girls Hazard Age (years) Age (years) 12 dmf = 1 dmf = 0 Age (years) 8 12 dmf = 1 dmf = 0 0.0 0.5 1.0 1.5 2.0 8 0.0 0.5 1.0 1.5 2.0 Mandible 4, Girls Hazard 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 9 Age (years) dmf = 1 dmf = 0 7 8 Age (years) dmf = 1 dmf = 0 7 dmf = 1 dmf = 0 7 Hazard 0.0 0.5 1.0 1.5 2.0 Hazard 9 dmf = 1 dmf = 0 7 Hazard 8 0.0 0.5 1.0 1.5 2.0 dmf = 1 dmf = 0 7 Hazard Hazard 0.0 0.5 1.0 1.5 2.0 Hazard Maxilla 4, Girls 12 dmf = 1 dmf = 0 7 8 9 10 11 12 Age (years) Figure 8.3: Signal Tandmobielr study. Posterior predictive hazard curves. Solid line: dmf = 1, dotted-dashed line: dmf = 0. 8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED INTERVAL-CENSORED DATA 135 4 0 2 gε (e) 6 8 Unconditional 1.4 1.6 1.8 2.0 e 6 8 Conditional K = 4 (6.65%) K = 6 (11.38%) 4 gε (e) K = 5 (8.75%) K = 7 (10.91%) 2 K = 8 (10.62%) K = 9 (8.77%) 0 K = 10 (6.54%) 1.4 1.6 1.8 2.0 e Figure 8.4: Signal Tandmobielr study. Posterior predictive error density. 136 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL ure 8.2. In agreement with the results discussed in Section 8.7.2 almost negligible difference is observed between the predictive emergence curves for dmf > 0 and dmf = 0 for mandibular teeth. The same is true for the predictive hazard functions of emergence shown in Figure 8.3. As expected (see Section 7.7.2 for the reasons why) the predictive hazard functions are all increasing. 8.7.5 Predictive error density In our sample, the number of mixture components K ranged from 2 to 24 while mixtures with K ∈ {6, 7, 8} occupied each more than 10% of the sample, with the highest frequency for K = 7 (11.2%). Mixtures with K ≥ 17 took each less than 1.5% of the sample. Apparently, the model did not suffer from the technical restriction given by Kmax = 30. Figure 8.4 further shows both the overall estimate of the predictive error density (8.21) and the conditional (given K) estimate of the predictive error density (8.20). It is seen that the mixtures with the most frequent numbers of components are all almost the same. 8.7.6 Conclusions This section showed an analysis of clustered data where moreover closer dependence between some observations within the cluster could be assumed. Since in Section 7.7 we have shown on the similar analysis of the same data set that the error variance might depend on covariates the model presented in this section might be improved if we allow to depend the variances of the mixture components determining the error distribution on covariates as well. However, in the current mixture setting such extension is not trivial and requires further research. 8.8 Example: CGD data – recurrent events analysis The chronic granulomatous disease (CGD) trial has been introduced in Section 1.2. The response variable Ti,l is a time to the lth (recurrent) infection on the ith patient, i = 1, . . . , 128, l = 1, . . . , ni , 1 ≤ ni ≤ 8. So that a patient represents a cluster and the infection times the individual observations. The problem of recurrent events in this data set was discussed by several authors in the literature. Among others, Therneau and Hamilton (1997) used 8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS √ d ∼ Unif(0, 100) 1.2 1.0 0.8 0.6 0.0 0.2 0.4 Posterior density 0.8 0.6 0.4 0.0 0.2 Posterior density 1.0 1.2 d ∼ I-Gamma(0.001, 0.001) 0.0 0.5 1.0 √ 1.5 2.0 0.0 0.5 1.0 √ d 1.5 2.0 d √ d ∼ Unif(0, 10) 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 Posterior density 1.0 1.2 √ d ∼ Unif(0, 50) Posterior density 137 0.0 0.5 1.0 √ 1.5 d 2.0 0.0 0.5 1.0 √ 1.5 2.0 d Figure 8.5: CGD data. Scaled histograms of sampled standard deviations of the random effect bi for different prior distributions. 138 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL the CGD data to illustrate several approaches for recurrent event analysis based on the Cox’s PH model. Vaida and Xu (2000) used this dataset to illustrate the PH model with random effects. They specify the hazard function for the (i, l)th event as ℏ(t | xi,l , z i,l , bi ) = ℏ0 (t) exp(β ′ xi,l + b′i z i,l ), where ℏ0 is a baseline hazard function, β regression parameters vector for ‘fixed’ effects, x a covariate vector of ‘fixed’ effects, bi a random effect vector and z i,l corresponding covariates, see also Section 3.4.1. They use a normal distribution for bi . In this section we present an analysis of the CGD data using the Bayesian CS normal mixture AFT model that could be considered as an AFT counterpart of the random effects PH model of Vaida and Xu (2000). In the model formula (8.1) a univariate random effect bi is used with zi,l ≡ 1. As fixed effects covariates we used the same covariates as Vaida and Xu (2000), namely the xi,l vector equals xi,l = (trtmti , inheri , agei , cortici , prophyi , gender i , hcatUSotheri , hcatEUAmsteri , hcatEUotheri )′ , where trtmt equals 1 for the gamma inferon group and equals 0 for the placebo group, inher equals 1 for patients with the autosomal recessive and equals 0 for patients with X-linked pattern of inheritance, age is the age of the patient in years, cortic equals 1 if the corticosteroids are used and equals 0 otherwise, prophy equals 1 if the prophylactic antibiotics are used and equals 0 otherwise, gender equals to 1 for females and equals 0 for males and finally hcatUSother, hcatEUAmster, and hcatEUother are dummies for the hospital categories US– other, EU-Amsterdam, and EU-other, respectively. For the inference we sampled two chains, each of length 60 000 with 1:6 thinning which took about 5 minutes on a Pentium IV 2 GHz PC with 512 MB RAM. The first 30 000 iterations of each chain were discarded. The convergence was evaluated by a critical examination of the trace and autocorrelation plots and using the method of Gelman and Rubin (1992). 8.8.1 Prior distribution The initial maximum-likelihood AFT model with a normal error distribution and without random effects gave an estimate of the intercept equal to 3.66 and a scale equal to 1.69. Along the suggestions made in Section 8.2.3 we used the following values of hyperparameters: ξ = 3.66, κ = 25 ≈ (3 · 1.69)2 , 8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS 139 ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the number of mixture components, K, a truncated Poisson prior with λ = 5 reflecting our prior belief that the error distribution is skewed and Kmax = 30 was used. Prior means of all regression parameters were equal to 0 and their prior variances to 1 000. For the variance d of the random effect we tried either an inverse-gamma IGamma(0.001, 0.001) prior (df = 0.002, s = 0.002 in the terms of the inverse√ √ √ Wishart distribution) or a uniform Unif(0, s) prior on d with s = 100, 50, 10. As discussed in Gelman (2006, Sections 2.2 and 4.3), with the IGamma(ǫ, ǫ) prior the inference might become very sensitive to the choices √ of ǫ. This is not the case of the uniform distribution on d where the choice of the range of the uniform distribution has practically no impact on the results (provided the upper limit of the uniform distribution is not chosen √ too small). In Figure 8.5, we show scaled histograms of sampled values of d for above mentioned prior distributions. It is seen that the inverse-gamma prior leads to a high posterior probability mass close to zero. The phenomenon driven by the prior distribution √ which has a peak close to zero. On the other hand, with the uniform prior on d, the posterior distribution is clearly separated from zero with the region of the support obviously driven by the data. Moreover, in agreement with the findings of Gelman (2006), the posterior distribution is practically the same irrespective the choice of the range of the uniform √ prior. The results presented below will be based on Unif(0, 100) prior on d (practically √ the same results were obtained also with the remaining uniform priors on d). 8.8.2 Effect of covariates on the time to infection Table 8.4 shows posterior summary statistics for the effect of the included covariates on the distribution of the time to infection. Reported Bayesian p-value is simultaneous in the case of the covariate hospital category. It is seen that the effect of gamma interferon is highly significant increasing the time to the infection by the factor of exp(1.273) = 3.57. The effect of the pattern of inheritance is slightly not-significant on a conventional 5% level. On the other hand, the increase of age by 1 year increases significantly the infection free time by the factor of exp(0.047) = 1.05. Further, the use of corticosteroids should be avoided as it decreases significantly the infection free time by the factor of exp(−2.767) = 0.06 whereas the use of prophylactic antibiotics increases significantly the infection free time by the factor of exp(1.191) = 3.29. The infection free time is further significantly higher for females, being exp(1.476) = 4.38 times higher than in the case of males. Finally, the effect of the hospital category is slightly not significant however 140 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Table 8.4: CGD data. Posterior medians, 95% equal-tail credible intervals and Bayesian two-sided (simultaneous) p-values for the effect of covariates. Posterior median Parameter Treatment group gamma interferon Pattern of inheritance autosomal recessive Age 1.273 −0.914 0.047 Use of corticosteroids yes Use of prophylactic antibiotics yes Gender female Hospital category US – other Europe – Amsterdam Europe – other −2.767 1.191 1.476 0.461 1.729 1.268 95% CI p = 0.001 (0.437, 2.195) p = 0.067 (−1.829, 0.071) p = 0.022 (0.007, 0.092) p = 0.038 (−5.727, −0.161) p = 0.023 (0.150, 2.330) p = 0.042 (0.050, 3.111) p = 0.065 (−0.481, 1.451) (0.183, 3.377) (0.017, 2.637) Table 8.5: CGD data. Posterior medians and 95% equal-tail credible intervals for the moments of the error distribution and standard deviation of the random effects. Parameter Posterior median 95% CI Moments of the error distribution Intercept α 4.088 (2.532, 5.527) Error scale σ 2.495 (1.399, 4.083) Standard deviation of the random effects sd(bi ) 0.748 (0.183, 1.395) 141 0.6 0.4 Treatment, female Treatment, male 0.2 Survival 0.8 1.0 8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS Placebo, female 0.0 Placebo, male 0 100 200 300 400 0.004 Time (days) Placebo, female 0.002 Treatment, male Treatment, female 0.000 0.001 Hazard 0.003 Placebo, male 0 100 200 300 400 Time (days) Figure 8.6: CGD data. Predictive survival (upper panel) and hazard (lower panel) curves for males and females taking either treatment or placebo. Remaining covariates were fixed either to the mean value (age = 14.6) or to the most common value (X-linked pattern of inheritance, no use of corticosteroids, use of prophylactic corticosteroids, and a hospital category USother ). 142 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL 0.15 0.10 0.00 0.05 gε (e) 0.20 0.25 Unconditional 0 5 10 e 0.25 Conditional 0.20 K = 1 (2.92%) K = 5 (2.17%) 0.15 K = 15 (2.58%) 0.10 K = 20 (3.22%) K = 25 (4.69%) 0.05 K = 30 (6.77%) 0.00 gε (e) K = 10 (2.49%) 0 5 10 e Figure 8.7: CGD data. Posterior predictive error density. 8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS 143 0 −1 −2 Random effect 1 2 the posterior median suggests the best results are obtained in the hospitals of category Europe – Amsterdam whereas the worst results in the hospital category US – NIH. Although the parameters of the AFT model are not directly comparable to the parameters of the PH model, we can compare at least the direction of the relationship obtained here and by Vaida and Xu (2000) who used the PH model. Care must be taken as Vaida and Xu (2000) use different 0-1 coding of dichotomous variables than we do. However, we conclude that the directions of the relationships between the covariates and the time to infection found by the AFT model is the same compared to the findings obtained using the PH model. The effect of the treatment (gamma interferon) is seen also in Figure 8.6 where we plot predictive survival and hazard curves for males and females taking either gamma interferon or placebo. Remaining covariates were fixed either to their mean or the most common value. 1 record/patient −3 2 records 0 20 40 60 80 100 3 ≥4 120 Patient Figure 8.8: CGD data. Posterior means and 95% equal-tail credible intervals for individual random effects. Patients are sorted according to the number of records in the data set. 144 8.8.3 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Predictive error density and variability of random effects Posterior summary statistics for the moments of the error distribution, computed in the same way as indicated in Section 8.7.2, and for the standard deviation of the random effects are given in Table 8.5. The estimate of the error density is given in Figure 8.7. In this case, also mixtures with a high number of components were quite highly represented in the sample. For a clarity, the conditional estimates of the error density (given K) are plotted only for chosen values of K. Higher number of components is needed firstly because of clear skewness of the error density and secondly because of somewhat higher probability mass in the right tail of the density. 8.8.4 Estimates of individual random effects Figure 8.8 shows posterior means and 95% equal-tail posterior credible intervals for the values of individual random effects bi , i = 1, . . . , 128. For the purpose of plotting, the patients were sorted according to the number of records they have in the data set. Since there are no big differences in the follow-up times for different patients, less records in the data set generally implies longer infection-free periods. Indeed, for the patients with only one recorded infection time practically all estimated individual random effects lie above zero, the mean for bi . Furthermore, there can be observed a decreasing trend in the estimated individual random effects as the number of recorded infection times increases. 8.8.5 Conclusions In this section we have shown how the Bayesian normal mixture CS AFT model can be used to analyse recurrent events data. It might be useful to include the covariate number of infections in the model. However, such covariate would be time-dependent and it is not possible to include covariates of this type in any model where the (baseline) survival distribution is modelled via density and not hazard function. 8.9 Example: EBCP data – multicenter study In Section 1.4 we have introduced a multicenter randomized clinical trial aiming to evaluate the effect of perioperative chemotherapy given besides 8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY 145 the surgery on the progression-free survival (PFS) time compared to surgery alone for early breast cancer patients while controlling for several baseline covariates. In Figure 1.3 we have indicated there possibly exists heterogeneity between centra with respect to the PFS distribution. Additionally, there is some evidence for the heterogeneity with respect to the treatment effect. In this section, we perform an analysis using the Bayesian normal mixture clusterspecific AFT model that addresses all these issues. The cluster is represented by the center, i.e. i = 1, . . . , 14, within the ith center ni patients were involved in the trial with 25 ≤ ni ≤ 902. As response Ti,l , i = 1, . . . , 14, l = 1, . . . , ni we use the PFS time in days of the lth patient treated by the ith center. To allow for the baseline heterogeneity across centra and also for the heterogeneity with respect to the treatment effect we include a bivariate random effect bi = (bi,1 , bi,2 )′ in the CS AFT model (8.1). The covariate vector z i,l for the random effects has the form z i,l = (1, trtmtGroupi,l )′ , where trtmtGroupi,l equals one if the (i, l)th patient underwent surgery alone and equals zero if she additionally got the course of perioperative chemotherapy. Additionally, as fixed effects we include all baseline factors mentioned in Section 1.4 in the model. Namely, the covariate vector xi,l in the model (8.1) equals xi,l = (ageMidi,l , ageOldi,l , tySui,l , tumSizi,l , nodSti,l , otDisi,l , regionNLi , regionPLi , regionSEi , regionSAi )′ , where ageMid and ageOld are dummies for the age groups 40–50 years and older than 50 years, respectively with the group younger than 40 years as the baseline, tySu being equal to 1 for the breast-conserving surgery and equal to 0 for mastectomy, tumSiz being equal to 1 for the tumors of size ≥ 2cm and equal to 0 for tumors of size < 2cm, nodSt being equal to 1 for the positive and equal to 0 for the negative pathological nodal status, otDis being equal to 1 if there was another disease present and equal to 0 otherwise. Finally, covariates regionNL, regionPL, regionSE, regionSA are dummies for the geographical location of the center with France as the baseline. Since the covariate region is categorical and center-specific it should be possible to reveal, at least partially, the regional structure of the centra from the estimates of their individual random effects bi,1 , i = 1, . . . , 14 when we 146 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL omit the covariate region from the model. To show this, we fitted additionally a model where all dummies for the region were omitted from the covariate vector x (model without region). For the inference we sampled two chains, each of length 200 000 with 1:5 thinning which took about 32 hours on a Pentium IV 2 GHz PC with 512 MB RAM. The first 150 000 iterations of each chain were discarded. The convergence was evaluated by a critical examination of the trace and autocorrelation plots and using the method of Gelman and Rubin (1992). 8.9.1 Prior distribution The initial maximum-likelihood AFT model, without random effects gave the estimate of the intercept equal to 9.43 and the estimate of the error scale equal to 1.73. As the prior mean for the mixture components, ξ, we have taken zero to show that the posterior for the mixture means manages to shift from slightly misspecified location. To set up the remaining hyperparameters we followed closely the guidelines given in Section 8.2.3, namely κ = 40 which is slightly higher than (3 · 1.73)2 , ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the number of mixture components, K, we used a truncated Poisson distribution prior with λ = 5 and Kmax = 30. Both γ2 (mean of the random effects bi,2 ) as well as all β regression parameters were assigned a spread N (0, 100) prior. The covariance matrix D of the random effects got an inverse Wishart prior with df = 2 and S = diag(0.002). 8.9.2 Effect of covariates on PFS time The effect of considered covariates, in both models with included or excluded covariate region, on the progression-free survival time can be evaluated from Table 8.6 where we report posterior medians, 95% equal-tail credible intervals and Bayesian p-values (simultaneous for categorical covariates with more than 2 levels) for the β and γ parameters. It is seen that the results for the model with region included are almost the same as these in the model with region excluded. This is in agreement with the general property of the AFT model mentioned in Section 3.3 that the regression parameters for included covariates do not change when an important factor is omitted from the model. If we base our conclusions on the model with region included then we see that, after adjustment for the remaining covariates, surgery alone decreases the time to the cancer progression by the factor of exp(−0.173) = 0.84 compared to the surgery given together with the perioperative chemotherapy. However the difference is not significant at conventional 5% level. 8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY 147 Table 8.6: Early breast cancer patients data. Posterior medians, 95% equaltail credible intervals and Bayesian two-sided (simultaneous) p-values for the effect of covariates. Parameter Treatment group surgery alone Age 40–50 years > 50 years Type of surgery breast conserving Tumor size ≥ 2cm Nodal status positive Other disease present Region the Netherlands Poland South Europe South Africa Model with region Poster. median 95% CI p = 0.070 −0.173 (−0.350, 0.016) p = 0.005 0.417 (0.140, 0.695) 0.260 (0.002, 0.520) p = 0.056 0.174 (−0.005, 0.357) p < 0.001 −0.494 (−0.686, −0.306) p < 0.001 −0.653 (−0.819, −0.488) p = 0.008 −0.385 (−0.666, −0.099) p = 0.033 −0.512 (−0.878, −0.068) 0.119 (−0.394, 0.663) −0.450 (−0.857, −0.038) −0.864 (−1.343, −0.371) Model without region Poster. median 95% CI p = 0.086 −0.166 (−0.342, 0.026) p = 0.003 0.429 (0.154, 0.715) 0.295 (0.036, 0.558) p = 0.029 0.197 (0.021, 0.379) p < 0.001 −0.507 (−0.697, −0.314) p < 0.001 −0.657 (−0.822, −0.490) p = 0.008 −0.394 (−0.683, −0.102) Further, the prognosis for the cancer progression is the most optimistic in the middle age group 40 – 50 years where the time to the progression of the disease is increased by the factor of exp(0.417) = 1.52 compared to the youngest group <40 years. In the oldest group >50 years the time to the disease progression is still increased, by the factor of exp(0.260) = 1.30, compared to the youngest group. The estimates for the effect of age further suggests a non-linear relationship between the age and log-progression-free survival time. The effect of the type of surgery on the disease progression is slightly not significant at 5% level when basing the inference on the model with region. However the posterior median of the β parameter for this covariate suggest that breast conserving surgery increases the time to the cancer progression by 148 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Table 8.7: Early breast cancer patients data. Posterior medians and 95% equal-tail credible intervals for the moments of the error distribution and variance components of the random effects. Parameter Intercept α Error scale σ Model with region Poster. median 95% CI Model without region Poster. median 95% CI Moments of the error distribution 9.453 (8.983, 9.853) 9.229 1.741 (1.600, 1.859) 1.749 (8.822, 9.796) (1.597, 2.376) Variance components of the random effects sd(bi,1 ) 0.126 (0.026, 0.392) 0.348 (0.192, 0.616) sd(bi,2 ) 0.060 (0.020, 0.228) 0.085 (0.023, 0.275) corr(bi,1 , bi,2 ) −0.071 (−0.988, 0.973) −0.842 (−0.995, 0.978) the factor of exp(0.174) = 1.20 when compared to mastectomy. The effect of remaining patient-specific covariates is highly significant and in the direction expected from the clinical point of view. Namely, the tumor of size ≥2 cm decreases the time to the cancer progression by the factor of exp(−0.494) = 0.61 compared to the smaller tumors of size <2 cm. A positive pathological nodal status decreases drastically the time to the cancer progression by the factor of exp(−0.653) = 0.52 compared to the negative result. The presence of other related disease decreases the PFS time by the factor of exp(−0.385) = 0.68. Finally, a significant effect of the geographical region on the PFS time is seen. The best performing region is found to be Poland, followed by France, South Europe and the Netherlands. The region which performs the worst is then South Africa. Relatively small effect of the perioperative therapy compared to surgery alone is also seen from the posterior predictive survival curves shown in Figure 8.9 and drawn for region = France and two typical combinations of covariates. 8.9.3 Predictive error density and variance components of random effects Posterior summary statistics for the moments of the error distribution and the variance components of the random effects are given in Table 8.7. The 8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY 149 0.6 0.4 Survival 0.8 1.0 BCS, ≥2 cm, nodal−, no other disease Surgery + chemotherapy 0.0 0.2 Surgery alone 0 1000 2000 3000 4000 5000 Time (days) 0.6 0.4 Survival 0.8 1.0 Mastectomy, ≥2 cm, nodal+, no other disease Surgery + chemotherapy 0.0 0.2 Surgery alone 0 1000 2000 3000 4000 5000 Time (days) Figure 8.9: Early breast cancer patients data. Predictive survival curves based on the model with region for region = France, and two typical combinations of covariates: (1) breast conserving surgery, tumor size ≥2 cm, negative nodal status and no other associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm, positive nodal status and no other associated disease (13.88% of the sample). 150 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL moments of the error distribution are computed in the same way as indicated in Section 8.7.2. It is seen that although there is heterogeneity between centra, the within-center variability given by the variance of the error distribution is much higher than the between-centra variability given by the variance of the random effects. Furthermore, as expected, the variability of the random intercept term bi,1 increased considerably when we omitted the covariate region. According to the posterior median there exists very low negative correlation between the overall center level and the treatment × center interaction in the model with region and relatively high negative correlation in the model with region excluded. However, in both cases the 95% equal-tail credible interval covers almost the whole range (−1, 1) of possible values for ̺ forcing us to conclude that almost nothing can be said about the random effects correlation ̺, probably due to the fact that effectively only a sample of size 14 is used to estimate this correlation. The reason for quite huge difference 3 0 1 2 Posterior density 2 0 1 Posterior density 3 4 Model without region 4 Model with region −1.0 −0.5 0.0 0.5 corr(bi,1 , bi,2 ) 1.0 −1.0 −0.5 0.0 0.5 1.0 corr(bi,1 , bi,2 ) Figure 8.10: Early breast cancer patients data. Scaled histograms for sampled corr(bi,1 , bi,2 ). 8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY 151 Model with region Conditional 0.15 0.05 0.10 gε (e) 0.10 0.05 gε (e) 0.15 0.20 0.20 Unconditional K = 1 (90.68%) 0.00 0.00 K = 2 (7.07%) 6 8 10 12 14 K = 3 (1.5%) 6 8 e 10 12 14 e Model without region Conditional 0.15 0.05 0.10 gε (e) 0.10 0.05 gε (e) 0.15 0.20 0.20 Unconditional K = 1 (74.4%) 0.00 0.00 K = 2 (22.13%) 6 8 10 e 12 14 K = 3 (1.73%) 6 8 10 12 14 e Figure 8.11: Early breast cancer patients data. Posterior predictive error densities. 152 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL in the posterior median for ̺ in the two models can be found in Figure 8.10 where we show scaled histograms of sampled values of ̺, i.e. estimates of the posterior density of ̺. It is seen that the posterior density has, in both cases, a ‘U’ shape, while putting somewhat more mass on negative values in the case of the model without region. In the sample, mostly error densities with a low number of mixture components were presented. Namely, in the model with region, 90.68% of the sample was formed by a one-component density, 7.07% of the sample was formed by a two-component mixture, 1.50% of the sample contained a three-component mixtures and mixtures with more than 3 components were all represented in only 0.75% of the sample. In the model with omitted covariate region the proportion of densities with at least two components quite logically increased, namely one-component density is now represented only in 74.40% of the sample, two-component mixtures in 22.13% of the sample and three-component mixtures in 1.73% of the sample. Mixtures with more than 3 components are still quite rare, being all together represented only in 1.74% of the sample. The estimates of the error density (both uncondionally and conditionally given the number of mixture components) are given in Figure 8.11. 8.9.4 Estimates of individual random effects Estimates of individual random effects that could serve to discriminate the centra are given in Figure 8.12. To be able to compare directly the models with and without covariate region the plots related to the random intercept bi,1 take into account also the overall intercept α (mean of the error distribution) and in the case of the model with region also the appropriate main effect of region (β(regionNL), β(regionPL), β(regionSE), and β(regionSA) respectively). It is seen that the estimates of individual random intercepts in the model without region managed quite nicely to capture also the region effect, of course for the price of decreased precision of the estimates. 8.9.5 Conclusions In this section, we have shown an analysis of a typical multicenter clinical trial with heterogeneity with respect to the overall center effect as well as center × treatment interaction. Among others we have further shown how the centerspecific random effects may capture the effect of an omitted center-specific covariate. 8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY 153 Model with region 8.0 8.5 9.0 9.5 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Treatment Institution 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Institution Intercept + region effect 10.0 −0.6 −0.2 0.0 0.2 b2 b1 + α + β(region) Model without region Institution 8.0 8.5 9.0 9.5 b1 + α 10.0 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Treatment 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Institution Intercept −0.6 −0.2 0.0 0.2 b2 Figure 8.12: Early breast cancer patients data. Posterior means and 95% equal-tail credible intervals for individual random effects. Random intercepts are further shifted by an overall intercept α and in the model with region also by a corresponding region main effect β(region). 154 8.10 CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL Discussion In this chapter, we have proposed a Bayesian cluster-specific accelerated failure time model whose error distribution is modelled in a flexible way as a finite normal mixture. An advantage of the full Bayesian approach is the fact that a general random effect vector can be easily included in the model. Subsequently, the effect of covariates can be evaluated jointly with the association among clustered responses. Further, interval-, right-, or left-censored data are easy to handle and finally, the MCMC sampling-based implementation of the model offers a straightforward way to obtain credible intervals of model parameters as well as predictive survival or hazard curves. Observe that the Bayesian approach is used here mainly for technical convenience. Indeed, in practice likelihood (8.3) is hardly tractable using the maximum-likelihood method. On the other hand, the Bayesian estimation using the MCMC does not pose any real difficulties. Further, since all our prior distributions are non-informative (or close to, cfr. variance parameters) and we use (on a proper scale) more or less posterior modes as point estimates the classical maximum-likelihood estimation would lead to almost the same results. The proposed methodology aims to contribute to the area of semi-parametric modelling of correlated and at the same time interval-censored data. Furthermore, our approach allows to bring in a structure into the dependencies between observations in one cluster. For instance, in multicenter studies, the vector z i,l = (1, treatmenti,l )′ in the model formula (8.1) allows to consider not only the random center effect but also a random center-by-treatment interaction which can sometimes be substantial. Unfortunately, our approach cannot handle time-dependent covariates. However, the same is true for any model where the distribution of the response is specified by the density and not by the hazard function. To include also the time-dependent covariates, usually the Cox’s proportional hazards model is used. For example, Kooperberg and Clarkson (1997); Betensky et al. (1999); Goetghebeur and Ryan (2000) consider independent interval-censored data. Vaida and Xu (2000) offer an approach based on the proportional hazards linear mixed model with right-censored data. Finally, our approach can be quite easily extended along the lines presented in Chapters 9 and 10 to handle also doubly-interval-censored data, i.e. the data where the response is given as the difference of two interval-censored observations. Chapter 9 Bayesian Penalized Mixture Cluster-Specific AFT Model This chapter continues with the developments in the framework of the clusterspecific AFT model. However, to model unknown distributional shapes a penalized normal mixture introduced in Section 6.3 will be exploited instead of the classical normal mixture that was used in Chapter 8. Furthermore, we directly describe a model for doubly-interval-censored data although it can also be used with interval- or right-censored data. This approach, introduced by Komárek and Lesaffre (2006b), will allow us to analyze the caries times in the Signal Tandmobielr study. The cluster-specific AFT model for doubly-interval-censored data is specified in Section 9.1. In Section 9.2, we specify the prior distributions of all model parameters and derive their posterior distribution. Markov chain Monte Carlo methodology for the model of this chapter is described in Section 9.3. Estimation of the survival distribution and of the individual random effects is described in Sections 9.4 and 9.5, respectively. Results of the simulation study aiming to evaluate the performance of the proposed method are shown in Section 9.6. Section 9.7 presents the analysis of doubly-intervalcensored caries times of the four permanent first molars. The analysis of the breast cancer multicenter study is given in Section 9.8. Discussion finalizes the chapter in Section 9.9. 155 156 9.1 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Model P Let N i=1 ni observational units be divided into N clusters, the ith one of size ni . Let Ui,l and Vi,l , i = 1, . . . , N, l = 1, . . . , ni denote the true chronological onset and failure time, respectively and Ti,l = Vi,l − Ui,l the true event time. With doubly interval censoring, it is only known that Ui,l occurred U L U within an interval of time ⌊uL i,l , ui,l ⌋, where ui,l ≤ ui,l . Similarly, the failL , v U ⌋, with v L ≤ v U , ure time Vi,l is only known to lie in an interval ⌊vi,l i,l i,l i,l i = 1, . . . , N, l = 1, . . . , ni . As in the whole thesis, it is assumed that observed intervals result from an independent noninformative censoring process (see Section 2.4). Further, as indicated in Section 4.1.2, we will assume that, given the model parameters, the true event time Ti,l is independent of the true onset time Ui,l for all i and l. Below, we discuss this issue further. To account for possible dependencies of different individuals within a cluster, the cluster-specific random effects di = (di,1 , . . . , di,qd )′ and bi = (bi,1 , . . . , bi,qb )′ are introduced and incorporated in the cluster-specific AFT model for doubly-interval-censored data: log(Ui,l ) = δ ′ xui,l + d′i z ui,l + ζi,l , (9.1) log(Vi,l − Ui,l ) = log(Ti,l ) = β ′ xti,l + b′i z ti,l + εi,l , (9.2) i = 1, . . . , N, l = 1, . . . , ni , where δ = (δ1 , . . . , δmu )′ and β = (β1 , . . . , βmt )′ are unknown regression parameter vectors, z ui,l is the covariate vector for random effects influencing the distribution of the onset time, z ti,l the covariate vector for random effects influencing the distribution of the event time and similary, xui,l is the covariate vector for fixed effects having possibly an impact on the onset time and xti,l the covariate vector for fixed effects having possibly an impact on the event time. The error terms ζi,l , i = 1, . . . , N , l = 1, . . . , ni are i.i.d. random variables with some density gζ (ζ). Analogously, the error terms εi,l , i = 1, . . . , N, l = 1, . . . , ni are i.i.d. random variables with density gε (ε). The random effects di , i = 1, . . . , N and bi , i = 1, . . . , N , respectively are assumed to be i.i.d. with a density gd (d) and gb (b), respectively. Furthermore we assume that εi1 ,l1 , ζi2 ,l2 , bi3 and di4 are independent for all i1 , i2 , i3 , i4 and l1 , l2 . This assumption implies that, given the model parameters and the random effects bi and di , Ui,l and Ti,l are independent for each i and l and the vectors U i = (Ui,1 , . . . , Ui,ni )′ and T i = (Ti,1 , . . . , Ti,ni )′ are independent for each i. Furthermore, for example in the context of the Signal Tandmobielr application (see Section 9.7) where Ui,l and Ti,l are the emergence time and the time to caries, respectively, for the lth tooth of the ith child, it also 9.1. MODEL 157 implies the following decomposition (a) Whether a child is an early or late emerger is independent of whether a child is more or less sensitive against caries (independence of di and bi ); (b) Whether a specific tooth emerges early or late is independent of whether that tooth is more or less sensitive against caries (independence of ζi,l and εi,l ). 9.1.1 Distributional assumptions To finalize the specification of the measurement model we have to specify the densities gζ , gε of the random errors and the densities gd , gb of the random effects. According to the dimensionality of the problem, we distinguish two situations. Model U In the case of univariate densities, i.e. for the densities gζ and gε and for the densities gd and gb if the corresponding random effects are univariate (in which case we will use the notation di = (di,1 ) ≡ di and/or bi = (bi,1 ) ≡ bi ), a penalized normal mixture as introduced in Section 6.3 will be used. That is, a generic density g(y) of a random variable Y (substitute ζi,l , εi,l , di or bi ) is modelled as a location-and-scale transformed weighted sum of normal densities over a fixed fine grid of knots µ = (µ−K , . . . , µK )′ centered around µ0 = 0. The means of the normal components are equal to the knots and their variances are all equal and fixed to σ 2 , i.e. g(y) = τ −1 K X j=−K y − α 2 , µ , σ wj (a) ϕ j τ (9.3) where the unknown intercept term α and the unknown scale parameter τ have to be estimated as well as the vector a = (a−K , . . . , aK )′ of the transformed weights. See (6.14) for the relationship between a and w = (w−K , . . . , wK )′ . Model M In the case when a random effect vector di or bi is multivariate it is assumed, analogously to Chapter 8 that it follows a multivariate normal distribution. This choice is driven mainly by computational convenience. Note however, 158 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL that the densities gζ and gε are still modelled using the penalized normal mixture (9.3). Finally, the same reasoning as in Section 8.1.1 can be used to explain why we put more emphasis on a correct specification of the error distribution. For notational convenience and clarity of the exposition we will assume that in Model U, both random effects are univariate (qd = qb = 1) whereas in Model M, both random effects are multivariate (qd > 1 and qb > 1). However, in practical situations both cases can be mixed. For example the distribution of the univariate di can be specified as a penalized normal mixture (9.3) whereas for the multivariate bi a multivariate normal distribution can be used. 9.1.2 Likelihood Denoting p a generic density, the likelihood contribution of the ith cluster is given by Li = = Z Z Z Z Rqd Rqd Rqb Rqb Y ni I l=1 Y ni I l=1 uU i,l uL i,l uU i,l uL i,l I U −u vi,l i,l L −u vi,l i,l I U −u vi,l i,l L −u vi,l i,l p(ti,l , bi , ui,l , di ) dti,l dui,l dbi ddi p(ti,l | bi , ui,l , di ) p(bi | ui,l , di ) p(ui,l | di ) p(di ) dti,l dui,l dbi ddi = Z Rqd Z Rqb "n I i Y l=1 I vU −ui,l uU i,l i,l uL i,l L −u vi,l i,l (9.4) p(ti,l | bi ) dti,l p(ui,l | di ) dui,l # p(bi ) p(di ) dbi ddi , where ′ t ′ t p(ti,l | bi ) = t−1 i,l gε log(ti,l ) − bi z i,l − β xi,l ′ u ′ u p(ui,l | di ) = u−1 i,l gζ log(ui,l ) − di z i,l − δ xi,l are modelled using the expression (9.3) for gε and gζ . Further, in the Model U, p(bi ) = gb (bi ) and p(di ) = gd (di ) are penalized normal mixtures (9.3). Since it is not possible to distinguish between the intercept terms of the error and the random effect the intercepts α = αd for gd and α = αb for gb are fixed to zero for identifiability reasons. In the case 9.2. BAYESIAN HIERARCHICAL MODEL 159 of the Model M, the densities p(bi ) = gb (bi ) and p(di ) = gd (di ) are densities of an appropriate multivariate normal distribution (see also Section 9.2.3). The method of penalized maximum-likelihood, suggested in Chapter 7, is computationally quite demanding for likelihood (9.4). Instead, a Bayesian approach together with MCMC methodology will be used here to avoid explicit integration and optimization. 9.2 Bayesian hierarchical model To specify the model from a Bayesian point of view, prior distributions for all unknown parameters have to be given. For our model we assume a hierarchical structure described by a directed acyclic graph (DAG). The DAG for Model U where the distributions of the univariate random effects and the error terms are estimated using the penalized mixture is given in Figure 9.1. Onset Event Gζ Gd Gε rζi,l rdi δ di xui,l Gb rεi,l rbi εi,l ζi,l ui,l xti,l bi β ti,l uU i,l L vi,l censoringi,l U vi,l i = 1, . . . , N uL i,l l = 1, . . . , ni vi,l Figure 9.1: Directed acyclic graph for the Bayesian penalized mixture clusterspecific AFT model with univariate random effects (Model U ). 160 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL The DAG for Model M with multivariate normal random effects and error terms expressed using the penalized mixture is given in Figure 9.2. For Model U, the joint prior distribution of the total parameter vector θ is given by p(θ) ∝ "n N Y i Y p vi,l ui,l , ti,l × p ti,l β, bi , εi,l × p ui,l δ, di , ζi,l × i=1 l=1 ζ ζ ε ε p εi,l Gε , ri,l × p ζi,l Gζ , ri,l × p ri,l Gε × p ri,l Gζ × # b d d b p bi Gb , ri × p di Gd , ri × p ri Gb × p ri Gd × (9.5) p Gε × p Gζ × p Gb × p Gd × p δ × p β . Onset γd Event Gζ Dd Gε rζi,l δ di zui,l xui,l Db γb rεi,l εi,l ζi,l ui,l xti,l zti,l bi β ti,l uU i,l L vi,l censoringi,l U vi,l i = 1, . . . , N uL i,l l = 1, . . . , ni vi,l Figure 9.2: Directed acyclic graph for the Bayesian penalized mixture clusterspecific AFT model with multivariate normal random effects (Model M ). 9.2. BAYESIAN HIERARCHICAL MODEL 161 The node Gε refers to the set {σ ε , µε , αε , τ ε , wε , aε , λε } which contains the parameters of formulas (9.3) and (6.14) and a smoothing parameter λε which will be further discussed in Section 9.2.1. The sets Gζ , Gb , Gd are defined in an analogous manner. Further, let G be a generic symbol for its subscriped counterpart (i.e. for Gε , Gζ , Gb , Gd ) and let y be a generic symbol for εi,l , ζi,l , bi , or di , i = 1, . . . , N, l = 1, . . . , ni , respectively. The sub-DAG for the generic Y random variable is shown in Figure 9.3 and the corresponding DAG conditional distributions are discussed in Sections 9.2.1 and 9.2.2. In the case of Model M, the joint prior distribution is given by p(θ) ∝ "n N Y i Y p vi,l ui,l , ti,l × p ti,l β, bi , εi,l × p ui,l δ, di , ζi,l × i=1 l=1 ζ ζ ε ε p εi,l Gε , ri,l × p ζi,l Gζ , ri,l × p ri,l Gε × p ri,l Gζ × # p bi γ b , Db × p di γ d , Dd × (9.6) p Gε × p Gζ × p γ b × p Db × p γ d × p Dd × p δ × p β , G λ a σ µ α τ w ri,l yi,l l = 1, . . . , ni i = 1, . . . , N Figure 9.3: Directed acyclic graph for the penalized mixture. 162 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL where γ d and Dd are the mean and the covariance matrix for the random effect vectors di and γ b and Db are the mean and the covariance matrix for the random effect vectors bi . These parameters will be discussed in detail in Section 9.2.3. All the multiplicands of expressions (9.5) and (9.6) will be discussed in detail in the following sections. 9.2.1 Prior distribution for G The prior distribution of a generic node G whose structure is given in Figure 9.3 equals p(G) ∝ p(a | λ) p(λ) p(α) p(τ ). Prior for transformed mixture weights Although often the grid length (2 K + 1) is of moderate size it results in a rather large number of unknown a parameters. To avoid overfitting of the data and identifiability problems, a restriction on the a parameters is needed. In Chapter 7 we added a penalty term for the transformed weights to the log-likelihood for this purpose. This penalty term can be interpreted as an informative log-prior distribution (e.g., Silverman, 1985, Section 6). Therefore the prior distribution p(a | λ) is defined as the exponential of the penalty term used in Chapter 7, i.e. K n λ X 2 o ∆s aj p(a | λ) ∝ exp − 2 j=−K+s n λ o = exp − a′ P′ P a , 2 (9.7) where ∆s denotes a difference operator of order s and P the corresponding difference operator matrix. The hyperparameter λ controls the smoothness of the resulting density g(y). Expression (9.7) is that of a multivariate normal density with zero mean and − − covariance matrix λ−1 P′ P , where P′ P denotes a generalized inverse of the matrix P′ P. This distribution is known as a Gaussian Markov random field (GMRF) and is extensively used in spatial statistics. Although the distribution (9.7) is improper (the matrix P′ P has a deficiency of s in its rank) the resulting posterior distribution is proper as soon as there is some informative data available, see Besag et al. (1995). 9.2. BAYESIAN HIERARCHICAL MODEL 163 As a consequence of the findings discussed in Section 7.2, prior distribution (9.7) favours smooth estimates of the estimated densities (gε , gζ , gb or gd ). Due to the correspondence of the prior (9.7) with the penalty term in the penalized maximum-likelihood approach we will call the mixture model (9.3) with this prior a penalized mixture. Prior for the smoothing parameter The smoothing hyperparameter λ can be interpreted as a component of the prior precision of the transformed weights a. See Section 7.2.3 for the approaches to determine the optimal value of λ in the context of penalized maximum-likelihood estimation. For our full Bayesian inference, the unknown smoothing parameter λ is considered stochastic and is estimated simultaneously with all the remaining parameters of the model. Therefore, here a hyperprior has been assigned to λ, i.e. a highly dispersed Gamma(hλ,1 , hλ,2 ) prior, i.e. h p(λ) = λ,1 hλ,2 Γ(hλ,1 ) λhλ,1−1 exp −hλ,2 λ , where hλ,1 is the fixed shape parameter and hλ,2 the fixed rate parameter. A dispersed gamma distribution is obtained for instance with hλ,1 = hλ,2 = 0.001 or hλ,1 = 1, hλ,2 = 0.005. Prior for the mixture intercept Finally, in the case when the intercept term α is not fixed to zero (intercept of error distributions), a highly dispersed normal distribution has been taken for p(α), i.e. p(α) = ϕ(α | να , ψα ), where να is the fixed prior mean and ψα is the fixed large prior variance. Prior for the mixture scale For the precision τ −2 we have taken a highly dispersed Gamma(hτ,1 , hτ,2 ) distribution, see above the paragraph on the prior for the smoothing parameter. Alternatively a uniform distribution on τ (formally a truncated gamma distribution for τ −2 with hτ,1 = −1/2 and hτ,2 = 0) which is sometimes preferred for hierarchical models (Gelman et al., 2004, pp. 136, 390) could be taken. 164 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL 9.2.2 Prior distribution for the generic node Y To specify the prior distribution of generic Y (εi,l , ζi,l , i = 1, . . . , N , l = 1, . . . , ni in Models U and M and bi , di , i = 1, . . . , N in Model U ) we introduce, analogously to Section 8.2.1, a latent allocation variable r taking values in {−K, . . . , K}. Actually, data augmentation (Tanner and Wong, 1987) is introduced which simplifies the MCMC procedure. The DAG conditional distribution p(y | G, r) is simply a normal distribution: p(y | G, r) = p(y | σ, µ, α, τ, r) = ϕ y | α + τ µr , (τ σ)2 . Further, p(r | G) = p(r | w) is given by Pr r = j w = wj , j ∈ {−K, . . . , K}. Had the latent allocation variable r not been introduced we would have had to work with the conditional distribution p(y | G) = p(y | σ, µ, α, τ, w) which is a normal mixture given by the formula (9.3). 9.2.3 Prior distribution for multivariate random effects in Model M As was mentioned in Section 9.1.1, the multivariate random effects bi and di , i = 1, . . . , N in Model M are assumed to be a priori normally distributed. That is, the densities p(bi | γ b , Db ) and p(di | γ d , Dd ) in the expression (9.6) are p(bi | γ b , Db ) = ϕqb (bi |γ b , Db ), p(di | γ d , Dd ) = ϕqd (di |γ d , Dd ), where γ b = (γb,1 , . . . , γb,qb )′ is the prior mean of the random effects bi , γ d = (γd,1 , . . . , γd,qd )′ the prior mean of the random effects di , Db is the prior covariance matrix of the random effects bi and Dd is the prior covariance matrix of the random effects di . Both prior random effect means γ b and γ d as well as random effect covariance matrices Db and Dd are further assigned hyperpriors. These hyperpriors are chosen analogously to Section 8.2.2. That is, the prior distribution for each γb,j , j = 1, . . . , qb and γd,j ∗ , j ∗ = 1, . . . , qd , respectively is N (νγb ,j , ψγb ,j ) and N (νγd ,j ∗ , ψγd ,j ∗ ), respectively, independently for j = 1, . . . , qb and j ∗ = 1, . . . , qd , i.e. p(γ b ) p(γ d ) = qb nY j=1 ϕ(γb,j qd o o nY ϕ(γd,j ∗ | νγd ,j ∗ , ψγd ,j ∗ ) . | νγb ,j , ψγb ,j ) × j ∗ =1 9.2. BAYESIAN HIERARCHICAL MODEL 165 The vectors ν γb = (νγb ,1 , . . . , νγb ,qb )′ , ν γd = (νγd ,1 , . . . , νγd ,qd )′ , ψ γb = (ψγb ,1 , . . . , ψγb ,qb )′ , and ψ γd = (ψγd ,1 , . . . , ψγd ,qd )′ are fixed hyperparameters. Special care is needed when the random intercept is included in the model. If for example z ti,l,1 ≡ 1, i = 1, . . . , N , l = 1, . . . , ni , then for identifiability reasons γb,1 must be fixed to zero (or equivalently, νγb ,1 = 0, ψγb ,1 = 0) as the overall intercept is given by the intercept αε of the error terms εi,l . The prior distributions for the covariance matrices Db and Dd are inverseWishart with fixed degrees of freedom dfb and dfd , respectively and fixed scale matrices Sb and Sd , respectively. See formula (8.8) for the expression of the corresponding density. 9.2.4 Prior distribution for the regression parameters The prior specification for the regression parameters β and δ is analogous to Section 8.2.2. Firstly, also here, we use the hierarchical centering. That is, the covariates included in xti,l or xui,l , respectively are not included in z ti,l or z ui,l , respectively and vice versa. Further, the covariate vectors xti,l and xui,l , respectively never contain an intercept term since the overall intercept are already included in the model in the form of the parameters αε and αζ , respectively. The prior distribution for each regression coefficient βj , j = 1, . . . , mt and δj ∗ , j ∗ = 1, . . . , mu is N (νβ,j , ψβ,j ) and N (νδ,j ∗ , ψβ,j ∗ ), respectively, independently for j = 1, . . . , mt and j ∗ = 1, . . . , mu , i.e. mu mt o o nY nY ϕ(δj ∗ | νδ,j ∗ , ψδ,j ∗ ) . ϕ(βj | νβ,j , ψβ,j ) × p(β) p(δ) = j ∗ =1 j=1 The vectors ν β = (νβ,1 , . . . , νβ,mt )′ , ν δ = (νδ,1 , . . . , νδ,mt )′ , ψ β = (ψβ,1 ,. . . , ψβ,mt )′ , and ψ δ = (ψδ,1 , . . . , ψδ,mt )′ are fixed hyperparameters. 9.2.5 Prior distribution for the time variables The terms p(vi,l | ui,l , ti,l ), p(ti,l | β, bi , εi,l ) and p(ui,l | δ, di , ζi,l ) appearing in the expressions (9.5) and (9.6) are all Dirac (degenerated) densities driven by the AFT models (9.1) and (9.2). Namely: p(vi,l | ui,l , ti,l ) = I[vi,l = ui,l + ti,l ], p(ui,l | δ, di , ζi,l ) = I[log(ui,l ) = δ ′ xui,l + d′i z ui,l + ζi,l ], p(ti,l | β, bi , εi,l ) = I[log(ti,l ) = β ′ xti,l + b′i z ti,l + εi,l ], i = 1, . . . , N, l = 1, . . . , ni . 166 9.2.6 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Posterior distribution The product of all DAG conditional distributions determines the joint posterior distribution p(θ | data), i.e. p(θ | data) ∝ p(θ) × ni n N Y Y i=1 l=1 U p uL i,l , ui,l ui,l , censoring i,l × o U L , vi,l p vi,l vi,l , censoringi,l , where p(θ) is given by (9.5) for Model U and by (9.6) for ModelL M,Urespec U u , censoring tively. Further, the terms p uL , u and p vi,l , vi,l vi,l , i,l i,l i,l i,l censoringi,l , where censoringi,l represents a realization of the random variable(s) causing the censoring of the (i, l)th onset and failure time, are the same as in Section 8.2.4, with an obvious change in notation. 9.3 Markov chain Monte Carlo As indicated in Section 4.5 we base the inference on a sample from the posterior distribution obtained using MCMC methods. Here, Gibbs sampling (Geman and Geman, 1984; Gelfand and Smith, 1990) was chosen necessitating to sample from all full conditional distributions of blocks of model parameters. Below, the full conditional distributions are discussed. 9.3.1 Updating the parameters related to the penalized mixture G Let yi∗ , i∗ = 1, . . . , n be the current values of the appropriate generic nodes y and ri∗ , i∗ = 1, . . . , n corresponding latent allocation variables. That is, • For G ε we have {yi∗ : i∗ = 1, . . . , n} = {εi,l : i = 1, . . . , N, l = ε : i = 1, . . . , N, l = 1, . . . , n }, 1, . . . , ni }, {ri∗ : i∗ = 1, . . . , n} = {ri,l i PN and n = i=1 ni ; • For G ζ we have {yi∗ : i∗ = 1, . . . , n} = {ζi,l : i = 1, . . . , N, l = ζ 1, . . . , ni }, {ri∗ : i∗ = 1, . . . , n} = {ri,l : i = 1, . . . , N, l = 1, . . . , ni }, PN and n = i=1 ni ; • For G b we have {yi∗ : i∗ = 1, . . . , n} = {bi : i = 1, . . . , N }, {ri∗ : i∗ = 1, . . . , n} = {rib : i = 1, . . . , N }, and n = N ; 9.3. MARKOV CHAIN MONTE CARLO 167 • For G d we have {yi∗ : i∗ = 1, . . . , n} = {di : i = 1, . . . , N }, {ri∗ : i∗ = 1, . . . , n} = {rid : i = 1, . . . , N }, and n = N . Full conditional for transformed mixture weights The full conditional of each element of a is given by exp(Nj aj ) p(aj | · · · ) ∝ n K on P exp(ak ) k=−K o2 # " n aj − E aj | a−(j) , λ × exp − , 2 var aj | a−(j) , λ j = −K, . . . , K, (9.8) where Nj is the number of yi∗ for which the latent allocation variable ri∗ is equal to j, i.e. n X I[ri∗ = j]. Nj = i∗ =1 Further, E aj | a−(j) , λ and var aj | a−(j) , λ are the mean and the variance resulting from the GMRF prior (9.7). For example, for the third order differences (s = 3), which have been used in all applications in this thesis (Sections 9.7 and 9.8), we have aj−3 − 6 aj−2 + 15 aj−1 + 15 aj+1 − 6 aj+2 + aj+3 , E aj a−(j) = 20 j = −K + 3, . . . , K − 3, −3 a−K + 12 a−K+1 + 15 a−K+3 − 6 a−K+4 + a−K+5 E a−K+2 a−(−K+2) = , 19 −3 aK + 12 aK−1 + 15 aK−3 − 6 aK−4 + aK−5 E aK−2 a−(K−2) = , 19 3 a−K + 12 a−K+2 − 6 a−K+3 + a−K+4 , E a−K+1 a−(−K+1) = 10 3 aK + 12 aK−2 − 6 aK−3 + aK−4 E aK−1 a−(K−1) = , 10 E a−K a−(−K) = 3 a−K+1 − 3 a−K+2 + a−K+3 , E aK a−(K) = 3 aK−1 − 3 aK−2 + aK−3 , 168 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL and var aj a−(j) ) = (20 λ)−1 , j = −K + 3, . . . , K − 3, var a−K+2 a−(−K+2) = var aK−2 a−(K−2) = (19 λ)−1 , var a−K+1 a−(−K+1) = var aK−1 a−(K−1) = (10 λ)−1 , var a−K a−(−K) = var aK a−(K) = λ−1 . Distribution (9.8) is log-concave so we experimented both with the slice sampler of Neal (2003) as well as with the adaptive rejection sampling (ARS) method of Gilks and Wild (1992) to update the elements of a. However, in our applications no method was found to be superior with respect to the performance of the MCMC. The results presented in Sections 9.7 and 9.8 were obtained using slice sampling. Furthermore, it is seen that the full conditional distribution for each transformed mixture weight depends only on the weights of the neighboring mixture components. For a better performance of the MCMC, especially to decrease the autocorrelation of the sampled chain, it is thus advantageous to update in one iteration of the MCMC the transformed mixture weights in such an order that the full conditional of a we are updating does not depend on a which has just been updated. This is obtained, for example, using the following update order: · · · → a0 → as+1 → a2(s+1) → · · · → a1 → a1+s+1 → a1+2(s+1) → · · · . Full conditional for the smoothing parameter For the smoothing parameter λ, the full conditional distribution is Gamma (h∗λ,1 , h∗λ,2 ) where h∗λ,1 = hλ,1 + 2K + 1 − s + 1 , 2 h∗λ,2 = hλ,2 + 1 ′ ′ a P Pa. 2 Full conditional for the mixture intercept The full conditional for the mixture intercept α is a normal distribution with the mean and variance n o n X −2 (yi∗ − τ µri∗ ) + ψα−1 να , E(α | · · · ) = var(α | · · · ) × (στ ) −1 −1 var(α | · · · ) = (στ )−2 n + ψα i∗ =1 , respectively and is thus easily sampled from. 9.3. MARKOV CHAIN MONTE CARLO 169 Full conditional for the mixture scale The full conditional distribution of τ −2 has the form √ p(τ −2 | · · · ) ∝ (τ −2 )ξ1 −1 exp ξ3 τ −2 − ξ2 τ −2 , (9.9) with ξ1 = hτ,1 + 0.5 n, ξ2 = hτ,2 + 0.5 σ −2 n X (yi∗ − α)2 , i∗ =1 ξ3 = σ −2 n X i∗ =1 µri∗ (yi∗ − α). Distribution (9.9) is generally not log-concave so that the adaptive rejection sampling (ARS) method of Gilks and Wild (1992), successfully used in many situations when the full conditional distribution does not have a standard form, cannot be used here. However, it can easily be shown that the density (9.9) is always unimodal and the slice sampler of Neal (2003) can be used to update the parameter τ −2 in an MCMC run. Full conditional for the allocation variables The full conditional for each allocation variable ri∗ , i∗ = 1, . . . , n is discrete with n (y ∗ − α − τ µ )2 o i j Pr(ri∗ = j | · · · ) ∝ wj exp − , 2(στ )2 9.3.2 j ∈ {−K, . . . , K}. Updating the generic node Y The update of the generic node Y is of two types: (1) update of the residuals εi,l , ζi,l , i = 1, . . . , N , l = 1, . . . , ni (2) update of the univariate random effects bi , di , i = 1, . . . , N in Model U. Updating the residuals The update of the ‘onset’ residuals ζi,l , i = 1, . . . , N , l = 1, . . . , ni is fully U deterministic provided the (i, l)th onset time ui,l = uL i,l = ui,l is uncensored. The update of ζi,l consists then of using the AFT expression (9.1) with the 170 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL current values of the parameters, i.e. the updated ζi,l is equal to log(ui,l ) − δ ′ xui,l − d′i z ui,l . When the (i, l)th onset time is interval-censored with an observed interval U ⌊uL i,l , ui,l ⌋, its update consists of the sampling from a truncated normal distribution, namely N αζ + τ ζ µrζ , (σ ζ τ ζ )2 truncated on i,l k j ′ u ′ u U ′ u ′ u log(uL i,l ) − δ xi,l − di z i,l , log(ui,l ) − δ xi,l − di z i,l . A similar procedure is used when updating the ‘event’ residuals εi,l , i = 1, . . . , N , l = 1, . . . , ni . It is useful to stress that for the update of εi,l also the ‘onset’ residual ζi,l and subsequently also the true onset time ui,l = exp(δ ′ xui,l + d′i z ui,l + ζi,l ) make a part of the condition when exploiting the full conditional distribution. This implies that the update of εi,l is fully deL = v U is uncensored, terministic provided the (i, l)th failure time vi,l = vi,l i,l irrespective whether the onset time is censored or not. The update of εi,l consists then of using the AFT expression (9.2) with the current values of the parameters, i.e. the updated εi,l is equal to log(vi,l − ui,l ) − β′ xti,l − b′i z ti,l . When the residual εi,l corresponds to the censored failure time with an obL , v U ⌋ its update consists of the sampling from the full served interval ⌊vi,l i,l conditional distribution of εi,l which is here a truncated normal distribution, namely ε ε 2 ε , (σ τ ) truncated on N αε + τ ε µri,l k j L U log(vi,l − ui,l ) − β ′ xti,l − b′i z ti,l , log(vi,l − ui,l ) − β ′ xti,l − b′i z ti,l . Updating the univariate random effects in Model U In Model U, the full conditional distributions for the univariate random effects bi and/or di , i = 1, . . . , N are normal distributions, namely bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) , i = 1, . . . , N, with E(bi | · · · ) = var(bi | · · · ) × ni h X i log(ti,l ) − αε − β ′ xti,l − τ ε µεrε , (σ b τ b )−2 τ b µbrb + (σ ε τ ε )−2 i i,l l=1 o−1 n var(bi | · · · ) = (σ b τ b )−2 + (σ ε τ ε )−2 ni , 9.3. MARKOV CHAIN MONTE CARLO 171 Analogous formulas, with an obvious change in notation, hold for di , i = 1, . . . , N . 9.3.3 Updating the parameters related to the multivariate random effects in Model M In the case of the multivariate random effects bi and/or di having a multivariate normal prior distribution the following full conditionals are used to update the related parameters. Full conditionals for the multivariate random effects bi and di The full conditional of the multivariate random effects vector bi , i = 1, . . . , N is multivariate normal distribution, i.e. bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) , i = 1, . . . , N, with E(bi | · · · ) = var(bi | · · · ) × ni h X i t ε ε −2 ε ′ t ε ε z γ + (σ τ ) D−1 , log(t ) − α − β x − τ µ ε i,l b i,l i,l r b i,l l=1 var(bi | · · · ) = n D−1 b ε ε −2 + (σ τ ) ni X l=1 z ti,l (z ti,l )′ o−1 . The full conditional distribution of the multivariate random effects di , i = 1, . . . , N is analogous with an obvious change in notation. Full conditionals for the means γ b , γ d and the covariance matrices Db , Dd of the multivariate random effects For the means γ b , γ d and the covariance matrices Db , Dd of the multivariate random effects, the full conditional distributions are exactly the same as these derived for the Bayesian normal mixture CS AFT model in Section 8.3.2. Only appropriate subscripts have to be added to expressions appearing in formulas given in Section 8.3.2. 172 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL 9.3.4 Updating the regression parameters Full conditionals for the fixed effects δ and β Let β (S) be an arbitrary sub-vector of vector β, and xi,l(S) the corresponding sub-vectors of covariate vectors xti,l , and further let xi,l(−S) be their complementary sub-vectors. Similarly, let further ν β(S) and ψ β(S) be appropriate sub-vectors of hyperparameters ν β and ψ β , respectively. Finally, let Ψβ(S) = diag(ψ β(S) ). Then β (S) | · · · ∼ N E(β (S) | · · · ), var(β (S) | · · · ) , with E(β (S) | · · · ) = var(β (S) | · · · )× n n Ψ−1 β(S) ν β(S) ε ε −2 + (σ τ ) ε ε −2 var(β (S) | · · · ) = Ψ−1 β(S) + (σ τ ) ni N X X i=1 l=1 n N i XX o (F ) xi,l(S) ei,l(S) , xi,l(S) x′i,l(S) i=1 l=1 o−1 , (F ) where ei,l(S) = log(ti,l ) − αε − β ′(−S) xi,l(−S) − b′i z ti,l − τ ε µεrε . i,l The full conditional distribution for an arbitrary subvector of the vector δ is analogous with an obvious change in notation. 9.4 9.4.1 Bayesian estimates of the survival distribution Predictive survival and hazard curves and predictive survival densities Analogously to Section 8.4, the survival and hazard functions or the survival densities for a specific combination of covariates are estimated by the mean of (posterior) predictive quantities. Almost all expressions given in Section 8.4.1 apply also here with the following changes. To get the Bayesian estimate of the survival function of the event time T , given the covariates xtnew and z tnew , the expression (8.16) changes 9.5. BAYESIAN ESTIMATES OF THE INDIVIDUAL RANDOM EFFECTS 173 into S(t | θ, xtnew , z tnew ) = 1− K X j=−K (9.10) wjε Φ log(t) − β ′ xtnew − b′ z tnew αε + τ ε µεj , (σ ε τ ε )2 . Similarly, to get the estimate of the survival density, we use p(t | θ, xtnew , z tnew ) = t−1 K X j=−K (9.11) wjε ϕ log(t) − β ′ xtnew − b′ z tnew αε + τ ε µεj , (σ ε τ ε )2 instead of the expression (8.18). To be able to use a relationship analogous to (8.17) we need a sample {b(m) : m = 1, . . . , M } of the posterior predictive values of the random effects. In the case of a univariate randomeffect b in Model U, b(m) is sampled from the P (m) b,(m) µb , (σ b τ b,(m) )2 . In the case of a mulnormal mixture K j j=−K wj N τ (m) (m) tivariate random effect b in Model M, b(m) is sampled from Nqb (γ b , Db ). The predictive quantities for the onset time U are obtained in an analogous manner. 9.4.2 Predictive error and random effect densities The estimate of the smoothed densities gε , gζ , gb , gd is obtained by the mean of the (posterior) predictive density which is given, for example in the case of gε , by Z E gε (e) data = gε (e)p(θ | data) dθ, e ∈ R. (9.12) The MCMC estimate of (9.12) is obtained by averaging the error density (9.3) over the MCMC run, i.e. M K e − αε,(m) X 1 X ε ε,(m) ε 2 ε,(m) −1 ĝε (e) = . (9.13) wj ϕ τ µj , (σ ) M τ ε,(m) m=1 j=−K 9.5 Bayesian estimates of the individual random effects As explained in Section 8.5 in the context of the Bayesian normal mixture CS AFT model, in some situation estimates of the individual random effects 174 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL must be provided. In the context of this chapter these can be computed in the same way as shown in Section 8.5. 9.6 Simulation study To validate our approach we conducted a simulation study which mimics to a certain extent the Signal Tandmobielr data. From each of 150 clusters we simulated 4 observations. The onset time Ui,l and the event time Ti,l , i = 1, . . . , 150, l = 1, . . . , 4 were generated according to the AFT models (9.1) and (9.2) with xui,l = (xui,l,1 , xui,l,2 )′ , z ui,l ≡ 1, δ = (0.20, −0.10)′ and xti,l = (xti,l,1 , xti,l,2 )′ , z ti,l ≡ 1, β = (0.30, −0.15)′ . The covariates xui,l,1 and xti,l,1 are continuous and generated independently from a uniform distribution on (0, 1), the covariates xui,l,2 and xti,l,2 are binary with the equal probabilities for zeros and ones. ∗ (αζ = 1.75, The error terms ζi,l and εi,l are obtained from ζi,l = αζ + τ ζ ζi,l ∗ ε ∗ ε ε ∗ ∗ ∗ ζi,l ∼ gζ ) and εi,l = α + τ εi,l (α = 2.00, εi,l ∼ gε ), respectively. Further, the random effects di and bi are obtained from di = τ d d∗i (d∗i ∼ gd∗ ) and bi = τ b b∗i (b∗i ∼ gb∗ ), respectively. The scale parameters were chosen such 2 2 that (τ d )2 + (τ ζ )2 = τonset = 0.1 and (τ b )2 + (τ ε )2 = τevent = 1.0, see below 2 2 for the individual values. The choice of τonset and τevent was motivated by the results of the analysis in Section 9.7. Two scenarios for the distributional parts of the model were considered. In scenario I, both densities gζ∗ and gε∗ (of the standardized error terms) are a mixture of normals, i.e. equal to 0.4 N (−2.000, 0.25) + 0.6 N (1.333, 0.36) standardized to have unit variance. For the densities gd∗ and gb∗ (of the standardized random effects) the density of a standardized extreme value of minimum distribution was taken. In scenario II, we reversed the setting, i.e. we have taken an extreme value distribution for the error terms and a normal mixture for the random effects. Additionally, within each scenario, the vari2 2 were decomposed such that the ratios τ d /τ ζ = τ b /τ ε and τevent ances τonset were equal to 5, 3, 2, 1, 1/2, 1/3, and 1/5, respectively. The true onset and event times were interval-censored by simulating the ‘visit’ times for each subject in the data set. The first visit was drawn from N (1, 0.22 ). Each of the distances between the consecutive visits was drawn from N (0.5, 0.052 ). The results for the simulation study are shown in Appendix B.3. Tables B.12 and B.13 give the results for the regression parameters and show that they are estimated practically unbiasedly and with a reasonable precision. It is further seen that the precision of the estimation decreases when the 9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED DOUBLY-INTERVAL-CENSORED DATA 175 within-cluster variability (variance of the error terms) increases compared to the between-cluster variability (variance of the random effects). In practice however, the between-cluster variability is often much higher than the within-cluster variability. Further, Tables B.14 and B.15 show results for the standard deviations of the error terms and random effects. Here, the precision is sometimes somewhat worse. However, also the standard deviations are, in most cases, estimated with minimal bias. Furthermore, the shape of the survival functions or survival densities is correctly estimated as is illustrated in Figures B.10–B.17 which show results for the fitted survival functions and survival densities for selected combinations of covariates. 9.7 Example: Signal Tandmobielr study – clustered doubly-interval-censored data This analysis of the Signal Tandmobielr data, introduced in Section 1.1, involves (a) doubly-interval-censored data, i.e. the time from tooth emergence to onset of caries; (b) clustering. Indeed, we will examine several teeth jointly and the teeth from the same mouth are related. The primary interest of the present analysis is to address the influence of sound versus affected (decayed/filled/missing due to caries) deciduous second molars (in Figure 1.2, teeth 55, 65, 75, 85, respectively) on the caries susceptibility of the adjacent permanent first molars (in Figure 1.1, teeth 16, 26, 36, 46, respectively). Note that for about five years the deciduous second molars are in the mouth together with the permanent first molars. It is possible that the caries processes on the primary and the permanent molar occur simultaneously. In this case it is difficult to know whether caries on the deciduous molar caused caries on the permanent molar or vice versa. For this reason, the permanent first molar was excluded from the analysis if caries was present when emergence was recorded. This implies that the data are not balanced with respect to the size of the clusters. In total, 3 520 children were included in the analysis of which 187 contributed 1 tooth, 317 2 teeth, 400 3 teeth and 2 616 all 4 teeth. Additionally, we considered the impact of gender (boy/girl), presence of sealants in pits and fissures of the permanent first molar (none/present), occlusal plaque accumulation on the permanent first molar (none/in pits and 176 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL fissures/on total surface), and reported oral brushing habits (not daily/daily). Note that pits and fissures sealing is a preventive action which is expected to protect the tooth against caries development. The presence of plaque on the occlusal surfaces of the permanent first molars was assessed using a simplified version of the index described by Carvalho, Ekstrand, and Thylstrup (1989). All explanatory variables were obtained at the examination where the presence of the permanent first molar was first recorded. The choice of explanatory variables is motivated by the results of Leroy et al. (2005) where a GEE multivariate log-logistic AFT model was used to analyze the time to caries. Multiple imputation was used to deal with the intervalcensored emergence times. Further, on top of that, the caries status of the deciduous first molars (in Figure 1.2, teeth 54, 64, 74, 84, respectively) was included in the covariate part of the model. We will not use this factor as an explanatory variable due to its high dependence with the status of the deciduous second molar (in all quadrants of the mouth, the χ2 test statistics with 9 degrees of freedom exceeded 1 100). The onset time Ui,l , l = 1, . . . , 4 is the age (in years) of the ith child (ith cluster) at which the lth permanent first molar emerged. The failure time, Vi,l , indicates the onset of caries of the lth permanent first molar. The time from tooth emergence to the onset of caries, Ti,l , is doubly-interval-censored. Here, both the time of tooth emergence and the onset of caries experience are only known to lie in an interval of about 1 year. Further, in our example about 85% of the permanent first molars had emerged at the first examination giving rise to a huge amount of left-censored onset times. However, at each examination the permanent teeth were scored according to their clinical eruption stage using a grading that starts at P0 (tooth not visible in the mouth) and ends with P4 (fully erupted tooth with full occlusion). Based on the clinical eruption stage at the moment of the first examination, all left-censored emergence times were transformed into interval-censored ones with the lower limit of the observed interval equal to the age at examination minus 0.25 year, 0.5 year and 1 year, respectively for the teeth with the eruption stage P1, P2 and P3, respectively and with the lower limit equal to 5 years for the teeth with the eruption stage P4. We refer to Leroy et al. (2005) for details and motivation. 9.7.1 Basic Model The analysis starts with the Basic Model where we allowed for a different effect of the covariates on both emergence and caries experience for the four permanent first molars. Namely, the Basic Model was based on the AFT 9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED DOUBLY-INTERVAL-CENSORED DATA 177 models (9.1) and (9.2) with the covariate vector xui,l for emergence: xui,l = (gender i , tooth26i,l , tooth36i,l , tooth46i,l , tooth26i,l ∗ gender i , tooth36i,l ∗ genderi , tooth46i,l ∗ genderi )′ , and the covariate vector xti,l for caries: xti,l = (x̃ti,l , tooth26i,l , tooth36i,l , tooth46i,l , tooth26i,l ∗ x̃ti,l , tooth36i,l ∗ x̃ti,l , tooth46i,l ∗ x̃ti,l )′ , where x̃ti,l = (gender i , statusDi,l , statusFi,l , statusMi,l , brushingi , sealantsi,l , plaquePFi,l , plaqueTi,l ). The covariates tooth26, tooth36, tooth46 are dummies for the position of the permanent first molar with the molar 16 as the baseline, the covariate gender equals 1 for boys and equals 0 for girls. The covariates statusD, statusF, statusM are dummies for the status of the adjacent deciduous molar: decayed, filled, missing due to caries with sound being the baseline. The covariate brushing is dichotomous (1 = daily, 0 = not daily) as well as the covariate sealants (1 = present, 0 = not present). Finally, the covariates plaquePF and plaqueT are dummies for the plaque accumulation: in pits and fissures, on total surface with no plaque as the baseline. To account for clustering, univariate child-specific random effects di and bi u = are included in the model expressions (9.1) and (9.2), respectively with zi,l t zi,l ≡ 1. Finally, analogously to Sections 7.7 and 8.7, we subtracted 5 years from all observed times, i.e. log(Ui,l − 5) was used in the left-hand side of the model formula (9.1). As discussed already in Section 9.1, our model assumes that, given the covariates and child-specific random effects, the emergence time Ui,l and the time to caries Ti,l are independent for each i and l. Specifically, we assume that the caries process on a specific tooth only depends on the time when that tooth is at risk for caries and not on the chronological time. This assumption seems reasonable for the Signal Tandmobielr data taking into account the results of Leroy et al. (2005) who evaluated also the effect of the emergence time on the time to caries and found it non-significant (p = 0.78). 9.7.2 Final Model Based on the results for the Basic Model (see below) we fitted the Final Model where we omitted all two-way interactions with the covariates 178 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL tooth26, tooth36, tooth46. Additionally, we binarized the covariates statusD, statusF, statusM into a new covariate status which was equal to 1 for decayed, filled or missing due to caries deciduous molars and was equal to 0 for sound deciduous molars. Also the covariates plaquePF and plaqueT were binarized into the covariate plaque equal to 1 for the teeth with plaque present either in pits and fissures or on total surface and equal to 0 otherwise. That is, the onset and event covariate vectors are equal to xui,l = (gender i , tooth26i,l , tooth36i,l , tooth46i,l )′ , xti,l = (gender i , statusi,l , brushingi , sealantsi,l , plaquei,l , tooth26i,l , tooth36i,l , tooth46i,l )′ . 9.7.3 Prior distribution Firstly, for all penalized mixtures we used the same grid of equidistant knots of length 31 (K = 15) defined on [−4.5, 4.5] with the basis standard deviation σ = 2(µj − µj−1 )/3 = 0.2. Secondly, the third order difference (s = 3) was used in the prior (9.7). Further, the prior distributions of the nodes in DAGs (Figures 9.1 and 9.3) without parents were taken highly dispersed. That is all λ and τ −2 parameters were a priori Gamma(1, 0.005) distributed, all α, β and δ parameters were given a N (0, 100) prior. 9.7.4 Results For each considered model we ran 500 000 iterations with 1:3 thinning which took about 44 hours on a 3 GHz Pentium IV PC with 1 GB RAM. We kept the last 100 000 iterations for inference. Results for the Basic Model The analysis of the Basic Model revealed that all interaction terms with tooth covariates are redundant implying that the effect of all these covariates is the same for all four permanent first molars. To evaluate this we used simultaneous Bayesian p-values computed using the method described in Section 4.6.2. For the emergence part, the simultaneous p-value for the tooth:gender interactions is higher than 0.5. For the caries part of the model, the p-values are higher than 0.5 for the interactions of tooth with gender and plaque and higher than 0.1 for the interactions with brushing, sealants and status. Also the covariate tooth is not significant however we kept it in the 9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED DOUBLY-INTERVAL-CENSORED DATA 179 Table 9.1: Signal Tandmobielr study, Final Model. Posterior medians, 95% equal-tail credible regions (CR) and Bayesian two-sided p-values for the model parameters. For the parameter Tooth the CR and the p-value are simultaneous. Emergence Parameter Tooth tooth 26 tooth 36 tooth 46 Gender girl Status dmf Brushing daily Sealants present Plaque present E(error) sd(error) sd(random) Posterior median −0.003 0.001 0.002 −0.023 95% CR p > 0.5 (−0.013, (−0.008, (−0.008, p = 0.008 (−0.039, Caries Posterior median 0.007) 0.011) 0.012) −0.006 −0.009 −0.016 −0.007) −0.071 −0.140 0.337 0.119 0.442 0.029 0.199 (0.427, 0.456) (0.025, 0.034) (0.191, 0.210) −0.114 1.920 0.767 0.672 95% CR p > 0.5 (−0.045, 0.031) (−0.051, 0.034) (−0.059, 0.026) p = 0.085 (−0.155, 0.009) p < 0.001 (−0.193, −0.091) p < 0.001 (0.233, 0.436) p < 0.001 (0.060, 0.178) p < 0.001 (−0.171, −0.067) (1.810, 2.059) (0.712, 0.834) (0.614, 0.734) model to address the question whether the emergence and caries timing are the same for the four permanent first molars. Further, for none of the four permanent first molars a significant difference was found between the status groups decayed, filled or missing, and between the plaque groups present in pits and fissures or present on total surface. This finding, together with the fact that the group with extracted deciduous molar and the group with the plaque present on total surface had very low prevalence (1.45% and 3.13%, respectively), led to the simplification of these two covariates in the Final Model. 180 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Results for the Final Model Table 9.1 shows posterior medians, 95% equal-tail credible intervals and Bayesian two-sided p-values for the parameters in the Final Model. It is seen that neither for the emergence and nor for the caries process there is a significant difference between the four permanent first molars. However, the molars of girls emerge significantly earlier than those of boys. With respect to caries experience, the difference between boys and girls is not significant at 5%. However all remaining covariates have a significant impact on the caries process. Namely, daily brushing increases the time to caries with a factor of exp(0.337) = 1.40 compared to less frequent brushing. Presence of sealants increases the time to caries with a factor of exp(0.119) = 1.13. On the other hand, the presence of the plaque decreases the time to caries with a factor of exp(−0.114) = 0.89 and the fact that the neighboring deciduous second molar is either decayed, filled or extracted due to caries decreases the time time to caries with a factor of exp(−0.140) = 0.87. Figure 9.4 shows the posterior predictive survival and hazard functions for the time to caries on the upper right permanent first molar of boys, for ‘the best’, ‘the worst’ and two intermediate combinations of covariates (the curves for the remaining teeth and girls are similar). It is seen that when the teeth are daily brushed, plaque-free and sealed the hazard for caries starts to increase approximately 1 year after emergence however then remains almost constant. Whereas, when the teeth are not brushed daily and are exposed to other risk factors the hazard starts to increase already approximately 6 months after emergence. After a period of constant risk then the hazard starts to increase again. The peak in the hazard for caries approximately 1 year after emergence was observed also by Leroy et al. (2005) and can be explained by the fact that teeth are most vulnerable for caries soon after the emergence when the enamel is not yet fully developed. This peak is also present, although with a different size and with a slight shift, for all covariate combinations. On the other hand, for covariate combinations reflecting good oral health and hygiene habits, the hazard remains almost constant after the initial period of highly increasing risk whereas for combinations of covariates reflecting bad oral conditions the hazard starts to increase again approximately 3 years after emergence. This shows clearly the relationship between caries experience and oral health and hygiene habits. Finally, Figure 9.5 shows Bayesian predictive error and random effect density estimates. The estimate of the emergence random effect density gd suggests the children could be divided, even after conditioning on gender, into two 9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED DOUBLY-INTERVAL-CENSORED DATA 181 0.4 0.6 0.8 0.0 0.2 Caries free 1.0 Caries: survival function 0 1 2 3 4 5 6 5 6 Time since emergence (years) 0.00 0.05 0.10 0.15 0.20 Hazard of caries Caries: hazard function 0 1 2 3 4 Time since emergence (years) Figure 9.4: Signal Tandmobielr study, Final Model. Posterior predictive caries free (survival) and caries hazard curves for tooth 16 of boys and the following combinations of covariates: solid and dashed lines for no plaque, present sealing, daily brushing and sound primary second molar (solid line) or dmf primary second molar (dashed line), dotted and dotted-dashed lines for present plaque, no sealing, not daily brushing and sound primary second molar (dotted line) or dmf primary second molar (dotted-dashed line). 182 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Emergence: random effect 0.5 1.0 gd (d) 8 6 0 0.0 2 4 gζ (ζ) 10 1.5 12 14 Emergence: error 0.40 0.45 0.50 0.55 −0.4 −0.2 0.0 0.2 0.4 ζ d Caries: error Caries: random effect 0.8 0.6 0.0 0.0 0.2 0.4 0.5 gb (b) gε (ε) 1.0 1.0 1.2 1.5 1.4 0.35 −1 0 1 2 ε 3 4 −2.0 −1.0 0.0 1.0 b Figure 9.5: Signal Tandmobielr study, Final Model. Estimates of the densities of the error terms and random effects. 9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED DOUBLY-INTERVAL-CENSORED DATA 183 groups: early and late emergers. Also, the children can be divided into two groups with respect to caries sensitivity (see random effect density gb ). Finally, as the estimate of the caries error density gε shows three modes it seems that there are other important factors influencing the caries process besides the included covariates. 9.7.5 Conclusions This section showed how the Bayesian penalized mixture CS AFT model can be used to analyze clustered doubly-interval-censored data. Owing to flexible distributional assumptions it was not here necessary to perform the classical checks for correct distributional specification. Clearly, this step cannot be avoided when using fully parametric methods. However, for censored, or let alone doubly-interval-censored data, this is far from trivial. As was illustrated in this section new important findings concerning the distribution of the event time, derived e.g. from the shape of the hazard function, can be discovered when avoiding strong parametric assumptions. Further, we point out that the Basic Model corresponded, for comparison purposes, as closely as possible to the model used by Leroy et al. (2005). The differences were in detail outlined above. The most important one is that we used here the flexible and cluster-specific (conditional) model fitted in the Bayesian way, whereas in Leroy et al. (2005) a parametric and populationaveraged (marginal) model fitted using a frequentist method. The results for the regression parameters of the caries part of the model correspond quite closely to the earlier findings of Leroy et al. (2005) where, however, no attempts were done to simplify the model. Nevertheless, our results largely confirmed their findings. Namely, they found the overall effect (on all four teeth) of all factors except gender to be significant with p-value < 0.001. For the effect of gender they observed a p-value of 0.060 compared to 0.085 found by us. Due to the fact that Leroy et al. (2005) used a parametric log-logistic AFT model, they could not reveal the second period of increased hazard found here. Finally, we have to admit that some covariates used in our dental application should actually be treated as time-dependent. Unfortunately, with our and any other method where the distribution of the event time is specified using a density and not using an instantaneous quantity like the hazard function, inclusion of time-dependent covariates is difficult. 184 9.8 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Example: EBCP data – multicenter study In this section, we re-analyze the Early Breast Cancer Patients data introduced in Section 1.4 using the penalized mixture cluster-specific AFT model and compare the results to the results of the earlier analysis conducted using the classical normal mixture cluster-specific AFT model (see Section 8.9). Except for the model for the error distribution of the AFT model, we fitted exactly the same models as in Section 8.9. Here is their brief overview. The response event time Ti,l , i = 1, . . . , 14, l = 1, . . . , ni , 25 ≤ ni ≤ 902 is the progression-free survival (PFS) time of the lth patient treated by the ith center. In the CS AFT model (9.2), a bivariate random effect bi = (bi,1 , bi,2 )′ with the covariate vector z ti,l = (1, trtmtGroupi,l )′ is included to allow for the baseline heterogeneity as well as the heterogeneity with respect to the treatment effect across centra. The covariate vector for the fixed effects is given by xti,l = (ageMidi,l , ageOldi,l , tySui,l , tumSizi,l , nodSti,l , otDisi,l , regionNLi , regionPLi , regionSEi , regionSAi )′ . See Section 8.9 for explanation of the meaning of the single covariates. Analogously to Section 8.9, besides the model with region described above we fitted also the model without region for which the covariates regionNL, regionPL, regionSE, and regionSA were omitted from the covariate vector xti,l . The motivation for this step was an attempt to see whether the the regional structure can be revealed from the estimates of the individual random effects bi,1 , i = 1, . . . , 14. For the inference we sampled two chains, each of length 125 000 with 1:5 thinning which took about 2.5 hour on a Pentium IV 2 GHz PC with 512 MB RAM. For the inference we kept the last 25 000 iterations of each chain. 9.8.1 Prior distribution To specify the penalized mixture defining the distribution of the error terms εi,l , i = 1, . . . , N , l = 1, . . . , ni we used the grid of equidistant knots of length 31 (K = 15) defined on the interval [−4.5, 4.5] with the basis standard deviation σ = 2(µj −µj−1 )/3 = 0.2. In the prior (9.7), we used the third order differences (s = 3). Further, the smoothing parameter λε as well as the error precision parameter (τ ε )−2 were given a dispersed Gamma(1, 0.005) prior. The intercept parameter αε as well as all fixed effect regression parameters β and the parameter γb,2 – the mean of the treatment random effects bi,2 9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY 185 Table 9.2: Early breast cancer patients data. Posterior medians, 95% equaltail credible intervals and Bayesian two-sided (simultaneous) p-values for the effect of covariates. Parameter Treatment group surgery alone Age 40–50 years > 50 years Type of surgery breast conserving Tumor size ≥ 2cm Nodal status positive Other disease present Region the Netherlands Poland South Europe South Africa Model with region Poster. median 95% CI p = 0.084 −0.153 (−0.325, 0.026) p = 0.026 0.325 (0.059, 0.585) 0.285 (0.041, 0.520) p = 0.008 0.229 (0.053, 0.404) p < 0.001 −0.462 (−0.643, −0.283) p < 0.001 −0.599 (−0.758, −0.442) p = 0.016 −0.323 (−0.605, −0.059) p = 0.007 −0.403 (−0.737, −0.017) 0.349 (−0.113, 0.802) −0.339 (−0.729, 0.033) −0.737 (−1.161, −0.320) Model without region Poster. median 95% CI p = 0.074 −0.150 (−0.310, 0.015) p = 0.014 0.344 (0.088, 0.619) 0.313 (0.073, 0.565) p = 0.007 0.248 (0.078, 0.420) p < 0.001 −0.470 (−0.656, −0.288) p < 0.001 −0.605 (−0.771, −0.440) p = 0.015 −0.335 (−0.609, −0.067) – were given a dispersed N (0, 100) prior. Finally, the covariance matrix Db of the random effects got an inverse Wishart prior with dfb = 2 and Sb = diag(0.002). 9.8.2 Effect of covariates on PFS time Table 9.2 shows the posterior summary for the effect of considered covariates in both models with included or excluded covariate region. In the model with region included, surgery alone decreases the time to the cancer progression by the factor of exp(−0.153) = 0.86 compared to the surgery given together with the perioperative chemotherapy. However, as well as in the previous analysis in Section 8.9, the difference is not significant at conventional 5% level. 186 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL Table 9.3: Early breast cancer patients data. Posterior medians and 95% equal-tail credible intervals for the moments of the error distribution and variance components of the random effects. Parameter E(ε) sd(ε) Model with region Poster. median 95% CI Model without region Poster. median 95% CI Moments of the error distribution 9.155 (8.771, 9.525) 8.967 1.481 (1.356, 1.663) 1.470 (8.570, 9.353) (1.352, 1.639) Variance components of the random effects sd(bi,1 ) 0.111 (0.024, 0.336) 0.302 (0.157, 0.541) sd(bi,2 ) 0.057 (0.020, 0.217) 0.074 (0.022, 0.245) corr(bi,1 , bi,2 ) −0.219 (−0.987, 0.963) −0.675 (−0.993, 0.980) Also the results for the effect of remaining covariates is very similar to the results of the earlier analysis given in Table 8.6. Firstly, again, the estimates in both models – with and without region – are almost the same. Further, according to the model with region included, in the middle age group 40 – 50 years, the time to the progression of cancer is increased by the factor of exp(0.325) = 1.38 compared to the youngest group <40 years. For the patients from the oldest group >50 years, this time is increased by the factor of exp(0.285) = 1.33 compared to the youngest group. The variable breast conserving surgery increases the PFS time by the factor exp(0.229) = 1.26 compared to mastectomy. Further, the tumor of size ≥2 cm decreases the PFS time by the factor of exp(−0.462) = 0.63 compared to the smaller tumors of size <2 cm. A positive pathological nodal status decreases the PFS time by the factor of exp(−0.599) = 0.55 compared to the negative result. The presence of other related disease decreases the PFS time by the factor of exp(−0.323) = 0.72. Analogously to Section 8.9, the effect of the geographical reason on the PFS time is highly significant with the same ordering of regions, namely Poland performs the best, followed by France, South Europe, the Netherlands and South Africa. Finally, Figure 9.6 illustrates rather small effect of the perioperative therapy compared to surgery alone on the posterior predictive survival curves drawn for region = France and two typical combinations of covariates. More or less the same picture has been seen also in Figure 8.9 referring to the results of the earlier analysis. 9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY 187 0.6 0.4 Survival 0.8 1.0 BCS, ≥2 cm, nodal−, no other disease Surgery + chemotherapy 0.0 0.2 Surgery alone 0 1000 2000 3000 4000 5000 Time (days) 0.6 0.4 Survival 0.8 1.0 Mastectomy, ≥2 cm, nodal+, no other disease Surgery + chemotherapy 0.0 0.2 Surgery alone 0 1000 2000 3000 4000 5000 Time (days) Figure 9.6: Early breast cancer patients data. Predictive survival curves based on the model with region for region = France, and two typical combinations of covariates: (1) breast conserving surgery, tumor size ≥2 cm, negative nodal status and no other associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm, positive nodal status and no other associated disease (13.88% of the sample). 188 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL 9.8.3 Predictive error density and variance components of random effects Table 9.3 gives posterior summary statistics for the moments of the error distribution and the variance components of the random effects. Also in this case, the results are very similar to these related to the earlier analysis and shown in Table 8.7. Furthermore, the 95% equal-tail credible interval for the correlation between the overall center level and the treatment × center interaction covers again almost the whole range (−1, 1) of possible value. This is also seen on the scaled histograms of sampled values of ̺ in Figure 9.7. The estimates of the error densities in both models with and without the covariate region are shown in Figure 9.8. It is seen that exclusion of the covariate region had hardly an effect on the estimated error distribution. Indeed, since this covariate only groups different centra (clusters), its omission approved itself mainly in the variability of the random intercept bi,1 (see Table 9.3). 3 0 1 2 Posterior density 2 0 1 Posterior density 3 4 Model without region 4 Model with region −1.0 −0.5 0.0 0.5 corr(bi,1 , bi,2 ) 1.0 −1.0 −0.5 0.0 0.5 1.0 corr(bi,1 , bi,2 ) Figure 9.7: Early breast cancer patients data. Scaled histograms for sampled corr(bi,1 , bi,2 ). 9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY 189 0.00 0.05 0.10 0.15 0.20 0.25 0.30 gε (e) Model with region 6 8 10 12 14 12 14 e 0.00 0.05 0.10 0.15 0.20 0.25 0.30 gε (e) Model without region 6 8 10 e Figure 9.8: Early breast cancer patients data. Posterior predictive error densities. 190 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL 1.0 0.8 0.6 0.4 Survival 0.6 0.4 Survival 0.8 1.0 BCS, ≥2 cm, nodal−, no other disease 0.0 0.0 0.2 Surgery alone 0.2 Surgery + chemotherapy 0 1000 3000 5000 0 1000 Time (days) 3000 5000 Time (days) 1.0 0.8 0.6 0.4 Survival 0.6 0.4 Survival 0.8 1.0 Mastectomy, ≥2 cm, nodal+, no other disease 0.2 Surgery alone 0.0 0.0 0.2 Surgery + chemotherapy 0 1000 3000 Time (days) 5000 0 1000 3000 5000 Time (days) Figure 9.9: Early breast cancer patients data, comparison of the penalized mixture CS AFT model (solid lines) and the classical mixture CS AFT model (dashed lines). Predictive survival curves based on the model with region for region = France, and two typical combinations of covariates: (1) breast conserving surgery, tumor size ≥2 cm, negative nodal status and no other associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm, positive nodal status and no other associated disease (13.88% of the sample). 9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY 191 Model with region 8.0 8.5 9.0 9.5 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Treatment Institution 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Institution Intercept + region effect 10.0 −0.6 −0.2 0.0 0.2 b2 b1 + E(ε) + β(region) Model without region Institution 8.0 8.5 9.0 9.5 b1 + E(ε) 10.0 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Treatment 11 12 13 21 22 31 32 33 34 41 42 43 44 51 Institution Intercept −0.6 −0.2 0.0 0.2 b2 Figure 9.10: Early breast cancer patients data. Posterior means and 95% equal-tail credible intervals for individual random effects. Random intercepts are further shifted by the error mean E(ε) and in the model with region also by a corresponding region main effect β(region). 192 CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL The shape of the estimated error density seems to be somewhat different from what has been found using the classical mixture model in Section 8.9 (Figure 8.11). In Figure 8.11, the shape similar to what is seen now (Figure 9.8) can only be found when looking at conditional predictive densities, given K > 1. However, both estimated error distributions lead to almost the same estimates of the survival curves as it is seen in Figure 9.9. 9.8.4 Estimates of individual random effects Finally, Figure 9.10 shows estimates of individual random effects. Analogously to Figure 8.12, the plots related to the random intercept takes into account also the mean of the error term and in the case of the model with region also the appropriate main effect of region. It can be seen that, analogous to the remaining model characteristics, Figure 9.10 resembles quite closely Figure 8.12. Among other things, also here the estimates of individual random intercepts in the model without region managed quite nicely to capture also the region effect. 9.8.5 Conclusions The main purpose of this section was to explore how the chosen method for the estimation of the error distribution influences the results of a particular analysis. We have seen that, except for the estimate of the error distribution itself, the differences were almost negligible. Moreover, although the estimated shapes of the error distribution were somewhat different they both led to almost identical survival curves. 9.9 Discussion A semiparametric method to perform a regression analysis with clustered doubly-interval-censored data was suggested in this chapter. We opted for a fully Bayesian approach and MCMC methodology. Note however, that similarly as in Chapter 8, the Bayesian approach is used only for technical convenience to avoid difficult optimization unavoidable with more classical maximum-likelihood based estimation. Remember that we use a penalty-like prior distribution for the transformed mixture weights a and vague priors for all remaining parameters. We did not make any attempt to use any prior information although it could have been utilized. Taking into account the above reasoning, we conclude that similar results would have been obtained if the penalized maximum-likelihood estimation had been used. Chapter 10 Bayesian Penalized Mixture Population-Averaged AFT Model In Section 9.7, we evaluated the impact of several covariates on the time to caries of the permanent first molars which are the teeth most often attacked by caries during childhood. It was also of interest to know whether the covariates have the same effect on all teeth. Hence all four teeth had to be modelled jointly. In the same section, univariate cluster-specific random effects have been included in the model expression to account for withincluster dependencies. Given these random effects, the observations within each cluster were assumed to be independent. Distributional parts of the model were specified as penalized univariate normal mixtures. However, it is also of interest to evaluate the association between the times-tocaries of the studied teeth. Nevertheless, the approach of Chapter 9 treats the within-cluster association as nuisance and, except for the estimated variance of the random effects, it does not give a direct measure of the within-cluster association. For this reason, we modify the method of Chapter 9 and assume a multivariate error distribution as a penalized multivariate normal mixture with a high number of components with equidistant means and constant covariance matrices. For the explanatory and also computational reasons we describe only a bivariate version of the model as given by Komárek and Lesaffre (2006c) and apply it to the analysis of right permanent first molars in Section 10.6. The approach of this chapter allows to visualize the estimated bivariate error distribution and evaluate the association of paired responses. In Section 10.1, we specify the penalized mixture population-averaged AFT 193 194 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL model. Further, the prior distributions are given and posterior distribution is derived in Section 10.2. Section 10.3 provides the details of the Markov chain Monte Carlo method in the context of the model of this chapter. In Section 10.4, we show how the association between the paired responses can be evaluated. Estimation of the survival distribution is discussed in Section 10.5. The analysis of the doubly-interval-censored caries times of the right permanent first molars is given in Section 10.6. Finally, we provide discussion in Section 10.7. 10.1 Model A similar notation as in Chapter 9 will be used here. That is, let Ui,l and Vi,l , i = 1, . . . , N, l = 1, 2 be the onset time and the failure time, respectively for the lth unit of the ith cluster in the study. Let Ti,l = Vi,l − Ui,l denote the corresponding event time The onset time Ui,l is only observed in an interval U ⌊uL i,l , ui,l ⌋. Similarly, we only know that the event time Vi,l lies in an interval L , v U ⌋. ⌊vi,l i,l Further, let xui,l be the vector of covariates which might have an effect on the onset time Ui,l and xti,l be the vector of covariates which can possibly influence the event time Ti,l . Additionally, we assume that the onset times vector (Ui,1 , Ui,2 )′ and the time-to-event vector (Ti,1 , Ti,2 )′ are, given the covariates, for each i independent (see Chapter 9 for a detailed discussion of this assumption) and that the interval censoring is independent and noninformative (e.g. pre-scheduled visits, see Section 2.4). The distribution of (Ui,1 , Ui,2 , Ti,1 , Ti,2 )′ , i = 1, . . . , N , given the covariates, is given by the following accelerated failure time model: log(Ui,l ) = δ ′ xui,l + ζi,l , log(Vi,l − Ui,l ) = log(Ti,l ) = β ′ xti,l (10.1) + εi,l , i = 1, . . . , N, (10.2) l = 1, 2, where δ = (δ1 , . . . , δmu )′ and β = (β1 , . . . , βmt )′ are unknown regression parameter vectors, ζ i = (ζi,1 , ζi,2 )′ , i = 1, . . . , N are i.i.d. random vectors with a bivariate density gζ (ζ1 , ζ2 ) and similarly, εi = (εi,1 , εi,2 )′ , i = 1, . . . , N i.i.d. random vectors with a bivariate density gε (ε1 , ε2 ). 10.1.1 Distributional assumptions Our model for the unknown bivariate densities gε (ε1 , ε2 ) and gζ (ζ1 , ζ2 ) is motivated by a penalized smoothing as introduced in Section 6.3.4 and directly 10.1. MODEL 195 generalizes the method used in Chapter 9 into higher dimensions. Let Y = (Y1 , Y2 )′ be a generic symbol for either ε = (ε1 , ε2 )′ or ζ = (ζ1 , ζ2 )′ and g(y) = g(y1 , y2 ) be a generic symbol for its density. We express the unknown density g(y) as a location-and-scale transformed finite mixture of bivariate normal densities with zero correlation over a fixed fine grid with knots µ(j1 ,j2) = (µ1,j1 , µ2,j2 )′ , j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 that are centered around zero, i.e. µ(0,0) = (0, 0)′ . The means of the bivariate normal components are equal to the knots and their covariance matrices are all equal but fixed to Σ = diag σ12 , σ22 . Thus, g(y) = (τ1 τ2 )−1 K1 X K2 X j1 =−K1 j2 =−K2 (10.3) y − α y − α 2 2 1 1 , wj1 ,j2 (A) ϕ2 µ(j1 ,j2 ) , Σ . τ1 τ2 In expression (10.3), the intercept term α = (α1 , α1 )′ and the scale parameters vector τ = (τ1 , τ2 )′ have to be estimated as well as the matrix A = (aj1 ,j2 ), j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 of the transformed weights. See (6.19) for the relationship between A and W = (wj1 ,j2 ), j1 = −K1 , . . . , K1 , j2 = −K2 ,. . . , K2 . The density of the zero-mean, unit-variance random vec′ tor Y ∗ = τ1−1 (Y1 − α1 ), τ2−1 (Y2 − α2 ) is a density of the bivariate normal mixture with uncorrelated components given by (6.21). In the following, let Gε refers to the set {Σε , µε , αε , τ ε , Wε , Aε , λε } which contains the parameters defining the distribution of ε and a smoothing parameter vector λε which we will discuss in Section 10.2.1. Similarly, let Gζ refers to the set µζ , αζ , τ ζ , Wζ , Aζ , λζ } which contains the parameters defining the distribution of ζ and a corresponding smoothing parameter vector λζ . Finally, let G be a generic symbol for Gε or Gζ . 10.1.2 Likelihood Let p denote a generic density. The likelihood contribution of the ith paired response equals Li = I uU i,1 uL i,1 I uU i,2 uL i,2 I U −u vi,1 i,1 L −u vi,1 i,1 I U −u vi,2 i,2 L −u vi,2 i,2 dti,2 dti,1 dui,2 dui,1 p(ui,1 , ui,2 , ti,1 , ti,2 ) 196 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL = I uU i,1 uL i,1 I uU i,2 uL i,2 I p(ui,1 , ui,2 ) U −u vi,1 i,1 L −u vi,1 i,1 I U −u vi,2 i,2 L −u vi,2 i,2 (10.4) p(ti,1 , ti,2 ) dti,2 dti,1 dui,2 dui,1 , where p(ti,1 , ti,2 ) = (ti,1 ti,2 )−1 gε log(ti,1 ) − β ′ xi,1 , log(ti,2 ) − β ′ xi,2 , p(ui,1 , ui,2 ) = (ui,1 ui,2 )−1 gζ log(ui,1 ) − δ ′ z i,1 , log(ui,2 ) − δ ′ z i,2 , are obtained using the expression (10.3) for gε and gζ . In another context, Ghidey, Lesaffre, and Eilers (2004) used an expression similar to (10.3) to model a density of the random intercept and slope in the linear mixed model with uncensored data. Further, Bogaerts and Lesaffre (2006) used this approach to model a density of bivariate simply-intervalcensored data without covariates. In both papers, a penalized maximum likelihood method has been used to estimate unknown parameters. In our context, however, a maximum likelihood procedure is difficult and computationally almost intractable. Like in Chapter 9 we suggest to use the Bayesian approach together with MCMC methodology. 10.2 Bayesian hierarchical model Let θ be a vector of all unknown parameters in our model. We assume the hierarchical structure represented by the directed acyclic graph (DAG) shown in Figure 10.1. The DAG implies the following prior distribution: p(θ) ∝ N Y i=1 p vi,1 , vi,2 ui,1 , ui,2 , ti,1 , ti,2 × p ti,1 , ti,2 β, εi,1 , εi,2 × p ui,1 , ui,2 δ, ζi,1 , ζi,2 × ζ ζ ε ε p εi,1 , εi,2 Gε , ri,1 , ri,2 × p ζi,1 , ζi,2 Gζ , ri,1 , ri,2 × ζ ζ ε ε Gζ × , ri,2 p ri,1 , ri,2 Gε × p ri,1 p Gε × p Gζ × p δ × p β . (10.5) The DAG child-parent conditional distributions and priors for the parameters residing on the top of the hierarchy are similar to these used in Chapter 9. We give a brief overview and highlight the differences with the bivariate model considered here. 10.2. BAYESIAN HIERARCHICAL MODEL 10.2.1 197 Prior distribution for G The structure of the prior distribution of the generic node G is the same as in Section 9.2.1, i.e. p(G) ∝ p(A | λ) p(λ) p(α) p(τ ). With the bivariate setting, the number of unknown elements of the matrix A is naturally much higher than with the univariate setting used in Chapter 9, namely it is equal to (2K1 +1)×(2K2 +1) (e.g. equal to 961 in the analysis of Onset Event Gζ Σζ µζ δ αζ τζ xui,l λζ λε Aζ Aε Wζ Wε rζi,l rεi,l Gε τε εi,l ζi,l ui,l αε xti,l µε Σε β ti,l vi,l uL i,l uU i,l L vi,l censoringi,l U vi,l l = 1, 2 i = 1, . . . , N Figure 10.1: Directed acyclic graph for the Bayesian penalized mixture population-averaged AFT model. 198 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL the Signal Tandmobielr data in Section10.6). With an uninformative prior for A, this could cause overfitting of the data or identifiability problems. Spatial prior for A Since the (transformed) mixture weights correspond to spatially located normal components, a Gaussian Markov random field (GMRF) prior (see, e.g., Besag et al., 1995, Section 3), common in spatial statistics, can be exploited here. Such a prior distribution can be defined by specifying the conditional distribution of each aj1 ,j2 given remaining ak1 ,k2 , (k1 , k2 ) 6= (j1 , j2 ), here denoted as A−(j1 , j2 ) , and the hyperparameter λ that controls the smoothness. Usually, only a few neighboring coefficients are effectively used in the specification of p(aj1 ,j2 | A−(j1 , j2 ) , λ). A commonly used conditional distribution is a normal distribution with expectation and variance equal to aj −1,j2 + aj1 +1,j2 + aj1 ,j2 −1 + aj1 ,j2 +1 − E aj1 ,j2 | A−(j1 ,j2 ) , λ = 1 2 aj1 −1,j2 −1 + aj1 −1,j2 +1 + aj1 +1,j2 −1 + aj1 +1,j2 +1 , 4 var aj1 ,j2 | A−(j1 ,j2 ) , λ = (4λ)−1 , (10.6) respectively, based on the eight nearest neighbors and local quadratic smoothing. Note that the expectation and variance formulas have to be changed appropriately on edges where only five neighbors are available and in corners where we have only three neighbors out of the original eight. Namely, for the edge given by j1 = K1 : aK1 ,j2−1 + aK1 ,j2 +1 E aK1 ,j2 | A−(K1 ,j2 ) , λ = aK1 −1,j2 + − 2 aK1 −1,j2 −1 + aK1 −1,j2+1 2 −1 var aK1 ,j2 | A−(K1 ,j2 ) , λ = (2λ) , j2 = −K2 + 1, . . . , K2 − 1, and similarly for the remaining edges. In the corner given by (j1 , j2 ) = (K1 , K2 ): E aK1 ,K2 | A−(K1 ,K2) , λ = aK1 −1,K2 + aK1 ,K2−1 − aK1 −1,K2 −1 , var aK1 ,K2 | A−(K1 ,K2) , λ = λ−1 , and similarly for the remaining corners. Let a denote the matrix A stacked into a column vector. Using a bivariate difference operator ∆ aj1 ,j2 = aj1 ,j2 − aj1 +1,j2 − aj1 ,j2 +1 + aj1 +1,j2 +1 , 10.2. BAYESIAN HIERARCHICAL MODEL 199 and denoting D the associated difference operator matrix, the joint prior of all transformed weights A given the smoothing hyperparameter λ can be written as n λ p(A | λ) ∝ exp − 2 KX 1 −1 KX 2 −1 ∆ aj1 ,j2 j1 =−K1 j2 =−K2 2 o λ = exp − a′ D′ Da (10.7) 2 which shows that the DAG conditional distribution p(A | λ) specified as a GMRF − − is multivariate normal with covariance matrix λ−1 D′ D , where D′ D denotes a generalized inverse of D′ D. Although this distribution is improper (the matrix D′ D has a deficiency of 2(K1 + K2 ) + 1 in its rank) the resulting posterior distribution is proper as soon as there is some informative data available, see Besag et al. (1995). Conditionally univariate difference prior An alternative prior, still belonging to the class of GMRF, corresponding closely to the prior for A used in Chapter 9 is obtained by considering a univariate difference operator for each row and each column of the matrix A with possibly two different smoothing hyperparameters stacked in a vector λ = (λ1 , λ2 )′ acting on rows and columns separately. Then λ1 p(A | λ) ∝ exp − 2 − K1 X K2 X ∆s1 aj1 ,j2 j1 =−K1 j2 =−K2 +s λ2 2 K2 X K1 X j2 =−K2 j1 =−K1 +s 2 ∆s2 aj1 ,j2 2 n 1 o (10.8) = exp − a′ λ1 D′1 D1 + λ2 D′2 D2 a 2 where ∆sl , l = 1, 2 denotes a difference operator of order s for the lth dimension, e.g. ∆31 aj1 ,j2 = aj1 ,j2 − 3 aj1 ,j2 −1 + 3 aj1 ,j2 −2 − aj1 ,j2 −3 and D1 and D2 are the corresponding difference operator matrices for each dimension. This prior distribution corresponds to a local polynomial smoothing of degree s−1 in each row and each column of the matrix A. For example, the conditional mean and variance are given (for s = 3 and except on the corners and on edges) by λ1 Aj2 | j1 + λ2 Aj1 | j2 E aj1 ,j2 | A−(j1 ,j2) , λ = λ1 + λ2 1 , var aj1 ,j2 | A−(j1 ,j2) , λ = 20(λ1 + λ2 ) (10.9) 200 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL where Ak | j = aj,k−3 − 6 aj,k−2 + 15 aj,k−1 + 15 aj,k+1 − 6 aj,k+2 + aj,k+3 . 20 Both the spatial prior for A and the conditionally univariate difference prior for A put higher probability mass in areas where spatially close coefficients of the matrix A do not substantially differ. In other words, a priori, we believe that the estimated densities gζ (ζ1 , ζ2 ) and gε (ε1 , ε2 ) are smooth. In general, prior (10.8) leads to better a fit in our context and hence is preferred. Prior for the smoothing parameter The λ parameter in the prior (10.7) or the components λ1 , λ2 of the λ parameter in the prior (10.8) determine, together with the fixed difference operator matrix D, the precision of the transformed weights A. We assign these parameters standardly used highly dispersed (but proper) Gamma priors. Prior for the mixture intercepts and scales The intercept parameters αε1 , αε2 , αζ1 , αζ2 can obtain a vague normal prior unless there is some external information available. For the scale parameters τ1ε , τ2ε , τ1ζ , τ2ζ we suggest to use either the uniform prior or a highly dispersed inverse-Gamma prior for the squared scale parameters. 10.2.2 Prior distribution for the generic node Y To specify the prior distribution of the generic node Y , i.e. of the nodes εi and ζ i , i = 1, . . . , N , we introduce, analogously to Chapter 9 and using the idea of Bayesian data augmentation (see Section 4.3), latent allocation vector r = (r1 , r2 )′ that can take discrete values from {−K1 , . . . , K1 } × {−K2 , . . . , K2 }. Its DAG conditional distribution is given by Pr(r = (j1 , j2 )′ | G) = Pr(r = (j1 , j2 )′ | W) = wj1 , j2 , j1 ∈ {−K1 , . . . , K1 }, j2 ∈ {−K2 , . . . , K2 }. The DAG conditional distribution of the generic node Y is then simply bivariate normal with independent margins: p(y1 , y2 | G, r1 , r2 ) = ϕ2 y α + diag(τ )µ(r1 , r2 ) , diag(τ ) Σ diag(τ ) . Without introducing the latent allocation vectors we would have to work with p(y | G) = p(y | µ, Σ, α, τ , W) which is a bivariate normal mixture given by (10.3). 10.3. MARKOV CHAIN MONTE CARLO 10.2.3 201 Prior distribution for the regression parameters and time variables The prior distribution of the regression parameters and the time variables is exactly the same as in Chapter 9. That is, the regression parameter vectors β and δ are given a vague normal prior unless there is some external U L U L U information available. Finally, the nodes uL i,l , ui,l , vi,l , vi,l , ti,l and ti,l have, conditionally on their parents, the Dirac distribution driven by the censoring mechanism and the true onset, failure or event time, respectively. See Section 9.2.5 with an obvious change in notation. Finally, remember that we do not have to specify an exact form of the censoring mechanism as soon as it is noninformative and independent. 10.2.4 Posterior distribution The posterior distribution is given as a product of all DAG conditional distributions. See Section 9.2.6 for details. 10.3 Markov chain Monte Carlo In practice we obtain a sample from the posterior distribution using the Markov chain Monte Carlo method and base our inference on this sample. Analogously to Chapter 9, the basis for the MCMC algorithm is Gibbs sampling (Geman and Geman, 1984) using the full conditional distributions. In the situations when the full conditional distribution was not of standard form we used either slice sampling (Neal, 2003) or adaptive rejection sampling (Gilks and Wild, 1992). For most parameters the full conditionals are identical (with only a slight change in notation) to those given in Section 9.3 and we refer the reader thereinto. Here we mention only the full conditional distribution for the transformed mixture weights which, due to the bivariate nature considered here, differs 202 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL from that in Chapter 9 and is equal to exp(Nj1 ,j2 aj1 ,j2 ) oN × K2 P exp(ak1 ,k2 ) p(aj1 ,j2 | · · · ) ∝ n K P1 k1 =−K1 k2 =−K2 o2 # " n aj1 ,j2 − E aj1 ,j2 | A−(j1 ,j2 ) , λ exp − , 2 var aj1 ,j2 | A−(j1 ,j2 ) , λ j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 , where Nj1 ,j2 denotes the number of latent allocation vectors r i that are equal ′ to (j1 , j2 ) and E aj1 ,j2 | A−(j1 ,j2) , λ and var aj1 ,j2 | A−(j1 ,j2 ) , λ follow from (10.6) or (10.9). 10.4 Evaluation of association The association between the paired responses, after adjustment for the effect of covariates, can be evaluated for example using the Pearson correlation coefficient of the error terms ζi,1 and ζi,2 , or εi,1 and εi,2 , respectively. For example, the Pearson correlation coefficient of the error terms εi,1 and εi,2 equals ̺ε = n K1 P K2 P j1 =−K1 j2 =−K2 (σ1ε )2 where wjε1 + = + K1 P j1 =−K1 K2 X wjε1 + wjε1 ,j2 , j2 =−K2 ε w+j = 2 K1 X j1 =−K1 wjε1 ,j2 , µε1,j1 wjε1 ,j2 µε1,j1 − M1ε µε2,j2 − M2ε − M1ε o 12 n ε (σ2 )2 + K2 P j2 =−K2 j1 = −K1 , . . . , K1 , M1ε = j2 = −K2 , . . . , K2 , M2ε = ε w+j 2 µε2,j2 K1 X − M2ε o 21 wjε1 + µε1,j1 , j1 =−K1 K2 X ε w+j µε . 2 2,j2 j2 =−K2 Another popular measure of association for censored data is the Kendall’s tau, denoted by τKend, of which one advantage is that it is invariant towards monotone transformations. This implies in our context that after adjustment for the effect of covariates, the same value of the Kendall’s tau is obtained for both the original event times and for their logarithmic transformation , 10.5. BAYESIAN ESTIMATES OF THE SURVIVAL DISTRIBUTION 203 represented by the error terms. For example for the time-to-event part of the ε is equal to model, given the model parameters, the Kendall’s tau τKend ε τKend =4 · K1 X K2 X K1 X K2 X wjε1 ,j2 wkε1 ,k2 j1 =−K1 j2 =−K2 k1 =−K1 k2 =−K2 µε − µε µε − µε 2,j2 1,j1 √ ε 1,k1 Φ √ ε 2,k2 Φ 2σ1 2σ2 − 1, see Bogaerts and Lesaffre (2006) for details. 10.5 Bayesian estimates of the survival distribution 10.5.1 Predictive survival nad hazard curves and predictive survival densities The estimates of the survival and hazard functions or the survival densites for a specific combination of covariates are estimated by the mean of (posterior) predictive quantities. In practice, this is done analogously to Sections 8.4 and 9.4. However, due to the bivariate approach in this chapter, we have to distinguish between the quantities for the first margin: the onset time U1 and the event time T1 and for the second margin: the onset time U2 and the event time T2 . For example, to get the Bayesian estimate of the predictive survival function of the event time T1 , given the covariates xtnew and z tnew , we can use the relationship (8.17) while replacing the expression (8.16) by S1 (t1 | θ, xtnew , z tnew ) = 1− K1 X j1 =−K1 (10.10) wjε1 ,+ Φ log(t1 ) − β ′ xtnew − b′ z tnew αε1 + τ1ε µε1,j1 , (σ1ε τ1ε )2 . To get the Bayesian estimate of the predictive survival density of the event time T1 , we replace the expression (8.18) by p1 (t1 | θ, xtnew , z tnew ) = t−1 1 K1 X j1 =−K1 (10.11) wjε1 ,+ ϕ log(t1 ) − β ′ xtnew − b′ z tnew αε1 + τ1ε µε1,j1 , (σ1ε τ1ε )2 . Analogously, the quantities for the event time T2 in the second margin and for the onset times U1 and U2 are obtained. 204 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL 10.5.2 Predictive error densities The MCMC estimates of the predictive error densities are obtained in the same way as explained in Section 9.4.2. We only have to use a bivariate counterpart of the expression (9.13), i.e. for the event error density we use M 1 X ε,(m) ε,(m) −1 τ1 τ2 ĝε (e1 , e2 ) = M m=1 K1 X K2 X j1 =−K1 j2 =−K2 10.6 ε,(m) wj1 ,j2 (10.12) e − αε,(m) e − αε,(m) ε 1 2 ε 1 2 ϕ2 , . µ(j1 ,j2 ) , Σ ε,(m) ε,(m) τ1 τ2 Example: Signal Tandmobielr study – paired doubly-interval-censored data In Section 9.7, we have analyzed the time to caries of the permanent first molars based on the data from the Signal Tandmobielr study using the clusterspecific AFT model. The results were compared to the earlier analysis of Leroy et al. (2005). In this section, we perform a similar analysis. However, for practical reasons (see the introduction to this chapter) it is only possible to analyze a pair of teeth. In our analysis, we concentrated on differences between the maxillary (upper) and mandibular (lower) teeth and analyzed separately the pair of right teeth (teeth 16 and 46) and the pair of left teeth (teeth 26 and 36). The results for both pairs were very similar so we report only the results for the right teeth in this thesis. Due to the fact that a (parametric) population-averaged AFT model was used by Leroy et al. (2005), the results presented in this section can even more closely be compared to their findings. The analysis proceeded in a similar way as in Section 9.7 with only changes related to the fact we analyze only two teeth now. Specifically, the onset time Ui,l , i = 1, . . . , N , l = 1, 2 refers to the age (in years) of the ith child at which the lth tooth (l = 1 ≡ tooth 16, l = 2 ≡ tooth 46) emerged. The failure time Vi,l , i = 1, . . . , N , l = 1, . . . , 2 refers to the onset of caries and the event time Ti,l to the time between the emergence and the onset of caries. As explained in Section 9.7, left-censored emergence times were transformed into interval-censored ones based on the clinical eruption stage. Finally, as in Section 9.7, we subtracted 5 years from all observed times, i.e. log(Ui,l − 5) was used in the left-hand side of the model formula (10.1). Analogously to Section 9.7, we started the analysis with the Basic Model and based on 10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED DOUBLY-INTERVAL-CENSORED DATA 205 the results for the Basic Model we subsequently fitted its simplified version, referred as the Final Model. 10.6.1 Basic Model In the Basic Model we allowed for a different effect of the covariates on both emergence and caries experience for the maxillary and mandibular tooth, respectively. That is, in the AFT models (10.1) and (10.2) we used the following covariate vectors xui,l and xti,l for the emergence and caries parts of the model, respectively. xui,l = (gender i , jawi,l ∗ genderi )′ , xti,l = (x̃ti,l , jawi,l ∗ x̃ti,l ), where x̃ti,l = (gender i , statusDi,l , statusFi,l , statusMi,l , brushingi , sealantsi,l , plaquePFi,l , plaqueTi,l ). The covariate jaw is dichotomous (1 = maxilla, 0 = mandible) and distinguishes between the maxillary and mandibular tooth. It replaces the covariates tooth26, tooth36, tooth46 used in Section 9.7.1. Note that as well in the caries part as in the emergence part of the model the main effect of jaw is expressed by the intercept terms αε and αζ , respectively. See Section 9.7.1 for the explanation of the remaining covariates. 10.6.2 Final Model In the Final Model, we excluded all interaction terms with the covariate jaw, i.e. we assumed that the studied factors have the same effect on the emergence and caries for both the maxillary and mandibular tooth. Additionally, as in Section 9.7, we binarized the covariates plaquePF, plaqueT and statusD, statusF, statusM into new covariates plaque and status, respectively. Bayesian two-sided p-values and for factors with more than two levels simultaneous two-sided Bayesian p-values (see Section 4.6.2) were used to arrive at the Final Model. 10.6.3 Prior distribution To model the bivariate densities gζ and gε we used in both cases a grid of 31× 31 (K1 = K2 = 15) knots with the distance d between the two knots in each 206 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL margin equal to 0.3 and the basis standard deviations σ1ε = σ2ε = σ1ζ = σ2ζ = 0.2. The grid of knots is defined on a square [−4.5, 4.5] × [−4.5, 4.5] which covers the support of most standardized unimodal distributions (unimodality was checked after the analysis). For the transformed mixture weights Aε and Aζ we used the prior (10.8) with the differences of the third order (s = 3). The smoothing parameters λε1 , λε2 , λζ1 , λζ2 were all assigned dispersed Gamma(1, 0.005) priors. The same priors were used also for the scale parameters τ1ε , τ2ε , τ1ζ , τ2ζ . The intercept terms αε1 , αε2 , αζ1 , αζ2 as well as regression parameters contained in vectors β and δ were all assigned dispersed N (0, 100) priors. 10.6.4 Results For each model we ran 250 000 iterations with 1:3 thinning and kept last 25 000 iterations for the inference. Sampling for each model took about 68 hours on a 3 GHz Pentium IV PC with 1 GB RAM. Results for the Basic Model Table 10.1 shows the posterior medians, (simultaneous) 95% equal-tail credible intervals and (simultaneous) Bayesian two-sided p-values for the effect of each considered factor on emergence and caries experience, separately for the maxillary and the mandibular tooth. It is seen that the results for the mandibular and the maxillary tooth are very similar. Indeed, the interaction terms between jaw and the remaining factor variables were all non-significant at 5%, namely, the p-values were > 0.5, > 0.5, > 0.5, 0.262, > 0.5, 0.145, respectively for the interaction with gender in the emergence and the caries part of the model, and for the interaction with brushing, sealants, plaque, and status, respectively. Additionally, we computed the (simultaneous) Bayesian two-sided p-values for the two contrasts justifying the simplification of the covariates plaque and status for the Final Model, again separately for the mandibular and the maxillary tooth. For the variable status contrast decayed vs. filled vs. missing due to caries, the p-values were equal to 0.342 and 0.308, respectively for the maxillary and the mandibular tooth, respectively. For the variable plaque contrast in pits and fissures vs. on total surface, the p-values were equal to 0.262 and 0.301, respectively for the maxillary and the mandibular tooth, respectively. 10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED DOUBLY-INTERVAL-CENSORED DATA 207 Table 10.1: Signal Tandmobielr study, Basic Model. Posterior medians, 95% equal-tail credible regions (CR) and Bayesian two-sided p-values for each factor variable, separately for the maxillary tooth 16 and the mandibular tooth 46. Effect Maxillary tooth 16 Posterior median 95% CR Mandibular tooth 46 Posterior median 95% CR Emergence Gender girl Gender girl Status decayed filled missing Brushing daily Sealants present Plaque in pits and fissures on total surface −0.018 p = 0.094 (−0.039, 0.003) −0.016 Caries p = 0.534 −0.035 (−0.139, 0.073) −0.049 p < 0.001 −0.449 (−0.704, −0.224) −0.379 −0.627 (−0.844, −0.414) −0.375 −0.470 (−1.377, 0.138) −0.726 p = 0.003 0.226 (0.086, 0.386) 0.265 p = 0.019 0.158 (0.028, 0.283) 0.055 p = 0.014 −0.183 (−0.333, −0.031) −0.252 −0.389 (−0.819, −0.015) −0.468 p = 0.142 (−0.036, 0.005) p = 0.403 (−0.162, 0.063) p < 0.001 (−0.641, −0.151) (−0.588, −0.175) (−1.398, −0.208) p < 0.001 (0.097, 0.426) p = 0.401 (−0.077, 0.180) p = 0.002 (−0.404, −0.107) (−0.997, −0.038) Results for the Final Model Results for the Final Model are given in Table 10.2. This table contains also the main effect of jaw which is given by E(ζ2 ) − E(ζ1 ) and by E(ε2 ) − E(ε1 ) in the case of emergence and caries, respectively. It is seen that the lower tooth 46 emerges slightly later than the upper tooth 16. On the other hand, emergence occurs slightly earlier for girls than for boys. However, neither the position of the tooth nor gender have a significant 208 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL Table 10.2: Signal Tandmobielr study, Final Model. Posterior medians, 95% equal-tail credible regions (CR) and Bayesian two-sided p-values for each factor variable. Emergence Effect Jaw lower Gender girl Status dmf Brushing daily Sealants present Plaque present Posterior median Caries 95% CR p = 0.021 0.017 (0.003, 0.032) p = 0.018 −0.017 (−0.033, −0.003) Posterior median 0.024 −0.044 −0.482 0.249 0.110 −0.228 95% CR p = 0.816 (−0.158, 0.218) p = 0.267 (−0.120, 0.033) p < 0.001 (−0.576, −0.388) p < 0.001 (0.139, 0.369) p = 0.022 (0.019, 0.195) p < 0.001 (−0.313, −0.141) Table 10.3: Signal Tandmobielr study, Final Model. Posterior medians and 95% equal-tail credible regions (CR) for the mean, standard deviation, Pearson correlation and Kendall’s tau of the error terms. Param. E(ζ1 ) E(ζ2 ) sd(ζ1 ) sd(ζ2 ) ̺ζ ζ τKend Posterior median 95% CR Emergence 0.392 (0.379, 0.409 (0.397, 0.170 (0.163, 0.170 (0.164, 0.037 (0.030, 0.022 (0.016, 0.404) 0.421) 0.178) 0.177) 0.050) 0.030) Param. E(ε1 ) E(ε2 ) sd(ε1 ) sd(ε2 ) ̺ε ε τKend Posterior median 95% CR Caries 2.846 (2.645, 2.870 (2.706, 1.737 (1.631, 1.812 (1.722, 0.023 (0.018, 0.011 (0.008, 3.043) 3.040) 1.855) 1.918) 0.028) 0.013) 10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED DOUBLY-INTERVAL-CENSORED DATA 209 80 60 40 Posterior density 60 40 0 0 20 20 Posterior density 100 80 120 Emergence 0.025 0.035 0.045 0.055 0.015 0.020 0.025 0.030 ζ τKend ̺ζ 200 150 0 0 50 100 Posterior density 100 50 Posterior density 150 250 Caries 0.015 0.020 0.025 ̺ ε 0.030 0.006 0.010 0.014 ε τKend Figure 10.2: Signal Tandmobielr study, Final Model. Scaled histograms for sampled Pearson correlation and Kendall’s tau between the error terms. 210 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL effect on the time to caries. The remaining factors do influence significantly the time to caries, namely, daily brushing increases this time with a factor of exp(0.249) = 1.283, presence of sealants with a factor of exp(0.110) = 1.116. The factor for presence of plaque is exp(−0.228) = 0.796 and when the adjacent deciduous second molar was not sound the factor is exp(−0.482) = 0.618. It is seen that the results given in Table 10.2 are slightly different from the summary given in Table 9.1 which relates to the earlier joint analysis of all four permanent first molars using the cluster-specific (conditional) AFT model. Especially, the effect of the covariate status appears to be more profound when evaluated using the population-averaged (marginal) model. However, the conclusions concerning a beneficial effect of sealing and daily brushing and an indisposed effect of not sound primary predecessors or plaque on the caries process on the permanent first molars are the same irrespective of the used model. Further, Table 10.3 shows the mean and standard deviation of the error terms and also the residual association (after adjustment for the effect of covariates) between the maxillary and the mandibular tooth. For both the emergence and the caries processes, a very low posterior median for the Pearson correlation coefficient was found on the log-scale and the same is true also for the Kendall’s tau. Moreover, as seen in Figure 10.2, the whole posterior distribution for the correlation coefficients and the Kendall’s taus is concentrated in the neighborhood of zero. Figures 10.3 and 10.4 show the estimates of the error densities gζ (ζ) and gε (ε) and their margins and illustrate the smoothing nature of our approach. These figures also reveal the low association between error terms for the upper and lower tooth. For the interpretation of the figure, we must take into account that about 75% of the caries times were right-censored and practically all around 12 years of age, which is 5 to 6 years after emergence. This implies that in fact each margin is identifiable from the data only up to approximately the first quartile. The right tail of the density is extrapolated from the left tail using the weights distributed according to the GMRF prior. It also implies that the association might be underestimated, see, e.g. Bogaerts and Lesaffre (2006). Figure 10.5 shows the predictive survival and hazard functions for caries on the upper tooth 16 of boys and ‘the best’, ‘the worst’ and two intermediate combinations of covariates. Corresponding curves for the lower tooth 46 or for girls are almost the same due to the non-significant effect of the covariates gender and jaw on the caries. For teeth that are not brushed daily and are exposed to other risk factors, a high peak in the hazard function is 10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED DOUBLY-INTERVAL-CENSORED DATA 0.4 0.0 0.2 ζ2 0.6 0.8 211 0.0 0.2 0.4 0.6 0.8 0.5 1.0 gζ (ζ2 ) 1.0 0.0 0.0 0.5 gζ (ζ1 ) 1.5 1.5 2.0 2.0 ζ1 0.0 0.2 0.6 0.4 ζ1 0.8 1.0 0.0 0.2 0.6 0.4 0.8 1.0 ζ2 Figure 10.3: Signal Tandmobielr study, Final Model. Estimate of the density gζ (ζ1 , ζ2 ) and the corresponding marginal densities gζ (ζ1 ) and gζ (ζ2 ) of the error terms in the emergence part of the model. CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL 4 0 2 ε2 6 8 212 0 2 4 6 8 0.05 0.10 gε (ε2 ) 0.10 0.00 0.00 0.05 gε (ε1 ) 0.15 0.15 0.20 0.20 ε1 0 2 4 ε1 6 8 0 2 4 6 8 ε2 Figure 10.4: Signal Tandmobielr study, Final Model. Estimate of the density gε (ε1 , ε2 ) and the corresponding marginal densities gε (ε1 ) and gε (ε2 ) of the error terms in the caries part of the model. The shaded part in the marginal densities extends to the first quartile. 10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED DOUBLY-INTERVAL-CENSORED DATA 213 0.4 0.6 0.8 0.0 0.2 Caries free 1.0 Caries: survival function 0 1 2 3 4 5 6 5 6 Time since emergence (years) 0.00 0.05 0.10 0.15 0.20 Hazard of caries Caries: hazard function 0 1 2 3 4 Time since emergence (years) Figure 10.5: Signal Tandmobielr study, Final Model. Posterior predictive caries free (survival) and caries hazard curves for tooth 16 of boys and the following combinations of covariates: solid and dashed lines for no plaque, present sealing, daily brushing and sound primary second molar (solid line) or dmf primary second molar (dashed line), dotted and dotted-dashed lines for present plaque, no sealing, not daily brushing and sound primary second molar (dotted line) or dmf primary second molar (dotted-dashed line). 214 CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL observed already less than 1 year after emergence. A similar peak, however shifted to right and of much lower magnitude is seen also for other covariate combinations. The same peak, also approximately of the same magnitude, has already been found when analyzing all four permanent first molars using the cluster-specific AFT model in Chapter 9 (see Figure 9.4) and can be explained by the fact that permanent first molars are most vulnerable by caries soon after they emerge, possibly because of not yet fully developed enamel on their surfaces. However, when using the population-averaged model we do not see the second period of increased hazard for the ‘worse’ combinations of covariates as we have seen in Figure 9.4. This alleged difference between the results of the population-averaged and cluster-specific model could be caused by a failure to compare like with like, see Lee and Nelder (2004) for a deeper discussion to this point. 10.7 Discussion In this chapter, we have suggested a semiparametric method to analyze bivariate doubly-interval-censored data in the presence of covariates. The method was applied to the analysis of a dental data set where all covariates were categorical. However, continuous covariates would not cause any difficulties and could have been used as well. Although the method was presented to deal with doubly-interval-censored data it can be used to analyze also simple interval- or right-censored data. Further, using the ideas outlined in Section 6.3.4, the method of this chapter could theoretically be extended to handle not only bivariate data but also data of an arbitrary dimension (i.e. ni > 2 for all i). However, the number of unknown parameters increases exponentially and the estimation becomes quite fast computationally intractable. A disadvantage of the current method is that it requires balanced data, i.e. exactly two observations must be supplied for each cluster and if only one observation of the cluster is missing the whole cluster must be removed from the analysis. Missingness in one event time out of the pair could have been solved using the Bayesian data augmentation in the same way as it solves the problem of censoring. However, if the missingness is caused by a missing covariate value, the Bayesian data augmentation would not help unless a measurement model is set up also for the covariates. With unbalanced data, the cluster specific approach of Chapter 9 can be used, however. Chapter 11 Overview and Further Research In this thesis, we have developed several modifications of the accelerated failure time model for the analysis of the multivariate (doubly-)interval-censored data while making only weak distributional assumptions. We will now state an overview and give topics for future research. 11.1 Overview Chapter 1 brings several data sets that motivate the developments presented in the thesis. The data sets are then used to illustrate the usage of presented methods in practical situations. Chapter 2 explains briefly several notions used in the area of survival data and introduces the notation used in the thesis. An overview of the regression models for the analysis of the survival data is given in Chapter 3. We described the Cox’s proportional hazards (PH) model and the accelerated failure time (AFT) model as the most popular models in the given area. For reasons stated in Section 3.3 we chose the accelerated failure time model as the basis for all developments in this thesis. In Chapter 4, we discuss the likelihood form in the case of (multivariate) (doubly-)interval-censored data and show several advantages of the Bayesian inference compared to the maximum-likelihood estimation in such situations. Further, we suggest to use the Markov chain Monte Carlo methodology as the mean of Bayesian estimation. The final chapter of the introductory part of the thesis, Chapter 5, gives an overview of existing methods for the analysis of the interval-censored data and shows in detail a Bayesian analysis of the dental multivariate doubly215 216 CHAPTER 11. OVERVIEW AND FURTHER RESEARCH interval-censored data using a PH model with piecewise constant baseline hazard functions. The main part of the thesis starts with Chapter 6 where we describe two slightly different classes of models for a flexible modelling of continuous densities. Firstly, a classical normal mixture is introduced and secondly, we propose a penalized normal mixture motivated by penalized B-splines as a useful tool to model unknown densities. Both approaches are subsequently used in the AFT models to express either the error density or the density of the random effects. Chapter 7 gives the AFT model for univariate interval-censored data where the error distribution is specified as the penalized normal mixture. The inference is based on the maximum-likelihood paradigm. The model is further extended to allow not only the mean response but also the scale of the response to depend on covariates. The AFT models presented in subsequent chapters can already handle also the multivariate (doubly-)interval-censored data. However, due to reasons discussed in Chapter 4 and in Section 7.8, we switch to the Bayesian inference. Firstly, Chapter 8 gives the AFT model with normal random effects (clusterspecific model) and the distribution of the error term specified as the classical normal mixture. Secondly, Chapter 9 shows the cluster-specific AFT model where the error distribution and in the case of univariate random effects also the distribution of the random effects is specified as the penalized normal mixture. In this chapter, we also explicitely show and illustrate the usage of the proposed methods in the context of doubly-interval-censored data. Finally, Chapter 10 gives the population-averaged AFT model for paired (doubly-)interval-censored data where the error distribution is given by a bivariate penalized normal mixture. 11.2. GENERALIZATIONS AND IMPROVEMENTS 11.2 217 Generalizations and improvements In this section, we list several topics to generalize or improve the models presented in this thesis. Time-dependent covariates and joint modelling of survival data and longitudinal profiles In many applications of the survival analysis, it is of interest to evaluate an effect of factors that can evolve over time. The values of such factors (e.g., blood pressure, dose of medication, etc.) are typically determined at (prespecified) occasions and it is assumed that they remain constant (deterministic) until the next occasion. In the last decade, several models were developed for joint modelling of the evolution of factors evolving over time (longitudinal data analysis) and the time-to-event, see Tsiatis and Davidian (2004) for an overview. That is, a stochastic component is included in the evolution of the time-dependent factors possibly influencing the survival time. To include the time-dependent covariates, both deterministic and stochastic, in the survival model, it is necessary to specify the dependence of the survival time on the covariates using a local characteristic like the hazard function. However, in all models presented in Part II of this thesis, the covariates modified a global characteristic of the survival time, i.e. the mean log-time. The possibility on how to extend the models of this thesis to handle also the time-dependent covariates would be to use the hazard specification (3.3) of the AFT model and use a mixture model for the baseline hazard function ℏ0 . Dependence of the scale parameters on covariates In Section 7.1.2 we suggested to extend the basic AFT model by allowing the dependence of the scale parameter on the covariates. The same extension could quite easily be applied to both Bayesian penalized approaches in Chapters 9 and 10. However, in the case of the classical mixture (Chapter 8), a similar extension would be much more complicated due to the fact that the scale of the response is derived from the unknown number of the estimated variances of the mixture components. Dependent censoring The models in this thesis assumed all that the censoring mechanism is independent on the time-to-event (see Section 2.4). Generally, this does not 218 CHAPTER 11. OVERVIEW AND FURTHER RESEARCH always have to be true. All Bayesian models (Chapters 8–10) could relatively easily be extended to handle also dependent censoring. However, a reasonable measurement model has to be specified for the censoring mechanism. Goodness-of-fit An important topic, not discussed in this thesis is the evaluation of goodnessof-fit. Indeed, in all models in this thesis, the distribution of the response is specified in a flexible manner and there is less need to evaluate the distributional assumptions. Nevertheless, one should also check an appropriatness of the AFT assumption with respect to the form in which the covariates modify the distribution of the response. On few places in this thesis, and in the case of categorical covariates, this was only graphically checked by comparing the fitted survival curves with their nonparametric estimates. Classical goodness-of-fit methods are based on residuals whose form is straightforward in the case of a linear regression with uncensored data. In the case of right censored data, various forms of residuals are derived from the counting process specification of the survival models, see, e.g., Therneau and Grambsch (2000, Chapter 4). However, the definition of residuals for interval-censored data is not straightforward and only recently (Topp and Gómez, 2004) a work in this direction appeared in the literature. Model selection A general model selection is another important topic somewhat neglectful in this thesis. In Chapter 7, we based the model selection on the Akaike’s information criterion whereas in Chapters 8–10 on the (simultaneous) Bayesian p-values for model contrasts. In general, also in the Bayesian framework some form of the information criterion could be used for the model selection. Recently, the most popular one seems to be the deviance information criterion (Spiegelhalter et al., 2002). The use of specifically developed optimizers Due to the complexity of the likelihood, we have considered the estimation through the method of penalized maximum-likelihood only in Chapter 7. However, there are currently several convenient gateways to optimization software and services available on the Internet. For example, the Kestrel interface to the NEOS server (Czyzyk, Mesnier, and Moré, 1998; Ferris, Mesnier, and Moré, 2000) together with the modeling language for mathematical 11.3. THE USE OF PENALIZED MIXTURES IN OTHER APPLICATION AREAS 219 programming AMPL (Fourer, Gay, and Kernighan, 2003) enables to optimize complicated functions subject to different types of constraints. These possibilities could be explored as promising alternatives to the full Bayesian approaches presented in Chapters 8–10. 11.3 The use of penalized mixtures in other application areas Finally, we show how we intend to use the ideas used in this thesis in the future work. 11.3.1 Generalized linear mixed models with random effects having a flexible distribution Firstly, we aim to develop a generalized linear mixed model (GLMM) with random effects distribution specified as the penalized mixture. The proposed work has the following objectives. Let Yi,l , i = 1, . . . , N , l = 1, . . . , ni be discrete random variables for which the components of the vector Y i = (Yi,1 , . . . , Yi,ni )′ are possibly dependent. Typically, Y i represents the outcomes of the ith subject at ni different time points ti,1 , . . . , ti,ni in a longitudinal study or outcomes of ni subjects forming the ith cluster in the case of clustered data. Further, let µi,l = E(Yi,l ). Using the GLMM, the expected outcome µi,l is expressed as µi,l = h−1 (x′i,l β + z ′i,l bi ), i = 1, . . . , N, l = 1, . . . , ni , where h is a known link function (e.g. log, logit, probit), β is the vector of unknown regression parameters (fixed effects), xi,l the vector of covariates for fixed effects, bi the vector of random effects and z i,l the vector of covariates for random effects, see, e.g., Molenberghs and Verbeke (2005) for more details. We aim to concentrate mainly on longitudinal studies where usually z i,l = (1, ti,l )′ , and bi = (bi,1 , bi,2 )′ . Classically, it is assumed that the random effects bi , i = 1, . . . , N are i.i.d. following a (multivariate) normal distribution. However, it has been shown (see Molenberghs and Verbeke, 2005, Chapter 23) that the incorrect assumption of normality of the random effects may lead to biased estimates of the regression parameters β. But, due to the fact that the random effects bi are latent, it is very difficult to check the normality assumption. That is why one strives for more flexible methods with respect to the distribution of the ran- 220 OVERVIEW AND FURTHER RESEARCH dom effects. One possibility, we wish to explore is to specify the distribution of the random effects as a penalized bivariate mixture (10.3). 11.3.2 Spatial models with the intensity specified by the penalized mixture Secondly, we would like to explore the possibilities of the penalized mixtures in the context of spatial models. The motivation is the following. In epidemiology, it is of interest to model the prevalence or incidence of a disease in a spatial manner in order to represent the true risk in a honest manner. Let A denote the study area, R a region within A, and y = (y1 , y2 ) coordinates of a location in A. Generation of the disease cases can be formalized by considering an underlying point process described by a counting measure N on A, i.e. N (R) denotes the number of disease cases in R. Finally, let E N (∆y) , λ(y) = lim k∆ yk k∆ yk→0 where ∆ y is an infinitesimal region around y and k∆ yk its area, be the intensity of the point process. Different approaches have been suggested in the literature to express λ(y) of which one uses an expression λ(y) = ̺ g(y) f (y; θ), (11.1) where ̺ denotes an overall region-wide rate, g(y) a known background function representing the reference population and f (y; θ) represents a function of spatial location and possibly other parameters and associated covariates as well (see Lawson et al., 1999). However, there exists no gold standard for the expression of f (y; θ). The main requirement for f (y; θ) is, however, that it varies smoothly across A. To model smoothly the variation of the intensity λ(y) across the region of interest A, a penalized mixture could be used to express f (y; θ) as part of expression (11.1) as f (y; θ) = 1 + K1 X K2 X wk1 ,k2 ϕk1 (y1 ) ϕk2 (y2 ), (11.2) k1 =−K1 k2 =−K2 where the weights are, in contrast to the approaches used in this thesis, not constrained. Further, it is here of interest to develop efficient procedures (a) to test a null hypothesis of w−K1 ,−K2 = · · · = wK1 ,K2 = 0, corresponding to a constant ratio λ(y)/g(y) which is known as a standardized mortality rate, and (b) to develop a general procedure for model selection. 11.3. THE USE OF PENALIZED MIXTURES IN OTHER APPLICATION AREAS 221 Further, to allow for the dependence of the intensity λ(y) on other (regionspecific) covariates x(y), we would like to explore a generalization of the model (11.2) of the form K1 X K2 X f (y; θ, β) = h x(y), β + wk1 ,k2 ϕk1 (y1 ) ϕk2 (y2 ), k1 =1 k2 =1 where h is an unknown (nonlinear) function and β a vector of unknown regression parameters. 222 OVERVIEW AND FURTHER RESEARCH Appendix A Technical details for the Maximum Likelihood Penalized AFT Model This appendix provides the technical details for the practical computation of the penalized maximum-likelihood estimate for the AFT model of Chapter 7. Namely, we give more details concerning the optimization algorithm, provide the formulas for computation of the first and second derivatives of the penalized log-likelihood needed to implement this algorithm and give the proof of Proposition 7.1. Notation introduced in Chapter 7 will be used throughout this appendix. Additionally, the following notation is employed. −1 L ′ eL i = τi (yi − α − β xi ), −1 U ′ eU i = τi (yi − α − β xi ), −1 L ẽL i,j = σ (ei − µj ), −1 U ẽU i,j = σ (ei − µj ), L ϕL i,j = ϕ(ẽi,j ), U ϕU i,j = ϕ(ẽi,j ), L L ϕ̄L i,j = ẽi,j ϕ(ẽi,j ), 2 L L ẽi,j − 1 ϕ(ẽL ϕ̆i,j = i,j ), U U ϕ̄U i,j = ẽi,j ϕ(ẽi,j ), 2 U U ẽi,j − 1 ϕ(ẽU ϕ̆i,j = i,j ), L ΦL i,j = Φ(ẽi,j ) U ΦU i,j = Φ(ẽi,j ), j = −K, . . . , K. i = 1, . . . , N, 223 224 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT L L ′ ϕL i = (ϕi,−K , . . . , ϕi,K ) , U U ′ ϕU i = (ϕi,−K , . . . , ϕi,K ) , L L ′ ϕ̄L i = (ϕ̄i,−K , . . . , ϕ̄i,K ) , U U ′ ϕ̄U i = (ϕ̄i,−K , . . . , ϕ̄i,K ) , L L ′ ϕ̆L i = (ϕ̆i,−K , . . . , ϕ̆i,K ) , U U ′ ϕ̆U i = (ϕ̆i,−K , . . . , ϕ̆i,K ) , L L ′ ΦL i = (Φ̄i,−K , . . . , Φ̄i,K ) , U U ′ ΦU i = (Φ̄i,−K , . . . , Φ̄i,K ) , i = 1, . . . , N. We omit the superscripts ‘L’ and ‘U’ in the case of exactly observed event times (δi = 1) resulting in yiL = yiU = yi . Finally, in all formulas, we omit U the Jacobian term (t−1 for exactly observed event times with tL i = ti = ti ) i resulting from the logarithmic transformation of the event times in the loglikelihood. A.1 Optimization algorithm To compute the penalized maximum-likelihood estimate we firstly maximize the penalized log-likelihood (7.7) with respect to θ̃ = (α, β ′ , γ ′ , a′−0 )′ under the constraints (7.4) and upon the convergence we compute the second derivative matrix of ℓP with respect to θ = (α, β ′ , γ ′ , d′ )′ to get the variance estimates. Constrained optimization is conducted using the sequential quadratic programming (SQP) algorithm, see Han (1977); Fletcher (1987, Section 12.4). The idea of this algorithm is to iteratively maximize slightly modified quadratic approximation of the objective function subject to the linear approximation of the constraints. Let K K X X 2 wj µ2j (A.1) wj µj , c2 (θ̃) = 1 − σ0 − c1 (θ̃) = j=−K j=−K be the constraint equations resulting from (7.4), and let L(θ̃, ξ1 , ξ2 ) = ℓP (θ̃) + ξ1 c1 (θ̃) + ξ2 c2 (θ̃) be the Lagrange function with the Lagrange multipliers ξ1 and ξ2 corresponding to the maximization problem maxθ̃ ℓP (θ̃) subject to c1 (θ̃) = 0 and c2 (θ̃) = 0. Let QP(θ̃, H) be a quadratic programming problem o n ∂ℓ P θ̃ + 0.5 δ ′ H δ max δ ′ δ ∂ θ̃ (A.2) A.2. INDIVIDUAL LOG-LIKELIHOOD CONTRIBUTIONS 225 subject to c1 (θ̃) + δ ′ where ∂c1 θ̃ = 0, ∂ θ̃ c2 (θ̃) + δ ′ H = H(θ̃, ξ1 , ξ2 ) = ∂2L ∂ θ̃∂ θ̃ ′ ∂c2 θ̃ = 0, ∂ θ̃ (A.3) θ̃, ξ1 , ξ2 . (A.4) Note that the objective function in (A.2) is the second order Taylor approximation of ℓP (θ̃) around some fixed point θ̃ 0 with δ = θ̃− θ̃ 0 , omitted constant term and the matrix of second derivatives ∂ 2 ℓP /∂ θ̃∂θ ′ replaced by the θ̃-θ̃ block of the second derivative matrix of the Lagrange function L. The SQP algorithm proceeds in the following steps Step 0. Give the initial estimate θ̃ (0) Lagrange multipliers. Set H(0) In the sth iteration: (0) (0) and the initial guesses ξ1 , ξ2 for the (0) (0) (0) = H θ̃ , ξ1 , ξ2 . Step 1. Find the point δ (s) which solves the quadratic program QP(θ̃ (s) , H(s) ); Step 2. Set θ̃ If θ̃ (s+1) (s+1) = θ̃ (s) + δ (s) . does not lead to increase of ℓP use step-halving procedure; (s+1) Step 3. Set ξ1 (s+1) and ξ2 quadratic program QP(θ̃ to the optimal Lagrangian multipliers of the (s) , H(s) ); Step 4. Check the convergence, if it is not reached go to Step 1. A.2 Individual log-likelihood contributions log(1 − w′ ΦL i ), − log(τ ) + log(w′ ϕ ), i i ℓi (θ̃) = ′ ΦU ), log(w i logw′ (ΦU − ΦL ), i i δi δi δi δi = 0, = 1, = 2, = 3, i = 1, . . . , N. 226 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT A.3 First derivatives of the log-likelihood A.3.1 With respect to the regression parameters and the intercept X ∂ℓ dbi , = (τi σ0 )−1 w′ ∂α N i=1 ∂ℓ = (τi σ0 )−1 w′ ∂βl N X i=1 xi,l dbi , l = 1, . . . , m, where dbi is a vector of length 2K + 1 of the form −1 ϕL , (1 − w′ ΦL δi = 0, i ) i (w ′ ϕ )−1 ϕ̄ , δi = 1, i i dbi = U ′ −1 U −(w Φi ) ϕi , δi = 2, w′ (ΦU − ΦL )−1 (ϕL − ϕU ), δi = 3, i i i i A.3.2 i = 1, . . . , N. With respect to the log-scale and the scale-regression parameters Firstly, we consider the case when the scale parameter τ does not depend on covariates, i.e. log(τ ) = γ1 . X X ∂ℓ dli . I[δi = 1] + σ0−1 w ′ =− ∂γ1 N N i=1 i=1 Secondly, we consider the case when log(τi ) = Then γ ′z i, where z i = (zi,1 , . . . , zi,ms )′ . X X ∂ℓ zi,l dli , I[δi = 1] zi,l + σ0−1 w′ =− ∂γl N N i=1 i=1 l = 1, . . . , ms . In both formulas, dli is a vector of length 2K + 1 of the form o−1 n L ′ L eL δi = 0, i ϕi , w (1 − Φi ) ′ −1 (w ϕi ) ei ϕ̄i , δi = 1, dli = i = 1, . . . , N. U ′ −1 U U −(w Φ ) e ϕ , δ = 2, i i i i −1 ′ U L ϕL − eU ϕU ), δ w (Φi − ΦL ) (e i = 3, i i i i i A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD A.3.3 227 With respect to the transformed mixture weights Let a−0 be the vector of transformed mixture weights except the baseline coefficient which is fixed to zero (without loss of generality a0 = 0). Then N ∂w X ∂ℓ dai , = ∂a−0 ∂a−0 i=1 where dai is a vector of length 2K + 1 of the form o−1 n ′ (1 − ΦL ) w (1 − ΦL i i ), ′ −1 (w ϕi ) ϕi , dai = −1 ΦU , (w′ ΦU i ) i −1 U U ′ w (Φi − ΦL (Φi − ΦL i ) i ), δi = 0, δi = 1, δi = 2, δi = 3, i = 1, . . . , N, and ∂w/∂a−0 is a 2K × (2K + 1) matrix whose (j, k)th element equals ∂wk /∂aj , j = −K, . . . , −1, 1, . . . , K, k = −K, . . . , K. Namely ∂wj = wj (1 − wj ), ∂aj ∂wk = −wj wk , ∂aj A.4 j = −K, . . . , −1, 1, . . . , K, j = −K, . . . , −1, 1, . . . , K, k = −K, . . . , K, j 6= k. Second derivatives of the log-likelihood Let β̃ be the vector of regression parameters extended by the intercept, i.e. β̃ = (α, β ′ )′ and x̃i , i = 1, . . . , N be the covariate vectors extended by the intercept term, i.e. x̃i = (1, x′i )′ . A.4.1 With respect to the extended regression parameters ∂2ℓ ∂ β̃∂ β̃ ′ = N X i=1 (ddbbi,1 − ddbb2i,2 ) x̃i x̃′i , 228 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT where ddbbi,1 and ddbbi,2 are scalars of the following form ddbbi,1 w′ ϕ̄L i (τi σ0 )−2 ′ , w (1 − ΦL ) i ′ −2 w ϕ̆i (τi σ0 ) w ′ ϕ , i = U ′ −(τi σ0 )−2 w ϕ̄i , ′ w ΦU i ′ (ϕ̄L − ϕ̄U ) w i , (τi σ0 )−2 ′ iU w (Φi − ΦL i ) ddbbi,2 = A.4.2 w ′ ϕL −1 i (τ σ ) i 0 ′ (1 − ΦL ) , w i w′ ϕ̄i −1 (τi σ0 ) w′ ϕ , i ′ w ϕU −(τi σ0 )−1 ′ iU , w Φi U w ′ (ϕL i − ϕi ) , (τi σ0 )−1 ′ U w (Φi − ΦL i ) δi = 0, δi = 1, i = 1, . . . , N, δi = 2, δi = 3, δi = 0, δi = 1, i = 1, . . . , N. δi = 2, δi = 3, Mixed with respect to the extended regression parameters and the log-scale or the scale-regression parameters In the case when the scale parameter does not depend on covariates we have N X ∂2ℓ ddbli,1 − ddbbi,2 (1 + ddbli,2 ) x̃i . = ∂ β̃∂γ1 i=1 In the case of log(τi ) = γ ′ z i we have N X ∂2ℓ ddbli,1 − ddbbi,2 (1 + ddbli,2 ) x̃i z ′i . = ∂ β̃∂γ ′ i=1 A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD 229 In both formulas, ddbbi,2 is given in Section A.4.1, ddbli,1 and ddbli,2 are scalars of the form ddbli,1 L ei w′ ϕ̄L i · 2 ′ (1 − ΦL ) , τ σ w i 0 i w′ ϕ̆i e i τ σ2 · w′ ϕ , i 0 i = ′ U w ϕ̄U e − i 2 · ′ iU , τ i σ0 w Φ i L U U w′ (eL i ϕ̄i − ei ϕ̄i ) , 12 · U τ i σ0 w′ (Φi − ΦL i ) ddbli,2 = w ′ ϕL eL i i σ0 · w ′ (1 − ΦL ) i ei · w′ ϕ̄i σ0 w ′ ϕi A.4.3 δi = 1, i = 1, . . . , N, δi = 2, δi = 3, δi = 0, δi = 1, eU w′ ϕU − σi0 · ′ iU w Φi ′ L U U w (eL i ϕi − ei ϕi ) σ10 · w′ (ΦU − ΦL ) i δi = 0, i = 1, . . . , N. δi = 2, δi = 3, i Mixed with respect to the extended regression parameters and the transformed mixture weights X N n o ∂w ′ ∂2ℓ −1 ′ ′ , ddbai − (τi σ0 ) (w dbi ) x̃i dai = ∂a−0 ∂ β̃∂a′−0 i=1 where ddbai is a (m + 1) × (2K + 1) matrix of the form ′ (1 − ΦL ) −1 x̃ ϕL ′ , τ σ w i 0 i i i (τ σ w′ ϕ )−1 x̃ ϕ̄ ′ , i i i 0 i ddbai = ′ ΦU )−1 x̃ ϕU ′ , −(τ σ w i i i 0 i τ σ w′ (ΦU − ΦL )−1 x̃ (ϕL − ϕU )′ , i i 0 i i i i δi δi δi δi = 0, = 1, = 2, = 3, i = 1, . . . , N. Further, dbi is a vector of length 2K + 1 given in Section A.3.1. Finally, dai and ∂w/∂a−0 are a vector of length 2K + 1 and a 2K × (2K + 1) matrix, respectively, given in Section A.3.3. 230 A.4.4 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT With respect to the log-scale or the scale-regression parameters In the case when the scale parameter does not depend on covariates we have N n o X ∂2ℓ ddll − ddbl (1 + ddbl ) . = i i,2 i,2 ∂γ12 i=1 In the case of log(τi ) = γ ′ z i we have N n o X ∂2ℓ ddll − ddbl (1 + ddbl ) z i z ′i . = i i,2 i,2 ∂γ∂γ ′ i=1 In both formulas, ddbli,2 is a scalar given in Section A.4.2 and ddlli is a scalar given by the formula L 2 w′ ϕ̄L ei i · , σ ′ 0 w (1 − ΦL i ) ei 2 · w′ ϕ̆i , σ 0 w′ ϕi U 2 ddlli = ei w′ ϕ̄U i − · ′ ΦU , σ0 w i o n L − (eU )2 ϕ̄U L 2 ′ ϕ̄ (e ) w i i i i σ0−2 , U L ′ w (Φi − Φi ) A.4.5 δi = 0, δi = 1, δi = 2, i = 1, . . . , N. δi = 3, Mixed with respect to the log-scale or the scaleregression parameters and the transformed mixture weights In the case when the scale parameter does not depend on covariates we have X N n o ∂w ′ ∂2ℓ −1 ′ ′ ddlai − σ0 (w dli ) dai = . ∂γ1 ∂a′−0 ∂a−0 i=1 In the case of log(τi ) = γ ′ z i we have X N n o ∂w ′ ∂2ℓ −1 ′ ′ . = zi ddlai − σ0 (w dli ) dai ∂γ∂a′−0 ∂a−0 i=1 A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD 231 In both formulas, dai and ∂w/∂a−0 are a vector of length 2K +1 and a 2K × (2K +1) matrix, respectively, given in Section A.3.3, and ddlai is a row vector of length 2K + 1 of the form ddlai = eL ′ σ0−1 ′ i L ϕL i , w (1 − Φ ) i −1 ei ′ σ0 w′ ϕ ϕ̄i , δi = 0, δi = 1, i U e −σ0−1 ′ i U w Φi ′ ϕU i , n o−1 L L L U ′, σ0 w′ ΦU − Φ ei ϕi − eU i i i ϕi δi = 2, δi = 3, i = 1, . . . , N. A.4.6 With respect to the transformed mixture weights N X ∂w ∂2ℓ ddaai − = ′ ∂a−0 ∂a−0 ∂a−0 i=1 X N dai da′i i=1 ∂w ∂a−0 ′ , where dai and ∂w/∂a−0 are a vector of length 2K + 1 and a 2K × (2K + 1) matrix, respectively, given in Section A.3.3. Further, ddaai is a 2K × 2K matrix given by ddaai = ′ ∂ 2 wj L −1 PK L ) w (1 − Φ ) (1 − Φ i i,j ∂a ∂a′ , j=−K −0 −0 2w P ∂ K j ′ −1 (w ϕi ) j=−K ϕi,j ∂a ∂a′ , −0 −0 P ∂ 2 wj )−1 K ΦU , (w′ ΦU i i,j j=−K ∂a−0 ∂a′−0 L −1 PK U L w′ (ΦU i − Φi ) j=−K (Φi,j − Φi,j ) i = 1, . . . , N, ∂ 2 wj , ∂a−0 ∂a′−0 δi = 0, δi = 1, δi = 2, δi = 3, 232 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT where ∂ 2 wj /∂a−0 ∂a′−0 , j = −K, . . . , K is a 2K × 2K matrix with the elements ddwaajk,l , k, l = −K, . . . , −1, 1, . . . , K given by ddwaajj,j = wj (1 − wj ) (1 − 2wj ), ddwaajk,k = −wj wk (1 − 2wk ), ddwaajj,k = −wj wk (1 − 2wj ), ddwaajk,j = −wj wk (1 − 2wj ), ddwaajk,l A.5 = 2 wj wk wl , j 6= 0, k 6= j, j 6= 0, k 6= j, j 6= 0, k 6= j, k 6= j, l 6= j, k 6= l. Derivatives of the penalty term The penalty term depends only on the a−0 part of θ̃ so we have to provide only the derivatives with respect to this parameter sub-vector. ∂q = λ D′ Da with removed 0th element, ∂a−0 ∂2q = λ D′ D with removed 0th row and 0th column. ∂a−0 ∂a′−0 A.6 Derivatives of the constraints To be able to compute the H matrix (A.4) derivatives of the constraint functions (A.1) are needed. Since the constraints (A.1) depend only on the a−0 part of θ̃ we have to provide only the derivatives with respect to this parameter sub-vector. The first derivatives are computed by ∂c1 ∂w = µ, ∂a−0 ∂a−0 ∂c2 ∂w 2 = µ , ∂a−0 ∂a−0 where µ = (µ−K , . . . , µK )′ , µ2 = (µ2−K , . . . , µ2K )′ , and ∂w/∂a−0 is a 2K × (2K + 1) matrix given in Section A.3.3. The second derivatives are given by K X ∂ 2 wj ∂ 2 c1 µ = , j ∂a−0 ∂a′−0 ∂a−0 ∂a′−0 j=−K K X ∂ 2 wj ∂ 2 c2 2 µ = , j ∂a−0 ∂a′−0 ∂a−0 ∂a′−0 j=−K A.7. PROOF OF PROPOSITION ?? 233 where ∂ 2 wj /∂a−0 ∂a′−0 , j = −K, . . . , K is a 2K × 2K matrix introduced in Section A.4.6. A.7 Proof of Proposition 7.1 3 2 P 2 is It is easily seen that the unconstrained minimizer of K j=−K 2 +3 ∆ aj not unique and is given by an arbitrary quadratic function of knots, i.e. K K K K 2 aK j = b0 − b2 (µj − b1 ) , j = −K 2 , . . . , K 2 . Under the hconstraints (7.9), the minimizer becomes unique with bK 1 = 0, i P 2 K K K K K 2 b0 = − log and b2 being a solution to CK (b) = 0, j=−K 2 exp −b2 (µj ) where PK 2 K 2 K 2 j=−K 2 (µj ) exp −b (µj ) − (1 − σ02 ). CK (b) = PK 2 K )2 exp −b (µ 2 j=−K j The function CK (b) has the following properties: • It is continuous on [0, ∞); • For all b ∈ [0, ∞) h i2 d CK (b) = E (µK )2 bK = b − E (µK )4 bK 2 2 =b , db and from the Hölder’s inequality (see, e.g., Billingsley, 1995, p. 80) d db CK (b) < 0. I.e. CK (b) is decreasing on [0, ∞); • CK (0) = (K 2 + 1)/3 − (1 − σ02 ) > 0 for all K ≥ 2; • limb→∞ CK (b) = −(1 − σ02 ) < 0. So that for all K ≥ 2 there exists exactly obe root bK 2 ∈ (0 ∞) of the equation CK (b) = 0. Let function C(b) be defined as R∞ 2 2 −∞ s exp −b s ds − (1 − σ02 ) = (2b)−1 − (1 − σ02 ). C(b) = R ∞ 2 ds exp −b s −∞ −1 ∈ (0.5, ∞). The equation C(b) = 0 has a unique solution b2 = 2(1 − σ02 ) It follows from the property of the integral that for all b ∈ (0, ∞) lim CK (b) = C(b) K→∞ 234 APPENDIX A. DETAILS FOR THE ML PENALIZED AFT and consequently, using the properties of CK (b) also lim bK 2 = b2 . K→∞ Let FK (µ) be a cumulative distribution function of µK under bK 2 , i.e. FK (µ) = K 2 exp −bK 2 (µj ) K 2 exp −bK 2 (µj ) Pmin(Kµ, K 2 ) j=−K 2 PK 2 j=−K 2 and Φ(µ | 0, 1 − σ02 ) be a cumulative distribution function of the normal distribution N (0, 1 − σ02 ), i.e. Rµ 2 ds exp −b s 2 . Φ(µ | 0, 1 − σ02 ) = R−∞ ∞ 2 −∞ exp −b2 s ds It can be now shown that for all µ ∈ R lim FK (µ) = Φ(µ | 0, 1 − σ02 ), K→∞ i.e. the random variable µK under bK 2 converges in distribution to a N (0, 1 − 2 σ0 ) random variable. Finally, for all y ∈ R gK (y) = Z ∞ −∞ and ϕ(y) = Z ∞ −∞ ϕ(y | µ, σ02 ) dFK (µ) ϕ(y | µ, σ02 ) dΦ(µ | 0, 1 − σ02 ). The assertion of the proposition now follows from the fact that function ϕ(y | µ, σ02 ) is for all y ∈ R bounded and continuous function of µ. Appendix B Simulation results B.1 Simulation for the maximum likelihood penalized AFT model Here we present selected results of the simulation study introduced in Section 7.5. Tables B.1 – B.6 show the results for the regression parameters. In the first third of the tables, results based on the penalized AFT model are shown. The second third of the tables shows the results based on the parametric AFT model estimated using the maximum-likelihood method assuming a correct (true) error distribution. Finally, the last third of the tables shows the results obtained by the parametric AFT model estimated using the maximum-likelihood method while assuming (in most case incorrectly) normal error distribution. Figures B.1 – B.3 show the fitted error distributions. For comparison purposes, we plot also the true error distribution. 235 236 APPENDIX B. SIMULATION RESULTS Table B.1: Results for the regression parameter β1 = −0.800 related to the binary covariate. True error distribution: normal. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) MSE (×10−4 ) β̂ (SD) light RC (0.114) (0.168) (0.316) (0.401) MSE (×10−4 ) −0.792 −0.812 −0.787 −0.772 (0.118) (0.175) (0.337) (0.478) 600 300 100 50 −0.794 −0.817 −0.775 −0.792 (0.119) (0.176) (0.351) (0.513) 142.59 311.80 1235.08 2635.81 light R+IC −0.794 (0.117) 136.20 −0.812 (0.172) 295.97 −0.778 (0.323) 1045.28 −0.769 (0.424) 1806.80 600 300 100 50 −0.780 −0.798 −0.789 −0.629 (0.140) (0.198) (0.491) (0.622) 200.25 391.34 2412.45 4156.99 −0.782 −0.799 −0.793 −0.652 −0.787 −0.811 −0.837 −0.680 (0.150) (0.212) (0.487) (0.717) 600 300 100 50 138.93 307.92 1140.71 2290.06 Assumed Error Distribution True 2280.00 449.06 2387.49 5278.77 −0.792 −0.812 −0.778 −0.762 130.57 282.47 1005.28 1623.70 heavy RC (0.135) 186.94 (0.198) 391.21 (0.413) 1708.48 (0.490) 2616.76 heavy R+IC −0.786 (0.141) 201.63 −0.800 (0.206) 425.72 −0.799 (0.425) 1809.93 −0.664 (0.514) 2826.40 Normal β̂ (SD) MSE (×10−4 ) −0.792 −0.812 −0.778 −0.762 (0.114) (0.168) (0.316) (0.401) 130.57 282.47 1005.28 1623.70 −0.794 −0.812 −0.778 −0.769 (0.117) (0.172) (0.323) (0.424) 136.20 295.97 1045.28 1806.80 −0.782 −0.799 −0.793 −0.652 (0.135) (0.198) (0.413) (0.490) 186.94 391.21 1708.48 2616.76 −0.786 −0.800 −0.799 −0.664 (0.141) (0.206) (0.425) (0.514) 201.63 425.72 1809.93 2826.40 B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 237 Table B.2: Results for the regression parameter β1 = −0.800 related to the binary covariate. True error distribution: extreme value. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) MSE (×10−4 ) β̂ (SD) light RC (0.104) (0.151) (0.267) (0.323) MSE (×10−4 ) −0.791 −0.827 −0.796 −0.888 (0.112) (0.151) (0.300) (0.467) 600 300 100 50 −0.795 −0.826 −0.796 −0.869 (0.109) (0.156) (0.299) (0.428) 118.64 250.12 896.48 1883.02 light R+IC −0.786 (0.104) 109.58 −0.824 (0.151) 232.65 −0.782 (0.266) 712.12 −0.884 (0.324) 1117.59 600 300 100 50 −0.788 −0.851 −0.813 −0.891 (0.149) (0.218) (0.459) (0.732) 222.96 499.81 2104.51 5437.10 −0.785 −0.853 −0.777 −0.921 −0.800 −0.855 −0.853 −0.872 (0.156) (0.229) (0.469) (0.684) 600 300 100 50 126.39 235.96 901.70 2257.10 Assumed Error Distribution True 242.35 552.22 2225.78 4725.93 −0.786 −0.824 −0.782 −0.883 110.56 233.70 714.10 1112.5 heavy RC (0.140) 198.32 (0.200) 427.42 (0.360) 1301.67 (0.546) 3132.58 heavy R+IC −0.786 (0.138) 191.75 −0.856 (0.203) 442.33 −0.786 (0.368) 1357.85 −0.936 (0.563) 3360.86 Normal β̂ (SD) MSE (×10−4 ) −0.819 −0.864 −0.842 −0.912 (0.136) (0.188) (0.349) (0.465) 187.24 393.02 1234.08 2290.56 −0.793 −0.833 −0.808 −0.885 (0.123) (0.173) (0.320) (0.430) 152.84 309.89 1024.61 1919.72 −0.869 −0.935 −0.877 −0.973 (0.173) (0.249) (0.460) (0.648) 348.10 802.15 2176.66 4493.43 −0.819 −0.880 −0.833 −0.944 (0.152) (0.229) (0.420) (0.620) 233.59 589.85 1778.50 4048.72 238 APPENDIX B. SIMULATION RESULTS Table B.3: Results for the regression parameter β1 = −0.800 related to the binary covariate. True error distribution: normal mixture. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) −0.817 −0.817 −0.829 −0.845 (0.154) (0.201) (0.386) (0.624) 600 300 100 50 −0.824 −0.834 −0.803 −0.871 (0.159) (0.226) (0.411) (0.688) 600 300 100 50 600 300 100 50 MSE (×10−4 ) β̂ (SD) light RC (0.142) (0.187) (0.319) (0.502) MSE (×10−4 ) β̂ (SD) MSE (×10−4 ) (0.173) (0.262) (0.438) (0.628) 319.10 713.18 1917.94 3963.00 258.53 523.20 1686.57 4781.04 light R+IC −0.819 (0.150) 229.01 −0.819 (0.201) 408.76 −0.803 (0.323) 1043.16 −0.807 (0.567) 3209.78 −0.877 −0.880 −0.833 −0.867 (0.184) (0.283) (0.466) (0.692) 399.17 865.88 2184.66 4839.10 −0.80 (0.213) −0.752 (0.318) −0.781 (0.558) −0.723 (0.915) 451.77 1036.14 3114.28 8426.26 −0.797 −0.763 −0.780 −0.810 −0.743 −0.715 −0.716 −0.728 (0.194) (0.310) (0.520) (0.788) 407.02 1033.05 2771.73 6257.40 (0.263) (0.376) (0.640) (1.183) 700.18 1412.76 4118.05 14012.96 −0.821 −0.782 −0.779 −0.851 (0.230) (0.366) (0.609) (0.981) 531.61 1345.87 3711.43 9655.56 −0.813 −0.814 −0.809 −0.819 203.08 350.64 1019.39 2526.53 Normal −0.845 −0.850 −0.814 −0.836 −0.826 −0.789 −0.752 −0.846 239.76 408.04 1498.05 3912.84 Assumed Error Distribution True heavy RC (0.187) 349.75 (0.285) 827.02 (0.485) 2357.78 (0.746) 5568.92 heavy R+IC −0.808 (0.223) 497.55 −0.759 (0.342) 1189.15 −0.776 (0.548) 3012.35 −0.868 (0.969) 9440.90 B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 239 Table B.4: Results for the regression parameter β2 = 0.400 related to the continuous covariate. True error distribution: normal. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) 0.406 0.399 0.380 0.398 (0.046) (0.064) (0.134) (0.202) MSE (×10−4 ) 21.58 41.34 182.30 407.62 Assumed Error Distribution True β̂ (SD) MSE (×10−4 ) Normal β̂ (SD) MSE (×10−4 ) 0.406 0.397 0.388 0.391 light RC (0.046) 21.20 (0.059) 34.48 (0.121) 147.19 (0.176) 311.05 0.406 0.397 0.388 0.391 (0.046) (0.059) (0.121) (0.176) 21.20 34.48 147.19 311.05 0.406 0.397 0.391 0.398 (0.048) (0.062) (0.121) (0.184) 23.10 38.69 147.62 338.59 600 300 100 50 0.407 0.397 0.389 0.402 (0.049) (0.063) (0.133) (0.215) 24.60 39.78 178.22 461.64 0.406 0.397 0.391 0.398 light R+IC (0.048) 23.10 (0.062) 38.69 (0.121) 147.62 (0.184) 338.59 600 300 100 50 0.404 0.398 0.385 0.400 (0.051) (0.070) (0.173) (0.264) 26.57 48.90 299.81 697.82 0.405 0.402 0.392 0.407 heavy RC (0.050) 25.05 (0.068) 46.37 (0.140) 197.65 (0.214) 460.12 0.405 0.402 0.392 0.407 (0.050) (0.068) (0.140) (0.214) 25.05 46.37 197.65 460.12 31.72 75.08 275.28 997.94 heavy R+IC 0.406 (0.054) 29.01 0.403 (0.074) 54.82 0.399 (0.142) 200.59 0.424 (0.244) 600.63 0.406 0.403 0.399 0.424 (0.054) (0.074) (0.142) (0.244) 29.01 54.82 200.59 600.63 600 300 100 50 0.408 0.403 0.404 0.438 (0.056) (0.087) (0.166) (0.314) 240 APPENDIX B. SIMULATION RESULTS Table B.5: Results for the regression parameter β2 = 0.400 related to the continuous covariate. True error distribution: extreme value. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) 0.402 0.415 0.414 0.428 (0.040) (0.061) (0.101) (0.188) MSE (×10−4 ) 15.96 39.21 104.29 361.29 Assumed Error Distribution True β̂ (SD) MSE (×10−4 ) Normal β̂ (SD) MSE (×10−4 ) 0.400 0.413 0.408 0.436 light RC (0.039) 15.33 (0.057) 33.84 (0.093) 86.87 (0.158) 260.77 0.420 0.432 0.415 0.438 (0.048) (0.076) (0.113) (0.186) 27.40 68.03 129.37 359.47 0.404 0.417 0.410 0.429 (0.045) (0.067) (0.105) (0.174) 20.06 48.18 111.89 311.89 600 300 100 50 0.403 0.416 0.416 0.433 (0.041) (0.059) (0.101) (0.182) 17.17 37.93 103.67 343.47 0.400 0.412 0.409 0.436 light R+IC (0.039) 15.53 (0.056) 32.98 (0.093) 87.23 (0.160) 268.39 600 300 100 50 0.407 0.427 0.389 0.454 (0.061) (0.086) (0.155) (0.294) 38.05 82.19 241.04 895.33 0.403 0.420 0.398 0.441 heavy RC (0.058) 34.09 (0.077) 63.28 (0.138) 190.95 (0.229) 540.61 0.453 0.463 0.431 0.464 (0.073) (0.098) (0.155) (0.256) 80.63 135.13 248.61 698.09 38.47 80.49 271.55 736.90 heavy R+IC 0.403 (0.059) 34.94 0.420 (0.077) 63.71 0.405 (0.143) 203.32 0.452 (0.241) 607.49 0.426 0.440 0.425 0.461 (0.066) (0.087) (0.151) (0.250) 50.42 92.29 234.82 662.22 600 300 100 50 0.413 0.432 0.419 0.445 (0.061) (0.084) (0.164) (0.268) B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 241 Table B.6: Results for the regression parameter β2 = 0.400 related to the continuous covariate. True error distribution: normal mixture. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Smoothed N 600 300 100 50 β̂ (SD) 0.405 0.401 0.386 0.361 (0.051) (0.075) (0.154) (0.274) MSE (×10−4 ) 26.07 56.31 239.62 763.56 Assumed Error Distribution True β̂ (SD) MSE (×10−4 ) Normal β̂ (SD) MSE (×10−4 ) 0.403 0.400 0.386 0.358 light RC (0.050) 24.79 (0.072) 51.28 (0.125) 158.23 (0.250) 640.84 0.412 0.418 0.397 0.369 (0.068) (0.090) (0.176) (0.282) 48.18 84.56 311.23 806.32 0.424 0.432 0.417 0.390 (0.076) (0.098) (0.196) (0.316) 62.83 105.56 386.41 997.41 600 300 100 50 0.408 0.408 0.403 0.376 (0.059) (0.079) (0.183) (0.313) 35.94 62.88 336.76 983.42 0.407 0.401 0.397 0.391 light R+IC (0.056) 31.74 (0.071) 50.22 (0.152) 230.94 (0.306) 935.42 600 300 100 50 0.400 0.392 0.367 0.315 (0.078) (0.110) (0.201) (0.409) 60.87 121.55 414.59 1747.62 0.396 0.404 0.380 0.332 heavy RC (0.069) 48.17 (0.092) 85.67 (0.172) 301.03 (0.347) 1253.25 0.368 0.373 0.363 0.327 (0.081) (0.100) (0.206) (0.331) 74.92 106.46 437.47 1148.82 84.33 117.03 924.86 2296.79 heavy R+IC 0.402 (0.084) 69.92 0.408 (0.095) 90.05 0.427 (0.249) 628.79 0.392 (0.441) 1941.29 0.401 0.405 0.418 0.381 (0.096) (0.113) (0.267) (0.429) 92.04 128.28 713.71 1843.27 600 300 100 50 0.410 0.418 0.434 0.385 (0.091) (0.107) (0.302) (0.479) APPENDIX B. SIMULATION RESULTS 3 3 3 1 3 3 0.2 0.0 0.4 0.0 0.6 0.4 −3 −1 N = 100, heavy R+IC 0.2 1 N = 50, heavy RC 3 0.0 −3 −1 3 0.2 1 0.4 0.6 0.4 0.2 −3 −1 −3 −1 N = 300, heavy R+IC 1 0.0 0.2 1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy R+IC N = 100, heavy RC 0.0 −3 −1 −3 −1 1 3 N = 50, heavy R+IC 0.6 3 3 N = 50, light R+IC 3 0.6 1 1 0.4 0.6 0.4 0.2 −3 −1 −3 −1 N = 300, heavy RC 1 0.2 0.4 1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy RC N = 100, light R+IC 0.0 −3 −1 −3 −1 0.4 3 3 0.6 1 1 0.2 0.4 0.6 N = 300, light R+IC 0.2 −3 −1 −3 −1 0.6 1 0.0 0.0 0.2 0.4 0.6 N = 600, light R+IC 0.4 0.6 0.4 0.2 −3 −1 0.2 3 N = 50, light RC 0.0 1 0.6 −3 −1 N = 100, light RC 0.0 0.2 0.4 0.6 N = 300, light RC 0.0 0.0 0.2 0.4 0.6 N = 600, light RC 0.6 242 −3 −1 1 3 −3 −1 1 3 Figure B.1: Results for the standardized error distribution. True error distribution: normal. Solid line: average fitted density, grey lines: 95% pointwise confidence band, dashed line: true error density. B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT MODEL 1 2 1 2 1 2 −1 1 2 1 2 0.6 0.2 0.0 0.4 0.0 0.6 0.4 −3 N = 100, heavy R+IC 0.2 −1 N = 50, heavy RC 1 2 0.0 −3 1 2 0.2 −1 0.4 0.6 0.4 0.2 −3 −3 N = 300, heavy R+IC −1 0.0 0.2 −1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy R+IC N = 100, heavy RC 0.0 −3 −3 −1 1 2 N = 50, heavy R+IC 0.6 1 2 1 2 N = 50, light R+IC 1 2 0.6 −1 −1 0.4 0.6 0.4 0.2 −3 −3 N = 300, heavy RC −1 0.2 0.4 −1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy RC N = 100, light R+IC 0.0 −3 −3 0.4 1 2 1 2 0.6 −1 −1 0.2 0.4 0.6 N = 300, light R+IC 0.2 −3 −3 0.6 −1 0.0 0.0 0.2 0.4 0.6 N = 600, light R+IC 0.4 0.6 0.4 0.2 −3 0.2 1 2 N = 50, light RC 0.0 −1 0.6 −3 N = 100, light RC 0.0 0.2 0.4 0.6 N = 300, light RC 0.0 0.0 0.2 0.4 0.6 N = 600, light RC 243 −3 −1 1 2 −3 −1 1 2 Figure B.2: Results for the standardized error distribution. True error distribution: extreme value. Solid line: average fitted density, grey lines: 95% pointwise confidence band, dashed line: true error density. APPENDIX B. SIMULATION RESULTS 1 2 1 2 1 2 −1 1 2 1 2 0.2 0.0 0.4 0.0 0.6 0.4 −3 N = 100, heavy R+IC 0.2 −1 N = 50, heavy RC 1 2 0.0 −3 1 2 0.2 −1 0.4 0.6 0.4 0.2 −3 −3 N = 300, heavy R+IC −1 0.0 0.2 −1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy R+IC N = 100, heavy RC 0.0 −3 −3 −1 1 2 N = 50, heavy R+IC 0.6 1 2 1 2 N = 50, light R+IC 1 2 0.6 −1 −1 0.4 0.6 0.4 0.2 −3 −3 N = 300, heavy RC −1 0.2 0.4 −1 0.0 0.0 0.2 0.4 0.6 N = 600, heavy RC N = 100, light R+IC 0.0 −3 −3 0.4 1 2 1 2 0.6 −1 −1 0.2 0.4 0.6 N = 300, light R+IC 0.2 −3 −3 0.6 −1 0.0 0.0 0.2 0.4 0.6 N = 600, light R+IC 0.4 0.6 0.4 0.2 −3 0.2 1 2 N = 50, light RC 0.0 −1 0.6 −3 N = 100, light RC 0.0 0.2 0.4 0.6 N = 300, light RC 0.0 0.0 0.2 0.4 0.6 N = 600, light RC 0.6 244 −3 −1 1 2 −3 −1 1 2 Figure B.3: Results for the standardized error distribution. True error distribution: normal mixture. Solid line: average fitted density, grey lines: 95% pointwise confidence band, dashed line: true error density. B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL B.2 245 Simulation for the Bayesian normal mixture cluster-specific AFT model In this section we give the results of the simulation study introduced in Section 8.6. Tables B.7 and B.8 show the results for the regression parameters. Further, Tables B.9 – B.11 give the results related to the covariance matrix D of the random effects In the first third (or half) of the tables, results based on the Bayesian normal mixture AFT model are shown. The second third (half) of the tables shows the results based on Bayesian AFT model with assumed normal error distribution and finally the last third of Tables B.7 and B.8 show the results obtained using the parametric AFT model with assumed normal distribution, no random effects included and estimated using the maximum likelihood. Figures B.4 and B.5 give the fitted standardized (in the case of Cauchy and Student t2 distribution only centered) error distribution compared to the true density. Figures B.6 and B.7 show the fitted hazard function for a combination of covariates zi,l = 0 and xi,l = 8.13 (median value). Always a comparison between the Bayesian normal mixture and the Bayesian model with (incorrectly) specified normal error distribution is given. The same comparison, however with respect to the fitted survivor functions is given in Figures B.8 and B.9. 246 APPENDIX B. SIMULATION RESULTS Table B.7: Results for the mean of the covariate random effect γ = −0.800 related to the binary covariate. Mean, standard deviation and MSE (×10−4 ) are clculated over the simulations. Estimation method Bayesian normal Bayesian mixture N, ni γ̂ (SD) MSE (×10−4 ) 100, 10 50, 5 −0.798 (0.069) −0.813 (0.155) 47.01 240.43 100, 10 50, 5 −0.8100 (0.103) −0.766 (0.224) 107.16 512.72 100, 10 50, 5 −0.793 (0.100) −0.778 (0.218) 99.91 479.02 100, 10 50, 5 −0.797 (0.069) −0.815 (0.137) 47.39 191.09 100, 10 50, 5 −0.804 (0.051) −0.787 (0.097) 26.60 95.70 γ̂ (SD) MSE (×10−4 ) True error = normal −0.798 (0.069) 48.17 −0.811 (0.149) 222.78 True error = Cauchy t1 −0.736 (0.139) 234.52 −0.719 (0.255) 716.39 True error = Student t2 −0.761 (0.104) 123.72 −0.759 (0.196) 401.28 ML, no random effects γ̂ (SD) MSE (×10−4 ) −0.798 (0.078) −0.812 (0.153) 60.45 235.67 −0.738 (0.142) −0.721 (0.253) 238.91 703.97 −0.7600 (0.108) −0.761 (0.200) 132.14 415.46 True error = extreme value −0.80 (0.075) 56.62 −0.802 (0.082) −0.811 (0.138) 192.4 −0.809 (0.142) True error = normal mixture −0.926 (0.144) 366.82 −0.923 (0.148) −0.869 (0.291) 894.99 −0.863 (0.283) 66.90 202.26 369.41 840.46 B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL 247 Table B.8: Results for the regression parameter β = 0.400 related to the continuous covariate. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Estimation method Bayesian normal Bayesian mixture β̂ (SD) MSE (×10−4 ) 100, 10 50, 5 0.402 (0.027) 0.397 (0.051) 7.28 26.31 100, 10 50, 5 0.392 (0.036) 0.412 (0.071) 13.23 52.54 100, 10 50, 5 0.394 (0.033) 0.393 (0.076) 11.51 58.15 100, 10 50, 5 0.404 (0.021) 0.393 (0.042) 100, 10 50, 5 0.400 (0.019) 0.394 (0.042) N, ni ML, no random effects β̂ (SD) MSE (×10−4 ) 0.402 (0.030) 0.399 (0.059) 9.01 34.67 0.357 (0.057) 0.378 (0.109) 50.73 124.59 0.378 (0.041) 0.382 (0.084) 21.62 72.93 4.34 17.92 True error = extreme value 0.403 (0.023) 5.41 0.402 (0.026) 0.395 (0.045) 20.89 0.395 (0.051) 6.76 25.87 3.46 17.65 True error = normal mixture 0.448 (0.048) 45.62 0.450 (0.052) 0.432 (0.076) 68.1 0.444 (0.104) 52.93 127.26 β̂ (SD) MSE (×10−4 ) True error = normal 0.402 (0.027) 7.28 0.397 (0.051) 25.6 True error = Cauchy t1 0.361 (0.051) 41.84 0.383 (0.081) 68.17 True error = Student t2 0.379 (0.038) 19.02 0.386 (0.069) 49.62 248 APPENDIX B. SIMULATION RESULTS Table B.9: Results for the standard deviation of the random intercept sd(bi,1 ) = 0.500. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. N, ni Estimation method Bayesian mixture Bayesian normal b b sd(bi,1 ) (SD) MSE sd(bi,1 ) (SD) MSE 100, 10 50, 5 0.476 (0.069) 0.321 (0.154) True error = normal 52.98 0.476 (0.068) 559.95 0.324 (0.156) 52.52 551.48 100, 10 50, 5 True error = Cauchy t1 0.381 (0.120) 284.36 0.188 (0.121) 0.118 (0.060) 1492.37 0.086 (0.013) 1117.45 1718.19 100, 10 50, 5 True error = Student t2 0.452 (0.094) 111.52 0.418 (0.106) 0.160 (0.128) 1321.32 0.125 (0.086) 179.32 1480.14 100, 10 50, 5 True error = extreme value 0.489 (0.061) 38.41 0.495 (0.069) 0.343 (0.144) 453.41 0.305 (0.156) 48.37 625.15 100, 10 50, 5 True error = normal mixture 0.493 (0.047) 22.33 0.428 (0.176) 360.99 0.446 (0.093) 115.32 0.105 (0.048) 1583.26 B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL 249 Table B.10: Results for the standard deviation of the covariate random effect sd(bi,2 ) = 0.100. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. N, ni Estimation method Bayesian mixture Bayesian normal b b sd(bi,2 ) (SD) MSE sd(bi,2 ) (SD) MSE 100, 10 50, 5 True error = normal 0.125 (0.040) 22.13 0.125 (0.040) 0.152 (0.059) 61.67 0.153 (0.059) 22.30 62.53 100, 10 50, 5 True error = Cauchy t1 0.156 (0.058) 64.95 0.124 (0.054) 0.093 (0.017) 3.36 0.083 (0.008) 34.43 3.60 100, 10 50, 5 True error = Student t2 0.135 (0.031) 22.03 0.142 (0.033) 0.105 (0.039) 15.61 0.097 (0.029) 28.49 8.30 100, 10 50, 5 True error = extreme value 0.109 (0.027) 8.22 0.112 (0.028) 0.151 (0.059) 60.72 0.142 (0.052) 9.16 44.71 100, 10 50, 5 True error = normal mixture 0.094 (0.029) 8.51 0.174 (0.062) 93.20 0.139 (0.043) 33.42 0.090 (0.020) 5.24 250 APPENDIX B. SIMULATION RESULTS Table B.11: Results for the random effects correlation corr(bi,1 , , bi,2 ) = 0.400. Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations. Estimation method Bayesian mixture Bayesian normal N, ni corr(b d i,1 , bi,2 ) (SD) MSE corr(b d i,1 , bi,2 ) (SD) MSE 100, 10 50, 5 0.391 (0.457) 0.293 (0.343) True error = normal 2086.96 0.395 (0.459) 1292.59 0.290 (0.343) 100, 10 50, 5 0.380 (0.372) 0.061 (0.090) True error = Cauchy t1 1385.34 0.210 (0.226) 1232.64 0.014 (0.032) 873.41 1502.97 100, 10 50, 5 0.266 (0.434) 0.100 (0.197) True error = Student t2 2066.67 0.240 (0.423) 1284.85 0.070 (0.118) 2045.53 1230.40 100, 10 50, 5 0.388 (0.428) 0.327 (0.409) True error = extreme value 1831.69 0.244 (0.469) 1722.62 0.249 (0.352) 2442.23 1466.79 100, 10 50, 5 True error = normal mixture 0.376 (0.401) 1617.48 0.228 (0.415) 0.307 (0.433) 1958.50 0.027 (0.051) 2019.53 1413.30 2108.60 1299.25 B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 Normal, N = 50, ni = 5 0.5 Normal, N = 100, ni = 10 251 −3 −2 −1 0 1 2 −3 3 −1 0 1 2 3 0.20 0.10 0.00 0.00 0.10 0.20 0.30 Cauchy, N = 50, ni = 5 0.30 Cauchy, N = 100, ni = 10 −2 −3 −2 −1 0 1 2 −3 3 −1 0 1 2 3 0.20 0.10 0.00 0.00 0.10 0.20 0.30 Student t2 , N = 50, ni = 5 0.30 Student t2 , N = 100, ni = 10 −2 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Figure B.4: Results for the standardized error density, estimated using the Bayesian mixture model. Solid line: average fitted standardized density, grey lines: 95% pointwise confidence band, dashed line: true standardized error density. 252 APPENDIX B. SIMULATION RESULTS Extr. value, N = 50, ni = 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Extr. value, N = 100, ni = 10 −3 −2 −1 1 0 −3 2 −1 1 0 2 0.2 0.4 0.6 0.8 Normal mixture, N = 50, ni = 5 0.0 0.0 0.2 0.4 0.6 0.8 Normal mixture, N = 100, ni = 10 −2 −2 −1 0 1 2 −2 −1 0 1 2 Figure B.5: Results for the standardized error density, estimated using the Bayesian mixture model. Solid line: average fitted standardized density, grey lines: 95% pointwise confidence band, dashed line: true standardized error density. 0 200 600 0 0 200 200 600 Student t2 (50, 5) 600 0.010 0.015 0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 253 Bayesian normal Normal (50, 5) Normal (100, 10) 0 Cauchy (50, 5) 0 0 200 200 200 600 Normal (50, 5) 600 0 Cauchy (100, 10) 600 0 Student t2 (100, 10) 0 200 200 200 600 Cauchy (50, 5) 600 Student t2 (50, 5) 600 Figure B.6: Results for the hazard function, estimated using the Bayesian mixture model (left part) and the Bayesian normal model (right part). Each row shows the results for different true error densities. Solid line: average fitted hazard, grey lines: 95% pointwise confidence band, dashed line: true hazard function. 0.005 0.015 0.015 0.015 600 0.000 0.010 0.010 0.010 Cauchy (100, 10) 200 0.008 Student t2 (100, 10) 0.005 0.005 0.005 0 0.004 600 0.000 0.000 0.000 Normal (100, 10) 0.000 200 600 0.008 0.008 0.008 0 200 0.004 0.004 0.004 0 0.000 0.000 0.000 B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL Bayesian mixture 254 APPENDIX B. SIMULATION RESULTS 0.005 0.010 0.015 0.020 200 600 Normal mixture (50, 5) 0 200 600 0 200 600 Normal mixture (100, 10) 0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 200 600 600 Extr. value (50, 5) 0 200 Normal mixture (50, 5) 0 Bayesian normal 0.000 Bayesian mixture 0.020 Extr. value (100, 10) 0.015 Extr. value (50, 5) 0.010 Extr. value (100, 10) 0.005 600 0.000 0.000 0.002 0.004 0.006 0.008 0.010 200 0.020 0 0.015 600 0.010 200 0.005 0 0.000 0.000 0.002 0.004 0.006 0.008 0.010 600 0.020 200 0.015 0 0.010 Normal mixture (100, 10) 0.005 0 0.000 Figure B.7: Results for the hazard function, estimated using the Bayesian mixture model (left part) and the Bayesian normal model (right part). Each row shows the results for different true error densities. Solid line: average fitted hazard, grey lines: 95% pointwise confidence band, dashed line: true hazard function. B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE CLUSTER-SPECIFIC AFT MODEL 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 600 0.0 0.2 0.4 0.6 0.8 1.0 0 0 600 0.0 0.2 0.4 0.6 0.8 1.0 0 0 600 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 600 Normal (50, 5) 200 600 600 Cauchy (50, 5) 200 200 Student t2 (50, 5) 0 Bayesian normal Normal (100, 10) 200 600 600 Cauchy (100, 10) 200 200 Student t2 (100, 10) 0 0.0 0.2 0.4 0.6 0.8 1.0 Normal (50, 5) 200 600 600 Cauchy (50, 5) 200 200 Student t2 (50, 5) 0 0.0 0.2 0.4 0.6 0.8 1.0 Bayesian mixture 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Normal (100, 10) 200 600 600 Cauchy (100, 10) 200 200 Student t2 (100, 10) 0 0.0 0.2 0.4 0.6 0.8 1.0 255 Figure B.8: Results for the survivor function, estimated using the Bayesian mixture model (left part) and the Bayesian normal model (right part). Each row shows the results for different true error densities. Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function. 256 APPENDIX B. SIMULATION RESULTS 0.0 0.2 0.4 0.6 0.8 1.0 200 200 600 600 0.0 0.2 0.4 0.6 0.8 1.0 200 200 600 600 Extr. value (50, 5) 0 Normal mixture (50, 5) 0 Bayesian mixture 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 200 200 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 200 600 600 Extr. value (50, 5) 0 200 Normal mixture (50, 5) 0 Bayesian normal 600 600 Extr. value (100, 10) 0 0 Normal mixture (100, 10) 0.0 0.2 0.4 0.6 0.8 1.0 Extr. value (100, 10) 0 0 Normal mixture (100, 10) 0.0 0.2 0.4 0.6 0.8 1.0 Figure B.9: Results for the survivor function, estimated using the Bayesian mixture model (left part) and the Bayesian normal model (right part). Each row shows the results for different true error densities. Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function. B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL B.3 257 Simulation for the Bayesian penalized mixture cluster-specific AFT model This section presents selected results of the simulation study introduced in Section 9.6. Tables B.12 and B.13 show the results for the regression parameters. Tables B.14 and B.15 give the results for the variance components of the model. Figures B.10 and B.11 show the fitted survivor densities for the onset part of the model for a combination of covariates xui,l,1 = 0.5 (median value) and xui,l,2 = 1. Figures B.12 and B.13 give the fitted survivor densities for the event part of the model for a combination of covariates xti,l,1 = 0.5 (median value) and xti,l,2 = 1. Corresponding fitted survivor functions are given in Figures B.14 – B.17. 258 APPENDIX B. SIMULATION RESULTS Table B.12: Results for the regression parameters from the onset part of the model. Mean, standard deviation and MSE (×10−4 ) over the simulation. τ d /τ ζ = τ b /τ ε δ1 = 0.200 MSE δ̂1 (SD) (×10−4 ) δ2 = −0.100 MSE δ̂2 (SD) (×10−4 ) Scenario I (error ∼ normal mixture, random effect ∼ extreme value) 5 3 2 1 1/2 1/3 1/5 0.199 0.201 0.198 0.199 0.200 0.201 0.198 (0.007) (0.008) (0.011) (0.014) (0.018) (0.019) (0.019) 0.56 0.68 1.30 1.84 3.14 3.74 3.51 −0.101 −0.100 −0.100 −0.100 −0.100 −0.101 −0.100 (0.004) (0.005) (0.006) (0.009) (0.010) (0.010) (0.010) −0.101 −0.101 −0.099 −0.099 −0.097 −0.099 −0.100 (0.005) (0.008) (0.011) (0.019) (0.025) (0.024) (0.020) 0.17 0.20 0.37 0.76 0.92 1.02 0.95 Scenario II (error ∼ extreme value, random effect ∼ normal mixture) 5 3 2 1 1/2 1/3 1/5 0.200 0.202 0.200 0.196 0.194 0.201 0.203 (0.010) (0.015) (0.019) (0.029) (0.038) (0.041) (0.043) 0.93 2.38 3.44 8.73 14.46 16.73 18.12 0.30 0.72 1.27 3.45 6.30 5.90 4.12 B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL 259 Table B.13: Results for the regression parameters from the event part of the model. Mean, standard deviation and MSE (×10−4 ) over the simulation. τ d /τ ζ = τ b /τ ε β1 = 0.300 MSE β̂1 (SD) (×10−4 ) β2 = −0.150 MSE β̂2 (SD) (×10−4 ) Scenario I (error ∼ normal mixture, random effect ∼ extreme value) 5 3 2 1 1/2 1/3 1/5 0.302 0.301 0.298 0.304 0.301 0.311 0.299 (0.014) (0.032) (0.056) (0.054) (0.043) (0.058) (0.050) 2.12 9.99 30.55 29.04 18.07 34.67 25.11 −0.149 −0.149 −0.150 −0.148 −0.147 −0.150 −0.151 (0.008) (0.021) (0.034) (0.028) (0.031) (0.035) (0.031) −0.148 −0.152 −0.146 −0.149 −0.151 −0.146 −0.142 (0.016) (0.022) (0.036) (0.057) (0.070) (0.071) (0.065) 0.64 4.47 11.75 7.55 9.66 11.88 9.68 Scenario II (error ∼ extreme value, random effect ∼ normal mixture) 5 3 2 1 1/2 1/3 1/5 0.298 0.291 0.306 0.299 0.304 0.296 0.308 (0.031) (0.040) (0.065) (0.103) (0.121) (0.126) (0.112) 9.40 16.44 42.02 105.54 144.59 157.36 125.51 2.74 4.99 13.01 32.60 48.40 50.06 42.10 260 APPENDIX B. SIMULATION RESULTS Table B.14: Results for the scale parameters from the onset part of the model. Mean, standard deviation and MSE (×10−4 ) over the simulation. τd d ζ τ /τ = τ b /τ ε True τ d τ̂ d (SD) τζ MSE (×10−4 ) True τ ζ τ̂ ζ (SD) MSE (×10−4 ) Scenario I (error ∼ normal mixture, random effect ∼ extreme value) 5 3 2 1 1/2 1/3 1/5 0.310 0.300 0.283 0.224 0.141 0.100 0.062 0.341 0.324 0.283 0.219 0.143 0.103 0.110 (0.035) (0.037) (0.031) (0.024) (0.018) (0.035) (0.097) 21.20 19.13 9.31 5.85 3.24 12.23 116.93 0.062 0.100 0.141 0.224 0.283 0.300 0.310 0.062 0.100 0.141 0.223 0.283 0.301 0.325 (0.002) (0.002) (0.003) (0.006) (0.006) (0.012) (0.034) 0.04 0.04 0.08 0.39 0.31 1.35 13.74 Scenario II (error ∼ extreme value, random effect ∼ normal mixture) 5 3 2 1 1/2 1/3 1/5 0.310 0.300 0.283 0.224 0.141 0.100 0.062 0.311 0.318 0.299 0.218 0.132 0.065 0.040 (0.009) (0.112) (0.141) (0.011) (0.021) (0.037) (0.030) 0.86 128.44 198.36 1.54 5.23 25.76 13.87 0.062 0.100 0.141 0.224 0.283 0.300 0.310 0.061 0.116 0.159 0.224 0.285 0.304 0.314 (0.003) (0.099) (0.126) (0.012) (0.013) (0.013) (0.015) 0.11 99.25 159.21 1.35 1.70 1.83 2.34 B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL 261 Table B.15: Results for the scale parameters from the event part of the model. Mean, standard deviation and MSE (×10−4 ) over the simulation. τb d ζ τ /τ = τ b /τ ε True τ b τ̂ b (SD) τε MSE (×10−4 ) True τ ε τ̂ ε (SD) MSE (×10−4 ) Scenario I (error ∼ normal mixture, random effect ∼ extreme value) 5 3 2 1 1/2 1/3 1/5 0.981 0.949 0.894 0.707 0.447 0.316 0.196 0.980 0.987 0.827 0.647 0.428 0.307 0.180 (0.393) (0.517) (0.065) (0.046) (0.039) (0.037) (0.037) 1532.34 2660.34 87.10 57.34 18.27 14.42 15.93 0.196 0.316 0.447 0.707 0.894 0.949 0.981 0.202 0.417 0.663 0.741 0.901 0.954 0.984 (0.005) (0.160) (0.217) (0.090) (0.017) (0.018) (0.018) 0.60 356.10 932.57 91.97 3.47 3.57 3.49 Scenario II (error ∼ extreme value, random effect ∼ normal mixture) 5 3 2 1 1/2 1/3 1/5 0.981 0.949 0.894 0.707 0.447 0.316 0.196 0.971 0.941 0.884 0.671 0.394 0.079 0.024 (0.030) (0.040) (0.049) (0.040) (0.092) (0.115) (0.026) 9.67 16.72 24.54 28.78 111.54 695.18 302.36 0.196 0.316 0.447 0.707 0.894 0.949 0.981 0.202 0.325 0.532 0.886 1.160 1.286 1.345 (0.012) (0.054) (0.237) (0.273) (0.230) (0.214) (0.235) 1.73 29.79 626.75 1056.64 1228.67 1589.69 1873.27 262 APPENDIX B. SIMULATION RESULTS τ b /τ ε = 1/5 0.0 0.00 0.1 0.10 0.2 0.3 0.20 0.4 τ b /τ ε = 5 0 2 4 6 8 0 10 2 τ b /τ ε = 3 4 6 8 10 8 10 8 10 0.00 0.00 0.10 0.10 0.20 0.20 0.30 τ b /τ ε = 1/3 0 2 4 6 8 0 10 2 τ b /τ ε = 2 4 6 0.00 0.00 0.10 0.10 0.20 0.20 τ b /τ ε = 1/2 0 2 4 6 8 10 8 10 0 2 4 6 0.00 0.10 0.20 τ b /τ ε = 1 0 2 4 6 Figure B.10: Results for the survivor density of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario I (error ∼ normal mixture, random effect ∼ extreme value). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor density.. B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL τ b /τ ε = 1/5 0.0 0.00 0.1 0.10 0.2 0.3 0.20 0.4 τ b /τ ε = 5 263 0 2 4 6 8 0 10 2 τ b /τ ε = 3 4 6 8 10 8 10 8 10 0.0 0.00 0.1 0.10 0.2 0.3 0.20 τ b /τ ε = 1/3 0 2 4 6 8 0 10 2 τ b /τ ε = 2 4 6 0.00 0.00 0.10 0.10 0.20 0.20 0.30 τ b /τ ε = 1/2 0 2 4 6 8 10 8 10 0 2 4 6 0.00 0.10 0.20 τ b /τ ε = 1 0 2 4 6 Figure B.11: Results for the survivor density of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario II (error ∼ extreme value, random effect ∼ normal mixture). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor density.. 264 APPENDIX B. SIMULATION RESULTS τ b /τ ε = 1/5 0.00 0.04 0.08 0.00 0.05 0.10 0.15 0.20 τ b /τ ε = 5 0 10 20 30 40 50 0 10 τ b /τ ε = 3 20 30 40 50 40 50 40 50 0.00 0.00 0.05 0.04 0.10 0.08 0.15 0.20 τ b /τ ε = 1/3 0 10 20 30 40 50 0 10 τ b /τ ε = 2 20 30 0.00 0.05 0.10 0.00 0.02 0.04 0.06 0.08 0.15 τ b /τ ε = 1/2 0 10 20 30 40 50 40 50 0 10 20 30 0.00 0.04 0.08 τ b /τ ε = 1 0 10 20 30 Figure B.12: Results for the survivor density of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario I (error ∼ normal mixture, random effect ∼ extreme value). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor density.. B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL τ b /τ ε = 1/5 0.00 0.00 0.02 0.10 0.04 0.06 0.20 0.08 τ b /τ ε = 5 265 0 10 20 30 40 50 0 10 τ b /τ ε = 3 20 30 40 50 40 50 40 50 0.00 0.00 0.05 0.02 0.10 0.04 0.15 0.06 0.20 0.08 τ b /τ ε = 1/3 0 10 20 30 40 50 0 10 τ b /τ ε = 2 20 30 0.00 0.00 0.02 0.05 0.04 0.10 0.06 0.15 0.08 τ b /τ ε = 1/2 0 10 20 30 40 50 40 50 0 10 20 30 0.00 0.04 0.08 τ b /τ ε = 1 0 10 20 30 Figure B.13: Results for the survivor density of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario II (error ∼ extreme value, random effect ∼ normal mixture). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor density.. 266 APPENDIX B. SIMULATION RESULTS τ b /τ ε = 1/5 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 5 0 2 4 6 8 0 10 2 4 6 8 10 8 10 8 10 τ b /τ ε = 1/3 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 3 0 2 4 6 8 0 10 2 τ b /τ ε = 2 4 6 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1/2 0 2 4 6 8 10 8 10 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1 0 2 4 6 Figure B.14: Results for the survivor function of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario I (error ∼ normal mixture, random effect ∼ extreme value). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function.. B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL 267 τ b /τ ε = 1/5 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 5 0 2 4 6 8 0 10 2 4 6 8 10 8 10 8 10 τ b /τ ε = 1/3 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 3 0 2 4 6 8 0 10 2 τ b /τ ε = 2 4 6 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1/2 0 2 4 6 8 10 8 10 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1 0 2 4 6 Figure B.15: Results for the survivor function of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario II (error ∼ extreme value, random effect ∼ normal mixture). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function.. 268 APPENDIX B. SIMULATION RESULTS τ b /τ ε = 1/5 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 5 0 10 20 30 40 50 0 10 20 30 40 50 40 50 40 50 τ b /τ ε = 1/3 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 3 0 10 20 30 40 50 0 10 τ b /τ ε = 2 20 30 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1/2 0 10 20 30 40 50 40 50 0 10 20 30 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1 0 10 20 30 Figure B.16: Results for the survivor function of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario I (error ∼ normal mixture, random effect ∼ extreme value). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function.. B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE CLUSTER-SPECIFIC AFT MODEL 269 τ b /τ ε = 1/5 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 5 0 10 20 30 40 50 0 10 20 30 40 50 40 50 40 50 τ b /τ ε = 1/3 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 3 0 10 20 30 40 50 0 10 τ b /τ ε = 2 20 30 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1/2 0 10 20 30 40 50 40 50 0 10 20 30 0.0 0.2 0.4 0.6 0.8 1.0 τ b /τ ε = 1 0 10 20 30 Figure B.17: Results for the survivor function of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario II (error ∼ extreme value, random effect ∼ normal mixture). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor function.. 270 APPENDIX B. SIMULATION RESULTS Appendix C Software For all methodologies described in Part II of the thesis a software in the form of R (R Development Core Team, 2005) packages smoothSurv and bayesSurv has been written and can be downloaded, together with extensive manuals and description on how to perform analyses shown in this thesis from the Comprehensive R Archive Network at http://www.R-project.org. To optimize the computational time, all time consuming computation is performed using the C++ compiled code. In this appendix, we only briefly list the most important functions from both packages. C.1 Package smoothSurv This package implements the methods for the penalized maximum-likelihood AFT model as described in Chapter 7 and involves, among others, the following functions: smoothSurvReg fits the AFT model (7.1) with the error density (7.2) using the method of penalized maximum-likelihood. It also allows for the scale regression (7.6); plot.smoothSurvReg computes and plots the fitted error density (7.2); survfit.smoothSurvReg computes and plots the fitted survival function (7.13) for a specified combination of covariates; fdensity computes and plots the fitted survival density (7.14) for a specified combination of covariates; hazard computes and plots the fitted hazard function (7.15) for a specified combination of covariates; 271 272 APPENDIX C. SOFTWARE estimTdiff estimates expected survival time for a specified combination of covariates or estimates expected value of the difference between the survival times for two specified combinations of covariates based on the AFT model fitted using the function smoothSurvReg. C.2 Package bayesSurv This package implements the Bayesian methods described in Chapters 8 – 10. For the Bayesian normal mixture cluster-specific AFT model of Chapter 8, the core functions include: bayessurvreg1 runs the MCMC simulation for the AFT model (8.1) with the error density (8.2) and normally distributed (multivariate) random effects; bayesDensity computes the estimate of the predictive error densities (8.20) and (8.21); predictive computes the MCMC estimate of the predictive survival, density or hazard function for a specified combination of covariates based on the formulas (8.16), (8.18) and (8.19). For the Bayesian penalized mixture cluster-specific and population-averaged AFT models of Chapters 9 and 10, the core functions include: bayessurvreg2 runs the MCMC simulation for the cluster-specific AFT model (9.1), (9.2) with the error densities specified by (9.3) and normally distributed (multivariate) random effects (Model M ); bayessurvreg3 runs the MCMC simulation for the cluster-specific AFT model (9.1), (9.2) with the error densities specified by (9.3) and univariate random effects whose distribution is specified by (9.3) (Model U ); bayesBisurvreg runs the MCMC simulation for the population-averaged AFT model (10.1), (10.2) with the error densities specified by (10.3); bayesGspline computes the estimate of the predictive density of the factors whose distribution was specified as the penalized normal mixture (9.3) or (10.3). The function is based on formulas (9.13) and (10.12); marginal.bayesGspline computes the estimates of the predictive marginal densities of the factors whose distribution was specified as the bivariate penalized normal mixture (10.3). predictive2 computes the MCMC estimate of the predictive survival, density or hazard function for a specified combination of covariates based on the formulas (9.10), (9.11) or (10.10), (10.11). Bibliography Aalen, O. O. (1994). Effects of frailty in survival analysis. Statistical Methods in Medical Research, 3, 227–243. Abrahamowicz, M., Ciampi, A., and Ramsay, J. O. (1992). Nonparametric density estimation for censored survival data: regression-spline approach. The Canadian Journal of Statistics, 20, 171–185. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19, 716–723. Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2, 1152– 1174. Arjas, E. and Gasbarra, D. (1994). Nonparametric bayesian inference from right censored survival data, using the Gibbs sampler. Statistica Sinica, 4, 505–524. Bacchetti, P. (1990). Estimation the incubation period of AIDS comparing population infection and diagnosis patterns. Journal of the American Statistical Association, 85, 1002–1008. Bacchetti, P. and Jewell, N. P. (1991). Nonparametric estimation of the incubation period of AIDS based on a prevalent cohort with unknown infection times. Biometrics, 47, 947–960. Barkan, S. E., Melnick, S. L., Preston-Martin, S., Weber, K., Kalish, L. A., Miotti, P., Young, M., Greenblatt, R., Sacks, H., and Feldman, J. (1998). The Women’s Interagency HIV Study. Epidemiology, 9, 117–125. 273 274 BIBLIOGRAPHY Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian computation and stochastic systems (with Discussion). Statistical Science, 10, 3–66. Betensky, R. A., Lindsey, J. C., Ryan, L. M., and Wand, M. P. (1999). Local EM estimation of the hazard function for interval-censored data. Biometrics, 55, 238–245. Betensky, R. A., Lindsey, J. C., Ryan, L. M., and Wand, M. P. (2002). A local likelihood proportional hazards model for interval censored data. Statistics in Medicine, 21, 263–275. Betensky, R. A., Rabinowitz, D., and Tsiatis, A. A. (2001). Computationally simple accelerated failure time regression for interval censored data. Biometrika, 88, 703–711. Billingsley, P. (1995). Probability and Measure. John Wiley & Sons, New York, Third edition. ISBN 0-471-00710-2. Bogaerts, K. and Lesaffre, E. (2004). A new, fast algorithm to find the regions of possible support for bivariate interval-censored data. Journal of Computational and Graphical Statistics, 13, 330–340. Bogaerts, K. and Lesaffre, E. (2006). Estimating Kendall’s tau for bivariate interval censored data with a smooth estimate of the density. Submitted. Breslow, N. E. (1974). Covariance analysis of censored survival data. Biometrics, 30, 89–99. Brooks, S. P., Giudici, P., and Roberts, G. O. (2003). Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions (with Discussion). Journal of the Royal Statistical Society, Series B, 65, 3–55. Buckley, J. and James, I. (1979). Linear regression with censored data. Biometrika, 66, 429–436. Cai, T. and Betensky, R. A. (2003). Hazard regression for intervalcensored data with penalized spline. Biometrics, 59, 570–579. Calle, M. L. and Gómez, G. (2005). A semiparametric hierarchical method for a regression model with an interval-censored covariate. Australian and New Zealand Journal of Statistics, 47, 351–364. BIBLIOGRAPHY 275 Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC, Boca Raton, Second edition. ISBN 1-58488-170-4. Carvalho, J. C., Ekstrand, K. R., and Thylstrup, A. (1989). Dental plaque and caries on occlusal surfaces of first permanent molars in relation to stage of eruption. Journal of Dental Research, 68, 773–779. Chen, M.-H., Shao, Q.-M., and Ibrahim, J. G. (2000). Monte Carlo Methods in Bayesian Computation. Springer-Verlag, New York. ISBN 0-387-98935-8. Christensen, R. and Johnson, W. (1988). Modelling accelerated failure time with a Dirichlet process. Biometrika, 75, 693–704. Clahsen, P. C., van de Velde, C. J., Julien, J. P., Floiras, J. L., Delozier, T., Mignolet, F. Y., and Sahmoud, T. M. (1996). Improved local control and disease-free survival after perioperative chemotherapy for early-stage breast cancer. A European Organization for Research and Treatment of Cancer Breast Cancer Cooperative Group Study. Journal of Clinical Oncology, 14, 745–753. Cox, D. R. (1972). Regression models and life-tables (with Discussion). Journal of the Royal Statistical Society, Series B, 34, 187–220. Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276. Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman & Hall, London. ISBN 0-412-16160-5. Czyzyk, J., Mesnier, M. P., and Moré, J. J. (1998). The NEOS server. IEEE Journal on Computational Science and Engineering, 5, 68–75. Dalal, S. R. and Hall, W. J. (1983). Approximating priors by mixtures of natural conjugate priors. Journal of the Royal Statistical Society, Series B, 45, 278–286. de Boor, C. (1978). A Practical Guide to Splines. Springer, New York. ISBN 0-387-90356-9. De Gruttola, V. and Lagakos, S. W. (1989). Analysis of doublycensored survival data, with application to AIDS. Biometrics, 45, 1–11. Dellaportas, P. and Papageorgiou, I. (2006). Multivariate mixtures of normals with unknown number of components. Statistics and Computing, 16, 57–68. 276 BIBLIOGRAPHY Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society, Series B, 56, 363–375. Dierckx, P. (1993). Curve and Surface Fitting with Splines. Clarendon, Oxford. ISBN 0-19-853440-X. Dorey, F. J., Little, R. J., and Schenker, N. (1993). Multiple imputation for threshold-crossing data with interval censoring. Statistics in Medicine, 12, 1589–1603. Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with Bsplines and penalties (with Discussion). Statistical Science, 11, 89–121. Ekstrand, K. R., Christiansen, J., and Christiansen, M. E. (2003). Time and duration of eruption of first and second permanent molars: a longitudinal investigation. Community Dentistry and Oral Epidemiology, 31, 344–350. Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, New York, Second edition. Fang, H.-B., Sun, J., and Lee, M.-L. T. (2002). Nonparametric survival comparisons for interval-censored continuous data. Statistica Sinica, 12, 1073–1083. Fay, M. P. (1996). Rank invariant tests for interval censored data under grouped continuous model. Biometrics, 52, 811–822. Fay, M. P. (1999). Comparing several score tests for interval censored data. Statistics in Medicine, 18, 273–285. Fay, M. P. and Shih, J. H. (1998). Permutation tests using estimated distribution functions. Journal of the American Statistical Association, 93, 387–396. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209–230. Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. The Annals of Statistics, 2, 615–629. BIBLIOGRAPHY 277 Ferris, M. C., Mesnier, M. P., and Moré, J. (2000). NEOS and Condor: Solving nonlinear optimization problems over the Internet. ACM Transactions on Mathematical Software, 26, 1–18. Finkelstein, D. M. (1986). A proportional hazards model for intervalcensored failure time data. Biometrics, 42, 845–854. Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Survival Analysis. John Wiley & Sons, New York. ISBN 0-471-52218-X. Fletcher, R. (1987). Practical Methods of Optimization. John Wiley & Sons, Chichester, Second edition. ISBN 0-471-49463-1. Fourer, R., Gay, D. M., and Kernighan, B. W. (2003). AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, Second edition. ISBN 0-534-388094. Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman & Hall, London. ISBN 0-412-81820-5. Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika, 52, 203–223. Gelfand, A. E., Sahu, S. K., and Carlin, B. P. (1995). Efficient parametrisations for normal linear mixed models. Biometrika, 82, 479– 499. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. To appear in Bayesian Analysis. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, Second edition. ISBN 1-58488-388-X. Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulations using multiple sequences (with Discussion). Statistical Science, 7, 457–511. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayes restoration of image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. 278 BIBLIOGRAPHY Gentleman, R. and Geyer, C. J. (1994). Maximum likelihood for interval censored data: consistency and computation. Biometrika, 81, 618–623. Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with Discussion). Statistical Science, 7, 473–511. Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60, 945–953. Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics. Springer-Verlag, New York. ISBN 0-387-95537-2. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., editors (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London. ISBN 0-412-05551-1. Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41, 337–348. Gill, R. D. (1980). Censoring and Stochastic Integrals. Number 124 in Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam. ISBN 90-6196-197-1. Goetghebeur, E. and Ryan, L. (2000). Semiparametric regression analysis of interval-censored data. Biometrics, 56, 1139–1144. Goggins, W. B., Finkelstein, D. M., Schoenfeld, D. A., and Zaslavsky, A. M. (1998). A Markov chain Monte Carlo EM algorithm for analyzing interval-censored data under the Cox proportional hazards model. Biometrics, 54, 1498–1507. Goggins, W. B., Finkelstein, D. M., and Zaslavsky, A. M. (1999). Applying the Cox proportional hazards model for analysis of latency data with interval censoring. Statistics in Medicine, 18, 2737–2747. Gómez, G. and Calle, M. L. (1999). Non-parametric estimation with doubly censored data. Journal of Applied Statistics, 26, 45–58. Gómez, G., Calle, M. L., and Oller, R. (2004). Frequentist and Bayesian approaches for interval-censored data. Statistical Papers, 45, 139–173. Gómez, G., Espinal, A., and Lagakos, S. W. (2003). Inference for a linear regression model with an interval-censored covariate. Statistics in Medicine, 22, 409–425. BIBLIOGRAPHY 279 Gómez, G. and Lagakos, S. W. (1994). Estimation of the infection time and latency distribution of AIDS with doubly censored data. Biometrics, 50, 204–212. Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. Journal of the American Statistical Association, 87, 942–951. Green, P. J. (1995). Reversible jump Markov chain computation and Bayesian model determination. Biometrika, 82, 711–732. Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. Birkhäuser-Verlag, Boston. ISBN 0-8176-2794-4. Han, S. P. (1977). A globally convergent method for nonlinear programming. Journal of Optimization Theory and Applications, 22, 297–309. Hanson, T. and Johnson, W. O. (2002). Modeling regression error with a mixture of Polya trees. Journal of the American Statistical Association, 97, 1020–1033. Hanson, T. and Johnson, W. O. (2004). A Bayesian semiparametric AFT model for interval-censored data. Journal of Computational and Graphical Statistics, 13, 341–361. Härkänen, T. (2003). BITE: A Bayesian intensity estimator. Computational Statistics, 18, 565–583. Härkänen, T., Virtanen, J. I., and Arjas, E. (2000). Caries on permanent teeth: a nonparametric Bayesian analysis. Scandinavian Journal of Statistics, 27, 577–588. Hastie, T. and Tibshirani, R. (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics, 46, 1005–1016. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer-Verlag, New York. ISBN 0-387-95284-5. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Held, L. (2004). Simultaneous posterior probability statements from Monte Carlo output. Journal of Computational and Graphical Statistics, 13, 20–35. 280 BIBLIOGRAPHY Hougaard, P. (1999). Fundamentals of survival data. Biometrics, 55, 13–22. Hougaard, P. (2000). Analysis of Multivariate Survival Data. SpringerVerlag, New York. ISBN 0-387-98873-4. Huang, J. (1999). Asymptotic properties of nonparametric estimation based on partly interval-censored data. Statistica Sinica, 9, 501–519. Ibrahim, J. G., Chen, M.-H., and Sinha, D. (2001). Bayesian Survival Analysis. Springer-Verlag, New York. ISBN 0-387-95277-2. Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20, 50–67. Jin, Z., Lin, D. Y., Wei, L. J., and Ying, Z. (2003). Rank-based inference for the accelerated failure time model. Biometrika, 90, 341–353. Johnson, W. and Christensen, R. (1989). Nonparametric Bayesian analysis of the accelerated failure time model. Statistics and Probability Letters, 8, 179–184. Joly, P., Commenges, D., and Letenneur, L. (1998). A penalized likelihood approach for arbitrarily censored and truncated data: application to age–specific incidence of dementia. Biometrics, 54, 185–194. Kalbfleisch, J. D. and MacKay, R. J. (1979). On constant-sum models for censored survival data. Biometrika, 66, 87–90. Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley & Sons, Chichester, Second edition. ISBN 0-471-36357-X. Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53, 457–481. Kauermann, G. (2005a). A note on smoothing parameter selection for penalised spline smoothing. Journal of Statistical Planning and Inference, 127, 53–69. Kauermann, G. (2005b). Penalised spline smoothing in multivariable survival models with varying coefficients. Computational Statistics and Data Analysis, 49, 169–186. BIBLIOGRAPHY 281 Keiding, N., Andersen, P. K., and Klein, J. P. (1997). The role of frailty models and accelerated failure time models in describing heterogeneity due to omitted covariates. Statistics in Medicine, 16, 215–225. Kim, M. Y., De Gruttola, V. G., and Lagakos, S. W. (1993). Analyzing doubly censored data with covariates, with application to AIDS. Biometrics, 49, 13–22. Komárek, A. and Lesaffre, E. (2006a). Bayesian accelerated failure time model for correlated censored data with a normal mixture as an error distribution. To appear in Statistica Sinica. Komárek, A. and Lesaffre, E. (2006b). Bayesian accelerated failure time model with multivariate doubly-interval-censored data and flexible distributional assumptions. Submitted. Komárek, A. and Lesaffre, E. (2006c). Bayesian semiparametric accelerated failure time model for paired doubly-interval-censored data. Statistical Modelling, 6, 3–22. Komárek, A., Lesaffre, E., Härkänen, T., Declerck, D., and Virtanen, J. I. (2005). A Bayesian analysis of multivariate doubly-intervalcensored data. Biostatistics, 6, 145–155. Komárek, A., Lesaffre, E., and Hilton, J. F. (2005). Accelerated failure time model for arbitrarily censored data with smoothed error distribution. Journal of Computational and Graphical Statistics, 14, 726–745. Kooperberg, C. (1998). Bivariate density estimation with an application to survival analysis. Journal of Computational and Graphical Statistics, 7, 322–341. Kooperberg, C. and Clarkson, D. B. (1997). Hazard regression with interval-censored data. Biometrics, 53, 1485–1494. Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, 1, 301–328. Kooperberg, C., Stone, C. J., and Truong, Y. K. (1995). Hazard regression. Journal of the American Statistical Association, 90, 78–94. Kottas, A. and Gelfand, A. E. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association, 96, 1458–1468. 282 BIBLIOGRAPHY Kuo, L. and Mallick, B. (1997). Bayesian semiparametric inference for the accelerated failure time model. The Canadian Journal of Statistics, 25, 457–472. Lai, T. L. and Ying, Z. (1991). Large sample theory of a modified BuckleyJames estimator for regression analysis with censored data. The Annals of Statistics, 19, 1370–1402. Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974. Lambert, P., Collett, D., Kimber, A., and Johnson, R. (2004). Parametric accelerated failure time models with random effects and an application to kidney transplant survival. Statistics in Medicine, 23, 3177–3192. Lambert, P. and Eilers, P. H. C. (2005). Bayesian proportional hazards model with time-varying regression coefficients: A penalized Poisson regression approach. Statistics in Medicine, 24, 3977–3989. Langohr, K., Gómez, G., and Muga, R. (2004). A parametric survival model with an interval-censored covariate. Statistics in Medicine, 23, 3159–3175. Lavine, M. (1992). Some aspects of Pólya tree distributions for statistical modelling. The Annals of Statistics, 20, 1222–1235. Lavine, M. (1994). More aspects of Pólya tree distributions for statistical modelling. The Annals of Statistics, 22, 1161–1176. Law, C. G. and Brookmeyer, R. (1992). Effects of mid-point imputation on the analysis of doubly censored data. Statistics in Medicine, 11, 1569– 1578. Lawson, A., Biggeri, A., Böhning, D., Lesaffre, E., Viel, J.-F., and Bertollini, R., editors (1999). Disease Mapping and Risk Assessment for Public Health. John Wiley & Sons, Chichester. ISBN 0-471-98634-8. Lee, E. W., Wei, L. J., and Ying, Z. (1993). Linear regression analysis for highly stratified failure time data. Journal of the American Statistical Association, 88, 557–565. Lee, Y. and Nelder, J. A. (2004). Conditional and marginal models: Another view (with Discussion). Statistical Science, 19, 219–238. Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. Springer-Verlag, New York, Second edition. ISBN 0-387-98502-6. BIBLIOGRAPHY 283 Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2003a). The effect of fluorides and caries in primary teeth on permanent tooth emergence. Community Dentistry and Oral Epidemiology, 31, 463–470. Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2003b). The emergence of permanent teeth in Flemish children (Belgium). Community Dentistry and Oral Epidemiology, 31, 30–39. Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2005). Effect of caries experience in primary molars on cavity formation in the adjacent permanent first molar. Caries Research, 39, 342–349. Lesaffre, E., Komárek, A., and Declerck, D. (2005). An overview of methods for interval-censored data with an emphasis on applications in dentistry. Statistical Methods in Medical Research, 14, 539–552. Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. Lin, J. S. and Wei, L. J. (1992). Linear regression analysis for multivariate failure time observations. Journal of the American Statistical Association, 87, 1091–1097. Lindsey, J. K. and Lambert, P. (1998). On the appropriateness of marginal models for repeated measurements in clinical trials. Statistics in Medicine, 17, 447–469. Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics, 12, 351–357. Louis, T. A. (1981). Nonparametric analysis of an accelerated failure time model. Biometrika, 68, 381–390. Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemotherapy Reports, 50, 163–170. Mantel, N. (1967). Ranking procedures for arbitrarily restricted observations. Biometrics, 23, 65–78. Mauldin, R. D., Sudderth, W. D., and Williams, S. C. (1992). Pólya trees and random distributions. The Annals of Statistics, 20, 1203–1221. McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, Inc., New York. ISBN 08247-7691-7. 284 BIBLIOGRAPHY Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., and Teller, A. H. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Miller, R. G. (1976). Least squares regression with censored data. Biometrika, 63, 449–464. Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. Springer Science+Business Media, New York. ISBN 0-38725144-8. Nanda, R. S. (1960). Eruption of human teeth. American Journal of Orthodontics, 46, 363–378. Nardi, A. and Schemper, M. (2003). Comparing Cox and parametric models in clinical studies. Statistics in Medicine, 22, 3597–3610. Neal, R. M. (2003). Slice sampling (with Discussion). The Annals of Statistics, 31, 705–767. Odell, P. M., Anderson, K. M., and D’Agostino, R. B. (1992). Maximum likelihood estination for interval-censored data using a Weibull-based accelerated failure time model. Biometrics, 48, 951–959. O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, London, Sixth edition. ISBN 0-340-52922-9. Oller, R., Gómez, G., and Calle, M. L. (2004). Interval censoring: model characterization for the validity of the simplified likelihood. The Canadian Journal of Statistics, 32, 315–326. O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problem (with Discussion). Statistical Science, 1, 502–527. O’Sullivan, F. (1988). Fast computation of fully automated log–density and log–hazard estimators. SIAM Journal on Scientific and Statistical Computing, 9, 363–379. Oulis, C. J., Raadal, M., and Martens, L. (2000). Guidelines on the use of fluoride in children: an EAPD policy document. European Journal of Paediatric Dentistry, 1, 7–12. Pan, J. and MacKenzie, G. (2003). On modelling mean-covariance structures in longitudinal studies. Biometrika, 90, 239–244. BIBLIOGRAPHY 285 Pan, W. (1999a). A comparison of some two-sample tests with interval censored data. Nonparametric Statistics, 12, 133–146. Pan, W. (1999b). Extending the iterative convex minorant algorithm to the Cox model for interval-censored data. Journal of Computational and Graphical Statistics, 8, 109–120. Pan, W. (2000a). A multiple imputation approach to Cox regression with interval-censored data. Biometrics, 56, 199–203. Pan, W. (2000b). A two-sample test with interval censored data via multiple imputation. Statistics in Medicine, 19, 1–11. Pan, W. (2001). A multiple imputation approach to regression analysis for doubly censored data with application to AIDS studies. Biometrics, 57, 1245–1250. Pan, W. and Connett, J. E. (2001). A multiple imputation approach to linear regression with clustered censored data. Lifetime Data Analysis, 7, 111–123. Pan, W. and Kooperberg, C. (1999). Linear regression for bivariate censored data via multiple imputation. Statistics in Medicine, 18, 3111–3121. Pan, W. and Louis, T. A. (2000). A linear mixed-effects model for multivariate censored data. Biometrics, 56, 160–166. Parner, E. T., Heidmann, J. M., Væth, M., and Poulsen, S. (2001). A longitudinal study of time trendsin the eruption of permanent teeth in Danish children. Archives of Oral Biology, 46, 425–431. Pepe, M. S. and Fleming, T. R. (1989). Weighted Kaplan-Meier statistics: a class of distance tests for censored survival data. Biometrics, 45, 497–507. Pepe, M. S. and Fleming, T. R. (1991). Weighted Kaplan-Meier statistics: large sample and optimality considerations. Journal of the Royal Statistical Society, Series B, 53, 341–352. Peto, R. (1973). Experimental survival curves for interval-censored data. Applied Statistics, 22, 86–91. Peto, R. and Peto, J. (1972). Asymptotically efficient rank-invariant test procedures (with Discussion). Journal of the Royal Statistical Society, Series A, 135, 185–206. 286 BIBLIOGRAPHY Petroni, G. R. and Wolfe, R. A. (1994). A two-sample test for stochastic ordering with interval-censored data. Biometrics, 50, 77–87. Pourahmadi, M. (1999). Joint mean-covariance models with applications to longitudinal data: Unconstrained parametrisation. Biometrics, 86, 677–690. Prentice, R. L. (1978). Biometrika, 65, 167–179. Linear rank tests with right censored data. R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org. ISBN 3-900051-07-0. Rabinowitz, D., Tsiatis, A., and Aragon, J. (1995). Regression with interval-censored data. Biometrika, 82, 501–513. Ramsay, J. O. (1988). Monotone regression splines in action. Statistical Science, 3, 425–461. Reid, N. (1994). A conversation with Sir David Cox. Statistical Science, 9, 439–455. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with unknown number of components (with Discussion). Journal of the Royal Statistical Society, Series B, 59, 731–792. Ritov, Y. (1990). Estimation in a linear regression model with censored data. The Annals of Statistics, 18, 303–328. Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag, New York, Second edition. ISBN 0-387-21239-6. Roeder, K. and Wasserman (1997). Practical bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92, 894–902. Rosenberg, P. S. (1995). Hazard function estimation using B-splines. Biometrics, 51, 874–887. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York. ISBN 0-471-08705-X. Rücker, G. and Messerer, D. (1988). Remission duration: an example of interval-censored observations. Statistics in Medicine, 7, 1139–1145. BIBLIOGRAPHY 287 Satten, G. A. (1996). Rank-based inference in the proportional hazards model for interval censored data. Biometrika, 83, 355–370. Satten, G. A., Datta, S., and Williamson, J. M. (1998). Inference based on imputed failure times for the proportional hazards model with interval-censored data. Journal of the American Statistical Association, 93, 318–327. Self, S. G. and Grossman, E. A. (1986). Linear rank tests for intervalcensored data with application to PCB levels in adipose tissue of transformer repair workers. Biometrics, 42, 521–530. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non–parametric regression curve fitting. Journal of the Royal Statistical Society, Series B, 47, 1–52. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with Discussion). Journal of the Royal Statistical Society, Series B, 64, 583–639. Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society, Series B, 62, 795–809. Sun, J. (1995). Empirical estimation of a distribution function with truncated and doubly interval-censored data and its application to AIDS studies. Biometrics, 51, 1096–1104. Sun, J., Liao, Q., and Pagano, M. (1999). Regression analysis of doubly censored failure time data with application to AIDS studies. Biometrics, 55, 909–914. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–550. Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer-Verlag, New York. ISBN 0-387-98784-3. Therneau, T. M. and Hamilton, S. A. (1997). rhDNase as an example of recurrent event analysis. Statistics in Medicine, 16, 2029–2047. Tierney, L. (1994). Markov chains for exploring posterior distributions (with Discussion). The Annals of Statistics, 22, 1701–1762. 288 BIBLIOGRAPHY Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, Chichester. ISBN 0-471-90763-4. Topp, R. and Gómez, G. (2004). Residual analysis in linear regression models with an interval-censored covariate. Statistics in Medicine, 23, 3377–3391. Tsiatis, A. A. (1990). Estimating regression parameters using linear rank tests for censored data. The Annals of Statistics, 18, 354–372. Tsiatis, A. A. and Davidian, M. (2004). Joint modeling of longitudinal and time-to-event data: An overview. Statistica Sinica, 14, 809–834. Turnbull, B. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B, 37, 290–295. Tutz, G. and Binder, H. (2004). Flexible modelling of discrete failure time including time-varying smooth effects. Statistics in Medicine, 23, 2445–2461. Unser, M., Aldroubi, A., and Eden, M. (1992). On the asymptotic convergence of B-spline wavelets to Gabor functions. IEEE Transactions on Information Theory, 38, 864–872. Vaida, F. and Xu, R. (2000). Proportional hazards model with random effects. Statistics in Medicine, 19, 3309–3324. Vanobbergen, J., Martens, L., Lesaffre, E., Bogaerts, K., and Declerck, D. (2001). Assessing risk indicators for dental caries in the primary dentition. Community Dentistry and Oral Epidemiology, 29, 424– 434. Vanobbergen, J., Martens, L., Lesaffre, E., and Declerck, D. (2000). The Signal-Tandmobielr project – a longitudinal intervention health promotion study in Flanders (Belgium): baseline and first year results. European Journal of Paediatric Dentistry, 2, 87–96. Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91, 217–221. Verbeke, G. and Lesaffre, E. (1997). The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Computational Statistics and Data Analysis, 23, 541–556. BIBLIOGRAPHY 289 Verweij, P. J. M. and Van Houwelingen, H. C. (1994). Penalized likelihood in Cox regression. Statistics in Medicine, 13, 2427–2436. Virtanen, J. I. (2001). Changes and trends in attack distributions and progression of dental caries in three age cohorts in Finland. Journal of Epidemiology and Biostatistics, 6, 325–329. Wahba, G. (1983). Bayesian “confidence intervals” for the cross–validated smoothing spline. Journal of the Royal Statistical Society, Series B, 45, 133–150. Walker, S. G., Damien, P., Laud, P. W., and Smith, A. F. M. (1999). Bayesian nonparametric inference for random distributions and related functions (with discussion). Journal of the Royal Statistical Society, Series B, 61, 485–527. Walker, S. G. and Mallick, B. K. (1999). A Bayesian semiparametric accelerated failure time model. Biometrics, 55, 477–483. Wand, M. P. (2003). Smoothing and mixed models. Computational Statistics, 18, 223–249. Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85, 699–704. Wei, G. C. G. and Tanner, M. A. (1991). Applications of multiple imputation to the analysis of censored regression data. Biometrics, 47, 1297–1309. Williams, J. S. and Lagakos, S. W. (1977). Models for censored survivaal analysis: Constant-sum and variable-sum models. Biometrika, 64, 215– 224. Ying, Z. (1993). A large sample study of rank estimation for censored refression data. The Annals of Statistics, 21, 76–99. Yu, Q., Li, L., and Wong, G. Y. C. (2000). On consistency of the self-consistent estimator of survival functions with interval-censored data. Scandinavian Journal of Statistics, 27, 35–44. Yu, Q., Schick, A., Li, L., and Wong, G. Y. C. (1998). Asymptotic properties of the GLME in the case 1 interval-censorship model with discrete inspection times. Canadian Journal of Statistics, 26, 619–627. 290 BIBLIOGRAPHY Curriculum Vitae Arnošt Komárek was born on March 28, 1977 in Hradec Králové in the Czech Republic. After secondary school at Božena Němcová Secondary Grammar School (Gymnázium Boženy Němcové) in Hradec Králové, he started undergraduate studies in Mathematics in September 1995 at the Faculty of Mathematics and Physics of the Charles University (Univerzita Karlova) in Prague, the Czech Republic where he chose direction of Mathematical Statistics and graduated as Master of Science in Mathematical Statistics in May 2000. From October 2000 till September 2001, he was enrolled as an Erasmus Exchange Student at the University of Limburg (Limburgs Universitair Centrum, nowadays Universiteit Hasselt) in Diepenbeek, Belgium and obtained a degree of Master of Science in Biostatistics. From October 2001 he started as a predoctoral student his career of a researcher at the Biostatistical Centre of the Catholic University of Leuven (Katholieke Universiteit Leuven) in Leuven, Belgium. At the same place, he started the doctoral programme in October 2002 of which this thesis is the most important outcome. 291
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement