Katholieke Universiteit Leuven Arnoˇst Kom ´arek Accelerated Failure Time Models

Katholieke Universiteit Leuven Arnoˇst Kom ´arek Accelerated Failure Time Models
Katholieke Universiteit Leuven
Faculteit Wetenschappen
Arnošt Komárek
Accelerated Failure Time Models
for Multivariate Interval-Censored Data
with Flexible Distributional Assumptions
Promotoren:
Prof. Emmanuel Lesaffre
Prof. Jan Beirlant
Proefschrift ingediend tot
het behalen van de graad van
Doctor in de Wetenschappen
Mei 2006
ISBN 90-8649-014-X
c Arnošt Komárek
All rights reserved. No part of this book may be reproduced, in any form or
by any other means, without the written permission of the copyright owner.
Dankwoord
Ik wens mijn dank te betuigen aan iedereen die er op een of andere manier
toe bijgedragen hebben aan goede afloop van mijn doctoraatstudies.
Bijzondere dank gaat uit naar mijn promotor, Professor Emmanuel Lesaffre,
die mij altijd deskundig begeleidde en wist te steunen bij moeilijke momenten
tijdens het voorbereiden van deze proefschrift. De feit dat deze proefschrift
tot vijf artikels die in de internationale wetenschappelijke tijdschriften aanvaard werden, leidde, is vooral het resultaat van zijn bekwaamheid om mij
op juiste moment in de juiste richting te duwen. Hij heeft mij ook vele mogelijkheden geboden tot kontakt met andere onderzoekers op zoals nationale
als internationale niveau waarvoor ik hem ook bedank.
Verder wil ik bedanken mijn co-promotor, Professor Jan Beirlant, die altijd
bereid was om mij te helpen als het nodig was.
Dank ook aan alle leden van de jury, Prof. Paul Janssen, Prof. An Carbonez,
Prof. Irene Gijbels, Prof. Guadalupe Gómez en Prof. Dominique Declerck
voor de kritische lezing van dit werk die tot een serieuze verbetering leidde.
Mijn (ex-)collega’s van het Biostatistisch centrum, Dora, Kris, Samuel, Silvia,
Steffen, Geert, Dimitris, Roula, Wendim, Luwis, Alejandro, Marı́a José, Ann,
Roos, Annelies, Bart en Francis wil ik voor de aangename werkomgeving
gedurende de laatste vijf jaren danken. Extra dank gaat uit naar Jeannine,
voor haar perfecte administratieve steun waarop ik altijd kon leunen.
Voor een boeiend jaar dat mij in het gebied van de toegepaste statistiek
en vooral biostatistiek geintroduceerd heb en dat tot het begin van mijn
doctoraatstudies in Leuven leidde wil ik al mijn medestudenten en lesgevers
van het Biostatistiek programma aan het Limburgs Universiteit Centrum in
Diepenbeek in het academiejaar 2000 – 01 bedanken.
Deze proefschrift zou niet kunnen ontstaan zonder de financiële steun van
de onderzoeksbeurzen van de Katholieke Universteit Leuven. De steun van
de beurzen OT/00/35, OE/03/29, DB/04/031 en BDB-B/05/10 wordt diep
geapprecieerd.
Als laatst maar niet in het minst wil ik Pascale, Filip, Sibe, Ine en Wout
bedanken die voor mij een familie in België wisten te creëren.
Dank u wel!
Arnošt
Poděkovánı́
Tento text by též nikdy nemohl vzniknout bez znalostı́ matematiky a statistiky, kterých jsem nabyl během pregraduálnı́ch studiı́ na Matematicko-fyzikálnı́ fakultě Univerzity Karlovy v Praze. Za základy svého statistického
věděnı́ chci potom poděkovat všem pracovnı́kům Katedry pravděpodobnosti
a matematické statistiky.
Michalu Kulichovi potom děkuji, že mě přemluvil, abych v roce 2000 odjel
na jeden rok do Belgie, čı́mž změnil na dalšı́ch nejméně šest let mı́sto mého
trvalého pobytu a přispěl nepřı́mo k naprosté změně tématu mé doktorské
práce. Profesoru Jaromı́ru Antochovi, svému původnı́mu vedoucı́mu doktorské práce, děkuji, že na mě i přes moji dezerci, soudě dle našich následných
ROBUSTnı́ch a jiných setkánı́, nezanevřel.
Závěrem děkuji Lence, že si mě ponechala i přes to, že mnohý čas, který
bych mohl věnovat jı́, jsem věnoval statistice. Děkuji též za dvou a půl kilový
dárek, který mezitı́m trochu narostl a kterým mi zpestřila závěr jednoho
COMPSTATu. Jindře děkuji za to, že prostě je. Bez tvých úsměvů a dalšı́ch
projevů přı́zně i nepřı́zně by finálnı́ práce na tomto textu nebyly zdaleka tak
úsměvné jak byly.
Děkuji!
Arnošt
Acknowledgement
There would be no need to develop the techniques presented in this thesis if
there were no data posing interesting questions. I would like to thank to all
who collected those interesting data sets and allowed me to use them in this
thesis.
Data collection for the Signal Tandmobielr project introduced in Section
1.1 was supported by Unilever, Belgium. The Signal Tandmobielr project
comprises the following partners: D. Declerck (Dental School, Catholic University Leuven), L. Martens (Dental School, University Ghent), J. Vanobbergen (Oral Health Promotion and Prevention, Flemish Dental Association), P.
Bottenberg (Dental School, University Brussels), E. Lesaffre (Biostatistical
Centre, Catholic University Leuven), K. Hoppenbrouwers (Youth Health Department, Catholic University Leuven; Flemish Association for Youth Health
Care).
The WIHS data introduced in Section 1.3 were collected by the Women’s
Interagency HIV Study Collaborative Study Group and its Oral Substudy
with centers (Principal Investigators) at New York City/Bronx Consortium
(K. Anastos, J. A. Phelan); Brooklyn, NY (H. Minkoff); Washington DC
Metropolitan Consortium (M. Young); The Connie Wofsy Study Consortium
of Northern California (R. Greenblatt, D. Greenspan, J. S. Greenspan); Los
Angeles County/Southern California Consortium (A. Levine, R. Mulligan, M.
Navazesh); Chicago Consortium (M. Cohen, M. Alves); Data Coordinating
Center (A. Muñoz). The WIHS is funded by the National Institute of Allergy
and Infectious Diseases, with supplemental funding from the National Cancer
Institute, the National Institute of Child Health & Human Development,
the National Institute on Drug Abuse, the National Institute of Dental and
Craniofacial Research, the Agency for Health Care Policy and Research, the
National Center for Research Resources, and the Centers for Disease Control
and Prevention. U01-AI-35004, U01-AI-31834, U01-AI-34994, U01-AI-34989,
U01-HD-32632 (NICHD), U01-AI-34993, U01-AI-42590, M01-RR00079, and
M01-RR00083. The WIHS Oral Substudy is funded by the National Institute
of Dental and Craniofacial Research.
The EBCP data introduced in Section 1.4 were kindly provided by Catherine
Legrand and Richard Sylvester from the European Organisation for Research
and Treatment of Cancer.
Thank You!
Arnošt
The majority of the material in this thesis is based on the original publications. Below, we give a list of the parts of the thesis based principally on
these publications.
Sections 5.1, 7.7: Lesaffre, E., Komárek, A., and Declerck, D. (2005).
An overview of methods for interval-censored data with an emphasis on
applications in dentistry. Statistical Methods in Medical Research, 14,
539–552.
Section 5.2: Komárek, A., Lesaffre, E., Härkänen, T., Declerck,
D., and Virtanen, J. I. (2005). A Bayesian analysis of multivariate
doubly-interval-censored data. Biostatistics, 6, 145–155.
Chapter 7: Komárek, A., Lesaffre, E., and Hilton, J. F. (2005). Accelerated failure time model for arbitrarily censored data with smoothed
error distribution. Journal of Computational and Graphical Statistics,
14, 726–745.
Chapter 8: Komárek, A. and Lesaffre, E. (2006a). Bayesian accelerated failure time model for correlated censored data with a normal
mixture as an error distribution. To appear in Statistica Sinica.
Chapter 9: Komárek, A. and Lesaffre, E. (2006b). Bayesian accelerated failure time model with multivariate doubly-interval-censored data
and flexible distributional assumptions. Submitted.
Chapter 10: Komárek, A. and Lesaffre, E. (2006c). Bayesian semiparametric accelerated failure time model for paired doubly-intervalcensored data. Statistical Modelling, 6, 3–22.
Contents
Notation
xvii
Preface
Part I
xix
Introduction
1
1 Motivating Data Sets
3
Tandmobielr
1.1
The Signal
study . . . . . . . . . . . . . . . . .
3
1.2
The Chronic Granulomatous Disease trial (CGD) . . . . . . .
5
1.3
The Woman’s Interagency HIV Study (WIHS) . . . . . . . .
6
1.4
Perioperative Chemotherapy in Early Breast Cancer Patients
(EBCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 Basic Notions
11
2.1
Right, left and interval censoring . . . . . . . . . . . . . . . .
11
2.2
Doubly interval censoring . . . . . . . . . . . . . . . . . . . .
12
2.3
Density, survival, hazard and cumulative hazard functions . .
13
2.4
Independent noninformative censoring and simplified likelihood 13
2.4.1
Right-censored data . . . . . . . . . . . . . . . . . . .
14
2.4.2
Interval-censored data . . . . . . . . . . . . . . . . . .
15
2.4.3
Simplified likelihood for interval-censored data . . . .
16
3 An Overview of Regression Models for Survival Data
ix
17
x
CONTENTS
3.1
Proportional hazards model . . . . . . . . . . . . . . . . . . .
17
3.2
Accelerated failure time model . . . . . . . . . . . . . . . . .
18
3.3
Accelerated failure time model versus proportional hazards
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Regression models for multivariate survival data . . . . . . .
21
3.4.1
Frailty proportional hazards model . . . . . . . . . . .
21
3.4.2
Population averaged accelerated failure time model . .
22
3.4.3
Cluster specific accelerated failure time model . . . . .
23
3.4.4
Population averaged model versus cluster specific model 24
3.4
4 Frequentist and Bayesian Inference
4.1
27
Likelihood for interval-censored data . . . . . . . . . . . . . .
28
4.1.1
Interval-censored data . . . . . . . . . . . . . . . . . .
28
4.1.2
Doubly-interval-censored data . . . . . . . . . . . . . .
29
4.2
Likelihood for multivariate (doubly) interval-censored data . .
30
4.3
Bayesian data augmentation . . . . . . . . . . . . . . . . . . .
30
4.4
Hierarchical specification of the model . . . . . . . . . . . . .
32
4.5
Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . .
35
4.6
Credible regions and Bayesian p-values . . . . . . . . . . . . .
36
4.6.1
Credible regions . . . . . . . . . . . . . . . . . . . . .
36
4.6.2
Bayesian p-values . . . . . . . . . . . . . . . . . . . . .
37
5 An Overview of Methods for Interval-Censored Data
5.1
5.2
39
Frequentist methods . . . . . . . . . . . . . . . . . . . . . . .
40
5.1.1
Estimation of the survival function . . . . . . . . . . .
40
5.1.2
Comparison of two survival distributions . . . . . . . .
42
5.1.3
Proportional hazards model . . . . . . . . . . . . . . .
44
5.1.4
Accelerated failure time model . . . . . . . . . . . . .
45
5.1.5
Interval-censored covariates . . . . . . . . . . . . . . .
46
Bayesian proportional hazards model: An illustration . . . . .
46
5.2.1
Tandmobielr
Signal
study: Research question and related data characteristics . . . . . . . . . . . . . . . .
47
5.2.2
Proportional hazards modelling using midpoints . . .
48
5.2.3
The Bayesian survival model for doubly-interval-censored data . . . . . . . . . . . . . . . . . . . . . . . . .
50
CONTENTS
xi
5.2.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.2.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3
Bayesian accelerated failure time model . . . . . . . . . . . .
59
5.4
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . .
60
Concluding Remarks to Part I and Introduction to Part II
61
Part II Accelerated Failure Time Models with Flexible Distributional Assumptions
63
6 Mixtures as Flexible Models for Unknown Distributions
6.1
6.2
6.3
Classical normal mixture . . . . . . . . . . . . . . . . . . . . .
65
6.1.1
From general finite mixture to normal mixture . . . .
65
6.1.2
Estimation of mixture parameters . . . . . . . . . . .
66
Penalized B-splines . . . . . . . . . . . . . . . . . . . . . . . .
68
6.2.1
Introduction to B-splines . . . . . . . . . . . . . . . .
68
6.2.2
Penalized smoothing . . . . . . . . . . . . . . . . . . .
71
6.2.3
B-splines in the survival analysis . . . . . . . . . . . .
72
6.2.4
B-splines as models for densities . . . . . . . . . . . .
72
6.2.5
B-splines for multivariate smoothing . . . . . . . . . .
74
Penalized normal mixture . . . . . . . . . . . . . . . . . . . .
74
6.3.1
From B-spline to normal density . . . . . . . . . . . .
74
6.3.2
Transformation of mixture weights . . . . . . . . . . .
77
6.3.3
Penalized normal mixture for distributions with an arbitrary location and scale . . . . . . . . . . . . . . . .
78
Multivariate smoothing . . . . . . . . . . . . . . . . .
79
Classical versus penalized normal mixture . . . . . . . . . . .
81
6.3.4
6.4
7 Maximum Likelihood Penalized AFT Model
7.1
7.2
65
83
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
7.1.1
Model for the error density . . . . . . . . . . . . . . .
84
7.1.2
Scale regression . . . . . . . . . . . . . . . . . . . . . .
85
Penalized maximum-likelihood . . . . . . . . . . . . . . . . .
85
7.2.1
85
Penalized log-likelihood . . . . . . . . . . . . . . . . .
xii
CONTENTS
7.2.2
Remarks on the penalty function . . . . . . . . . . . .
87
7.2.3
Selecting the smoothing parameter . . . . . . . . . . .
88
Inference based on the maximum likelihood penalized AFT
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
7.3.1
Pseudo-variance . . . . . . . . . . . . . . . . . . . . .
90
7.3.2
Asymptotic variance . . . . . . . . . . . . . . . . . . .
91
7.3.3
The pseudo-variance versus the asymptotic variance .
91
7.3.4
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . .
92
7.4
Predictive survival and hazard curves and predictive densities
92
7.5
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.6
Example: WIHS data – interval censoring . . . . . . . . . . .
94
7.6.1
Fitted models . . . . . . . . . . . . . . . . . . . . . . .
96
7.6.2
Predictive survival and hazard curves, predictive densities 96
7.6.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
97
Example: Signal Tandmobielr study – interval-censored data
99
7.3
7.7
7.8
7.7.1
Fitted models . . . . . . . . . . . . . . . . . . . . . . . 100
7.7.2
Predictive emergence and hazard curves . . . . . . . . 101
7.7.3
Comparison of emergence distributions between different groups . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 105
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8 Bayesian Normal Mixture Cluster-Specific AFT Model
8.1
8.2
8.3
107
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.1.1
Distributional assumptions . . . . . . . . . . . . . . . 109
8.1.2
Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 110
Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 110
8.2.1
Prior specification of the error part . . . . . . . . . . . 111
8.2.2
Prior specification of the regression part . . . . . . . . 113
8.2.3
Weak prior information . . . . . . . . . . . . . . . . . 114
8.2.4
Posterior distribution . . . . . . . . . . . . . . . . . . 115
Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 116
8.3.1
Update of the error part of the model . . . . . . . . . 116
8.3.2
Update of the regression part of the model . . . . . . 123
CONTENTS
8.4
xiii
Bayesian estimates of the survival distribution . . . . . . . . . 125
8.4.1
Predictive survival and hazard curves and predictive
survival densities . . . . . . . . . . . . . . . . . . . . . 125
8.4.2
Predictive error densities
. . . . . . . . . . . . . . . . 126
8.5
Bayesian estimates of the individual random effects . . . . . . 127
8.6
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.7
Example: Signal Tandmobielr study – clustered interval-censored
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.8
8.9
8.7.1
Prior distribution . . . . . . . . . . . . . . . . . . . . . 130
8.7.2
Results for the regression and error parameters . . . . 131
8.7.3
Inter-teeth relationship
8.7.4
Predictive emergence and hazard curves . . . . . . . . 132
8.7.5
Predictive error density . . . . . . . . . . . . . . . . . 136
8.7.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 136
. . . . . . . . . . . . . . . . . 132
Example: CGD data – recurrent events analysis . . . . . . . . 136
8.8.1
Prior distribution . . . . . . . . . . . . . . . . . . . . . 138
8.8.2
Effect of covariates on the time to infection . . . . . . 139
8.8.3
Predictive error density and variability of random effects144
8.8.4
Estimates of individual random effects . . . . . . . . . 144
8.8.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 144
Example: EBCP data – multicenter study . . . . . . . . . . . 144
8.9.1
Prior distribution . . . . . . . . . . . . . . . . . . . . . 146
8.9.2
Effect of covariates on PFS time . . . . . . . . . . . . 146
8.9.3
Predictive error density and variance components of
random effects . . . . . . . . . . . . . . . . . . . . . . 148
8.9.4
Estimates of individual random effects . . . . . . . . . 152
8.9.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 152
8.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9 Bayesian Penalized Mixture Cluster-Specific AFT
Model
9.1
9.2
155
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.1.1
Distributional assumptions . . . . . . . . . . . . . . . 157
9.1.2
Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 158
Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 159
xiv
CONTENTS
9.2.1
9.3
9.2.2
Prior distribution for the generic node Y
9.2.3
Prior distribution for multivariate random effects in
Model M . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.4
Prior distribution for the regression parameters . . . . 165
9.2.5
Prior distribution for the time variables . . . . . . . . 165
9.2.6
Posterior distribution . . . . . . . . . . . . . . . . . . 166
. . . . . . . 164
Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 166
9.3.1
9.3.2
9.4
Prior distribution for G . . . . . . . . . . . . . . . . . 162
Updating the parameters related to the penalized mixture G . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Updating the generic node Y . . . . . . . . . . . . . . 169
9.3.3
Updating the parameters related to the multivariate
random effects in Model M . . . . . . . . . . . . . . . 171
9.3.4
Updating the regression parameters . . . . . . . . . . 172
Bayesian estimates of the survival distribution . . . . . . . . . 172
9.4.1
Predictive survival and hazard curves and predictive
survival densities . . . . . . . . . . . . . . . . . . . . . 172
9.4.2
Predictive error and random effect densities . . . . . . 173
9.5
Bayesian estimates of the individual random effects . . . . . . 173
9.6
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.7
Example: Signal Tandmobielr study – clustered doubly-intervalcensored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.8
9.9
9.7.1
Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 176
9.7.2
Final Model . . . . . . . . . . . . . . . . . . . . . . . . 177
9.7.3
Prior distribution . . . . . . . . . . . . . . . . . . . . . 178
9.7.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.7.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 183
Example: EBCP data – multicenter study . . . . . . . . . . . 184
9.8.1
Prior distribution . . . . . . . . . . . . . . . . . . . . . 184
9.8.2
Effect of covariates on PFS time . . . . . . . . . . . . 185
9.8.3
Predictive error density and variance components of
random effects . . . . . . . . . . . . . . . . . . . . . . 188
9.8.4
Estimates of individual random effects . . . . . . . . . 192
9.8.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 192
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
CONTENTS
xv
10 Bayesian Penalized Mixture Population-Averaged AFT
Model
193
10.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.1.1 Distributional assumptions . . . . . . . . . . . . . . . 194
10.1.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.2 Bayesian hierarchical model . . . . . . . . . . . . . . . . . . . 196
10.2.1 Prior distribution for G . . . . . . . . . . . . . . . . . 197
10.2.2 Prior distribution for the generic node Y
. . . . . . . 200
10.2.3 Prior distribution for the regression parameters and
time variables . . . . . . . . . . . . . . . . . . . . . . . 201
10.2.4 Posterior distribution . . . . . . . . . . . . . . . . . . 201
10.3 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 201
10.4 Evaluation of association . . . . . . . . . . . . . . . . . . . . . 202
10.5 Bayesian estimates of the survival distribution . . . . . . . . . 203
10.5.1 Predictive survival nad hazard curves and predictive
survival densities . . . . . . . . . . . . . . . . . . . . . 203
10.5.2 Predictive error densities
. . . . . . . . . . . . . . . . 204
Tandmobielr
10.6 Example: Signal
study – paired doubly-intervalcensored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.6.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 205
10.6.2 Final Model . . . . . . . . . . . . . . . . . . . . . . . . 205
10.6.3 Prior distribution . . . . . . . . . . . . . . . . . . . . . 205
10.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11 Overview and Further Research
215
11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.2 Generalizations and improvements . . . . . . . . . . . . . . . 217
11.3 The use of penalized mixtures in other application areas . . . 219
11.3.1 Generalized linear mixed models with random effects
having a flexible distribution . . . . . . . . . . . . . . 219
11.3.2 Spatial models with the intensity specified by the penalized mixture . . . . . . . . . . . . . . . . . . . . . . 220
A Technical details for the Maximum Likelihood Penalized AFT
Model
223
xvi
CONTENTS
A.1 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . 224
A.2 Individual log-likelihood contributions . . . . . . . . . . . . . 225
A.3 First derivatives of the log-likelihood . . . . . . . . . . . . . . 226
A.3.1 With respect to the regression parameters and the intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
A.3.2 With respect to the log-scale and the scale-regression
parameters . . . . . . . . . . . . . . . . . . . . . . . . 226
A.3.3 With respect to the transformed mixture weights . . . 227
A.4 Second derivatives of the log-likelihood . . . . . . . . . . . . . 227
A.4.1 With respect to the extended regression parameters . 227
A.4.2 Mixed with respect to the extended regression parameters and the log-scale or the scale-regression parameters228
A.4.3 Mixed with respect to the extended regression parameters and the transformed mixture weights . . . . . . . 229
A.4.4 With respect to the log-scale or the scale-regression parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 230
A.4.5 Mixed with respect to the log-scale or the scale-regression
parameters and the transformed mixture weights . . . 230
A.4.6 With respect to the transformed mixture weights . . . 231
A.5 Derivatives of the penalty term . . . . . . . . . . . . . . . . . 232
A.6 Derivatives of the constraints . . . . . . . . . . . . . . . . . . 232
A.7 Proof of Proposition 7.1 . . . . . . . . . . . . . . . . . . . . . 233
B Simulation results
235
B.1 Simulation for the maximum likelihood penalized AFT model 235
B.2 Simulation for the Bayesian normal mixture cluster-specific
AFT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
B.3 Simulation for the Bayesian penalized mixture cluster-specific
AFT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
C Software
271
C.1 Package smoothSurv . . . . . . . . . . . . . . . . . . . . . . . 271
C.2 Package bayesSurv . . . . . . . . . . . . . . . . . . . . . . . . 272
Bibliography
273
Curriculum Vitae
291
Notation
Here, we give a list of the most often used symbols within this thesis.
δi
⋆ censoring indicator,
⋆ 0 for right-censored, 1 for exactly observed, 2 for leftcensored, 3 for interval-censored observations;
1
⋆ vector of ones;
ϕ(e)
⋆ density of N (0, 1)
ϕ(e | µ, σ 2 )
⋆ density of N (µ, σ 2 )
ϕq (e | µ, Σ)
⋆ density of q-variate normal distribution with mean µ
and covariance matrix Σ
Φ(e)
⋆ cumulative distribution function of N (0, 1)
Φ(e | µ, σ 2 )
⋆ cumulative distribution function of N (µ, σ 2 )
⌊tL , tU ⌋
⋆ interval censored observation
⋆ according to the context, the interval might be closed,
half closed or open
Z tU
p(s) ds if tL < tU
=
I
tU
p(s) ds
tL
tL
= p(tL ) = p(tU ) if tL = tU
⋆ symbol used to write down the likelihood of the interval
censored data
xvii
xviii
NOTATION
Preface
The accelerated failure time (AFT) model, the principal topic of this thesis
is a regression model used to analyze survival data. The term survival data
is usually used for data that measure the time to some event, not necessarily
death. Precisely, the event time will be considered a positive real valued
variable having a continuous distribution. In particular practical situations,
data on event times are obtained by following subjects in the study over
(calendar) time, recording the moments of the specified events of interest
and computing the time spans between the event and some initial - onset
time (e.g. enter to a study and disease progression, contagion by HIV virus
and onset of AIDS, tooth emergence and the time it is attacked by caries for
the first time).
A typical feature of survival data is the fact that the time to event is not
always observed completely and observations are imposed to censoring. Most
commonly, either the study is finished before all subjects involved encounter
the specified event or the subject leaves for some reasons the study before
encountering the event. In both situations, only the lower limit for the true
event time is known and we talk about right censoring (see Sections 1.2
and 1.4 for examples).
In many areas of medical research, the occurrence of the event of interest
can only be recorded at planned (or unplanned) visits. The exact event time
is then only known to happen between two examination times (visits) and
we encounter interval censoring. Typical examples are (a) time to caries
development; (b) time emergence of a tooth (Section 1.1); (c) time to HIV
seroconversion; (d) time to the onset of AIDS (Section 1.3). Indeed, in case
of a cavity or of emergence the event is often observed after some delay, say at
planned (or even unplanned) visits. Similarly, HIV seroconversion can only
be determined by regular or irregular laboratory assessments. However, the
xix
xx
PREFACE
event may also happen before the first examination (e.g. a decayed tooth
is detected already at the first dental examination) and we get a so called
left-censored observation or it may happen after the last examination resulting in a right-censored observation. Hence interval censoring is a natural
generalization of the commonly encountered right censoring.
Often not only the event time but also the time which specifies the origin
of the time scale for the event (the onset time) can only be recorded in the
same way as described in the previous paragraph. An example is the time to
caries development on a tooth where the time of tooth emergence constitutes
the onset time for caries (see Section 1.1). We then speak of doubly interval
censoring. We further formalize the notion of censoring in Chapter 2
Furthermore, the independence between the event times cannot always be
assumed thereby entering the area of multivariate survival data. The dependence can be caused by very different factors. Although many methods
described in this thesis can be applied to any multivariate survival data the
dependencies in our applications are all result of some type of clustering:
emergence or caries times of several teeth of one child (Section 1.1), or progression free survival times of several patients within one hospital in a multicenter clinical trial (Section 1.4). Also recurrent infection times on one
patient (Section 1.2) can be considered to result in clustered data.
The ultimate goal of the research presented in this thesis was to develop the
AFT models which can be used to analyze multivariate survival data, possibly under the presence of doubly interval censoring. The scale of complexity
considered in this thesis starts with interval censoring which can be handled
by all methods introduced here. Possible dependencies between the observations (multivariate survival data) are viewed as the next step on the scale of
complexity and finally, doubly interval censoring is regarded to be the final
level of complexity treated by this thesis and only some methods shown here
reached this final stage. With all the levels of complexity we strived for the
model with distributional assumptions as flexible as possible. Two slightly
different directions are followed in the thesis to address this issue. Both of
them use a Gaussian mixture as a building block to model an unknown distribution. Whereas the first and more extensively explored approach uses
the mixture with a higher number of fixed mixture components with mixture
weights estimated using a kind of penalized methodology the second technique uses a classical mixture with both the number as well as the weights,
locations and scales of the mixture components unknown.
Chapter 1 introduces several data sets that contain each survival data involving one or more issues discussed above and that will be used throughout
the thesis to illustrate the developed methods. Terminology and notation
xxi
used in the thesis are formalized in Chapter 2 together with an explanation
of some basic notions in the analysis of the survival data. The most popular
regression models for the survival data are introduced in Chapter 3.
In Chapter 4 we give the likelihood for interval and doubly-interval-censored
data and discuss briefly the difficulties encountered when using maximumlikelihood methods in the context of (doubly) interval-censored data. Subsequently, we show how the Bayesian inference together with the Markov chain
Monte Carlo (MCMC) methodology can simplify the calculations.
Available methods for the analysis of interval-censored data will be reviewed
in Chapter 5 and one of the methods, namely the Bayesian proportional
hazards model with a piecewise constant baseline hazard function will be
applied to the analysis of the dental clustered doubly-interval-censored data.
In Chapter 6 we explain in detail how the classical and the penalized normal
mixtures can be used to specify unknown distributions in a flexible way.
The first AFT model presented in this thesis – the AFT model with an error
distribution being a normal mixture with a high number of fixed components estimated using the penalized maximum-likelihood method – is shown
in Chapter 7. However only univariate interval-censored data can be handled
by this method. To move on to the area of multivariate or even doublyinterval-censored survival data we found it more advantageous to use a Bayesian methodology rather than the more classical maximum-likelihood based
techniques. The Bayesian AFT model allowing for multivariate intervalcensored data and using a classical normal mixture with both unknown
number of mixture components as well as the mixture components themselves to specify the error distribution is presented in Chapter 8. Finally,
Chapters 9 and 10 show the Bayesian AFT models suitable for multivariate
doubly-interval-censored data that exploit a penalized normal mixture with
higher number of fixed components. For all methods described in this thesis,
software was written in the form of R (R Development Core Team, 2005)
packages called smoothSurv and bayesSurv downloadable from the Comprehensive R Archive Network at http://www.R-project.org. The software is
briefly described in Appendix C.
xxii
PREFACE
Part I
Introduction
Chapter
1
Motivating Data Sets
This chapter introduces the data sets which will be used throughout the thesis illustrating the developed techniques and showing their generality. Each
data set involves one or more specific features of interest here, discussed
briefly in the Preface. The Signal Tandmobielr data set introduced in Section 1.1 involves clustered interval- and doubly-interval-censored dental observations. Section 1.2 describes a clinical trial with patients with a chronic
granulomatous disease where times of possibly recurrent infections were of
interest. At the same time, the time of the last infection is right-censored.
The Women’s Interagency HIV Study involved interval-censored data and is
described in Section 1.3. In Section 1.4, a multicenter clinical trial is described which evaluated the effect of perioperative chemotherapy on disease
progression in early breast cancer patients where the heterogeneity accross
the centra plays an important role.
1.1
The Signal Tandmobielr study
The Signal Tandmobielr project is a longitudinal oral health study performed
in Flanders from 1996 to 2001. It involved 4 468 schoolchildren (2 315 boys
and 2 153 girls) born in 1989. Two stratification factors, i.e. geographical
location (5 provinces) and educational system (3 school systems) establishing
15 strata, were taken into account. The sample represented about 7% of the
corresponding Flemish population of school children. Detailed oral health
data at tooth and tooth-surface level (caries experience, gingivitis, etc.) were
annually collected by a team of 16 dentists whose examination method was
calibrated every six months. In addition, data on dietary and oral hygiene
3
4
CHAPTER 1. MOTIVATING DATA SETS
habits were collected using a questionnaire completed by the parents. Hence
the data set consists of a series of at most 6 longitudinal dental observations
and reported oral health habits. The details of the study design and research
methods have been described in detail by Vanobbergen et al. (2000).
Here, we concentrate on the emergence and caries times of permanent premolars and molars (teeth κ + 4, κ + 5, κ + 6, κ = 10, 20, 30, 40 in European
dental notation, see Figure 1.1). There is no doubt that an adequate knowledge of timing and patterns of tooth emergence and/or caries attacks are
still essential for diagnosis and treatment planning in paediatric dentistry
and orthodontics. Additionally, the effect of certain prespecified factors (like
the caries status of the primary teeth – see Figure 1.2 for their notation, use
of fluoride supplements, brushing habits etc.) on the emergence or caries
processes are often of interest.
An interesting feature of this data set, though typical in dental applications, is
the fact that both emergence and onset of caries are only observable when the
child is examined (by a dentist). This leads to interval-censored emergence
times and to doubly-interval-censored times for caries (see also Figure 2.1).
Additionally, the teeth of a single mouth share common immeasurable or
Figure 1.1: European notation for the position of permanent teeth. Maxilla
= upper jaw, mandible = lower jaw. The first and the fourth quadrants are
at the right-hand side of the subject, the second and the third quandrats are
at the left-hand side of the subject.
1.2. THE CHRONIC GRANULOMATOUS DISEASE TRIAL (CGD)
5
only roughly measured factors like genetical dispositions or dietary habits.
As a result, the emergence times or the times to caries of teeth in the same
mouth are related. Hence, when studying the emergence time or the time to
caries of several teeth, dependencies among the observations taken on a single
child must be taken into account. Analysis of the emergence time or time to
caries is reported in several sections of the thesis.
1.2
The Chronic Granulomatous Disease trial (CGD)
The Chronic Granulomatous Disease is a group of inherited rare disorders
of the immune function characterized by recurrent pyogenic infections which
may lead to death in childhood. There is evidence of a positive role of gamma
interferon in restoring the immune functions of the patients. For that reason,
a multicenter placebo-controlled randomized trial was conducted to study the
ability of gamma interferon to reduce the rate of serious infections.
Between October 1988 and March 1989, 128 patients (63 taking gamma inter-
Figure 1.2: European notation for the position of deciduous (primary) teeth.
The quadrants are numbered 5, 6, 7, 8. The fifth and the eight quadrants
are at the right-hand side of the subject, the sixth and the seventh quadrants
are at the left-hand side of the subject.
6
CHAPTER 1. MOTIVATING DATA SETS
feron, 65 taking placebo) with CGD were accrued by 13 hospitals in Europe
and the United States. The average follow-up time was 292 days, minimal
and maximal follow-up times, respectively were 91 and 432 days, respectively. For each patient, times of initial and any recurrent serious infections
were recorded. There is a minimum of one and a maximum of eight recurrent
infection times per patient, with a total of 203 records.
Besides the gamma interferon there are other factors that may influence the
times between the infections. In the course of the study the following additional information was recorded for each patient:
• Age at time of study entry (mean 14.6 years, range from 1 to 44 years,
standard deviation 9.8 years);
• Gender: male (n = 104), female (n = 24);
• Pattern of inheritance: autosomal recessive (n = 42), X-linked (n =
86);
• Using corticosteroids at time of study entry: yes (n = 3), no (n = 125);
• Using prophylactic antibiotics at time of study entry: yes (n = 111),
no (n = 17);
• Category of the hospital: US – NIH (n = 26), US – other (n = 63),
Europe – Amsterdam (n = 19), Europe – other (n = 20).
The data can be found in Appendix D.2 of Fleming and Harrington (1991).
It is of interest here to set up a regression model with the time between the
two consecutive infections as response and mentioned factors as covariates.
It should be taken into account that the infection times of one patient cannot
be assumed to be independent. We address this issue in Section 8.8.
1.3
The Woman’s Interagency HIV Study (WIHS)
The Woman’s Interagency HIV Study comprises the cohort of 2 058 seropositive women with a comparison cohort of 568 seronegative women being
exposed to a higher risk of HIV infection than the common U.S. population. The study groups were enrolled between October 1994 and November
1995 through six clinical consortia at 23 sites throughout the United States.
Barkan et al. (1998) provide full details on the setup of the study. In this
thesis we concentrate only on the WIHS Oral Substudy involving 224 seropositive AIDS-free (at baseline) women.
The women participating in the Oral Substudy were regularly (on average
every 7 months) examined for AIDS symptoms, the number of copies of
1.4. PERIOPERATIVE CHEMOTHERAPY IN EARLY BREAST CANCER
PATIENTS (EBCP)
7
the HIV RNA virus (viral load) and CD4 T-lymphocyte counts per ml of
blood. Additionally, the presence of one of the three oral lesion markers: oral
candidiasis, hairy leukoplakia and angular cheilitis was checked. The average follow-up time was 41 months and the maximal follow-up time was 84
months. For each woman, the time of seroconversion (HIV infection) was externally estimated and assumed to be known. Clinical AIDS diagnoses were
self-reported in 73.5% of cases, presumptive or definite in 17.5%, and indeterminate in 9%; the case definition did not depend on CD4 T-lymphocytes.
For 66 women the onset of AIDS was interval-censored, while for 158 women
it was right-censored.
For HIV positive people, it is of interest to describe the distribution of the
time to the onset of an AIDS-related illness based on some measured quantities. We examine in Section 7.6 how the classical predictors like viral load
and CD4 T-cells counts together with oral lesion markers can be used in describing this distribution.
1.4
Perioperative Chemotherapy in Early Breast
Cancer Patients (EBCP)
To investigate whether a short intensive course of perioperative chemotherapy
can change the course of early breast cancer compared to surgery alone, the
European Organization for Research and Treatment of Cancer (EORTC) conducted a multicenter randomized clinical trial (EORTC Trial 10854). From
1986 to 1991, a total of 2 793 women with early breast cancer were randomized to receive either one perioperative course of an anthracycline-containing
chemotherapeutic regimen within 24 h after surgery (n = 1 398) or surgery
alone (n = 1 395). See Clahsen et al. (1996) for more details on the trial.
Patients were followed-up for several endpoints, however, we concentrate on
the progression-free survival (PFS) time. The mean follow-up time was 8.15
years with a maximum of 14.13 years. Other factors that may influence the
PFS time include:
• Category of the age of the patient: <40 years (n = 321), 40–50 years
(n = 796), >50 years (n = 1 676);
• Type of surgery: mastectomy (n = 1 231), breast-conserving surgery
(n = 1 542), missing data for n = 20 patients;
• Category of the tumor size: <2 cm (n = 823), ≥2 cm (n = 1 915),
missing data for n = 55 patients;
8
CHAPTER 1. MOTIVATING DATA SETS
• Pathological nodal status: negative (n = 1 467), positive (n = 1 303),
missing data for n = 23 patients;
• Presence of other disease: no (n = 2 542), yes (n = 234), missing for
n = 19 patients.
The trial was conducted in 14 centra located in 5 geographical regions (the
Netherlands, Poland, France, South of Europe and South Africa). Figure 1.3
shows Kaplan-Meier estimates of PFS time survival functions for the treatment and control group, separately for each center. Obviously, there is a huge
heterogeneity among the centra. Not only the overall proportion of PFS patients at fixed time points differs from center to center but also the effect of
treatment on PFS both quantitatively and qualitatively seems to vary accross
centra. Models that measure the effect of covariates and that allow modelling
heterogeneity between centra will be considered in Chapters 8 and 9.
1.4. PERIOPERATIVE CHEMOTHERAPY IN EARLY BREAST CANCER
PATIENTS (EBCP)
France (31)
0.8
n = 53
n = 54
0.0
0.0
n = 311
0.0
0
1000 2000 3000 4000 5000
Days
the Netherlands (12)
France (32)
South Europe (42)
0.8
0.8
0.4
0.4
0.4
n = 60
0.0
0.0
0.0
n = 622
0
1000 2000 3000 4000 5000
1000 2000 3000 4000 5000
Days
the Netherlands (13)
France (33)
South Europe (43)
0.8
0.4
0.4
0.4
0.8
0.8
Days
n = 25
0.0
0.0
0.0
n = 185
0
1000 2000 3000 4000 5000
0
1000 2000 3000 4000 5000
1000 2000 3000 4000 5000
Days
Days
Poland (21)
France (34)
South Europe (44)
0.8
0.4
0.4
0.4
0.8
0.8
Days
n = 902
n = 48
1000 2000 3000 4000 5000
Days
0.0
0.0
0.0
n = 40
0
0
1000 2000 3000 4000 5000
Days
n = 184
0
1000 2000 3000 4000 5000
Days
n = 25
0
0
1000 2000 3000 4000 5000
Days
0.8
0
South Europe (41)
0.4
0.4
0.4
0.8
0.8
the Netherlands (11)
0
0
1000 2000 3000 4000 5000
1000 2000 3000 4000 5000
Days
Days
Poland (22)
0.8
South Africa (51)
0.4
0.4
0.8
9
n = 206
0.0
0.0
n = 78
0
1000 2000 3000 4000 5000
Days
0
1000 2000 3000 4000 5000
Days
Figure 1.3: EBCP Data. Kaplan-Meier estimates of the PFS time distribution separately for each institution. Solid line: treatment arm, dotted-dashed
line: control arm.
10
CHAPTER 1. MOTIVATING DATA SETS
Chapter
2
Basic Notions
In this chapter we introduce some notation that will be used throughout the
thesis and explain more in detail some basic notions like types and mechanisms of censoring considered.
2.1
Right, left and interval censoring
Let Ti,l , i = 1, . . . , N, l = 1, . . . , ni be the exact event time for the lth
observational unit of the ith cluster. It will be assumed throughout the thesis
that Ti,l is a nonnegative random variable with a continuous distribution with
some density pi,l (t) which might depend on a vector of covariates, e.g., xi,l =
(xi,l,1 , . . . , xi,l,m )′ . The time Ti,l can either be known exactly or in a coarsened
manner and is then called censored. Suppose first that knowing whether the
event occurred or not requires a detailed examination (visit to a dentist,
laboratory assessment) executed at pre-planned visits. Then it is only known
U
that the event time occurred after, say tL
i,l , and before, say ti,l . According to
U
U
L
U
L
the context, we either know ti,l < Ti,l ≤ ti,l , ti,l ≤ Ti,l < ti,l , tL
i,l ≤ Ti,l ≤ ti,l ,
U
or tL
i,l < Ti,l < ti,l . Thus, the true event time Ti,l is known to lie in the interval
U
whose lower and upper limits are equal to tL
i,l and ti,l , respectively and the
observation is called interval-censored. Note that all methods presented in
Part II of the thesis lead to the same results irrespective of whether the
interval is closed, open or half open. To cover all these situations we will
U
write Ti,l ∈ ⌊tL
i,l , ti,l ⌋.
With the same notation right-censored observations are obtained, i.e. by setL
ting tU
i,l = ∞ and ti,l equal to the time the subject was last seen before leaving
the study or before the study was terminated. Similarly, a left-censored ob11
12
CHAPTER 2. BASIC NOTIONS
U
servation is obtained with tL
i,l = 0 and ti,l equal to the first time, the subject
was seen after the event. Finally, an exactly observed time ti,l is recorded
U
with tL
i,l = ti,l = ti,l . Below, a censoring indicator δi,l is used, which will be
equal to 0 for right-censored, 1 for exactly observed, 2 for left-censored and
3 for interval-censored observations, respectively.
2.2
Doubly interval censoring
Suppose that the event time Ti,l is obtained as the difference of two random
variables: Vi,l , here always called the failure time and Ui,l , here always called
the onset time, i.e. Ti,l = Vi,l − Ui,l . The pair Ui,l , Vi,l can be, for example,
the emergence time of a tooth and the onset time of caries of that tooth.
Doubly interval censoring is obtained in the situations when either Ui,l and/or
U
Vi,l are interval-censored and it is only known Ui,l ∈ ⌊uL
i,l , ui,l ⌋ and Vi,l ∈
L , v U ⌋. A scheme of a typical doubly-interval-censored observation is given
⌊vi,l
i,l
in Figure 2.1 and an example is given by the Signal Tandmobielr data of
Section 1.1 with Ui,l being the emergence time of the lth tooth of the ith
child and Vi,l being the time when the same tooth is attacked by caries for
the first time.
In the following, we omit the subscript (i, l) from all expressions if it is not
necessary to make an explicit distinction among different observations of one
data set or use only a single subscript i if we do not deal with multivariate
survival data.
U
Observed onset time ⌊uL
i,l , ui,l ⌋
L , vU ⌋
Observed failure time ⌊vi,l
i,l
-
True onset time ui,l
Examinations:
True failure time vi,l
True event time ti,l
?
6
si,l,1
-
6
si,l,2
6
si,l,3
6
si,l,4
-
?
6
si,l,5
6
si,l,6
Figure 2.1: Doubly interval censoring. A scheme of a doubly-interval-censored
observation obtained by performing examinations to check the event status
at times si,l,1 , . . . , si,l,6 . The onset time is left-censored at time uU
i,l = si,l,1
U ⌋ = ⌊0, s
⌋),
the
failure
time
,
u
(i.e. interval-censored in the interval ⌊uL
i,l,1
i,l i,l
U
L
is interval-censored in the interval ⌊vi,l , vi,l ⌋ = ⌊si,l,5 , si,l,6 ⌋.
2.3. DENSITY, SURVIVAL, HAZARD AND CUMULATIVE HAZARD
FUNCTIONS
2.3
13
Density, survival, hazard and cumulative hazard functions
A continuous distribution of an event time T is uniquely determined by its
density p(t). Equivalently, the distribution of T is determined by a nonincreasing right-continuous survival function S(t) defined as the probability
that T exceeds a value t in its range, i.e.
S(t) = Pr(T > t) =
Z
∞
p(s) ds.
t
Another possibility is to specify the hazard function ℏ(t) which gives the
instantaneous rate at which an event occurs for an item that is still at risk
for the event at time t, i.e.
Pr(t ≤ T < t + ∆t T ≥ t)
= Pr T ∈ Nt (dt) | T ≥ t ,
ℏ(t) = lim
∆t→0+
∆t
where
Nt (dt) = [t, t + dt).
The density and the survival function can be computed from the hazard
function using the following relationships:
where H(t) =
2.4
Rt
0
p(t) = ℏ(t) exp −H(t) ,
S(t) = exp −H(t) ,
ℏ(s) ds is the cumulative hazard function.
Independent noninformative censoring and simplified likelihood
Throughout the thesis we will assume independent noninformative censoring
in the therminology of Kalbfleisch and Prentice (2002). In this section, we
explain this concept first in the framework of right-censored data and then
extend it to the area of interval-censored data. Finally, we introduce the term
of simplified likelihood and remark that it can be used for the inference with
censored data under the assumption of independent noninformative censoring.
14
2.4.1
CHAPTER 2. BASIC NOTIONS
Right-censored data
Kalbfleisch and Prentice (2002) introduce the concept of independent noninformative censoring in the context of right-censored data in the following
way. Let C denote the random variable causing the censoring. That is, instead of observing the event time T we only observe X = min(T, C) and
δ = I[T ≤ C].
Independent censoring
They call the censoring mechanism independent when the hazard which applies to the censored population is at each time point the same as the hazard
which applies would there have been no censoring. That is, the hazard functions have to satisfy
Pr T ∈ Nt (dt) C ≥ t, T ≥ t = Pr T ∈ Nt (dt) T ≥ t
(2.1)
for any t > 0. Note that independence of random variables T and C implies
that the condition (2.1) is satisfied. However, T and C are not necessarily
independent when the condition (2.1) is fulfilled.
Further, Kalbfleisch and MacKay (1979) proved that the condition (2.1) is
equivalent to so called constant-sum condition:
Z t
Pr C ∈ Nx (dx), δ = 0 T ≥ x = 1
Pr δ = 1 T ∈ Nt (dt) +
(2.2)
0
for any t > 0, introduced
by Williams and Lagakos (1977). The term
Pr δ = 1 T ∈ Nt (dt) could be interpreted as the probability that a subject who would fail at
time t is actually observed to fail and the term
Pr C ∈ Nx (dx), δ = 0 T ≥ x has the meaning that a subject who survives
at least x time units is censored at time x. To relate the condition (2.2) to its
interval-censored version which will be introduced in the following section,
we rewrite it into the form:
Z t
Pr
C
∈
N
(dx),
T
∈
[x,
∞),
δ
=
0
x
= 1. (2.3)
Pr δ = 1 T ∈ Nt (dt) +
Pr T ∈ [x, ∞)
0
Noninformative censoring
Kalbfleisch and Prentice (2002) further call the censoring mechanism noninformative if the censoring random variable C does not depend on any parameters used to model the distribution of the event time T . In other words,
2.4. INDEPENDENT NONINFORMATIVE CENSORING AND SIMPLIFIED
LIKELIHOOD
15
with the independent noninformative censoring, the censoring procedure or
rules may depend arbitrarily during the course of the study on:
• previous event times of other subjects in the study;
• previous censoring times of other subjects in the study;
• random mechanisms external to the study;
• values of covariates possibly included in the model;
but must not contain any information on the parameters used to model the
event time.
The independent noninformative censoring includes type I censoring. In this
case, censoring can only happen at a pre-planned calendar time. This censoring scheme has been used for the CGD data introduced in Section 1.2 and
for the EBCP data of Section 1.4.
2.4.2
Interval-censored data
Consider now the case of interval-censored data where the observed intervals
are generated by a triplet (T L , T U , T )′ . That is, we observe an interval
⌊tL , tU ⌋ if T L = tL , T U = tU and T ∈ ⌊T L , T U ⌋. Note that since the
observed interval ⌊T L , T U ⌋ must contain the event time T , the support of
the random vector (T L , T U , T )′ is equal to
L U
(t , t , t) : 0 ≤ tL ≤ t ≤ tU ≤ ∞ .
Oller, Gómez, and Calle (2004) show that the interval-censored counterpart
of the constant-sum condition (2.3) is given by
ZZ
Pr T L ∈ NtL (dtL ), T U ∈ NtU (dtU ), T ∈ ⌊tL , tU ⌋
=1
(2.4)
Pr T ∈ ⌊tL , tU ⌋
(tL , tU ): t∈⌊tL , tU ⌋
for all t > 0. Further, they introduce the term noninformative condition and
show that it is stronger than the constant-sum condition (2.4). It should be
pointed out that Oller et al. use the term “noninformative” in a different
context than Kalbfleisch and Prentice (2002) whose meaning of this word is
adopted in this thesis.
In summary, we will call the interval censoring independent if it satisfies
the constant-sum condition (2.4) and noninformative if the distribution of
censoring random variables T L and T U does not depend on the parameters
used to model the distribution of the event time T .
16
CHAPTER 2. BASIC NOTIONS
A typical example of an independent noninformative interval censoring can
be found in the Signal Tandmobielr data (Section 1.1) and in the WIHS
data (Section 1.3). In both cases even a stronger condition of independence
of T and (T L , T U )′ is satisfied. Indeed, either dental examinations or check
ups of the AIDS status were performed at pre-planned time-points and thus
external to the studied event time. Note that interval censoring would not
be independent when the event induces an examination, namely when a child
visits the dentist because of a decayed tooth.
2.4.3
Simplified likelihood for interval-censored data
We explain in Chapter 4 that likelihood is the corner stone for the inference
on the event time T . Strictly speaking, with interval-censored data, the
likelihood contribution is given by the density of observables, i.e. by the
density of the vector (T L , T U )′ whose support is such that T ∈ ⌊T L , T U ⌋
with probability one. That is, the likelihood contribution of the observed
⌊tL , tU ⌋ is given by
Lf ull = Pr T L ∈ NtL (dtL ), T U ∈ NtU (dtU ), T ∈ ⌊tL , tU ⌋ .
However, it is shown in Oller et al. (2004) that under the assumption of
independent noninformative censoring, the likelihood contribution Lf ull is
proportional to so called simplified likelihood contribution
L = Pr T ∈ ⌊tL , tU ⌋ ,
where a possible randomness of T L and T U is ignored. Consequently, the
inference on the event time T can be based on this simplified likelihood.
In the remainder of the thesis, we will use the simplified likelihood for the
inference and omit the word ‘simplified’ for clarity.
Chapter
3
An Overview of Regression
Models for Survival Data
Two regression models dominate the survival analysis to describe the dependence of the distribution of the event time T on covariates, say x =
(x1 , . . . , xm )′ : (a) the proportional hazards (PH) model and (b) the accelerated failure time (AFT) model. In this chapter, we introduce these two models, compare them and show how they can be extended to handle multivariate
survival data. We also review these models for the analysis of right-censored
data however with an emphasis on the AFT model. For methods that allow
interval- or doubly-interval-censored data we refer to Chapter 5.
3.1
Proportional hazards model
This model, introduced by Cox (1972), specifies that, for a given covariate
vector x, the hazard function is expressed as the product of an unspecified
baseline hazard function ℏ0 (t) and the exponential of a linear function of the
covariates, i.e.
ℏ(t | x) = ℏ0 (t) exp(β ′ x).
(3.1)
The regression parameter vector β is estimated by maximizing a partial likelihood (Cox, 1975) which treats ℏ0 as nuisance and does not estimate it.
However, when the baseline hazard ℏ0 is of interest as well, e.g. for prediction purposes, its non-parametric estimate can be obtained using the method
of Breslow (1974). The survival function for an object with covariates x,
17
18
CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA
S(· | x), is related to the baseline survival function S0 by the relationship
exp(β′ x)
S(t | x) = S0 (t)
.
An exhaustive treatment of the PH model and its extensions can be found,
e.g., in Therneau and Grambsch (2000) or Kalbfleisch and Prentice (2002,
Chapter 4). The software to fit the PH model using the method of maximal
partial likelihood together with possibilities to compute residuals, draw diagnostic plots or assess goodness of fit is available in most modern statistical
packages, e.g. function coxph in R/S-plus or procedure PHREG in SAS.
3.2
Accelerated failure time model
The accelerated failure time model is a useful, however less frequently used
alternative to the PH model. For this model, the effect of a covariate implies
on average an acceleration or deceleration of the event time. For a vector of
covariates x the effect is expressed by the parameter vector β in the following
way:
T = exp(β ′ x) T0 ,
where T0 is a baseline survival time. On the logarithmic scale, this model
becomes a simple linear regression model
log(T ) = β ′ x + ε,
(3.2)
with ε = log(T0 ). The hazard and survival functions of a subject with covariate vector x is related to the baseline hazard (ℏ0 ) and survival function
(S0 ) by the relationships
(3.3)
ℏ(t |x) = ℏ0 exp(−β ′ x) t exp(−β ′ x),
′
S(t | x) = S0 exp(−β x) t .
Usually one assumes that the error random variable
ε has adensity gε (ε)
from the location-scale family, i.e. gε (ε) = τ −1 gε∗ τ −1 (ε − α) , where gε∗ (·)
has location parameter = 0 and scale parameter = 1. The location parameter
α and the scale parameter τ have to be estimated from the data as well as
the regression parameter β.
A parametric AFT model assumes that gε∗ (·) is a density of a specific type
(e.g. Gaussian, logistic or Gumbel). In that case, the parameters α, τ and
β can easily be estimated using the method of maximum likelihood. However, the parametric assumptions affect evidently the shape and character
3.3. ACCELERATED FAILURE TIME MODEL VERSUS PROPORTIONAL
HAZARDS MODEL
19
of the resultant survival or hazard curves which, in the case of an incorrect
specification, is undesirable, especially when prediction is of interest.
On the other hand, semi-parametric procedures for the AFT model leave
the density gε (ε) unspecified and provide only the estimate of the regression
parameter vector β. In the past, primarily two semi-parametric methods for
the AFT model with right-censored data have been examined. The first one
is based on the generalization of the least squared method to censored data
first proposed by Miller (1976) and in a different manner by Buckley and
James (1979) giving their names to this approach. A slight modification of
the Buckley-James estimator and its asymptotic properties was given by Lai
and Ying (1991). However, a drawback of the Buckley-James method is that
it may fail to converge or may oscillate between several solutions.
The second approach is based on linear-rank-tests for censored data and was
developed by Prentice (1978), Gill (1980), and Louis (1981) in the case of
one covariate. Tsiatis (1990) extended the method to the multiple regression
context. The asymptotic equivalence of the Buckley-James method and the
linear-rank-test-based estimators has been pointed out by Ritov (1990). The
asymptotic properties of the linear-rank-test-based estimators were presented
in greatest generality by Ying (1993). In contrast to the partial likelihood
method for the PH model, the numerical aspect of the linear-rank-test-based
estimation of the regression parameters of the AFT model could be computationally cumbersome. Only recently, Jin et al. (2003) suggested an algorithm
to compute this estimate using a linear programming technique. They also
provide an S-plus function. Further, there seems to exist no non-parametric
method to estimate the baseline survival distribution like the method of Breslow (1974) for the PH model. Consequently, the semi-parametric procedures
cannot be used when prediction is of interest.
Only parametric AFT models have been implemented in major statistical
packages (functions survreg in R and SurvReg in S-plus and procedure
LIFEREG in SAS).
3.3
Accelerated failure time model versus proportional hazards model
Both the PH as well as the AFT model make an explicit assumption about
the effect of covariates on the hazard function. The effect of covariates on
the hazard function in the PH model is given by (3.1), in the AFT model
by (3.3). The assumed different effect of a covariate on the baseline hazard for
the PH and AFT model is exemplified in Figure 3.1. It is seen that, like in the
20
CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA
PH model, in the AFT model the effect of covariates on the baseline hazard
function is multiplicative, but additionally for the AFT model an acceleration
or deceleration of the time scale is seen. Also, in the AFT model the hazard
is increased for β < 0 whereas in the PH model for β > 0.
We point out (see Kalbfleisch and Prentice, 2002, Section 2.3.4) that the PH
model and the AFT model are equivalent if and only if the distribution of
the standardized error term ε∗ = τ −1 (ε − α) in the AFT model (3.2) is the
Gumbel (extreme value distribution of a minimum), i.e. when
gε∗ (ε∗ ) = exp ε∗ − exp(ε∗ ) .
In that case, the distribution of the baseline survival time T0 is Weibull and
the baseline hazard function ℏ0 (t) has the form
ℏ0 (t) = γ (λ t)γ−1 ,
where λ = exp(−α) and γ = τ −1 .
Further, it is generally true that it is not always possible (e.g. due to lack
of knowledge) to include all relevant covariates in the model. One of the
advantages of the AFT model is that the regression parameters of the included
covariates do not change when other, important, covariates are omitted. Of
course, the neglected covariates have an impact on the distribution of the
error term ε in (3.2) hich is typically changed into one with larger variability.
Such change, however, is of no major importance (except that it influences
the precision with which the regression parameters of the included covariates
are estimated) when semi-parametric methods or methods with a flexible
distribution for ε are used. Unfortunately, a similar property does not hold
for the PH model, see Hougaard (1999) for a more detailed discussion.
The fact that only parametric AFT models are implemented in major statistical packages, together with the computational difficulties associated with
the semi-parametric AFT model may have caused that the PH model became
far more popular in practice than the AFT model. See Nardi and Schemper
(2003) for comparison of the PH model and parametric AFT models. Though,
the property that the AFT model postulates a direct relationship between
failure time and covariates led Sir David Cox (see Reid, 1994) to remark that
“accelerated life models are in many ways more appealing” than the proportional hazards model “because of their quite direct physical interpretation.”
Indeed, in the AFT model, the regression indicates directly how is the time
– a quantity being understandable also by non-statisticians – increased or
decreased. Whereas, in the PH model, the direct effect of regression is on the
hazard which might be more difficult to understand by practitioners.
3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA
3.4
21
Regression models for multivariate survival data
Both the PH model and the AFT model can be extended to handle multivariate survival data. In this section, we briefly discuss one extension of the
PH model and concentrate mainly on the multivariate versions of the AFT
model that will serve as a basis for developments presented in this thesis.
3.4.1
Frailty proportional hazards model
For multivariate survival data, a common extension of the PH model includes
a cluster specific random effect Zi , called the shared frailty, in the expression
of the hazard function, i.e.
ℏ(t | xi,l , Zi ) = ℏ0 (t) Zi exp(β ′ xi,l ).
(3.4)
The frailty component Zi is most often assumed to have a parametric distribution such as a gamma or log-normal distribution. For more details, we
refer to Aalen (1994), Hougaard (2000) and Therneau and Grambsch (2000)
where also available software is described.
Nevertheless, the model (3.4) is rather simple, e.g., in the analysis of a multicenter clinical trial only the center effect and not the center by treatment
3
1
2
ℏ(t)
3
1
2
ℏ(t)
4
AFT model
4
PH model
0
1
2
3
t
4
5
0
1
2
3
4
5
t
Figure 3.1: Effect of PH and AFT assumption on a hypothetical baseline
hazard function (solid line) for a univariate covariate x taking a value of 0.6
(dashed line) and 1.2 (dotted line) with regression parameter β = −0.5 for
the PH model and β = 0.5 for the AFT model.
22
CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA
interaction can be controlled for. This drawback led to further developments
mimicking the classical linear mixed model of Laird and Ware (1982) by
assuming
ℏ(t | xi,l , bi ) = ℏ0 (t) exp(β ′ xi,l + b′i z i,l ),
(3.5)
where z i,l = (zi,l,1 , . . . , zi,l,q )′ is an additional vector of covariates and bi
= (bi,1 , . . . , bi,q )′ is a cluster specific random effect which is again usually
assumed to follow a parametric distribution, most often multivariate normal.
Such model is considered, e.g., by Vaida and Xu (2000). Note that the
model (3.4) is a special case of (3.5) with z i,l ≡ 1 and Zi ≡ exp(bi ).
Besides the fact that the frailty PH model is not, similarly as the basic PH
model, robust towards neglected covariates, it has another important drawback. Indeed, for most frailty distributions, the marginal hazard function
obtained from (3.4) by integrating out Zi is no more proportional with respect to the covariates xi,l . Moreover, the form in which the covariate vector
xi,l modifies the marginal baseline hazard function depends on the assumed
frailty distribution. Consequently, the estimates of the regression parameters
β can be highly sensitive towards the choice of the frailty distribution; see
Hougaard (2000, Chapter 7) for more details.
3.4.2
Population averaged accelerated failure time model
A natural extension of the basic AFT model allowing for multivariate data,
breaks down the assumption of i.i.d. error terms ε in the model expression (3.2) by assuming
log(Ti,l ) = β ′ xi,l + εi,l ,
i = 1, . . . , N, l = 1, . . . , ni ,
(3.6)
with εi = (εi,1 , . . . , εi,ni )′ , i = 1, . . . , N being independent random vectors, each with a multivariate density gε,i (εi ). Such model is often called
population-averaged (PA) or marginal. When all clusters are of the same
size, i.e. when ni = n for all i, it is usually assumed that the random error
vectors εi , i = 1, . . . , N are i.i.d. with a multivariate density gε (ε). The
main disadvantages of the PA model is that the model is designed only to account for within-cluster dependencies and consequently structured modelling
of these dependencies is rather unnatural.
Early semi-parametric approaches to the population averaged AFT model
(3.6) with right-censored data are given by Lin and Wei (1992); Lee, Wei,
and Ying (1993) and are directed mainly towards the estimation of the regression parameter β. They use the following estimation strategy. In the
first step, they ignore the correlation and estimate the regression coefficient
3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA
23
β using one of the semi-parametric approaches for uncorrelated censored
data outlined in Section 3.2 (the Buckley-James estimator or censored data
linear-rank-test-based estimator). In the second step, they correct the standard errors of the estimate using a GEE approach (Liang and Zeger, 1986).
However, we can point out that ignoring the dependence in the estimation
step generally does not take full advantage of the information in the data
and is likely not to be efficient. For that reason, Pan and Kooperberg (1999)
suggest, in the case of bivariate survival data, i.e. ni = 2 for all i = 1, . . . , N ,
methods that account already in the estimation step for the within-cluster
correlation. Briefly, their method iterates between (a) estimating the joint
bivariate distribution of (εi,1 , εi,2 )′ using the bivariate log-spline density estimate of Kooperberg (1998), (b) multiple imputation (Wei and Tanner, 1991)
of censored observations, (c) estimating the regression parameter β using
either ordinary or generalized least squares. Note that this procedure can
be considered as a generalization of the basic Buckley-James estimator, for
which in step (a) the Kaplan-Meier estimator of the survival distribution is
used while ignoring the correlation and in step (b) a simple imputation using
conditional expectations is employed.
Finally, Pan and Connett (2001) present an approach that, to some extent,
combines the methods of Lee et al. (1993) and Pan and Kooperberg (1999).
It iterates between (a) estimating the marginal distribution of εi,l using the
Kaplan-Meier estimate while ignoring the dependencies, (b) multiple imputation of censored observations, (c) GEE estimation of the regression parameter
β using a general working correlation matrix.
3.4.3
Cluster specific accelerated failure time model
Another extension of the AFT model for multivariate data adds, similarly as
the frailty PH model and analogously as the classical linear mixed model of
Laird and Ware (1982), cluster specific random effect vector bi = (bi,1 , . . . , bi,q )′
combined with a vector of covariates z i,l = (zi,l,1 , . . . , zi,l,q )′ into the model
expression, i.e.
log(Ti,l ) = β ′ xi,l + b′i z i,l + εi,l ,
i = 1, . . . , N, l = 1, . . . , ni .
(3.7)
The random effect vectors bi , i = 1, . . . , N are assumed to be i.i.d. with
some (multivariate) density gb (b), the random error terms εi,l , i = 1, . . . , N,
l = 1, . . . , ni are assumed to be i.i.d. with some density gε (ε) and independent
on the random effects. Besides the term cluster-specific (CS), the model (3.7)
is sometimes called conditional, since the distribution of the event time Ti,l
is modelled conditionally, given the cluster specific characteristic bi .
24
CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA
In the literature, Pan and Louis (2000) and Pan and Connett (2001) consider
model (3.7) with a univariate random effect bi and zi,l ≡ 1 for all i and l.
The estimation procedure iterates between (a) estimating the distribution of
independent error terms εi,l using the Kaplan-Meier estimator, (b) multiple
imputation of censored times, (c) a Monte Carlo EM algorithm of Wei and
Tanner (1990) in Pan and Louis (2000) or restricted maximum likelihood in
Pan and Connett (2001) to estimate the regression parameter β.
Observe that in contrast to the frailty PH model, in the cluster-specific AFT
model the meaning of the regression parameters β is the same conditionally
given bi as well as marginally. Indeed, when the random effects bi , i =
1, . . . , N are integrated out from model (3.7), we obtain the model (3.6) with
the only change in the error distribution which is given as an appropriate
convolution.
3.4.4
Population averaged model versus cluster specific
model
When compared to the PA model, not only the CS model allows for structured modelling of within-cluster dependencies but is often preferred to it
due to clear decomposition of the sources of variability and more natural interpretation of the regression parameters, see Lindsey and Lambert (1998)
and Lee and Nelder (2004) for more details.
However, in some sense, the PA model is more general in the following sense.
The CS model is specified hierarchically and always implies a particular PA
model when the random effects are integrated out. On the other hand, the
same PA model can correspond to several, very different CS models. Moreover, with the most common assumptions, i.e. when the error terms εi,l ,
i = 1, . . . , N, l = 1, . . . , ni in the CS model are assumed to be i.i.d. the
random effects bi , i = 1, . . . , N in the CS model i.i.d. and independent on the
errors and the error term vectors εi , i = 1, . . . , N in the PA model i.i.d., the
PA model leads to a more general covariance structure than the CS model. To
illustrate this, consider the CS model (3.7) with a random intercept only, i.e.
zi,l ≡ 1. Let var(εi,l ) = σε2 and var(bi ) = σb2 , i = 1, . . . , N . Such model implies
′
a covariance matrix for the log-event times vector log(Ti,1 ), . . . , log(Ti,ni )
which is of the compound symmetry type, i.e.

 

log(Ti,1 )
σε2 + σb2 . . .
σb2

 

..
..
..
..
=
.
var 
.
.
.
.

 

2
2
2
log(Ti,ni )
σb
. . . σε + σ b
3.4. REGRESSION MODELS FOR MULTIVARIATE SURVIVAL DATA
25
That is, the variance is necessarily the same for all observations within a cluster and the correlation between the two observations is the same for all pairs
within a cluster. On the other hand, with the PA model (3.6) both the
variance and the correlation are allowed to vary across the cluster as usually
unstructured covariance matrix for the error terms vector εi and subsequently
also for the log-event times vector is assumed.
26
CHAPTER 3. REGRESSION MODELS FOR SURVIVAL DATA
Chapter
4
Frequentist and Bayesian
Inference
Both PH and AFT models determine a probabilistic mechanism that leads
to survival data. The mechanism depends further on a vector of unknown
parameters, denoted by θ, which represents the relevant information we wish
to pick up from the observed data. For example, for the AFT model (3.2),
the θ vector is equal to (α, β ′ , τ )′ and the probabilistic mechanism is given
by equation (3.2) together with the specification of the density of the error
term ε. The assumed probabilistic mechanism together with the observed
data determines the likelihood function, L(θ), which is the corner stone to
draw the inference about the unknown parameter vector θ.
Two major paradigms exist in statistics of how to use the likelihood in order to draw the inference about θ, namely the frequentist and the Bayesian
paradigms. In the classical frequentist point of view, the data are assumed
to be a random sample generated by the random mechanism controlled by
θ, which is unknown but fixed. Several methods exist to estimate the true
value of the parameter θ, maximum likelihood (ML) being one of the most
b maximizes the likelihood function over a set
popular ones. The estimator, θ,
Θ of admissible θ values – the parameter space. Hypotheses about the parameter vector θ can be tested and accuracy of the estimates can be assessed
by calculation of the confidence intervals. See, e.g., Cox and Hinkley (1974,
Chapter 9) or Lehmann and Casella (1998, Chapter 6) for more details on
ML estimation.
In Bayesian statistics, both the data and the parameter vector θ are treated
as random variables. Besides the probabilistic model to generate the data,
a prior distribution p(θ) must be specified for the model parameters. Infer27
28
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
ence is then based on the posterior distribution p(θ | data) of the parameters
given the data which is calculated using Bayes’ rule:
p(θ | data) = R
L(θ) p(θ)
∗
∗
∗ ∝ L(θ) p(θ).
Θ L(θ ) p(θ ) dθ
(4.1)
As point estimate of θ, the posterior expectation, median or mode can be
used. The uncertainty about the model parameters can be expressed using
credible intervals constructed using the quantiles of the posterior distribution
(see Section 4.6 for more details). For an extensive introduction into the area
of Bayesian statistics, see, e.g., Carlin and Louis (2000); Gelman et al. (2004).
4.1
Likelihood for interval-censored data
We saw that the likelihood plays a principal role in drawing inference about
unknown model parameters. In this section, we discuss the general form of the
likelihood, first for univariate interval-censored and doubly-interval-censored
data. The multivariate case will be discussed in the following section.
In this section, let Ti , i = 1, . . . , N be a set of independent event times each
with a density pi (t; θ). For instance, for AFT model (3.2) density pi (t; θ) is
given by
pi (t; θ) = (τ t)−1 gε∗ τ −1 (log t − α − β ′ xi ) ,
where xi is a covariate vector for the ith observation.
4.1.1
Interval-censored data
U
Let ⌊tL
i , ti ⌋ be observed intervals and δi corresponding censoring indicators
with the same convention as in Section 2.1. Let the corresponding survival
functions be denoted by Si (t; θ). The likelihood L(θ)Q
is then the product of
individual likelihood contributions Li (θ), i.e. L(θ) = N
i=1 Li (θ), where
 R
∞

pi (s; θ) ds = Si (tL
δi = 0,

i ; θ),
tL

i


 pi (ti ; θ),
δi = 1,
R tU
Li (θ) =
U
i

δi = 2,

0 pi (s; θ) ds = 1 − Si (t ; θ) ,

U

R
t

 Li pi (s; θ) ds = Si (tL ; θ) − Si (tU ; θ) ,
δi = 3.
i
i
t
i
This can be briefly written as
Li (θ) =
I
tU
i
tL
i
pi (s; θ) ds
(4.2)
4.1. LIKELIHOOD FOR INTERVAL-CENSORED DATA
if we make use of the notation
 Z U
τ

I τU

p(s) ds,
p(s) ds =
L
τ

τL
 p(τ L ) = p(τ U ),
29
if τ L < τ U
if
τL
=
(4.3)
τU,
i.e. the integral disappears whenever the event time is exactly observed.
Note that already for simple interval-censored data, the likelihood involves
integration of the density.
4.1.2
Doubly-interval-censored data
U
Let ⌊uL
i , ui ⌋, i = 1, . . . , N be observed intervals for the onset time Ui and
L
U
⌊vi , vi ⌋ observed intervals for the failure time Vi in the sense of Section 2.2.
It is tempting to transform observations into single intervals of the form
U
L
U
U
L
⌊tL
i , ti ⌋ = ⌊vi − ui , vi − ui ⌋ and then to use methods for simple intervalcensored data with the likelihood (4.2). However, as pointed out by De Gruttola and Lagakos (1989), this approach would be only valid if the onset time Ui
is uniformly distributed and independent of the event time Ti .
To write a likelihood contribution of each observation in the general case
a bivariate density of an event and onset times must be considered. Let
qi (t, u; θ) be a density of the random vector (Ti , Ui )′ , i = 1, . . . , N . The likelihood contribution of the ith observation is then given by a double integral
of the form
I uU nI vU −u
o
i
i
Li (θ) =
qi (t, u; θ) dt du.
(4.4)
uL
i
viL −u
Note that whenever either the onset time Ui and/or the failure time Vi are
exactly observed either both or one integrals disappear in the formula (4.4).
In most practical situations it can be assumed that, given the parameter
vector θ, the onset and the event time are independent, i.e.
qi (t, u; θ) = pi (t; θ) pU
i (u; θ).
(4.5)
In the rest of this thesis we shall make use of assumption (4.5). The likelihood
contribution of the ith subject can then be rewritten as
Li (θ) =
I
nI viU −u
uU
i
uL
i
viL −u
o
pi (t; θ) dt pU
i (u; θ) du.
(4.6)
30
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
4.2
Likelihood for multivariate (doubly) interval-censored data
In the case of multivariate event times Ti,l , i = 1, . . . , N , l = 1, . . . , ni , obU
served as intervals ⌊tL
i,l , ti,l ⌋, the likelihood contribution of the ith cluster
equals
I tU
I tU
i,ni
i,1
···
pi (t1 , . . . , tni ; θ) dtni · · · dt1 ,
(4.7)
Li (θ) =
tL
i,n
tL
i,1
i
where pi (t1 , . . . , tni ; θ) is the density of (Ti,1 , . . . , Ti,ni )′ implied by the assumed model.
When population averaged AFT model introduced in Section 3.4.2 is assumed, pi (t1 , . . . , tni ; θ) equals
gε,i log(t1 ) − β ′ xi,1 , . . . , log(tni ) − β ′ xi,ni
.
pi (t1 , . . . , tni ; θ) =
t1 · · · tni
(4.8)
In the case of the cluster-specific AFT model described in Section 3.4.3, the
density pi (t1 , . . . , tni ; θ) becomes
Z Y
ni
gε log(tl ) − β ′ xi,l − b′i z i,l
pi (t1 , . . . , tni ; θ) =
gb (bi ) dbi .
tl
Rq
(4.9)
l=1
For doubly-interval-censored data, under assumption (4.5), the likelihood
contribution of the ith cluster is obtained by an appropriate multivariate
modification of the expression (4.6).
4.3
Bayesian data augmentation
The computation of the likelihood for interval- and doubly-interval-censored
data is rather involved. The complexity even increases when multivariate survival data are introduced. Indeed, the maximum likelihood method involves
multivariate integration combined with the optimization of the likelihood
which becomes quickly intractable even for simple models.
On the other hand, in Bayesian statistics, where the unknown parameter
vector θ is assumed to be random and its posterior distribution p(θ | data) is
used for inference, we are completely free to augment the vector of unknowns
by arbitrary auxiliary variables, let say ψ. Inference can then equally be
based on the joint posterior distribution p(θ, ψ | data). Indeed, all (marginal)
4.3. BAYESIAN DATA AUGMENTATION
31
posterior characteristics of θ (mean, median, credible intervals) are the same
regardless whether they are computed from p(θ | data) or p(θ, ψ | data) since
Z
p(θ | data) = p(θ, ψ | data) dψ.
In the case of censored data, matters simplify considerably if the unknown
true event times ti are explicitely considered to make a part of the vector
of unknowns, i.e. ψ = (ti : i = 1, . . . , N, ti is censored)′ . Assume now
that all observations are censored. In this situation, it is obvious that ψ
(uncensored augmented data) conveys more precise information about the
model parameter θ than the censored data which implies
p(θ | ψ, data) = p(θ | ψ).
The joint posterior distribution of θ and ψ then equals
p(θ, ψ | data) = p(θ | ψ, data) p(ψ | data) = p(θ | ψ) p(ψ | data).
(4.10)
The two terms on the right hand side of formula (4.10) are now easily computed. Indeed, p(θ | ψ) is the posterior distribution of θ if the uncensored
data were available, i.e.
p(θ | ψ) ∝ Laugm (θ) p(θ),
where the likelihood Laugm of the uncensored augmented data is simply
Laugm (θ) =
N
Y
pi (ti ; θ).
i=1
The second term of the right hand side of formula (4.10), p(ψ | data), is under
the assumption of independent noninformative censoring proportional to the
product of indicator functions:
p(ψ | data) ∝
N
Y
U
I ti ∈ ⌊tL
i , ti ⌋ .
i=1
A similar procedure can be applied for doubly-censored data. In that case,
both true onset times ui and true event times ti i = 1, . . . , N are augmented
into the vector of unknowns. The situation where only the part of the data
is censored is analogous, only with some change in notation. Finally, in the
case of multivariate survival data and cluster specific models, the integrals of
32
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
the form (4.9) can easily be avoided by augmenting the vector of unknowns
by the values of the random effects bi , i = 1, . . . , N .
The idea of data augmentation was first introduced in the context of the EM
algorithm (Dempster, Laird, and Rubin, 1977) and formalized in the context
of Bayesian computation by Tanner and Wong (1987). For more complex
models with censored data, this technique constitutes a highly appealing
alternative to difficult maximum likelihood estimation. Moreover, it is quite
natural to include the true event times or the values of latent random effects
in the set of unknowns. For these reasons, most of the models developed in
this thesis make use of the Bayesian estimation with augmented true event
times.
4.4
Hierarchical specification of the model
In Bayesian statistics, the prior distribution p(θ) and the model assumed to
generate the data, represented by the likelihood L(θ) = p(data | θ), are usually specified in a hierarchical manner. Firstly, remember that the parameter
vector θ contains not only the parameters in a classical sense but also all
remaining latent factors like random effects or augmented times. Crudely,
the vector θ can usually be splitted into two parts θ = (ψ ′ , φ′ )′ where ψ
refers to the latent factors and φ to the parameters in a classical sense. The
specification of the Bayesian model then proceeds in the following steps:
1. Data Model step specifies the likelihood function
L(θ) = p(data | θ) = p(data | ψ, φ)
and is actually equivalent to the frequentist specification of the model.
2. Latent Process Model step specifies
p(ψ | φ),
i.e. the distribution of the latent factors, possibly given the classical
parameters φ.
3. Parameter Model (Prior) specifies the prior distribution for the classical parameters φ, i.e. it specifies
p(φ).
Often, the components of φ are assumed to be a priori independent and
if no external information is available are assigned vague but proper
prior distributions.
4.4. HIERARCHICAL SPECIFICATION OF THE MODEL
33
The overall prior distribution is then given by
p(θ) ∝ p(ψ | φ) × p(φ),
and the posterior distribution is obtained using the relationship (4.1) as
p(θ | data) ∝ L(θ) × p(θ)
∝ p(data | ψ, φ) × p(ψ | φ) × p(φ),
(4.11)
i.e. it is proportional to the product of the distributions specified in the above
three steps.
The hierarchical structure of more complex hierarchical models is usually best
expressed using so called directed acyclic graphs (DAG) where each model
quantity is represented by the node drawn as a circle for unknowns and
drawn as a squared box for observed or fixed quantities (data, covariates).
Solid arrows are used to represent stochastic dependencies and dashed arrows
deterministic dependencies between the nodes. A simple DAG which only
distinguishes among the data, latent quantities ψ and classical parameters φ
and which corresponds to the expression (4.11) is shown in Figure 4.1.
Further, it is assumed that given its parents, each node is conditionally independent on all its grandparents, i.e. schematically
p(child | parents, grandparents) = p(child | parents).
The posterior distribution of the hierarchical model is then proportional, analogously to the relationship (4.11), to the product of all conditional distributions of the type p(child | parents) times the product of the prior distributions
for the nodes of the first generation (i.e. having no parents).
Illustration 4.1. Linear mixed model. As an illustration, consider a classical normal linear mixed model with data = {y i , . . . , y N } being a realization
φ
ψ
data
Figure 4.1: Directed acyclic graph – general scheme.
34
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
of independent random vectors Y i , i = 1, . . . , N , each of length n which in
a frequentist sense can be specified as
Y i = Xi β + Zi bi + εi ,
i.i.d.
i = 1, . . . , n,
i.i.d.
bi ∼ Nq (0, D),
εi ∼ Nn (0, Σ),
where Xi , Zi , i = 1, . . . , N are fixed covariate matrices. For the sake of the
Bayesian modelling, the vector θ = (ψ ′ , φ′ )′ is given by
′
ψ = (b′1 , . . . , b′N )′ ,
φ = β ′ , vec(D), vec(Σ) .
The whole model can be represented by the DAG shown in Figure 4.2. The
above mentioned three steps in the model building proceeds as follows. The
Data Model is given by a normal likelihood
L(θ) = p(data | θ) = p(data | ψ, φ) =
N
Y
i=1
ϕn (y i | β ′ xi + b′i z i , Σ).
The Latent Process Model is determined by the normal distribution of the
random effects, i.e.
N
Y
ϕq (bi | 0, D).
p(ψ | φ) =
i=1
Finally, some prior distributions p(β), p(D), p(Σ) are assigned to the parameters of the main interest, i.e. to β, D, Σ and
p(φ) = p(β) × p(D) × p(Σ).
bi
Zi
Xi
yi
β
i = 1, . . . , N
Σ
D
Figure 4.2: Directed acyclic graph for the linear mixed model.
4.5. MARKOV CHAIN MONTE CARLO
4.5
35
Markov chain Monte Carlo
In previous sections, we stated that the inference in the Bayesian approach
is based on the posterior distribution p(θ | data) which is obtained using the
Bayes’ formula (4.1) and is proportional to the product of the likelihood and
the prior distribution. We also saw that difficult likelihood evaluations can be
avoided by the introduction of a set of suitable auxiliary variables (augmented
data). What needs to be discussed is how the posterior distribution can be
computed and how to determine posterior summaries about θ. Most quantities related to the posterior summarization (posterior moments, quantiles,
highest posterior density regions etc.) involve computation of the posterior
expectation of some function G(θ), i.e. computation of
E G(θ) data =
Z
Θ
G(θ) p(θ | data) dθ =
R
Θ RG(θ) L(θ) p(θ) dθ
Θ
L(θ) p(θ) dθ
.
(4.12)
The integration in the expression (4.12) is usually high-dimensional and only
rarely analytically tractable in realistic practical situations.
Markov chain Monte Carlo (MCMC) methods avoid the explicit evaluations
of integrals. Instead, we construct a Markov chain with state space Θ whose
stationary distribution is equal to p(θ | data). After a sufficient number of
burn-in iterations the current draws follow the stationary distribution, i.e.
the posterior distribution of interest. We keep a sample of θ values, let say
θ (1) , . . . , θ (M ) and approximate the posterior expectation (4.12) by
GM =
M
1 X
G(θ (m) ).
M
(4.13)
m=1
The ergodic theorem
implies
that, under mild conditions, GM converges almost surely to E G(θ) data as M → ∞ (see, e.g., Billingsley, 1995, Section
24).
Many methods are available to construct the Markov chains with desired
properties. The most often used are the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) and the Gibbs algorithm (Geman and
Geman, 1984; Gelfand and Smith, 1990). Both of them, often properly dedicated will be used extensively throughout this thesis. A comprehensive introduction into the area of the MCMC can be found, e.g., in Geyer (1992);
Tierney (1994); Besag et al. (1995). More details can be obtained from several
books, e.g., Gilks, Richardson, and Spiegelhalter (1996); Gamerman (1997);
Chen, Shao, and Ibrahim (2000); Robert and Casella (2004).
36
4.6
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
Credible regions and Bayesian p-values
With a frequentist approach, confidence intervals or regions and p-values
are used to summarize the estimates and the inference for θ – parameter of
interest. In Bayesian statistics, the role of the confidence regions is played
by the credible regions and p-values are replaced by the Bayesian p-values.
In this section, we briefly discuss their construction.
4.6.1
Credible regions
For a given α ∈ (0, 1), the 100(1 − α)% credible region Θα for a parameter
of interest θ is defined using the conditional distribution θ | data (posterior
distribution of θ) as
Pr θ ∈ Θα data = 1 − α.
(4.14)
Equal-tail credible interval
Suppose first, the parameter of interest θ is univariate. The credible region
Θα can then be obtained by setting Θα = (θαL , θαU ), such that
Pr θ ≤ θαL data = Pr θ ≥ θαU data = α/2.
Such an interval is easily constructed when a sample from the posterior distribution of θ (obtained, e.g., using the MCMC technique) is available. Indeed,
θαL and θαU are 100(α/2)% and 100(1 − α/2)%, respectively, quantiles of the
posterior distribution θ | data and from the MCMC output they can be estimated using the sample quantiles.
Simultaneous credible bands
For the case the parameter of interest, θ = (θ1 , . . . , θq ), is multivariate and
we wish to calculate simultaneous probability statements, Besag et al. (1995,
p. 30) suggest to compute simultaneous credible bands. In that case, Θα
equals
L
U
L
U
Θα = (θ1,α
, θ1,α
) × · · · × (θq,α
, θq,α
).
(4.15)
uni
uni
uni
uni
That is, Θα is given as a product of univariate equal-tail credible intervals of
the same univariate level αuni (typically αuni ≥ α). As shown by Besag et al.
(1995), the simultaneous credible bands can easily be computed when the
sample from the posterior distribution is available as only order statistics for
each univariate sample are needed. From the computational point of view,
4.6. CREDIBLE REGIONS AND BAYESIAN P -VALUES
37
the most intensive part in computation of the simultaneous credible band is to
sort the univariate samples. However, when simultaneous credible bands for
different values of α are required this must be done only once. This property
is advantageously used when computing the simultaneous Bayesian p-values
(see Section 4.6.2).
As pointed by Held (2004), due to the fact the simultaneous credible band
is by construction restricted to be hyperrectangular, it can cover a huge area
actually not supported by the posterior distribution. Obviously, this problem
becomes more severe when a high posterior correlation exists between the
components of the vector θ.
Highest posterior density region
An alternative to the credible intervals and simultaneous credible bands is
given by the highest posterior density (HPD) region. In that case, Θα is
obtained by requiring (4.14) and additionally
p(θ 1 | data) > p(θ 2 | data)
for all θ 1 ∈ Θα , θ 2 ∈
/ Θα .
Note that in the univariate case and for unimodal posterior densities p(θ | data),
the HPD region becomes an interval. However, it is clear that in contrast to
the equal-tail credible interval or the simultaneous credible band the computation of the HPD region is much more complicated even when the sample
from the posterior distribution is already available.
4.6.2
Bayesian p-values
The Bayesian counterpart of the p-value for the hypothesis H0 : θ = θ 0
(typically θ 0 is a vector of zeros) – the Bayesian p-value – can be defined as
1 minus the content of the credible region which just covers θ 0 , i.e.
(4.16)
p = 1 − min α : θ 0 ∈ Θα .
In the univariate case, a two-sided Bayesian p-value based on the equal-tail
credible interval is computed quite easily once the sample from the posterior
distribution is available since (4.16) can be expressed as
n
o
p = 2 min Pr(θ ≤ θ0 | data), Pr(θ ≥ θ0 | data) ,
(4.17)
and Pr(θ ≤ θ0 | data), Pr(θ ≥ θ0 | data) can be estimated as a proportion of
the sample being higher or lower, respectively, than the point of interest θ0 .
38
CHAPTER 4. FREQUENTIST AND BAYESIAN INFERENCE
In the multivariate case, a two-sided simultaneous Bayesian p-value based on
the simultaneous credible band can be obtained by calculating the simultaneous credible bands Θα on various levels α and determining the smallest level,
such that θ 0 ∈ Θα , i.e. by direct usage of the expression (4.16).
To compute the Bayesian p-value based on the HPD region, the expression
(4.16) takes the form
p = Pr θ : p(θ | data) ≤ p(θ 0 | data) data .
(4.18)
An MCMC estimate of (4.18) can easily be obtained when p(θ | data) (any
proportionality constants may be ignored) can efficiently be evaluated. Often,
this is not the case however. Nevertheless, a technique how to overcome the
problem of unknown or difficult to evaluate p(θ | data) using its estimate
based on Rao-Blackwellization is given by Held (2004).
Mainly for computational reasons, we report in this thesis, if not stated otherwise, univariately equal-tail credible intervals and corresponding Bayesian
p-values of the type (4.17) and multivariately simultaneous credible regions
(4.15) and corresponding simultaneous Bayesian p-values computed using
an iterative procedure to evaluate (4.16).
Chapter
5
An Overview of Methods for
Interval-Censored Data
For right-censored data, a variety of methods (non-, semi- and fully parametric) have been developed. Further, commercial software is available to
support these techniques. In contrast, for interval-censored data and multivariate (doubly-)interval-censored data commercial software is much more
limited and only parametric approaches seem to be available for regression
models besides of course the user-written programs. Further, until recently
only few methods were available. That is why, in practice, modelling with
interval-censored data is often mimicked by methods developed for rightcensored data. For this, the interval needs to be replaced by an exact time or
right-censored time. The most common assumption is that the event occurred
at the midpoint of the interval. However, applying methods for right-censored
data on these artificial fixed points can lead to biased and misleading results
and the correctness of such approach depends strongly on the underlying distribution of the event times, see e.g., Rücker and Messerer (1988); Law and
Brookmeyer (1992); Odell, Anderson, and D’Agostino (1992); Dorey, Little,
and Schenker (1993).
In Section 5.1, we first review appropriate frequentist methods to deal with
(doubly-)interval-censored data and link them to the corresponding (classical)
method for right-censored data. We start with the estimation of the survival
distribution, proceed to the two-sample tests for the survival distributions,
continue with the proportional hazards and accelerated failure time models
and end up with the remark on the problem of interval-censored covariates.
Whenever feasible, we mention computational aspects of described methods
applicable for R, Splus and SAS.
39
40
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
With suitable semi-parametric approaches, both PH and AFT models can be
used not only for the estimation of the effect of covariates but also for both
estimation of the baseline survival distribution or comparison of two or more
samples. With the Bayesian approach, it is moreover relatively easy to set
up and estimate the models for multivariate (doubly-)interval-censored data.
We will illustrate this on the analysis of the Signal Tandmobielr data using
a semi-parametric Bayesian PH model in Section 5.2. As we are interested
mainly in the AFT model, we give also an overview of available Bayesian developments for this model in Section 5.3. We end this chapter by highlighting
our motivations for the further developments presented in this thesis.
5.1
5.1.1
Frequentist methods
Estimation of the survival function
In the case of simple i.i.d. survival data, often the aim is to estimate the
survival function. When only categorical covariates are involved, the survival
function can be estimated for each unique combination of covariate values
and could be used to check the fitted regression model.
For right-censored data, the classical non-parametric maximum-likelihood
estimate (NPMLE) of the survival function is given by Kaplan and Meier
(1958). For interval-censored data Peto (1973) first proposed the NPMLE
and used the constrained Newton-Raphson method to compute it. Nowadays, the NPMLE of the survival function based on the interval-censored
data is known as the Turnbull’s estimate (see Turnbull, 1976) who suggested
a so called iterative self-consistency algorithm, which is, in fact, an EM-like
(Dempster et al., 1977) algorithm. An improved version of the maximization
algorithm which utilizes standard convex optimization technique was given
by Gentleman and Geyer (1994) who also discussed the unicity of the estimate. For computation, a valuable alternative, the iterative convex minorant
algorithm, was suggested by Groeneboom and Wellner (1992). Finally, strong
consistency of the Turnbull’s estimate has been proved under rather general
assumptions by Yu, Li, and Wong (2000). The asymptotic distributional
behaviour of the Turnbull’s estimator for some special cases has been established by Yu et al. (1998) and Huang (1999). An extension of the NPMLE
of the survival function for bivariate interval-censored data is discussed, e.g.,
by Bogaerts and Lesaffre (2004).
Several numerical algorithms to compute the non-parametric estimate of the
survival function of the interval-censored data are implemented in Vandal’s
5.1. FREQUENTIST METHODS
41
and Gentleman’s R package Icens downloadable from the Comprehensive R
Archive Network (CRAN) or in the S-plus function kaplanMeier.
A valuable alternative to non-parametric procedures is obtained by smoothing the survival or equivalently the density function or the hazard function. In
most practical situations, it can be assumed that the event-times are continuously distributed, and we even get more realistic, not step-wise, estimates.
One such method, applicable directly also to interval-censored data is given
by Kooperberg and Stone (1992) who smooth the density using splines. They
also provide software in the form of the R package logspline downloadable
from CRAN or the S-plus library splinelib downloadable from StatLib.
Splines for the smoothing the hazard function are exploited by the approach
of Rosenberg (1995).
Illustration 5.1. Signal Tandmobielr study. As an illustration, we computed both the non-parametric estimate of Turnbull (1976) and the smooth
estimate of Kooperberg and Stone (1992) of the cumulative distribution functions (cdf) for the emergence of the right mandibular permanent first premolar, separately for boys and girls based on the Signal Tandmobielr data
introduced in Section 1.1. The cdf function giving the proportion of children
with the emerged tooth is called in this context the emergence curve and is
preferred in this situation to the survival curve. The estimates are plotted in
0.8
0.6
0.4
Boys
0.2
Girls
0.0
Proportion emerged
1.0
Tooth 44
6
7
8
9
10
11
12
Age (years)
Figure 5.1: Signal Tandmobielr study: Cumulative distribution functions of
emergence for right mandibular permanent first premolar, separately for girls
and boys. Non-parametric estimate of Turnbull (solid line), smooth estimate
of Kooperberg and Stone (dashed line).
42
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
Figure 5.1. Due to rather high sample size in each group (more than 2 000),
the non-parametric estimate is almost the same as the smooth estimate, especially for boys. From the plots it is seen that the emergence for girls is
somewhat fastened when compared to boys.
Doubly-interval-censored data
Non-parametric estimation of the survival curve based on doubly-intervalcensored data was first considered by De Gruttola and Lagakos (1989) who
make use of discretization of data and generalization of the self-consistency
algorithm of Turnbull (1976). The authors estimate simultaneously the onset
and the event distributions by treating them as bivariate data. However, they
point out that the large number of parameters resulting from discretization,
especially if time is grouped too coarsely may cause identifiability problems.
This gave rise to several two-step approaches. First, the distribution of the onset time is separately estimated and second, the estimated onset distribution
is used as an input for estimation of the distribution of the event time. Bacchetti (1990); Bacchetti and Jewell (1991) assume piece-wise constant hazard
and use penalized maximum-likelihood method to estimate the levels of the
hazard on each interval. The roughness penalty in the likelihood prevents the
method from identifiability problems reported by De Gruttola and Lagakos
(1989). The original proposal of De Gruttola and Lagakos (1989) motivates
the two-step approaches of Gómez and Lagakos (1994); Sun (1995). Finally,
Gómez and Calle (1999) present an extension of the technique of Gómez and
Lagakos (1994) which does not require discretization of the data.
5.1.2
Comparison of two survival distributions
If the data can be divided in two (or more) groups, e.g. boys and girls, one
could compare the distributions of the event times in these two groups. For
right-censored data, many non-parametric tests for comparing two survival
curves are available, e.g. the log-rank test (Mantel, 1966), the Gehan generalization of the Wilcoxon test (Gehan, 1965), the Peto-Prentice generalization
of the Wilcoxon test (Peto and Peto, 1972; Prentice, 1978) and the weighted
Kaplan-Meier statistic of Pepe and Fleming (1989, 1991) which with unit
weights is equal to the difference of means of the two survival distributions.
The Gehan-Wilcoxon test has been adopted to interval-censored data by
Mantel (1967), while the interval-censored version of the Peto-Prentice-Wilcoxon test is presented by Self and Grossman (1986). The log-rank test for
interval-censored data is given by Finkelstein (1986). Further, Petroni and
5.1. FREQUENTIST METHODS
43
Wolfe (1994) discuss the weighted Kaplan-Meier statistic in the context of
interval-censoring. The performance of above mentioned two-sample tests
for interval-censored data is in detail studied and compared by Pan (1999a).
Furthermore, Fay (1996, 1999) derived a general class of linear-rank tests
for interval-censored data which covers, as special cases, the Wilcoxon-based
tests. Finally, Fay and Shih (1998) present a class of tests called distribution permutation tests which besides the Wilcoxon-based tests covers also
an improved version of the weighted Kaplan-Meier test. Splus programs to
perform some distribution permutation tests are given by Gómez, Calle, and
Oller (2004, Section 4.4) and can be downloaded from
http://www-eio.upc.es/grass.
Regrettably, the asymptotic properties of the above methods assume the
grouped continuous model, which implies that the status of each subject is
checked at the same timepoints (in the study time scale) whose number is
fixed or that observed intervals are grouped in such a way. For example,
for the Signal Tandmobielr study this would mean that the emergence status of the teeth was checked at prespecified ages, the same for all children.
Obviously, such setting is too restrictive in many practical situations. For
instance, in the above example, each child was checked by a dentist-examiner
on a prespecified day of the year, irrespective of his or her age.
The grouped continuous model assumption is necessary to be able to apply
the standard maximum likelihood theory to interval-censored data measured
on a continuous scale without making any parametric assumptions. Only
recently, Fang, Sun, and Lee (2002) developed a test statistic, based on the
weighted Kaplan-Meier statistic of Pepe and Fleming (1989) that does not require the grouped continuous model assumption. Finally, Pan (2000b) offers
two-sample test procedures obtained by combining standard right-censored
tests and multiple imputation that allows, in contrast to single (e.g. midpoint) imputation mentioned at the beginning of this chapter, to draw appropriately the statistical inference.
Illustration 5.2. Signal Tandmobielr study. The emergence curves of the
right mandibular permanent first premolar for boys and girls shown in Figure 5.1 were compared using the Wilcoxon-based, log-rank and Fay’s and
Shih’s version of the difference in means tests. Not surprisingly, for all these
tests, the p-value is practically equal to zero. The values of the test statistics, their mean and variance under the null hypothesis and the standardized
value, which can asymptotically be compared to the quantile of the standard
Gaussian distribution, are shown in Table 5.1.
44
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
Table 5.1: Signal Tandmobielr study: Two-sample tests comparing the emergence of the permanent right mandibular first premolar (tooth 44) for boys
and girls.
Test
Gehan-Wilcoxon
Peto-Prentice-Wilcoxon
Log-rank
Difference in means
5.1.3
Test
Statistic
554 812
140.607
212.316
264.095
Mean
0
−37.634
−53.663
−76.486
Variance
2 865 333 000
284.255
675.251
1 102.340
Standardized
Test Statistic
10.365
10.572
10.236
10.258
Proportional hazards model
To extend the PH model to interval-censored data, basically four types of
approaches can be found in the literature. Firstly, the baseline hazard ℏ0 can
be parametrically specified and standard maximum likelihood theory applied
to estimate all the parameters. However, the parametric assumptions can
cause bias in inference if incorrectly specified and especially with heavily
censored data it is general difficult to assess them.
The second class of methods makes use of a combination of multiple imputation (see Rubin, 1987; Wei and Tanner, 1991) and methods for right-censored
data represented by works of Satten (1996); Satten, Datta, and Williamson
(1998); Goggins et al. (1998); Pan (2000a). A disadvantage of these methods
is, however, that they are highly computationally demanding and the fact
that the procedures they use to impute missing data have a relatively ad hoc
nature.
The third approach, suggested by Finkelstein (1986), Pan (1999b), and Goetghebeur and Ryan (2000) resembles most the original method of Cox (1972)
combined with that of Breslow (1974). Indeed, in all three papers the baseline
hazard ℏ0 is estimated non-parametrically on top of estimating the regression coefficients. Whereas the method of Finkelstein relies on the grouped
data assumption, Goetghebeur and Ryan developed an EM-type procedure
that relaxes that assumption. Moreover, the approach of Goetghebeur and
Ryan seems to be the only one that reduces to a standard Cox model when
interval-censoring reduces to right-censoring. Finally, the approach of Pan
extends the iterative convex minorant method mentioned in Section 5.1.1 into
the context of the PH model. His approach is also implemented as R package
intcox.
Finally, methods that smoothly estimate ℏ0 are a trade-off between para-
5.1. FREQUENTIST METHODS
45
metric modelling that allows for a straightforward maximum likelihood estimation of the parameters and semi-parametric models with a completely
unspecified baseline hazard ℏ0 . Kooperberg and Clarkson (1997) suggest to
use regression splines to express the logarithm of ℏ0 , while Joly et al. (1998)
employ monotone splines (Ramsay, 1988) directly for the baseline hazard ℏ0 .
Betensky et al. (1999) use local likelihood smoothing to model the baseline
hazard, firstly without covariates. Extension of their method into the regression setting is given by Betensky et al. (2002). Recently, Cai and Betensky
(2003) propose to use penalized linear spline for the baseline hazard function.
A nice feature of these methods is that predictive survival and hazard curves
are directly available and moreover, they are smooth rather than step-wise
as in the case of semi-parametric estimation. The software for the approach
of Kooperberg and Clarkson (1997) is included in the previously mentioned
R package logspline or S-plus library splinelib.
Doubly-interval-censored data
One of the first approaches to the PH model with doubly-interval-censored
data is given by Kim, De Gruttola, and Lagakos (1993) who, under the assumption of the grouped data, directly generalize the one-sample results of
De Gruttola and Lagakos (1989). However, their method is highly computationally intensive. For the situation when only the onset time is intervalcensored however the failure time is only right-censored or exactly observed,
alternatives are offered by Goggins, Finkelstein, and Zaslavsky (1999); Sun,
Liao, and Pagano (1999); Pan (2001).
5.1.4
Accelerated failure time model
A parametric AFT model estimated using the maximum likelihood method
can be used with interval-censored data as well. It is also implemented in major statistical packages (functions survreg in R and SurvReg in Splus, procedure LIFEREG in SAS). On the other hand, semi-parametric methods which
are not straightforward even with right-censored data are only with considerable difficulties extended to the interval-censored data, see Rabinowitz, Tsiatis, and Aragon (1995); Betensky, Rabinowitz, and Tsiatis (2001). Though,
both approaches are practically applicable only with low-dimensional covariate vectors x and as well as for right-censored data there exists no nonparametric method to estimate the baseline survival distribution implying
that the semi-parametric procedures cannot be used when prediction is of
interest.
46
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
More promising alternatives are the methods that make use of multiple imputation and/or smoothing. Indeed, approaches of Pan and Kooperberg (1999);
Pan and Louis (2000); Pan and Connett (2001) introduced in Sections 3.4.2
and 3.4.3 could relatively easily be extended to handle also (multivariate)
interval-censored or even doubly-interval-censored data. However, it can be
computationally demanding, especially with doubly-interval-censored data,
to perform integration of the form (4.4) in the optimization of the likelihood.
5.1.5
Interval-censored covariates
Up to now, we concentrated on the problem of interval-censored response.
In the regression context, it is however possible in practice, that we have to
face the problem of interval-censored covariate. Such problem is considered,
for example, by Gómez, Espinal, and Lagakos (2003) who studied, in the
framework of an HIV/AIDS clinical trial, the association between waiting
time between indinavir failure and enrolment (covariate) and subsequent viral
load (response).
However, we will not consider problems of this type in this thesis. Recent
developments in this field can be found, e.g., in Topp and Gómez (2004);
Langohr, Gómez, and Muga (2004); Calle and Gómez (2005).
5.2
Bayesian proportional hazards model: An illustration
For an extensive overview of the Bayesian methods for the proportional hazards model we refer the reader to the book of Ibrahim, Chen, and Sinha
(2001). Here, only the analysis based on the PH model, published as Komárek
et al. (2005), will be presented and that of doubly-interval-censored data from
the Signal Tandmobielr study. Actually, the main purpose of this section is
to illustrate typical features of a Bayesian analysis and show how it can be
used to answer rather complex questions.
In Section 5.2.1, we formulate the research question and outline the problems
related to this question. Section 5.2.2 presents a frequentist Cox’s PH regression model using midpoints of the observed intervals as if they were exact
observations, to compare our Bayesian approach to a more commonly used,
however incorrect, approach. In Section 5.2.3, the Bayesian model suggested
by Härkänen, Virtanen, and Arjas (2000) and modified for our purposes is
explained and results are presented in Section 5.2.4. We finalize this part by
a discussion.
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
5.2.1
47
Signal Tandmobielr study: Research question and
related data characteristics
In this section we will tackle the following research question: Does fluorideintake at a young age have a protective effect on caries in permanent teeth?
Our analyses will be limited to caries experience of the four permanent first
molars (teeth number 16, 26, 36, 46 in Figure 1.1).
The data suggest that the use of fluoride reduces caries experience in primary teeth, see Vanobbergen et al. (2001) and that fluoride-intake delays the
emergence of the permanent teeth, see Leroy et al. (2003a). The latter result
raises the question whether the fluoride-intake only reduces the time at risk
or whether it has also a direct protective effect on caries experience.
Unfortunately, fluoride-intake in children cannot be measured accurately. Indeed, fluoride-intake can come from: (1) fluoride supplements (systemic),
(2) accidental ingestion of toothpaste or (3) tap water. Further, the intake
from these sources can be recorded only crudely. Therefore it was decided
to measure fluoride-intake by the degree of fluorosis on some reference teeth.
Fluorosis is the most common side-effect of fluoride-intake and appears as
white spots on the enamel of teeth. For this analysis, a child was considered
fluoride-positive (covariate fluor = 1) if there were white spots on at least two
permanent maxillary incisors during the fourth year of the study or during
both the fifth and sixth year of the study.
The prevalence of fluorosis was relatively low (480 children, 10.8%). In our
analysis, 480 fluorosis children and 960 randomly selected fluorosis-free children are included. Case-control subsampling was done to reduce computation
time. To check that it did not destroy the stratification, we constructed a
5 × 3 × 2 contingency table with factors province, school system and whether
the child is in the subsample or not (subsample). A classical p-value of 0.13
was obtained for the significance of the interaction of the third factor with
the other two using a likelihood-ratio test in a log-linear model, implying that
the stratification is similar in the used and the discarded subsamples.
The prevalence of caries experience at the age of 12 was negligible (at most
1.4%) for all permanent teeth except for the first molars (teeth used in the
analysis). For these teeth the prevalence was 25.8% in children with fluorosis
compared to 29.4% in fluorosis-free children, with prevalence of 23.3% and
27.7% for boys, and 27.9% and 31.2% for girls, respectively. Thus, at first
sight the impact of fluoride-intake seems to be minor. However, since the
emergence of permanent teeth might be delayed by fluoride-intake, evaluating
the impact of fluoride-intake should take into account the time at risk for
caries. Hence, in our analysis the response will be the time between emergence
48
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
and the onset of caries development. Remember that both tooth emergence
and onset of caries development are interval-censored, implying a doublyinterval-censored response. See Figure 2.1 for a graphical illustration of a
possible evolution of a particular tooth.
At the onset of the study about 86% of the permanent first molars had already
emerged. The severity of this censoring will affect the efficiency with which
the effect of fluoride-intake can be estimated. We tried two strategies to
improve the efficiency of our estimation procedure. Firstly, we included in
our analysis the emergence times of teeth 14, 24, 34, 44, 12, 22, 33, 43 all
of which had emerged in more than 60% of cases during the course of the
study. By incorporating information on these teeth and using the association
between teeth of the same subject (via the concept of “the birth time of
dentition”, see next section), it was attempted to better estimate the true
emergence time of the permanent first molars. Secondly, emergence times
from a Finnish longitudinal data set (Virtanen, 2001), involving 235 boys
and 223 girls born in 1980–1981 with follow-up from 6 to 18 years, were
added to our Flemish data. For these Finnish data almost all 28 permanent
teeth emerged during the study period.
Our research question is not uncommon in dentistry, but cannot be addressed
within any classical statistical package. For our analysis, we have used
the software package BITE (Härkänen, 2003), based on a semi-parametric
Bayesian survival model developed by Härkänen et al. (2000).
5.2.2
Proportional hazards modelling using midpoints
A standard frequentist Cox’s PH model introduced in Section 3.1 could be
applied, replacing interval-censored observations by the midpoints of the observed intervals and treating the resulting data as right-censored observations.
In this way, we analyzed time to caries development for the four permanent
first molars. For our analysis, the left-censored emergence times were first
assumed to be interval-censored with a lower limit for emergence of 5 years,
which is practically the youngest age for the emergence of these teeth (Nanda,
1960). Possible dependencies between the four teeth of the same child can be
taken into account, for example by inclusion of a gamma–frailty component
in the PH model as explained in Section 3.4.1.
Based on preliminary Bayesian modelling, we do not distinguish between
opposite teeth in the same jaw and assume so called horizontal symmetry.
However, we do make a distinction between maxillary (upper) and mandibular (lower) teeth and also between teeth in different positions (of a quadrant)
in the mouth.
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
49
Table 5.2: Signal Tandmobielr study. Naive PH models for the effect of
fluorosis on caries on permanent first molars. Hazard ratios (95% confidence
intervals (CI)) between a fluorosis and fluorosis-free group of children while
controlling for gender and jaw.
Group
Boys, maxilla
Boys, mandible
Girls, maxilla
Girls, mandible
Model WITHOUT frailties
Estimate
95% CI
0.787
(0.541, 1.032)
0.733
(0.532, 0.934)
0.871
(0.698, 1.044)
0.812
(0.670, 0.953)
Model WITH frailties
Estimate
95% CI
0.704
(0.204, 1.204)
0.613
(0.231, 0.995)
0.892
(0.610, 1.174)
0.776
(0.559, 0.993)
For comparison purposes, we present the same PH model as the one shown in
Section 5.2.3 but analyzed by Bayesian methods. Hence, the hazard for the
time to caries of the lth tooth of the ith child depends on the tooth position,
fluor and gender of the child (0 = boy, 1 = girl). More specifically:
ℏ(t|toothl , genderi , fluori ) = ℏ0 (t) · Zi · exp(β ′ xi,l ),
(5.1)
i = 1, . . . , N, l = 16, 26, 36, 46, where ℏ0 (t) is an unspecified baseline hazard function, β = (β1 , . . . , β5 )′ , and xi,l = (fluori , genderi , toothl , fluori ×
genderi , fluori × toothl )′ . The covariate “tooth” is a dummy variable that
distinguishes teeth on different positions in the mouth (apart from horizontal
symmetry). The term Zi is either one, corresponding to a model without
frailties, or a gamma distributed frailty term.
Estimates of hazard ratios between the fluorosis and fluorosis-free group controlling for gender and jaw are shown in Table 5.2. As seen, incorrectly
ignoring dependencies between the responses of one child by using a model
without frailties artificially decreases the size of the confidence interval. Although both models conclude that the effect of fluorosis on the development
of caries on the permanent first molars is at the borderline of 5% significance (Table 5.2), the results are not reliable. As pointed on page 39, the
correctness of the midpoint imputation depends strongly on the underlying
distribution of the event times. For that reason, a more sophisticated analysis
is needed.
50
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
5.2.3
The Bayesian survival model for doubly-interval-censored data
The non-parametric Bayesian intensity model of Härkänen et al. (2000) provides a flexible tool for analyzing multivariate survival data. Further, a software package written in C, called BITE and downloadable from
http://www.rni.helsinki.fi/~tth
together with scripts used to perform all analyses presented here, makes the
analysis feasible in practice.
Model for emergence
Let Ui,l be the (unknown) age at which tooth l of child i emerged. The hazard
for emergence of tooth l of the ith child at time t is
(e)
λi,l (t) = ℏ(e) (t − ηi |toothl , genderi ) × I[ηi < t ≤ Ui,l ].
(5.2)
The dependence between emergence times of one child is accounted for by
using a subject-specific variable ηi called birth time of dentition. This is
a latent variable which represents the common time marking the onset of
the tooth eruption process and hereby “explains” the positive correlation
between eruption times Ui,l within a subject. Note that ηi is always less than
the first emergence time of the permanent teeth. The intensity of emergence
for a particular child is zero before that time, expressed by the indicator
I[ηi < t ≤ Ui,l ]. The hazard function ℏ(e) (·|toothl , genderi ) is defined as
piece-wise constant for estimation purposes.
Model for caries experience
Let Vi,l be the age at which the lth tooth of child i developed caries. The
hazard for the caries process is given by
(c)
λi,l (t) = Zi × ℏ(c) (t − Ui,l |toothl , genderi , fluori ) × I[Ui,l < t ≤ Vi,l ],
(5.3)
where the variable Zi is an unknown subject-specific frailty coefficient modulating the hazard function. Again, we assume in (5.3) that h is piece-wise
constant. We call the difference Vi,l − Ui,l the time-to-caries.
The covariate “fluor” will be used in two ways. Firstly, for each combination
of values of fluor, gender and tooth a piece-wise constant hazard function is
specified and fitted. Secondly, the term ℏ(c) (·|toothl , genderi , fluori ) in (5.3)
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
51
(c)
is replaced by ℏ0 (·) × exp(β ′ xi,l ), with β and xi,l being the same as in (5.1),
thus assuming a PH model for caries experience whilst retaining a piece-wise
(c)
constant baseline hazard function ℏ0 (·).
Remarks
Our statistical model will involve the above two measurement models. Hence
the possible dependencies among times of interest are taken into account
by involving two types of subject-specific parameters, ηi and Zi . The first
subject-specific parameter ηi is included in the model for the emergence and
will shift the hazard function in time, whereas the frailty Zi recognizes that
the teeth of one child can be more sensitive to caries than the corresponding
teeth of another child, reflecting different dietary behavior, brushing habits,
etc.
Priors for baseline hazard functions
In BITE the working assumption is that hazard functions are piece-wise constant. Further, for the emergence hazard functions ℏ(e) (·|toothl , genderi ) the
first level of the piece-wise constant and the increment levels are assigned
gamma prior distributions. This will ensure a priori an increasing hazard
function for emergence. In the case of caries experience, the first level of the
piece-wise constant hazard function ℏ(c) (·|toothl , genderi , fluori ) in the non(c)
parametric model and ℏ0 (·) in the PH model, say h0 , is assigned a gamma
prior distribution. Further, the level hm of the mth interval has, conditional
on the previous levels h0 , . . . , hm−1 , a Gamma(α, α/hm−1 ) prior distribution.
This gives a priori E[hm |hm−1 , . . . , h0 ] = hm−1 and assures that there is no
built-in prior assumption of trend in the hazard rate. Finally, the prior for the
jump points of each piece-wise constant function is a homogeneous Poisson
process, as suggested by Arjas and Gasbarra (1994). Because jump points
are assumed to be random and not fixed, the posterior predictive hazard
functions will be smooth, rather than piece-wise constant.
Priors for the random effect terms
The prior distribution for the birth time of dentition ηi illustrates how we
have combined the Flemish data and the Finnish data, and how the timing
of emergence of the Finnish data is included in our analysis. We assume that
the shapes of the emergence hazard functions f for Finland and Flanders are
the same, but we do allow for a shift in emergence times by assuming different
52
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
means for the birth time of dentition in the two countries. More precisely,
the prior distribution of ηi is assumed normal N (ξ0 , τ −2 ) for a Finnish child
and normal N (ξ1 , τ −2 ) for a Flemish child.
The Bayesian approach allows us to include the dentist’s knowledge on the
problem at hand by assigning to the parameters ξ0 and ξ1 independent normal
prior distributions with mean 5.2 years and standard deviation 1 year. Both
the normal distribution as well as the choice of the prior means and standard
deviation of the hyperparameters ξ0 and ξ1 are motivated by the results found
in the literature on the earliest emergence of permanent teeth, see Nanda
(1960) or more recently Parner et al. (2001). This reflects the dentist’s belief
that permanent teeth on average emerge slightly after 5 years of age. The
parameter τ 2 is assigned a Gamma(2, 2) prior distribution.
The individual frailties Zi in the model for caries are a priori assumed to be
conditional on the hyper-parameter φ, independent and identically gamma
distributed with both shape and inverse scale equal to that hyper-parameter.
The hyper-parameter itself is then given a Gamma(2, 2) prior distribution.
Sensitivity of the results with respect to the choice of parameters for priors
of hyperparameters ξ0 , ξ1 , τ and φ will be discussed in Section 5.2.4.
Treatment of censored data
Left- and interval-censoring are treated by Bayesian data augmentation introduced in Section 4.3. Additionally, the left-censored emergence times of all
teeth are changed into interval-censored emergence times with a lower limit
equal to 4 years, implying that less internal information is used here than
previously with the frequentist PH model where the limit was 5 years. In
the case that both emergence and caries development were observed within
one observational interval we force sampled values of the MCMC to satisfy
Vi,l > Ui,l .
Bayes inference on model components
The posterior distributions based on the model with prior assumptions described in the previous paragraphs are minor modifications of those derived in
Härkänen et al. (2000). Our Bayesian model is complex and requires the use
of Markov Chain Monte Carlo sampling techniques outlined in Section 4.5.
The software package BITE (Härkänen, 2003), based on the MetropolisHastings algorithm (Metropolis et al., 1953; Hastings, 1970), was used to
sample from the posterior distributions. Further, BITE employs the reversible jump approach of Green (1995) to sample piece-wise constant hazard
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
53
functions. We carried out two runs, each with 20 000 iterations of burn-in
followed by 14 000 iterations with a 1:4 thinning to obtain a sample from the
posterior distribution. We used the Gelman and Rubin (1992) test to check
for convergence.
5.2.4
Results
A non-parametric model with Flemish and Finnish data
To evaluate the effect of fluoride-intake on the development of caries on the
permanent first molars we have calculated the posterior expectations of hazard ratios
ℏ(c) (t|tooth, gender, fluorosis)
.
ℏ(c) (t|tooth, gender, fluorosis-free)
These hazard ratios together with their 95% equal tail point-wise credible
intervals can be found in Figure 5.2. The PH assumption with respect to
covariate fluor seems to be satisfied since credible intervals in all cases cover
0
1
5
2
4
3
Time since emergence (years)
4
3
2
1
0
1
2
3
4
HR (fluor./NO fluor.)
Boy, mandible 6
0
HR (fluor./NO fluor.)
Boy, maxilla 6
6
0
1
1
5
2
4
3
Time since emergence (years)
6
4
3
2
1
0
HR (fluor./NO fluor.)
4
3
2
1
0
6
Girl, mandible 6
0
HR (fluor./NO fluor.)
Girl, maxilla 6
5
2
4
3
Time since emergence (years)
0
1
5
2
4
3
Time since emergence (years)
6
Figure 5.2: Signal Tandmobielr study. Bayesian non-parametric model based
on Flemish and Finnish Data. Posterior means of the hazard ratios between
the fluorosis groups (solid line), 95% point-wise equal tail probability region
(dashed line).
54
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
Table 5.3: Signal Tandmobielr study. Bayesian PH models for the effect of
fluorosis on caries on permanent first molars. Hazard ratios (95% equal-tail
credible intervals (CI)) between fluorosis groups while controlling for gender
and jaw for models fitted using both Flemish and Finnish data and Flemish
data only.
Group
Boys, maxilla
Boys, mandible
Girls, maxilla
Girls, mandible
Flemish and Finnish data
Posterior
mean
95% CI
0.674
(0.492, 1.010)
0.572
(0.414, 0.850)
0.991
(0.721, 1.364)
0.840
(0.608, 1.136)
Flemish data only
Posterior
mean
95% CI
0.651
(0.463, 0.960)
0.549
(0.386, 0.779)
1.002
(0.698, 1.333)
0.844
(0.602, 1.135)
a horizontal line. In three cases, this horizontal line is close to the dotteddashed line y = 1 implying no effect of fluoride-intake on caries development.
A positive effect of fluoride intake seems to be present only for mandibular
permanent first molars in boys. There are also no deviations from the PH
assumption with respect to gender and tooth (plots are not shown). This
allowed us to assume for the caries model a PH effect of the three covariates,
possibly including some interaction terms. By this semi-parametric assumption it was hoped to see more clearly the effect of fluoride-intake on caries
experience.
A proportional hazards model with Flemish and Finnish data
For reasons stated in the previous paragraph, we have fitted a model where
the caries hazard function (5.3) was changed into
(c)
(c)
λi,l (t) = Zi × ℏ0 (t) × exp(β ′ xi,l ) × I[Ui,l < t ≤ Vi,l ],
(5.4)
where xi,l and β are the same as in (5.1). The additional β-parameters were
given a N (0, 102 ) prior. However, the hazard function for emergence is still
defined by (5.2). Posterior expectations of the hazard ratios between the
fluorosis groups while controlling for the other covariates are given in the left
part of Table 5.3.
The PH analysis for caries gives similar conclusions to the previous nonparametric analysis. A positive effect of fluoride-intake is now seen for the
mandibular permanent first molars of boys and has a borderline positive
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
55
Table 5.4: Signal Tandmobielr study. Bayesian models with Flemish and
Finnish Data. Posterior means and 95% equal-tail credible intervals for the
hyperparameters µ0 – conditional expectation of ηi for Finland, µ1 – conditional expectation of ηi for Flanders, τ −2 – conditional variance of ηi , φ−1 –
conditional variance of frailties Zi (top of the Table). Means of the posterior
predictive distributions and 95% equal tail posterior predictive intervals for
the birth time of dentition ηi in Finland and Flanders, respectively, and for
the frailty term Zi (bottom of the Table).
Posterior mean (95% credible interval)
Hyperparameter Non-parametric model Cox regression model
µ0
5.47 (5.40, 5.54)
5.45 (5.38, 5.52)
µ1
5.69 (5.64, 5.73)
5.68 (5.64, 5.73)
τ −2
0.48 (0.45, 0.52)
0.49 (0.45, 0.52)
φ−1
3.85 (3.57, 4.17)
3.94 (3.58, 4.28)
Posterior predictive mean (95% posterior predictive interval)
Parameter
Non-parametric model Cox regression model
ηi (Finland)
5.48 (4.12, 6.79)
5.45 (4.05, 6.84)
ηi (Flanders)
5.69 (4.33, 7.09)
5.69 (4.34, 7.01)
Zi
1.02 (10−6 , 6.90)
0.95 (10−6 , 6.45)
effect for the maxillary permanent first molars of boys. However, no effect of
fluoride intake was seen for girls.
Remark concerning hyperparameters
The posterior expectations and 95% equal-tail credible intervals of the hyperparameters related to the birth times of dentition ηi and frailties Zi are given
in the upper part of Table 5.4. The non-parametric model and PH model for
caries give similar results.
We now state our conclusions concerning the emergence process in Flanders
and Finland. The emergence process starts slightly earlier in Finland (by approx. 0.2 years) than in Flanders, as is seen by the difference in the posterior
expectations of the means of birth time of dentition. The MCMC output for
the hyperparameters can also be used to estimate properties of the predictive distributions of birth time of dentition and frailties. Their means and
95% equal-tail posterior predictive intervals are shown in the bottom part of
Table 5.4, which shows that the average of Finnish birth time of dentition is
56
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
close to 5.5 years of age, slightly higher than the prior expectation but close
to the value obtained by Härkänen et al. (2000) on another Finnish data set.
The 95% posterior predictive intervals show that the actual moment of birth
time of dentition varies between about 4 and 7 years of age. Finally, the 95%
posterior predictive interval of Zi shows a clear heterogeneity in the frailty
for caries experience.
Sensitivity analysis
Firstly, the model (5.4) was fitted using Flemish data only, to see how influential was inclusion of the Finnish data. As seen in Table 5.3, the hazard
ratios changed only slightly. The same was true for the remaining parameters.
Moreover, the Finnish data improved only slightly the precision with which
the emergence of the first permanent molars was estimated. This is seen in
Figure 5.3 which shows a comparison of 95% pointwise equal tail credible regions for the emergence hazard functions of the permanent first molars based
on the analysis with both data sets and the Flemish data set only. Though,
15
10
5
0
5
10
15
Hazard function
Boy, mandible 6
0
Hazard function
Boy, maxilla 6
0
2
4
12
6
8
10
Time since birth time of dentition (years)
0
Girl, mandible 6
15
10
0
5
Hazard function
15
10
5
Hazard function
Girl, maxilla 6
0
2
4
12
6
8
10
Time since birth time of dentition (years)
0
2
4
12
6
8
10
Time since birth time of dentition (years)
0
2
4
12
6
8
10
Time since birth time of dentition (years)
Figure 5.3: Signal Tandmobielr study. Bayesian PH models. Posterior
means of the emergence hazard functions ℏ(e) (·|tooth, gender) for the permanent first molars together with their 95% pointwise equal tail probability
regions. Comparison of the posterior mean with (solid line) and without
additional Finnish data (dashed line), respectively together with 95% prob.
regions (dotted-dashed line, dotted line respectively).
5.2. BAYESIAN PROPORTIONAL HAZARDS MODEL: AN ILLUSTRATION
57
the credible regions are somewhat narrower when both databases are used.
replacemen
To see how the behavior of the parameter estimates changes when informative
priors for the hyperparameters are modified we have fitted the proportional
hazards model with Flemish data only, using different choices of priors for
the hyperparameters. Specifically, we used normal distributions N (3, 2),
N (4, 1), N (5.2, 1), N (6, 1) as a priors for the expectation ξ0 of birth time of
dentition ηi . The standard deviation of the normal prior with mean 3 years
was increased so as to cover realistic emergence times of permanent teeth.
We used Gamma(0.1, 0.1), Gamma(2, 2), and Gamma(10, 10) distributions
as priors for the precision τ of the variance of the birth time of dentition and
for the precision φ of frailties Zi . All other parameters were given flat priors
and there is thus no reason to modify them.
Posterior means and 95% equal-tail credible intervals for hazard ratios between the fluorosis and fluorosis-free groups for different choices of the prior
distributions are shown in Figure 5.4, which shows that the influence of the
choice of the prior distribution is not strong.
−
−
−
−
−
−
−
−
2
−
−
−
−
−
4
6
8
Prior pattern
10
2
12
−
−
−
−
−
−
−
−
−
−
10
−
12
−
−
−
−
1.6
τ, φ ∼
Γ(2, 2)
τ, φ ∼
Γ(10, 10)
1.2
τ, φ ∼
Γ(0.1, 0.1)
−
−
−
−
−
−
−
4
6
8
Prior pattern
−
−
−
−
−
0.4
−
0.4
−
0.8
τ, φ ∼
Γ(10, 10)
Hazard ratio
1.6
−
0.8
−
Girl, mandible 6
τ, φ ∼
Γ(2, 2)
1.2
τ, φ ∼
Γ(0.1, 0.1)
τ, φ ∼
Γ(10, 10)
6
8
Prior pattern
4
Girl, maxilla 6
Hazard ratio
τ, φ ∼
Γ(2, 2)
1.2
0.8
−
Hazard ratio
−
τ, φ ∼
Γ(0.1, 0.1)
0.4
0.8
1.2
τ, φ ∼
Γ(10, 10)
−
−
1.6
Boy, mandible 6
τ, φ ∼
Γ(2, 2)
0.4
Hazard ratio
1.6
Boy, maxilla 6
τ, φ ∼
Γ(0.1, 0.1)
2
4
6
8
Prior pattern
10
12
2
10
12
Figure 5.4: Sensitivity Analysis. Evolution of posterior mean and 95% credible intervals for the hazard ratios between the fluorosis and fluorosis-free
groups with changing prior distributions for hyperparameters τ, φ and ξ0 .
Prior patters number 1, 5 and 9 use N (3, 2) prior for ξ0 , patterns number 2,
6 and 10 use N (4, 1) prior for ξ0 , patterns number 3, 7 and 11 use N (5.2, 1)
prior for ξ0 and patterns number 4, 8 and 12 use N (6, 1) prior for ξ0 .
58
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
We argue that our other assumptions are not strong. Indeed, we assume that
the distributions of the birth time of dentition differ between Finnish and
Flemish populations only in their means. Moreover, as indicated above, the
Finnish data had only a slight impact on the results for the Flemish data.
Further, the baseline hazards were estimated non-parametrically. Finally,
different choices for the priors of the hyperparameters led to similar results
as discussed above.
5.2.5
Discussion
The model presented here allows for the analysis of survival data in dental
research where (doubly-)interval-censored data and dependencies between observations (e.g. between teeth in the same mouth) are common. Our specific
application is to a typical dental research question, i.e. whether fluorideintake has a protective effect for caries. The results show that the protective
effect of fluoride-ingestion is not convincing. We observed a positive effect
only for mandibular teeth of boys. This agrees with current guidelines for
the use of fluoride in caries prevention, where only the topical application
(e.g. fluoride in tooth paste) is considered to be essential (Oulis, Raadal, and
Martens, 2000).
We acknowledge that our analyses could have been more refined if the amount
of left- and right censoring was less, for instance if the study had started
approximately one year earlier and ended in high school. This would make our
analyses less dependent on prior assumptions. Yet these prior assumptions
are simply a reflection of basic dental knowledge and it would be a waste
not to use them. Moreover, to our knowledge the Signal Tandmobielr study
is possibly the largest longitudinal study executed with such great detail on
dental aspects.
This section has illustrated the usefulness of the Bayesian approach. Firstly,
it was possible to incorporate prior information and to relax the parametric assumptions often made in survival analysis with interval-censored data.
Secondly, even rather complex models could be specified for doubly-intervalcensored data. However, we have to admit that this approach is computationally demanding. On a Pentium IV 2 GHz PC with 512 MB RAM one
BITE run took about 5 days to converge. However, in an epidemiological
analysis where there is correlation among the subjects, where the response
and/or the covariates are (right-, left- or interval-) censored and when we
wish to avoid parametric assumptions we doubt any classical approach will
suffice.
5.3. BAYESIAN ACCELERATED FAILURE TIME MODEL
5.3
59
Bayesian accelerated failure time model
Most contributions to the AFT model in the Bayesian literature work explicitely only with right-censored data. However, using the idea of Bayesian
data augmentation (Section 4.3) they can all be quite easily extended to
handle also interval-censored data. Additionally, actually all papers dealing
with the Bayesian AFT model use a Bayesian non-parametric approach (see
Walker et al., 1999 or the book Ghosh and Ramamoorthi, 2003) for the distributional parts of the AFT model. In this section, we give a brief overview.
Firstly, Christensen and Johnson (1988) and Johnson and Christensen (1989)
consider the basic univariate AFT model (3.2) and use a Dirichlet process
prior (Ferguson, 1973, 1974) for the underlying baseline survival distribution,
i.e. the distribution of exp(ε). In the former paper, only a semi-Bayesian approach is used, whereas the latter paper presents a fully Bayesian analysis
however, with uncensored data only. The authors state that “The analysis
becomes totally intractable when there are censored observations.” Additionally, as discussed in Johnson and Christensen (1989), difficulties might
arise due to the discrete nature of a Dirichlet process (the baseline survival
distribution is discrete with probability one if it is assigned the Dirichlet process prior). An improvement is presented by Kuo and Mallick (1997) who
consider a Dirichlet processes mixture (Lo, 1984) for either ε or exp(ε).
Subsequently, Walker and Mallick (1999) suggest to use a diffuse, finite Pólya
tree prior distribution described in Lavine (1992, 1994) and Mauldin, Sudderth, and Williams (1992) for the error term ε in the AFT model (3.2). The
main advantages of the Pólya tree prior distribution are (1) it can assign
probability one to the set of continuous distribution, (2) it is easy to constraint the resulting error term ε to have the median (or any other quantile)
rather than the mean equal to zero (or any other fixed number) such that
also the regression quantiles can be modelled, of which the median regression
is the most important case. Additionally, Walker and Mallick (1999) break
down the i.i.d. assumption of the error terms and assume also the population
averaged AFT model (3.6).
Successive approaches to the Bayesian non-parametric AFT concentrate on
the median regression. Namely, Kottas and Gelfand (2001) suggest to use the
Dirichlet process mixture of either unimodal parametric densities or unimodal
step functions for the distribution of the error term ε in the basic AFT
model (3.2). Another median regression AFT model is given by Hanson and
Johnson (2002) who use a mixture of Pólya trees centered about a standard,
parametric family of probability distributions as a prior for the error term ε.
Finally, Hanson and Johnson (2004) consider a mixture of Dirichlet processes
60
CHAPTER 5. METHODS FOR INTERVAL-CENSORED DATA
introduced by Antoniak (1974) (which is distinct from the Dirichlet process
mixture used by Kuo and Mallick, 1997 or Kottas and Gelfand, 2001) as the
prior for the error term ε in the basic AFT model (3.2). They also consider
explicitely the interval-censored data.
The area of multivariate survival data modelled by the mean of the Bayesian
AFT model seems to be almost unexplored. Except the work Walker and
Mallick (1999) we are not aware of any other contribution. Moreover, the
structured modelling of dependencies by the mean of the cluster specific AFT
model introduced in Section 3.4.3 seems to be absent at all in the literature.
5.4
Concluding remarks
In this chapter and in Chapter 3 we came across with two fundamental regression models for the survival data. We mentioned that the most frequently
used PH model has several drawbacks so that in many practical situations
it is worthy to consider alternatives of which the AFT model is an appealing one. We pointed out that the AFT model whose distributional parts
are parametrically specified can relatively easily be estimated even using the
method of maximum-likelihood. However, especially for prediction purposes,
it is important to avoid incorrectly specified parametric models since due
to the censoring any parametric assumption is very difficult to check with
survival data. For that reason, one aims for methods that leave the distributional parts of the model either completely unspecified or specify them in
a flexible way. For the PH model, the partial likelihood due Cox (1975) is
available for this purpose. Unfortunately, no similar concept exists for the
AFT model. Several frequentist semi-parametric methods were reviewed in
Sections 3.2, 3.4.2, 3.4.3, and 5.1.4. Nevertheless, we saw that, especially
with interval censoring, or let alone doubly interval censoring, most of them
become computationally intractable in practical situations. Moreover, with
multivariate data, the situation becomes even more complex.
On the other hand, the Bayesian approach together with data augmentation offer an appealing alternative allowing to formulate and also estimate
realistically complex models even with multivariate and/or (doubly-)intervalcensored data. We have illustrated this issue on the Bayesian semi-parametric
PH model in Section 5.2. In Section 5.3, we have subsequently reviewed existing semi-parametric approaches to the AFT model. However, we mentioned
that most of them were primarily developed to handle only univariate data.
Nevertheless, many survival problems lead to the analysis of the multivariate
data.
Concluding Remarks to Part I and
Introduction to Part II
We have introduced two versions of the AFT model - the population-averaged
and the cluster-specific model that can be used to analyze the multivariate
survival data. We have also mentioned that, especially for the cluster-specific
AFT model (3.7), with unspecified distributional parts of the model, there is
almost no methodology developed in the literature.
In this thesis, we aim to present the methods to handle both the populationaveraged AFT model (3.6) and the cluster-specific AFT model (3.7) under
the presence of multivariate and/or (doubly-)interval-censored data. At the
same time, we want to minimize the parametric assumptions concerning the
distributional parts of the model as much as possible. One possibility to
reach this target is to use smoothing methods for the unknown distributional
parts. In the literature, more often the baseline hazard function is smoothed
(Section 5.1.3: Kooperberg and Clarkson, 1997; Joly et al., 1998; Betensky
et al., 1999; Section 5.2: Härkänen et al., 2000; Komárek et al., 2005).
However, with the AFT model, it is quite natural to use a flexible smooth
expression for the density, either of the error term ε and/or the random effects
b. For example, for the bivariate population-averaged AFT model, Pan and
Kooperberg (1999) use this idea in combination with the multiple imputation
(see Section 3.4.2).
In principal, the methods presented in Part II of this thesis will be built on
the same basis as that of Pan and Kooperberg (1999). Whereas they express
the logarithm of the unknown density using the splines and use numerical
integration to evaluate and optimize the likelihood we will model directly the
density using a linear combination of suitable basis parametric functions and
simplify thus the likelihood evaluation (see Section 6.2.4). In contrast to Pan
61
62
CONCLUDING REMARKS TO PART I AND INTRODUCTION TO PART II
and Kooperberg (1999) we also exploit another strategy to determine the
number of the basis functions. Whereas they choose the optimal number of
basis functions using a criterion like AIC (Akaike, 1974) we will either take
an overspecified number of the basis functions and prevent identifiability
problems and overfitting the data using a penalty term (Chapters 7, 9, 10)
or estimate the number of the basis functions simultaneously with the other
model parameters (Chapter 8).
Further, we will show that for univariate survival data we are able, even under the interval-censoring to use maximum-likelihood based methods without
the need for multiple imputation (Chapter 7). With the introduction of multivariate and doubly-interval-censored data we avoid multiple imputation by
switching to the Bayesian approach (Chapters 8, 9, 10) which is more advantageous in such situation as was explained in Chapter 4.
Part II
Accelerated Failure Time
Models with Flexible
Distributional Assumptions
Chapter
6
Mixtures as Flexible Models for
Unknown Distributions
We aim to develop the accelerated failure time models with flexibly specified
distributional parts. We have already sketched that we wish to use flexible,
yet smooth expressions for densities involved in the specification of these
distributional parts. In this chapter, let g(y) (g(y)) denote an unknown
density of some generic univariate random variable Y (random vector Y ). We
outline two similar, though conceptually different, methods to approximate
g(y) or g(y) in a flexible and smooth way, namely
1. The classical mixture approach;
2. An approach based on penalized smoothing.
We introduce the classical mixture approach in Section 6.1. In Section 6.2, the
penalized smoothing approach exploiting B-splines will be given. In Section
6.3, we replace the B-splines by normal densities and introduce the penalized
normal mixture. Finally, we compare the classical and penalized normal
mixture in Section 6.4.
6.1
6.1.1
Classical normal mixture
From general finite mixture to normal mixture
To model unknown distributional shapes finite mixture distributions have
been advocated by, e.g., Titterington, Smith, and Makov (1985, Section 2.2)
as appealing semi-parametric structures. Using a finite mixture the density
65
66
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
g(y) is modelled in the following way:
g(y) = g(y | θ) =
K
X
wj gj (y),
(6.1)
j=1
where gj , j = 1, . . . , K are known densities and θ = (K, w1 , . . . , wK )′ is
the vector of unknown parameters. Namely, K is the number of mixture
components, and P
wj , j = 1, . . . , K are unknown weights satisfying wj > 0,
j = 1, . . . , K and j wj = 1. In general, the number of mixture components,
K, is assumed unknown, however, due to difficulties outlined further in the
text, estimation of K is often separated from estimation of the remaining
parameters, especially when using maximum-likelihood based methods.
Further, it is often assumed that the mixture components, gj , j = 1, . . . , K
have a common parametric form g̃ and each mixture component depends on
an unknown vector of parameters η j , j = 1, . . . , K. Expression (6.1) changes
then into
K
X
wj g̃(y | η j ),
(6.2)
g(y) = g(y | θ) =
j=1
w1 , . . . , wK , η ′1 , . . . , η ′K )′ .
A frequently used particular form
where θ = (K,
of (6.2) is a normal mixture where g̃(y | η j ) equals ϕ(y | µj , Σj ), a density of
the (multivariate) normal distribution with mean µj and covariance matrix
Σj . For instance, Verbeke and Lesaffre (1996) use a mixture of multivariate
normal distributions with Σj = Σ for all j to model a distribution of the
random effects in the linear mixed model.
In this thesis, we use the classical normal mixture only in a univariate context,
i.e. to express an unknown univariate density g(y) as
g(y) = g(y | θ) =
K
X
j=1
wj ϕ(y | µj , σj2 ).
(6.3)
In this case, the vector θ equals
2 ′
θ = (K, w1 , . . . , wK , µ1 , . . . , µK , σ12 , . . . , σK
).
(6.4)
Figure 6.1 illustrates how two- or four-component, even homoscedastic, normal mixtures can be used to obtain densities of different shapes.
6.1.2
Estimation of mixture parameters
Let θ be a vector given by the expression (6.4) and containing all unknown
parameters of model (6.3). Suppose first that an i.i.d. sample y1 , . . . , yn
6.1. CLASSICAL NORMAL MIXTURE
67
from a density g(y | θ) is available to estimate the unknown parameter vector θ. Maximum-likelihood based methods pose two main difficulties when
estimating θ:
1. When K, the number of mixture components is unknown, one of the
basic regularity conditions for the validity of the classical maximumlikelihood theory is violated. Namely, the parameter space does not
have a fixed dimension. Indeed, the number of unknowns (number of
unknown mixture weights, means and variances) is one of the unknowns.
See, e.g., Titterington et al. (1985, Section 1.2.2) for a detailed discussion of this difficulty.
2. For a fixed K ≥ 2, the likelihood becomes unbounded resulting in nonexistence of the maximum-likelihood estimate when one of the mixture
means, say µ1 , is equal to one of the observations yi , i = 1, . . . , n and
when the corresponding mixture variance, σ12 , converges to zero. See,
e.g., McLachlan and Basford (1988, Section 2.1) for more details.
In classical frequentist approach, the first problem is tackled by consecutive
fitting of several models with different numbers of mixture components and
choosing the best one using some criterion, e.g., Akaike’s information criterion (Akaike, 1974). To avoid the second problem, homoscedastic normal
mixtures, i.e. with σj2 = σ 2 for all j are used leading to a bounded likelihood.
µ1
µ2
µ1
µ1
µ2
µ1 µ2 µ3 µ4
µ2
Figure 6.1: Several densities expressed as two- or four-component homoscedastic normal mixtures.
68
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
Bayesian methodology, on the other hand, offers a unified framework to estimate both the number of mixture components K and heteroscedastic normal
mixtures in the same way as any other unknown parameters, i.e. using proper
posterior summaries. A breakthrough in Bayesian analysis of models with
a parameter space of varying dimension is the introduction of the reversible
jump Markov chain Monte Carlo (RJMCMC) algorithm by Green (1995)
which allows to explore a joint posterior distribution of the whole parameter
vector θ from model (6.3), including the number of mixture components K.
Explicit application of the RJMCMC algorithm to normal mixtures is then
described by Richardson and Green (1997).
The fact that the likelihood is unbounded for heteroscedastic normal mixtures leads to an improper posterior distribution in the Bayesian setting when
a fully non-informative prior distribution is used for the variances of the
Q mix−2
2 ) ∝
ture components (mixture variances), i.e. when p(σ12 , . . . , σK
j σj .
However, the problem is solved by using a slightly informative
prior distribuQ
tion for the mixture variances. For instance, replacing j σj−2 by a product
of inverse gamma distributions with parameters h1 and h2 where h1 = h2 =
0.001 or h1 = 1, h2 = 0.005, the classical vague priors, is already sufficient
to prevent that the mixture variances will tend to zero causing an infinite
likelihood.
We use a classical normal mixture model (6.3) for the density of the error
distribution in the cluster-specific AFT model in Chapter 8. To avoid difficulties with the maximum-likelihood estimation outlined above and for other
reasons (see Sections 4.1 and 4.2) only Bayesian methodology will be considered here. In Chapter 8 we also discuss the RJMCMC algorithm and the
issue of the prior distribution for mixture variances in more detail.
6.2
6.2.1
Penalized B-splines
Introduction to B-splines
Different types of smoothing are routinely used in various places of modern
statistics to express an unknown (smooth) function. Most often, either regression surfaces or densities are smoothed; see, e.g., Fahrmeir and Tutz (2001,
Chapter 5) and Hastie, Tibshirani, and Friedman (2001) for an overview.
In this thesis, we concentrate on smoothing based on splines. For simplicity,
we consider the univariate case first. The unknown function g(y) (density
in our case) is expressed as a linear combination (mixture) of suitable basis
6.2. PENALIZED B-SPLINES
69
spline functions B1 (y), . . . , BK (y), i.e.
g(y) = g(y | θ) =
K
X
wj Bj (y),
(6.5)
j=1
where θ = w = (w1 , . . . , wK )′ . Expression (6.5) is similar to (6.3) introduced
in the previous section. Note however that in contrast to normal densities in
(6.3), the basis spline functions Bj (y), j = 1, . . . , K are always fully specified,
including their location and scale, and the number of basis splines, K, is
always fixed beforehand. The only quantities that have to be estimated are
the spline coefficients (mixture weights) w.
So called B-splines (de Boor, 1978; Dierckx, 1993) form, for their numerical
stability and simplicity, a suitable system of basis spline functions. Their
use in statistics was promoted especially by Eilers and Marx (1996). The Bspline is a piecewise polynomial function. To fully specify the B-spline basis,
µ1
µ2
µ3
µ4
µ5
µ6
µ7
µ8
µ9
µ10
µ11
µ1
µ2
µ3
µ4
µ5
µ6
µ7
µ8
µ9
µ10
µ11
µ12
Figure 6.2: Basis B-splines of degree d = 1 (upper panel) and degree d = 2
(lower panel).
70
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
B1 (y), . . . , BK (y), we have to determine
1. Degree d of the polynomial pieces;
2. A set of values (knots) µ1 ≤ · · · ≤ µd+1 < · · · < µK+1 ≤ · · · ≤ µK+d+1
such that the interval (µ1 , µK+d+1 ) covers the domain of the function
g(y) we wish to express using the B-splines.
Given that, the value of each basis B-spline can easily be computed at an arbitrary point y ∈ R (see de Boor, 1978). Figure 6.2 shows a basis of linear
(d = 1) and quadratic (d = 2) B-splines with K = 9. It can be found that
the jth basis B-spline of degree d
1. Consists of d + 1 polynomial pieces;
2. Is only positive on the interval (µj , µj+d+1 );
3. Has continuous derivatives up to order d − 1;
4. Except on boundaries it overlaps with 2d polynomial pieces of its neighbors.
µ1
µ3
µ5
µ7
µ9 µ11 µ13
µ1
µ3
µ5
µ7
µ9 µ11 µ13
µ1
µ3
µ5
µ7
µ9 µ11 µ13
µ1
µ3
µ5
µ7
µ9 µ11 µ13
Figure 6.3: Several functions expressed as linear combinations of cubic Bsplines with K = 9 and equidistant set of knots.
6.2. PENALIZED B-SPLINES
71
Furthermore,
for all y ∈ (µ1 , µK+d+1 ) the basis B-splines sum up to one,
PK
i.e.
B
(y)
= 1. Finally, Dierckx (1993) gives simple recursive formulas
j
j=1
to compute derivatives or integrals of the function g(y) expressed by (6.5).
Figure 6.3 illustrates that B-spline mixture can result in functions of various
shapes.
6.2.2
Penalized smoothing
Choosing the optimal number and position of knots is generally a complex
task in the area of spline smoothing. Too many knots leads to overfitting the
data; too few knots leads to underfitting and inaccuracy. O’Sullivan (1986,
1988) proposed to take a relatively large number of knots and to restrict the
flexibility of the fitted curve by putting a penalty on the second derivative.
In the context of B-splines, Eilers and Marx (1996) suggested
1. To use a large number of equidistant knots covering the domain of the
function g(y) one wishes to smooth;
2. To estimate the spline coefficients using the method of penalized maximum-likelihood. Further, they propose to base the penalty on squared
finite higher-order differences between adjacent spline coefficients wj .
They call their method as penalized B-spline, or P-spline smoothing. Eilers
and Marx (1996) use the P-splines primarily to smooth regression surfaces
although they propose also a methodology, based on the Poisson generalized
linear model (GLM), for smooth estimation of the density with the i.i.d. data.
We sketch this method in Section 6.2.4.
The strategy of several further developments (Chapters 7, 9, 10) in this thesis
is based on the ideas of Eilers and Marx (1996), modified and adapted to
regression modelling with censored data. Namely,
1. For reasons stated in Section 6.3 we replace the basis B-splines by normal densities with a common variance;
2. We base the penalty term on squared finite higher-order differences
between appropriate transformations of the adjacent spline coefficients
wj , see Section 7.2.2 for a motivation;
3. More complex models in Chapters 9 and 10 will be estimated using
the Bayesian methodology using the prior distributions inspired by the
penalty term used in the penalized maximum-likelihood applications;
72
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
4. We will not use the Poisson GLM-based density estimation, see Section 6.2.4 for the reasons why.
In agreement with Eilers and Marx (1996) we use a set of equidistant knots
in all penalized-based developments.
6.2.3
B-splines in the survival analysis
General splines have been suggested at several places in the survival literature
to model flexibly either the (log-)density/(log-)hazard function or the effect
of covariates replacing a linear predictor by a spline function. See the discussion section of Abrahamowicz, Ciampi, and Ramsay (1992), the introductory
section of Kooperberg, Stone, and Truong (1995) or Chapter 5 of Therneau
and Grambsch (2000) for an overview.
More specifically, B-splines have been used by Rosenberg (1995) who uses
their cubic variant to express the baseline hazard function in the Cox’s PH
model. He choses the optimal number of knots according to Akaike’s information criterion (Akaike, 1974) while placing the knots to the quantiles
of uncensored observations. An approach based on the penalized maximumlikelihood is given by Joly, Commenges, and Letenneur (1998) who use monotone splines (close relatives of the B-splines introduced by Ramsay, 1988) to
model the baseline hazard function, in the Cox’s PH model as well. Tutz
and Binder (2004) and Kauermann (2005b) use B-splines to extend the basic
Cox’s PH model by allowing for time-varying regression parameters.
Recently, Lambert and Eilers (2005) use a Bayesian version of penalized Bsplines to model both the baseline hazard and the effect of covariates in the
Cox’s PH model in an actuarial way. To our best knowledge, we are not aware
of any approach where the B-splines would be used to model the density of
the survival times.
6.2.4
B-splines as models for densities
The function g(y | θ) expressed by (6.5) can serve as a model for the density
of a continuous distribution with domain (µ1 , µK+d+1 ) provided
g(y | θ) ≥ 0 for all y ∈ (µ1 , µK+d+1 ),
Z µK+d+1
g(y | θ) dy = 1.
µ1
(6.6)
(6.7)
6.2. PENALIZED B-SPLINES
73
Condition (6.6) is satisfied if we require all the spline coefficients to be positive, i.e.
wj > 0,
j = 1, . . . , K.
(6.8)
Constraint (6.7) can easily be avoided when we change the expression (6.5)
for g(y | θ) into
g(y | θ) = Q−1
Q=
Z
K
X
wj Bj (y),
(6.9)
j=1
K
µK+d+1 nX
µ1
j=1
o
wj Bj (y) dy
The constant Q can easily be computed using the formulas given by Dierckx
(1993, Section 1.3). For example, in the case of coincident boundary knots
(i.e. µ1 = · · · µd+1 and µK+1 = · · · = µK+d+1 ) the constant Q equals
K
Q=
1 X
wj (µj+d+1 − µj ).
d+1
j=1
We show in Section 6.3.2 how to avoid also the inequality constraints (6.8).
A somewhat different approach to estimate a density function using B-splines
has been suggested in Eilers and Marx (1996, Section 8), namely by smoothing a histogram. They divide the range of the data into a large number K of
bins, each of length h, and let the midpoints of the bins to define the knots
µ1 , . . . , µK . The raw continuous data, y1 , . . . , yn are changed into counts
n1 , . . . , nK such that nj , j = 1, . . . , K equals the number of raw observations
yi , i = 1, . . . , n with µj − h/2 ≤ yi < µj + h/2. The counts n1 , . . . , nK constitute a histogram. They assume that each of these counts follows a Poisson
distribution with expectation E(n1 ), . . . , E(nK ), respectively. A smoothed
histogram is obtained by expressing the Poisson log-expectations as the Bspline, namely
K
X
log E(nj ) =
wk Bk (µj ),
j = 1, . . . , K.
k=1
The corresponding smooth density of the original continuous data is then
given by
K
nX
o
g(y | θ) = Q−1 exp
wk Bk (y) ,
k=1
74
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
where Q is an appropriate proportionality constant. Eilers and Marx (1996)
argue that the use of penalized maximum-likelihood estimation provides stable and useful results and does not lead to any pathological results resulting
from discretization of the data.
For our developments in the context of the AFT model, we believe that the
approach with the density directly expressed as a mixture of B-splines is more
advantageous since it leads to a simpler likelihood evaluation. Remember that
with censored observations the likelihood involves evaluation of integrals of
the assumed density (see Section 4.1 and 4.2). With the density (6.9) these
integrals are simply mixtures of integrated basis B-splines whose computation
only involves integration of polynomials. Nevertheless, usage of the smoothed
histogram approach in the censored data regression context is presented by
Lambert and Eilers (2005).
6.2.5
B-splines for multivariate smoothing
The concept of B-splines can be extended to the multivariate setting, to
smooth (estimate) a function g(y) of several variables. For example the
bivariate case is achieved by replacing the formula (6.5) by
g(y) = g(y1 , y2 ) = g(y | θ) =
K2
K1 X
X
wj1 , j2 B1, j1 (y1 ) B2, j2 (y2 ),
j1 =1 j2 =1
where B1, j1 , j1 = 1, . . . , K1 is a set of basis B-splines of degree d defined by
knots µ1, 1 , . . . , µ1, K1 +d+1 , B2, j2 , j2 = 1, . . . , K2 a set of basis B-splines of
degree d defined by a generally different set of knots µ2, 1 , . . . , µ2, K2 +d+1 ,
and θ = (w1,1 , . . . , wK1 ,K2 )′ . Namely, g(y | θ) is expressed as a Kronecker
product of univariate B-splines and this idea can be extended also to higher
dimensions.
6.3
6.3.1
Penalized normal mixture
From B-spline to normal density
Using the B-spline expression (6.5) to model a survival density has one drawback. Namely, the support of the resulting density g(y | θ) is always bounded
and equal to the interval (µ1 , µK+d+1 ). However, most continuous survival
distributions are thought of as having a support of (0, ∞) on the time scale
and the real line on the log-scale. While in practice this might not constitute
6.3. PENALIZED NORMAL MIXTURE
75
any difficulty, in theory it might be more comfortable to approximate a density having an infinite support. Remember also that we aim to approximate
densities of either the error distribution in the AFT model or the distribution
of the random effects in the same model. This implies that it might be quite
difficult in some settings to find a proper range of the density for the error
terms and/or random effects as both distributions are seen from the data
only indirectly. However, one can easily find that the basis B-spline of degree
d is very close to the density of the standard normal distribution in the sense
of the following proposition.
Proposition 6.1. Let B d (y) be a basis B-spline of degree d defined on the
grid of d + 2 equidistant knots
µd1 = −δ
d+1
,
2
...,
µdd+2 = δ
with δ = µdj+1 − µdj , j = 1, . . . , d + 1 equal to
d
Bst
(y)
=
r
p
d+1 d
B (y),
12
d+1
,
2
12/(d + 1). Let
y∈R
be a standardized basis B-spline of degree d. Then
d
lim Bst
(y) = ϕ(y)
d→∞
uniformly for all y ∈ R,
where ϕ denotes a density of a standard normal distribution.
Proof. We give only main ideas of the proof. All technical details can be
found in Unser, Aldroubi, and Eden (1992).
Firstly, an arbitrary basis B-spline of degree d is proportional to the density of
a sum of d + 1 independent uniformly distributed random variables. Properly
d (y), is then equal to a density of a zero mean,
standardized basis B-spline, Bst
unit variance random variable given as a sum of d + 1 independent uniformly
distributed random variables. The proposition is then achieved using the
central limit theorem (see, e.g., Billingsley, 1995, Section 27).
The property outlined in Proposition 6.1 is illustrated in Figure 6.4. Moreover, the convergence is rather fast. Indeed, the standardized cubic basis
B-spline is already quite close to the standard normal density.
This reasoning led us to replace the basis B-splines in the expression (6.5)
by normal densities whose means are equal to the knots and whose variance
is equal to a common value σ02 . In accordance with the idea of penalized
B-splines (see Section 6.2.2), we use a larger number of equidistant knots
76
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
0
1
2
3
0.3
−3 −2 −1
1
2
3
2
3
2
3
−3 −2 −1
3
1
2
3
0.0
0.1
0.2
0.3
0.4
0.2
0.1
2
0
d=8
0.0
1
3
0.2
1
0.3
0.4
0.3
0.2
0.1
0
2
0.1
0
d=7
0.0
−3 −2 −1
3
0.0
−3 −2 −1
d=6
2
0.4
1
1
0.3
0.4
0.3
0.1
0.0
0
0
d=5
0.2
0.3
0.2
0.1
0.0
−3 −2 −1
−3 −2 −1
d=4
0.4
d=3
0
0.4
−3 −2 −1
0.0
0.1
0.2
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
d=2
0.4
d=1
0.4
d=0
−3 −2 −1
0
1
2
3
−3 −2 −1
0
1
Figure 6.4: Standardized basis B-splines of degree 0 to 8 (solid line) compared
to a standard normal density (dashed line).
6.3. PENALIZED NORMAL MIXTURE
77
chosen beforehand. Additionally, as explained in Section 6.3.3, we always
use an odd number of knots symmetric around the middle knot. For this
reason, the number of mixture components will be indicated by 2K + 1 and
the knots – means denoted by µ−K , . . . , µ0 , . . . , µK . Namely, the unknown
function g(y) (density) is approximated by
g(y) = g(y | θ) =
K
X
j=−K
wj ϕ(y | µj , σ02 ),
(6.10)
where θ = (w−K , . . . , wK )′ . The basis standard deviation, σ0 , is chosen
beforehand as well as the knots. For its choice we adopted the value 2δ/3,
where δ = µj+1 − µj , j = −K, . . . , K − 1 is the distance between the two
consecutive knots – means. The motivation for this choice is provided by
an attempt to keep a correspondence with the cubic B-splines. Remember,
the basis cubic B-spline covers an interval of length 4δ. The same is nearly
true for the normal density with the variance (2δ/3)2 if we admit that the
N (µ, σ 2 ) density is practically zero outside the interval (µ − 3σ, µ + 3σ). In
this context, we will call (6.10) penalized normal mixture.
6.3.2
Transformation of mixture weights
To ensure that the function g(y | θ) given by (6.10) is a density of some
continuous distribution, we have to impose constraints analogous to (6.8)
and (6.9) upon the mixture weights w = (w−K , . . . , wK )′ . Namely, they have
to satisfy
wj > 0,
K
X
wj = 1.
j = −K, . . . , K,
(6.11)
(6.12)
j=−K
To avoid constrained estimation, one can use an alternative parametrization
based on transformed mixture weights a = (a−K , . . . , aK )′
w j
, j = −K, . . . , K,
(6.13)
aj (w) = log
w0
Inversely, the original weights w are computed from the transformed weights
a by
exp(aj )
, j = −K, . . . , K.
(6.14)
wj (a) = PK
exp(a
)
k
k=−K
Instead of estimating the constrained weights w, the vector a−0 of unconstrained transformed weights, except a0 which is fixed to zero, is estimated.
78
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
Note that the weights w(a) expressed by (6.14) automatically satisfy both
(6.11) and (6.12). Further, an arbitrary mixture component can be chosen to
be the reference one having a corresponding a coefficient fixed to zero without
any impact on the results. However, for notational convenience, without loss
of generality, we will assume that a0 = 0.
6.3.3
Penalized normal mixture for distributions with an arbitrary location and scale
Let Y be a random variable with a density g(y) with
var(Y ) = τ 2 .
E(Y ) = α,
To be able to use the same grid of knots – means µ−K , . . . , µK for distributions
with an arbitrary location α and scale τ we incorporate these two parameters
in the expression (6.10) for the unknown density g(y), i.e., the density g(y)
will be approximated by
g(y) = g(y | θ) = τ −1
K
X
j=−K
y − α 2
µ
,
σ
wj (a) ϕ
j 0 ,
τ
(6.15)
where θ = (a−K , . . . , aK , α, τ )′ . In other words, the density of the standardized random variable Y ∗ = τ −1 (Y − α) is approximated by
g∗ (y ∗ | θ ∗ ) =
K
X
j=−K
wj (a) ϕ y ∗ µj , σ02 ,
where θ ∗ = (a−K , . . . , aK )′ . The intercept α and the scale τ will be estimated
simultaneously with the transformed mixture weights a.
With expression (6.15), the knots µ−K , . . . , µK have to cover a high probability region of the zero-mean, unit-variance distribution. In most practical
situations, the choice with µ−K equal to a value between −6 and −4.5 and
µK equal to a value between 4.5 and 6 provides the range of the knots broad
enough. Furthermore, a distance δ of 0.3 between two consecutive knots is
small enough to approximate most smooth densities. As an illustration, we
computed the L2 -distance between the standard normal density and its best
approximation using a penalized mixture (6.15) with µ−K = −6, µK = 6,
different choices of δ = µj+1 − µj , and σ0 = 2δ/3. This distance is equal to
0.00570 for δ = 1 (K = 6), and drops to 0.00104 for δ = 0.75 (K = 8). When
plotted, the penalized mixture (6.15) is indistinguishable from the normal
density at δ = 0.75. Further, for δ equal to 0.5 (K = 12), 0.4 (K = 15), 0.3
6.3. PENALIZED NORMAL MIXTURE
79
(K = 20), 0.2 (K = 30), and 0.1 (K = 60) we obtain distances of 0.00031,
0.00022, 0.00017, 0.00014, and 0.00012, respectively. Clearly, for δ = 0.3 the
penalized mixture and the normal density are quite close.
6.3.4
Multivariate smoothing
In Section 6.2.5 we discussed how the Kronecker product of univariate Bsplines can be used to model unknown multivariate functions. The same idea
can be used also with the penalized normal mixture. In this thesis, we use the
multivariate penalized normal mixture only in the bivariate setting which will
be discussed now. Extensions to higher dimensions are obvious, only with
more complex notation.
Firstly, we note that the bivariate basis formed of the Kronecker product of
univariate normal densities is actually the basis formed of bivariate normal
densities with diagonal covariance matrices. Indeed, for arbitrary y1 ∈ R and
y2 ∈ R
ϕ(y1 | µ1 , σ12 ) ϕ(y2 | µ2 , σ22 ) = ϕ2 (y1 , y2 | µ, Σ),
where ϕ2 (· | µ, Σ) is a density of N2 (µ, Σ) with µ = (µ1 , µ2 )′ and Σ =
diag(σ12 , σ22 ).
Analogously to the univariate formula (6.10), the unknown bivariate density
g(y1 , y2 ) = g(y) is expressed by
g(y) = g(y | θ) =
K1
X
K2
X
j1 =−K1 j2 =−K2
wj1 ,j2 ϕ(y | µ(j1 ,j2 ) , Σ),
(6.16)
where µ(j1 ,j2 ) = (µ1,j1 , µ2,j2 )′ , j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 is a fixed
fine grid of knots, Σ = diag(σ12 , σ22 ) is a fixed basis covariance matrix (the
same for all mixture components) and W = (wj1 ,j2 ), j1 = −K1 , . . . , K1 ,
j2 = −K2 , . . . , K2 a matrix of unknown mixture weights satisfying
wj1 ,j2 > 0,
j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2
(6.17)
wj1 ,j2 = 1.
(6.18)
K1
X
K2
X
j1 =−K1 j2 =−K2
The vector θ of unknown parameters contains the elements of the matrix W.
Similarly to Section 6.3.2, the constraints (6.17) and (6.18) are avoided by
the reparametrization of the weight matrix W into the matrix A = (aj1 ,j2 ),
80
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 of transformed weights by
wj1 ,j2 (A) =
w
j1 ,j2
,
j1 = −K1 , . . . , K1 ,
w0,0
exp(aj1 ,j2 )
, j2 = −K2 , . . . , K2 .
K2
K1
P
P
exp(ak1 ,k2 )
aj1 ,j2 (W) = log
(6.19)
k1 =−K1 k2 =−K2
For notational convenience and without loss of generality, the mixture component (0, 0) is chosen to be the baseline with a0,0 = 0.
Moments of the bivariate penalized normal mixture
It is useful to stress that although all bivariate normal components in (6.16)
are uncorrelated the covariance matrix of the random vector (Y1 , Y2 )′ with the
density g(y) = g(y | θ) defined by (6.16) is, except for a special combination
of mixture weights, not diagonal. Namely,
E(Y1 ) =
K1
X
wj1 + µ1,j1 ,
var(Y1 ) =
+
K1
X
j1 =−K1
var(Y2 ) =
σ22
+
K2
X
j2 =−K2
cov(Y1 , Y2 ) =
K1
X
w+j2 µ2,j2 ,
j2 =−K2
j1 =−K1
σ12
E(Y2 ) =
K2
X
K2
X
o2
n
wj1 + µ1,j1 − E(Y1 ) ,
n
o2
w+j2 µ2,j2 − E(Y2 ) ,
j1 =−K1 j2 =−K2
n
on
o
wj1 ,j2 µ1,j1 − E(Y1 ) µ2,j2 − E(Y2 ) ,
where subscript + means summation over the range of the corresponding
index.
Bivariate penalized normal mixture for distributions with an arbitrary
location and scale
Analogously to Section 6.3.3 we introduce here an extra intercept parameter
vector α = (α1 , α2 )′ and an extra scale parameter vector τ = (τ1 , τ2 )′ to
6.4. CLASSICAL VERSUS PENALIZED NORMAL MIXTURE
81
allow for modelling the bivariate densities of a random vector Y = (Y1 , Y2 )′
with a general location and scales, i.e. with
E(Y1 ) = α1 ,
var(Y1 ) = τ12 ,
E(Y2 ) = α2 ,
var(Y2 ) = τ22 .
As before, the same values of the extreme knots µ1,−K1 , µ1,K1 , µ2,−K2 , µ1,K2
and the basis standard deviations σ1 , σ2 can be used for distributions with
different location and scale.
Namely the bivariate density g(y) of a general distribution will be approximated by
g(y) = g(y | θ) =
(τ1 τ2 )−1
K1
X
K2
X
j1 =−K1 j2 =−K2
(6.20)
y − α y − α 1
1
2
2 wj1 ,j2 (A) ϕ2
,
µ(j1 ,j2 ) , Σ ,
τ1
τ2
where θ = (a−K1 ,−K2 , . . . , aK1 ,K2 , α1 , α2 , τ1 , τ2 )′ . In other words, the density of the standardized random vector
!
!
!
−1
∗
Y
−
α
τ
0
Y
1
1
1
1
=
Y∗=
Y2 − α2
0 τ2−1
Y2∗
is approximated by
g∗ (y ∗ | θ ∗ ) =
K1
X
K2
X
j1 =−K1 j2 =−K2
wj1 ,j2 (A) ϕ(y ∗ | µ(j1 ,j2) , Σ),
(6.21)
where the vector θ ∗ contains only the elements of the matrix A of transformed
weights. The same guidelines as in the univariate case (Section 6.3.3) will be
applied for the choice of the grid points and the basis standard deviations,
i.e. both µ1,−K1 , . . . , µ1,K1 and µ2,−K2 , . . . , µ2,K2 being the univariate grids of
equidistant knots with the distance between the two knots equal to δ ≈ 0.3,
with the minimal knot lying between −6 and −4.5, the maximal knot lying
between 4.5 and 6 and basis standard deviations equal σ1 = σ2 = 2δ/3.
6.4
Classical versus penalized normal mixture
We finalize this chapter by an explicit comparison of the classical normal
mixture and penalized normal mixture.
82
CHAPTER 6. MIXTURES AS FLEXIBLE MODELS
• With the penalized normal mixture, invariably a relatively large but
fixed number of mixture components is needed and the smoothness of
the resulting smoothed distribution is optimized via a penalty term.
On the other hand, with the classical mixture, often a small number
of mixture components is sufficient but, the number of components
have to be estimated which might cause some difficulties as outlined in
Section 6.1.2;
• The fine grid of fixed knots in the penalized mixture approach prevents
inaccuracy in the estimate of the unknown density, while the penalization inhibits overfitting. In contrast, in the case of a classical mixture,
the means and the standard deviations of the mixture components must
be estimated;
• In order to use a standard grid of knots we have included explicitely
the intercept and scale parameters in the model specification when using the penalized approach. This is not desirable with the classical
mixture approach as both the overall intercept and the overall scale
are implicitely defined by the mixture components means and standard
deviations;
• Extension of the univariate smoothing into the multivariate smoothing
is conceptually simple with the penalized approach as was shown in
Section 6.3.4 using the Kronecker product of the basis functions. In
higher dimensions, there are only some computational difficulties arising from the fact that the number of unknown parameters increases
exponentially with the dimension.
Extension of the classical mixture approach into higher dimensions is
relatively easy with a fixed number of mixture components however
is not straightforward when the number of mixture components have
to be estimated simultaneously with the remaining parameters. Even
with the Bayesian approach and the reversible jump MCMC algorithm
mentioned in Section 6.1.2 the multivariate extensions are still an area
of active research, see Dellaportas and Papageorgiou (2006) for recent
developments.
Chapter
7
Maximum Likelihood Penalized
AFT Model
In this chapter, we present the AFT model for the case of independent observations. The error distribution of the model will be based on the penalized
normal mixture (Section 6.3) and penalized maximum-likelihood estimation.
The basic version of this approach is given by Komárek, Lesaffre, and Hilton
(2005) and an extension allowing also modelling the dependence of the scale
parameter on the covariates can be found in Lesaffre, Komárek, and Declerck
(2005).
In Section 7.1, we describe the model in detail. In Section 7.2, we show how
the model parameters are estimated using the penalized maximum-likelihood
method. Section 7.3 describes the inferential procedures. In Section 7.4, computation of predictive survival or hazard functions and predictive densities
is discussed. Section 7.5 gives the results of a simulation study that evaluates the performance of the method. The proposed method is applied to the
analysis of the WIHS data in Section 7.6 and to the analysis of the Signal
Tandmobielr data in Section 7.7. We finalize the chapter by a discussion in
Section 7.8.
7.1
Model
U
Let Ti , i = 1, . . . , N be independent event times observed as intervals ⌊tL
i , ti ⌋
and δi be the corresponding censoring indicator with the same convention as
U
U
in Section 2.1. Let yiL = log(tL
i ) and yi = log(ti ). Further, let xi =
(xi,1 , . . . , xi,m )′ be a vector of covariates associated with the ith subject. The
83
84
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
effect of covariates on the event time Ti will be specified using the basic AFT
model introduced in Section 3.2, i.e.
log(Ti ) = β ′ xi + εi ,
i = 1, . . . , N,
(7.1)
where β = (β1 , . . . , βm )′ is a vector of unknown regression parameters and
ε1 , . . . , εN are i.i.d. error random variables with the density gε (ε).
7.1.1
Model for the error density
The density gε (ε) of the error term will be expressed using the penalized
normal mixture (6.15), i.e.
gε (ε) = τ −1
K
X
j=−K
ε − α wj (a) ϕ
µj , σ02 ,
τ
(7.2)
where µ−K , . . . , µK is a set of fixed equidistant knots, σ0 fixed basis standard
deviation, α unknown intercept and τ unknown scale parameter. Finally,
w = (w−K , . . . , wK )′ are unknown mixture weights and a = (a−K , . . . , aK )′
their transformations obtained using the relationship (6.13).
Let ε∗1 , . . . , ε∗N be standardized error terms, i.e. having the density
gε∗ (ε∗ )
=
K
X
j=−K
wj (a) ϕ(ε∗ | µj , σ02 ).
(7.3)
Keeping the intercept α and the scale τ identifiable requires that the first
two moments of the density (7.3) be fixed, i.e.,
E(ε∗i )
=
K
X
wj (a) µj = 0,
var(ε∗i )
=
K
X
wj (a) (µ2j + σ02 ) = 1. (7.4)
j=−K
j=−K
P
2
2
Due to the fact that K
j=−K wj (a)σ0 = σ0 , the variance constraint can be
PK
2
2
rewritten into the form
j=−K wj (a)µj = 1 − σ0 . It is then easily seen
that the basis standard deviation σ0 must be smaller than 1 to be able to
satisfy this constraint. Finally, the two equality constraints (7.4) can be
avoided if two coefficients, say, a−1 and a1 , are expressed as functions of
the remaining non-baseline coefficients, denoted together as a vector d =
(a−K , . . . , a−2 , a2 , . . . , aK )′ :
n
o
X
ak (d) = log ω0,k +
ωj,k exp(aj ) ,
k = −1, 1,
(7.5)
j6∈{−1,0,1}
7.2. PENALIZED MAXIMUM-LIKELIHOOD
85
with
µj − µ1
1 − σ02 + µ1 µj
,
·
µ−1 − µ1 1 − σ02 + µ1 µ−1
µj
µ−1
= −ωj,−1 ·
− ,
µ1
µ1
ωj,−1 = −
ωj,1
7.1.2
j = −K, . . . , −2, 0, 2, . . . , K.
Scale regression
In most regression models, it is conventionally assumed that the covariates
influence the mean, but it is presumed that it will not influence the scale
parameter. With hindsight, this is simply one model choice and in many
cases it may be untenable. Recently, there is interest in joint mean-covariance
models in the context of longitudinal studies (Pourahmadi, 1999; Pan and
MacKenzie, 2003). Our AFT model (7.1) with the error density (7.2) can
be generalized in the same direction yielding the mean-scale penalized AFT
model. With this generalization, we allow the scale parameter τ to vary
across individuals. Moreover, for the ith individual, the scale parameter τi
will depend on a vector of covariates, say z i = (zi,1 , . . . , zi,ms )′ , as
τi ≡ τ (z i ) = exp(γ ′ z i ),
(7.6)
where γ = (γ1 , . . . , γms )′ is a vector of unknown parameters. Note, that the
covariate vector z i usually contains the intercept term, i.e. zi,1 = 1 for all i.
In that case, the original AFT model (7.1) with the error density (7.2) and
the common scale parameter τ can be written as the mean-scale AFT model
with z i = 1 for all i and τ = exp(γ1 ).
All parameters in the model (transformed mixture coefficients d; regression
parameters vector β; intercept α; and log-scale log(τ ) or scale-regression
parameters vector γ) are estimated by means of a penalized maximumlikelihood method. In the next section, we construct the penalized loglikelihood function which consists of an ordinary log-likelihood and a difference penalty for the transformed spline coefficients. The penalized loglikelihood is subsequently maximized to obtain the estimates, see Appendix A
for practical aspects of the optimization of the penalized log-likelihood.
7.2
7.2.1
Penalized maximum-likelihood
Penalized log-likelihood
Let θ be the vector of all unknown parameters to be
i.e., θ =
estimated,
′
′
′
(α, β , γ , a−K , . . . , a−2 , a2 , . . . , aK ) . Let ℓi (θ) = log Li (θ) , i = 1, . . . , N
86
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
denote the ordinary log-likelihood contribution of the ith observation based
on model (7.1) with error density (7.2), i.e., using the results of Section 4.1.1
and the convention (4.3),
log(t) − α − β ′ xi
Li (θ) =
dt
τi
tL
i
I yU i
y − α − β ′ xi −1
∝ τi
gε∗
dy
τi
yiL
I yU K
X
i
y − α − β ′ xi −1
ϕ
wj (a)
= τi
µj , σ02 dy
τi
yiL
τi−1
I
tU
i
t−1 gε∗
j=−K
U
The proportionality constant is equal to tL
i = ti for exactly observed event
times (δi = 1) and equal to 1 for all remaining observations (δi = 0, 2, 3).
For the purpose of maximum-likelihood based estimation, this constant can
be ignored so for notational convenience
we will assume that this constant
P
equals one. Finally, let ℓ(θ) = N
ℓ
(θ)
be the ordinary log-likelihood of
i=1 i
the whole data set.
As usual with censored data, the likelihood evaluation involves integration.
With our model, however, this does not cause any considerable difficulties
irrespective of the type of censoring (left-, right-, interval-). Indeed, all integrals involved in the computation of the likelihood are normal cumulative
distribution functions, which can be easily and efficiently evaluated.
To construct the penalized log-likelihood function ℓP (θ; λ), we subtract a penalty term q(a; λ) based on the transformed mixture coefficients a from ℓ(θ),
i.e.,
ℓP (θ; λ) = ℓ(θ) − q(a; λ),
(7.7)
where λ is a fixed tuning parameter that controls the smoothness of the
fitted error distribution and inhibits identifiability problems due to overparametrization. For a given (reasonable) λ, Eilers and Marx (1996) proposed to base the penalty on squared higher-order finite differences of the
coefficients of adjacent B-splines, and they used second-order difference in
their examples. We base our penalty on squared finite differences of order s
of the transformed coefficients of adjacent mixture components:
q(a; λ) =
=
λ
2
K
X
∆s aj
j=−K+m
λ ′ ′
a Ps Ps a,
2
2
(7.8)
7.2. PENALIZED MAXIMUM-LIKELIHOOD
87
where ∆1 aj = aj − aj−1 , ∆s aj = ∆s−1 aj − ∆s−1 aj−1 , s = 1, . . . , and Ps
is a (2K + 1 − s) × (2K + 1) difference operator matrix. According to our
experience, s = 2 or s = 3 is sufficient to obtain a smooth estimate of the
density. However, in our context the choice s = 3 has another interesting
justification, as explained in Section 7.2.2 and will be used in all applications
presented in this thesis.
7.2.2
Remarks on the penalty function
There are two reasons why we penalize the transformed mixture coefficients
a instead of the original coefficients w and why we prefer the penalty of order
s = 3.
First, the penalty based on a distinguishes between areas of the density where
there are few datapoints (i.e., where the coefficients w are close to zero) and
areas where there are many datapoints (i.e., where the coefficients w are well
above zero); the penalty based on w cannot distinguish between these areas.
For example,
w̆ = (0.001, 0.002, 0.001, 0.996)′ ,
for
w̃ = (0.201, 0.202, 0.201, 0.396)′
we have ă = (−6.904, −6.211, −6.904, 0)′ ,
ã = (−0.678, −0.673, −0.678, 0)′
and
while
(∆2 w̆3 )2 =
0.000004
= (∆2 w̃3 )2 ,
(∆2 ă3 )2 = 1.92 ≫ 0.000099 = (∆2 ã3 )2 .
Indeed, in the areas with a sufficient amount of data, the estimated shape
of the error distribution is mostly driven by the data themselves, whereas
in the data-poor areas the shape of the fitted error distribution is inter- or
extrapolated from the data-rich areas according to the flexibility allowed by
the penalty term.
Second, the penalty of the third order (s = 3) based on transformed mixture coefficients a has an interesting property which can serve as a basis for
an empirical test of normality (see Section 7.2.3). A basis for this property
is given by the following proposition which is proved in Appendix A.
Proposition 7.1. Let for K ∈ N
µK = {µK
j =
j
,
K
= {−K, −K +
j = −K 2 , . . . , K 2 }
1
1
1
1
, . . . , − , 0, , . . . , K − , K}
K
K
K
K
88
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
2
be a sequence of knots. Let for a ∈ R2K +1 a discrete distribution on µK be
given by
Pr(µK = µK
j | a) = exp(aj ).
3 2
P 2
under the constraints
Let aK minimizes K
j=−K 2 +3 ∆ aj
XK 2
j=−K 2
Pr(µK = µK
j |a) = 1,
E(µK | a) = 0,
var(µK | a) = 1 − σ02
Let
(7.9)
for σ0 ∈ (0, 1) fixed.
2
gK (y) =
K
X
j=−K 2
Then for all y ∈ R
K
K
2
Pr(µK = µK
j |a ) ϕ(y | µj , σ0 ), y ∈ R.
lim gK (y) = ϕ(y).
K→∞
The empirical normality test is obtained using the following consideration.
1
1
1
, . . . , −K
, 0, K
,
Suppose that for fixed K we have 2K 2 +1 knots −K, −K+ K
1
. . . , K − K , K. Suppose further that we maximize the penalized loglikelihood (7.7) for λ → ∞. This is equivalent (in the limit) to minimizing
∗
the penalty term (7.8) under the constraints (7.4). For fixed K, let gε,K
be
the fitted standardized error density arising from the above-mentioned optimization problem. Using Proposition 7.1 with wj (a) = Pr(µK = µK
j | a),
∗ (ε∗ ) = ϕ(ε∗ ) for all ε∗ ∈ R. In
j = −K 2 , . . . , K 2 we get that limK→∞ gε,K
practice, the set of knots and the basis standard deviation recommended
in Sections 6.3.1 and 6.3.3 (e.g., knots from −6 to 6 by 0.3 and σ0 = 0.2)
∗
give already rise to a fitted standardized error density gε,K
practically indistinguishable from the normal density, ϕ, when only the penalty term is
minimized. This property does not hold for the order s 6= 3 of the penalty or
when the penalty is based on the original mixture coefficients w.
7.2.3
Selecting the smoothing parameter
In the area of density estimation, methods for selecting the smoothing parameter, λ, that rely on cross–validation are often used. The standard modified
maximum-likelihood cross–validation score that we are attempting to minimize is
N
X
(−i)
ℓi (θ̂
),
CV(λ) = −
i=1
7.2. PENALIZED MAXIMUM-LIKELIHOOD
89
(−i)
where θ̂ is the penalized maximum likelihood estimate (MLE) of θ and θ̂
the penalized MLE based on the sample excluding the ith observation. However, computation and optimization of the cross–validation score is extremely
computationally intensive in our case. In a similar context, O’Sullivan (1988)
suggested a one-step Newton-Raphson approximation combined with a firstorder Taylor series approximation. Applying his method in our setting results
in an approximate cross-validation score given by
CV(λ) = −
n
nX
i=1
o
ℓi (θ̂) − trace Ĥ−1 Î ,
(7.10)
where
Ĥ = −
∂ 2 ℓP (θ̂)
,
∂θ∂θ T
Î = −
∂ 2 ℓ(θ̂)
.
∂θ∂θ T
We denote trace(Ĥ−1 Î) by df(λ) and call it the effective degrees of freedom or
the effective dimension of the model since it necessarily plays the same role
as the effective dimension of a linear smoother (Hastie and Tibshirani, 1990).
Depending on a chosen order s of the differences in the penalty, the degrees
of freedom decreases in λ from dim(β) + 2 + (2K + 1 − 3) for λ = 0 (i.e., the
ordinary log-likelihood) to dim(β) + 2 + (s − 3) for λ → ∞ and s ≥ 3 (i.e.,
the penalized log-likelihood). For example, when K = 20, µj+1 − µj = 0.3,
σ0 = 0.2 and s = 3, penalized likelihood estimation as λ → ∞ depends
effectively on 2K + 1 − s = 38 fewer parameters than does ordinary likelihood
estimation.
Further, minimizing the expression (7.10) is essentially the same as maximizing Akaike’s information criterion AIC(λ) = ℓ(θ̂)−df(λ) (Akaike, 1974). This
can be a valuable way to compare different models and assess the importance
of covariate contributions (see an example in Section 7.6).
In accompanying R programs (see Appendix C), a grid search using userdefined values λ∗1 , . . . , λ∗L (in our applications we used values λ∗1 = e2 , λ∗2 =
e1 , . . . , λ∗L = e−9 ) is used to find the optimal AIC. Since the log-likelihood is
of the order O(N ), using a factor of N λ∗l /2 in the penalty term (7.8) instead
of λ/2 allows one to use approximately the same grid for datasets of different
sizes while also maintaining the proportional importance of the penalty term
in the penalized log-likelihood at the same level.
The result immediately following Proposition 7.1 further implies that with
a sufficiently dense set of knots, we can check the normality of the error term.
When the optimal value of the tuning parameter λ becomes large the error
density of the model can be considered to be normal.
90
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
Linear mixed model interpretation
Recently, Wand (2003) or Kauermann (2005a) pointed out the strong link
between penalized maximum-likelihood estimation and linear mixed models which can be used for selection of the smoothing parameter. The idea,
which underlies also the pseudo-variance estimate in Section 7.3.1 and the
full Bayesian developments in Chapters 9 and 10, is the following. The coefficient vector a is considered to be a vector of random effects having the
normal distribution
a ∼ N 0, λ−1 (P′s Ps )− ,
where (P′s Ps )− is the generalized inverse of P′s Ps . Smoothing parameter λ
then determines (together with the fixed matrix Ps ) the variability of the
“random effects” a. Penalized likelihood (7.7) can then be interpreted as
the likelihood of the mixed effects model with normal random effects a. The
optimal λ value is obtained as the maximum-likelihood or more frequently
as the restricted maximum-likelihood estimate of the inverse variance component in such constructed mixed effects model. See, e.g., Cai and Betensky
(2003) or Kauermann (2005b) for practical applications of this approach.
7.3
Inference based on the maximum likelihood penalized AFT model
With standard maximum-likelihood method the score vector (the first derivative of the log-likelihood) has a zero mean when its expectation is computed
under the true parameter vector. Under a mild regularity conditions, it is
then possible to prove that the MLE is an unbiased estimate. However, introduction of the penalty term with λ > 0 leads to the penalized score vector
(the first derivative of the penalized log-likelihood) having a mean different
from zero when its expectation is computed under the true parameter vector.
Consequently, the penalized MLE θ̂ is a biased estimator and its standard
errors may not be very informative when that bias is high. However, there
are two possibilities for drawing accurate inferences based on penalized MLE.
7.3.1
Pseudo-variance
Wahba (1983) described a pseudo-Bayesian technique for generating confidence bands around the cross-validated smoothing spline. O’Sullivan (1988)
used this technique in the penalized ML framework and his approach can
be adopted also here. Basically, the penalized log-likelihood ℓP is viewed
7.3. INFERENCE BASED ON THE MAXIMUM LIKELIHOOD PENALIZED AFT
MODEL
91
as a “posterior” log-density for the parameter θ and the penalty term as
a “prior” negative log-density of that parameter. Then, the second order
Taylor series expansion of the “posterior” log-density around its mode θ̂ leads
to
1
ℓP (θ) ≈ ℓP (θ̂) − (θ − θ̂)T Ĥ(θ − θ̂).
2
Finally the Gaussian approximation gives “posterior” normal distribution for
θ with covariance matrix
var
c P (θ̂) = Ĥ−1 .
(7.11)
We call this estimate of the variance of the penalized MLE θ̂ the “pseudovariance estimate.”
7.3.2
Asymptotic variance
More formal inference is possible under the following assumptions. Firstly, we
assume independent noninformative censoring. Secondly, as the sample size
N increases, the knots (both number and positions) and the basis standard
deviation remain fixed. Let θ T be the true parameter value of θ, assuming it
exists. To get asymptotically unbiased estimates we have to either keep the
value of the smoothing parameter λ constant as N → ∞ or let it increase
at a rate lower than N (i.e., λ = λN and limN →∞ λN /N = 0). Under
these conditions, the penalty part of the penalized log-likelihood reduces its
importance relative to the log-likelihood part as N → ∞ (i.e., as the sample
size N increases, the smoothness of the fitted error distribution is determined
to greater extent by the data and to a lesser extent by the penalty). Then,
in combination with standard maximum likelihood arguments,
for arbitrary
ξ > 0 the penalized MLE θ̂ satisfies PrθT |θ̂ − θ T | < ξ → 1. Using the
√
same arguments as in Gray (1992), one can further show that N θ̂ − θ T
is asymptotically normal with mean 0 and covariance matrix limN →∞ (N W)
where the matrix W can be consistently estimated by
var
c A (θ̂) = Ĥ−1 Î Ĥ−1 ,
(7.12)
which we call the “asymptotic variance estimate.” As pointed out by Gray
(1992), the asymptotic distribution of θ̂ remains the same if the smoothing
Pr
parameters λN are replaced by estimates satisfying λ̂N /λN → 1.
7.3.3
The pseudo-variance versus the asymptotic variance
In various applications, the pseudo-variance estimate (7.11) has been shown
to be useful. When smoothing a spline curve g(t), Wahba (1983) showed
92
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
p
it yielded pointwise confidence intervals ĝ(t) ± z var
c P {ĝ(t)}, where z is
a quantile of the normal distribution, that have good frequentist coverage
properties. Verweij and Van Houwelingen (1994) used it in the context of
penalized likelihood estimation in Cox regression; they called the square
roots of its diagonal elements “pseudo-standard errors.” Joly, Commenges,
and Letenneur (1998) exploited this technique to get confidence bands on the
hazard function smoothed using M-splines. In contrast, for the asymptotic
variance estimate (7.12) there is no guarantee that for finite samples its middle matrix Î is positive semidefinite. Based on our experience, this problem is
not rare. Finally, according to our simulations
the pseudo-variance estimate
q
c P (β̂) for regression parameters
(7.11) yields confidence intervals β̂ ± z var
with better coverage properties than the corresponding confidence intervals
based on the asymptotic estimate (7.12).
7.3.4
Remarks
We have assumed in this section that the true parameter vector θ T exists.
This does not have to be true. In particular, true a coefficients may fail to
exist when the true error distribution is not a mixture of the normal densities
determined by the choice of knots and the standard deviation σ0 . However, if
the distance between two consecutive knots is small enough, we argue that the
penalized mixture of the normal densities can approximate every continuous
distribution sufficiently well, see Dalal and Hall (1983) or O’Hagan (1994, Sec.
6.47), that the assumption on the existence of the true parameter vector θ T
is not restrictive at all. Loosely speaking, combining this with the asymptotic
arguments given in Section 7.3.2 implies that by increasing the sample size,
the estimated coefficients a will yield an estimated density which is close to
the true error density.
7.4
Predictive survival and hazard curves and predictive densities
The penalized AFT model has actually a parametric nature given the weights
w−K , . . . , wK in (7.2) are known. This makes it easy to compute predictive
survival curves or predictive hazards or densities for a given combination of
7.5. SIMULATION STUDY
93
covariates, say xnew and z new . The predictive survival function is given by
S(t | xnew , z new ) =
1−
K
X
wj (a) Φ
j=−K
(7.13)
log(t) − α − β ′ xnew
τ (z new )
The predictive density is computed by
p(t | xnew , z new ) =
µj , σ02 .
K
−1 X
log(t) − α − β ′ xnew
wj (a) ϕ
t τ (z new )
τ (z new )
j=−K
(7.14)
µj , σ02 ,
and finally the predictive hazard is obtained from the above quantities as
ℏ(t | xnew , z new ) =
p(t | xnew , z new )
.
S(t | xnew , z new )
(7.15)
In practice, all unknown parameters are replaced by their penalized maximumlikelihood estimates.
7.5
Simulation study
To see how the proposed method performs, we carried out a simulation study.
‘True’ uncensored data were generated according to model (7.1) with error
density (7.2). Two covariates, i.e. xi = (xi,1 , xi,2 )′ were included in the
model and the values of the parameters were the following: α = 1.6, τ = 1.4
and β = (−0.8, 0.4)′ . The covariate xi,1 was binary taking a value of 1 with
probability 0.4 and covariate xi,2 was generated according to the extreme
value distribution of a minimum, with location 8.5 and scale 1. The model
attempts to mimic an AFT model used for the dataset presented in Section 7.6
with xi,1 playing the role of the covariate lesion and xi,2 being distributed
as log2 (1 + CD4 count). Time to the event T is expressed in months. The
standardized error term ε∗ was generated from a standard normal distribution
N (0, 1), from a standardized extreme value distribution, and from a mixture
of two normal distributions 0.4 N (−1.4, 0.82 ) + 0.6 N (0.93, 0.82 ). Samples of
sizes 50, 100, 300, and 600 were generated. Each simulation involved 100
replications.
For each uncensored dataset we created four censored datasets that were then
used to compute the estimates: a dataset with (1) approximately 20% rightcensored and 80% uncensored observations (light RC); (2) approximately 20%
94
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
right and 80% interval-censored observations (light R+IC); (3) approximately
60% right and 40% uncensored observations (heavy RC); (4) approximately
60% right and 40% interval-censored observations (heavy R+IC). The censoring was created by simulating consecutive ‘visit times’ for each subject in the
dataset. Times of the first ‘visits’ were drawn from N (7, 1) distribution. Further, times between each consecutive ‘visits’ were simulated from N (6, 0.52 ).
This approach reflects the idea that subjects in our Oral Substudy were seen
for the first time about 7 months after the onset of the parent study and
then approximately every 6 months for several years. At each visit, subjects
were withdrawn (censored) according to a prespecified percentage (between
0.4% and 0.7% for light censoring and between 4.0% and 5.0% for heavy
censoring) creating right-censored observations provided that the uncensored
event time Ti was greater than the visit time at which the subject was withdrawn. To obtain interval-censored observations, we took the ‘visit’ interval
that contained the uncensored event time Ti .
For comparison, estimates for each dataset were computed using our smoothed
procedure and using two parametric models: an AFT model on the log scale
with a correctly specified error distribution (normal, extreme value or mixture
of normals, respectively) and a log-normal AFT model. For the smoothing
procedure, the third order penalty, equidistant knots with a distance of 0.3
between consecutive knots, and the basis standard deviation of 0.2 were used.
Selected results of the simulation are given in Appendix B, Section B.1.
Namely, Tables B.1 – B.6 show the results for the regression parameters.
It is seen that, in most cases, our smoothed procedure performs better than
the incorrectly specified log-normal AFT model and often only but slightly
worse than the correctly specified parametric AFT model. Additionally, when
our smoothing approach is used, the error distribution is reproduced rather
satisfactory as can be seen in Figures B.1 – B.3. This property is quite
important especially when the estimated model is to be used for prediction
purposes. Further, it is seen that even for small samples the performance of
our smoothing procedure is quite similar to the performance of a parametric
AFT model with a correctly specified error distribution.
7.6
Example: WIHS data – interval censoring
In Section 1.2, we introduced the study comprising the cohort of seropositive women and the cohort of seronegative women with an increased risk of
HIV infection. In this section, we concentrate on the data set collected in
the framework of the Oral Substudy involving 224 seropositive AIDS-free (at
baseline) women. We explore how the distribution of the time between the
7.6. EXAMPLE: WIHS DATA – INTERVAL CENSORING
95
Table 7.1: WIHS Data. Akaike’s information criterion, degrees of freedom,
the optimal log(λ/N ) for the fitted models.
Model
(1) lesion
(2) lvload
(3) lcd4
(4) lesion + lvload
(5) lesion + lcd4
(6) lvload + lcd4
(7) lesion + lvload + lcd4
AIC
−262.39
−256.16
−256.94
−255.63
−253.19
−253.45
−250.01
df
3.2
3.4
3.4
4.4
8.9
8.4
10.0
log(λ/N )
2
2
2
2
−7
−6
−7
baseline measurement and the onset of an AIDS-related illness can be explained using classical predictors which are the number of copies of the HIV
RNA virus and the count of CD4 T-cells per ml of blood. Additionally, we
examine whether presence of one of the three lesion markers, oral candidiasis,
hairy leukoplakia and angular cheilitis, is useful, possibly together with one
or both laboratory predictors, in describing the distribution of the residual
time to onset of AIDS.
For the purpose of modelling, the three lesion markers were summarized in
one binary covariate, lesion, equal to one if at least one of the above mentioned three lesion markers was present. Further, the laboratory predictors
entered the models in an transformed way, classically used in the HIV research. Namely, the covariate lvload is equal to log10 (1 + viral load) and the
covariate lcd4 equals log2 (1 + CD4 count). All three covariates are moderately to strongly associated with one another since, as AIDS progresses, viral
load increases, CD4 count falls, and oral lesions occur more frequently. In
our sample, for women with lesion = 0 and 1, respectively, the median lvload
was 3.60 and 4.23 (Mann-Whitney p-value, 0.001), There was also a moderate negative correlation of −0.46 between lcd4 and lvload. These associations
have to be taken into account when interpreting the results.
As a response, we used the time in months between the baseline visit, defined
as the first visit at which the lesion markers were collected by dental professionals, and the onset of an AIDS-related illness. As mentioned in Section
1.3, the response time is right-censored for 158 women and interval-censored
for 66 women with the average length of the observed interval equal to 7
months.
96
7.6.1
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
Fitted models
To obtain the results shown below, we used a sequence of 41 equidistant
knots from −6 to 6 with a distance of 0.3 between each pair. The basis
standard deviation was 0.2 and the third order difference was used in the
penalty. Different models were compared using Akaike’s information criterion
and claims concerning the significance of the parameters were based on Wald’s
tests using the pseudo-variance estimate (7.11). Summary of the fitted models
is shown in Tables 7.1 and 7.2.
If used alone (model (1)) the effect of lesion on the time to onset of AIDS is
statistically significant (p = 0.018) and the estimated time is exp(−0.87) ≈
0.42 times shorter for women with lesion = 1 than women with lesion = 0.
According to the AIC values for models (2) and (3), the transformed CD4
count and viral load are equally good predictors of the time to onset of AIDS.
Addition of the lesion marker (models (4) and (5)) improves the model with
lcd4 considerably but improves the model with lvload only slightly. Finally,
some additional improvement is gained by considering the model with all
three predictors (model (7)).
7.6.2
Predictive survival and hazard curves, predictive densities
Figure 7.1 shows predictive survival and hazard curves and predictive densities for women with lesion = 0 and lesion = 1 based on the simplest model
lesion and on the most complex model considered lesion + lvload + lcd4. The
predictive survival curves based on the model lesion are further overlaid with
the nonparametric estimate of Turnbull (1976) in each group. The two estimates are quite close to each other, illustrating the semiparametric nature
of our approach. However, our procedure gives smooth estimates of the survival curves and moreover enables quantification of the difference in survival
between the two groups. Notice further that due to the fact that the hazard
is obtained as a ratio of the density and the survival function, which relatively slowly varies from one, only a slight difference is observed between the
predictive density and the hazard.
Further, we point out that the predictive densities for models where lcd4
was not involved are very close to the log-normal density. This is not surprising since the optimal tuning parameter λ for these models was equal to
224 · exp(2), essentially a value of infinity in this practical situation and thus
implying that the fitted error distributions are close to the normal distribution, as discussed in Section 7.2.3. On the other hand, models where lcd4 was
7.6. EXAMPLE: WIHS DATA – INTERVAL CENSORING
97
Table 7.2: WIHS Data. Estimates of the regression parameters (standard
error; p-value) for the fitted models.
Model
(1) lesion
lesion
−0.87
(0.37; 0.018)
logvload
−0.76
(0.19; < 0.001)
(2) lvload
(3) lcd4
0.44
(0.11; < 0.001)
(4) lesion + lvload
−0.62
(0.36; 0.080)
(5) lesion + lcd4
−0.78
(0.26; 0.003)
(6) lvload + lcd4
(7) lesion + lvload +
+ lcd4
logcd4
−0.60
(0.23; 0.008)
−0.70
(0.19; < 0.001)
0.39
(0.07; < 0.001)
−0.39
(0.14; 0.004)
0.38
(0.06; < 0.001)
−0.30
(0.11; 0.005)
0.39
(0.05; < 0.001)
used in combination with other covariates gave much lower optimal tuning
parameters λ, implying also non-normal error densities. This is seen on the
right-hand side of Figure 7.1. The phenomenon could indicate presence of
a risk-group mixture in the data or absence of another important predictor.
Indeed, a factor that could play an important role is antiretroviral therapy,
which might have been used by some women in our sample. However, this
factor requires modelling time–dependent covariates, which cannot be done
with our model.
7.6.3
Conclusions
In conclusion, the time to AIDS onset in this study population is notably
shorter in women with oral lesions. Further, this marker improves the prediction of that time based on any of the classical indicators (CD4 count and
viral load). When interpreting these findings, one must bear in mind that only
a limited number of WIHS women opted to participate in the Oral Substudy,
the source of the dental data. Thus they may differ in unknown ways from the
98
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
lesion + lvload + lcd4
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Survival
lesion
lesion = 0
lesion = 1
Turnbull
0
20
40
60
80
lesion = 0
lesion = 1
lvload = 3.875
lcd4 = 8.735
0
0.010
80
lvload = 3.875
lesion = 1
lcd4 = 8.735
0.010
Hazard
60
Time (months)
lesion = 1
0.000
0.000
lesion = 0
0
20
40
60
80
lesion = 0
0
20
40
0.000 0.005 0.010 0.015
60
80
Time (months)
0.020
Time (months)
lvload = 3.875
lcd4 = 8.735
0.010
lesion = 1
lesion = 0
0
20
lesion = 1
0.000
Density
40
0.020
Time (months)
20
40
Time (months)
60
80
lesion = 0
0
20
40
60
80
Time (months)
Figure 7.1: WIHS Data. Predicted survival curves, hazard curves and densities for women with lesion = 1 (dotted-dashed line) vs. women with lesion = 0
(solid line) based on models lesion (left part) and lesion + lvload + lcd4 (right
part). Predictive curves for the latter model control for a median value of
lvload = 3.875 and a median value of lcd4 = 8.735. Predictive survival curves
for model lesion are further compared to the nonparametric estimate of Turnbull (1976) in each group.
7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED
DATA
99
Table 7.3: Signal Tandmobielr study. Description of fitted models.
gender
dmf
gender + dmf
gender ∗ dmf
Models with constant scale
x = (gender)
x = (dmf)
x = (gender, dmf)′
x = (gender, dmf, gender × dmf)′
Mean-scale models
gender ∗ dmf/scale(dmf)
x = (gender, dmf, gender × dmf)′
z = (dmf)
gender ∗ dmf/scale(gender ∗ dmf) x = (gender, dmf, gender × dmf)′
z = (gender, dmf, gender × dmf)′
overall set. Nonetheless, our findings are consistent with those of others who
have evaluated oral lesions as predictors of AIDS onset and they illustrate
use of our method in the area of AIDS research. Our method restricts us to
analysis of baseline covariates. Although this is a very widely applicable special case, extension of the method to accommodate time-dependent covariates
would allow more complex relationships between outcomes and covariates.
7.7
Example: Signal Tandmobielr study – intervalcensored data
In paediatric dentistry and orthodontics, adequate knowledge of timing and
patterns of tooth emergence is useful for diagnosis and treatment planning.
This motivates an example in this section where we fit the distribution of
emergence times of permanent maxillary right premolars (teeth 14 and 15 in
Figure 1.1) based on the data from the Signal Tandmobielr study introduced
in Section 1.1.
It is anticipated, that the distribution of emergence times of a particular
tooth is different for boys and girls. See Figure 5.1 and Table 5.1 where
the emergence distributions for boys and girls are compared for tooth 44.
However, a similar phenomenon is observed also for other teeth, 14 and 15
included. For that reason, we used the covariate gender (0 for boys and 1 for
girls) in our models. Additionally, it was of dental interest to check whether
the distribution of the emergence time of a permanent tooth changes when
the primary predecessor of the permanent tooth experienced caries or not.
100
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
Table 7.4: Signal Tandmobielr study. Akaike’s information criteria for different models.
Model
gender
dmf
gender
gender
gender
gender
+ dmf
∗ dmf
∗ dmf/scale(dmf)
∗ dmf/scale(gender ∗ dmf)
Tooth 14
−5 532.59
−5 538.03
−5 494.51
−5 491.47
−5 468.61
−5 467.67
Tooth 15
−4 551.57
−4 549.93
−4 526.85
−4 522.76
−4 506.66
−4 507.59
For this, we included a binarised dmf score pertaining to the predecessor as
a covariate, dmf = 1 if the primary predecessor of that permanent tooth was
recorded as decayed, or missing due to caries, or filled and 0 otherwise.
As response, for a particular child, we consider the age of emergence of a particular permanent tooth (14 or 15), recorded in years. Due to the design of
the study (annual planned examinations), the response variable is intervalcensored with intervals of length equal to approximately 1 year. It should
be stressed that in this section, the two teeth will be analyzed separately,
i.e. ignoring their possible correlation. In Section 7.8, we indicate how the
correlation between teeth can be incorporated in the analysis. For a better
fit, we shifted the time origin of the AFT model to 5 years of age which
is clinically minimal emergence time for the permanent teeth (see, e.g., Ekstrand, Christiansen, and Christiansen, 2003). Namely, we replaced Ti by
Ti − 5 in the AFT model specification (7.1). Similarly as in Section 7.6, we
used a sequence of 41 equidistant knots from −6 to 6 with a distance of 0.3
between each pair. The basis standard deviation was 0.2 and the third order
difference was used in the penalty.
7.7.1
Fitted models
We fitted four penalized AFT models with constant scale parameter and two
mean-scale penalized AFT models. The fitted models are described in Table
7.3 and AIC’s for these models are given in Table 7.4. The model selection
was based on the AIC.
Firstly, the model with the interaction term gender ∗ dmf seems to fit the
data best and the interaction term cannot be omitted. Secondly, the models
where the scale parameter τ depends on covariates give a considerably better
fit. For tooth 15, only dmf included in the scale covariate vector leads to the
7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED
DATA
101
Table 7.5: Signal Tandmobielr study. Estimates (standard errors) for the
models gender ∗ dmf/scale(dmf).
Parameter
α
β(gender)
β(dmf)
β(gender ∗ dmf)
γ1
γ(dmf)
Tooth 14
1.7734 (0.0073)
−0.0931 (0.0099)
−0.0990 (0.0116)
0.0401 (0.0166)
−1.5613 (0.0219)
0.2144 (0.0307)
Tooth 15
1.9143 (0.0091)
−0.0803 (0.0110)
−0.0773 (0.0125)
0.0473 (0.0172)
−1.6121 (0.0351)
0.2415 (0.0399)
best AIC. For tooth 14, the model with the scale depending only on dmf can
be improved by inclusion of gender and its interaction with dmf however the
improvement is minor. These findings lead us to conclude that the model that
describes satisfactory well the data while being kept as simple as possible is
the model gender ∗ dmf/scale(dmf). The estimates for this model are given
in Table 7.5. It is seen that dmf = 1 accelerates the emergence for both
genders and also increases the variability of the emergence distribution.
7.7.2
Predictive emergence and hazard curves
For our data, predictive emergence curves (cumulative distribution functions), which are prefered in this case to survival curves, based on the model
gender ∗ dmf/scale(dmf) are shown in Figure 7.2 and predictive hazards
in Figure 7.3. Further, Figure 7.2 shows also the non-parametric estimates of
Turnbull (1976) computed separately for each combination of covariates. It
is seen that model-based emergence curves agree with the non-parametric estimates indicating the goodness-of-fit of our model. Further, the figures show
that the difference between children with dmf = 0 and dmf = 1 is higher
for boys than for girls and that the emergence process for boys is indeed
postponed compared to girls.
Non-decreasing predictive hazard curves reflect the nature of the problem at
hand. Indeed, it can be expected that, provided the tooth of a child has not
emerged yet, the probability that the tooth will emerge increases with age.
102
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
Tooth 14, Boys
1.0
0.8
0.6
0.4
dmf = 1
0.2
Proportion emerged
0.8
0.6
0.4
dmf = 1
0.2
Proportion emerged
1.0
Tooth 14, Girls
0.0
dmf = 0
0.0
dmf = 0
9
10
11
12
7
9
10
11
Age (years)
Age (years)
Tooth 15, Girls
Tooth 15, Boys
12
0.8
0.6
0.4
dmf = 1
0.2
Proportion emerged
0.8
0.6
0.4
dmf = 1
0.2
Proportion emerged
8
1.0
8
1.0
7
dmf = 0
0.0
0.0
dmf = 0
7
8
9
10
Age (years)
11
12
7
8
9
10
11
12
Age (years)
Figure 7.2: Signal Tandmobielr study. Predictive emergence curves: solid
lines for curves based on the model gender ∗ dmf/scale(dmf) (on each plot: left
curve for dmf = 1, right curve for dmf = 0), dashed line for a non-parametric
estimate of Turnbull.
7.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – INTERVAL-CENSORED
DATA
103
Tooth 14, Boys
1.2
1.0
0.8
0.2
0.4
0.6
Hazard
0.6
0.2
0.4
Hazard
0.8
1.0
1.2
Tooth 14, Girls
dmf = 1
dmf = 1
0.0
dmf = 0
0.0
dmf = 0
8
9
10
11
12
7
9
10
11
Tooth 15, Girls
Tooth 15, Boys
12
1.0
0.8
0.6
0.2
0.2
0.4
0.6
Hazard
0.8
1.0
1.2
Age (years)
0.4
Hazard
8
Age (years)
1.2
7
dmf = 1
8
9
10
Age (years)
11
dmf = 0
0.0
0.0
7
dmf = 1
dmf = 0
12
7
8
9
10
11
12
Age (years)
Figure 7.3: Signal Tandmobielr study. Predictive hazard curves based on
the model gender ∗ dmf/scale(dmf): solid line for dmf = 1, dotted-dashed
line for dmf = 0.
104
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
Table 7.6: Signal Tandmobielr study. Estimates (standard errors) for the
models gender, dmf and gender + dmf.
Parameter
β(gender)
β(dmf)
β(gender)
β(dmf)
7.7.3
Model gender or dmf Model gender + dmf
Tooth 14
−0.0740 (0.0080)
−0.0766 (0.0081)
−0.0729 (0.0086)
−0.0741 (0.0085)
Tooth 15
−0.0564 (0.0085)
−0.0594 (0.0087)
−0.0613 (0.0089)
−0.0628 (0.0090)
Comparison of emergence distributions between different groups
While the model gender ∗ dmf/scale(dmf) gives a parsimonious description of
emergence distributions for different groups of children and serves as a solid
basis for prediction as was shown in the previous section, it is not suitable to
provide simple p-values for a comparison of emergence distributions between
e.g. boys and girls. Due to the fact that an interaction term gender ∗ dmf
appeared to be significantly important, we could only provide a p-value for
a multiple comparison of the four groups (girls with dmf = 1 and 0 and boys
with dmf = 1 and 0).
To simply compare two distributions, while averaging the effect of other covariates, the basic AFT model with a univariate covariate x (i.e. either the
model gender or the model dmf) can be used together with a significance test
for the group parameter. Additionally, it is possible to perform a test that
compares two groups while controling for additional confounding variables
(e.g. comparison of boys and girls while controling for dmf or vice versa). To
do that, we perform significance tests of β parameters in the model gender
+ dmf.
The estimates of regression parameters β together with their standard errors,
derived from the formula (7.11), in mentioned models are given in Table 7.4.
The Wald tests of significance for each β parameter all yield p-values lower
than 0.0001, which confirm the findings obtained previously that there is
indeed a significant difference in emergence distributions of studied teeth
between boys and girls and also between the group of children with dmf = 0
and dmf = 1. The difference remains both marginally (irrespective of value
of dmf or irrespective of value of gender, respectively) and while controling
for the other covariate.
7.8. DISCUSSION
105
The issue of the robustness of the AFT model against the omitted covariates,
discussed in Section 3.3, is further illustrated in Table 7.4. The effect of
gender remains almost unchanged in both models, gender and gender + dmf,
and an analogous conclusion holds also for the effect of dmf.
7.7.4
Conclusions
It has been shown that the emergence process of teeth 14 and 15 is significantly different between boys and girls and that the caries experience status
of a primary predecessor, expressed by the dmf score, has a significant effect
on the timing of emergence of permanent successors.
Predictive emergence curves have been drawn that can be used for diagnosis
and treatment planning in paediatric dentistry. Further, it was found that
the acceleration effect of caries experience on a primary predecessor on the
timing of emergence of its successor was stronger for boys than for girls.
7.8
Discussion
In this chapter, we have suggested a method useful for fitting the linear regression model for independent censored observations while avoiding overly
restrictive parametric assumptions on the error distribution. Most classically,
the logarithmic transformation of the response leads to the well known AFT
model. However, other transformations of the response leading to its potential range covering the whole real line are also possible. The density of the
error distribution is specified in a semi-parametric way as a mixture of the
overspecified number of normal densities with fixed means – knots and given
common standard deviation. Mixture coefficients are estimated using the
penalized maximum-likelihood method. Such model specifications allow flexibility with respect to the resulting error distribution yet retain tractability
such that data carrying censoring of several types, especially interval censoring, can be handled naturally.
The method of this chapter could generally be extended to handle also multivariate survival data. Namely, the population averaged AFT model (see
Section 3.4.2) with a multivariate error distribution specified as a multivariate penalized mixture (see Section 6.3.4) could be used. Or alternatively,
the cluster specific AFT model (see Section 3.4.3) with an error distribution
given as a penalized mixture and random effects distribution specified either
parametrically or as a (multivariate) penalized mixture could be considered.
However, as outlined in Sections 4.2 and 4.3, the computation and let alone
106
CHAPTER 7. MAXIMUM LIKELIHOOD PENALIZED AFT MODEL
optimization of the (penalized) likelihood is practically intractable. For this
reason, we switch to fully Bayesian approaches using the MCMC methodology.
Chapter
8
Bayesian Normal Mixture
Cluster-Specific AFT Model
In this chapter we present a cluster-specific AFT model (see Section 3.4.3)
with a flexible error distribution. This model, introduced by Komárek and
Lesaffre (2006a), allows us to analyze also data sets where not necessary all
observations are independent. For example, we will be able to analyze jointly
several teeth from the Signal Tandmobielr study, analyze the CGD data
where the times to recurrent infections are involved or to analyze the data
from the multicenter studies like EBCP data. The approach presented here
uses the classical normal mixture (see Section 6.1) to express the error density
in the AFT model. For the random effects we use a parametric (multivariate) normal distribution. The full Bayesian approach with the Markov chain
Monte Carlo methodology will be used for the inference.
In Section 8.1, we specify the cluster-specific AFT model and the distributional assumptions we use in this chapter. In Section 8.2, we specify the
model from the Bayesian perspective and derive the corresponding posterior
distribution. Details of the Markov chain Monte Carlo methodology to sample from the posterior distribution are given in Section 8.3. In Section 8.4, we
show how the survival distributions for specific combinations of covariates can
be estimated. Further, in Section 8.5, we give the estimates of the individual
random effects that could be used, for example, for the discrimination. The
performance of the method is evaluated using the simulation study in Section
8.6. The method is applied to the analysis of the interval-censored emergence
times of 8 permanent teeth in Section 8.7, to the recurrent events analysis
in Section 8.8 and to the analysis of the breast cancer multicenter study in
Section 8.9. The chapter is finalized by the discussion in Section 8.10.
107
108
8.1
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Model
Let Ti,l , i = 1, . . . , N , l = 1, . . . , ni be the lth event time in the ith cluster or
the lth recurrent event on the ith subject in the study. Let Ti,l be observed
U
as an interval ⌊tL
i,l , ti,l ⌋. Let logarithmic transformations of the event and
L = log(tL ), y U = log(tU ). We will
observed event times be Yi,l = log(Ti,l ), yi,l
i,l
i,l
i,l
assume that the random vectors T 1 , . . . , T N , where T i = (Ti,1 , . . . , Ti,ni )′ ,
i = 1, . . . , N are independent. However, the components of each T i are not
necessarily independent.
To model the effect of covariates on the event time we use the cluster-specific
AFT model (3.7), i.e.
log(Ti,l ) = Yi,l = β ′ xi,l + b′i z i,l + εi,l ,
i = 1, . . . , N, l = 1, . . . , ni , (8.1)
where β = (β1 , . . . , βm )′ is the unknown regression coefficient vector, xi,l the
covariate vector for fixed effects, bi = (bi,1 , . . . , bi,q )′ , i = 1, . . . , N are the
random effect vectors with the density gb (b) causing the possible correlation
for the components of Y i = (Yi,1 , . . . , Yi,ni )′ . Further, z i,l is the covariate
vector for random effects and εi,l are independent and identically distributed
random variables with the density gε (ε). Along the lines of Gelman et al.
(2004, Chapter 15) we use the terms ‘fixed’ and ‘random’ effects throughout
the thesis even in a Bayesian context where all unknown parameters are
treated as random quantities.
For recurrent events, usually z i,l = 1 for all i and l and bi = bi,1 expresses
an individual-specific deviation from an overall mean log-event time which
is not explained by fixed effects covariates (see the analysis of CGD data
in Section 8.8). For clustered data, the vector z i,l may define further subclusters (as in the analysis of the Signal Tandmobielr data in Section 8.7)
allowing for a higher dependence of observations within sub-clusters given by
common values of appropriate components of the vector bi while keeping the
dependence also across the sub-clusters through the correlation between the
components of bi . In multicenter clinical trials where the aim is to evaluate
an effect of some treatment (e.g. the EBCP data analyzed in Section 8.9), the
vector z i,l might be equal to (1, treatmenti,l )′ allowing that both a baseline
value of the expected event time and a treatment effect can vary across centra.
8.1. MODEL
8.1.1
109
Distributional assumptions
The density gε (ε) of the error term εi,l in model (8.1) is specified in a flexible
way as a classical normal mixture (6.3), i.e.
gε (ε) =
K
X
j=1
wj ϕ(y | µj , σj2 ),
(8.2)
where K is the unknown number of mixture components and further, w =
(w1 , . . . , wK )′ are unknown mixture weights, µ = (µ1 , . . . , µK )′ unknown mix2 )′ unknown mixture variances.
ture means and σ2 = (σ12 , . . . , σK
We have already mentioned in Section 6.1.2 that a heteroscedastic mixture
(8.2) leads to the likelihood which is unbounded if the parameter space for
variances is unconstrained. In a full Bayesian analysis, this difficulty is solved
by using an appropriate prior distribution for the variances which plays the
role of constraints. We discuss this issue in full detail in Section 8.2.1.
For the random effects bi , we take a suitable parametric distribution, namely
the multivariate normal distribution, see Section 8.2.2 for details. The fact
that we put more emphasis on a correct specification of the distribution of
the error term εi,l than on a specification of the distribution of random effects
bi is driven by the following reasoning.
For an AFT model, the regression parameters β express the effect of covariates (xi,l ) both conditionally (given bi ) and marginally (after integrating bi
out). Both interpretations do not change when different distributional assumptions are made on bi . Further, with a correctly specified distribution of
εi,l the conditional model is always correctly specified. However, when the
distribution of εi,l is incorrect neither conditional nor marginal models are
specified correctly. Further, Keiding, Andersen, and Klein (1997) showed that
for univariate (single-spell) Weibull AFT model the regression parameters are
robust against the misspecification of the random effects distribution. This
finding, also for non-Weibull models is further supported by the empirical results of Lambert et al. (2004). Finally, Verbeke and Lesaffre (1997) showed,
in the context of normal linear mixed model with uncensored data, that the
maximum-likelihood estimates of the regression parameters are unaffected by
the misspecified random effects distribution.
Of course, in situations in which the variability of the random effects considerably exceeds the variability of the error term it becomes more important
to specify correctly the distribution of the random effects rather than the
distribution of the error term. However, in all applications presented in this
chapter this is not the case.
110
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
8.1.2
Likelihood
The likelihood contribution of the ith cluster can be derived from expressions
(4.7) and (4.9). Namely,
Z Y
ni I y U
i,l
gε (yl − β ′ xi,l − b′i z i,l ) dyl gb (bi ) dbi .
(8.3)
Li =
Rq
l=1
L
yi,l
It might be useful to stress again that due to multivariate integration in the
likelihood (8.3), it is rather cumbersome to use maximum-likelihood based
methods for the cluster-specific AFT model with interval-censored observations even with gε (ε) and gb (b) being parametrically specified. Mainly for
this reason, the full Bayesian approach will be exploited.
8.2
Bayesian hierarchical model
The Bayesian specification of the model continues by specification of the prior
distributions for all unknown parameters, denoted by θ. We assume a clusterspecific AFT model (8.1) with a hierarchical structure graphically represented
by a directed acyclic graph (DAG) given in Figure 8.1. As explained in
Section 4.4, the joint prior distribution of θ is then given by the product of
the conditional distributions of the nodes pertaining to unobserved quantities
given their parents, namely
p(θ) ∝
"n N Y
i
Y
2
p ti,l β, bi , εi,l × p εi,l µ, σ , ri,l × p ri,l K, w ×
i=1 l=1
#
p bi γ, D ×
(8.4)
p µ K) × p σ 2 K, η) × p η × p w K) × p K ×
p β ×p γ ×p D .
For clarity, we omitted all fixed hyperparameters and fixed covariates in the
expression (8.4). As the DAG indicates, the unknown parameters can be
split into two parts connected only through the node of the true event times.
The conditional distribution for this node is simply a Dirac (degenerated)
distribution driven by the AFT model (8.1), i.e.
p(ti,l | β, bi , εi,l ) = I[log(ti,l ) = β ′ xi,l + b′i z i,l + εi,l ],
i = 1, . . . , N, l = 1, . . . , ni .
8.2. BAYESIAN HIERARCHICAL MODEL
111
In the subsequent sections, we explain all the multiplicands of expression (8.4)
and also the meaning of the newly introduced parameters ri,l , i = 1, . . . , N ,
l = 1, . . . , ni , γ, D, and η.
8.2.1
Prior specification of the error part
The prior conditional distributions pertaining to the error part of the model
are inspired by the work of Richardson and Green (1997) (with some change
in notation) who studied Bayesian estimation of the normal mixtures in the
context of i.i.d. data. That is, they did not consider covariates or censoring.
To improve the computation of the posterior distribution, it is useful to assume that εi,l , i = 1, . . . , N, l = 1, . . . , ni come from a heterogeneous population consisting of groups j = 1, 2, . . . , K of sizes proportional to the mixture
weights wj and introduce latent allocation variables ri,l denoting the label of
the group from which each random error variable εi,l is drawn. By this we are
introducing here the Bayesian implementation of the data augmentation algorithm (see Section 4.3). Together with distributional assumption (8.2) this
Error part
Regression part
η
K
w
D
ri,l
εi,l
xi,l
ti,l
zi,l
censoringi,l
tL
i,l
tU
i,l
bi
i = 1, . . . , N
σ2
l = 1, . . . , ni
µ
γ
β
Figure 8.1: Directed acyclic graph for the Bayesian normal mixture clusterspecific AFT model.
112
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
leads to the following conditional distributions appearing in the prior (8.4):
Pr(ri,l = j | K, w) = wj ,
j ∈ {1, . . . , K},
p(εi,l | µ, σ 2 , ri,l ) = ϕ(εi,l | µri,l , σr2i,l )
i = 1, . . . , N, l = 1, . . . , ni .
For the number of mixture components, K, we experimented with
1. a Poisson distribution with mean equal to a fixed hyper-parameter λ
truncated at some prespecified (relatively large) value Kmax and truncated zero, i.e.
max
n KX
λj o−1 λk
,
Pr(K = k) =
j!
k!
k = 1, . . . , Kmax ;
j=1
2. a uniform distribution on {1, . . . , Kmax }, i.e.
Pr(K = k) =
1
Kmax
,
k = 1, . . . , Kmax .
The prior for the mixture weights w is taken to be a symmetric K-dimensional
Dirichlet with prior ‘sample size’ equal to K δ, i.e.
K
Γ(K δ) Y δ−1
wj ,
p(w | K) = K
Γ(δ)
j=1
where δ is a fixed hyperparameter.
Further, the mixture means µj and variances σj2 , j = 1, . . . , K are a priori
all drawn independently with normal N (ξ, κ) and inverse-gamma IG(ζ, η)
priors respectively, i.e.
p(µ | K) =
p(σ 2 | K, η) =
K
Y
j=1
ϕ(µj | ξ, κ),
K η Y
ηζ
(σj2 )−(ζ+1) exp − 2 ,
Γ(ζ)
σj
(8.5)
(8.6)
j=1
where ξ, κ and ζ are fixed hyperparameters. As in Richardson and Green
(1997) we let the hyperparameter η follow a gamma distribution with fixed
shape parameter h1 and fixed rate parameter h2 , i.e.
p(η) =
hh2 1 h1 −1
η
exp(−h2 η).
Γ(h1 )
8.2. BAYESIAN HIERARCHICAL MODEL
113
A rationale for this construction is given in Section 8.2.3.
Since the error model is invariant to permutations of labels j = 1, . . . , K,
the joint prior distribution of a vector µ is restricted to the set {µ : µ1 <
· · · < µK } for identifiability reasons, see Stephens (2000) or Jasra, Holmes,
and Stephens (2005) for other approaches to establish identifiability. The
joint prior distribution of the mixture means and variances is thus K! times
the products (8.5) and (8.6), restricted to above mentioned set of increasing
means.
8.2.2
Prior specification of the regression part
The regression part of the model has the structure of a classical Bayesian
linear
Chapter 5). Let X be
P mixed model (see, e.g., Gelman′ et al., 2004,
′
as
rows. Similarly, let Z
,
.
.
.
,
x
a( N
n
)
×
m
matrix
with
vectors
x
1,1
N,nN
i=1 i
PN
′
′
be a ( i=1 ni ) × q matrix with vectors z 1,1 , . . . , z N,nN as rows. Further, we
will assume that the matrix (X, Z) is of full column rank (m + q). In other
words, covariates included in xi,l are not included in z i,l and vice versa.
This gives rise to hierarchical centering which in general results in a better
behavior of the MCMC algorithm (Gelfand, Sahu, and Carlin, 1995). Finally,
since gε (ε) does not have zero mean we do not allow a column of ones in the
matrix X to avoid identifiability problems.
The prior distribution for each regression coefficient βj , j = 1, . . . , m is assumed to be N (νβ,j , ψβ,j ), and the βj are assumed to be a priori independent,
i.e.
m
Y
ϕ(βj | νβ,j , ψβ,j ).
p(β) =
j=1
The vectors ν β = (νβ,1 , . . . , νβ,m )′ and ψ β = (ψβ,1 , . . . , ψβ,m )′ are fixed hyperparameters.
As already mentioned in Section 8.1.1, the (prior) distribution for the random
effect vector bi , i = 1, . . . , N is assumed to be (multivariate) normal with
a prior mean γ and a prior covariance matrix D, i.e.
p(bi | γ, D) = ϕq (bi | γ, D),
(8.7)
where γ = (γ1 , . . . , γq )′ .
The prior distribution for each γj , j = 1, . . . , q is N (νγ,j , ψγ,j ), independently
for j = 1, . . . , q, i.e.
q
Y
ϕ(γj | νγ,j , ψγ,j ).
p(γ) =
j=1
114
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
The vectors ν γ = (νγ,1 , . . . , νγ,q )′ and ψ γ = (ψγ,1 , . . . , ψγ,q )′ are fixed. Special
care is needed when the random intercept is included in the model (i.e. when
Z contains a column of ones, let say its first column). Hierarchical centering
cannot be applied in this case since the overall intercept is given by the mean
of the mixture (8.2). For that reason, γ1 is fixed to zero (or equivalently,
νγ,1 = 0, ψγ,1 = 0).
The prior distribution for the covariance matrix D of random effects is assumed to be an inverse-Wishart with fixed degrees of freedom df and a fixed
scale matrix S, i.e.
q
df + 1 + j −1
Y
×
π
Γ
p(D) = 2
2
j=1
o
n 1
df
df +q+1
|S| 2 |D|− 2 exp − trace(SD−1 ) .
2
df q
2
q(q−1)
4
(8.8)
In the special case of a univariate random effect (q = 1), we use d instead
of D and s instead of S in the notation. Note that in that case, the inverseWishart distribution is the same as the inverse-gamma distribution with the
shape parameter equal to df /2 and the scale parameter equal to s/2.
Further, in the situation of q = 1, we considered alternatively (see Section 8.8)
also the use of a uniform prior for standard deviation of the random effect
which is often considered to be a better choice (see Gelman et al., 2004, pp.
136, 390 or Gelman, 2006), i.e. a priori
√
1
p( d) = √ I[0 < d < s],
s
(8.9)
for a large value of s. On the original variance scale the prior (8.9) transforms
into
1
p(d) = √ I[0 < d < s],
2 sd
which is formally a truncated inverse-gamma distribution with the shape
parameter equal to −1/2 and the scale parameter equal to zero.
8.2.3
Weak prior information
In this problem, we have opted for specifying weak prior information on the
parameters of interest. When a priori information is available, our prior
assumptions could be appropriately modified.
For the regression part of the model, we use non-informative, however proper
distributions, that is, the prior variances of regression parameters β (ψ β )
8.2. BAYESIAN HIERARCHICAL MODEL
115
and γ (ψ γ ) are chosen such that the posterior variance of the regression
parameters is at least 100 times lower (which must be checked from the
results). Prior hyperparameters for the covariance matrix D giving a weak
prior information correspond to choices of df = q−1+c and S = diag(c, . . . , c)
with c being a small positive number.
In the error part of the model, it is not
possible to be fully non-informative,
QK
2
i.e. to use priors p(µ, σ | K) ∝ 1 × j=1 σj−2 and to obtain proper posterior distributions (Diebolt and Robert, 1994; Roeder and Wasserman, 1997).
Richardson and Green (1997) offer, in the context of i.i.d. observations, for
say e1 , . . . , eN , the following alternative: A rather
P flat prior N (ξ, κ) for each
µj is achieved by letting ξ equal to ē = N −1 N
i=1 ei and setting κ equal to
2
a multiple of R , where R = max(ei ) − min(ei ). They point out that it might
be restrictive to suppose that knowledge of the range or variability of the data
implies much about the size of each single σj2 and therefore introduced an additional hierarchical level by allowing η to follow a gamma distribution with
parameters h1 and h2 . They recommend taking ζ > 1 > h1 to express the belief that the σj2 are similar which is necessary to avoid a problem of unbounded
likelihood, without being informative about their absolute size. Finally, they
suggest setting the parameter h2 to a small multiple of 1/R2 . Here, the residuals yi,l − β ′ xi,l − b′i z i,l play the role of the observations ei . A rough estimate
of their location and scale can be obtained through a maximum-likelihood fit
of the AFT model, even without random effects (the scale of residuals can
only increase), with an explicitly included intercept and scale parameters in
the model. This can be done using standard software packages as R, Splus,
SAS. The estimated intercept from this model can then be used instead of ē
and a multiple of the estimated scale parameter instead of R.
8.2.4
Posterior distribution
As we indicated in Section 4.4, the joint posterior distribution, p(θ | data), is
proportional to the product of all DAG conditional distributions, i.e.
ni
N Y
Y
U p tL
p θ data ∝ p θ ×
i,l , ti,l ti,l , censoring i,l ,
(8.10)
i=1 l=1
U
where p(θ) is given by (8.4) and p(tL
i,l , ti,l | ti,l , censoring i,l ) is discussed below.
A box called censoringi,l in the DAG represents a realization of the random
variable(s) causing the censoring of the (i, l)th event time. Note, that under
the assumption of independent noninformative censoring (see Section 2.4)
there is no need to specify a measurement model for the censoring mechanism
116
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
since it only acts as a multiplicative constant in the posterior. After omitting
subscripts i, l for clarity, the expression of p(tL , tU | t, censoring) is rather
obvious for most censoring mechanisms.
For example with interval censoring resulting from checking the survival status at (random) times C = {c0 , . . . , cS+1 }, where c0 = 0, cS+1 = ∞ we obtain
a Dirac density
p(tL = cs , tU = cs+1 | t, C) = I t ∈ ⌊cs , cs+1 ⌋ , s = 0, . . . , S.
With standard right-censoring driven by the (random) censoring time C = c,
the following Dirac densities are obtained
p(tL = tU = t | t, c) = I[t ≤ c],
p(tL = t, tU = ∞ | t, c) = I[t > c].
8.3
Markov chain Monte Carlo
Inference is based on a sample from the posterior distribution obtained using
the MCMC methodology (see Section 4.5). The parameters of the error part
of the model are updated using the combination of the reversible jump MCMC
algorithm of Green (1995) and a conventional Gibbs algorithm (Geman and
Geman, 1984). For the remaining parameters of the model, each iteration of
the MCMC is conducted using the Gibbs sampler. Both the reversible jump
MCMC algorithm and the full conditional distributions needed to implement
the Gibbs sampler are discussed below.
8.3.1
Update of the error part of the model
Details on how to implement the update of the parameters of the error part
of the model are given in Richardson and Green (1997). Their guidelines,
now based on residuals εi,l = yi,l − β′ xi,l − b′i z i,l , can be immediately applied.
We give only a brief summary and for details we refer therein.
Six move types are suggested by Richardson and Green (1997), namely
(i) Updating the mixture weights w while keeping K fixed;
(ii) Updating the mixture means µ and variances σ 2 while keeping K fixed;
(iii) Updating the allocation parameters ri,l , i = 1, . . . , N , l = 1, . . . , ni ;
(iv) Updating the variance-hyperparameter η;
8.3. MARKOV CHAIN MONTE CARLO
117
(v) Split-combine move, i.e. splitting one mixture component into two, or
combining two into one;
(vi) Birth-death move, i.e. the birth or death of an empty mixture component.
In our context, due to the regression and the presence of censored data, we
add one more move type, i.e.
(vii) Updating the residuals εi,l , i = 1, . . . , N , l = 1, . . . , ni .
Note that only move types (v) and (vi) change the dimension of the parameter
vector by changing K to K −1 or K +1 and are performed using the reversible
jump MCMC algorithm. The moves (i)–(iv) and the move (vii) are performed
by sampling from the full conditional distributions given below.
Full conditional for mixture weights w
The full conditional distribution for the mixture weights is Dirichlet with
parameters δ + Nj , j = 1, . . . , K, i.e.
Γ(Kδ + n)
p(w | · · · ) = QK
j=1 Γ(δ
K
Y
+ Nj ) j=1
δ+Nj −1
wj
,
PN
where n =
i=1 ni is the total sample size and Nj , j = 1, . . . , K is the
number of observations currently allocated in the jth mixture component,
i.e.
ni
N X
X
Nj =
I[ri,l = j], j = 1, . . . , K.
i=1 l=1
Full conditional for mixture means
The full conditional for each mixture mean is normal with the mean and
variance
P
εi,l + κ−1 ξ
σj−2
E(µj | · · · ) =
var(µj | · · · ) =
(i,l): ri,l =j
σj−2 Nj +
1
σj−2 Nj
+ κ−1
,
κ−1
,
j = 1, . . . , K,
j = 1, . . . , K.
118
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Note that due to the ordering constraint µ1 < · · · < µK , the full conditional
only generates a proposal which is accepted provided it does not break this
ordering.
Full conditional for mixture variances
The full conditional for each mixture variance is an inverse gamma distribution
o
n
Nj
1 X
(εi,l − µj )2 .
,η+
σj2 | · · · ∼ I-Gamma ζ +
2
2
(i,l): ri,l =j
Full conditional for the allocation variables
The full conditional for each allocation variable ri,l , i = 1, . . . , N , l = 1, . . . , ni
is discrete with
n (ε − µ )2 o
wj
j
i,l
Pr(ri,l = j | · · · ) ∝
,
j ∈ {1, . . . , K}.
exp −
σj
2σj2
Full conditional for the variance-hyperparameter
The full conditional for the variance hyperparameter η is a gamma distribution
K
X
σj−2 ).
η | · · · ∼ Gamma(h1 + K ζ, h2 +
j=1
Split-combine move
To perform the split-combine move, firstly a random choice is made whether
to try to perform the split or combine move, namely, given K, the probability
split
of attempting the split move is πK
and the probability of attempting the
split
split
combine
= 0.
combine move is πK
= 1 − πK . Obviously, π1split = 1 and πK
max
split
combine
Otherwise we use πK = πK
= 0.5, K = 2, . . . , Kmax − 1.
When the combine move is attempted the new mixture with K − 1 components is proposed as follows:
1. Choose at random a pair of mixture components (j1 , j2 ) such that for
the current values of the mixture means holds
µj1 < µj2 and there is no other µj in the interval [µj1 , µj2 ];
(8.11)
8.3. MARKOV CHAIN MONTE CARLO
119
2. Propose a new mixture component by merging the j1 th and the j2 th
component. Label this new component by j ∗ . Set the weight, mean and
variance of the new component such that its 0th, 1st and 2nd moments
are the same as those of the combination of the merged components,
i.e.
wj ∗ = wj1 + wj2 ,
wj µj + wj2 µj2
,
µj ∗ = 1 1
wj ∗
wj1 (µ2j1 + σj21 ) + wj2 (µ2j2 + σj22 )
σj2∗ =
− µ2j ∗ .
wj ∗
(8.12)
3. Propose new values for the allocation variables ri,l , i = 1, . . . , N , l =
1, . . . , ni that were equal to j1 or to j2 , i.e. set such allocation variables
equal to j ∗ .
4. Accept the proposed mixture with K − 1 components with the probability
−1
Prcombine
accept = min 1, Asc (K − 1) ,
where the acceptance ratio Asc (K − 1) is discussed below. If not accepted keep the current K-component mixture.
The split move must be reversible in the sense described in Green (1995) to
the combine move. Namely it consists of the following steps:
1. Choose at random a component j ∗ which is proposed to be splitted;
2. Propose two new mixture components, labeled j1 and j2 . To keep
reversibility, set their weights, means and variances such that the equation (8.12). This can be done by sampling a three-dimensional auxiliary
random vector u = (u1 , u2 , u3 )′ from some distribution with a density
pu (u) and setting
wj1 = wj ∗ u1 ,
wj2
,
wj1
wj ∗
= u3 (1 − u22 )σj2∗
,
wj1
µ j 1 = µ j ∗ − u2 σ j ∗
σj21
r
wj2 = wj ∗ (1 − u1 ),
r
wj1
µ j 2 = µ j ∗ + u2 σ j ∗
,
wj2
σj22 = (1 − u3 )(1 − u22 )σj2∗
(8.13)
wj ∗
.
wj2
Check whether the condition (8.11) holds. If not, reject directly the
split-proposal otherwise continue;
120
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
3. Propose new values (either j1 or j2 ) for these allocation variables ri,l ,
i = 1, . . . , N , l = 1, . . . , ni that were equal to j ∗ . This is done randomly
with
n (ε − µ )2 o
wj
j1
i,l
Pralloc (ri,l = j1 ) ∝ 1 exp −
,
σj 1
2σj21
n (ε − µ )2 o
wj
j2
i,l
Pralloc (ri,l = j2 ) ∝ 2 exp −
.
σj 2
2σj22
4. Accept the proposed mixture with K + 1 components with the probability
Prsplit
accept = min 1, Asc (K) ,
see below for the expression of the acceptance ratio Asc (K). If not
accepted keep the current K-component mixture.
The acceptance ratio Asc (K) has the following general structure:
Asc (K) = [posterior ratio] × [proposal ratio](K) × [Jacobian].
The individual components of the above product have the following meaning.
[posterior ratio] =
p(θ j1 ,j2 | data)
,
∗
p(θ j | data)
where the posterior density p(· |data) is given by (8.10). Further, θj1 ,j2 refers
to the parameter vector pertaining to the proposal in the case of the split
move and to the current values of parameters in the case of the combine move.
∗
Similarly, θ j refers to the current parameter vector in the case of the split
move and to the proposal in the case of the combine move. The proposal
ratio is given by
[proposal ratio](K) =
split
πK
pu (u)
combine
πK+1
Q
Pralloc (ri,l )
.
(i,l): ri,l =j ∗
Finally, the Jacobian refers to the transformation (8.13) from (wj ∗ , µj ∗ , σj2∗ ,
u1 , u2 , u3 )′ to (wj1 , wj2 , µj1 , µj2 , σj21 , σj22 )′ , i.e.
wj ∗ σj21 σj22 (µj2 − µj1 ) [Jacobian] = 2
σj ∗ u2 (1 − u22 ) u3 (1 − u3 ) What leaves to be discussed is the choice of the density pu (u) of the auxiliary
random vector u used to generate the proposal in the split move. Richardson
8.3. MARKOV CHAIN MONTE CARLO
121
and Green (1997) suggest to generate u1 , u2 and u3 independently from the
following beta distributions:
u1 ∼ Beta(2, 2),
u2 ∼ Beta(2, 2),
u3 ∼ Beta(1, 1).
Note that at each iteration of the MCMC a new auxiliary vector u is generated also independently on the previous iteration. Brooks, Giudici, and
Roberts (2003) showed that some improvement of the MCMC sampling can
be achieved by allowing (a) a correlation between the components of u; (b)
a serial correlation between the auxiliary vectors u generated at successive
iterations of the MCMC. In our practical applications (Sections 8.7, 8.8 and
8.9) we exploited their methodology as well.
Birth-death move
Similarly as in the split-combine move, it is randomly chosen whether the
birth or the death move will be attempted. If the current number of mixture
birth and
components is K, the birth move is attempted with the probability πK
death
birth
the death move with the probability πK
= 1 − πK . Analogously to the
birth = 0
probabilities of the split and combine moves we use π1birth = 1, πK
max
birth
birth
and πK = πK = 0.5, K = 2, . . . , Kmax − 1.
When the birth move is attempted the new mixture with K + 1 components
is proposed in the following steps:
1. Sample the weight, mean and the variance for the new component from
the following distributions:
wj ∗ ∼ Beta(1, K),
µj ∗ ∼ N (ξ, κ),
(8.14)
σj2∗ ∼ I-Gamma(ζ, η).
Note that the expectation of the new weight is equal to 1/(K + 1), i.e.
a reciprocal of the number of components in the proposed mixture;
2. In the proposed mixture, rescale the weights such that they, together
with the new weight wj ∗ , sum to one, i.e. the weights of the proposed
′ , w ∗ with
mixture are w1′ , . . . , wK
j
wj′ = wj (1 − wj ∗ ),
j = 1, . . . , K.
(8.15)
3. Accept the proposed mixture with K + 1 components with the probability
Prbirth
accept = min 1, Abd (K) ,
122
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
see below for the form of the acceptance ratio Abd (K). If not accepted,
keep the current K-component mixture.
When it is chosen to propose the death move, the new mixture with K − 1
components is proposed in the follwoing way
1. Check whether there areP
any empty mixture components, i.e. the components for which Nj = i,l I[ri,l = j] is equal to zero. If not the death
move is directly rejected;
2. Choose randomly an empty mixture component. Let j ∗ be the label of
this component;
3. In the proposed (K −1)-component mixture, delete the j ∗ th component
and rescale the remaining weights such that they sum to one, i.e. the
proposed mixture has the weights wj′ , j = 1, . . . , K, j 6= j ∗ .
wj′ =
wj
,
1 − wj ∗
j = 1, . . . , K, j 6= j ∗ ;
4. Accept the proposed mixture with K − 1 components with the probability
−1
Prdeath
accept = min 1, Abd (K − 1) ,
where the acceptance ratio A−1
bd (K − 1) is given below. If not accepted
keep the current K-component mixture.
Analogous to the split-combine move, the acceptance ratio Abd (K) has the
general structure
Abd (K) = [posterior ratio] × [proposal ratio](K) × [Jacobian],
where
[posterior ratio] =
p(θ + | data)
.
p(θ − | data)
The vector θ + refers to the set of the parameters containing the proposed
mixture in the case of the birth move and the set of the current parameter
values in the case of the death move. Similarly, the vector θ − refers to the
set of the current parameter values in the case of the birth move and to the
set of parameters contaning the proposed mixture in the case of the death
move. Further, the proposal ratio is given by
[proposal ratio](K) =
death
πK+1
,
birth p
2
πK
prop (wj ∗ , µj ∗ , σj ∗ )
8.3. MARKOV CHAIN MONTE CARLO
123
where pprop (wj ∗ , µj ∗ , σj2∗ ) is the density of the proposal step given by (8.14),
i.e.
pprop (wj ∗ , µj ∗ , σj2∗ ) =
ζ
η η
−(ζ+1)
2
K−1
exp − 2
(σ ∗ )
K (1 − wj ∗ )
× ϕ(µj ∗ | ξ, κ) ×
.
Γ(ζ) j
σj ∗
Finally, the Jacobian refers to the transformation (8.15), i.e.
[Jacobian] = (1 − wj ∗ )K .
Updating the residuals
The update of the residuals εi,l , i = 1, . . . , N , l = 1, . . . , ni is fully deterministic provided the (i, l)th residual correspond to an uncensored observation
U
ti,l = tL
i,l = ti,l . In such case, the update of εi,l consists of using the AFT
expression (8.1) with the current values of the parameters, i.e. the updated
εi,l is equal to log(ti,l ) − β ′ xi,l − b′i z i,l .
When the residual εi,l corresponds to the censored observation with an obU
served interval ⌊tL
i,l , ti,l ⌋ its update consists of sampling from the full conditional distribution of εi,l which appears
j to be a truncated normal distribution,
namely N (µri,l , σr2i,l ) truncated on
k
b′i z i,l .
8.3.2
′
′
′
U
log(tL
i,l )−β xi,l −bi z i,l , log(ti,l )−β xi,l −
Update of the regression part of the model
The regression part of the model is updated by sampling from the full conditional distribution of each parameter or a set of parameters.
Full conditional for the fixed effects β
Let β (S) be an arbitrary sub-vector of vector β, and xi,l(S) the corresponding
sub-vectors of covariate vectors xi,l , and further let xi,l(−S) be their complementary sub-vectors. Similarly, let further ν β(S) and ψ β(S) be appropriate sub-vectors of hyperparameters ν β and ψ β , respectively. Finally, let
Ψβ(S) = diag(ψ β(S) ). Then
β (S) | · · · ∼ N E(β (S) | · · · ), var(β (S) | · · · ) ,
124
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
with
E(β (S) | · · · ) = var(β (S) | · · · )×
ni
N X
o
n
X
(F )
−1
σr−2
x
e
Ψβ(S) ν β(S) +
i,l(S)
i,l(S) ,
i,l
i=1 l=1
var(β (S)
ni
N X
−1
X
′
−2
,
+
| · · · ) = Ψ−1
x
x
σ
ri,l i,l(S) i,l(S)
β(S)
i=1 l=1
(F )
where ei,l(S) = log(ti,l ) − µri,l − β ′(−S) xi,l(−S) − b′i z i,l .
Full conditional for the means of random effects γ
There is no loss of generality to assume that γ = (γ ′(S) , γ ′(−S) )′ . Further, let
bi(S) , bi(−S) , ν γ(S) , ψ γ(S) the corresponding sub-vectors or complementary
sub-vectors of indicated quantities and Ψγ(S) = diag(ψ γ(S) ). Furthermore,
let the inversion of the matrix D be decomposed in the following way
D
then
−1
=
!
V(S)
V(S,−S)
,
V′(S,−S) V(−S)
γ (S) | · · · ∼ N E(γ (S) | · · · ), var(γ (S) | · · · ) ,
with
E(γ (S) | · · · ) = var(γ (S) | · · · ) ×
n
Ψ−1
γ(S) ν γ(S) + V(S)
N
X
bi(S) + V(S,−S)
i=1
−1
,
+
N
V
var(γ (S) | · · · ) = Ψ−1
(S)
γ(S)
N
X
i=1
bi(−S) − γ (−S)
Full conditional for the random effects bi
For the random effects vectors bi :
bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) ,
i = 1, . . . , N,
o
,
8.4. BAYESIAN ESTIMATES OF THE SURVIVAL DISTRIBUTION
125
with
E(bi | · · · ) = var(bi | · · · ) ×
ni
h
X
i
′
,
log(t
)
−
µ
−
β
x
z
D−1 γ +
σr−2
ri,l
i,l
i,l
i,l i,l
l=1
ni
−1
X
−1
′
var(bi | · · · ) = D +
.
z
z
σr−2
i,l
i,l
i,l
l=1
Full conditional for the covariance matrix of random effects D
Finally, D | · · · is an inverse-Wishart distribution with degrees of freedom
equal to df + N and a scale matrix equal to
S+
N
X
i=1
8.4
(bi − γ)(bi − γ)′ .
Bayesian estimates of the survival distribution
Simple posterior median or mean are suitable overall estimates for the components of the parameter vector θ. To characterize a survival distribution
underlying the data we also need an estimate for the survival and hazard function or for the survival density or the density of the error term in the AFT
model. All these quantities are functions with an expression that depends on
the parameter vector θ. In the Bayesian statistics they are estimated by the
mean of (posterior) predictive quantities to be discussed in this section.
8.4.1
Predictive survival and hazard curves and predictive
survival densities
For a specific value of covariates, say xnew and z new , the predictive survival
function is given by
Z
S(t | data, xnew , z new ) = S(t | θ, data, xnew , z new ) p(θ | data) dθ
for any t > 0. Further, once the parameter vector θ is known the data do
not bring any additional information and hence
S(t | θ, data, xnew , z new ) = S(t | θ, xnew , z new ).
126
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Additionally, analogously to Section 7.4, the quantity S(t | θ, xnew , z new ) is
expressed using the model parameters as
S(t | θ, xnew , z new ) = 1 −
K
X
j=1
wj Φ log(t) − β ′ xnew − b′ z new µj , σj2 . (8.16)
The MCMC estimate of the predictive survival function is then given, using
the expression (4.13):
Ŝ(t | data, xnew , z new ) =
M
1 X
S(t | θ (m) , xnew , z new ),
M
(8.17)
m=1
where θ (m) , m = 1, . . . , M is the MCMC sample from the posterior (predictive) distribution. All components of θ (m) are directly available except b(m) .
These must be additionally sampled from Nq (γ (m) , D(m) ).
Analogously, predictive hazard curves and predictive survival densities are
obtained using the relationship
p(t | θ, xnew , z new ) = t
−1
K
X
j=1
wj ϕ log(t) − β ′ xnew − b′ z new µj , σj2 (8.18)
for the survival density and the relationship
ℏ(t | θ, xnew , z new ) =
for the hazard.
8.4.2
p(t | θ, xnew , z new )
S(t | θ, xnew , z new )
(8.19)
Predictive error densities
Averaging the error density (8.2) across the MCMC run, conditionally on
fixed values of K, gives a Bayesian predictive error density estimate of the
mixture with K components, i.e. an estimate of
Z
gε (e) p(θ | K, data) dθ, e ∈ R,
(8.20)
E gε (e) K, data =
ΘK
where the domain of integration, ΘK , is the subset of the overall parameter
space pertaining to mixtures with a fixed number K of the mixture components.
Averaging further across values of K gives an estimate of
Z
E gε (e) data = gε (e) p(θ | data) dθ, e ∈ R,
(8.21)
the overall Bayesian predictive density estimate of the error distribution.
8.5. BAYESIAN ESTIMATES OF THE INDIVIDUAL RANDOM EFFECTS
8.5
127
Bayesian estimates of the individual random
effects
In some situations, for example when discrimination between clusters is of
interest, an estimate of the individual random effects must be provided. In
the Bayesian statistics, their estimates are given by some characteristic of the
posterior distribution, for instance by the posterior mean E(bi | data). The
precision of the estimate can be evaluated using the credible interval.
When using the MCMC to draw the sample from the posterior distribution,
we estimate each individual random effect vector bi by the average of the
sampled values, i.e.
M
1 X (m)
b ,
b̂i =
M m=1 i
(m)
where M is the number of MCMC iterations and bi the value of bi sampled
at the mth iteration. The credible interval is obtained by taking sample
quantiles from the MCMC sample.
8.6
Simulation study
A simulation study was carried out to explore the performance of the proposed method. The setting mimics a study with clustered data where a continuous covariate as well as a dichotomous covariate might influence the distribution of the event time. At the same time there might be an overall
heterogeneity between clusters present as well as a possible interaction between the cluster effect and the effect of the dichotomous covariate. The
factual setting used to generate the ‘true’ data was motivated by the results
of the WIHS analysis presented in Section 7.6.
Namely, ‘true’ uncensored data were generated according to the model
log(Ti,l ) = 1.5 + β xi,l + bi,1 + bi,2 zi,l + εi,l ,
i = 1, . . . , N, l = 1, . . . , ni ,
where
β = 0.4,
γ = −0.8,
(bi,1 , bi,2 )′ ∼ N2 (0, γ)′ , D ,
var(bi,1 ) = 0.52 ,
var(bi,2 ) = 0.12 ,
corr(bi,1 , bi,2 ) = 0.4.
The covariate xi,l was generated according to the extreme-value distribution
of a minimum, with location equal to 8.5 and scale equal to 1 inspired more
128
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
or less by the log2 (1 + CD4 count) covariate in the WIHS data set. The
covariate zi,l was binary taking a value of 1 with probability equal to 0.4.
The error term εi,l was generated from a standard normal distribution, from
a Cauchy distribution, from a Student t2 distribution, from a standardized
extreme value distribution, and from a normal mixture 0.4N1 (−2.000, 0.25)+
0.6 N1 (1.333, 0.36), respectively. Two sample sizes were considered: (1) N =
50, ni = 5 for all i (small sample size) and (2) N = 100, ni = 10 for all i
(large sample size). Each simulation involved 100 replications.
All event times were interval-censored by simulating 120 consecutive ‘assessment times’ for each ‘patient’ in the dataset (the first assessment time
was drawn from N (7, 1), times between each consecutive assessments from
N (6, 0.25)). At each assessment, between 0.2% and 0.6% randomly selected
patients were withdrawn from the study resulting in approximately 15% of
right-censored observations. For each dataset, the estimates were computed
using the Bayesian normal mixture cluster-specific AFT model, using the
Bayesian cluster-specific model with a normal error and using the maximumlikelihood AFT model with a normal error and ignoring the random effects
structure.
Appendix B, Section B.2 gives selected results of the simulation. Average
estimates of the regression parameters, their standard and mean squared
errors are given in Tables B.7 and B.8. The results related to the covariance
matrix D of the random effects are given in Tables B.9 – B.11. It is seen
that, in most cases, the Bayesian mixture approach performs better than
the incorrectly specified models. A large difference in favour of the Bayesian
mixture model is seen in the case of a normal mixture or Cauchy for the error
distribution.
Additionally, when the Bayesian mixture approach is used, also the error
distribution and consequently also the hazard or survival functions are reproduced closely which is not always the case when the Bayesian normal model
is used. See Figures B.4 – B.9.
8.7
Example: Signal Tandmobielr study – clustered interval-censored data
In Section 7.7 we analyzed separately the emergence times of teeth 14 and
15. In this section, we extend this analysis by inclusion of all permanent premolars, i.e. teeth 14, 15, 24, 25, 34, 35, 44, 45 in Figure 1.1 and additionally,
all eight teeth will be analyzed jointly. This allows not only to answer the
question what the impact of different covariates on the emergence time is but
8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
INTERVAL-CENSORED DATA
129
also the question concerning the relationship between the emergence times of
different teeth. A random sample of 500 boys and 500 girls will be used for
the inference.
The response variable Ti,l , i = 1, . . . , 1 000, l = 1, . . . , 8, refers to the age of
emergence of the lth permanent premolar of the ith child. As indicated in
Sections 1.1 and 7.7 the response variable is interval-censored with intervals
of length equal to approximately 1 year. For reasons stated in Section 7.7 we
shifted the time origin of the AFT model to 5 years of age, i.e. by replacing
Ti,l by Ti,l − 5 in the model (8.1).
Further, Leroy et al. (2003b) have shown that there is horizontal symmetry
with respect to emergence, i.e. the same emergence distribution can be assumed at horizontally symmetric positions (e.g., for teeth 14 and 24). In
model (8.1), this leads to the random effect vector
bi = (bi,1 , . . . , bi,4 )′
with z i,l = (1, man4i,l , max5i,l , man5i,l )′ ,
where man4i,l , max5i,l , man5i,l , respectively are dummies for the mandibular
first premolars (teeth 34, 44), maxillary second premolars (teeth 15, 25) and
mandibular second premolars (teeth 35, 45), respectively. With this model
specification, apart of the random variation given by the error term εi,l , the
terms
b∗i,max4 = bi,1 ,
b∗i,man4 = bi,1 + bi,2 ,
b∗i,max5 = bi,1 + bi,3 ,
b∗i,man5 = bi,1 + bi,4
determine how the log-emergence time of a pair of horizontally symmetric
teeth of a single child differ from the population average. As fixed effects
we used gender ≡ girl, dmf, interaction between gender and dmf, and all
two-way interaction terms between gender, dmf and dummies for the pairs of
horizontal symmetric teeth, i.e.
xi,l = (gender i , dmf i,l , gender i ∗ dmf i,l ,
genderi ∗ man4i,l , genderi ∗ max5i,l , genderi ∗ man5i,l ,
dmf i,l ∗ man4i,l , dmf i,l ∗ max5i,l , dmf i,l ∗ man5i,l )′ .
See Section 7.7 for the definition of the covariate dmf.
For the inference we sampled two chains, each of length 20 000 with 1:3 thinning which took about 27 hours on a Pentium IV 2 GHz PC with 512 MB
RAM. The first 1 500 iterations of each chain were discarded. Convergence
was evaluated by the method of Gelman and Rubin (1992).
130
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Table 8.1: Signal Tandmobielr study. Posterior medians, 95% equal-tail
credible intervals for the effect of different covariates and error variance.
Posterior
median
intercept
gender
dmf
intercept
gender
dmf
Effect
95% CI
Posterior
median
95% CI
1.7566
−0.0680
−0.0457
Maxilla 4
(1.7338, 1.7822)
(−0.1003, −0.0368)
(−0.0631, −0.0284)
1.9001
−0.0504
−0.0317
Maxilla 5
(1.8729, 1.9283)
(−0.0844, −0.0163)
(−0.0500, −0.0135)
1.7242
−0.0668
−0.0201
Mandible 4
(1.7019, 1.7484)
(−0.0972, −0.0375)
(−0.0378, −0.0032)
1.9060
−0.0654
−0.0090
Effect
Posterior median
gender ∗ dmf
log(scale) log(σ)
error scale σ
8.7.1
All teeth
0.0105
−2.2580
0.1046
Mandible 5
(1.8805, 1.9323)
(−0.0965, −0.0323)
(−0.0283, 0.0098)
95% CI
(−0.0073, 0.0279)
(−2.3111, −2.1721)
(0.0992, 0.1139)
Prior distribution
The initial maximum-likelihood AFT model, for each tooth separately, with
a normal error distribution and without random effects estimated the intercept as 1.8 and scale as 0.25. According to the suggestions of Section 8.2.3
we used the following values of hyperparameters: ξ = 1.8, κ = (3 · 0.25)2 ,
ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the number of mixture components, K,
a truncated Poisson prior with λ = 5 reflecting our prior belief that the error
distribution is skewed and Kmax = 30 was used. All β and γ parameters were
assigned a spread N (0, 100) prior. For the covariance matrix D of random
effects we used an inverse Wishart prior with df = 4. Though, due to the
fact that 1 000 clusters are involved in the data set, even a higher value could
be used with a negligible impact on results. Prior scale matrix S was equal
to diag(0.002) (corresponding to inverse-gamma(df, 0.001) in the univariate
case).
8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
INTERVAL-CENSORED DATA
131
Table 8.2: Signal Tandmobielr study. Posterior medians, 95% equal-tail
credible intervals and Bayesian two-sided p-values for the effect of dmf > 0
for the two genders and different teeth.
Tooth
Maxilla 4
Maxilla 5
Mandible 4
Mandible 5
8.7.2
Gender
Girl
Boy
Girl
Boy
Girl
Boy
Girl
Boy
Post. median
−0.0352
−0.0457
−0.0212
−0.0317
−0.0098
−0.0201
0.0015
−0.0090
95% CI
(−0.0522, −0.0185)
(−0.0631, −0.0284)
(−0.0390, −0.0035)
(−0.0500, −0.0135)
(−0.0267, 0.0070)
(−0.0378, −0.0032)
(−0.0162, 0.0193)
(−0.0283, 0.0098)
p-value
< 0.001
< 0.001
0.019
< 0.001
0.255
0.021
0.870
0.353
Results for the regression and error parameters
The effect of different covariates on the emergence, separately for each tooth is
given in Table 8.1. The results in Table 8.1 were obtained as MCMC summary
for proper combinations of model parameters. For example,
PK the intercept
effect for the maxillary teeth 4 equals the error mean α = j=1 wj µj . For the
maxillary teeth 5, the intercept effect equals α+γ(max5) where γ(max5) is the
mean of the random effect bi,3 . The intercept effects for the remainig teeth are
defined in an analogous manner. The effect of gender in Table 8.1 is defined
as β(gender) for the maxillary teeth 4, β(gender) + β(gender ∗ max5) for the
maxillary teeth 5 and analogously for the remaining teeth. Finally, the effect
of dmf is given by β(dmf) for the maxillary teeth 4, by β(dmf)+β(dmf ∗max5)
for the maxillary teeth 5 and analogously for remaining teeth. The error scale
refers to the
summary for the standard deviation σ of the error distribution,
qP
K
2
2
2
i.e. σ =
j=1 wj (µj + σj ) − α . The row labeled as log(scale) refers to
the summary for log(σ).
Most of the quantities in Table 8.1 are comparable to the results of the
earlier analysis (see Section 7.7) given in Table 7.5. Remember however that
in Section 7.7 we analyzed separately only one maxillary tooth 4 (14) and one
maxillary tooth 5 (15). Furthermore, in contrast to the recent analysis we
allowed the dependence of the error variance on the covariates in Section 7.7.
In this analysis, the main interest lies in the effect of dmf on emergence. This
can be evaluated from Table 8.2 that shows posterior summary statistics for
the effect of dmf (appropriate linear combinations of β parameters) for boys
132
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Table 8.3: Signal Tandmobielr study. Posterior medians, 95% equal-tail
credible intervals for variances and correlations between tooth-specific linear
combinations of random effects.
Parameter
sd(b∗i,max4 )
sd(b∗i,man4 )
sd(b∗i,max5 )
sd(b∗i,man5 )
corr(b∗max4 ,
corr(b∗max4 ,
corr(b∗max4 ,
corr(b∗man4 ,
corr(b∗man4 ,
corr(b∗max5 ,
b∗man4 )
b∗max5 )
b∗man5 )
b∗max5 )
b∗man5 )
b∗man5 )
Posterior median
0.204
0.198
0.205
0.202
0.887
0.914
0.842
0.793
0.895
0.847
95% CI
(0.192, 0.218)
(0.186, 0.211)
(0.190, 0.221)
(0.187, 0.218)
(0.856, 0.914)
(0.887, 0.938)
(0.804, 0.874)
(0.749, 0.832)
(0.864, 0.923)
(0.810, 0.880)
and girls and the four pairs of horizontally symmetric teeth. It is seen that
caries on the primary predecessor accelerates significantly the emergence of
the permanent successor in the case of maxillary teeth. For the mandibular
teeth, a slight effect is observed only for the first premolar on boys. Additionally, besides the effect of dmf the emergence process for girls is ahead of
boys.
8.7.3
Inter-teeth relationship
Further, Table 8.3 shows posterior summary statistics for standard deviations
and correlations of above defined tooth-specific linear combinations b∗i,max4 ,
b∗i,man4 , b∗i,max5 , b∗i,man5 of random effects bi,1 , . . . , bi,4 . It shows how the child
effect is important and how the different teeth in one mouth are strongly
correlated. The posterior medians of all standard deviations in Table 8.3
are all about 0.2 which is approximately two times higher than the posterior
median of the standard deviation of the error distribution which was equal
to 0.1. Posterior medians of all correlation parameters lie between 0.79 and
0.91.
8.7.4
Predictive emergence and hazard curves
Predictive emergence curves (predictive cumulative distribution functions)
computed using an approach described in Section 8.4.1 are shown in Fig-
8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
INTERVAL-CENSORED DATA
Maxilla 4, Boys
10
11
0.8
0.4
10
11
9
10
11
0.4
0.0
12
7
8
9
10
11
Maxilla 5, Girls
Maxilla 5, Boys
9
10
11
0.8
Age (years)
0.4
dmf = 1
dmf = 0
12
7
8
9
10
11
Mandible 5, Girls
Mandible 5, Boys
10
11
12
12
0.4
dmf = 1
dmf = 0
0.0
0.4
0.0
9
Age (years)
0.8
Age (years)
Proportion emerged
Age (years)
8
12
0.0
0.4
8
12
dmf = 1
dmf = 0
Age (years)
Proportion emerged
8
0.8
Mandible 4, Boys
0.4
0.8
9
Mandible 4, Girls
0.0
0.8
8
Age (years)
dmf = 1
dmf = 0
7
7
Age (years)
dmf = 1
dmf = 0
7
dmf = 1
dmf = 0
0.0
12
Proportion emerged
0.8
9
0.0
Proportion emerged
Proportion emerged
8
dmf = 1
dmf = 0
7
Proportion emerged
Proportion emerged
0.8
0.4
dmf = 1
dmf = 0
0.0
Proportion emerged
Maxilla 4, Girls
7
133
7
8
9
10
11
12
Age (years)
Figure 8.2: Signal Tandmobielr study. Posterior predictive emergence
curves. Solid line: dmf = 1, dotted-dashed line: dmf = 0.
134
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Maxilla 4, Boys
10
11
12
10
11
Mandible 4, Boys
9
10
11
12
8
7
8
9
10
11
Age (years)
Age (years)
Maxilla 5, Girls
Maxilla 5, Boys
9
10
11
12
7
8
9
10
11
Mandible 5, Boys
9
10
11
12
0.0 0.5 1.0 1.5 2.0
Mandible 5, Girls
Hazard
Age (years)
Age (years)
12
dmf = 1
dmf = 0
Age (years)
8
12
dmf = 1
dmf = 0
0.0 0.5 1.0 1.5 2.0
8
0.0 0.5 1.0 1.5 2.0
Mandible 4, Girls
Hazard
0.0 0.5 1.0 1.5 2.0
0.0 0.5 1.0 1.5 2.0
9
Age (years)
dmf = 1
dmf = 0
7
8
Age (years)
dmf = 1
dmf = 0
7
dmf = 1
dmf = 0
7
Hazard
0.0 0.5 1.0 1.5 2.0
Hazard
9
dmf = 1
dmf = 0
7
Hazard
8
0.0 0.5 1.0 1.5 2.0
dmf = 1
dmf = 0
7
Hazard
Hazard
0.0 0.5 1.0 1.5 2.0
Hazard
Maxilla 4, Girls
12
dmf = 1
dmf = 0
7
8
9
10
11
12
Age (years)
Figure 8.3: Signal Tandmobielr study. Posterior predictive hazard curves.
Solid line: dmf = 1, dotted-dashed line: dmf = 0.
8.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
INTERVAL-CENSORED DATA
135
4
0
2
gε (e)
6
8
Unconditional
1.4
1.6
1.8
2.0
e
6
8
Conditional
K = 4 (6.65%)
K = 6 (11.38%)
4
gε (e)
K = 5 (8.75%)
K = 7 (10.91%)
2
K = 8 (10.62%)
K = 9 (8.77%)
0
K = 10 (6.54%)
1.4
1.6
1.8
2.0
e
Figure 8.4: Signal Tandmobielr study. Posterior predictive error density.
136
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
ure 8.2. In agreement with the results discussed in Section 8.7.2 almost
negligible difference is observed between the predictive emergence curves for
dmf > 0 and dmf = 0 for mandibular teeth. The same is true for the predictive hazard functions of emergence shown in Figure 8.3. As expected (see
Section 7.7.2 for the reasons why) the predictive hazard functions are all
increasing.
8.7.5
Predictive error density
In our sample, the number of mixture components K ranged from 2 to 24
while mixtures with K ∈ {6, 7, 8} occupied each more than 10% of the sample,
with the highest frequency for K = 7 (11.2%). Mixtures with K ≥ 17 took
each less than 1.5% of the sample. Apparently, the model did not suffer from
the technical restriction given by Kmax = 30.
Figure 8.4 further shows both the overall estimate of the predictive error
density (8.21) and the conditional (given K) estimate of the predictive error
density (8.20). It is seen that the mixtures with the most frequent numbers
of components are all almost the same.
8.7.6
Conclusions
This section showed an analysis of clustered data where moreover closer dependence between some observations within the cluster could be assumed.
Since in Section 7.7 we have shown on the similar analysis of the same data
set that the error variance might depend on covariates the model presented
in this section might be improved if we allow to depend the variances of
the mixture components determining the error distribution on covariates as
well. However, in the current mixture setting such extension is not trivial
and requires further research.
8.8
Example: CGD data – recurrent events analysis
The chronic granulomatous disease (CGD) trial has been introduced in Section 1.2. The response variable Ti,l is a time to the lth (recurrent) infection
on the ith patient, i = 1, . . . , 128, l = 1, . . . , ni , 1 ≤ ni ≤ 8. So that a patient
represents a cluster and the infection times the individual observations.
The problem of recurrent events in this data set was discussed by several authors in the literature. Among others, Therneau and Hamilton (1997) used
8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS
√
d ∼ Unif(0, 100)
1.2
1.0
0.8
0.6
0.0
0.2
0.4
Posterior density
0.8
0.6
0.4
0.0
0.2
Posterior density
1.0
1.2
d ∼ I-Gamma(0.001, 0.001)
0.0
0.5
1.0
√
1.5
2.0
0.0
0.5
1.0
√
d
1.5
2.0
d
√
d ∼ Unif(0, 10)
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
Posterior density
1.0
1.2
√
d ∼ Unif(0, 50)
Posterior density
137
0.0
0.5
1.0
√
1.5
d
2.0
0.0
0.5
1.0
√
1.5
2.0
d
Figure 8.5: CGD data. Scaled histograms of sampled standard deviations of
the random effect bi for different prior distributions.
138
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
the CGD data to illustrate several approaches for recurrent event analysis
based on the Cox’s PH model. Vaida and Xu (2000) used this dataset to illustrate the PH model with random effects. They specify the hazard function
for the (i, l)th event as
ℏ(t | xi,l , z i,l , bi ) = ℏ0 (t) exp(β ′ xi,l + b′i z i,l ),
where ℏ0 is a baseline hazard function, β regression parameters vector for
‘fixed’ effects, x a covariate vector of ‘fixed’ effects, bi a random effect vector
and z i,l corresponding covariates, see also Section 3.4.1. They use a normal
distribution for bi .
In this section we present an analysis of the CGD data using the Bayesian CS
normal mixture AFT model that could be considered as an AFT counterpart
of the random effects PH model of Vaida and Xu (2000). In the model
formula (8.1) a univariate random effect bi is used with zi,l ≡ 1. As fixed
effects covariates we used the same covariates as Vaida and Xu (2000), namely
the xi,l vector equals
xi,l = (trtmti , inheri , agei , cortici , prophyi , gender i ,
hcatUSotheri , hcatEUAmsteri , hcatEUotheri )′ ,
where trtmt equals 1 for the gamma inferon group and equals 0 for the placebo
group, inher equals 1 for patients with the autosomal recessive and equals 0
for patients with X-linked pattern of inheritance, age is the age of the patient
in years, cortic equals 1 if the corticosteroids are used and equals 0 otherwise,
prophy equals 1 if the prophylactic antibiotics are used and equals 0 otherwise,
gender equals to 1 for females and equals 0 for males and finally hcatUSother,
hcatEUAmster, and hcatEUother are dummies for the hospital categories US–
other, EU-Amsterdam, and EU-other, respectively.
For the inference we sampled two chains, each of length 60 000 with 1:6 thinning which took about 5 minutes on a Pentium IV 2 GHz PC with 512 MB
RAM. The first 30 000 iterations of each chain were discarded. The convergence was evaluated by a critical examination of the trace and autocorrelation
plots and using the method of Gelman and Rubin (1992).
8.8.1
Prior distribution
The initial maximum-likelihood AFT model with a normal error distribution
and without random effects gave an estimate of the intercept equal to 3.66
and a scale equal to 1.69. Along the suggestions made in Section 8.2.3 we
used the following values of hyperparameters: ξ = 3.66, κ = 25 ≈ (3 · 1.69)2 ,
8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS
139
ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the number of mixture components, K,
a truncated Poisson prior with λ = 5 reflecting our prior belief that the error
distribution is skewed and Kmax = 30 was used. Prior means of all regression
parameters were equal to 0 and their prior variances to 1 000.
For the variance d of the random effect we tried either an inverse-gamma IGamma(0.001, 0.001) prior (df = 0.002, s = 0.002 in the terms
of the inverse√
√
√
Wishart distribution) or a uniform Unif(0, s) prior on d with s = 100,
50, 10. As discussed in Gelman (2006, Sections 2.2 and 4.3), with the IGamma(ǫ, ǫ) prior the inference might become very sensitive
to the choices
√
of ǫ. This is not the case of the uniform distribution on d where the choice of
the range of the uniform distribution has practically no impact on the results
(provided the upper limit of the uniform distribution is not chosen
√ too small).
In Figure 8.5, we show scaled histograms of sampled values of d for above
mentioned prior distributions. It is seen that the inverse-gamma prior leads
to a high posterior probability mass close to zero. The phenomenon driven by
the prior distribution √
which has a peak close to zero. On the other hand, with
the uniform prior on d, the posterior distribution is clearly separated from
zero with the region of the support obviously driven by the data. Moreover,
in agreement with the findings of Gelman (2006), the posterior distribution
is practically the same irrespective the choice of the range of the uniform
√
prior. The results presented below will be based on Unif(0, 100) prior on d
(practically
√ the same results were obtained also with the remaining uniform
priors on d).
8.8.2
Effect of covariates on the time to infection
Table 8.4 shows posterior summary statistics for the effect of the included
covariates on the distribution of the time to infection. Reported Bayesian
p-value is simultaneous in the case of the covariate hospital category. It is
seen that the effect of gamma interferon is highly significant increasing the
time to the infection by the factor of exp(1.273) = 3.57. The effect of the
pattern of inheritance is slightly not-significant on a conventional 5% level.
On the other hand, the increase of age by 1 year increases significantly the
infection free time by the factor of exp(0.047) = 1.05. Further, the use of
corticosteroids should be avoided as it decreases significantly the infection
free time by the factor of exp(−2.767) = 0.06 whereas the use of prophylactic antibiotics increases significantly the infection free time by the factor
of exp(1.191) = 3.29. The infection free time is further significantly higher
for females, being exp(1.476) = 4.38 times higher than in the case of males.
Finally, the effect of the hospital category is slightly not significant however
140
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Table 8.4: CGD data. Posterior medians, 95% equal-tail credible intervals
and Bayesian two-sided (simultaneous) p-values for the effect of covariates.
Posterior
median
Parameter
Treatment group
gamma interferon
Pattern of inheritance
autosomal recessive
Age
1.273
−0.914
0.047
Use of corticosteroids
yes
Use of prophylactic antibiotics
yes
Gender
female
Hospital category
US – other
Europe – Amsterdam
Europe – other
−2.767
1.191
1.476
0.461
1.729
1.268
95% CI
p = 0.001
(0.437, 2.195)
p = 0.067
(−1.829, 0.071)
p = 0.022
(0.007, 0.092)
p = 0.038
(−5.727, −0.161)
p = 0.023
(0.150, 2.330)
p = 0.042
(0.050, 3.111)
p = 0.065
(−0.481, 1.451)
(0.183, 3.377)
(0.017, 2.637)
Table 8.5: CGD data. Posterior medians and 95% equal-tail credible intervals for the moments of the error distribution and standard deviation of the
random effects.
Parameter
Posterior
median
95% CI
Moments of the error distribution
Intercept α
4.088 (2.532, 5.527)
Error scale σ
2.495 (1.399, 4.083)
Standard deviation of the random effects
sd(bi )
0.748 (0.183, 1.395)
141
0.6
0.4
Treatment, female
Treatment, male
0.2
Survival
0.8
1.0
8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS
Placebo, female
0.0
Placebo, male
0
100
200
300
400
0.004
Time (days)
Placebo, female
0.002
Treatment, male
Treatment, female
0.000
0.001
Hazard
0.003
Placebo, male
0
100
200
300
400
Time (days)
Figure 8.6: CGD data. Predictive survival (upper panel) and hazard (lower
panel) curves for males and females taking either treatment or placebo. Remaining covariates were fixed either to the mean value (age = 14.6) or to
the most common value (X-linked pattern of inheritance, no use of corticosteroids, use of prophylactic corticosteroids, and a hospital category USother ).
142
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
0.15
0.10
0.00
0.05
gε (e)
0.20
0.25
Unconditional
0
5
10
e
0.25
Conditional
0.20
K = 1 (2.92%)
K = 5 (2.17%)
0.15
K = 15 (2.58%)
0.10
K = 20 (3.22%)
K = 25 (4.69%)
0.05
K = 30 (6.77%)
0.00
gε (e)
K = 10 (2.49%)
0
5
10
e
Figure 8.7: CGD data. Posterior predictive error density.
8.8. EXAMPLE: CGD DATA – RECURRENT EVENTS ANALYSIS
143
0
−1
−2
Random effect
1
2
the posterior median suggests the best results are obtained in the hospitals
of category Europe – Amsterdam whereas the worst results in the hospital
category US – NIH.
Although the parameters of the AFT model are not directly comparable to
the parameters of the PH model, we can compare at least the direction of
the relationship obtained here and by Vaida and Xu (2000) who used the PH
model. Care must be taken as Vaida and Xu (2000) use different 0-1 coding of
dichotomous variables than we do. However, we conclude that the directions
of the relationships between the covariates and the time to infection found
by the AFT model is the same compared to the findings obtained using the
PH model.
The effect of the treatment (gamma interferon) is seen also in Figure 8.6
where we plot predictive survival and hazard curves for males and females
taking either gamma interferon or placebo. Remaining covariates were fixed
either to their mean or the most common value.
1 record/patient
−3
2 records
0
20
40
60
80
100
3
≥4
120
Patient
Figure 8.8: CGD data. Posterior means and 95% equal-tail credible intervals
for individual random effects. Patients are sorted according to the number
of records in the data set.
144
8.8.3
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Predictive error density and variability of random effects
Posterior summary statistics for the moments of the error distribution, computed in the same way as indicated in Section 8.7.2, and for the standard
deviation of the random effects are given in Table 8.5.
The estimate of the error density is given in Figure 8.7. In this case, also
mixtures with a high number of components were quite highly represented in
the sample. For a clarity, the conditional estimates of the error density (given
K) are plotted only for chosen values of K. Higher number of components
is needed firstly because of clear skewness of the error density and secondly
because of somewhat higher probability mass in the right tail of the density.
8.8.4
Estimates of individual random effects
Figure 8.8 shows posterior means and 95% equal-tail posterior credible intervals for the values of individual random effects bi , i = 1, . . . , 128. For
the purpose of plotting, the patients were sorted according to the number of
records they have in the data set. Since there are no big differences in the
follow-up times for different patients, less records in the data set generally
implies longer infection-free periods. Indeed, for the patients with only one
recorded infection time practically all estimated individual random effects lie
above zero, the mean for bi . Furthermore, there can be observed a decreasing
trend in the estimated individual random effects as the number of recorded
infection times increases.
8.8.5
Conclusions
In this section we have shown how the Bayesian normal mixture CS AFT
model can be used to analyse recurrent events data. It might be useful
to include the covariate number of infections in the model. However, such
covariate would be time-dependent and it is not possible to include covariates
of this type in any model where the (baseline) survival distribution is modelled
via density and not hazard function.
8.9
Example: EBCP data – multicenter study
In Section 1.4 we have introduced a multicenter randomized clinical trial
aiming to evaluate the effect of perioperative chemotherapy given besides
8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY
145
the surgery on the progression-free survival (PFS) time compared to surgery
alone for early breast cancer patients while controlling for several baseline
covariates.
In Figure 1.3 we have indicated there possibly exists heterogeneity between
centra with respect to the PFS distribution. Additionally, there is some
evidence for the heterogeneity with respect to the treatment effect. In this
section, we perform an analysis using the Bayesian normal mixture clusterspecific AFT model that addresses all these issues.
The cluster is represented by the center, i.e. i = 1, . . . , 14, within the ith
center ni patients were involved in the trial with 25 ≤ ni ≤ 902. As response
Ti,l , i = 1, . . . , 14, l = 1, . . . , ni we use the PFS time in days of the lth patient
treated by the ith center.
To allow for the baseline heterogeneity across centra and also for the heterogeneity with respect to the treatment effect we include a bivariate random
effect bi = (bi,1 , bi,2 )′ in the CS AFT model (8.1). The covariate vector z i,l
for the random effects has the form
z i,l = (1, trtmtGroupi,l )′ ,
where trtmtGroupi,l equals one if the (i, l)th patient underwent surgery alone
and equals zero if she additionally got the course of perioperative chemotherapy.
Additionally, as fixed effects we include all baseline factors mentioned in
Section 1.4 in the model. Namely, the covariate vector xi,l in the model (8.1)
equals
xi,l = (ageMidi,l , ageOldi,l , tySui,l , tumSizi,l , nodSti,l , otDisi,l ,
regionNLi , regionPLi , regionSEi , regionSAi )′ ,
where ageMid and ageOld are dummies for the age groups 40–50 years and
older than 50 years, respectively with the group younger than 40 years as
the baseline, tySu being equal to 1 for the breast-conserving surgery and
equal to 0 for mastectomy, tumSiz being equal to 1 for the tumors of size
≥ 2cm and equal to 0 for tumors of size < 2cm, nodSt being equal to 1 for
the positive and equal to 0 for the negative pathological nodal status, otDis
being equal to 1 if there was another disease present and equal to 0 otherwise.
Finally, covariates regionNL, regionPL, regionSE, regionSA are dummies for the
geographical location of the center with France as the baseline.
Since the covariate region is categorical and center-specific it should be possible to reveal, at least partially, the regional structure of the centra from
the estimates of their individual random effects bi,1 , i = 1, . . . , 14 when we
146
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
omit the covariate region from the model. To show this, we fitted additionally
a model where all dummies for the region were omitted from the covariate
vector x (model without region).
For the inference we sampled two chains, each of length 200 000 with 1:5 thinning which took about 32 hours on a Pentium IV 2 GHz PC with 512 MB
RAM. The first 150 000 iterations of each chain were discarded. The convergence was evaluated by a critical examination of the trace and autocorrelation
plots and using the method of Gelman and Rubin (1992).
8.9.1
Prior distribution
The initial maximum-likelihood AFT model, without random effects gave
the estimate of the intercept equal to 9.43 and the estimate of the error scale
equal to 1.73. As the prior mean for the mixture components, ξ, we have
taken zero to show that the posterior for the mixture means manages to shift
from slightly misspecified location. To set up the remaining hyperparameters
we followed closely the guidelines given in Section 8.2.3, namely κ = 40 which
is slightly higher than (3 · 1.73)2 , ζ = 2, h1 = 0.2, h2 = 0.1, δ = 1. For the
number of mixture components, K, we used a truncated Poisson distribution
prior with λ = 5 and Kmax = 30. Both γ2 (mean of the random effects bi,2 )
as well as all β regression parameters were assigned a spread N (0, 100) prior.
The covariance matrix D of the random effects got an inverse Wishart prior
with df = 2 and S = diag(0.002).
8.9.2
Effect of covariates on PFS time
The effect of considered covariates, in both models with included or excluded
covariate region, on the progression-free survival time can be evaluated from
Table 8.6 where we report posterior medians, 95% equal-tail credible intervals
and Bayesian p-values (simultaneous for categorical covariates with more than
2 levels) for the β and γ parameters.
It is seen that the results for the model with region included are almost the
same as these in the model with region excluded. This is in agreement with
the general property of the AFT model mentioned in Section 3.3 that the regression parameters for included covariates do not change when an important
factor is omitted from the model. If we base our conclusions on the model
with region included then we see that, after adjustment for the remaining
covariates, surgery alone decreases the time to the cancer progression by the
factor of exp(−0.173) = 0.84 compared to the surgery given together with
the perioperative chemotherapy. However the difference is not significant at
conventional 5% level.
8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY
147
Table 8.6: Early breast cancer patients data. Posterior medians, 95% equaltail credible intervals and Bayesian two-sided (simultaneous) p-values for the
effect of covariates.
Parameter
Treatment group
surgery alone
Age
40–50 years
> 50 years
Type of surgery
breast conserving
Tumor size
≥ 2cm
Nodal status
positive
Other disease
present
Region
the Netherlands
Poland
South Europe
South Africa
Model with region
Poster.
median 95% CI
p = 0.070
−0.173 (−0.350, 0.016)
p = 0.005
0.417 (0.140, 0.695)
0.260 (0.002, 0.520)
p = 0.056
0.174 (−0.005, 0.357)
p < 0.001
−0.494 (−0.686, −0.306)
p < 0.001
−0.653 (−0.819, −0.488)
p = 0.008
−0.385 (−0.666, −0.099)
p = 0.033
−0.512 (−0.878, −0.068)
0.119 (−0.394, 0.663)
−0.450 (−0.857, −0.038)
−0.864 (−1.343, −0.371)
Model without region
Poster.
median 95% CI
p = 0.086
−0.166 (−0.342, 0.026)
p = 0.003
0.429 (0.154, 0.715)
0.295 (0.036, 0.558)
p = 0.029
0.197 (0.021, 0.379)
p < 0.001
−0.507 (−0.697, −0.314)
p < 0.001
−0.657 (−0.822, −0.490)
p = 0.008
−0.394 (−0.683, −0.102)
Further, the prognosis for the cancer progression is the most optimistic in
the middle age group 40 – 50 years where the time to the progression of
the disease is increased by the factor of exp(0.417) = 1.52 compared to the
youngest group <40 years. In the oldest group >50 years the time to the
disease progression is still increased, by the factor of exp(0.260) = 1.30,
compared to the youngest group. The estimates for the effect of age further
suggests a non-linear relationship between the age and log-progression-free
survival time.
The effect of the type of surgery on the disease progression is slightly not
significant at 5% level when basing the inference on the model with region.
However the posterior median of the β parameter for this covariate suggest
that breast conserving surgery increases the time to the cancer progression by
148
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Table 8.7: Early breast cancer patients data. Posterior medians and 95%
equal-tail credible intervals for the moments of the error distribution and
variance components of the random effects.
Parameter
Intercept α
Error scale σ
Model with region
Poster.
median 95% CI
Model without region
Poster.
median 95% CI
Moments of the error distribution
9.453 (8.983, 9.853)
9.229
1.741 (1.600, 1.859)
1.749
(8.822, 9.796)
(1.597, 2.376)
Variance components of the random effects
sd(bi,1 )
0.126 (0.026, 0.392)
0.348 (0.192, 0.616)
sd(bi,2 )
0.060 (0.020, 0.228)
0.085 (0.023, 0.275)
corr(bi,1 , bi,2 ) −0.071 (−0.988, 0.973) −0.842 (−0.995, 0.978)
the factor of exp(0.174) = 1.20 when compared to mastectomy. The effect of
remaining patient-specific covariates is highly significant and in the direction
expected from the clinical point of view. Namely, the tumor of size ≥2 cm
decreases the time to the cancer progression by the factor of exp(−0.494) =
0.61 compared to the smaller tumors of size <2 cm. A positive pathological
nodal status decreases drastically the time to the cancer progression by the
factor of exp(−0.653) = 0.52 compared to the negative result. The presence
of other related disease decreases the PFS time by the factor of exp(−0.385) =
0.68.
Finally, a significant effect of the geographical region on the PFS time is seen.
The best performing region is found to be Poland, followed by France, South
Europe and the Netherlands. The region which performs the worst is then
South Africa.
Relatively small effect of the perioperative therapy compared to surgery alone
is also seen from the posterior predictive survival curves shown in Figure 8.9
and drawn for region = France and two typical combinations of covariates.
8.9.3
Predictive error density and variance components of
random effects
Posterior summary statistics for the moments of the error distribution and
the variance components of the random effects are given in Table 8.7. The
8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY
149
0.6
0.4
Survival
0.8
1.0
BCS, ≥2 cm, nodal−, no other disease
Surgery + chemotherapy
0.0
0.2
Surgery alone
0
1000
2000
3000
4000
5000
Time (days)
0.6
0.4
Survival
0.8
1.0
Mastectomy, ≥2 cm, nodal+, no other disease
Surgery + chemotherapy
0.0
0.2
Surgery alone
0
1000
2000
3000
4000
5000
Time (days)
Figure 8.9: Early breast cancer patients data. Predictive survival curves
based on the model with region for region = France, and two typical combinations of covariates: (1) breast conserving surgery, tumor size ≥2 cm,
negative nodal status and no other associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm, positive nodal status and no other
associated disease (13.88% of the sample).
150
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
moments of the error distribution are computed in the same way as indicated
in Section 8.7.2. It is seen that although there is heterogeneity between
centra, the within-center variability given by the variance of the error distribution is much higher than the between-centra variability given by the
variance of the random effects. Furthermore, as expected, the variability of
the random intercept term bi,1 increased considerably when we omitted the
covariate region.
According to the posterior median there exists very low negative correlation
between the overall center level and the treatment × center interaction in
the model with region and relatively high negative correlation in the model
with region excluded. However, in both cases the 95% equal-tail credible
interval covers almost the whole range (−1, 1) of possible values for ̺ forcing
us to conclude that almost nothing can be said about the random effects
correlation ̺, probably due to the fact that effectively only a sample of size
14 is used to estimate this correlation. The reason for quite huge difference
3
0
1
2
Posterior density
2
0
1
Posterior density
3
4
Model without region
4
Model with region
−1.0
−0.5
0.0
0.5
corr(bi,1 , bi,2 )
1.0
−1.0
−0.5
0.0
0.5
1.0
corr(bi,1 , bi,2 )
Figure 8.10: Early breast cancer patients data. Scaled histograms for sampled
corr(bi,1 , bi,2 ).
8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY
151
Model with region
Conditional
0.15
0.05
0.10
gε (e)
0.10
0.05
gε (e)
0.15
0.20
0.20
Unconditional
K = 1 (90.68%)
0.00
0.00
K = 2 (7.07%)
6
8
10
12
14
K = 3 (1.5%)
6
8
e
10
12
14
e
Model without region
Conditional
0.15
0.05
0.10
gε (e)
0.10
0.05
gε (e)
0.15
0.20
0.20
Unconditional
K = 1 (74.4%)
0.00
0.00
K = 2 (22.13%)
6
8
10
e
12
14
K = 3 (1.73%)
6
8
10
12
14
e
Figure 8.11: Early breast cancer patients data. Posterior predictive error
densities.
152
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
in the posterior median for ̺ in the two models can be found in Figure 8.10
where we show scaled histograms of sampled values of ̺, i.e. estimates of the
posterior density of ̺. It is seen that the posterior density has, in both cases,
a ‘U’ shape, while putting somewhat more mass on negative values in the
case of the model without region.
In the sample, mostly error densities with a low number of mixture components were presented. Namely, in the model with region, 90.68% of the sample
was formed by a one-component density, 7.07% of the sample was formed by
a two-component mixture, 1.50% of the sample contained a three-component
mixtures and mixtures with more than 3 components were all represented in
only 0.75% of the sample. In the model with omitted covariate region the
proportion of densities with at least two components quite logically increased,
namely one-component density is now represented only in 74.40% of the sample, two-component mixtures in 22.13% of the sample and three-component
mixtures in 1.73% of the sample. Mixtures with more than 3 components
are still quite rare, being all together represented only in 1.74% of the sample. The estimates of the error density (both uncondionally and conditionally
given the number of mixture components) are given in Figure 8.11.
8.9.4
Estimates of individual random effects
Estimates of individual random effects that could serve to discriminate the
centra are given in Figure 8.12. To be able to compare directly the models
with and without covariate region the plots related to the random intercept
bi,1 take into account also the overall intercept α (mean of the error distribution) and in the case of the model with region also the appropriate main
effect of region (β(regionNL), β(regionPL), β(regionSE), and β(regionSA) respectively). It is seen that the estimates of individual random intercepts in
the model without region managed quite nicely to capture also the region
effect, of course for the price of decreased precision of the estimates.
8.9.5
Conclusions
In this section, we have shown an analysis of a typical multicenter clinical trial
with heterogeneity with respect to the overall center effect as well as center ×
treatment interaction. Among others we have further shown how the centerspecific random effects may capture the effect of an omitted center-specific
covariate.
8.9. EXAMPLE: EBCP DATA – MULTICENTER STUDY
153
Model with region
8.0
8.5
9.0
9.5
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Treatment
Institution
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Institution
Intercept + region effect
10.0
−0.6
−0.2
0.0
0.2
b2
b1 + α + β(region)
Model without region
Institution
8.0
8.5
9.0
9.5
b1 + α
10.0
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Treatment
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Institution
Intercept
−0.6
−0.2
0.0
0.2
b2
Figure 8.12: Early breast cancer patients data. Posterior means and 95%
equal-tail credible intervals for individual random effects. Random intercepts
are further shifted by an overall intercept α and in the model with region also
by a corresponding region main effect β(region).
154
8.10
CHAPTER 8. BAYESIAN NORMAL MIXTURE CS AFT MODEL
Discussion
In this chapter, we have proposed a Bayesian cluster-specific accelerated failure time model whose error distribution is modelled in a flexible way as a finite
normal mixture. An advantage of the full Bayesian approach is the fact that
a general random effect vector can be easily included in the model. Subsequently, the effect of covariates can be evaluated jointly with the association
among clustered responses. Further, interval-, right-, or left-censored data
are easy to handle and finally, the MCMC sampling-based implementation of
the model offers a straightforward way to obtain credible intervals of model
parameters as well as predictive survival or hazard curves.
Observe that the Bayesian approach is used here mainly for technical convenience. Indeed, in practice likelihood (8.3) is hardly tractable using the
maximum-likelihood method. On the other hand, the Bayesian estimation
using the MCMC does not pose any real difficulties. Further, since all our
prior distributions are non-informative (or close to, cfr. variance parameters)
and we use (on a proper scale) more or less posterior modes as point estimates
the classical maximum-likelihood estimation would lead to almost the same
results.
The proposed methodology aims to contribute to the area of semi-parametric
modelling of correlated and at the same time interval-censored data. Furthermore, our approach allows to bring in a structure into the dependencies
between observations in one cluster. For instance, in multicenter studies, the
vector z i,l = (1, treatmenti,l )′ in the model formula (8.1) allows to consider
not only the random center effect but also a random center-by-treatment
interaction which can sometimes be substantial.
Unfortunately, our approach cannot handle time-dependent covariates. However, the same is true for any model where the distribution of the response is
specified by the density and not by the hazard function. To include also the
time-dependent covariates, usually the Cox’s proportional hazards model is
used. For example, Kooperberg and Clarkson (1997); Betensky et al. (1999);
Goetghebeur and Ryan (2000) consider independent interval-censored data.
Vaida and Xu (2000) offer an approach based on the proportional hazards
linear mixed model with right-censored data.
Finally, our approach can be quite easily extended along the lines presented
in Chapters 9 and 10 to handle also doubly-interval-censored data, i.e. the
data where the response is given as the difference of two interval-censored
observations.
Chapter
9
Bayesian Penalized Mixture
Cluster-Specific AFT Model
This chapter continues with the developments in the framework of the clusterspecific AFT model. However, to model unknown distributional shapes a penalized normal mixture introduced in Section 6.3 will be exploited instead of
the classical normal mixture that was used in Chapter 8. Furthermore, we
directly describe a model for doubly-interval-censored data although it can
also be used with interval- or right-censored data. This approach, introduced
by Komárek and Lesaffre (2006b), will allow us to analyze the caries times
in the Signal Tandmobielr study.
The cluster-specific AFT model for doubly-interval-censored data is specified in Section 9.1. In Section 9.2, we specify the prior distributions of
all model parameters and derive their posterior distribution. Markov chain
Monte Carlo methodology for the model of this chapter is described in Section 9.3. Estimation of the survival distribution and of the individual random
effects is described in Sections 9.4 and 9.5, respectively. Results of the simulation study aiming to evaluate the performance of the proposed method are
shown in Section 9.6. Section 9.7 presents the analysis of doubly-intervalcensored caries times of the four permanent first molars. The analysis of the
breast cancer multicenter study is given in Section 9.8. Discussion finalizes
the chapter in Section 9.9.
155
156
9.1
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Model
P
Let N
i=1 ni observational units be divided into N clusters, the ith one of
size ni . Let Ui,l and Vi,l , i = 1, . . . , N, l = 1, . . . , ni denote the true chronological onset and failure time, respectively and Ti,l = Vi,l − Ui,l the true event
time. With doubly interval censoring, it is only known that Ui,l occurred
U
L
U
within an interval of time ⌊uL
i,l , ui,l ⌋, where ui,l ≤ ui,l . Similarly, the failL , v U ⌋, with v L ≤ v U ,
ure time Vi,l is only known to lie in an interval ⌊vi,l
i,l
i,l
i,l
i = 1, . . . , N, l = 1, . . . , ni . As in the whole thesis, it is assumed that observed intervals result from an independent noninformative censoring process
(see Section 2.4). Further, as indicated in Section 4.1.2, we will assume that,
given the model parameters, the true event time Ti,l is independent of the
true onset time Ui,l for all i and l. Below, we discuss this issue further.
To account for possible dependencies of different individuals within a cluster,
the cluster-specific random effects di = (di,1 , . . . , di,qd )′ and bi = (bi,1 , . . . ,
bi,qb )′ are introduced and incorporated in the cluster-specific AFT model for
doubly-interval-censored data:
log(Ui,l ) = δ ′ xui,l + d′i z ui,l + ζi,l ,
(9.1)
log(Vi,l − Ui,l ) = log(Ti,l ) = β ′ xti,l + b′i z ti,l + εi,l ,
(9.2)
i = 1, . . . , N,
l = 1, . . . , ni ,
where δ = (δ1 , . . . , δmu )′ and β = (β1 , . . . , βmt )′ are unknown regression
parameter vectors, z ui,l is the covariate vector for random effects influencing
the distribution of the onset time, z ti,l the covariate vector for random effects
influencing the distribution of the event time and similary, xui,l is the covariate
vector for fixed effects having possibly an impact on the onset time and xti,l the
covariate vector for fixed effects having possibly an impact on the event time.
The error terms ζi,l , i = 1, . . . , N , l = 1, . . . , ni are i.i.d. random variables
with some density gζ (ζ). Analogously, the error terms εi,l , i = 1, . . . , N,
l = 1, . . . , ni are i.i.d. random variables with density gε (ε). The random
effects di , i = 1, . . . , N and bi , i = 1, . . . , N , respectively are assumed to be
i.i.d. with a density gd (d) and gb (b), respectively. Furthermore we assume
that εi1 ,l1 , ζi2 ,l2 , bi3 and di4 are independent for all i1 , i2 , i3 , i4 and l1 , l2 .
This assumption implies that, given the model parameters and the random
effects bi and di , Ui,l and Ti,l are independent for each i and l and the
vectors U i = (Ui,1 , . . . , Ui,ni )′ and T i = (Ti,1 , . . . , Ti,ni )′ are independent for
each i. Furthermore, for example in the context of the Signal Tandmobielr
application (see Section 9.7) where Ui,l and Ti,l are the emergence time and
the time to caries, respectively, for the lth tooth of the ith child, it also
9.1. MODEL
157
implies the following decomposition
(a) Whether a child is an early or late emerger is independent of whether
a child is more or less sensitive against caries (independence of di and
bi );
(b) Whether a specific tooth emerges early or late is independent of whether
that tooth is more or less sensitive against caries (independence of ζi,l
and εi,l ).
9.1.1
Distributional assumptions
To finalize the specification of the measurement model we have to specify the
densities gζ , gε of the random errors and the densities gd , gb of the random
effects. According to the dimensionality of the problem, we distinguish two
situations.
Model U
In the case of univariate densities, i.e. for the densities gζ and gε and for
the densities gd and gb if the corresponding random effects are univariate (in
which case we will use the notation di = (di,1 ) ≡ di and/or bi = (bi,1 ) ≡ bi ),
a penalized normal mixture as introduced in Section 6.3 will be used.
That is, a generic density g(y) of a random variable Y (substitute ζi,l , εi,l ,
di or bi ) is modelled as a location-and-scale transformed weighted sum of
normal densities over a fixed fine grid of knots µ = (µ−K , . . . , µK )′ centered
around µ0 = 0. The means of the normal components are equal to the knots
and their variances are all equal and fixed to σ 2 , i.e.
g(y) = τ −1
K
X
j=−K
y − α 2
,
µ
,
σ
wj (a) ϕ
j
τ
(9.3)
where the unknown intercept term α and the unknown scale parameter τ have
to be estimated as well as the vector a = (a−K , . . . , aK )′ of the transformed
weights. See (6.14) for the relationship between a and w = (w−K , . . . , wK )′ .
Model M
In the case when a random effect vector di or bi is multivariate it is assumed,
analogously to Chapter 8 that it follows a multivariate normal distribution.
This choice is driven mainly by computational convenience. Note however,
158
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
that the densities gζ and gε are still modelled using the penalized normal
mixture (9.3). Finally, the same reasoning as in Section 8.1.1 can be used
to explain why we put more emphasis on a correct specification of the error
distribution.
For notational convenience and clarity of the exposition we will assume that
in Model U, both random effects are univariate (qd = qb = 1) whereas in Model
M, both random effects are multivariate (qd > 1 and qb > 1). However, in
practical situations both cases can be mixed. For example the distribution of
the univariate di can be specified as a penalized normal mixture (9.3) whereas
for the multivariate bi a multivariate normal distribution can be used.
9.1.2
Likelihood
Denoting p a generic density, the likelihood contribution of the ith cluster is
given by
Li =
=
Z
Z
Z
Z
Rqd
Rqd
Rqb
Rqb
Y
ni I
l=1
Y
ni I
l=1
uU
i,l
uL
i,l
uU
i,l
uL
i,l
I
U −u
vi,l
i,l
L −u
vi,l
i,l
I
U −u
vi,l
i,l
L −u
vi,l
i,l
p(ti,l , bi , ui,l , di ) dti,l dui,l dbi ddi
p(ti,l | bi , ui,l , di ) p(bi | ui,l , di )
p(ui,l | di ) p(di ) dti,l dui,l dbi ddi
=
Z
Rqd
Z
Rqb
"n I
i
Y
l=1
I vU −ui,l
uU
i,l
i,l
uL
i,l
L −u
vi,l
i,l
(9.4)
p(ti,l | bi ) dti,l p(ui,l | di ) dui,l
#
p(bi ) p(di ) dbi ddi ,
where
′ t
′ t
p(ti,l | bi ) = t−1
i,l gε log(ti,l ) − bi z i,l − β xi,l
′ u
′ u
p(ui,l | di ) = u−1
i,l gζ log(ui,l ) − di z i,l − δ xi,l
are modelled using the expression (9.3) for gε and gζ .
Further, in the Model U, p(bi ) = gb (bi ) and p(di ) = gd (di ) are penalized
normal mixtures (9.3). Since it is not possible to distinguish between the
intercept terms of the error and the random effect the intercepts α = αd for
gd and α = αb for gb are fixed to zero for identifiability reasons. In the case
9.2. BAYESIAN HIERARCHICAL MODEL
159
of the Model M, the densities p(bi ) = gb (bi ) and p(di ) = gd (di ) are densities
of an appropriate multivariate normal distribution (see also Section 9.2.3).
The method of penalized maximum-likelihood, suggested in Chapter 7, is
computationally quite demanding for likelihood (9.4). Instead, a Bayesian
approach together with MCMC methodology will be used here to avoid explicit integration and optimization.
9.2
Bayesian hierarchical model
To specify the model from a Bayesian point of view, prior distributions for
all unknown parameters have to be given. For our model we assume a hierarchical structure described by a directed acyclic graph (DAG). The DAG
for Model U where the distributions of the univariate random effects and the
error terms are estimated using the penalized mixture is given in Figure 9.1.
Onset
Event
Gζ
Gd
Gε
rζi,l
rdi
δ
di
xui,l
Gb
rεi,l
rbi
εi,l
ζi,l
ui,l
xti,l
bi
β
ti,l
uU
i,l
L
vi,l
censoringi,l
U
vi,l
i = 1, . . . , N
uL
i,l
l = 1, . . . , ni
vi,l
Figure 9.1: Directed acyclic graph for the Bayesian penalized mixture clusterspecific AFT model with univariate random effects (Model U ).
160
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
The DAG for Model M with multivariate normal random effects and error
terms expressed using the penalized mixture is given in Figure 9.2.
For Model U, the joint prior distribution of the total parameter vector θ is
given by
p(θ) ∝
"n N Y
i
Y
p vi,l ui,l , ti,l × p ti,l β, bi , εi,l × p ui,l δ, di , ζi,l ×
i=1 l=1
ζ ζ ε ε
p εi,l Gε , ri,l × p ζi,l Gζ , ri,l × p ri,l Gε × p ri,l Gζ ×
#
b
d
d
b
p bi Gb , ri × p di Gd , ri × p ri Gb × p ri Gd ×
(9.5)
p Gε × p Gζ × p Gb × p Gd × p δ × p β .
Onset
γd
Event
Gζ
Dd
Gε
rζi,l
δ
di
zui,l
xui,l
Db
γb
rεi,l
εi,l
ζi,l
ui,l
xti,l
zti,l
bi
β
ti,l
uU
i,l
L
vi,l
censoringi,l
U
vi,l
i = 1, . . . , N
uL
i,l
l = 1, . . . , ni
vi,l
Figure 9.2: Directed acyclic graph for the Bayesian penalized mixture clusterspecific AFT model with multivariate normal random effects (Model M ).
9.2. BAYESIAN HIERARCHICAL MODEL
161
The node Gε refers to the set {σ ε , µε , αε , τ ε , wε , aε , λε } which contains
the parameters of formulas (9.3) and (6.14) and a smoothing parameter λε
which will be further discussed in Section 9.2.1. The sets Gζ , Gb , Gd are
defined in an analogous manner. Further, let G be a generic symbol for its
subscriped counterpart (i.e. for Gε , Gζ , Gb , Gd ) and let y be a generic symbol
for εi,l , ζi,l , bi , or di , i = 1, . . . , N, l = 1, . . . , ni , respectively. The sub-DAG for
the generic Y random variable is shown in Figure 9.3 and the corresponding
DAG conditional distributions are discussed in Sections 9.2.1 and 9.2.2.
In the case of Model M, the joint prior distribution is given by
p(θ) ∝
"n N Y
i
Y
p vi,l ui,l , ti,l × p ti,l β, bi , εi,l × p ui,l δ, di , ζi,l ×
i=1 l=1
ζ ζ ε
ε p εi,l Gε , ri,l × p ζi,l Gζ , ri,l × p ri,l Gε × p ri,l Gζ ×
#
p bi γ b , Db × p di γ d , Dd ×
(9.6)
p Gε × p Gζ × p γ b × p Db × p γ d × p Dd × p δ × p β ,
G
λ
a
σ
µ
α
τ
w
ri,l
yi,l
l = 1, . . . , ni
i = 1, . . . , N
Figure 9.3: Directed acyclic graph for the penalized mixture.
162
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
where γ d and Dd are the mean and the covariance matrix for the random
effect vectors di and γ b and Db are the mean and the covariance matrix for
the random effect vectors bi . These parameters will be discussed in detail in
Section 9.2.3.
All the multiplicands of expressions (9.5) and (9.6) will be discussed in detail
in the following sections.
9.2.1
Prior distribution for G
The prior distribution of a generic node G whose structure is given in Figure 9.3 equals
p(G) ∝ p(a | λ) p(λ) p(α) p(τ ).
Prior for transformed mixture weights
Although often the grid length (2 K + 1) is of moderate size it results in
a rather large number of unknown a parameters. To avoid overfitting of
the data and identifiability problems, a restriction on the a parameters is
needed. In Chapter 7 we added a penalty term for the transformed weights
to the log-likelihood for this purpose. This penalty term can be interpreted
as an informative log-prior distribution (e.g., Silverman, 1985, Section 6).
Therefore the prior distribution p(a | λ) is defined as the exponential of the
penalty term used in Chapter 7, i.e.
K
n λ X
2 o
∆s aj
p(a | λ) ∝ exp −
2
j=−K+s
n λ
o
= exp − a′ P′ P a ,
2
(9.7)
where ∆s denotes a difference operator of order s and P the corresponding
difference operator matrix. The hyperparameter λ controls the smoothness
of the resulting density g(y).
Expression (9.7) is that of a multivariate normal density with zero mean and
−
−
covariance matrix λ−1 P′ P , where P′ P denotes a generalized inverse of
the matrix P′ P. This distribution is known as a Gaussian Markov random
field (GMRF) and is extensively used in spatial statistics. Although the
distribution (9.7) is improper (the matrix P′ P has a deficiency of s in its
rank) the resulting posterior distribution is proper as soon as there is some
informative data available, see Besag et al. (1995).
9.2. BAYESIAN HIERARCHICAL MODEL
163
As a consequence of the findings discussed in Section 7.2, prior distribution
(9.7) favours smooth estimates of the estimated densities (gε , gζ , gb or gd ).
Due to the correspondence of the prior (9.7) with the penalty term in the
penalized maximum-likelihood approach we will call the mixture model (9.3)
with this prior a penalized mixture.
Prior for the smoothing parameter
The smoothing hyperparameter λ can be interpreted as a component of the
prior precision of the transformed weights a. See Section 7.2.3 for the approaches to determine the optimal value of λ in the context of penalized
maximum-likelihood estimation. For our full Bayesian inference, the unknown smoothing parameter λ is considered stochastic and is estimated simultaneously with all the remaining parameters of the model. Therefore, here
a hyperprior has been assigned to λ, i.e. a highly dispersed Gamma(hλ,1 , hλ,2 )
prior, i.e.
h
p(λ) =
λ,1
hλ,2
Γ(hλ,1 )
λhλ,1−1 exp −hλ,2 λ ,
where hλ,1 is the fixed shape parameter and hλ,2 the fixed rate parameter.
A dispersed gamma distribution is obtained for instance with hλ,1 = hλ,2 =
0.001 or hλ,1 = 1, hλ,2 = 0.005.
Prior for the mixture intercept
Finally, in the case when the intercept term α is not fixed to zero (intercept
of error distributions), a highly dispersed normal distribution has been taken
for p(α), i.e.
p(α) = ϕ(α | να , ψα ),
where να is the fixed prior mean and ψα is the fixed large prior variance.
Prior for the mixture scale
For the precision τ −2 we have taken a highly dispersed Gamma(hτ,1 , hτ,2 )
distribution, see above the paragraph on the prior for the smoothing parameter. Alternatively a uniform distribution on τ (formally a truncated gamma
distribution for τ −2 with hτ,1 = −1/2 and hτ,2 = 0) which is sometimes preferred for hierarchical models (Gelman et al., 2004, pp. 136, 390) could be
taken.
164
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
9.2.2
Prior distribution for the generic node Y
To specify the prior distribution of generic Y (εi,l , ζi,l , i = 1, . . . , N , l =
1, . . . , ni in Models U and M and bi , di , i = 1, . . . , N in Model U ) we introduce, analogously to Section 8.2.1, a latent allocation variable r taking values
in {−K, . . . , K}. Actually, data augmentation (Tanner and Wong, 1987) is
introduced which simplifies the MCMC procedure. The DAG conditional
distribution p(y | G, r) is simply a normal distribution:
p(y | G, r) = p(y | σ, µ, α, τ, r) = ϕ y | α + τ µr , (τ σ)2 .
Further, p(r | G) = p(r | w) is given by
Pr r = j w = wj ,
j ∈ {−K, . . . , K}.
Had the latent allocation variable r not been introduced we would have had
to work with the conditional distribution p(y | G) = p(y | σ, µ, α, τ, w) which
is a normal mixture given by the formula (9.3).
9.2.3
Prior distribution for multivariate random effects in
Model M
As was mentioned in Section 9.1.1, the multivariate random effects bi and
di , i = 1, . . . , N in Model M are assumed to be a priori normally distributed.
That is, the densities p(bi | γ b , Db ) and p(di | γ d , Dd ) in the expression (9.6)
are
p(bi | γ b , Db ) = ϕqb (bi |γ b , Db ),
p(di | γ d , Dd ) = ϕqd (di |γ d , Dd ),
where γ b = (γb,1 , . . . , γb,qb )′ is the prior mean of the random effects bi , γ d =
(γd,1 , . . . , γd,qd )′ the prior mean of the random effects di , Db is the prior
covariance matrix of the random effects bi and Dd is the prior covariance
matrix of the random effects di .
Both prior random effect means γ b and γ d as well as random effect covariance
matrices Db and Dd are further assigned hyperpriors. These hyperpriors are
chosen analogously to Section 8.2.2. That is, the prior distribution for each
γb,j , j = 1, . . . , qb and γd,j ∗ , j ∗ = 1, . . . , qd , respectively is N (νγb ,j , ψγb ,j )
and N (νγd ,j ∗ , ψγd ,j ∗ ), respectively, independently for j = 1, . . . , qb and j ∗ =
1, . . . , qd , i.e.
p(γ b ) p(γ d ) =
qb
nY
j=1
ϕ(γb,j
qd
o
o
nY
ϕ(γd,j ∗ | νγd ,j ∗ , ψγd ,j ∗ ) .
| νγb ,j , ψγb ,j ) ×
j ∗ =1
9.2. BAYESIAN HIERARCHICAL MODEL
165
The vectors ν γb = (νγb ,1 , . . . , νγb ,qb )′ , ν γd = (νγd ,1 , . . . , νγd ,qd )′ , ψ γb = (ψγb ,1 ,
. . . , ψγb ,qb )′ , and ψ γd = (ψγd ,1 , . . . , ψγd ,qd )′ are fixed hyperparameters. Special
care is needed when the random intercept is included in the model. If for
example z ti,l,1 ≡ 1, i = 1, . . . , N , l = 1, . . . , ni , then for identifiability reasons
γb,1 must be fixed to zero (or equivalently, νγb ,1 = 0, ψγb ,1 = 0) as the overall
intercept is given by the intercept αε of the error terms εi,l .
The prior distributions for the covariance matrices Db and Dd are inverseWishart with fixed degrees of freedom dfb and dfd , respectively and fixed
scale matrices Sb and Sd , respectively. See formula (8.8) for the expression of
the corresponding density.
9.2.4
Prior distribution for the regression parameters
The prior specification for the regression parameters β and δ is analogous
to Section 8.2.2. Firstly, also here, we use the hierarchical centering. That
is, the covariates included in xti,l or xui,l , respectively are not included in z ti,l
or z ui,l , respectively and vice versa. Further, the covariate vectors xti,l and
xui,l , respectively never contain an intercept term since the overall intercept
are already included in the model in the form of the parameters αε and αζ ,
respectively.
The prior distribution for each regression coefficient βj , j = 1, . . . , mt and
δj ∗ , j ∗ = 1, . . . , mu is N (νβ,j , ψβ,j ) and N (νδ,j ∗ , ψβ,j ∗ ), respectively, independently for j = 1, . . . , mt and j ∗ = 1, . . . , mu , i.e.
mu
mt
o
o
nY
nY
ϕ(δj ∗ | νδ,j ∗ , ψδ,j ∗ ) .
ϕ(βj | νβ,j , ψβ,j ) ×
p(β) p(δ) =
j ∗ =1
j=1
The vectors ν β = (νβ,1 , . . . , νβ,mt )′ , ν δ = (νδ,1 , . . . , νδ,mt )′ , ψ β = (ψβ,1 ,. . . ,
ψβ,mt )′ , and ψ δ = (ψδ,1 , . . . , ψδ,mt )′ are fixed hyperparameters.
9.2.5
Prior distribution for the time variables
The terms p(vi,l | ui,l , ti,l ), p(ti,l | β, bi , εi,l ) and p(ui,l | δ, di , ζi,l ) appearing
in the expressions (9.5) and (9.6) are all Dirac (degenerated) densities driven
by the AFT models (9.1) and (9.2). Namely:
p(vi,l | ui,l , ti,l ) = I[vi,l = ui,l + ti,l ],
p(ui,l | δ, di , ζi,l ) = I[log(ui,l ) = δ ′ xui,l + d′i z ui,l + ζi,l ],
p(ti,l | β, bi , εi,l ) = I[log(ti,l ) = β ′ xti,l + b′i z ti,l + εi,l ],
i = 1, . . . , N,
l = 1, . . . , ni .
166
9.2.6
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Posterior distribution
The product of all DAG conditional distributions determines the joint posterior distribution p(θ | data), i.e.
p(θ | data) ∝ p(θ) ×
ni n
N Y
Y
i=1 l=1
U p uL
i,l , ui,l ui,l , censoring i,l ×
o
U L
, vi,l
p vi,l
vi,l , censoringi,l ,
where p(θ) is given by (9.5) for Model
U and by (9.6) for ModelL M,Urespec
U u , censoring
tively. Further, the terms p uL
,
u
and p vi,l , vi,l vi,l ,
i,l
i,l
i,l
i,l
censoringi,l , where censoringi,l represents a realization of the random variable(s) causing the censoring of the (i, l)th onset and failure time, are the
same as in Section 8.2.4, with an obvious change in notation.
9.3
Markov chain Monte Carlo
As indicated in Section 4.5 we base the inference on a sample from the posterior distribution obtained using MCMC methods. Here, Gibbs sampling
(Geman and Geman, 1984; Gelfand and Smith, 1990) was chosen necessitating to sample from all full conditional distributions of blocks of model
parameters. Below, the full conditional distributions are discussed.
9.3.1
Updating the parameters related to the penalized
mixture G
Let yi∗ , i∗ = 1, . . . , n be the current values of the appropriate generic nodes
y and ri∗ , i∗ = 1, . . . , n corresponding latent allocation variables. That is,
• For G ε we have {yi∗ : i∗ = 1, . . . , n} = {εi,l : i = 1, . . . , N, l =
ε : i = 1, . . . , N, l = 1, . . . , n },
1, . . . , ni }, {ri∗ : i∗ = 1, . . . , n} = {ri,l
i
PN
and n = i=1 ni ;
• For G ζ we have {yi∗ : i∗ = 1, . . . , n} = {ζi,l : i = 1, . . . , N, l =
ζ
1, . . . , ni }, {ri∗ : i∗ = 1, . . . , n} = {ri,l
: i = 1, . . . , N, l = 1, . . . , ni },
PN
and n = i=1 ni ;
• For G b we have {yi∗ : i∗ = 1, . . . , n} = {bi : i = 1, . . . , N }, {ri∗ : i∗ =
1, . . . , n} = {rib : i = 1, . . . , N }, and n = N ;
9.3. MARKOV CHAIN MONTE CARLO
167
• For G d we have {yi∗ : i∗ = 1, . . . , n} = {di : i = 1, . . . , N }, {ri∗ : i∗ =
1, . . . , n} = {rid : i = 1, . . . , N }, and n = N .
Full conditional for transformed mixture weights
The full conditional of each element of a is given by
exp(Nj aj )
p(aj | · · · ) ∝ n K
on
P
exp(ak )
k=−K
o2 #
" n
aj − E aj | a−(j) , λ
× exp −
,
2 var aj | a−(j) , λ
j = −K, . . . , K,
(9.8)
where Nj is the number of yi∗ for which the latent allocation variable ri∗ is
equal to j, i.e.
n
X
I[ri∗ = j].
Nj =
i∗ =1
Further, E aj | a−(j) , λ and var aj | a−(j) , λ are the mean and the variance
resulting from the GMRF prior (9.7). For example, for the third order differences (s = 3), which have been used in all applications in this thesis (Sections
9.7 and 9.8), we have
aj−3 − 6 aj−2 + 15 aj−1 + 15 aj+1 − 6 aj+2 + aj+3
,
E aj a−(j) =
20
j = −K + 3, . . . , K − 3,
−3 a−K + 12 a−K+1 + 15 a−K+3 − 6 a−K+4 + a−K+5
E a−K+2 a−(−K+2) =
,
19
−3 aK + 12 aK−1 + 15 aK−3 − 6 aK−4 + aK−5
E aK−2 a−(K−2) =
,
19
3 a−K + 12 a−K+2 − 6 a−K+3 + a−K+4
,
E a−K+1 a−(−K+1) =
10
3 aK + 12 aK−2 − 6 aK−3 + aK−4
E aK−1 a−(K−1) =
,
10
E a−K a−(−K) = 3 a−K+1 − 3 a−K+2 + a−K+3 ,
E aK a−(K) = 3 aK−1 − 3 aK−2 + aK−3 ,
168
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
and
var aj a−(j) ) = (20 λ)−1 ,
j = −K + 3, . . . , K − 3,
var a−K+2 a−(−K+2) = var aK−2 a−(K−2) = (19 λ)−1 ,
var a−K+1 a−(−K+1) = var aK−1 a−(K−1) = (10 λ)−1 ,
var a−K a−(−K) = var aK a−(K) = λ−1 .
Distribution (9.8) is log-concave so we experimented both with the slice sampler of Neal (2003) as well as with the adaptive rejection sampling (ARS)
method of Gilks and Wild (1992) to update the elements of a. However, in
our applications no method was found to be superior with respect to the performance of the MCMC. The results presented in Sections 9.7 and 9.8 were
obtained using slice sampling.
Furthermore, it is seen that the full conditional distribution for each transformed mixture weight depends only on the weights of the neighboring mixture components. For a better performance of the MCMC, especially to
decrease the autocorrelation of the sampled chain, it is thus advantageous to
update in one iteration of the MCMC the transformed mixture weights in
such an order that the full conditional of a we are updating does not depend
on a which has just been updated. This is obtained, for example, using the
following update order:
· · · → a0 → as+1 → a2(s+1) → · · · → a1 → a1+s+1 → a1+2(s+1) → · · · .
Full conditional for the smoothing parameter
For the smoothing parameter λ, the full conditional distribution is Gamma
(h∗λ,1 , h∗λ,2 ) where
h∗λ,1 = hλ,1 +
2K + 1 − s + 1
,
2
h∗λ,2 = hλ,2 +
1 ′ ′
a P Pa.
2
Full conditional for the mixture intercept
The full conditional for the mixture intercept α is a normal distribution with
the mean and variance
n
o
n
X
−2
(yi∗ − τ µri∗ ) + ψα−1 να ,
E(α | · · · ) = var(α | · · · ) × (στ )
−1 −1
var(α | · · · ) = (στ )−2 n + ψα
i∗ =1
,
respectively and is thus easily sampled from.
9.3. MARKOV CHAIN MONTE CARLO
169
Full conditional for the mixture scale
The full conditional distribution of τ −2 has the form
√
p(τ −2 | · · · ) ∝ (τ −2 )ξ1 −1 exp ξ3 τ −2 − ξ2 τ −2 ,
(9.9)
with
ξ1 = hτ,1 + 0.5 n,
ξ2 = hτ,2 + 0.5 σ
−2
n
X
(yi∗ − α)2 ,
i∗ =1
ξ3 = σ −2
n
X
i∗ =1
µri∗ (yi∗ − α).
Distribution (9.9) is generally not log-concave so that the adaptive rejection
sampling (ARS) method of Gilks and Wild (1992), successfully used in many
situations when the full conditional distribution does not have a standard
form, cannot be used here. However, it can easily be shown that the density
(9.9) is always unimodal and the slice sampler of Neal (2003) can be used to
update the parameter τ −2 in an MCMC run.
Full conditional for the allocation variables
The full conditional for each allocation variable ri∗ , i∗ = 1, . . . , n is discrete
with
n (y ∗ − α − τ µ )2 o
i
j
Pr(ri∗ = j | · · · ) ∝ wj exp −
,
2(στ )2
9.3.2
j ∈ {−K, . . . , K}.
Updating the generic node Y
The update of the generic node Y is of two types: (1) update of the residuals
εi,l , ζi,l , i = 1, . . . , N , l = 1, . . . , ni (2) update of the univariate random effects
bi , di , i = 1, . . . , N in Model U.
Updating the residuals
The update of the ‘onset’ residuals ζi,l , i = 1, . . . , N , l = 1, . . . , ni is fully
U
deterministic provided the (i, l)th onset time ui,l = uL
i,l = ui,l is uncensored.
The update of ζi,l consists then of using the AFT expression (9.1) with the
170
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
current values of the parameters, i.e. the updated ζi,l is equal to log(ui,l ) −
δ ′ xui,l − d′i z ui,l .
When the (i, l)th onset time is interval-censored with an observed interval
U
⌊uL
i,l , ui,l ⌋, its update consists of the sampling from a truncated normal distribution, namely
N αζ + τ ζ µrζ , (σ ζ τ ζ )2 truncated on
i,l
k
j
′ u
′ u
U
′ u
′ u
log(uL
i,l ) − δ xi,l − di z i,l , log(ui,l ) − δ xi,l − di z i,l .
A similar procedure is used when updating the ‘event’ residuals εi,l , i =
1, . . . , N , l = 1, . . . , ni . It is useful to stress that for the update of εi,l
also the ‘onset’ residual ζi,l and subsequently also the true onset time ui,l =
exp(δ ′ xui,l + d′i z ui,l + ζi,l ) make a part of the condition when exploiting the
full conditional distribution. This implies that the update of εi,l is fully deL = v U is uncensored,
terministic provided the (i, l)th failure time vi,l = vi,l
i,l
irrespective whether the onset time is censored or not. The update of εi,l
consists then of using the AFT expression (9.2) with the current values of
the parameters, i.e. the updated εi,l is equal to log(vi,l − ui,l ) − β′ xti,l − b′i z ti,l .
When the residual εi,l corresponds to the censored failure time with an obL , v U ⌋ its update consists of the sampling from the full
served interval ⌊vi,l
i,l
conditional distribution of εi,l which is here a truncated normal distribution,
namely
ε ε 2
ε , (σ τ )
truncated on
N αε + τ ε µri,l
k
j
L
U
log(vi,l
− ui,l ) − β ′ xti,l − b′i z ti,l , log(vi,l
− ui,l ) − β ′ xti,l − b′i z ti,l .
Updating the univariate random effects in Model U
In Model U, the full conditional distributions for the univariate random effects
bi and/or di , i = 1, . . . , N are normal distributions, namely
bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) ,
i = 1, . . . , N,
with
E(bi | · · · ) = var(bi | · · · ) ×
ni
h
X
i
log(ti,l ) − αε − β ′ xti,l − τ ε µεrε ,
(σ b τ b )−2 τ b µbrb + (σ ε τ ε )−2
i
i,l
l=1
o−1
n
var(bi | · · · ) = (σ b τ b )−2 + (σ ε τ ε )−2 ni
,
9.3. MARKOV CHAIN MONTE CARLO
171
Analogous formulas, with an obvious change in notation, hold for di , i =
1, . . . , N .
9.3.3
Updating the parameters related to the multivariate
random effects in Model M
In the case of the multivariate random effects bi and/or di having a multivariate normal prior distribution the following full conditionals are used to
update the related parameters.
Full conditionals for the multivariate random effects bi and di
The full conditional of the multivariate random effects vector bi , i = 1, . . . , N
is multivariate normal distribution, i.e.
bi | · · · ∼ N E(bi | · · · ), var(bi | · · · ) ,
i = 1, . . . , N,
with
E(bi | · · · ) = var(bi | · · · ) ×
ni
h
X
i
t
ε ε −2
ε
′ t
ε ε
z
γ
+
(σ
τ
)
D−1
,
log(t
)
−
α
−
β
x
−
τ
µ
ε
i,l
b
i,l
i,l
r
b
i,l
l=1
var(bi | · · · ) =
n
D−1
b
ε ε −2
+ (σ τ )
ni
X
l=1
z ti,l (z ti,l )′
o−1
.
The full conditional distribution of the multivariate random effects di , i =
1, . . . , N is analogous with an obvious change in notation.
Full conditionals for the means γ b , γ d and the covariance matrices Db ,
Dd of the multivariate random effects
For the means γ b , γ d and the covariance matrices Db , Dd of the multivariate
random effects, the full conditional distributions are exactly the same as these
derived for the Bayesian normal mixture CS AFT model in Section 8.3.2.
Only appropriate subscripts have to be added to expressions appearing in
formulas given in Section 8.3.2.
172
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
9.3.4
Updating the regression parameters
Full conditionals for the fixed effects δ and β
Let β (S) be an arbitrary sub-vector of vector β, and xi,l(S) the corresponding
sub-vectors of covariate vectors xti,l , and further let xi,l(−S) be their complementary sub-vectors. Similarly, let further ν β(S) and ψ β(S) be appropriate sub-vectors of hyperparameters ν β and ψ β , respectively. Finally, let
Ψβ(S) = diag(ψ β(S) ). Then
β (S) | · · · ∼ N E(β (S) | · · · ), var(β (S) | · · · ) ,
with
E(β (S) | · · · ) = var(β (S) | · · · )×
n
n
Ψ−1
β(S) ν β(S)
ε ε −2
+ (σ τ )
ε ε −2
var(β (S) | · · · ) = Ψ−1
β(S) + (σ τ )
ni
N X
X
i=1 l=1
n
N
i
XX
o
(F )
xi,l(S) ei,l(S) ,
xi,l(S) x′i,l(S)
i=1 l=1
o−1
,
(F )
where ei,l(S) = log(ti,l ) − αε − β ′(−S) xi,l(−S) − b′i z ti,l − τ ε µεrε .
i,l
The full conditional distribution for an arbitrary subvector of the vector δ is
analogous with an obvious change in notation.
9.4
9.4.1
Bayesian estimates of the survival distribution
Predictive survival and hazard curves and predictive
survival densities
Analogously to Section 8.4, the survival and hazard functions or the survival
densities for a specific combination of covariates are estimated by the mean
of (posterior) predictive quantities.
Almost all expressions given in Section 8.4.1 apply also here with the following
changes. To get the Bayesian estimate of the survival function of the event
time T , given the covariates xtnew and z tnew , the expression (8.16) changes
9.5. BAYESIAN ESTIMATES OF THE INDIVIDUAL RANDOM EFFECTS
173
into
S(t | θ, xtnew , z tnew ) =
1−
K
X
j=−K
(9.10)
wjε Φ log(t) − β ′ xtnew − b′ z tnew αε + τ ε µεj , (σ ε τ ε )2 .
Similarly, to get the estimate of the survival density, we use
p(t | θ, xtnew , z tnew ) =
t−1
K
X
j=−K
(9.11)
wjε ϕ log(t) − β ′ xtnew − b′ z tnew αε + τ ε µεj , (σ ε τ ε )2
instead of the expression (8.18).
To be able to use a relationship analogous to (8.17) we need a sample {b(m) :
m = 1, . . . , M } of the posterior predictive values of the random effects. In
the case of a univariate randomeffect b in Model U, b(m) is sampled from the
P
(m)
b,(m) µb , (σ b τ b,(m) )2 . In the case of a mulnormal mixture K
j
j=−K wj N τ
(m)
(m)
tivariate random effect b in Model M, b(m) is sampled from Nqb (γ b , Db ).
The predictive quantities for the onset time U are obtained in an analogous
manner.
9.4.2
Predictive error and random effect densities
The estimate of the smoothed densities gε , gζ , gb , gd is obtained by the mean
of the (posterior) predictive density which is given, for example in the case
of gε , by
Z
E gε (e) data = gε (e)p(θ | data) dθ,
e ∈ R.
(9.12)
The MCMC estimate of (9.12) is obtained by averaging the error density
(9.3) over the MCMC run, i.e.
M K
e − αε,(m) X
1 X
ε
ε,(m)
ε 2
ε,(m) −1
ĝε (e) =
. (9.13)
wj
ϕ
τ
µj , (σ )
M
τ ε,(m)
m=1
j=−K
9.5
Bayesian estimates of the individual random
effects
As explained in Section 8.5 in the context of the Bayesian normal mixture
CS AFT model, in some situation estimates of the individual random effects
174
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
must be provided. In the context of this chapter these can be computed in
the same way as shown in Section 8.5.
9.6
Simulation study
To validate our approach we conducted a simulation study which mimics to
a certain extent the Signal Tandmobielr data. From each of 150 clusters
we simulated 4 observations. The onset time Ui,l and the event time Ti,l ,
i = 1, . . . , 150, l = 1, . . . , 4 were generated according to the AFT models
(9.1) and (9.2) with xui,l = (xui,l,1 , xui,l,2 )′ , z ui,l ≡ 1, δ = (0.20, −0.10)′ and
xti,l = (xti,l,1 , xti,l,2 )′ , z ti,l ≡ 1, β = (0.30, −0.15)′ . The covariates xui,l,1 and
xti,l,1 are continuous and generated independently from a uniform distribution
on (0, 1), the covariates xui,l,2 and xti,l,2 are binary with the equal probabilities
for zeros and ones.
∗ (αζ = 1.75,
The error terms ζi,l and εi,l are obtained from ζi,l = αζ + τ ζ ζi,l
∗
ε
∗
ε
ε
∗
∗
∗
ζi,l ∼ gζ ) and εi,l = α + τ εi,l (α = 2.00, εi,l ∼ gε ), respectively. Further,
the random effects di and bi are obtained from di = τ d d∗i (d∗i ∼ gd∗ ) and
bi = τ b b∗i (b∗i ∼ gb∗ ), respectively. The scale parameters were chosen such
2
2
that (τ d )2 + (τ ζ )2 = τonset
= 0.1 and (τ b )2 + (τ ε )2 = τevent
= 1.0, see below
2
2
for the individual values. The choice of τonset and τevent was motivated by
the results of the analysis in Section 9.7.
Two scenarios for the distributional parts of the model were considered. In
scenario I, both densities gζ∗ and gε∗ (of the standardized error terms) are
a mixture of normals, i.e. equal to 0.4 N (−2.000, 0.25) + 0.6 N (1.333, 0.36)
standardized to have unit variance. For the densities gd∗ and gb∗ (of the standardized random effects) the density of a standardized extreme value of minimum distribution was taken. In scenario II, we reversed the setting, i.e. we
have taken an extreme value distribution for the error terms and a normal
mixture for the random effects. Additionally, within each scenario, the vari2
2
were decomposed such that the ratios τ d /τ ζ = τ b /τ ε
and τevent
ances τonset
were equal to 5, 3, 2, 1, 1/2, 1/3, and 1/5, respectively.
The true onset and event times were interval-censored by simulating the
‘visit’ times for each subject in the data set. The first visit was drawn from
N (1, 0.22 ). Each of the distances between the consecutive visits was drawn
from N (0.5, 0.052 ).
The results for the simulation study are shown in Appendix B.3. Tables
B.12 and B.13 give the results for the regression parameters and show that
they are estimated practically unbiasedly and with a reasonable precision.
It is further seen that the precision of the estimation decreases when the
9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
DOUBLY-INTERVAL-CENSORED DATA
175
within-cluster variability (variance of the error terms) increases compared
to the between-cluster variability (variance of the random effects). In practice however, the between-cluster variability is often much higher than the
within-cluster variability. Further, Tables B.14 and B.15 show results for the
standard deviations of the error terms and random effects. Here, the precision
is sometimes somewhat worse. However, also the standard deviations are, in
most cases, estimated with minimal bias. Furthermore, the shape of the survival functions or survival densities is correctly estimated as is illustrated in
Figures B.10–B.17 which show results for the fitted survival functions and
survival densities for selected combinations of covariates.
9.7
Example: Signal Tandmobielr study – clustered doubly-interval-censored data
This analysis of the Signal Tandmobielr data, introduced in Section 1.1,
involves
(a) doubly-interval-censored data, i.e. the time from tooth emergence to
onset of caries;
(b) clustering. Indeed, we will examine several teeth jointly and the teeth
from the same mouth are related.
The primary interest of the present analysis is to address the influence of
sound versus affected (decayed/filled/missing due to caries) deciduous second molars (in Figure 1.2, teeth 55, 65, 75, 85, respectively) on the caries
susceptibility of the adjacent permanent first molars (in Figure 1.1, teeth 16,
26, 36, 46, respectively). Note that for about five years the deciduous second
molars are in the mouth together with the permanent first molars.
It is possible that the caries processes on the primary and the permanent
molar occur simultaneously. In this case it is difficult to know whether caries
on the deciduous molar caused caries on the permanent molar or vice versa.
For this reason, the permanent first molar was excluded from the analysis
if caries was present when emergence was recorded. This implies that the
data are not balanced with respect to the size of the clusters. In total, 3 520
children were included in the analysis of which 187 contributed 1 tooth, 317
2 teeth, 400 3 teeth and 2 616 all 4 teeth.
Additionally, we considered the impact of gender (boy/girl), presence of
sealants in pits and fissures of the permanent first molar (none/present),
occlusal plaque accumulation on the permanent first molar (none/in pits and
176
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
fissures/on total surface), and reported oral brushing habits (not daily/daily).
Note that pits and fissures sealing is a preventive action which is expected
to protect the tooth against caries development. The presence of plaque on
the occlusal surfaces of the permanent first molars was assessed using a simplified version of the index described by Carvalho, Ekstrand, and Thylstrup
(1989). All explanatory variables were obtained at the examination where
the presence of the permanent first molar was first recorded.
The choice of explanatory variables is motivated by the results of Leroy et al.
(2005) where a GEE multivariate log-logistic AFT model was used to analyze
the time to caries. Multiple imputation was used to deal with the intervalcensored emergence times. Further, on top of that, the caries status of the
deciduous first molars (in Figure 1.2, teeth 54, 64, 74, 84, respectively) was
included in the covariate part of the model. We will not use this factor as
an explanatory variable due to its high dependence with the status of the
deciduous second molar (in all quadrants of the mouth, the χ2 test statistics
with 9 degrees of freedom exceeded 1 100).
The onset time Ui,l , l = 1, . . . , 4 is the age (in years) of the ith child (ith
cluster) at which the lth permanent first molar emerged. The failure time,
Vi,l , indicates the onset of caries of the lth permanent first molar. The time
from tooth emergence to the onset of caries, Ti,l , is doubly-interval-censored.
Here, both the time of tooth emergence and the onset of caries experience
are only known to lie in an interval of about 1 year.
Further, in our example about 85% of the permanent first molars had emerged
at the first examination giving rise to a huge amount of left-censored onset
times. However, at each examination the permanent teeth were scored according to their clinical eruption stage using a grading that starts at P0
(tooth not visible in the mouth) and ends with P4 (fully erupted tooth with
full occlusion). Based on the clinical eruption stage at the moment of the
first examination, all left-censored emergence times were transformed into
interval-censored ones with the lower limit of the observed interval equal to
the age at examination minus 0.25 year, 0.5 year and 1 year, respectively for
the teeth with the eruption stage P1, P2 and P3, respectively and with the
lower limit equal to 5 years for the teeth with the eruption stage P4. We
refer to Leroy et al. (2005) for details and motivation.
9.7.1
Basic Model
The analysis starts with the Basic Model where we allowed for a different
effect of the covariates on both emergence and caries experience for the four
permanent first molars. Namely, the Basic Model was based on the AFT
9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
DOUBLY-INTERVAL-CENSORED DATA
177
models (9.1) and (9.2) with the covariate vector xui,l for emergence:
xui,l = (gender i , tooth26i,l , tooth36i,l , tooth46i,l ,
tooth26i,l ∗ gender i , tooth36i,l ∗ genderi , tooth46i,l ∗ genderi )′ ,
and the covariate vector xti,l for caries:
xti,l = (x̃ti,l , tooth26i,l , tooth36i,l , tooth46i,l ,
tooth26i,l ∗ x̃ti,l , tooth36i,l ∗ x̃ti,l , tooth46i,l ∗ x̃ti,l )′ ,
where
x̃ti,l = (gender i , statusDi,l , statusFi,l , statusMi,l ,
brushingi , sealantsi,l , plaquePFi,l , plaqueTi,l ).
The covariates tooth26, tooth36, tooth46 are dummies for the position of
the permanent first molar with the molar 16 as the baseline, the covariate
gender equals 1 for boys and equals 0 for girls. The covariates statusD, statusF,
statusM are dummies for the status of the adjacent deciduous molar: decayed,
filled, missing due to caries with sound being the baseline. The covariate
brushing is dichotomous (1 = daily, 0 = not daily) as well as the covariate
sealants (1 = present, 0 = not present). Finally, the covariates plaquePF and
plaqueT are dummies for the plaque accumulation: in pits and fissures, on
total surface with no plaque as the baseline.
To account for clustering, univariate child-specific random effects di and bi
u =
are included in the model expressions (9.1) and (9.2), respectively with zi,l
t
zi,l ≡ 1. Finally, analogously to Sections 7.7 and 8.7, we subtracted 5 years
from all observed times, i.e. log(Ui,l − 5) was used in the left-hand side of the
model formula (9.1).
As discussed already in Section 9.1, our model assumes that, given the covariates and child-specific random effects, the emergence time Ui,l and the time
to caries Ti,l are independent for each i and l. Specifically, we assume that
the caries process on a specific tooth only depends on the time when that
tooth is at risk for caries and not on the chronological time. This assumption
seems reasonable for the Signal Tandmobielr data taking into account the
results of Leroy et al. (2005) who evaluated also the effect of the emergence
time on the time to caries and found it non-significant (p = 0.78).
9.7.2
Final Model
Based on the results for the Basic Model (see below) we fitted the Final Model where we omitted all two-way interactions with the covariates
178
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
tooth26, tooth36, tooth46. Additionally, we binarized the covariates statusD,
statusF, statusM into a new covariate status which was equal to 1 for decayed,
filled or missing due to caries deciduous molars and was equal to 0 for sound
deciduous molars. Also the covariates plaquePF and plaqueT were binarized
into the covariate plaque equal to 1 for the teeth with plaque present either
in pits and fissures or on total surface and equal to 0 otherwise. That is, the
onset and event covariate vectors are equal to
xui,l = (gender i , tooth26i,l , tooth36i,l , tooth46i,l )′ ,
xti,l = (gender i , statusi,l , brushingi , sealantsi,l , plaquei,l ,
tooth26i,l , tooth36i,l , tooth46i,l )′ .
9.7.3
Prior distribution
Firstly, for all penalized mixtures we used the same grid of equidistant knots
of length 31 (K = 15) defined on [−4.5, 4.5] with the basis standard deviation
σ = 2(µj − µj−1 )/3 = 0.2. Secondly, the third order difference (s = 3) was
used in the prior (9.7). Further, the prior distributions of the nodes in DAGs
(Figures 9.1 and 9.3) without parents were taken highly dispersed. That is
all λ and τ −2 parameters were a priori Gamma(1, 0.005) distributed, all α,
β and δ parameters were given a N (0, 100) prior.
9.7.4
Results
For each considered model we ran 500 000 iterations with 1:3 thinning which
took about 44 hours on a 3 GHz Pentium IV PC with 1 GB RAM. We kept
the last 100 000 iterations for inference.
Results for the Basic Model
The analysis of the Basic Model revealed that all interaction terms with
tooth covariates are redundant implying that the effect of all these covariates is the same for all four permanent first molars. To evaluate this we
used simultaneous Bayesian p-values computed using the method described
in Section 4.6.2. For the emergence part, the simultaneous p-value for the
tooth:gender interactions is higher than 0.5. For the caries part of the model,
the p-values are higher than 0.5 for the interactions of tooth with gender and
plaque and higher than 0.1 for the interactions with brushing, sealants and
status. Also the covariate tooth is not significant however we kept it in the
9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
DOUBLY-INTERVAL-CENSORED DATA
179
Table 9.1: Signal Tandmobielr study, Final Model. Posterior medians, 95%
equal-tail credible regions (CR) and Bayesian two-sided p-values for the model
parameters. For the parameter Tooth the CR and the p-value are simultaneous.
Emergence
Parameter
Tooth
tooth 26
tooth 36
tooth 46
Gender
girl
Status
dmf
Brushing
daily
Sealants
present
Plaque
present
E(error)
sd(error)
sd(random)
Posterior
median
−0.003
0.001
0.002
−0.023
95% CR
p > 0.5
(−0.013,
(−0.008,
(−0.008,
p = 0.008
(−0.039,
Caries
Posterior
median
0.007)
0.011)
0.012)
−0.006
−0.009
−0.016
−0.007)
−0.071
−0.140
0.337
0.119
0.442
0.029
0.199
(0.427, 0.456)
(0.025, 0.034)
(0.191, 0.210)
−0.114
1.920
0.767
0.672
95% CR
p > 0.5
(−0.045, 0.031)
(−0.051, 0.034)
(−0.059, 0.026)
p = 0.085
(−0.155, 0.009)
p < 0.001
(−0.193, −0.091)
p < 0.001
(0.233, 0.436)
p < 0.001
(0.060, 0.178)
p < 0.001
(−0.171, −0.067)
(1.810, 2.059)
(0.712, 0.834)
(0.614, 0.734)
model to address the question whether the emergence and caries timing are
the same for the four permanent first molars.
Further, for none of the four permanent first molars a significant difference
was found between the status groups decayed, filled or missing, and between
the plaque groups present in pits and fissures or present on total surface.
This finding, together with the fact that the group with extracted deciduous
molar and the group with the plaque present on total surface had very low
prevalence (1.45% and 3.13%, respectively), led to the simplification of these
two covariates in the Final Model.
180
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Results for the Final Model
Table 9.1 shows posterior medians, 95% equal-tail credible intervals and
Bayesian two-sided p-values for the parameters in the Final Model. It
is seen that neither for the emergence and nor for the caries process there is
a significant difference between the four permanent first molars. However, the
molars of girls emerge significantly earlier than those of boys. With respect
to caries experience, the difference between boys and girls is not significant at
5%. However all remaining covariates have a significant impact on the caries
process. Namely, daily brushing increases the time to caries with a factor of
exp(0.337) = 1.40 compared to less frequent brushing. Presence of sealants
increases the time to caries with a factor of exp(0.119) = 1.13. On the other
hand, the presence of the plaque decreases the time to caries with a factor
of exp(−0.114) = 0.89 and the fact that the neighboring deciduous second
molar is either decayed, filled or extracted due to caries decreases the time
time to caries with a factor of exp(−0.140) = 0.87.
Figure 9.4 shows the posterior predictive survival and hazard functions for the
time to caries on the upper right permanent first molar of boys, for ‘the best’,
‘the worst’ and two intermediate combinations of covariates (the curves for
the remaining teeth and girls are similar). It is seen that when the teeth are
daily brushed, plaque-free and sealed the hazard for caries starts to increase
approximately 1 year after emergence however then remains almost constant.
Whereas, when the teeth are not brushed daily and are exposed to other risk
factors the hazard starts to increase already approximately 6 months after
emergence. After a period of constant risk then the hazard starts to increase
again.
The peak in the hazard for caries approximately 1 year after emergence was
observed also by Leroy et al. (2005) and can be explained by the fact that
teeth are most vulnerable for caries soon after the emergence when the enamel
is not yet fully developed. This peak is also present, although with a different
size and with a slight shift, for all covariate combinations. On the other hand,
for covariate combinations reflecting good oral health and hygiene habits, the
hazard remains almost constant after the initial period of highly increasing
risk whereas for combinations of covariates reflecting bad oral conditions the
hazard starts to increase again approximately 3 years after emergence. This
shows clearly the relationship between caries experience and oral health and
hygiene habits.
Finally, Figure 9.5 shows Bayesian predictive error and random effect density
estimates. The estimate of the emergence random effect density gd suggests
the children could be divided, even after conditioning on gender, into two
9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
DOUBLY-INTERVAL-CENSORED DATA
181
0.4 0.6 0.8
0.0 0.2
Caries free
1.0
Caries: survival function
0
1
2
3
4
5
6
5
6
Time since emergence (years)
0.00 0.05 0.10 0.15 0.20
Hazard of caries
Caries: hazard function
0
1
2
3
4
Time since emergence (years)
Figure 9.4: Signal Tandmobielr study, Final Model. Posterior predictive
caries free (survival) and caries hazard curves for tooth 16 of boys and the
following combinations of covariates: solid and dashed lines for no plaque,
present sealing, daily brushing and sound primary second molar (solid line)
or dmf primary second molar (dashed line), dotted and dotted-dashed lines
for present plaque, no sealing, not daily brushing and sound primary second
molar (dotted line) or dmf primary second molar (dotted-dashed line).
182
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Emergence: random effect
0.5
1.0
gd (d)
8
6
0
0.0
2
4
gζ (ζ)
10
1.5
12
14
Emergence: error
0.40
0.45
0.50
0.55
−0.4 −0.2
0.0
0.2
0.4
ζ
d
Caries: error
Caries: random effect
0.8
0.6
0.0
0.0
0.2
0.4
0.5
gb (b)
gε (ε)
1.0
1.0
1.2
1.5
1.4
0.35
−1
0
1
2
ε
3
4
−2.0
−1.0
0.0
1.0
b
Figure 9.5: Signal Tandmobielr study, Final Model. Estimates of the densities of the error terms and random effects.
9.7. EXAMPLE: SIGNAL TANDMOBIELr STUDY – CLUSTERED
DOUBLY-INTERVAL-CENSORED DATA
183
groups: early and late emergers. Also, the children can be divided into
two groups with respect to caries sensitivity (see random effect density gb ).
Finally, as the estimate of the caries error density gε shows three modes it
seems that there are other important factors influencing the caries process
besides the included covariates.
9.7.5
Conclusions
This section showed how the Bayesian penalized mixture CS AFT model can
be used to analyze clustered doubly-interval-censored data. Owing to flexible
distributional assumptions it was not here necessary to perform the classical
checks for correct distributional specification. Clearly, this step cannot be
avoided when using fully parametric methods. However, for censored, or let
alone doubly-interval-censored data, this is far from trivial. As was illustrated
in this section new important findings concerning the distribution of the event
time, derived e.g. from the shape of the hazard function, can be discovered
when avoiding strong parametric assumptions.
Further, we point out that the Basic Model corresponded, for comparison
purposes, as closely as possible to the model used by Leroy et al. (2005). The
differences were in detail outlined above. The most important one is that we
used here the flexible and cluster-specific (conditional) model fitted in the
Bayesian way, whereas in Leroy et al. (2005) a parametric and populationaveraged (marginal) model fitted using a frequentist method. The results for
the regression parameters of the caries part of the model correspond quite
closely to the earlier findings of Leroy et al. (2005) where, however, no attempts were done to simplify the model. Nevertheless, our results largely
confirmed their findings. Namely, they found the overall effect (on all four
teeth) of all factors except gender to be significant with p-value < 0.001. For
the effect of gender they observed a p-value of 0.060 compared to 0.085 found
by us. Due to the fact that Leroy et al. (2005) used a parametric log-logistic
AFT model, they could not reveal the second period of increased hazard
found here.
Finally, we have to admit that some covariates used in our dental application
should actually be treated as time-dependent. Unfortunately, with our and
any other method where the distribution of the event time is specified using
a density and not using an instantaneous quantity like the hazard function,
inclusion of time-dependent covariates is difficult.
184
9.8
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Example: EBCP data – multicenter study
In this section, we re-analyze the Early Breast Cancer Patients data introduced in Section 1.4 using the penalized mixture cluster-specific AFT model
and compare the results to the results of the earlier analysis conducted using
the classical normal mixture cluster-specific AFT model (see Section 8.9).
Except for the model for the error distribution of the AFT model, we fitted
exactly the same models as in Section 8.9. Here is their brief overview. The
response event time Ti,l , i = 1, . . . , 14, l = 1, . . . , ni , 25 ≤ ni ≤ 902 is the
progression-free survival (PFS) time of the lth patient treated by the ith
center. In the CS AFT model (9.2), a bivariate random effect bi = (bi,1 , bi,2 )′
with the covariate vector z ti,l = (1, trtmtGroupi,l )′ is included to allow for
the baseline heterogeneity as well as the heterogeneity with respect to the
treatment effect across centra. The covariate vector for the fixed effects is
given by
xti,l = (ageMidi,l , ageOldi,l , tySui,l , tumSizi,l , nodSti,l , otDisi,l ,
regionNLi , regionPLi , regionSEi , regionSAi )′ .
See Section 8.9 for explanation of the meaning of the single covariates.
Analogously to Section 8.9, besides the model with region described above
we fitted also the model without region for which the covariates regionNL,
regionPL, regionSE, and regionSA were omitted from the covariate vector xti,l .
The motivation for this step was an attempt to see whether the the regional
structure can be revealed from the estimates of the individual random effects
bi,1 , i = 1, . . . , 14.
For the inference we sampled two chains, each of length 125 000 with 1:5
thinning which took about 2.5 hour on a Pentium IV 2 GHz PC with 512
MB RAM. For the inference we kept the last 25 000 iterations of each chain.
9.8.1
Prior distribution
To specify the penalized mixture defining the distribution of the error terms
εi,l , i = 1, . . . , N , l = 1, . . . , ni we used the grid of equidistant knots of
length 31 (K = 15) defined on the interval [−4.5, 4.5] with the basis standard
deviation σ = 2(µj −µj−1 )/3 = 0.2. In the prior (9.7), we used the third order
differences (s = 3). Further, the smoothing parameter λε as well as the error
precision parameter (τ ε )−2 were given a dispersed Gamma(1, 0.005) prior.
The intercept parameter αε as well as all fixed effect regression parameters
β and the parameter γb,2 – the mean of the treatment random effects bi,2
9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY
185
Table 9.2: Early breast cancer patients data. Posterior medians, 95% equaltail credible intervals and Bayesian two-sided (simultaneous) p-values for the
effect of covariates.
Parameter
Treatment group
surgery alone
Age
40–50 years
> 50 years
Type of surgery
breast conserving
Tumor size
≥ 2cm
Nodal status
positive
Other disease
present
Region
the Netherlands
Poland
South Europe
South Africa
Model with region
Poster.
median 95% CI
p = 0.084
−0.153 (−0.325, 0.026)
p = 0.026
0.325 (0.059, 0.585)
0.285 (0.041, 0.520)
p = 0.008
0.229 (0.053, 0.404)
p < 0.001
−0.462 (−0.643, −0.283)
p < 0.001
−0.599 (−0.758, −0.442)
p = 0.016
−0.323 (−0.605, −0.059)
p = 0.007
−0.403 (−0.737, −0.017)
0.349 (−0.113, 0.802)
−0.339 (−0.729, 0.033)
−0.737 (−1.161, −0.320)
Model without region
Poster.
median 95% CI
p = 0.074
−0.150 (−0.310, 0.015)
p = 0.014
0.344 (0.088, 0.619)
0.313 (0.073, 0.565)
p = 0.007
0.248 (0.078, 0.420)
p < 0.001
−0.470 (−0.656, −0.288)
p < 0.001
−0.605 (−0.771, −0.440)
p = 0.015
−0.335 (−0.609, −0.067)
– were given a dispersed N (0, 100) prior. Finally, the covariance matrix
Db of the random effects got an inverse Wishart prior with dfb = 2 and
Sb = diag(0.002).
9.8.2
Effect of covariates on PFS time
Table 9.2 shows the posterior summary for the effect of considered covariates
in both models with included or excluded covariate region. In the model with
region included, surgery alone decreases the time to the cancer progression by
the factor of exp(−0.153) = 0.86 compared to the surgery given together with
the perioperative chemotherapy. However, as well as in the previous analysis
in Section 8.9, the difference is not significant at conventional 5% level.
186
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
Table 9.3: Early breast cancer patients data. Posterior medians and 95%
equal-tail credible intervals for the moments of the error distribution and
variance components of the random effects.
Parameter
E(ε)
sd(ε)
Model with region
Poster.
median 95% CI
Model without region
Poster.
median 95% CI
Moments of the error distribution
9.155 (8.771, 9.525)
8.967
1.481 (1.356, 1.663)
1.470
(8.570, 9.353)
(1.352, 1.639)
Variance components of the random effects
sd(bi,1 )
0.111 (0.024, 0.336)
0.302 (0.157, 0.541)
sd(bi,2 )
0.057 (0.020, 0.217)
0.074 (0.022, 0.245)
corr(bi,1 , bi,2 ) −0.219 (−0.987, 0.963) −0.675 (−0.993, 0.980)
Also the results for the effect of remaining covariates is very similar to the
results of the earlier analysis given in Table 8.6. Firstly, again, the estimates
in both models – with and without region – are almost the same. Further,
according to the model with region included, in the middle age group 40 –
50 years, the time to the progression of cancer is increased by the factor
of exp(0.325) = 1.38 compared to the youngest group <40 years. For the
patients from the oldest group >50 years, this time is increased by the factor
of exp(0.285) = 1.33 compared to the youngest group. The variable breast
conserving surgery increases the PFS time by the factor exp(0.229) = 1.26
compared to mastectomy. Further, the tumor of size ≥2 cm decreases the
PFS time by the factor of exp(−0.462) = 0.63 compared to the smaller tumors
of size <2 cm. A positive pathological nodal status decreases the PFS time
by the factor of exp(−0.599) = 0.55 compared to the negative result. The
presence of other related disease decreases the PFS time by the factor of
exp(−0.323) = 0.72. Analogously to Section 8.9, the effect of the geographical
reason on the PFS time is highly significant with the same ordering of regions,
namely Poland performs the best, followed by France, South Europe, the
Netherlands and South Africa.
Finally, Figure 9.6 illustrates rather small effect of the perioperative therapy
compared to surgery alone on the posterior predictive survival curves drawn
for region = France and two typical combinations of covariates. More or less
the same picture has been seen also in Figure 8.9 referring to the results of
the earlier analysis.
9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY
187
0.6
0.4
Survival
0.8
1.0
BCS, ≥2 cm, nodal−, no other disease
Surgery + chemotherapy
0.0
0.2
Surgery alone
0
1000
2000
3000
4000
5000
Time (days)
0.6
0.4
Survival
0.8
1.0
Mastectomy, ≥2 cm, nodal+, no other disease
Surgery + chemotherapy
0.0
0.2
Surgery alone
0
1000
2000
3000
4000
5000
Time (days)
Figure 9.6: Early breast cancer patients data. Predictive survival curves
based on the model with region for region = France, and two typical combinations of covariates: (1) breast conserving surgery, tumor size ≥2 cm,
negative nodal status and no other associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm, positive nodal status and no other
associated disease (13.88% of the sample).
188
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
9.8.3
Predictive error density and variance components of
random effects
Table 9.3 gives posterior summary statistics for the moments of the error
distribution and the variance components of the random effects. Also in this
case, the results are very similar to these related to the earlier analysis and
shown in Table 8.7. Furthermore, the 95% equal-tail credible interval for
the correlation between the overall center level and the treatment × center
interaction covers again almost the whole range (−1, 1) of possible value.
This is also seen on the scaled histograms of sampled values of ̺ in Figure
9.7.
The estimates of the error densities in both models with and without the
covariate region are shown in Figure 9.8. It is seen that exclusion of the
covariate region had hardly an effect on the estimated error distribution.
Indeed, since this covariate only groups different centra (clusters), its omission
approved itself mainly in the variability of the random intercept bi,1 (see Table
9.3).
3
0
1
2
Posterior density
2
0
1
Posterior density
3
4
Model without region
4
Model with region
−1.0
−0.5
0.0
0.5
corr(bi,1 , bi,2 )
1.0
−1.0
−0.5
0.0
0.5
1.0
corr(bi,1 , bi,2 )
Figure 9.7: Early breast cancer patients data. Scaled histograms for sampled
corr(bi,1 , bi,2 ).
9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY
189
0.00 0.05 0.10 0.15 0.20 0.25 0.30
gε (e)
Model with region
6
8
10
12
14
12
14
e
0.00 0.05 0.10 0.15 0.20 0.25 0.30
gε (e)
Model without region
6
8
10
e
Figure 9.8: Early breast cancer patients data. Posterior predictive error
densities.
190
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
1.0
0.8
0.6
0.4
Survival
0.6
0.4
Survival
0.8
1.0
BCS, ≥2 cm, nodal−, no other disease
0.0
0.0
0.2
Surgery alone
0.2
Surgery + chemotherapy
0
1000
3000
5000
0
1000
Time (days)
3000
5000
Time (days)
1.0
0.8
0.6
0.4
Survival
0.6
0.4
Survival
0.8
1.0
Mastectomy, ≥2 cm, nodal+, no other disease
0.2
Surgery alone
0.0
0.0
0.2
Surgery + chemotherapy
0
1000
3000
Time (days)
5000
0
1000
3000
5000
Time (days)
Figure 9.9: Early breast cancer patients data, comparison of the penalized
mixture CS AFT model (solid lines) and the classical mixture CS AFT model
(dashed lines). Predictive survival curves based on the model with region
for region = France, and two typical combinations of covariates: (1) breast
conserving surgery, tumor size ≥2 cm, negative nodal status and no other
associated disease (9.79% of the sample), (2) mastectomy, tumor size ≥2 cm,
positive nodal status and no other associated disease (13.88% of the sample).
9.8. EXAMPLE: EBCP DATA – MULTICENTER STUDY
191
Model with region
8.0
8.5
9.0
9.5
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Treatment
Institution
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Institution
Intercept + region effect
10.0
−0.6
−0.2
0.0
0.2
b2
b1 + E(ε) + β(region)
Model without region
Institution
8.0
8.5
9.0
9.5
b1 + E(ε)
10.0
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Treatment
11 12 13 21 22 31 32 33 34 41 42 43 44 51
Institution
Intercept
−0.6
−0.2
0.0
0.2
b2
Figure 9.10: Early breast cancer patients data. Posterior means and 95%
equal-tail credible intervals for individual random effects. Random intercepts
are further shifted by the error mean E(ε) and in the model with region also
by a corresponding region main effect β(region).
192
CHAPTER 9. BAYESIAN PENALIZED MIXTURE CS AFT MODEL
The shape of the estimated error density seems to be somewhat different
from what has been found using the classical mixture model in Section 8.9
(Figure 8.11). In Figure 8.11, the shape similar to what is seen now (Figure
9.8) can only be found when looking at conditional predictive densities, given
K > 1. However, both estimated error distributions lead to almost the same
estimates of the survival curves as it is seen in Figure 9.9.
9.8.4
Estimates of individual random effects
Finally, Figure 9.10 shows estimates of individual random effects. Analogously to Figure 8.12, the plots related to the random intercept takes into
account also the mean of the error term and in the case of the model with
region also the appropriate main effect of region. It can be seen that, analogous to the remaining model characteristics, Figure 9.10 resembles quite
closely Figure 8.12. Among other things, also here the estimates of individual random intercepts in the model without region managed quite nicely to
capture also the region effect.
9.8.5
Conclusions
The main purpose of this section was to explore how the chosen method for
the estimation of the error distribution influences the results of a particular
analysis. We have seen that, except for the estimate of the error distribution
itself, the differences were almost negligible. Moreover, although the estimated shapes of the error distribution were somewhat different they both led
to almost identical survival curves.
9.9
Discussion
A semiparametric method to perform a regression analysis with clustered
doubly-interval-censored data was suggested in this chapter. We opted for
a fully Bayesian approach and MCMC methodology. Note however, that
similarly as in Chapter 8, the Bayesian approach is used only for technical
convenience to avoid difficult optimization unavoidable with more classical
maximum-likelihood based estimation. Remember that we use a penalty-like
prior distribution for the transformed mixture weights a and vague priors for
all remaining parameters. We did not make any attempt to use any prior
information although it could have been utilized. Taking into account the
above reasoning, we conclude that similar results would have been obtained
if the penalized maximum-likelihood estimation had been used.
Chapter
10
Bayesian Penalized Mixture
Population-Averaged AFT Model
In Section 9.7, we evaluated the impact of several covariates on the time to
caries of the permanent first molars which are the teeth most often attacked
by caries during childhood. It was also of interest to know whether the
covariates have the same effect on all teeth. Hence all four teeth had to
be modelled jointly. In the same section, univariate cluster-specific random
effects have been included in the model expression to account for withincluster dependencies. Given these random effects, the observations within
each cluster were assumed to be independent. Distributional parts of the
model were specified as penalized univariate normal mixtures.
However, it is also of interest to evaluate the association between the times-tocaries of the studied teeth. Nevertheless, the approach of Chapter 9 treats the
within-cluster association as nuisance and, except for the estimated variance
of the random effects, it does not give a direct measure of the within-cluster
association. For this reason, we modify the method of Chapter 9 and assume
a multivariate error distribution as a penalized multivariate normal mixture
with a high number of components with equidistant means and constant
covariance matrices.
For the explanatory and also computational reasons we describe only a bivariate version of the model as given by Komárek and Lesaffre (2006c) and
apply it to the analysis of right permanent first molars in Section 10.6. The
approach of this chapter allows to visualize the estimated bivariate error distribution and evaluate the association of paired responses.
In Section 10.1, we specify the penalized mixture population-averaged AFT
193
194
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
model. Further, the prior distributions are given and posterior distribution
is derived in Section 10.2. Section 10.3 provides the details of the Markov
chain Monte Carlo method in the context of the model of this chapter. In
Section 10.4, we show how the association between the paired responses can
be evaluated. Estimation of the survival distribution is discussed in Section
10.5. The analysis of the doubly-interval-censored caries times of the right
permanent first molars is given in Section 10.6. Finally, we provide discussion
in Section 10.7.
10.1
Model
A similar notation as in Chapter 9 will be used here. That is, let Ui,l and Vi,l ,
i = 1, . . . , N, l = 1, 2 be the onset time and the failure time, respectively for
the lth unit of the ith cluster in the study. Let Ti,l = Vi,l − Ui,l denote the
corresponding event time The onset time Ui,l is only observed in an interval
U
⌊uL
i,l , ui,l ⌋. Similarly, we only know that the event time Vi,l lies in an interval
L , v U ⌋.
⌊vi,l
i,l
Further, let xui,l be the vector of covariates which might have an effect on the
onset time Ui,l and xti,l be the vector of covariates which can possibly influence the event time Ti,l . Additionally, we assume that the onset times vector
(Ui,1 , Ui,2 )′ and the time-to-event vector (Ti,1 , Ti,2 )′ are, given the covariates,
for each i independent (see Chapter 9 for a detailed discussion of this assumption) and that the interval censoring is independent and noninformative
(e.g. pre-scheduled visits, see Section 2.4).
The distribution of (Ui,1 , Ui,2 , Ti,1 , Ti,2 )′ , i = 1, . . . , N , given the covariates,
is given by the following accelerated failure time model:
log(Ui,l ) = δ ′ xui,l + ζi,l ,
log(Vi,l − Ui,l ) = log(Ti,l ) =
β ′ xti,l
(10.1)
+ εi,l ,
i = 1, . . . , N,
(10.2)
l = 1, 2,
where δ = (δ1 , . . . , δmu )′ and β = (β1 , . . . , βmt )′ are unknown regression
parameter vectors, ζ i = (ζi,1 , ζi,2 )′ , i = 1, . . . , N are i.i.d. random vectors
with a bivariate density gζ (ζ1 , ζ2 ) and similarly, εi = (εi,1 , εi,2 )′ , i = 1, . . . , N
i.i.d. random vectors with a bivariate density gε (ε1 , ε2 ).
10.1.1
Distributional assumptions
Our model for the unknown bivariate densities gε (ε1 , ε2 ) and gζ (ζ1 , ζ2 ) is motivated by a penalized smoothing as introduced in Section 6.3.4 and directly
10.1. MODEL
195
generalizes the method used in Chapter 9 into higher dimensions.
Let Y = (Y1 , Y2 )′ be a generic symbol for either ε = (ε1 , ε2 )′ or ζ = (ζ1 , ζ2 )′
and g(y) = g(y1 , y2 ) be a generic symbol for its density. We express the
unknown density g(y) as a location-and-scale transformed finite mixture of
bivariate normal densities with zero correlation over a fixed fine grid with
knots µ(j1 ,j2) = (µ1,j1 , µ2,j2 )′ , j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 that are
centered around zero, i.e. µ(0,0) = (0, 0)′ . The means of the bivariate normal
components are equal to the knots
and their covariance matrices are all equal
but fixed to Σ = diag σ12 , σ22 . Thus,
g(y) =
(τ1 τ2 )−1
K1
X
K2
X
j1 =−K1 j2 =−K2
(10.3)
y − α y − α 2
2 1
1
,
wj1 ,j2 (A) ϕ2
µ(j1 ,j2 ) , Σ .
τ1
τ2
In expression (10.3), the intercept term α = (α1 , α1 )′ and the scale parameters vector τ = (τ1 , τ2 )′ have to be estimated as well as the matrix A =
(aj1 ,j2 ), j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 of the transformed weights. See
(6.19) for the relationship between A and W = (wj1 ,j2 ), j1 = −K1 , . . . , K1 ,
j2 = −K2 ,. . . , K2 . The density of the zero-mean, unit-variance random vec′
tor Y ∗ = τ1−1 (Y1 − α1 ), τ2−1 (Y2 − α2 ) is a density of the bivariate normal
mixture with uncorrelated components given by (6.21).
In the following, let Gε refers to the set {Σε , µε , αε , τ ε , Wε , Aε , λε } which
contains the parameters defining the distribution of ε and a smoothing parameter vector λε which we will discuss in Section 10.2.1. Similarly, let Gζ refers
to the set µζ , αζ , τ ζ , Wζ , Aζ , λζ } which contains the parameters defining
the distribution of ζ and a corresponding smoothing parameter vector λζ .
Finally, let G be a generic symbol for Gε or Gζ .
10.1.2
Likelihood
Let p denote a generic density. The likelihood contribution of the ith paired
response equals
Li =
I
uU
i,1
uL
i,1
I
uU
i,2
uL
i,2
I
U −u
vi,1
i,1
L −u
vi,1
i,1
I
U −u
vi,2
i,2
L −u
vi,2
i,2
dti,2 dti,1 dui,2 dui,1
p(ui,1 , ui,2 , ti,1 , ti,2 )
196
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
=
I
uU
i,1
uL
i,1
I
uU
i,2
uL
i,2
I
p(ui,1 , ui,2 )
U −u
vi,1
i,1
L −u
vi,1
i,1
I
U −u
vi,2
i,2
L −u
vi,2
i,2
(10.4)
p(ti,1 , ti,2 ) dti,2 dti,1 dui,2 dui,1 ,
where
p(ti,1 , ti,2 ) = (ti,1 ti,2 )−1 gε log(ti,1 ) − β ′ xi,1 , log(ti,2 ) − β ′ xi,2 ,
p(ui,1 , ui,2 ) = (ui,1 ui,2 )−1 gζ log(ui,1 ) − δ ′ z i,1 , log(ui,2 ) − δ ′ z i,2 ,
are obtained using the expression (10.3) for gε and gζ .
In another context, Ghidey, Lesaffre, and Eilers (2004) used an expression
similar to (10.3) to model a density of the random intercept and slope in the
linear mixed model with uncensored data. Further, Bogaerts and Lesaffre
(2006) used this approach to model a density of bivariate simply-intervalcensored data without covariates. In both papers, a penalized maximum
likelihood method has been used to estimate unknown parameters. In our
context, however, a maximum likelihood procedure is difficult and computationally almost intractable. Like in Chapter 9 we suggest to use the Bayesian
approach together with MCMC methodology.
10.2
Bayesian hierarchical model
Let θ be a vector of all unknown parameters in our model. We assume the
hierarchical structure represented by the directed acyclic graph (DAG) shown
in Figure 10.1. The DAG implies the following prior distribution:
p(θ) ∝
N Y
i=1
p vi,1 , vi,2 ui,1 , ui,2 , ti,1 , ti,2 ×
p ti,1 , ti,2 β, εi,1 , εi,2 × p ui,1 , ui,2 δ, ζi,1 , ζi,2 ×
ζ
ζ ε
ε
p εi,1 , εi,2 Gε , ri,1
, ri,2
× p ζi,1 , ζi,2 Gζ , ri,1
, ri,2
×
ζ
ζ ε
ε Gζ ×
, ri,2
p ri,1
, ri,2
Gε × p ri,1
p Gε × p Gζ × p δ × p β .
(10.5)
The DAG child-parent conditional distributions and priors for the parameters
residing on the top of the hierarchy are similar to these used in Chapter 9. We
give a brief overview and highlight the differences with the bivariate model
considered here.
10.2. BAYESIAN HIERARCHICAL MODEL
10.2.1
197
Prior distribution for G
The structure of the prior distribution of the generic node G is the same as
in Section 9.2.1, i.e.
p(G) ∝ p(A | λ) p(λ) p(α) p(τ ).
With the bivariate setting, the number of unknown elements of the matrix A
is naturally much higher than with the univariate setting used in Chapter 9,
namely it is equal to (2K1 +1)×(2K2 +1) (e.g. equal to 961 in the analysis of
Onset
Event
Gζ
Σζ
µζ
δ
αζ
τζ
xui,l
λζ
λε
Aζ
Aε
Wζ
Wε
rζi,l
rεi,l
Gε
τε
εi,l
ζi,l
ui,l
αε
xti,l
µε
Σε
β
ti,l
vi,l
uL
i,l
uU
i,l
L
vi,l
censoringi,l
U
vi,l
l = 1, 2
i = 1, . . . , N
Figure 10.1: Directed acyclic graph for the Bayesian penalized mixture
population-averaged AFT model.
198
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
the Signal Tandmobielr data in Section10.6). With an uninformative prior
for A, this could cause overfitting of the data or identifiability problems.
Spatial prior for A
Since the (transformed) mixture weights correspond to spatially located normal components, a Gaussian Markov random field (GMRF) prior (see, e.g.,
Besag et al., 1995, Section 3), common in spatial statistics, can be exploited
here. Such a prior distribution can be defined by specifying the conditional
distribution of each aj1 ,j2 given remaining ak1 ,k2 , (k1 , k2 ) 6= (j1 , j2 ), here denoted as A−(j1 , j2 ) , and the hyperparameter λ that controls the smoothness.
Usually, only a few neighboring coefficients are effectively used in the specification of p(aj1 ,j2 | A−(j1 , j2 ) , λ). A commonly used conditional distribution
is a normal distribution with expectation and variance equal to
aj −1,j2 + aj1 +1,j2 + aj1 ,j2 −1 + aj1 ,j2 +1
−
E aj1 ,j2 | A−(j1 ,j2 ) , λ = 1
2
aj1 −1,j2 −1 + aj1 −1,j2 +1 + aj1 +1,j2 −1 + aj1 +1,j2 +1
,
4
var aj1 ,j2 | A−(j1 ,j2 ) , λ = (4λ)−1 ,
(10.6)
respectively, based on the eight nearest neighbors and local quadratic smoothing. Note that the expectation and variance formulas have to be changed
appropriately on edges where only five neighbors are available and in corners
where we have only three neighbors out of the original eight. Namely, for the
edge given by j1 = K1 :
aK1 ,j2−1 + aK1 ,j2 +1
E aK1 ,j2 | A−(K1 ,j2 ) , λ = aK1 −1,j2 +
−
2
aK1 −1,j2 −1 + aK1 −1,j2+1
2
−1
var aK1 ,j2 | A−(K1 ,j2 ) , λ = (2λ) ,
j2 = −K2 + 1, . . . , K2 − 1,
and similarly for the remaining edges. In the corner given by (j1 , j2 ) =
(K1 , K2 ):
E aK1 ,K2 | A−(K1 ,K2) , λ = aK1 −1,K2 + aK1 ,K2−1 − aK1 −1,K2 −1 ,
var aK1 ,K2 | A−(K1 ,K2) , λ = λ−1 ,
and similarly for the remaining corners.
Let a denote the matrix A stacked into a column vector. Using a bivariate
difference operator
∆ aj1 ,j2 = aj1 ,j2 − aj1 +1,j2 − aj1 ,j2 +1 + aj1 +1,j2 +1 ,
10.2. BAYESIAN HIERARCHICAL MODEL
199
and denoting D the associated difference operator matrix, the joint prior of all
transformed weights A given the smoothing hyperparameter λ can be written
as
n λ
p(A | λ) ∝ exp −
2
KX
1 −1
KX
2 −1
∆ aj1 ,j2
j1 =−K1 j2 =−K2
2 o
λ
= exp − a′ D′ Da (10.7)
2
which shows that the DAG conditional distribution p(A | λ) specified as a GMRF
−
−
is multivariate normal with covariance matrix λ−1 D′ D , where D′ D denotes a generalized inverse of D′ D. Although this distribution is improper
(the matrix D′ D has a deficiency of 2(K1 + K2 ) + 1 in its rank) the resulting posterior distribution is proper as soon as there is some informative data
available, see Besag et al. (1995).
Conditionally univariate difference prior
An alternative prior, still belonging to the class of GMRF, corresponding
closely to the prior for A used in Chapter 9 is obtained by considering a univariate difference operator for each row and each column of the matrix A
with possibly two different smoothing hyperparameters stacked in a vector
λ = (λ1 , λ2 )′ acting on rows and columns separately. Then
λ1
p(A | λ) ∝ exp −
2
−
K1
X
K2
X
∆s1 aj1 ,j2
j1 =−K1 j2 =−K2 +s
λ2
2
K2
X
K1
X
j2 =−K2 j1 =−K1 +s
2
∆s2 aj1 ,j2
2
n 1
o
(10.8)
= exp − a′ λ1 D′1 D1 + λ2 D′2 D2 a
2
where ∆sl , l = 1, 2 denotes a difference operator of order s for the lth dimension, e.g. ∆31 aj1 ,j2 = aj1 ,j2 − 3 aj1 ,j2 −1 + 3 aj1 ,j2 −2 − aj1 ,j2 −3 and D1 and D2
are the corresponding difference operator matrices for each dimension. This
prior distribution corresponds to a local polynomial smoothing of degree s−1
in each row and each column of the matrix A. For example, the conditional
mean and variance are given (for s = 3 and except on the corners and on
edges) by
λ1 Aj2 | j1 + λ2 Aj1 | j2
E aj1 ,j2 | A−(j1 ,j2) , λ =
λ1 + λ2
1
,
var aj1 ,j2 | A−(j1 ,j2) , λ =
20(λ1 + λ2 )
(10.9)
200
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
where
Ak | j =
aj,k−3 − 6 aj,k−2 + 15 aj,k−1 + 15 aj,k+1 − 6 aj,k+2 + aj,k+3
.
20
Both the spatial prior for A and the conditionally univariate difference prior
for A put higher probability mass in areas where spatially close coefficients of
the matrix A do not substantially differ. In other words, a priori, we believe
that the estimated densities gζ (ζ1 , ζ2 ) and gε (ε1 , ε2 ) are smooth. In general,
prior (10.8) leads to better a fit in our context and hence is preferred.
Prior for the smoothing parameter
The λ parameter in the prior (10.7) or the components λ1 , λ2 of the λ parameter in the prior (10.8) determine, together with the fixed difference operator
matrix D, the precision of the transformed weights A. We assign these parameters standardly used highly dispersed (but proper) Gamma priors.
Prior for the mixture intercepts and scales
The intercept parameters αε1 , αε2 , αζ1 , αζ2 can obtain a vague normal prior
unless there is some external information available. For the scale parameters
τ1ε , τ2ε , τ1ζ , τ2ζ we suggest to use either the uniform prior or a highly dispersed
inverse-Gamma prior for the squared scale parameters.
10.2.2
Prior distribution for the generic node Y
To specify the prior distribution of the generic node Y , i.e. of the nodes εi and
ζ i , i = 1, . . . , N , we introduce, analogously to Chapter 9 and using the idea
of Bayesian data augmentation (see Section 4.3), latent allocation vector r =
(r1 , r2 )′ that can take discrete values from {−K1 , . . . , K1 } × {−K2 , . . . , K2 }.
Its DAG conditional distribution is given by
Pr(r = (j1 , j2 )′ | G) = Pr(r = (j1 , j2 )′ | W) = wj1 , j2 ,
j1 ∈ {−K1 , . . . , K1 }, j2 ∈ {−K2 , . . . , K2 }.
The DAG conditional distribution of the generic node Y is then simply bivariate normal with independent margins:
p(y1 , y2 | G, r1 , r2 ) = ϕ2 y α + diag(τ )µ(r1 , r2 ) , diag(τ ) Σ diag(τ ) .
Without introducing the latent allocation vectors we would have to work with
p(y | G) = p(y | µ, Σ, α, τ , W) which is a bivariate normal mixture given by
(10.3).
10.3. MARKOV CHAIN MONTE CARLO
10.2.3
201
Prior distribution for the regression parameters and
time variables
The prior distribution of the regression parameters and the time variables
is exactly the same as in Chapter 9. That is, the regression parameter vectors β and δ are given a vague normal prior unless there is some external
U
L
U
L
U
information available. Finally, the nodes uL
i,l , ui,l , vi,l , vi,l , ti,l and ti,l have,
conditionally on their parents, the Dirac distribution driven by the censoring
mechanism and the true onset, failure or event time, respectively. See Section
9.2.5 with an obvious change in notation. Finally, remember that we do not
have to specify an exact form of the censoring mechanism as soon as it is
noninformative and independent.
10.2.4
Posterior distribution
The posterior distribution is given as a product of all DAG conditional distributions. See Section 9.2.6 for details.
10.3
Markov chain Monte Carlo
In practice we obtain a sample from the posterior distribution using the
Markov chain Monte Carlo method and base our inference on this sample.
Analogously to Chapter 9, the basis for the MCMC algorithm is Gibbs sampling (Geman and Geman, 1984) using the full conditional distributions. In
the situations when the full conditional distribution was not of standard
form we used either slice sampling (Neal, 2003) or adaptive rejection sampling (Gilks and Wild, 1992). For most parameters the full conditionals are
identical (with only a slight change in notation) to those given in Section 9.3
and we refer the reader thereinto.
Here we mention only the full conditional distribution for the transformed
mixture weights which, due to the bivariate nature considered here, differs
202
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
from that in Chapter 9 and is equal to
exp(Nj1 ,j2 aj1 ,j2 )
oN ×
K2
P
exp(ak1 ,k2 )
p(aj1 ,j2 | · · · ) ∝ n K
P1
k1 =−K1 k2 =−K2
o2 #
" n
aj1 ,j2 − E aj1 ,j2 | A−(j1 ,j2 ) , λ
exp −
,
2 var aj1 ,j2 | A−(j1 ,j2 ) , λ
j1 = −K1 , . . . , K1 , j2 = −K2 , . . . , K2 ,
where Nj1 ,j2 denotes the number of latent
allocation vectors r i that
are equal
′
to (j1 , j2 ) and E aj1 ,j2 | A−(j1 ,j2) , λ and var aj1 ,j2 | A−(j1 ,j2 ) , λ follow from
(10.6) or (10.9).
10.4
Evaluation of association
The association between the paired responses, after adjustment for the effect
of covariates, can be evaluated for example using the Pearson correlation
coefficient of the error terms ζi,1 and ζi,2 , or εi,1 and εi,2 , respectively. For
example, the Pearson correlation coefficient of the error terms εi,1 and εi,2
equals
̺ε = n
K1
P
K2
P
j1 =−K1 j2 =−K2
(σ1ε )2
where
wjε1 + =
+
K1
P
j1 =−K1
K2
X
wjε1 +
wjε1 ,j2 ,
j2 =−K2
ε
w+j
=
2
K1
X
j1 =−K1
wjε1 ,j2 ,
µε1,j1
wjε1 ,j2 µε1,j1 − M1ε µε2,j2 − M2ε
−
M1ε
o 12 n ε
(σ2 )2 +
K2
P
j2 =−K2
j1 = −K1 , . . . , K1 ,
M1ε =
j2 = −K2 , . . . , K2 ,
M2ε =
ε
w+j
2
µε2,j2
K1
X
− M2ε
o 21
wjε1 + µε1,j1 ,
j1 =−K1
K2
X
ε
w+j
µε .
2 2,j2
j2 =−K2
Another popular measure of association for censored data is the Kendall’s
tau, denoted by τKend, of which one advantage is that it is invariant towards
monotone transformations. This implies in our context that after adjustment
for the effect of covariates, the same value of the Kendall’s tau is obtained
for both the original event times and for their logarithmic transformation
,
10.5. BAYESIAN ESTIMATES OF THE SURVIVAL DISTRIBUTION
203
represented by the error terms. For example for the time-to-event part of the
ε
is equal to
model, given the model parameters, the Kendall’s tau τKend
ε
τKend
=4 ·
K1
X
K2
X
K1
X
K2
X
wjε1 ,j2 wkε1 ,k2
j1 =−K1 j2 =−K2 k1 =−K1 k2 =−K2
µε − µε µε − µε 2,j2
1,j1
√ ε 1,k1 Φ
√ ε 2,k2
Φ
2σ1
2σ2
− 1,
see Bogaerts and Lesaffre (2006) for details.
10.5
Bayesian estimates of the survival distribution
10.5.1
Predictive survival nad hazard curves and predictive survival densities
The estimates of the survival and hazard functions or the survival densites for
a specific combination of covariates are estimated by the mean of (posterior)
predictive quantities. In practice, this is done analogously to Sections 8.4
and 9.4. However, due to the bivariate approach in this chapter, we have
to distinguish between the quantities for the first margin: the onset time U1
and the event time T1 and for the second margin: the onset time U2 and the
event time T2 .
For example, to get the Bayesian estimate of the predictive survival function
of the event time T1 , given the covariates xtnew and z tnew , we can use the
relationship (8.17) while replacing the expression (8.16) by
S1 (t1 | θ, xtnew , z tnew ) =
1−
K1
X
j1 =−K1
(10.10)
wjε1 ,+ Φ log(t1 ) − β ′ xtnew − b′ z tnew αε1 + τ1ε µε1,j1 , (σ1ε τ1ε )2 .
To get the Bayesian estimate of the predictive survival density of the event
time T1 , we replace the expression (8.18) by
p1 (t1 | θ, xtnew , z tnew ) =
t−1
1
K1
X
j1 =−K1
(10.11)
wjε1 ,+ ϕ log(t1 ) − β ′ xtnew − b′ z tnew αε1 + τ1ε µε1,j1 , (σ1ε τ1ε )2 .
Analogously, the quantities for the event time T2 in the second margin and
for the onset times U1 and U2 are obtained.
204
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
10.5.2
Predictive error densities
The MCMC estimates of the predictive error densities are obtained in the
same way as explained in Section 9.4.2. We only have to use a bivariate
counterpart of the expression (9.13), i.e. for the event error density we use
M 1 X
ε,(m) ε,(m) −1
τ1
τ2
ĝε (e1 , e2 ) =
M m=1
K1
X
K2
X
j1 =−K1 j2 =−K2
10.6
ε,(m)
wj1 ,j2
(10.12)
e − αε,(m) e − αε,(m) ε
1
2
ε
1
2
ϕ2
,
.
µ(j1 ,j2 ) , Σ
ε,(m)
ε,(m)
τ1
τ2
Example: Signal Tandmobielr study – paired
doubly-interval-censored data
In Section 9.7, we have analyzed the time to caries of the permanent first molars based on the data from the Signal Tandmobielr study using the clusterspecific AFT model. The results were compared to the earlier analysis of
Leroy et al. (2005). In this section, we perform a similar analysis. However,
for practical reasons (see the introduction to this chapter) it is only possible
to analyze a pair of teeth. In our analysis, we concentrated on differences
between the maxillary (upper) and mandibular (lower) teeth and analyzed
separately the pair of right teeth (teeth 16 and 46) and the pair of left teeth
(teeth 26 and 36). The results for both pairs were very similar so we report
only the results for the right teeth in this thesis. Due to the fact that a (parametric) population-averaged AFT model was used by Leroy et al. (2005), the
results presented in this section can even more closely be compared to their
findings.
The analysis proceeded in a similar way as in Section 9.7 with only changes
related to the fact we analyze only two teeth now. Specifically, the onset
time Ui,l , i = 1, . . . , N , l = 1, 2 refers to the age (in years) of the ith child
at which the lth tooth (l = 1 ≡ tooth 16, l = 2 ≡ tooth 46) emerged. The
failure time Vi,l , i = 1, . . . , N , l = 1, . . . , 2 refers to the onset of caries and
the event time Ti,l to the time between the emergence and the onset of caries.
As explained in Section 9.7, left-censored emergence times were transformed
into interval-censored ones based on the clinical eruption stage. Finally, as in
Section 9.7, we subtracted 5 years from all observed times, i.e. log(Ui,l − 5)
was used in the left-hand side of the model formula (10.1). Analogously to
Section 9.7, we started the analysis with the Basic Model and based on
10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED
DOUBLY-INTERVAL-CENSORED DATA
205
the results for the Basic Model we subsequently fitted its simplified version,
referred as the Final Model.
10.6.1
Basic Model
In the Basic Model we allowed for a different effect of the covariates on both
emergence and caries experience for the maxillary and mandibular tooth,
respectively. That is, in the AFT models (10.1) and (10.2) we used the
following covariate vectors xui,l and xti,l for the emergence and caries parts of
the model, respectively.
xui,l = (gender i , jawi,l ∗ genderi )′ ,
xti,l = (x̃ti,l , jawi,l ∗ x̃ti,l ),
where
x̃ti,l = (gender i , statusDi,l , statusFi,l , statusMi,l ,
brushingi , sealantsi,l , plaquePFi,l , plaqueTi,l ).
The covariate jaw is dichotomous (1 = maxilla, 0 = mandible) and distinguishes between the maxillary and mandibular tooth. It replaces the covariates tooth26, tooth36, tooth46 used in Section 9.7.1. Note that as well in the
caries part as in the emergence part of the model the main effect of jaw is
expressed by the intercept terms αε and αζ , respectively. See Section 9.7.1
for the explanation of the remaining covariates.
10.6.2
Final Model
In the Final Model, we excluded all interaction terms with the covariate jaw,
i.e. we assumed that the studied factors have the same effect on the emergence
and caries for both the maxillary and mandibular tooth. Additionally, as
in Section 9.7, we binarized the covariates plaquePF, plaqueT and statusD,
statusF, statusM into new covariates plaque and status, respectively. Bayesian
two-sided p-values and for factors with more than two levels simultaneous
two-sided Bayesian p-values (see Section 4.6.2) were used to arrive at the
Final Model.
10.6.3
Prior distribution
To model the bivariate densities gζ and gε we used in both cases a grid of 31×
31 (K1 = K2 = 15) knots with the distance d between the two knots in each
206
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
margin equal to 0.3 and the basis standard deviations σ1ε = σ2ε = σ1ζ = σ2ζ =
0.2. The grid of knots is defined on a square [−4.5, 4.5] × [−4.5, 4.5] which
covers the support of most standardized unimodal distributions (unimodality
was checked after the analysis).
For the transformed mixture weights Aε and Aζ we used the prior (10.8) with
the differences of the third order (s = 3). The smoothing parameters λε1 , λε2 ,
λζ1 , λζ2 were all assigned dispersed Gamma(1, 0.005) priors. The same priors
were used also for the scale parameters τ1ε , τ2ε , τ1ζ , τ2ζ . The intercept terms
αε1 , αε2 , αζ1 , αζ2 as well as regression parameters contained in vectors β and δ
were all assigned dispersed N (0, 100) priors.
10.6.4
Results
For each model we ran 250 000 iterations with 1:3 thinning and kept last
25 000 iterations for the inference. Sampling for each model took about 68
hours on a 3 GHz Pentium IV PC with 1 GB RAM.
Results for the Basic Model
Table 10.1 shows the posterior medians, (simultaneous) 95% equal-tail credible intervals and (simultaneous) Bayesian two-sided p-values for the effect of
each considered factor on emergence and caries experience, separately for the
maxillary and the mandibular tooth.
It is seen that the results for the mandibular and the maxillary tooth are very
similar. Indeed, the interaction terms between jaw and the remaining factor
variables were all non-significant at 5%, namely, the p-values were > 0.5,
> 0.5, > 0.5, 0.262, > 0.5, 0.145, respectively for the interaction with gender
in the emergence and the caries part of the model, and for the interaction
with brushing, sealants, plaque, and status, respectively.
Additionally, we computed the (simultaneous) Bayesian two-sided p-values
for the two contrasts justifying the simplification of the covariates plaque
and status for the Final Model, again separately for the mandibular and the
maxillary tooth. For the variable status contrast decayed vs. filled vs. missing
due to caries, the p-values were equal to 0.342 and 0.308, respectively for the
maxillary and the mandibular tooth, respectively. For the variable plaque
contrast in pits and fissures vs. on total surface, the p-values were equal to
0.262 and 0.301, respectively for the maxillary and the mandibular tooth,
respectively.
10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED
DOUBLY-INTERVAL-CENSORED DATA
207
Table 10.1: Signal Tandmobielr study, Basic Model. Posterior medians, 95%
equal-tail credible regions (CR) and Bayesian two-sided p-values for each
factor variable, separately for the maxillary tooth 16 and the mandibular
tooth 46.
Effect
Maxillary tooth 16
Posterior
median
95% CR
Mandibular tooth 46
Posterior
median
95% CR
Emergence
Gender
girl
Gender
girl
Status
decayed
filled
missing
Brushing
daily
Sealants
present
Plaque
in pits and
fissures
on total
surface
−0.018
p = 0.094
(−0.039, 0.003)
−0.016
Caries
p = 0.534
−0.035 (−0.139, 0.073)
−0.049
p < 0.001
−0.449 (−0.704, −0.224)
−0.379
−0.627 (−0.844, −0.414)
−0.375
−0.470 (−1.377, 0.138)
−0.726
p = 0.003
0.226 (0.086, 0.386)
0.265
p = 0.019
0.158 (0.028, 0.283)
0.055
p = 0.014
−0.183 (−0.333, −0.031)
−0.252
−0.389
(−0.819, −0.015)
−0.468
p = 0.142
(−0.036, 0.005)
p = 0.403
(−0.162, 0.063)
p < 0.001
(−0.641, −0.151)
(−0.588, −0.175)
(−1.398, −0.208)
p < 0.001
(0.097, 0.426)
p = 0.401
(−0.077, 0.180)
p = 0.002
(−0.404, −0.107)
(−0.997, −0.038)
Results for the Final Model
Results for the Final Model are given in Table 10.2. This table contains also
the main effect of jaw which is given by E(ζ2 ) − E(ζ1 ) and by E(ε2 ) − E(ε1 )
in the case of emergence and caries, respectively.
It is seen that the lower tooth 46 emerges slightly later than the upper tooth
16. On the other hand, emergence occurs slightly earlier for girls than for
boys. However, neither the position of the tooth nor gender have a significant
208
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
Table 10.2: Signal Tandmobielr study, Final Model. Posterior medians, 95%
equal-tail credible regions (CR) and Bayesian two-sided p-values for each
factor variable.
Emergence
Effect
Jaw
lower
Gender
girl
Status
dmf
Brushing
daily
Sealants
present
Plaque
present
Posterior
median
Caries
95% CR
p = 0.021
0.017 (0.003, 0.032)
p = 0.018
−0.017 (−0.033, −0.003)
Posterior
median
0.024
−0.044
−0.482
0.249
0.110
−0.228
95% CR
p = 0.816
(−0.158, 0.218)
p = 0.267
(−0.120, 0.033)
p < 0.001
(−0.576, −0.388)
p < 0.001
(0.139, 0.369)
p = 0.022
(0.019, 0.195)
p < 0.001
(−0.313, −0.141)
Table 10.3: Signal Tandmobielr study, Final Model. Posterior medians and
95% equal-tail credible regions (CR) for the mean, standard deviation, Pearson correlation and Kendall’s tau of the error terms.
Param.
E(ζ1 )
E(ζ2 )
sd(ζ1 )
sd(ζ2 )
̺ζ
ζ
τKend
Posterior
median
95% CR
Emergence
0.392
(0.379,
0.409
(0.397,
0.170
(0.163,
0.170
(0.164,
0.037
(0.030,
0.022
(0.016,
0.404)
0.421)
0.178)
0.177)
0.050)
0.030)
Param.
E(ε1 )
E(ε2 )
sd(ε1 )
sd(ε2 )
̺ε
ε
τKend
Posterior
median
95% CR
Caries
2.846
(2.645,
2.870
(2.706,
1.737
(1.631,
1.812
(1.722,
0.023
(0.018,
0.011
(0.008,
3.043)
3.040)
1.855)
1.918)
0.028)
0.013)
10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED
DOUBLY-INTERVAL-CENSORED DATA
209
80
60
40
Posterior density
60
40
0
0
20
20
Posterior density
100
80
120
Emergence
0.025
0.035
0.045
0.055
0.015
0.020
0.025
0.030
ζ
τKend
̺ζ
200
150
0
0
50
100
Posterior density
100
50
Posterior density
150
250
Caries
0.015
0.020
0.025
̺
ε
0.030
0.006
0.010
0.014
ε
τKend
Figure 10.2: Signal Tandmobielr study, Final Model. Scaled histograms for
sampled Pearson correlation and Kendall’s tau between the error terms.
210
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
effect on the time to caries. The remaining factors do influence significantly
the time to caries, namely, daily brushing increases this time with a factor
of exp(0.249) = 1.283, presence of sealants with a factor of exp(0.110) =
1.116. The factor for presence of plaque is exp(−0.228) = 0.796 and when the
adjacent deciduous second molar was not sound the factor is exp(−0.482) =
0.618.
It is seen that the results given in Table 10.2 are slightly different from the
summary given in Table 9.1 which relates to the earlier joint analysis of
all four permanent first molars using the cluster-specific (conditional) AFT
model. Especially, the effect of the covariate status appears to be more
profound when evaluated using the population-averaged (marginal) model.
However, the conclusions concerning a beneficial effect of sealing and daily
brushing and an indisposed effect of not sound primary predecessors or plaque
on the caries process on the permanent first molars are the same irrespective
of the used model.
Further, Table 10.3 shows the mean and standard deviation of the error terms
and also the residual association (after adjustment for the effect of covariates)
between the maxillary and the mandibular tooth. For both the emergence
and the caries processes, a very low posterior median for the Pearson correlation coefficient was found on the log-scale and the same is true also for the
Kendall’s tau. Moreover, as seen in Figure 10.2, the whole posterior distribution for the correlation coefficients and the Kendall’s taus is concentrated
in the neighborhood of zero.
Figures 10.3 and 10.4 show the estimates of the error densities gζ (ζ) and gε (ε)
and their margins and illustrate the smoothing nature of our approach. These
figures also reveal the low association between error terms for the upper and
lower tooth. For the interpretation of the figure, we must take into account
that about 75% of the caries times were right-censored and practically all
around 12 years of age, which is 5 to 6 years after emergence. This implies
that in fact each margin is identifiable from the data only up to approximately
the first quartile. The right tail of the density is extrapolated from the left tail
using the weights distributed according to the GMRF prior. It also implies
that the association might be underestimated, see, e.g. Bogaerts and Lesaffre
(2006).
Figure 10.5 shows the predictive survival and hazard functions for caries on
the upper tooth 16 of boys and ‘the best’, ‘the worst’ and two intermediate combinations of covariates. Corresponding curves for the lower tooth 46
or for girls are almost the same due to the non-significant effect of the covariates gender and jaw on the caries. For teeth that are not brushed daily
and are exposed to other risk factors, a high peak in the hazard function is
10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED
DOUBLY-INTERVAL-CENSORED DATA
0.4
0.0
0.2
ζ2
0.6
0.8
211
0.0
0.2
0.4
0.6
0.8
0.5
1.0
gζ (ζ2 )
1.0
0.0
0.0
0.5
gζ (ζ1 )
1.5
1.5
2.0
2.0
ζ1
0.0
0.2
0.6
0.4
ζ1
0.8
1.0
0.0
0.2
0.6
0.4
0.8
1.0
ζ2
Figure 10.3: Signal Tandmobielr study, Final Model. Estimate of the density
gζ (ζ1 , ζ2 ) and the corresponding marginal densities gζ (ζ1 ) and gζ (ζ2 ) of the
error terms in the emergence part of the model.
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
4
0
2
ε2
6
8
212
0
2
4
6
8
0.05
0.10
gε (ε2 )
0.10
0.00
0.00
0.05
gε (ε1 )
0.15
0.15
0.20
0.20
ε1
0
2
4
ε1
6
8
0
2
4
6
8
ε2
Figure 10.4: Signal Tandmobielr study, Final Model. Estimate of the density
gε (ε1 , ε2 ) and the corresponding marginal densities gε (ε1 ) and gε (ε2 ) of the
error terms in the caries part of the model. The shaded part in the marginal
densities extends to the first quartile.
10.6. EXAMPLE: SIGNAL TANDMOBIELr STUDY – PAIRED
DOUBLY-INTERVAL-CENSORED DATA
213
0.4 0.6 0.8
0.0 0.2
Caries free
1.0
Caries: survival function
0
1
2
3
4
5
6
5
6
Time since emergence (years)
0.00 0.05 0.10 0.15 0.20
Hazard of caries
Caries: hazard function
0
1
2
3
4
Time since emergence (years)
Figure 10.5: Signal Tandmobielr study, Final Model. Posterior predictive
caries free (survival) and caries hazard curves for tooth 16 of boys and the
following combinations of covariates: solid and dashed lines for no plaque,
present sealing, daily brushing and sound primary second molar (solid line)
or dmf primary second molar (dashed line), dotted and dotted-dashed lines
for present plaque, no sealing, not daily brushing and sound primary second
molar (dotted line) or dmf primary second molar (dotted-dashed line).
214
CHAPTER 10. BAYESIAN PENALIZED MIXTURE PA AFT MODEL
observed already less than 1 year after emergence. A similar peak, however
shifted to right and of much lower magnitude is seen also for other covariate
combinations. The same peak, also approximately of the same magnitude,
has already been found when analyzing all four permanent first molars using
the cluster-specific AFT model in Chapter 9 (see Figure 9.4) and can be explained by the fact that permanent first molars are most vulnerable by caries
soon after they emerge, possibly because of not yet fully developed enamel
on their surfaces. However, when using the population-averaged model we do
not see the second period of increased hazard for the ‘worse’ combinations of
covariates as we have seen in Figure 9.4. This alleged difference between the
results of the population-averaged and cluster-specific model could be caused
by a failure to compare like with like, see Lee and Nelder (2004) for a deeper
discussion to this point.
10.7
Discussion
In this chapter, we have suggested a semiparametric method to analyze bivariate doubly-interval-censored data in the presence of covariates. The method
was applied to the analysis of a dental data set where all covariates were
categorical. However, continuous covariates would not cause any difficulties
and could have been used as well. Although the method was presented to
deal with doubly-interval-censored data it can be used to analyze also simple
interval- or right-censored data.
Further, using the ideas outlined in Section 6.3.4, the method of this chapter
could theoretically be extended to handle not only bivariate data but also
data of an arbitrary dimension (i.e. ni > 2 for all i). However, the number
of unknown parameters increases exponentially and the estimation becomes
quite fast computationally intractable.
A disadvantage of the current method is that it requires balanced data, i.e.
exactly two observations must be supplied for each cluster and if only one
observation of the cluster is missing the whole cluster must be removed from
the analysis. Missingness in one event time out of the pair could have been
solved using the Bayesian data augmentation in the same way as it solves
the problem of censoring. However, if the missingness is caused by a missing covariate value, the Bayesian data augmentation would not help unless
a measurement model is set up also for the covariates. With unbalanced data,
the cluster specific approach of Chapter 9 can be used, however.
Chapter
11
Overview and Further Research
In this thesis, we have developed several modifications of the accelerated failure time model for the analysis of the multivariate (doubly-)interval-censored
data while making only weak distributional assumptions. We will now state
an overview and give topics for future research.
11.1
Overview
Chapter 1 brings several data sets that motivate the developments presented
in the thesis. The data sets are then used to illustrate the usage of presented
methods in practical situations. Chapter 2 explains briefly several notions
used in the area of survival data and introduces the notation used in the
thesis.
An overview of the regression models for the analysis of the survival data is
given in Chapter 3. We described the Cox’s proportional hazards (PH) model
and the accelerated failure time (AFT) model as the most popular models
in the given area. For reasons stated in Section 3.3 we chose the accelerated
failure time model as the basis for all developments in this thesis.
In Chapter 4, we discuss the likelihood form in the case of (multivariate)
(doubly-)interval-censored data and show several advantages of the Bayesian
inference compared to the maximum-likelihood estimation in such situations.
Further, we suggest to use the Markov chain Monte Carlo methodology as
the mean of Bayesian estimation.
The final chapter of the introductory part of the thesis, Chapter 5, gives
an overview of existing methods for the analysis of the interval-censored data
and shows in detail a Bayesian analysis of the dental multivariate doubly215
216
CHAPTER 11. OVERVIEW AND FURTHER RESEARCH
interval-censored data using a PH model with piecewise constant baseline
hazard functions.
The main part of the thesis starts with Chapter 6 where we describe two
slightly different classes of models for a flexible modelling of continuous densities. Firstly, a classical normal mixture is introduced and secondly, we propose a penalized normal mixture motivated by penalized B-splines as a useful
tool to model unknown densities. Both approaches are subsequently used in
the AFT models to express either the error density or the density of the
random effects.
Chapter 7 gives the AFT model for univariate interval-censored data where
the error distribution is specified as the penalized normal mixture. The inference is based on the maximum-likelihood paradigm. The model is further
extended to allow not only the mean response but also the scale of the response to depend on covariates.
The AFT models presented in subsequent chapters can already handle also
the multivariate (doubly-)interval-censored data. However, due to reasons
discussed in Chapter 4 and in Section 7.8, we switch to the Bayesian inference.
Firstly, Chapter 8 gives the AFT model with normal random effects (clusterspecific model) and the distribution of the error term specified as the classical
normal mixture.
Secondly, Chapter 9 shows the cluster-specific AFT model where the error
distribution and in the case of univariate random effects also the distribution
of the random effects is specified as the penalized normal mixture. In this
chapter, we also explicitely show and illustrate the usage of the proposed
methods in the context of doubly-interval-censored data.
Finally, Chapter 10 gives the population-averaged AFT model for paired
(doubly-)interval-censored data where the error distribution is given by a bivariate penalized normal mixture.
11.2. GENERALIZATIONS AND IMPROVEMENTS
11.2
217
Generalizations and improvements
In this section, we list several topics to generalize or improve the models
presented in this thesis.
Time-dependent covariates and joint modelling of survival data and
longitudinal profiles
In many applications of the survival analysis, it is of interest to evaluate
an effect of factors that can evolve over time. The values of such factors
(e.g., blood pressure, dose of medication, etc.) are typically determined at
(prespecified) occasions and it is assumed that they remain constant (deterministic) until the next occasion. In the last decade, several models were
developed for joint modelling of the evolution of factors evolving over time
(longitudinal data analysis) and the time-to-event, see Tsiatis and Davidian
(2004) for an overview. That is, a stochastic component is included in the
evolution of the time-dependent factors possibly influencing the survival time.
To include the time-dependent covariates, both deterministic and stochastic,
in the survival model, it is necessary to specify the dependence of the survival
time on the covariates using a local characteristic like the hazard function.
However, in all models presented in Part II of this thesis, the covariates
modified a global characteristic of the survival time, i.e. the mean log-time.
The possibility on how to extend the models of this thesis to handle also the
time-dependent covariates would be to use the hazard specification (3.3) of
the AFT model and use a mixture model for the baseline hazard function ℏ0 .
Dependence of the scale parameters on covariates
In Section 7.1.2 we suggested to extend the basic AFT model by allowing
the dependence of the scale parameter on the covariates. The same extension could quite easily be applied to both Bayesian penalized approaches in
Chapters 9 and 10. However, in the case of the classical mixture (Chapter 8),
a similar extension would be much more complicated due to the fact that the
scale of the response is derived from the unknown number of the estimated
variances of the mixture components.
Dependent censoring
The models in this thesis assumed all that the censoring mechanism is independent on the time-to-event (see Section 2.4). Generally, this does not
218
CHAPTER 11. OVERVIEW AND FURTHER RESEARCH
always have to be true. All Bayesian models (Chapters 8–10) could relatively
easily be extended to handle also dependent censoring. However, a reasonable
measurement model has to be specified for the censoring mechanism.
Goodness-of-fit
An important topic, not discussed in this thesis is the evaluation of goodnessof-fit. Indeed, in all models in this thesis, the distribution of the response is
specified in a flexible manner and there is less need to evaluate the distributional assumptions. Nevertheless, one should also check an appropriatness of
the AFT assumption with respect to the form in which the covariates modify
the distribution of the response. On few places in this thesis, and in the case
of categorical covariates, this was only graphically checked by comparing the
fitted survival curves with their nonparametric estimates.
Classical goodness-of-fit methods are based on residuals whose form is straightforward in the case of a linear regression with uncensored data. In the case of
right censored data, various forms of residuals are derived from the counting
process specification of the survival models, see, e.g., Therneau and Grambsch
(2000, Chapter 4). However, the definition of residuals for interval-censored
data is not straightforward and only recently (Topp and Gómez, 2004) a work
in this direction appeared in the literature.
Model selection
A general model selection is another important topic somewhat neglectful in
this thesis. In Chapter 7, we based the model selection on the Akaike’s information criterion whereas in Chapters 8–10 on the (simultaneous) Bayesian
p-values for model contrasts. In general, also in the Bayesian framework some
form of the information criterion could be used for the model selection. Recently, the most popular one seems to be the deviance information criterion
(Spiegelhalter et al., 2002).
The use of specifically developed optimizers
Due to the complexity of the likelihood, we have considered the estimation
through the method of penalized maximum-likelihood only in Chapter 7.
However, there are currently several convenient gateways to optimization
software and services available on the Internet. For example, the Kestrel
interface to the NEOS server (Czyzyk, Mesnier, and Moré, 1998; Ferris, Mesnier, and Moré, 2000) together with the modeling language for mathematical
11.3. THE USE OF PENALIZED MIXTURES IN OTHER APPLICATION
AREAS
219
programming AMPL (Fourer, Gay, and Kernighan, 2003) enables to optimize complicated functions subject to different types of constraints. These
possibilities could be explored as promising alternatives to the full Bayesian
approaches presented in Chapters 8–10.
11.3
The use of penalized mixtures in other application areas
Finally, we show how we intend to use the ideas used in this thesis in the
future work.
11.3.1
Generalized linear mixed models with random effects having a flexible distribution
Firstly, we aim to develop a generalized linear mixed model (GLMM) with
random effects distribution specified as the penalized mixture. The proposed
work has the following objectives.
Let Yi,l , i = 1, . . . , N , l = 1, . . . , ni be discrete random variables for which
the components of the vector Y i = (Yi,1 , . . . , Yi,ni )′ are possibly dependent.
Typically, Y i represents the outcomes of the ith subject at ni different time
points ti,1 , . . . , ti,ni in a longitudinal study or outcomes of ni subjects forming
the ith cluster in the case of clustered data. Further, let µi,l = E(Yi,l ). Using
the GLMM, the expected outcome µi,l is expressed as
µi,l = h−1 (x′i,l β + z ′i,l bi ),
i = 1, . . . , N, l = 1, . . . , ni ,
where h is a known link function (e.g. log, logit, probit), β is the vector of
unknown regression parameters (fixed effects), xi,l the vector of covariates for
fixed effects, bi the vector of random effects and z i,l the vector of covariates
for random effects, see, e.g., Molenberghs and Verbeke (2005) for more details.
We aim to concentrate mainly on longitudinal studies where usually z i,l =
(1, ti,l )′ , and bi = (bi,1 , bi,2 )′ .
Classically, it is assumed that the random effects bi , i = 1, . . . , N are i.i.d.
following a (multivariate) normal distribution. However, it has been shown
(see Molenberghs and Verbeke, 2005, Chapter 23) that the incorrect assumption of normality of the random effects may lead to biased estimates of the
regression parameters β. But, due to the fact that the random effects bi are
latent, it is very difficult to check the normality assumption. That is why one
strives for more flexible methods with respect to the distribution of the ran-
220
OVERVIEW AND FURTHER RESEARCH
dom effects. One possibility, we wish to explore is to specify the distribution
of the random effects as a penalized bivariate mixture (10.3).
11.3.2
Spatial models with the intensity specified by the
penalized mixture
Secondly, we would like to explore the possibilities of the penalized mixtures
in the context of spatial models. The motivation is the following. In epidemiology, it is of interest to model the prevalence or incidence of a disease
in a spatial manner in order to represent the true risk in a honest manner.
Let A denote the study area, R a region within A, and y = (y1 , y2 ) coordinates of a location in A. Generation of the disease cases can be formalized
by considering an underlying point process described by a counting measure
N on A, i.e. N (R) denotes the number of disease cases in R. Finally, let
E N (∆y)
,
λ(y) = lim
k∆ yk
k∆ yk→0
where ∆ y is an infinitesimal region around y and k∆ yk its area, be the
intensity of the point process. Different approaches have been suggested in
the literature to express λ(y) of which one uses an expression
λ(y) = ̺ g(y) f (y; θ),
(11.1)
where ̺ denotes an overall region-wide rate, g(y) a known background function representing the reference population and f (y; θ) represents a function
of spatial location and possibly other parameters and associated covariates as
well (see Lawson et al., 1999). However, there exists no gold standard for the
expression of f (y; θ). The main requirement for f (y; θ) is, however, that it
varies smoothly across A.
To model smoothly the variation of the intensity λ(y) across the region of
interest A, a penalized mixture could be used to express f (y; θ) as part of
expression (11.1) as
f (y; θ) = 1 +
K1
X
K2
X
wk1 ,k2 ϕk1 (y1 ) ϕk2 (y2 ),
(11.2)
k1 =−K1 k2 =−K2
where the weights are, in contrast to the approaches used in this thesis, not
constrained.
Further, it is here of interest to develop efficient procedures (a) to test a null
hypothesis of w−K1 ,−K2 = · · · = wK1 ,K2 = 0, corresponding to a constant
ratio λ(y)/g(y) which is known as a standardized mortality rate, and (b) to
develop a general procedure for model selection.
11.3. THE USE OF PENALIZED MIXTURES IN OTHER APPLICATION
AREAS
221
Further, to allow for the dependence of the intensity λ(y) on other (regionspecific) covariates x(y), we would like to explore a generalization of the
model (11.2) of the form
K1 X
K2
X
f (y; θ, β) = h x(y), β +
wk1 ,k2 ϕk1 (y1 ) ϕk2 (y2 ),
k1 =1 k2 =1
where h is an unknown (nonlinear) function and β a vector of unknown
regression parameters.
222
OVERVIEW AND FURTHER RESEARCH
Appendix
A
Technical details for the Maximum
Likelihood Penalized AFT Model
This appendix provides the technical details for the practical computation of
the penalized maximum-likelihood estimate for the AFT model of Chapter 7.
Namely, we give more details concerning the optimization algorithm, provide
the formulas for computation of the first and second derivatives of the penalized log-likelihood needed to implement this algorithm and give the proof of
Proposition 7.1.
Notation introduced in Chapter 7 will be used throughout this appendix.
Additionally, the following notation is employed.
−1 L
′
eL
i = τi (yi − α − β xi ),
−1 U
′
eU
i = τi (yi − α − β xi ),
−1 L
ẽL
i,j = σ (ei − µj ),
−1 U
ẽU
i,j = σ (ei − µj ),
L
ϕL
i,j = ϕ(ẽi,j ),
U
ϕU
i,j = ϕ(ẽi,j ),
L
L
ϕ̄L
i,j = ẽi,j ϕ(ẽi,j ),
2
L
L
ẽi,j − 1 ϕ(ẽL
ϕ̆i,j =
i,j ),
U
U
ϕ̄U
i,j = ẽi,j ϕ(ẽi,j ),
2
U
U
ẽi,j − 1 ϕ(ẽU
ϕ̆i,j =
i,j ),
L
ΦL
i,j = Φ(ẽi,j )
U
ΦU
i,j = Φ(ẽi,j ),
j = −K, . . . , K.
i = 1, . . . , N,
223
224
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
L
L
′
ϕL
i = (ϕi,−K , . . . , ϕi,K ) ,
U
U
′
ϕU
i = (ϕi,−K , . . . , ϕi,K ) ,
L
L
′
ϕ̄L
i = (ϕ̄i,−K , . . . , ϕ̄i,K ) ,
U
U
′
ϕ̄U
i = (ϕ̄i,−K , . . . , ϕ̄i,K ) ,
L
L
′
ϕ̆L
i = (ϕ̆i,−K , . . . , ϕ̆i,K ) ,
U
U
′
ϕ̆U
i = (ϕ̆i,−K , . . . , ϕ̆i,K ) ,
L
L ′
ΦL
i = (Φ̄i,−K , . . . , Φ̄i,K ) ,
U
U ′
ΦU
i = (Φ̄i,−K , . . . , Φ̄i,K ) ,
i = 1, . . . , N.
We omit the superscripts ‘L’ and ‘U’ in the case of exactly observed event
times (δi = 1) resulting in yiL = yiU = yi . Finally, in all formulas, we omit
U
the Jacobian term (t−1
for exactly observed event times with tL
i = ti = ti )
i
resulting from the logarithmic transformation of the event times in the loglikelihood.
A.1
Optimization algorithm
To compute the penalized maximum-likelihood estimate we firstly maximize
the penalized log-likelihood (7.7) with respect to θ̃ = (α, β ′ , γ ′ , a′−0 )′ under the constraints (7.4) and upon the convergence we compute the second
derivative matrix of ℓP with respect to θ = (α, β ′ , γ ′ , d′ )′ to get the variance
estimates.
Constrained optimization is conducted using the sequential quadratic programming (SQP) algorithm, see Han (1977); Fletcher (1987, Section 12.4).
The idea of this algorithm is to iteratively maximize slightly modified quadratic
approximation of the objective function subject to the linear approximation
of the constraints.
Let
K
K
X
X
2
wj µ2j
(A.1)
wj µj ,
c2 (θ̃) = 1 − σ0 −
c1 (θ̃) =
j=−K
j=−K
be the constraint equations resulting from (7.4), and let
L(θ̃, ξ1 , ξ2 ) = ℓP (θ̃) + ξ1 c1 (θ̃) + ξ2 c2 (θ̃)
be the Lagrange function with the Lagrange multipliers ξ1 and ξ2 corresponding to the maximization problem maxθ̃ ℓP (θ̃) subject to c1 (θ̃) = 0 and
c2 (θ̃) = 0.
Let QP(θ̃, H) be a quadratic programming problem
o
n ∂ℓ
P
θ̃ + 0.5 δ ′ H δ
max δ ′
δ
∂ θ̃
(A.2)
A.2. INDIVIDUAL LOG-LIKELIHOOD CONTRIBUTIONS
225
subject to
c1 (θ̃) + δ ′
where
∂c1 θ̃ = 0,
∂ θ̃
c2 (θ̃) + δ ′
H = H(θ̃, ξ1 , ξ2 ) =
∂2L
∂ θ̃∂ θ̃
′
∂c2 θ̃ = 0,
∂ θ̃
(A.3)
θ̃, ξ1 , ξ2 .
(A.4)
Note that the objective function in (A.2) is the second order Taylor approximation of ℓP (θ̃) around some fixed point θ̃ 0 with δ = θ̃− θ̃ 0 , omitted constant
term and the matrix of second derivatives ∂ 2 ℓP /∂ θ̃∂θ ′ replaced by the θ̃-θ̃
block of the second derivative matrix of the Lagrange function L.
The SQP algorithm proceeds in the following steps
Step 0. Give the initial estimate θ̃
(0)
Lagrange multipliers. Set H(0)
In the sth iteration:
(0)
(0)
and the initial guesses ξ1 , ξ2 for the
(0)
(0)
(0) = H θ̃ , ξ1 , ξ2 .
Step 1. Find the point δ (s) which solves the quadratic program QP(θ̃
(s)
, H(s) );
Step 2. Set
θ̃
If θ̃
(s+1)
(s+1)
= θ̃
(s)
+ δ (s) .
does not lead to increase of ℓP use step-halving procedure;
(s+1)
Step 3. Set ξ1
(s+1)
and ξ2
quadratic program QP(θ̃
to the optimal Lagrangian multipliers of the
(s)
, H(s) );
Step 4. Check the convergence, if it is not reached go to Step 1.
A.2
Individual log-likelihood contributions


log(1 − w′ ΦL

i ),


 − log(τ ) + log(w′ ϕ ),
i
i
ℓi (θ̃) =
′ ΦU ),

log(w

i


 logw′ (ΦU − ΦL ),
i
i
δi
δi
δi
δi
= 0,
= 1,
= 2,
= 3,
i = 1, . . . , N.
226
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
A.3
First derivatives of the log-likelihood
A.3.1
With respect to the regression parameters and the
intercept
X
∂ℓ
dbi ,
= (τi σ0 )−1 w′
∂α
N
i=1
∂ℓ
= (τi σ0 )−1 w′
∂βl
N
X
i=1
xi,l dbi ,
l = 1, . . . , m,
where dbi is a vector of length 2K + 1 of the form

−1 ϕL ,

(1 − w′ ΦL
δi = 0,

i )
i


 (w ′ ϕ )−1 ϕ̄ ,
δi = 1,
i
i
dbi =
U
′
−1
U

−(w Φi ) ϕi ,
δi = 2,



 w′ (ΦU − ΦL )−1 (ϕL − ϕU ),
δi = 3,
i
i
i
i
A.3.2
i = 1, . . . , N.
With respect to the log-scale and the scale-regression
parameters
Firstly, we consider the case when the scale parameter τ does not depend on
covariates, i.e. log(τ ) = γ1 .
X X
∂ℓ
dli .
I[δi = 1] + σ0−1 w ′
=−
∂γ1
N
N
i=1
i=1
Secondly, we consider the case when log(τi ) =
Then
γ ′z
i,
where z i = (zi,1 , . . . , zi,ms )′ .
X
X
∂ℓ
zi,l dli ,
I[δi = 1] zi,l + σ0−1 w′
=−
∂γl
N
N
i=1
i=1
l = 1, . . . , ms .
In both formulas, dli is a vector of length 2K + 1 of the form
o−1
 n
L
′
L

eL
δi = 0,

i ϕi ,
 w (1 − Φi )


′
−1
(w ϕi ) ei ϕ̄i ,
δi = 1,
dli =
i = 1, . . . , N.
U

′
−1
U
U

−(w
Φ
)
e
ϕ
,
δ
=
2,
i

i
i
i

−1
 ′ U
L ϕL − eU ϕU ),
δ
w (Φi − ΦL
)
(e
i = 3,
i
i
i
i
i
A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD
A.3.3
227
With respect to the transformed mixture weights
Let a−0 be the vector of transformed mixture weights except the baseline
coefficient which is fixed to zero (without loss of generality a0 = 0). Then
N
∂w X
∂ℓ
dai ,
=
∂a−0
∂a−0
i=1
where dai is a vector of length 2K + 1 of the form
o−1
 n
′ (1 − ΦL )

w
(1 − ΦL

i
i ),



′
−1
(w ϕi ) ϕi ,
dai =

−1 ΦU ,

(w′ ΦU

i )
i

−1 U

U
′
w (Φi − ΦL
(Φi − ΦL
i )
i ),
δi = 0,
δi = 1,
δi = 2,
δi = 3,
i = 1, . . . , N,
and ∂w/∂a−0 is a 2K × (2K + 1) matrix whose (j, k)th element equals
∂wk /∂aj , j = −K, . . . , −1, 1, . . . , K, k = −K, . . . , K. Namely
∂wj
= wj (1 − wj ),
∂aj
∂wk
= −wj wk ,
∂aj
A.4
j = −K, . . . , −1, 1, . . . , K,
j = −K, . . . , −1, 1, . . . , K, k = −K, . . . , K, j 6= k.
Second derivatives of the log-likelihood
Let β̃ be the vector of regression parameters extended by the intercept, i.e.
β̃ = (α, β ′ )′ and x̃i , i = 1, . . . , N be the covariate vectors extended by the
intercept term, i.e. x̃i = (1, x′i )′ .
A.4.1
With respect to the extended regression parameters
∂2ℓ
∂ β̃∂ β̃
′
=
N
X
i=1
(ddbbi,1 − ddbb2i,2 ) x̃i x̃′i ,
228
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
where ddbbi,1 and ddbbi,2 are scalars of the following form
ddbbi,1

w′ ϕ̄L

i

(τi σ0 )−2 ′
,


w
(1
−
ΦL
)

i



′

−2 w ϕ̆i


 (τi σ0 ) w ′ ϕ ,
i
=
U
′

 −(τi σ0 )−2 w ϕ̄i ,


′

w ΦU

i


′ (ϕ̄L − ϕ̄U )


w
i ,

 (τi σ0 )−2 ′ iU
w (Φi − ΦL
i )
ddbbi,2 =
A.4.2

w ′ ϕL

−1
i

(τ
σ
)

i
0
′ (1 − ΦL ) ,


w
i




w′ ϕ̄i
−1


 (τi σ0 ) w′ ϕ ,












i
′
w ϕU
−(τi σ0 )−1 ′ iU ,
w Φi
U
w ′ (ϕL
i − ϕi ) ,
(τi σ0 )−1 ′ U
w (Φi − ΦL
i )
δi = 0,
δi = 1,
i = 1, . . . , N,
δi = 2,
δi = 3,
δi = 0,
δi = 1,
i = 1, . . . , N.
δi = 2,
δi = 3,
Mixed with respect to the extended regression parameters and the log-scale or the scale-regression
parameters
In the case when the scale parameter does not depend on covariates we have
N
X
∂2ℓ
ddbli,1 − ddbbi,2 (1 + ddbli,2 ) x̃i .
=
∂ β̃∂γ1
i=1
In the case of log(τi ) = γ ′ z i we have
N
X
∂2ℓ
ddbli,1 − ddbbi,2 (1 + ddbli,2 ) x̃i z ′i .
=
∂ β̃∂γ ′
i=1
A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD
229
In both formulas, ddbbi,2 is given in Section A.4.1, ddbli,1 and ddbli,2 are
scalars of the form
ddbli,1
 L
ei
w′ ϕ̄L

i

·

2
′ (1 − ΦL ) ,

τ
σ

w
i
0
i




w′ ϕ̆i
e

i

 τ σ2 · w′ ϕ ,
i 0
i
=
′
U

w ϕ̄U
e


− i 2 · ′ iU ,


 τ i σ0 w Φ i


L
U U


w′ (eL
i ϕ̄i − ei ϕ̄i ) ,

 12 ·
U
τ i σ0
w′ (Φi − ΦL
i )
ddbli,2 =













w ′ ϕL
eL
i
i
σ0 · w ′ (1 − ΦL )
i
ei · w′ ϕ̄i
σ0 w ′ ϕi
A.4.3
δi = 1,
i = 1, . . . , N,
δi = 2,
δi = 3,
δi = 0,
δi = 1,

eU w′ ϕU


− σi0 · ′ iU


w Φi



′
L
U U


w (eL
i ϕi − ei ϕi )

 σ10 ·
w′ (ΦU − ΦL )
i
δi = 0,
i = 1, . . . , N.
δi = 2,
δi = 3,
i
Mixed with respect to the extended regression parameters and the transformed mixture weights
X
N n
o ∂w ′
∂2ℓ
−1
′
′
,
ddbai − (τi σ0 ) (w dbi ) x̃i dai
=
∂a−0
∂ β̃∂a′−0
i=1
where ddbai is a (m + 1) × (2K + 1) matrix of the form
 ′ (1 − ΦL ) −1 x̃ ϕL ′ ,

τ
σ
w

i
0
i i
i


 (τ σ w′ ϕ )−1 x̃ ϕ̄ ′ ,
i i
i 0
i
ddbai =
′ ΦU )−1 x̃ ϕU ′ ,

−(τ
σ
w
i i
i 0

i


 τ σ w′ (ΦU − ΦL )−1 x̃ (ϕL − ϕU )′ ,
i
i 0
i
i
i
i
δi
δi
δi
δi
= 0,
= 1,
= 2,
= 3,
i = 1, . . . , N.
Further, dbi is a vector of length 2K + 1 given in Section A.3.1. Finally, dai
and ∂w/∂a−0 are a vector of length 2K + 1 and a 2K × (2K + 1) matrix,
respectively, given in Section A.3.3.
230
A.4.4
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
With respect to the log-scale or the scale-regression
parameters
In the case when the scale parameter does not depend on covariates we have
N n
o
X
∂2ℓ
ddll
−
ddbl
(1
+
ddbl
)
.
=
i
i,2
i,2
∂γ12
i=1
In the case of log(τi ) = γ ′ z i we have
N n
o
X
∂2ℓ
ddll
−
ddbl
(1
+
ddbl
)
z i z ′i .
=
i
i,2
i,2
∂γ∂γ ′
i=1
In both formulas, ddbli,2 is a scalar given in Section A.4.2 and ddlli is a scalar
given by the formula
 L 2
w′ ϕ̄L
ei

i

·
,

σ
′

0
w (1 − ΦL

i )




ei 2 · w′ ϕ̆i ,



σ
0
w′ ϕi

U 2
ddlli =
ei
w′ ϕ̄U
i

−
·


′ ΦU ,
σ0

w

i

o
n


L − (eU )2 ϕ̄U
L
2
′

ϕ̄
(e
)
w

i
i
i
i


 σ0−2
,
U
L
′
w (Φi − Φi )
A.4.5
δi = 0,
δi = 1,
δi = 2,
i = 1, . . . , N.
δi = 3,
Mixed with respect to the log-scale or the scaleregression parameters and the transformed mixture
weights
In the case when the scale parameter does not depend on covariates we have
X
N n
o ∂w ′
∂2ℓ
−1
′
′
ddlai − σ0 (w dli ) dai
=
.
∂γ1 ∂a′−0
∂a−0
i=1
In the case of log(τi ) = γ ′ z i we have
X
N n
o ∂w ′
∂2ℓ
−1
′
′
.
= zi
ddlai − σ0 (w dli ) dai
∂γ∂a′−0
∂a−0
i=1
A.4. SECOND DERIVATIVES OF THE LOG-LIKELIHOOD
231
In both formulas, dai and ∂w/∂a−0 are a vector of length 2K +1 and a 2K ×
(2K +1) matrix, respectively, given in Section A.3.3, and ddlai is a row vector
of length 2K + 1 of the form
ddlai =

eL
′


σ0−1 ′ i L ϕL

i ,


w
(1
−
Φ
)

i


−1 ei

′

 σ0 w′ ϕ ϕ̄i ,
δi = 0,
δi = 1,
i
U
e
−σ0−1 ′ i U
w Φi

′


ϕU

i ,




n

o−1 L L

L
U ′,
 σ0 w′ ΦU
−
Φ
ei ϕi − eU
i
i
i ϕi
δi = 2,
δi = 3,
i = 1, . . . , N.
A.4.6
With respect to the transformed mixture weights
N
X
∂w
∂2ℓ
ddaai −
=
′
∂a−0 ∂a−0
∂a−0
i=1
X
N
dai da′i
i=1
∂w
∂a−0
′
,
where dai and ∂w/∂a−0 are a vector of length 2K + 1 and a 2K × (2K + 1)
matrix, respectively, given in Section A.3.3. Further, ddaai is a 2K × 2K
matrix given by
ddaai =

′
∂ 2 wj

L −1 PK
L )

w
(1
−
Φ
)
(1
−
Φ

i
i,j ∂a ∂a′ ,
j=−K

−0

−0



2w
P

∂
K
j

′
−1

 (w ϕi )
j=−K ϕi,j ∂a ∂a′ ,
−0
−0
P

∂ 2 wj


)−1 K
ΦU
,
(w′ ΦU

i
i,j
j=−K

∂a−0 ∂a′−0





L −1 PK
U
L

 w′ (ΦU
i − Φi )
j=−K (Φi,j − Φi,j )
i = 1, . . . , N,
∂ 2 wj
,
∂a−0 ∂a′−0
δi = 0,
δi = 1,
δi = 2,
δi = 3,
232
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
where ∂ 2 wj /∂a−0 ∂a′−0 , j = −K, . . . , K is a 2K × 2K matrix with the elements ddwaajk,l , k, l = −K, . . . , −1, 1, . . . , K given by
ddwaajj,j = wj (1 − wj ) (1 − 2wj ),
ddwaajk,k = −wj wk (1 − 2wk ),
ddwaajj,k = −wj wk (1 − 2wj ),
ddwaajk,j = −wj wk (1 − 2wj ),
ddwaajk,l
A.5
= 2 wj wk wl ,
j 6= 0,
k 6= j,
j 6= 0,
k 6= j,
j 6= 0,
k 6= j,
k 6= j,
l 6= j,
k 6= l.
Derivatives of the penalty term
The penalty term depends only on the a−0 part of θ̃ so we have to provide
only the derivatives with respect to this parameter sub-vector.
∂q
= λ D′ Da with removed 0th element,
∂a−0
∂2q
= λ D′ D with removed 0th row and 0th column.
∂a−0 ∂a′−0
A.6
Derivatives of the constraints
To be able to compute the H matrix (A.4) derivatives of the constraint functions (A.1) are needed. Since the constraints (A.1) depend only on the a−0
part of θ̃ we have to provide only the derivatives with respect to this parameter sub-vector. The first derivatives are computed by
∂c1
∂w
=
µ,
∂a−0
∂a−0
∂c2
∂w 2
=
µ ,
∂a−0
∂a−0
where µ = (µ−K , . . . , µK )′ , µ2 = (µ2−K , . . . , µ2K )′ , and ∂w/∂a−0 is a 2K ×
(2K + 1) matrix given in Section A.3.3.
The second derivatives are given by
K
X
∂ 2 wj
∂ 2 c1
µ
=
,
j
∂a−0 ∂a′−0
∂a−0 ∂a′−0
j=−K
K
X
∂ 2 wj
∂ 2 c2
2
µ
=
,
j
∂a−0 ∂a′−0
∂a−0 ∂a′−0
j=−K
A.7. PROOF OF PROPOSITION ??
233
where ∂ 2 wj /∂a−0 ∂a′−0 , j = −K, . . . , K is a 2K × 2K matrix introduced in
Section A.4.6.
A.7
Proof of Proposition 7.1
3 2
P 2
is
It is easily seen that the unconstrained minimizer of K
j=−K 2 +3 ∆ aj
not unique and is given by an arbitrary quadratic function of knots, i.e.
K
K K
K 2
aK
j = b0 − b2 (µj − b1 ) ,
j = −K 2 , . . . , K 2 .
Under the hconstraints (7.9), the minimizer
becomes unique with bK
1 = 0,
i
P
2
K
K
K
K
K
2
b0 = − log
and b2 being a solution to CK (b) = 0,
j=−K 2 exp −b2 (µj )
where
PK 2
K 2
K 2
j=−K 2 (µj ) exp −b (µj )
− (1 − σ02 ).
CK (b) =
PK 2
K )2
exp
−b
(µ
2
j=−K
j
The function CK (b) has the following properties:
• It is continuous on [0, ∞);
• For all b ∈ [0, ∞)
h i2
d
CK (b) = E (µK )2 bK
=
b
− E (µK )4 bK
2
2 =b ,
db
and from the Hölder’s inequality (see, e.g., Billingsley, 1995, p. 80)
d
db CK (b) < 0. I.e. CK (b) is decreasing on [0, ∞);
• CK (0) = (K 2 + 1)/3 − (1 − σ02 ) > 0 for all K ≥ 2;
• limb→∞ CK (b) = −(1 − σ02 ) < 0.
So that for all K ≥ 2 there exists exactly obe root bK
2 ∈ (0 ∞) of the equation
CK (b) = 0.
Let function C(b) be defined as
R∞ 2
2
−∞ s exp −b s ds
− (1 − σ02 ) = (2b)−1 − (1 − σ02 ).
C(b) = R ∞
2 ds
exp
−b
s
−∞
−1
∈ (0.5, ∞).
The equation C(b) = 0 has a unique solution b2 = 2(1 − σ02 )
It follows from the property of the integral that for all b ∈ (0, ∞)
lim CK (b) = C(b)
K→∞
234
APPENDIX A. DETAILS FOR THE ML PENALIZED AFT
and consequently, using the properties of CK (b) also
lim bK
2 = b2 .
K→∞
Let FK (µ) be a cumulative distribution function of µK under bK
2 , i.e.
FK (µ) =
K 2
exp −bK
2 (µj )
K 2
exp −bK
2 (µj )
Pmin(Kµ, K 2 )
j=−K 2
PK 2
j=−K 2
and Φ(µ | 0, 1 − σ02 ) be a cumulative distribution function of the normal distribution N (0, 1 − σ02 ), i.e.
Rµ
2 ds
exp
−b
s
2
.
Φ(µ | 0, 1 − σ02 ) = R−∞
∞
2
−∞ exp −b2 s ds
It can be now shown that for all µ ∈ R
lim FK (µ) = Φ(µ | 0, 1 − σ02 ),
K→∞
i.e. the random variable µK under bK
2 converges in distribution to a N (0, 1 −
2
σ0 ) random variable.
Finally, for all y ∈ R
gK (y) =
Z
∞
−∞
and
ϕ(y) =
Z
∞
−∞
ϕ(y | µ, σ02 ) dFK (µ)
ϕ(y | µ, σ02 ) dΦ(µ | 0, 1 − σ02 ).
The assertion of the proposition now follows from the fact that function
ϕ(y | µ, σ02 ) is for all y ∈ R bounded and continuous function of µ.
Appendix
B
Simulation results
B.1
Simulation for the maximum likelihood penalized AFT model
Here we present selected results of the simulation study introduced in Section 7.5. Tables B.1 – B.6 show the results for the regression parameters.
In the first third of the tables, results based on the penalized AFT model
are shown. The second third of the tables shows the results based on the
parametric AFT model estimated using the maximum-likelihood method assuming a correct (true) error distribution. Finally, the last third of the tables
shows the results obtained by the parametric AFT model estimated using
the maximum-likelihood method while assuming (in most case incorrectly)
normal error distribution.
Figures B.1 – B.3 show the fitted error distributions. For comparison purposes, we plot also the true error distribution.
235
236
APPENDIX B. SIMULATION RESULTS
Table B.1: Results for the regression parameter β1 = −0.800 related to the
binary covariate. True error distribution: normal. Mean, standard deviation
and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
MSE
(×10−4 )
β̂ (SD)
light RC
(0.114)
(0.168)
(0.316)
(0.401)
MSE
(×10−4 )
−0.792
−0.812
−0.787
−0.772
(0.118)
(0.175)
(0.337)
(0.478)
600
300
100
50
−0.794
−0.817
−0.775
−0.792
(0.119)
(0.176)
(0.351)
(0.513)
142.59
311.80
1235.08
2635.81
light R+IC
−0.794 (0.117)
136.20
−0.812 (0.172)
295.97
−0.778 (0.323)
1045.28
−0.769 (0.424)
1806.80
600
300
100
50
−0.780
−0.798
−0.789
−0.629
(0.140)
(0.198)
(0.491)
(0.622)
200.25
391.34
2412.45
4156.99
−0.782
−0.799
−0.793
−0.652
−0.787
−0.811
−0.837
−0.680
(0.150)
(0.212)
(0.487)
(0.717)
600
300
100
50
138.93
307.92
1140.71
2290.06
Assumed Error Distribution
True
2280.00
449.06
2387.49
5278.77
−0.792
−0.812
−0.778
−0.762
130.57
282.47
1005.28
1623.70
heavy RC
(0.135)
186.94
(0.198)
391.21
(0.413)
1708.48
(0.490)
2616.76
heavy R+IC
−0.786 (0.141)
201.63
−0.800 (0.206)
425.72
−0.799 (0.425)
1809.93
−0.664 (0.514)
2826.40
Normal
β̂ (SD)
MSE
(×10−4 )
−0.792
−0.812
−0.778
−0.762
(0.114)
(0.168)
(0.316)
(0.401)
130.57
282.47
1005.28
1623.70
−0.794
−0.812
−0.778
−0.769
(0.117)
(0.172)
(0.323)
(0.424)
136.20
295.97
1045.28
1806.80
−0.782
−0.799
−0.793
−0.652
(0.135)
(0.198)
(0.413)
(0.490)
186.94
391.21
1708.48
2616.76
−0.786
−0.800
−0.799
−0.664
(0.141)
(0.206)
(0.425)
(0.514)
201.63
425.72
1809.93
2826.40
B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT
MODEL
237
Table B.2: Results for the regression parameter β1 = −0.800 related to the
binary covariate. True error distribution: extreme value. Mean, standard
deviation and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
MSE
(×10−4 )
β̂ (SD)
light RC
(0.104)
(0.151)
(0.267)
(0.323)
MSE
(×10−4 )
−0.791
−0.827
−0.796
−0.888
(0.112)
(0.151)
(0.300)
(0.467)
600
300
100
50
−0.795
−0.826
−0.796
−0.869
(0.109)
(0.156)
(0.299)
(0.428)
118.64
250.12
896.48
1883.02
light R+IC
−0.786 (0.104)
109.58
−0.824 (0.151)
232.65
−0.782 (0.266)
712.12
−0.884 (0.324)
1117.59
600
300
100
50
−0.788
−0.851
−0.813
−0.891
(0.149)
(0.218)
(0.459)
(0.732)
222.96
499.81
2104.51
5437.10
−0.785
−0.853
−0.777
−0.921
−0.800
−0.855
−0.853
−0.872
(0.156)
(0.229)
(0.469)
(0.684)
600
300
100
50
126.39
235.96
901.70
2257.10
Assumed Error Distribution
True
242.35
552.22
2225.78
4725.93
−0.786
−0.824
−0.782
−0.883
110.56
233.70
714.10
1112.5
heavy RC
(0.140)
198.32
(0.200)
427.42
(0.360)
1301.67
(0.546)
3132.58
heavy R+IC
−0.786 (0.138)
191.75
−0.856 (0.203)
442.33
−0.786 (0.368)
1357.85
−0.936 (0.563)
3360.86
Normal
β̂ (SD)
MSE
(×10−4 )
−0.819
−0.864
−0.842
−0.912
(0.136)
(0.188)
(0.349)
(0.465)
187.24
393.02
1234.08
2290.56
−0.793
−0.833
−0.808
−0.885
(0.123)
(0.173)
(0.320)
(0.430)
152.84
309.89
1024.61
1919.72
−0.869
−0.935
−0.877
−0.973
(0.173)
(0.249)
(0.460)
(0.648)
348.10
802.15
2176.66
4493.43
−0.819
−0.880
−0.833
−0.944
(0.152)
(0.229)
(0.420)
(0.620)
233.59
589.85
1778.50
4048.72
238
APPENDIX B. SIMULATION RESULTS
Table B.3: Results for the regression parameter β1 = −0.800 related to the
binary covariate. True error distribution: normal mixture. Mean, standard
deviation and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
−0.817
−0.817
−0.829
−0.845
(0.154)
(0.201)
(0.386)
(0.624)
600
300
100
50
−0.824
−0.834
−0.803
−0.871
(0.159)
(0.226)
(0.411)
(0.688)
600
300
100
50
600
300
100
50
MSE
(×10−4 )
β̂ (SD)
light RC
(0.142)
(0.187)
(0.319)
(0.502)
MSE
(×10−4 )
β̂ (SD)
MSE
(×10−4 )
(0.173)
(0.262)
(0.438)
(0.628)
319.10
713.18
1917.94
3963.00
258.53
523.20
1686.57
4781.04
light R+IC
−0.819 (0.150)
229.01
−0.819 (0.201)
408.76
−0.803 (0.323)
1043.16
−0.807 (0.567)
3209.78
−0.877
−0.880
−0.833
−0.867
(0.184)
(0.283)
(0.466)
(0.692)
399.17
865.88
2184.66
4839.10
−0.80 (0.213)
−0.752 (0.318)
−0.781 (0.558)
−0.723 (0.915)
451.77
1036.14
3114.28
8426.26
−0.797
−0.763
−0.780
−0.810
−0.743
−0.715
−0.716
−0.728
(0.194)
(0.310)
(0.520)
(0.788)
407.02
1033.05
2771.73
6257.40
(0.263)
(0.376)
(0.640)
(1.183)
700.18
1412.76
4118.05
14012.96
−0.821
−0.782
−0.779
−0.851
(0.230)
(0.366)
(0.609)
(0.981)
531.61
1345.87
3711.43
9655.56
−0.813
−0.814
−0.809
−0.819
203.08
350.64
1019.39
2526.53
Normal
−0.845
−0.850
−0.814
−0.836
−0.826
−0.789
−0.752
−0.846
239.76
408.04
1498.05
3912.84
Assumed Error Distribution
True
heavy RC
(0.187)
349.75
(0.285)
827.02
(0.485)
2357.78
(0.746)
5568.92
heavy R+IC
−0.808 (0.223)
497.55
−0.759 (0.342)
1189.15
−0.776 (0.548)
3012.35
−0.868 (0.969)
9440.90
B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT
MODEL
239
Table B.4: Results for the regression parameter β2 = 0.400 related to the
continuous covariate. True error distribution: normal. Mean, standard
deviation and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
0.406
0.399
0.380
0.398
(0.046)
(0.064)
(0.134)
(0.202)
MSE
(×10−4 )
21.58
41.34
182.30
407.62
Assumed Error Distribution
True
β̂ (SD)
MSE
(×10−4 )
Normal
β̂ (SD)
MSE
(×10−4 )
0.406
0.397
0.388
0.391
light RC
(0.046)
21.20
(0.059)
34.48
(0.121)
147.19
(0.176)
311.05
0.406
0.397
0.388
0.391
(0.046)
(0.059)
(0.121)
(0.176)
21.20
34.48
147.19
311.05
0.406
0.397
0.391
0.398
(0.048)
(0.062)
(0.121)
(0.184)
23.10
38.69
147.62
338.59
600
300
100
50
0.407
0.397
0.389
0.402
(0.049)
(0.063)
(0.133)
(0.215)
24.60
39.78
178.22
461.64
0.406
0.397
0.391
0.398
light R+IC
(0.048)
23.10
(0.062)
38.69
(0.121)
147.62
(0.184)
338.59
600
300
100
50
0.404
0.398
0.385
0.400
(0.051)
(0.070)
(0.173)
(0.264)
26.57
48.90
299.81
697.82
0.405
0.402
0.392
0.407
heavy RC
(0.050)
25.05
(0.068)
46.37
(0.140)
197.65
(0.214)
460.12
0.405
0.402
0.392
0.407
(0.050)
(0.068)
(0.140)
(0.214)
25.05
46.37
197.65
460.12
31.72
75.08
275.28
997.94
heavy R+IC
0.406 (0.054)
29.01
0.403 (0.074)
54.82
0.399 (0.142)
200.59
0.424 (0.244)
600.63
0.406
0.403
0.399
0.424
(0.054)
(0.074)
(0.142)
(0.244)
29.01
54.82
200.59
600.63
600
300
100
50
0.408
0.403
0.404
0.438
(0.056)
(0.087)
(0.166)
(0.314)
240
APPENDIX B. SIMULATION RESULTS
Table B.5: Results for the regression parameter β2 = 0.400 related to the continuous covariate. True error distribution: extreme value. Mean, standard
deviation and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
0.402
0.415
0.414
0.428
(0.040)
(0.061)
(0.101)
(0.188)
MSE
(×10−4 )
15.96
39.21
104.29
361.29
Assumed Error Distribution
True
β̂ (SD)
MSE
(×10−4 )
Normal
β̂ (SD)
MSE
(×10−4 )
0.400
0.413
0.408
0.436
light RC
(0.039)
15.33
(0.057)
33.84
(0.093)
86.87
(0.158)
260.77
0.420
0.432
0.415
0.438
(0.048)
(0.076)
(0.113)
(0.186)
27.40
68.03
129.37
359.47
0.404
0.417
0.410
0.429
(0.045)
(0.067)
(0.105)
(0.174)
20.06
48.18
111.89
311.89
600
300
100
50
0.403
0.416
0.416
0.433
(0.041)
(0.059)
(0.101)
(0.182)
17.17
37.93
103.67
343.47
0.400
0.412
0.409
0.436
light R+IC
(0.039)
15.53
(0.056)
32.98
(0.093)
87.23
(0.160)
268.39
600
300
100
50
0.407
0.427
0.389
0.454
(0.061)
(0.086)
(0.155)
(0.294)
38.05
82.19
241.04
895.33
0.403
0.420
0.398
0.441
heavy RC
(0.058)
34.09
(0.077)
63.28
(0.138)
190.95
(0.229)
540.61
0.453
0.463
0.431
0.464
(0.073)
(0.098)
(0.155)
(0.256)
80.63
135.13
248.61
698.09
38.47
80.49
271.55
736.90
heavy R+IC
0.403 (0.059)
34.94
0.420 (0.077)
63.71
0.405 (0.143)
203.32
0.452 (0.241)
607.49
0.426
0.440
0.425
0.461
(0.066)
(0.087)
(0.151)
(0.250)
50.42
92.29
234.82
662.22
600
300
100
50
0.413
0.432
0.419
0.445
(0.061)
(0.084)
(0.164)
(0.268)
B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT
MODEL
241
Table B.6: Results for the regression parameter β2 = 0.400 related to the
continuous covariate. True error distribution: normal mixture. Mean,
standard deviation and MSE (×10−4 ) are calculated over the simulations.
Smoothed
N
600
300
100
50
β̂ (SD)
0.405
0.401
0.386
0.361
(0.051)
(0.075)
(0.154)
(0.274)
MSE
(×10−4 )
26.07
56.31
239.62
763.56
Assumed Error Distribution
True
β̂ (SD)
MSE
(×10−4 )
Normal
β̂ (SD)
MSE
(×10−4 )
0.403
0.400
0.386
0.358
light RC
(0.050)
24.79
(0.072)
51.28
(0.125)
158.23
(0.250)
640.84
0.412
0.418
0.397
0.369
(0.068)
(0.090)
(0.176)
(0.282)
48.18
84.56
311.23
806.32
0.424
0.432
0.417
0.390
(0.076)
(0.098)
(0.196)
(0.316)
62.83
105.56
386.41
997.41
600
300
100
50
0.408
0.408
0.403
0.376
(0.059)
(0.079)
(0.183)
(0.313)
35.94
62.88
336.76
983.42
0.407
0.401
0.397
0.391
light R+IC
(0.056)
31.74
(0.071)
50.22
(0.152)
230.94
(0.306)
935.42
600
300
100
50
0.400
0.392
0.367
0.315
(0.078)
(0.110)
(0.201)
(0.409)
60.87
121.55
414.59
1747.62
0.396
0.404
0.380
0.332
heavy RC
(0.069)
48.17
(0.092)
85.67
(0.172)
301.03
(0.347)
1253.25
0.368
0.373
0.363
0.327
(0.081)
(0.100)
(0.206)
(0.331)
74.92
106.46
437.47
1148.82
84.33
117.03
924.86
2296.79
heavy R+IC
0.402 (0.084)
69.92
0.408 (0.095)
90.05
0.427 (0.249)
628.79
0.392 (0.441)
1941.29
0.401
0.405
0.418
0.381
(0.096)
(0.113)
(0.267)
(0.429)
92.04
128.28
713.71
1843.27
600
300
100
50
0.410
0.418
0.434
0.385
(0.091)
(0.107)
(0.302)
(0.479)
APPENDIX B. SIMULATION RESULTS
3
3
3
1
3
3
0.2
0.0
0.4
0.0
0.6
0.4
−3 −1
N = 100, heavy R+IC
0.2
1
N = 50, heavy RC
3
0.0
−3 −1
3
0.2
1
0.4
0.6
0.4
0.2
−3 −1
−3 −1
N = 300, heavy R+IC
1
0.0
0.2
1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy R+IC
N = 100, heavy RC
0.0
−3 −1
−3 −1
1
3
N = 50, heavy R+IC
0.6
3
3
N = 50, light R+IC
3
0.6
1
1
0.4
0.6
0.4
0.2
−3 −1
−3 −1
N = 300, heavy RC
1
0.2
0.4
1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy RC
N = 100, light R+IC
0.0
−3 −1
−3 −1
0.4
3
3
0.6
1
1
0.2
0.4
0.6
N = 300, light R+IC
0.2
−3 −1
−3 −1
0.6
1
0.0
0.0
0.2
0.4
0.6
N = 600, light R+IC
0.4
0.6
0.4
0.2
−3 −1
0.2
3
N = 50, light RC
0.0
1
0.6
−3 −1
N = 100, light RC
0.0
0.2
0.4
0.6
N = 300, light RC
0.0
0.0
0.2
0.4
0.6
N = 600, light RC
0.6
242
−3 −1
1
3
−3 −1
1
3
Figure B.1: Results for the standardized error distribution. True error distribution: normal. Solid line: average fitted density, grey lines: 95% pointwise
confidence band, dashed line: true error density.
B.1. SIMULATION FOR THE MAXIMUM LIKELIHOOD PENALIZED AFT
MODEL
1 2
1 2
1 2
−1
1 2
1 2
0.6
0.2
0.0
0.4
0.0
0.6
0.4
−3
N = 100, heavy R+IC
0.2
−1
N = 50, heavy RC
1 2
0.0
−3
1 2
0.2
−1
0.4
0.6
0.4
0.2
−3
−3
N = 300, heavy R+IC
−1
0.0
0.2
−1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy R+IC
N = 100, heavy RC
0.0
−3
−3
−1
1 2
N = 50, heavy R+IC
0.6
1 2
1 2
N = 50, light R+IC
1 2
0.6
−1
−1
0.4
0.6
0.4
0.2
−3
−3
N = 300, heavy RC
−1
0.2
0.4
−1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy RC
N = 100, light R+IC
0.0
−3
−3
0.4
1 2
1 2
0.6
−1
−1
0.2
0.4
0.6
N = 300, light R+IC
0.2
−3
−3
0.6
−1
0.0
0.0
0.2
0.4
0.6
N = 600, light R+IC
0.4
0.6
0.4
0.2
−3
0.2
1 2
N = 50, light RC
0.0
−1
0.6
−3
N = 100, light RC
0.0
0.2
0.4
0.6
N = 300, light RC
0.0
0.0
0.2
0.4
0.6
N = 600, light RC
243
−3
−1
1 2
−3
−1
1 2
Figure B.2: Results for the standardized error distribution. True error distribution: extreme value. Solid line: average fitted density, grey lines: 95%
pointwise confidence band, dashed line: true error density.
APPENDIX B. SIMULATION RESULTS
1 2
1 2
1 2
−1
1 2
1 2
0.2
0.0
0.4
0.0
0.6
0.4
−3
N = 100, heavy R+IC
0.2
−1
N = 50, heavy RC
1 2
0.0
−3
1 2
0.2
−1
0.4
0.6
0.4
0.2
−3
−3
N = 300, heavy R+IC
−1
0.0
0.2
−1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy R+IC
N = 100, heavy RC
0.0
−3
−3
−1
1 2
N = 50, heavy R+IC
0.6
1 2
1 2
N = 50, light R+IC
1 2
0.6
−1
−1
0.4
0.6
0.4
0.2
−3
−3
N = 300, heavy RC
−1
0.2
0.4
−1
0.0
0.0
0.2
0.4
0.6
N = 600, heavy RC
N = 100, light R+IC
0.0
−3
−3
0.4
1 2
1 2
0.6
−1
−1
0.2
0.4
0.6
N = 300, light R+IC
0.2
−3
−3
0.6
−1
0.0
0.0
0.2
0.4
0.6
N = 600, light R+IC
0.4
0.6
0.4
0.2
−3
0.2
1 2
N = 50, light RC
0.0
−1
0.6
−3
N = 100, light RC
0.0
0.2
0.4
0.6
N = 300, light RC
0.0
0.0
0.2
0.4
0.6
N = 600, light RC
0.6
244
−3
−1
1 2
−3
−1
1 2
Figure B.3: Results for the standardized error distribution. True error distribution: normal mixture. Solid line: average fitted density, grey lines: 95%
pointwise confidence band, dashed line: true error density.
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
B.2
245
Simulation for the Bayesian normal mixture cluster-specific AFT model
In this section we give the results of the simulation study introduced in Section 8.6. Tables B.7 and B.8 show the results for the regression parameters.
Further, Tables B.9 – B.11 give the results related to the covariance matrix D
of the random effects In the first third (or half) of the tables, results based on
the Bayesian normal mixture AFT model are shown. The second third (half)
of the tables shows the results based on Bayesian AFT model with assumed
normal error distribution and finally the last third of Tables B.7 and B.8 show
the results obtained using the parametric AFT model with assumed normal
distribution, no random effects included and estimated using the maximum
likelihood.
Figures B.4 and B.5 give the fitted standardized (in the case of Cauchy and
Student t2 distribution only centered) error distribution compared to the true
density. Figures B.6 and B.7 show the fitted hazard function for a combination of covariates zi,l = 0 and xi,l = 8.13 (median value). Always a comparison between the Bayesian normal mixture and the Bayesian model with
(incorrectly) specified normal error distribution is given. The same comparison, however with respect to the fitted survivor functions is given in Figures
B.8 and B.9.
246
APPENDIX B. SIMULATION RESULTS
Table B.7: Results for the mean of the covariate random effect γ = −0.800
related to the binary covariate. Mean, standard deviation and MSE (×10−4 )
are clculated over the simulations.
Estimation method
Bayesian normal
Bayesian mixture
N, ni
γ̂ (SD)
MSE
(×10−4 )
100, 10
50, 5
−0.798 (0.069)
−0.813 (0.155)
47.01
240.43
100, 10
50, 5
−0.8100 (0.103)
−0.766 (0.224)
107.16
512.72
100, 10
50, 5
−0.793 (0.100)
−0.778 (0.218)
99.91
479.02
100, 10
50, 5
−0.797 (0.069)
−0.815 (0.137)
47.39
191.09
100, 10
50, 5
−0.804 (0.051)
−0.787 (0.097)
26.60
95.70
γ̂ (SD)
MSE
(×10−4 )
True error = normal
−0.798 (0.069) 48.17
−0.811 (0.149) 222.78
True error = Cauchy t1
−0.736 (0.139) 234.52
−0.719 (0.255) 716.39
True error = Student t2
−0.761 (0.104) 123.72
−0.759 (0.196) 401.28
ML, no random effects
γ̂ (SD)
MSE
(×10−4 )
−0.798 (0.078)
−0.812 (0.153)
60.45
235.67
−0.738 (0.142)
−0.721 (0.253)
238.91
703.97
−0.7600 (0.108)
−0.761 (0.200)
132.14
415.46
True error = extreme value
−0.80 (0.075)
56.62
−0.802 (0.082)
−0.811 (0.138) 192.4
−0.809 (0.142)
True error = normal mixture
−0.926 (0.144) 366.82
−0.923 (0.148)
−0.869 (0.291) 894.99
−0.863 (0.283)
66.90
202.26
369.41
840.46
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
247
Table B.8: Results for the regression parameter β = 0.400 related to the continuous covariate. Mean, standard deviation and MSE (×10−4 ) are calculated
over the simulations.
Estimation method
Bayesian normal
Bayesian mixture
β̂ (SD)
MSE
(×10−4 )
100, 10
50, 5
0.402 (0.027)
0.397 (0.051)
7.28
26.31
100, 10
50, 5
0.392 (0.036)
0.412 (0.071)
13.23
52.54
100, 10
50, 5
0.394 (0.033)
0.393 (0.076)
11.51
58.15
100, 10
50, 5
0.404 (0.021)
0.393 (0.042)
100, 10
50, 5
0.400 (0.019)
0.394 (0.042)
N, ni
ML, no random effects
β̂ (SD)
MSE
(×10−4 )
0.402 (0.030)
0.399 (0.059)
9.01
34.67
0.357 (0.057)
0.378 (0.109)
50.73
124.59
0.378 (0.041)
0.382 (0.084)
21.62
72.93
4.34
17.92
True error = extreme value
0.403 (0.023)
5.41
0.402 (0.026)
0.395 (0.045) 20.89
0.395 (0.051)
6.76
25.87
3.46
17.65
True error = normal mixture
0.448 (0.048) 45.62
0.450 (0.052)
0.432 (0.076)
68.1
0.444 (0.104)
52.93
127.26
β̂ (SD)
MSE
(×10−4 )
True error = normal
0.402 (0.027)
7.28
0.397 (0.051)
25.6
True error = Cauchy t1
0.361 (0.051) 41.84
0.383 (0.081) 68.17
True error = Student t2
0.379 (0.038) 19.02
0.386 (0.069) 49.62
248
APPENDIX B. SIMULATION RESULTS
Table B.9: Results for the standard deviation of the random intercept
sd(bi,1 ) = 0.500. Mean, standard deviation and MSE (×10−4 ) are calculated
over the simulations.
N, ni
Estimation method
Bayesian mixture
Bayesian normal
b
b
sd(bi,1 ) (SD)
MSE
sd(bi,1 ) (SD)
MSE
100, 10
50, 5
0.476 (0.069)
0.321 (0.154)
True error = normal
52.98
0.476 (0.068)
559.95
0.324 (0.156)
52.52
551.48
100, 10
50, 5
True error = Cauchy t1
0.381 (0.120)
284.36
0.188 (0.121)
0.118 (0.060) 1492.37 0.086 (0.013)
1117.45
1718.19
100, 10
50, 5
True error = Student t2
0.452 (0.094)
111.52
0.418 (0.106)
0.160 (0.128) 1321.32 0.125 (0.086)
179.32
1480.14
100, 10
50, 5
True error = extreme value
0.489 (0.061)
38.41
0.495 (0.069)
0.343 (0.144)
453.41
0.305 (0.156)
48.37
625.15
100, 10
50, 5
True error = normal mixture
0.493 (0.047)
22.33
0.428 (0.176)
360.99
0.446 (0.093)
115.32
0.105 (0.048) 1583.26
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
249
Table B.10: Results for the standard deviation of the covariate random effect
sd(bi,2 ) = 0.100. Mean, standard deviation and MSE (×10−4 ) are calculated
over the simulations.
N, ni
Estimation method
Bayesian mixture
Bayesian normal
b
b
sd(bi,2 ) (SD) MSE sd(bi,2 ) (SD) MSE
100, 10
50, 5
True error = normal
0.125 (0.040) 22.13 0.125 (0.040)
0.152 (0.059) 61.67 0.153 (0.059)
22.30
62.53
100, 10
50, 5
True error = Cauchy t1
0.156 (0.058) 64.95 0.124 (0.054)
0.093 (0.017)
3.36
0.083 (0.008)
34.43
3.60
100, 10
50, 5
True error = Student t2
0.135 (0.031) 22.03 0.142 (0.033)
0.105 (0.039) 15.61 0.097 (0.029)
28.49
8.30
100, 10
50, 5
True error = extreme value
0.109 (0.027)
8.22
0.112 (0.028)
0.151 (0.059) 60.72 0.142 (0.052)
9.16
44.71
100, 10
50, 5
True error = normal mixture
0.094 (0.029)
8.51
0.174 (0.062) 93.20
0.139 (0.043) 33.42 0.090 (0.020)
5.24
250
APPENDIX B. SIMULATION RESULTS
Table B.11: Results for the random effects correlation corr(bi,1 , , bi,2 ) = 0.400.
Mean, standard deviation and MSE (×10−4 ) are calculated over the simulations.
Estimation method
Bayesian mixture
Bayesian normal
N, ni
corr(b
d i,1 , bi,2 ) (SD)
MSE
corr(b
d i,1 , bi,2 ) (SD)
MSE
100, 10
50, 5
0.391 (0.457)
0.293 (0.343)
True error = normal
2086.96
0.395 (0.459)
1292.59
0.290 (0.343)
100, 10
50, 5
0.380 (0.372)
0.061 (0.090)
True error = Cauchy t1
1385.34
0.210 (0.226)
1232.64
0.014 (0.032)
873.41
1502.97
100, 10
50, 5
0.266 (0.434)
0.100 (0.197)
True error = Student t2
2066.67
0.240 (0.423)
1284.85
0.070 (0.118)
2045.53
1230.40
100, 10
50, 5
0.388 (0.428)
0.327 (0.409)
True error = extreme value
1831.69
0.244 (0.469)
1722.62
0.249 (0.352)
2442.23
1466.79
100, 10
50, 5
True error = normal mixture
0.376 (0.401)
1617.48
0.228 (0.415)
0.307 (0.433)
1958.50
0.027 (0.051)
2019.53
1413.30
2108.60
1299.25
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.5
Normal, N = 50, ni = 5
0.5
Normal, N = 100, ni = 10
251
−3
−2
−1
0
1
2
−3
3
−1
0
1
2
3
0.20
0.10
0.00
0.00
0.10
0.20
0.30
Cauchy, N = 50, ni = 5
0.30
Cauchy, N = 100, ni = 10
−2
−3
−2
−1
0
1
2
−3
3
−1
0
1
2
3
0.20
0.10
0.00
0.00
0.10
0.20
0.30
Student t2 , N = 50, ni = 5
0.30
Student t2 , N = 100, ni = 10
−2
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
Figure B.4: Results for the standardized error density, estimated using the
Bayesian mixture model. Solid line: average fitted standardized density, grey
lines: 95% pointwise confidence band, dashed line: true standardized error
density.
252
APPENDIX B. SIMULATION RESULTS
Extr. value, N = 50, ni = 5
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Extr. value, N = 100, ni = 10
−3
−2
−1
1
0
−3
2
−1
1
0
2
0.2
0.4
0.6
0.8
Normal mixture, N = 50, ni = 5
0.0
0.0
0.2
0.4
0.6
0.8
Normal mixture, N = 100, ni = 10
−2
−2
−1
0
1
2
−2
−1
0
1
2
Figure B.5: Results for the standardized error density, estimated using the
Bayesian mixture model. Solid line: average fitted standardized density, grey
lines: 95% pointwise confidence band, dashed line: true standardized error
density.
0
200
600
0
0
200
200
600
Student t2 (50, 5)
600
0.010
0.015
0.000 0.002 0.004 0.006 0.008 0.010
0.000 0.002 0.004 0.006 0.008 0.010
0.000 0.002 0.004 0.006 0.008 0.010
0.000 0.002 0.004 0.006 0.008 0.010
253
Bayesian normal
Normal (50, 5)
Normal (100, 10)
0
Cauchy (50, 5)
0
0
200
200
200
600
Normal (50, 5)
600
0
Cauchy (100, 10)
600
0
Student t2 (100, 10)
0
200
200
200
600
Cauchy (50, 5)
600
Student t2 (50, 5)
600
Figure B.6: Results for the hazard function, estimated using the Bayesian
mixture model (left part) and the Bayesian normal model (right part). Each
row shows the results for different true error densities. Solid line: average
fitted hazard, grey lines: 95% pointwise confidence band, dashed line: true
hazard function.
0.005
0.015
0.015
0.015
600
0.000
0.010
0.010
0.010
Cauchy (100, 10)
200
0.008
Student t2 (100, 10)
0.005
0.005
0.005
0
0.004
600
0.000
0.000
0.000
Normal (100, 10)
0.000
200
600
0.008
0.008
0.008
0
200
0.004
0.004
0.004
0
0.000
0.000
0.000
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
Bayesian mixture
254
APPENDIX B. SIMULATION RESULTS
0.005
0.010
0.015
0.020
200
600
Normal mixture (50, 5)
0
200
600
0
200
600
Normal mixture (100, 10)
0.000 0.002 0.004 0.006 0.008 0.010
0.000 0.002 0.004 0.006 0.008 0.010
200
600
600
Extr. value (50, 5)
0
200
Normal mixture (50, 5)
0
Bayesian normal
0.000
Bayesian mixture
0.020
Extr. value (100, 10)
0.015
Extr. value (50, 5)
0.010
Extr. value (100, 10)
0.005
600
0.000
0.000 0.002 0.004 0.006 0.008 0.010
200
0.020
0
0.015
600
0.010
200
0.005
0
0.000
0.000 0.002 0.004 0.006 0.008 0.010
600
0.020
200
0.015
0
0.010
Normal mixture (100, 10)
0.005
0
0.000
Figure B.7: Results for the hazard function, estimated using the Bayesian
mixture model (left part) and the Bayesian normal model (right part). Each
row shows the results for different true error densities. Solid line: average
fitted hazard, grey lines: 95% pointwise confidence band, dashed line: true
hazard function.
B.2. SIMULATION FOR THE BAYESIAN NORMAL MIXTURE
CLUSTER-SPECIFIC AFT MODEL
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0
0
600
0.0 0.2 0.4 0.6 0.8 1.0
0
0
600
0.0 0.2 0.4 0.6 0.8 1.0
0
0
600
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0
0
600
Normal (50, 5)
200
600
600
Cauchy (50, 5)
200
200
Student t2 (50, 5)
0
Bayesian normal
Normal (100, 10)
200
600
600
Cauchy (100, 10)
200
200
Student t2 (100, 10)
0
0.0 0.2 0.4 0.6 0.8 1.0
Normal (50, 5)
200
600
600
Cauchy (50, 5)
200
200
Student t2 (50, 5)
0
0.0 0.2 0.4 0.6 0.8 1.0
Bayesian mixture
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Normal (100, 10)
200
600
600
Cauchy (100, 10)
200
200
Student t2 (100, 10)
0
0.0 0.2 0.4 0.6 0.8 1.0
255
Figure B.8: Results for the survivor function, estimated using the Bayesian
mixture model (left part) and the Bayesian normal model (right part). Each
row shows the results for different true error densities. Solid line: average
fitted survivor function, grey lines: 95% pointwise confidence band, dashed
line: true survivor function.
256
APPENDIX B. SIMULATION RESULTS
0.0 0.2 0.4 0.6 0.8 1.0
200
200
600
600
0.0 0.2 0.4 0.6 0.8 1.0
200
200
600
600
Extr. value (50, 5)
0
Normal mixture (50, 5)
0
Bayesian mixture
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
200
200
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
200
600
600
Extr. value (50, 5)
0
200
Normal mixture (50, 5)
0
Bayesian normal
600
600
Extr. value (100, 10)
0
0
Normal mixture (100, 10)
0.0 0.2 0.4 0.6 0.8 1.0
Extr. value (100, 10)
0
0
Normal mixture (100, 10)
0.0 0.2 0.4 0.6 0.8 1.0
Figure B.9: Results for the survivor function, estimated using the Bayesian
mixture model (left part) and the Bayesian normal model (right part). Each
row shows the results for different true error densities. Solid line: average
fitted survivor function, grey lines: 95% pointwise confidence band, dashed
line: true survivor function.
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
B.3
257
Simulation for the Bayesian penalized mixture
cluster-specific AFT model
This section presents selected results of the simulation study introduced in
Section 9.6. Tables B.12 and B.13 show the results for the regression parameters. Tables B.14 and B.15 give the results for the variance components of
the model.
Figures B.10 and B.11 show the fitted survivor densities for the onset part
of the model for a combination of covariates xui,l,1 = 0.5 (median value) and
xui,l,2 = 1. Figures B.12 and B.13 give the fitted survivor densities for the
event part of the model for a combination of covariates xti,l,1 = 0.5 (median
value) and xti,l,2 = 1. Corresponding fitted survivor functions are given in
Figures B.14 – B.17.
258
APPENDIX B. SIMULATION RESULTS
Table B.12: Results for the regression parameters from the onset part of the
model. Mean, standard deviation and MSE (×10−4 ) over the simulation.
τ d /τ ζ =
τ b /τ ε
δ1 = 0.200
MSE
δ̂1 (SD)
(×10−4 )
δ2 = −0.100
MSE
δ̂2 (SD)
(×10−4 )
Scenario I
(error ∼ normal mixture, random effect ∼ extreme value)
5
3
2
1
1/2
1/3
1/5
0.199
0.201
0.198
0.199
0.200
0.201
0.198
(0.007)
(0.008)
(0.011)
(0.014)
(0.018)
(0.019)
(0.019)
0.56
0.68
1.30
1.84
3.14
3.74
3.51
−0.101
−0.100
−0.100
−0.100
−0.100
−0.101
−0.100
(0.004)
(0.005)
(0.006)
(0.009)
(0.010)
(0.010)
(0.010)
−0.101
−0.101
−0.099
−0.099
−0.097
−0.099
−0.100
(0.005)
(0.008)
(0.011)
(0.019)
(0.025)
(0.024)
(0.020)
0.17
0.20
0.37
0.76
0.92
1.02
0.95
Scenario II
(error ∼ extreme value, random effect ∼ normal mixture)
5
3
2
1
1/2
1/3
1/5
0.200
0.202
0.200
0.196
0.194
0.201
0.203
(0.010)
(0.015)
(0.019)
(0.029)
(0.038)
(0.041)
(0.043)
0.93
2.38
3.44
8.73
14.46
16.73
18.12
0.30
0.72
1.27
3.45
6.30
5.90
4.12
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
259
Table B.13: Results for the regression parameters from the event part of the
model. Mean, standard deviation and MSE (×10−4 ) over the simulation.
τ d /τ ζ =
τ b /τ ε
β1 = 0.300
MSE
β̂1 (SD)
(×10−4 )
β2 = −0.150
MSE
β̂2 (SD)
(×10−4 )
Scenario I
(error ∼ normal mixture, random effect ∼ extreme value)
5
3
2
1
1/2
1/3
1/5
0.302
0.301
0.298
0.304
0.301
0.311
0.299
(0.014)
(0.032)
(0.056)
(0.054)
(0.043)
(0.058)
(0.050)
2.12
9.99
30.55
29.04
18.07
34.67
25.11
−0.149
−0.149
−0.150
−0.148
−0.147
−0.150
−0.151
(0.008)
(0.021)
(0.034)
(0.028)
(0.031)
(0.035)
(0.031)
−0.148
−0.152
−0.146
−0.149
−0.151
−0.146
−0.142
(0.016)
(0.022)
(0.036)
(0.057)
(0.070)
(0.071)
(0.065)
0.64
4.47
11.75
7.55
9.66
11.88
9.68
Scenario II
(error ∼ extreme value, random effect ∼ normal mixture)
5
3
2
1
1/2
1/3
1/5
0.298
0.291
0.306
0.299
0.304
0.296
0.308
(0.031)
(0.040)
(0.065)
(0.103)
(0.121)
(0.126)
(0.112)
9.40
16.44
42.02
105.54
144.59
157.36
125.51
2.74
4.99
13.01
32.60
48.40
50.06
42.10
260
APPENDIX B. SIMULATION RESULTS
Table B.14: Results for the scale parameters from the onset part of the model.
Mean, standard deviation and MSE (×10−4 ) over the simulation.
τd
d
ζ
τ /τ =
τ b /τ ε
True τ d
τ̂ d (SD)
τζ
MSE
(×10−4 )
True τ ζ
τ̂ ζ (SD)
MSE
(×10−4 )
Scenario I
(error ∼ normal mixture, random effect ∼ extreme value)
5
3
2
1
1/2
1/3
1/5
0.310
0.300
0.283
0.224
0.141
0.100
0.062
0.341
0.324
0.283
0.219
0.143
0.103
0.110
(0.035)
(0.037)
(0.031)
(0.024)
(0.018)
(0.035)
(0.097)
21.20
19.13
9.31
5.85
3.24
12.23
116.93
0.062
0.100
0.141
0.224
0.283
0.300
0.310
0.062
0.100
0.141
0.223
0.283
0.301
0.325
(0.002)
(0.002)
(0.003)
(0.006)
(0.006)
(0.012)
(0.034)
0.04
0.04
0.08
0.39
0.31
1.35
13.74
Scenario II
(error ∼ extreme value, random effect ∼ normal mixture)
5
3
2
1
1/2
1/3
1/5
0.310
0.300
0.283
0.224
0.141
0.100
0.062
0.311
0.318
0.299
0.218
0.132
0.065
0.040
(0.009)
(0.112)
(0.141)
(0.011)
(0.021)
(0.037)
(0.030)
0.86
128.44
198.36
1.54
5.23
25.76
13.87
0.062
0.100
0.141
0.224
0.283
0.300
0.310
0.061
0.116
0.159
0.224
0.285
0.304
0.314
(0.003)
(0.099)
(0.126)
(0.012)
(0.013)
(0.013)
(0.015)
0.11
99.25
159.21
1.35
1.70
1.83
2.34
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
261
Table B.15: Results for the scale parameters from the event part of the model.
Mean, standard deviation and MSE (×10−4 ) over the simulation.
τb
d
ζ
τ /τ =
τ b /τ ε
True τ b
τ̂ b (SD)
τε
MSE
(×10−4 )
True τ ε
τ̂ ε (SD)
MSE
(×10−4 )
Scenario I
(error ∼ normal mixture, random effect ∼ extreme value)
5
3
2
1
1/2
1/3
1/5
0.981
0.949
0.894
0.707
0.447
0.316
0.196
0.980
0.987
0.827
0.647
0.428
0.307
0.180
(0.393)
(0.517)
(0.065)
(0.046)
(0.039)
(0.037)
(0.037)
1532.34
2660.34
87.10
57.34
18.27
14.42
15.93
0.196
0.316
0.447
0.707
0.894
0.949
0.981
0.202
0.417
0.663
0.741
0.901
0.954
0.984
(0.005)
(0.160)
(0.217)
(0.090)
(0.017)
(0.018)
(0.018)
0.60
356.10
932.57
91.97
3.47
3.57
3.49
Scenario II
(error ∼ extreme value, random effect ∼ normal mixture)
5
3
2
1
1/2
1/3
1/5
0.981
0.949
0.894
0.707
0.447
0.316
0.196
0.971
0.941
0.884
0.671
0.394
0.079
0.024
(0.030)
(0.040)
(0.049)
(0.040)
(0.092)
(0.115)
(0.026)
9.67
16.72
24.54
28.78
111.54
695.18
302.36
0.196
0.316
0.447
0.707
0.894
0.949
0.981
0.202
0.325
0.532
0.886
1.160
1.286
1.345
(0.012)
(0.054)
(0.237)
(0.273)
(0.230)
(0.214)
(0.235)
1.73
29.79
626.75
1056.64
1228.67
1589.69
1873.27
262
APPENDIX B. SIMULATION RESULTS
τ b /τ ε = 1/5
0.0
0.00
0.1
0.10
0.2
0.3
0.20
0.4
τ b /τ ε = 5
0
2
4
6
8
0
10
2
τ b /τ ε = 3
4
6
8
10
8
10
8
10
0.00
0.00
0.10
0.10
0.20
0.20
0.30
τ b /τ ε = 1/3
0
2
4
6
8
0
10
2
τ b /τ ε = 2
4
6
0.00
0.00
0.10
0.10
0.20
0.20
τ b /τ ε = 1/2
0
2
4
6
8
10
8
10
0
2
4
6
0.00
0.10
0.20
τ b /τ ε = 1
0
2
4
6
Figure B.10: Results for the survivor density of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario I (error ∼ normal mixture,
random effect ∼ extreme value). Solid line: average fitted survivor density,
grey lines: 95% pointwise confidence band, dashed line: true survivor density..
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
τ b /τ ε = 1/5
0.0
0.00
0.1
0.10
0.2
0.3
0.20
0.4
τ b /τ ε = 5
263
0
2
4
6
8
0
10
2
τ b /τ ε = 3
4
6
8
10
8
10
8
10
0.0
0.00
0.1
0.10
0.2
0.3
0.20
τ b /τ ε = 1/3
0
2
4
6
8
0
10
2
τ b /τ ε = 2
4
6
0.00
0.00
0.10
0.10
0.20
0.20
0.30
τ b /τ ε = 1/2
0
2
4
6
8
10
8
10
0
2
4
6
0.00
0.10
0.20
τ b /τ ε = 1
0
2
4
6
Figure B.11: Results for the survivor density of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario II (error ∼ extreme value,
random effect ∼ normal mixture). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor
density..
264
APPENDIX B. SIMULATION RESULTS
τ b /τ ε = 1/5
0.00
0.04
0.08
0.00 0.05 0.10 0.15 0.20
τ b /τ ε = 5
0
10
20
30
40
50
0
10
τ b /τ ε = 3
20
30
40
50
40
50
40
50
0.00
0.00
0.05
0.04
0.10
0.08
0.15
0.20
τ b /τ ε = 1/3
0
10
20
30
40
50
0
10
τ b /τ ε = 2
20
30
0.00
0.05
0.10
0.00 0.02 0.04 0.06 0.08
0.15
τ b /τ ε = 1/2
0
10
20
30
40
50
40
50
0
10
20
30
0.00
0.04
0.08
τ b /τ ε = 1
0
10
20
30
Figure B.12: Results for the survivor density of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario I (error ∼ normal mixture,
random effect ∼ extreme value). Solid line: average fitted survivor density,
grey lines: 95% pointwise confidence band, dashed line: true survivor density..
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
τ b /τ ε = 1/5
0.00
0.00
0.02
0.10
0.04
0.06
0.20
0.08
τ b /τ ε = 5
265
0
10
20
30
40
50
0
10
τ b /τ ε = 3
20
30
40
50
40
50
40
50
0.00
0.00
0.05
0.02
0.10
0.04
0.15
0.06
0.20
0.08
τ b /τ ε = 1/3
0
10
20
30
40
50
0
10
τ b /τ ε = 2
20
30
0.00
0.00
0.02
0.05
0.04
0.10
0.06
0.15
0.08
τ b /τ ε = 1/2
0
10
20
30
40
50
40
50
0
10
20
30
0.00
0.04
0.08
τ b /τ ε = 1
0
10
20
30
Figure B.13: Results for the survivor density of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario II (error ∼ extreme value,
random effect ∼ normal mixture). Solid line: average fitted survivor density, grey lines: 95% pointwise confidence band, dashed line: true survivor
density..
266
APPENDIX B. SIMULATION RESULTS
τ b /τ ε = 1/5
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 5
0
2
4
6
8
0
10
2
4
6
8
10
8
10
8
10
τ b /τ ε = 1/3
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 3
0
2
4
6
8
0
10
2
τ b /τ ε = 2
4
6
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1/2
0
2
4
6
8
10
8
10
0
2
4
6
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1
0
2
4
6
Figure B.14: Results for the survivor function of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario I (error ∼ normal mixture,
random effect ∼ extreme value). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor
function..
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
267
τ b /τ ε = 1/5
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 5
0
2
4
6
8
0
10
2
4
6
8
10
8
10
8
10
τ b /τ ε = 1/3
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 3
0
2
4
6
8
0
10
2
τ b /τ ε = 2
4
6
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1/2
0
2
4
6
8
10
8
10
0
2
4
6
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1
0
2
4
6
Figure B.15: Results for the survivor function of the onset time, for the combination of covariates xui,l = (0.5, 1)′ , scenario II (error ∼ extreme value,
random effect ∼ normal mixture). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor
function..
268
APPENDIX B. SIMULATION RESULTS
τ b /τ ε = 1/5
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 5
0
10
20
30
40
50
0
10
20
30
40
50
40
50
40
50
τ b /τ ε = 1/3
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 3
0
10
20
30
40
50
0
10
τ b /τ ε = 2
20
30
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1/2
0
10
20
30
40
50
40
50
0
10
20
30
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1
0
10
20
30
Figure B.16: Results for the survivor function of the event time, for the combination of covariates xti,l = (0.5, 1)′ , scenario I (error ∼ normal mixture,
random effect ∼ extreme value). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor
function..
B.3. SIMULATION FOR THE BAYESIAN PENALIZED MIXTURE
CLUSTER-SPECIFIC AFT MODEL
269
τ b /τ ε = 1/5
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 5
0
10
20
30
40
50
0
10
20
30
40
50
40
50
40
50
τ b /τ ε = 1/3
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 3
0
10
20
30
40
50
0
10
τ b /τ ε = 2
20
30
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1/2
0
10
20
30
40
50
40
50
0
10
20
30
0.0
0.2
0.4
0.6
0.8
1.0
τ b /τ ε = 1
0
10
20
30
Figure B.17: Results for the survivor function of the event time, for the
combination of covariates xti,l = (0.5, 1)′ , scenario II (error ∼ extreme value,
random effect ∼ normal mixture). Solid line: average fitted survivor function, grey lines: 95% pointwise confidence band, dashed line: true survivor
function..
270
APPENDIX B. SIMULATION RESULTS
Appendix
C
Software
For all methodologies described in Part II of the thesis a software in the form
of R (R Development Core Team, 2005) packages smoothSurv and bayesSurv
has been written and can be downloaded, together with extensive manuals
and description on how to perform analyses shown in this thesis from the
Comprehensive R Archive Network at http://www.R-project.org. To optimize the computational time, all time consuming computation is performed
using the C++ compiled code. In this appendix, we only briefly list the most
important functions from both packages.
C.1
Package smoothSurv
This package implements the methods for the penalized maximum-likelihood
AFT model as described in Chapter 7 and involves, among others, the following functions:
smoothSurvReg fits the AFT model (7.1) with the error density (7.2) using
the method of penalized maximum-likelihood. It also allows for the
scale regression (7.6);
plot.smoothSurvReg computes and plots the fitted error density (7.2);
survfit.smoothSurvReg computes and plots the fitted survival function (7.13)
for a specified combination of covariates;
fdensity computes and plots the fitted survival density (7.14) for a specified
combination of covariates;
hazard computes and plots the fitted hazard function (7.15) for a specified
combination of covariates;
271
272
APPENDIX C. SOFTWARE
estimTdiff estimates expected survival time for a specified combination of
covariates or estimates expected value of the difference between the
survival times for two specified combinations of covariates based on the
AFT model fitted using the function smoothSurvReg.
C.2
Package bayesSurv
This package implements the Bayesian methods described in Chapters 8 –
10.
For the Bayesian normal mixture cluster-specific AFT model of Chapter 8,
the core functions include:
bayessurvreg1 runs the MCMC simulation for the AFT model (8.1) with
the error density (8.2) and normally distributed (multivariate) random
effects;
bayesDensity computes the estimate of the predictive error densities (8.20)
and (8.21);
predictive computes the MCMC estimate of the predictive survival, density
or hazard function for a specified combination of covariates based on
the formulas (8.16), (8.18) and (8.19).
For the Bayesian penalized mixture cluster-specific and population-averaged
AFT models of Chapters 9 and 10, the core functions include:
bayessurvreg2 runs the MCMC simulation for the cluster-specific AFT model
(9.1), (9.2) with the error densities specified by (9.3) and normally distributed (multivariate) random effects (Model M );
bayessurvreg3 runs the MCMC simulation for the cluster-specific AFT model
(9.1), (9.2) with the error densities specified by (9.3) and univariate
random effects whose distribution is specified by (9.3) (Model U );
bayesBisurvreg runs the MCMC simulation for the population-averaged AFT
model (10.1), (10.2) with the error densities specified by (10.3);
bayesGspline computes the estimate of the predictive density of the factors
whose distribution was specified as the penalized normal mixture (9.3)
or (10.3). The function is based on formulas (9.13) and (10.12);
marginal.bayesGspline computes the estimates of the predictive marginal
densities of the factors whose distribution was specified as the bivariate
penalized normal mixture (10.3).
predictive2 computes the MCMC estimate of the predictive survival, density
or hazard function for a specified combination of covariates based on
the formulas (9.10), (9.11) or (10.10), (10.11).
Bibliography
Aalen, O. O. (1994). Effects of frailty in survival analysis. Statistical
Methods in Medical Research, 3, 227–243.
Abrahamowicz, M., Ciampi, A., and Ramsay, J. O. (1992). Nonparametric density estimation for censored survival data: regression-spline approach. The Canadian Journal of Statistics, 20, 171–185.
Akaike, H. (1974). A new look at the statistical model identification. IEEE
Transactions on Automatic Control, AC-19, 716–723.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications
to Bayesian nonparametric problems. The Annals of Statistics, 2, 1152–
1174.
Arjas, E. and Gasbarra, D. (1994). Nonparametric bayesian inference
from right censored survival data, using the Gibbs sampler. Statistica
Sinica, 4, 505–524.
Bacchetti, P. (1990). Estimation the incubation period of AIDS comparing population infection and diagnosis patterns. Journal of the American
Statistical Association, 85, 1002–1008.
Bacchetti, P. and Jewell, N. P. (1991). Nonparametric estimation of
the incubation period of AIDS based on a prevalent cohort with unknown
infection times. Biometrics, 47, 947–960.
Barkan, S. E., Melnick, S. L., Preston-Martin, S., Weber, K.,
Kalish, L. A., Miotti, P., Young, M., Greenblatt, R., Sacks,
H., and Feldman, J. (1998). The Women’s Interagency HIV Study.
Epidemiology, 9, 117–125.
273
274
BIBLIOGRAPHY
Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian
computation and stochastic systems (with Discussion). Statistical Science,
10, 3–66.
Betensky, R. A., Lindsey, J. C., Ryan, L. M., and Wand, M. P.
(1999). Local EM estimation of the hazard function for interval-censored
data. Biometrics, 55, 238–245.
Betensky, R. A., Lindsey, J. C., Ryan, L. M., and Wand, M. P.
(2002). A local likelihood proportional hazards model for interval censored
data. Statistics in Medicine, 21, 263–275.
Betensky, R. A., Rabinowitz, D., and Tsiatis, A. A. (2001). Computationally simple accelerated failure time regression for interval censored
data. Biometrika, 88, 703–711.
Billingsley, P. (1995). Probability and Measure. John Wiley & Sons, New
York, Third edition. ISBN 0-471-00710-2.
Bogaerts, K. and Lesaffre, E. (2004). A new, fast algorithm to find the
regions of possible support for bivariate interval-censored data. Journal of
Computational and Graphical Statistics, 13, 330–340.
Bogaerts, K. and Lesaffre, E. (2006). Estimating Kendall’s tau for
bivariate interval censored data with a smooth estimate of the density.
Submitted.
Breslow, N. E. (1974). Covariance analysis of censored survival data.
Biometrics, 30, 89–99.
Brooks, S. P., Giudici, P., and Roberts, G. O. (2003). Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions (with Discussion). Journal of the Royal Statistical Society, Series B,
65, 3–55.
Buckley, J. and James, I. (1979). Linear regression with censored data.
Biometrika, 66, 429–436.
Cai, T. and Betensky, R. A. (2003). Hazard regression for intervalcensored data with penalized spline. Biometrics, 59, 570–579.
Calle, M. L. and Gómez, G. (2005). A semiparametric hierarchical
method for a regression model with an interval-censored covariate. Australian and New Zealand Journal of Statistics, 47, 351–364.
BIBLIOGRAPHY
275
Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC, Boca Raton, Second edition.
ISBN 1-58488-170-4.
Carvalho, J. C., Ekstrand, K. R., and Thylstrup, A. (1989). Dental
plaque and caries on occlusal surfaces of first permanent molars in relation
to stage of eruption. Journal of Dental Research, 68, 773–779.
Chen, M.-H., Shao, Q.-M., and Ibrahim, J. G. (2000). Monte Carlo
Methods in Bayesian Computation. Springer-Verlag, New York. ISBN
0-387-98935-8.
Christensen, R. and Johnson, W. (1988). Modelling accelerated failure
time with a Dirichlet process. Biometrika, 75, 693–704.
Clahsen, P. C., van de Velde, C. J., Julien, J. P., Floiras, J. L., Delozier, T., Mignolet, F. Y., and Sahmoud, T. M. (1996). Improved
local control and disease-free survival after perioperative chemotherapy
for early-stage breast cancer. A European Organization for Research and
Treatment of Cancer Breast Cancer Cooperative Group Study. Journal of
Clinical Oncology, 14, 745–753.
Cox, D. R. (1972). Regression models and life-tables (with Discussion).
Journal of the Royal Statistical Society, Series B, 34, 187–220.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman
& Hall, London. ISBN 0-412-16160-5.
Czyzyk, J., Mesnier, M. P., and Moré, J. J. (1998). The NEOS server.
IEEE Journal on Computational Science and Engineering, 5, 68–75.
Dalal, S. R. and Hall, W. J. (1983). Approximating priors by mixtures
of natural conjugate priors. Journal of the Royal Statistical Society, Series
B, 45, 278–286.
de Boor, C. (1978). A Practical Guide to Splines. Springer, New York.
ISBN 0-387-90356-9.
De Gruttola, V. and Lagakos, S. W. (1989). Analysis of doublycensored survival data, with application to AIDS. Biometrics, 45, 1–11.
Dellaportas, P. and Papageorgiou, I. (2006). Multivariate mixtures of
normals with unknown number of components. Statistics and Computing,
16, 57–68.
276
BIBLIOGRAPHY
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum
likelihood from incomplete data via the EM algorithm. Journal of the
Royal Statistical Society, Series B, 39, 1–38.
Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical
Society, Series B, 56, 363–375.
Dierckx, P. (1993). Curve and Surface Fitting with Splines. Clarendon,
Oxford. ISBN 0-19-853440-X.
Dorey, F. J., Little, R. J., and Schenker, N. (1993). Multiple imputation for threshold-crossing data with interval censoring. Statistics in
Medicine, 12, 1589–1603.
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with Bsplines and penalties (with Discussion). Statistical Science, 11, 89–121.
Ekstrand, K. R., Christiansen, J., and Christiansen, M. E. (2003).
Time and duration of eruption of first and second permanent molars: a
longitudinal investigation. Community Dentistry and Oral Epidemiology,
31, 344–350.
Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based
on Generalized Linear Models. Springer-Verlag, New York, Second edition.
Fang, H.-B., Sun, J., and Lee, M.-L. T. (2002). Nonparametric survival
comparisons for interval-censored continuous data. Statistica Sinica, 12,
1073–1083.
Fay, M. P. (1996). Rank invariant tests for interval censored data under
grouped continuous model. Biometrics, 52, 811–822.
Fay, M. P. (1999). Comparing several score tests for interval censored data.
Statistics in Medicine, 18, 273–285.
Fay, M. P. and Shih, J. H. (1998). Permutation tests using estimated
distribution functions. Journal of the American Statistical Association,
93, 387–396.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209–230.
Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. The Annals of Statistics, 2, 615–629.
BIBLIOGRAPHY
277
Ferris, M. C., Mesnier, M. P., and Moré, J. (2000). NEOS and Condor: Solving nonlinear optimization problems over the Internet. ACM
Transactions on Mathematical Software, 26, 1–18.
Finkelstein, D. M. (1986). A proportional hazards model for intervalcensored failure time data. Biometrics, 42, 845–854.
Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and
Survival Analysis. John Wiley & Sons, New York. ISBN 0-471-52218-X.
Fletcher, R. (1987). Practical Methods of Optimization. John Wiley &
Sons, Chichester, Second edition. ISBN 0-471-49463-1.
Fourer, R., Gay, D. M., and Kernighan, B. W. (2003). AMPL: A
Modeling Language for Mathematical Programming. Duxbury Press, Second edition. ISBN 0-534-388094.
Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation
for Bayesian Inference. Chapman & Hall, London. ISBN 0-412-81820-5.
Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily
singly-censored samples. Biometrika, 52, 203–223.
Gelfand, A. E., Sahu, S. K., and Carlin, B. P. (1995). Efficient
parametrisations for normal linear mixed models. Biometrika, 82, 479–
499.
Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches
to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. To appear in Bayesian Analysis.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004).
Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, Second
edition. ISBN 1-58488-388-X.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulations
using multiple sequences (with Discussion). Statistical Science, 7, 457–511.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayes restoration of image. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 6, 721–741.
278
BIBLIOGRAPHY
Gentleman, R. and Geyer, C. J. (1994). Maximum likelihood for interval
censored data: consistency and computation. Biometrika, 81, 618–623.
Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with Discussion).
Statistical Science, 7, 473–511.
Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects
distribution in a linear mixed model. Biometrics, 60, 945–953.
Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics.
Springer-Verlag, New York. ISBN 0-387-95537-2.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., editors
(1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London.
ISBN 0-412-05551-1.
Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs
sampling. Applied Statistics, 41, 337–348.
Gill, R. D. (1980). Censoring and Stochastic Integrals. Number 124 in
Mathematical Centre Tracts. Mathematisch Centrum, Amsterdam. ISBN
90-6196-197-1.
Goetghebeur, E. and Ryan, L. (2000). Semiparametric regression analysis
of interval-censored data. Biometrics, 56, 1139–1144.
Goggins, W. B., Finkelstein, D. M., Schoenfeld, D. A., and Zaslavsky, A. M. (1998). A Markov chain Monte Carlo EM algorithm
for analyzing interval-censored data under the Cox proportional hazards
model. Biometrics, 54, 1498–1507.
Goggins, W. B., Finkelstein, D. M., and Zaslavsky, A. M. (1999).
Applying the Cox proportional hazards model for analysis of latency data
with interval censoring. Statistics in Medicine, 18, 2737–2747.
Gómez, G. and Calle, M. L. (1999). Non-parametric estimation with
doubly censored data. Journal of Applied Statistics, 26, 45–58.
Gómez, G., Calle, M. L., and Oller, R. (2004). Frequentist and
Bayesian approaches for interval-censored data. Statistical Papers, 45,
139–173.
Gómez, G., Espinal, A., and Lagakos, S. W. (2003). Inference for a
linear regression model with an interval-censored covariate. Statistics in
Medicine, 22, 409–425.
BIBLIOGRAPHY
279
Gómez, G. and Lagakos, S. W. (1994). Estimation of the infection time
and latency distribution of AIDS with doubly censored data. Biometrics,
50, 204–212.
Gray, R. J. (1992). Flexible methods for analyzing survival data using
splines, with application to breast cancer prognosis. Journal of the American Statistical Association, 87, 942–951.
Green, P. J. (1995). Reversible jump Markov chain computation and
Bayesian model determination. Biometrika, 82, 711–732.
Groeneboom, P. and Wellner, J. A. (1992). Information Bounds
and Nonparametric Maximum Likelihood Estimation. Birkhäuser-Verlag,
Boston. ISBN 0-8176-2794-4.
Han, S. P. (1977). A globally convergent method for nonlinear programming.
Journal of Optimization Theory and Applications, 22, 297–309.
Hanson, T. and Johnson, W. O. (2002). Modeling regression error with
a mixture of Polya trees. Journal of the American Statistical Association,
97, 1020–1033.
Hanson, T. and Johnson, W. O. (2004). A Bayesian semiparametric AFT
model for interval-censored data. Journal of Computational and Graphical
Statistics, 13, 341–361.
Härkänen, T. (2003). BITE: A Bayesian intensity estimator. Computational Statistics, 18, 565–583.
Härkänen, T., Virtanen, J. I., and Arjas, E. (2000). Caries on permanent teeth: a nonparametric Bayesian analysis. Scandinavian Journal of
Statistics, 27, 577–588.
Hastie, T. and Tibshirani, R. (1990). Exploring the nature of covariate
effects in the proportional hazards model. Biometrics, 46, 1005–1016.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of
Statistical Learning. Springer-Verlag, New York. ISBN 0-387-95284-5.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov
chains and their applications. Biometrika, 57, 97–109.
Held, L. (2004). Simultaneous posterior probability statements from Monte
Carlo output. Journal of Computational and Graphical Statistics, 13,
20–35.
280
BIBLIOGRAPHY
Hougaard, P. (1999). Fundamentals of survival data. Biometrics, 55,
13–22.
Hougaard, P. (2000). Analysis of Multivariate Survival Data. SpringerVerlag, New York. ISBN 0-387-98873-4.
Huang, J. (1999). Asymptotic properties of nonparametric estimation based
on partly interval-censored data. Statistica Sinica, 9, 501–519.
Ibrahim, J. G., Chen, M.-H., and Sinha, D. (2001). Bayesian Survival
Analysis. Springer-Verlag, New York. ISBN 0-387-95277-2.
Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). Markov chain
Monte Carlo methods and the label switching problem in Bayesian mixture
modeling. Statistical Science, 20, 50–67.
Jin, Z., Lin, D. Y., Wei, L. J., and Ying, Z. (2003). Rank-based inference
for the accelerated failure time model. Biometrika, 90, 341–353.
Johnson, W. and Christensen, R. (1989). Nonparametric Bayesian analysis of the accelerated failure time model. Statistics and Probability Letters,
8, 179–184.
Joly, P., Commenges, D., and Letenneur, L. (1998). A penalized likelihood approach for arbitrarily censored and truncated data: application
to age–specific incidence of dementia. Biometrics, 54, 185–194.
Kalbfleisch, J. D. and MacKay, R. J. (1979). On constant-sum models
for censored survival data. Biometrika, 66, 87–90.
Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis
of Failure Time Data. John Wiley & Sons, Chichester, Second edition.
ISBN 0-471-36357-X.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from
incomplete observations. Journal of the American Statistical Association,
53, 457–481.
Kauermann, G. (2005a). A note on smoothing parameter selection for
penalised spline smoothing. Journal of Statistical Planning and Inference,
127, 53–69.
Kauermann, G. (2005b). Penalised spline smoothing in multivariable survival models with varying coefficients. Computational Statistics and Data
Analysis, 49, 169–186.
BIBLIOGRAPHY
281
Keiding, N., Andersen, P. K., and Klein, J. P. (1997). The role of frailty
models and accelerated failure time models in describing heterogeneity due
to omitted covariates. Statistics in Medicine, 16, 215–225.
Kim, M. Y., De Gruttola, V. G., and Lagakos, S. W. (1993). Analyzing doubly censored data with covariates, with application to AIDS.
Biometrics, 49, 13–22.
Komárek, A. and Lesaffre, E. (2006a). Bayesian accelerated failure time
model for correlated censored data with a normal mixture as an error
distribution. To appear in Statistica Sinica.
Komárek, A. and Lesaffre, E. (2006b). Bayesian accelerated failure
time model with multivariate doubly-interval-censored data and flexible
distributional assumptions. Submitted.
Komárek, A. and Lesaffre, E. (2006c). Bayesian semiparametric accelerated failure time model for paired doubly-interval-censored data. Statistical
Modelling, 6, 3–22.
Komárek, A., Lesaffre, E., Härkänen, T., Declerck, D., and Virtanen, J. I. (2005). A Bayesian analysis of multivariate doubly-intervalcensored data. Biostatistics, 6, 145–155.
Komárek, A., Lesaffre, E., and Hilton, J. F. (2005). Accelerated
failure time model for arbitrarily censored data with smoothed error distribution. Journal of Computational and Graphical Statistics, 14, 726–745.
Kooperberg, C. (1998). Bivariate density estimation with an application
to survival analysis. Journal of Computational and Graphical Statistics, 7,
322–341.
Kooperberg, C. and Clarkson, D. B. (1997). Hazard regression with
interval-censored data. Biometrics, 53, 1485–1494.
Kooperberg, C. and Stone, C. J. (1992). Logspline density estimation
for censored data. Journal of Computational and Graphical Statistics, 1,
301–328.
Kooperberg, C., Stone, C. J., and Truong, Y. K. (1995). Hazard
regression. Journal of the American Statistical Association, 90, 78–94.
Kottas, A. and Gelfand, A. E. (2001). Bayesian semiparametric median
regression modeling. Journal of the American Statistical Association, 96,
1458–1468.
282
BIBLIOGRAPHY
Kuo, L. and Mallick, B. (1997). Bayesian semiparametric inference for
the accelerated failure time model. The Canadian Journal of Statistics,
25, 457–472.
Lai, T. L. and Ying, Z. (1991). Large sample theory of a modified BuckleyJames estimator for regression analysis with censored data. The Annals of
Statistics, 19, 1370–1402.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974.
Lambert, P., Collett, D., Kimber, A., and Johnson, R. (2004). Parametric accelerated failure time models with random effects and an application to kidney transplant survival. Statistics in Medicine, 23, 3177–3192.
Lambert, P. and Eilers, P. H. C. (2005). Bayesian proportional hazards model with time-varying regression coefficients: A penalized Poisson
regression approach. Statistics in Medicine, 24, 3977–3989.
Langohr, K., Gómez, G., and Muga, R. (2004). A parametric survival
model with an interval-censored covariate. Statistics in Medicine, 23,
3159–3175.
Lavine, M. (1992). Some aspects of Pólya tree distributions for statistical
modelling. The Annals of Statistics, 20, 1222–1235.
Lavine, M. (1994). More aspects of Pólya tree distributions for statistical
modelling. The Annals of Statistics, 22, 1161–1176.
Law, C. G. and Brookmeyer, R. (1992). Effects of mid-point imputation
on the analysis of doubly censored data. Statistics in Medicine, 11, 1569–
1578.
Lawson, A., Biggeri, A., Böhning, D., Lesaffre, E., Viel, J.-F., and
Bertollini, R., editors (1999). Disease Mapping and Risk Assessment for
Public Health. John Wiley & Sons, Chichester. ISBN 0-471-98634-8.
Lee, E. W., Wei, L. J., and Ying, Z. (1993). Linear regression analysis
for highly stratified failure time data. Journal of the American Statistical
Association, 88, 557–565.
Lee, Y. and Nelder, J. A. (2004). Conditional and marginal models:
Another view (with Discussion). Statistical Science, 19, 219–238.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation.
Springer-Verlag, New York, Second edition. ISBN 0-387-98502-6.
BIBLIOGRAPHY
283
Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2003a).
The effect of fluorides and caries in primary teeth on permanent tooth
emergence. Community Dentistry and Oral Epidemiology, 31, 463–470.
Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2003b).
The emergence of permanent teeth in Flemish children (Belgium). Community Dentistry and Oral Epidemiology, 31, 30–39.
Leroy, R., Bogaerts, K., Lesaffre, E., and Declerck, D. (2005).
Effect of caries experience in primary molars on cavity formation in the
adjacent permanent first molar. Caries Research, 39, 342–349.
Lesaffre, E., Komárek, A., and Declerck, D. (2005). An overview
of methods for interval-censored data with an emphasis on applications in
dentistry. Statistical Methods in Medical Research, 14, 539–552.
Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using
generalized linear models. Biometrika, 73, 13–22.
Lin, J. S. and Wei, L. J. (1992). Linear regression analysis for multivariate
failure time observations. Journal of the American Statistical Association,
87, 1091–1097.
Lindsey, J. K. and Lambert, P. (1998). On the appropriateness of
marginal models for repeated measurements in clinical trials. Statistics
in Medicine, 17, 447–469.
Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density
estimates. The Annals of Statistics, 12, 351–357.
Louis, T. A. (1981). Nonparametric analysis of an accelerated failure time
model. Biometrika, 68, 381–390.
Mantel, N. (1966). Evaluation of survival data and two new rank order
statistics arising in its consideration. Cancer Chemotherapy Reports, 50,
163–170.
Mantel, N. (1967). Ranking procedures for arbitrarily restricted observations. Biometrics, 23, 65–78.
Mauldin, R. D., Sudderth, W. D., and Williams, S. C. (1992). Pólya
trees and random distributions. The Annals of Statistics, 20, 1203–1221.
McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference
and Applications to Clustering. Marcel Dekker, Inc., New York. ISBN 08247-7691-7.
284
BIBLIOGRAPHY
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., and
Teller, A. H. (1953). Equations of state calculations by fast computing
machines. Journal of Chemical Physics, 21, 1087–1091.
Miller, R. G. (1976). Least squares regression with censored data.
Biometrika, 63, 449–464.
Molenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. Springer Science+Business Media, New York. ISBN 0-38725144-8.
Nanda, R. S. (1960). Eruption of human teeth. American Journal of Orthodontics, 46, 363–378.
Nardi, A. and Schemper, M. (2003). Comparing Cox and parametric
models in clinical studies. Statistics in Medicine, 22, 3597–3610.
Neal, R. M. (2003). Slice sampling (with Discussion). The Annals of
Statistics, 31, 705–767.
Odell, P. M., Anderson, K. M., and D’Agostino, R. B. (1992). Maximum likelihood estination for interval-censored data using a Weibull-based
accelerated failure time model. Biometrics, 48, 951–959.
O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Volume 2B:
Bayesian Inference. Arnold, London, Sixth edition. ISBN 0-340-52922-9.
Oller, R., Gómez, G., and Calle, M. L. (2004). Interval censoring:
model characterization for the validity of the simplified likelihood. The
Canadian Journal of Statistics, 32, 315–326.
O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problem
(with Discussion). Statistical Science, 1, 502–527.
O’Sullivan, F. (1988). Fast computation of fully automated log–density
and log–hazard estimators. SIAM Journal on Scientific and Statistical
Computing, 9, 363–379.
Oulis, C. J., Raadal, M., and Martens, L. (2000). Guidelines on the
use of fluoride in children: an EAPD policy document. European Journal
of Paediatric Dentistry, 1, 7–12.
Pan, J. and MacKenzie, G. (2003). On modelling mean-covariance structures in longitudinal studies. Biometrika, 90, 239–244.
BIBLIOGRAPHY
285
Pan, W. (1999a). A comparison of some two-sample tests with interval
censored data. Nonparametric Statistics, 12, 133–146.
Pan, W. (1999b). Extending the iterative convex minorant algorithm to
the Cox model for interval-censored data. Journal of Computational and
Graphical Statistics, 8, 109–120.
Pan, W. (2000a). A multiple imputation approach to Cox regression with
interval-censored data. Biometrics, 56, 199–203.
Pan, W. (2000b). A two-sample test with interval censored data via multiple
imputation. Statistics in Medicine, 19, 1–11.
Pan, W. (2001). A multiple imputation approach to regression analysis for
doubly censored data with application to AIDS studies. Biometrics, 57,
1245–1250.
Pan, W. and Connett, J. E. (2001). A multiple imputation approach to
linear regression with clustered censored data. Lifetime Data Analysis, 7,
111–123.
Pan, W. and Kooperberg, C. (1999). Linear regression for bivariate censored data via multiple imputation. Statistics in Medicine, 18, 3111–3121.
Pan, W. and Louis, T. A. (2000). A linear mixed-effects model for multivariate censored data. Biometrics, 56, 160–166.
Parner, E. T., Heidmann, J. M., Væth, M., and Poulsen, S. (2001).
A longitudinal study of time trendsin the eruption of permanent teeth in
Danish children. Archives of Oral Biology, 46, 425–431.
Pepe, M. S. and Fleming, T. R. (1989). Weighted Kaplan-Meier statistics:
a class of distance tests for censored survival data. Biometrics, 45, 497–507.
Pepe, M. S. and Fleming, T. R. (1991). Weighted Kaplan-Meier statistics:
large sample and optimality considerations. Journal of the Royal Statistical
Society, Series B, 53, 341–352.
Peto, R. (1973). Experimental survival curves for interval-censored data.
Applied Statistics, 22, 86–91.
Peto, R. and Peto, J. (1972). Asymptotically efficient rank-invariant test
procedures (with Discussion). Journal of the Royal Statistical Society, Series A, 135, 185–206.
286
BIBLIOGRAPHY
Petroni, G. R. and Wolfe, R. A. (1994). A two-sample test for stochastic
ordering with interval-censored data. Biometrics, 50, 77–87.
Pourahmadi, M. (1999). Joint mean-covariance models with applications
to longitudinal data: Unconstrained parametrisation. Biometrics, 86,
677–690.
Prentice, R. L. (1978).
Biometrika, 65, 167–179.
Linear rank tests with right censored data.
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria. URL http://www.R-project.org. ISBN 3-900051-07-0.
Rabinowitz, D., Tsiatis, A., and Aragon, J. (1995). Regression with
interval-censored data. Biometrika, 82, 501–513.
Ramsay, J. O. (1988). Monotone regression splines in action. Statistical
Science, 3, 425–461.
Reid, N. (1994). A conversation with Sir David Cox. Statistical Science, 9,
439–455.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures
with unknown number of components (with Discussion). Journal of the
Royal Statistical Society, Series B, 59, 731–792.
Ritov, Y. (1990). Estimation in a linear regression model with censored
data. The Annals of Statistics, 18, 303–328.
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods.
Springer-Verlag, New York, Second edition. ISBN 0-387-21239-6.
Roeder, K. and Wasserman (1997). Practical bayesian density estimation
using mixtures of normals. Journal of the American Statistical Association,
92, 894–902.
Rosenberg, P. S. (1995). Hazard function estimation using B-splines. Biometrics, 51, 874–887.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John
Wiley & Sons, New York. ISBN 0-471-08705-X.
Rücker, G. and Messerer, D. (1988). Remission duration: an example
of interval-censored observations. Statistics in Medicine, 7, 1139–1145.
BIBLIOGRAPHY
287
Satten, G. A. (1996). Rank-based inference in the proportional hazards
model for interval censored data. Biometrika, 83, 355–370.
Satten, G. A., Datta, S., and Williamson, J. M. (1998). Inference
based on imputed failure times for the proportional hazards model with
interval-censored data. Journal of the American Statistical Association,
93, 318–327.
Self, S. G. and Grossman, E. A. (1986). Linear rank tests for intervalcensored data with application to PCB levels in adipose tissue of transformer repair workers. Biometrics, 42, 521–530.
Silverman, B. W. (1985). Some aspects of the spline smoothing approach
to non–parametric regression curve fitting. Journal of the Royal Statistical
Society, Series B, 47, 1–52.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde,
A. (2002). Bayesian measures of model complexity and fit (with Discussion). Journal of the Royal Statistical Society, Series B, 64, 583–639.
Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society, Series B, 62, 795–809.
Sun, J. (1995). Empirical estimation of a distribution function with truncated and doubly interval-censored data and its application to AIDS studies. Biometrics, 51, 1096–1104.
Sun, J., Liao, Q., and Pagano, M. (1999). Regression analysis of doubly
censored failure time data with application to AIDS studies. Biometrics,
55, 909–914.
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior
distributions by data augmentation. Journal of the American Statistical
Association, 82, 528–550.
Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data:
Extending the Cox Model. Springer-Verlag, New York. ISBN 0-387-98784-3.
Therneau, T. M. and Hamilton, S. A. (1997). rhDNase as an example
of recurrent event analysis. Statistics in Medicine, 16, 2029–2047.
Tierney, L. (1994). Markov chains for exploring posterior distributions
(with Discussion). The Annals of Statistics, 22, 1701–1762.
288
BIBLIOGRAPHY
Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985).
Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons,
Chichester. ISBN 0-471-90763-4.
Topp, R. and Gómez, G. (2004). Residual analysis in linear regression
models with an interval-censored covariate. Statistics in Medicine, 23,
3377–3391.
Tsiatis, A. A. (1990). Estimating regression parameters using linear rank
tests for censored data. The Annals of Statistics, 18, 354–372.
Tsiatis, A. A. and Davidian, M. (2004). Joint modeling of longitudinal
and time-to-event data: An overview. Statistica Sinica, 14, 809–834.
Turnbull, B. (1976). The empirical distribution function with arbitrarily
grouped, censored and truncated data. Journal of the Royal Statistical
Society, Series B, 37, 290–295.
Tutz, G. and Binder, H. (2004). Flexible modelling of discrete failure
time including time-varying smooth effects. Statistics in Medicine, 23,
2445–2461.
Unser, M., Aldroubi, A., and Eden, M. (1992). On the asymptotic
convergence of B-spline wavelets to Gabor functions. IEEE Transactions
on Information Theory, 38, 864–872.
Vaida, F. and Xu, R. (2000). Proportional hazards model with random
effects. Statistics in Medicine, 19, 3309–3324.
Vanobbergen, J., Martens, L., Lesaffre, E., Bogaerts, K., and
Declerck, D. (2001). Assessing risk indicators for dental caries in the
primary dentition. Community Dentistry and Oral Epidemiology, 29, 424–
434.
Vanobbergen, J., Martens, L., Lesaffre, E., and Declerck, D.
(2000). The Signal-Tandmobielr project – a longitudinal intervention
health promotion study in Flanders (Belgium): baseline and first year results. European Journal of Paediatric Dentistry, 2, 87–96.
Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with
heterogeneity in the random-effects population. Journal of the American
Statistical Association, 91, 217–221.
Verbeke, G. and Lesaffre, E. (1997). The effect of misspecifying the
random-effects distribution in linear mixed models for longitudinal data.
Computational Statistics and Data Analysis, 23, 541–556.
BIBLIOGRAPHY
289
Verweij, P. J. M. and Van Houwelingen, H. C. (1994). Penalized
likelihood in Cox regression. Statistics in Medicine, 13, 2427–2436.
Virtanen, J. I. (2001). Changes and trends in attack distributions and
progression of dental caries in three age cohorts in Finland. Journal of
Epidemiology and Biostatistics, 6, 325–329.
Wahba, G. (1983). Bayesian “confidence intervals” for the cross–validated
smoothing spline. Journal of the Royal Statistical Society, Series B, 45,
133–150.
Walker, S. G., Damien, P., Laud, P. W., and Smith, A. F. M. (1999).
Bayesian nonparametric inference for random distributions and related
functions (with discussion). Journal of the Royal Statistical Society, Series
B, 61, 485–527.
Walker, S. G. and Mallick, B. K. (1999). A Bayesian semiparametric
accelerated failure time model. Biometrics, 55, 477–483.
Wand, M. P. (2003). Smoothing and mixed models. Computational Statistics, 18, 223–249.
Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation
of the EM algorithm and the poor man’s data augmentation algorithms.
Journal of the American Statistical Association, 85, 699–704.
Wei, G. C. G. and Tanner, M. A. (1991). Applications of multiple imputation to the analysis of censored regression data. Biometrics, 47,
1297–1309.
Williams, J. S. and Lagakos, S. W. (1977). Models for censored survivaal
analysis: Constant-sum and variable-sum models. Biometrika, 64, 215–
224.
Ying, Z. (1993). A large sample study of rank estimation for censored
refression data. The Annals of Statistics, 21, 76–99.
Yu, Q., Li, L., and Wong, G. Y. C. (2000). On consistency of the
self-consistent estimator of survival functions with interval-censored data.
Scandinavian Journal of Statistics, 27, 35–44.
Yu, Q., Schick, A., Li, L., and Wong, G. Y. C. (1998). Asymptotic
properties of the GLME in the case 1 interval-censorship model with discrete inspection times. Canadian Journal of Statistics, 26, 619–627.
290
BIBLIOGRAPHY
Curriculum Vitae
Arnošt Komárek was born on March 28, 1977 in Hradec Králové in the Czech
Republic. After secondary school at Božena Němcová Secondary Grammar
School (Gymnázium Boženy Němcové) in Hradec Králové, he started undergraduate studies in Mathematics in September 1995 at the Faculty of Mathematics and Physics of the Charles University (Univerzita Karlova) in Prague,
the Czech Republic where he chose direction of Mathematical Statistics and
graduated as Master of Science in Mathematical Statistics in May 2000. From
October 2000 till September 2001, he was enrolled as an Erasmus Exchange
Student at the University of Limburg (Limburgs Universitair Centrum, nowadays Universiteit Hasselt) in Diepenbeek, Belgium and obtained a degree of
Master of Science in Biostatistics. From October 2001 he started as a predoctoral student his career of a researcher at the Biostatistical Centre of the
Catholic University of Leuven (Katholieke Universiteit Leuven) in Leuven,
Belgium. At the same place, he started the doctoral programme in October
2002 of which this thesis is the most important outcome.
291
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement