Multiple Time Scales and Longitudinal Measurements in Event History Analysis Danardono

Multiple Time Scales and Longitudinal Measurements in Event History Analysis Danardono
Multiple Time Scales and
Longitudinal Measurements
in Event History Analysis
Danardono
Statistical Studies No. 33
Department of Statistics
Umeå University 2005
Doctoral Dissertation
Department of Statistics
Umeå University
SE-901 87 Umeå, Sweden
Department of Public Health and Clinical Medicine,
Epidemiology and Public Health Sciences
Umeå University
SE-901 85 Umeå, Sweden
c
Copyright 2005
by Danardono
ISSN: 1100-8989
ISBN: 91-7305-812-2
Printed by Solfjädern Offset AB Umeå 2005
Abstract
A general time-to-event data analysis known as event history
analysis is considered. The focus is on the analysis of time-to-event
data using Cox’s regression model when the time to the event may be
measured from different origins giving several observable time scales
and when longitudinal measurements are involved. For the multiple
time scales problem, procedures to choose a basic time scale in Cox’s
regression model are proposed. The connections between piecewise
constant hazards, time-dependent covariates and time-dependent
strata in the dual time scales are discussed. For the longitudinal
measurements problem, four methods known in the literature together with two proposed methods are compared. All quantitative
comparisons are performed by means of simulations. Applications to
the analysis of infant mortality, morbidity, and growth are provided.
Keywords and phrases: Cox regression, multiple events, proportional hazards, random effects, survival analysis, time-dependent
covariates, time origin.
AMS subject classification:
62P10, 62N03.
To Leni, Fiyan and Lila
Acknowledgments
I would like to thank and express my deepest gratitude to:
Professor Göran Broström, my main supervisor, for his support
and help during my studies and the writing of this thesis. I learned
a lot from all our discussions during the last five years;
Dr. Hans Stenlund, my co-supervisor from the Department
of Public Health and Clinical Medicine, Epidemiology and Public
Health Sciences, for his support, comments and friendship;
Dr. Marie Lindkvist, who discussed the thesis manuscript in my
slutseminarium and provided many valuable comments.
Professor Subanar, the dean of the Faculty of Mathematics and
Natural Sciences, Gadjah Mada University, Indonesia, for his advice
and support.
I would also like to thank the Community Health and Nutrition Research Laboratories (CHN-RL), Faculty of Medicine, Gadjah
Mada University, for allowing me to use the surveillance data, and
to Dr. Torbjörn Lind, for allowing me to use the ZINAK data.
I received financial support from STINT (Stiftelsen för internationalisering av högre utbildning och forskning - the Swedish foundation for international cooperation in research and higher education)
during the initial stage of my studies at Umeå University, for my
licenciate degree. Subsequently, I received financial support from
Umeå University through the Department of Statistics and from the
Department of Public Health and Clinical Medicine, Epidemiology
and Public Health Sciences. To them I am very thankful.
Thanks to my many friends and colleagues who supported me
during the life course of my studies. My warmest thanks to Birgitta Åström, for her friendship and endless assistance to me and
my family. I also thank Anna Winkvist for her support, friendship
vi
and scientific discussions. To all Indonesian friends in Umeå, I say
”terima kasih banyak”.
Thanks (and goodbye...) to my ”old” classmates Jari’-san’,
Maria, Marie; and to the ”younger”-mates, Mathias-ever-been-aroommate, Ingeborg, Juke (thanks for your comments and corrections), Suad, Leake and Tea. Lycka till! ”Tack så mycket” to Birgitta Löfroth for your help and all my colleagues at the Department
of Statistics, Umeå University.
To anyone else who, because of my limited memory, may have
been omitted from being mentioned by name, I thank you for your
assistance.
To Leni, Fiyan and Lila, my beloved family, thank you for supporting me and being here. I apologize, that my mind was often
engaged with this thesis during dinner. I do not have enough words
to thank you here. This thesis is dedicated to you.
I would also like to say something about my name. Many people
asked me why I only have one name (one word). In Indonesia, where
I come from, there is no requirement to have a family name. We
have liberty to have our own name. I have one name, my wife and
our children have three names (three words) each.
Finally, thanks for reading this thesis, at least this page...
Contents
Abstract
iii
Acknowledgments
v
List of Figures
xii
List of Tables
xiv
1 Introduction
1.1 Event history and longitudinal data
1.2 Review of the problem . . . . . . . .
1.3 Objectives and scope . . . . . . . . .
1.4 Outline and summary . . . . . . . .
.
.
.
.
2 Basic Methods
2.1 Introduction . . . . . . . . . . . . . . .
2.2 Event history analysis . . . . . . . . .
2.2.1 Hazard and survival . . . . . .
2.2.2 The counting process approach
2.2.3 Regression models . . . . . . .
2.2.4 Diagnostics and stratification .
2.2.5 Frailty . . . . . . . . . . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
. 1
. 4
. 11
. 11
.
.
.
.
.
.
.
13
13
14
14
15
17
19
20
.
.
.
.
.
.
.
viii
Contents
.
.
.
.
.
.
.
.
.
.
20
21
21
22
24
25
26
26
29
32
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
36
36
38
44
45
47
50
53
56
56
57
60
64
4 Multiple Time Scales
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
4.2 The choice of relevant time scales . . . . . . . . . . . .
4.3 Modeling dual time scales . . . . . . . . . . . . . . . .
67
67
69
72
2.3
2.4
2.2.6 Multistate models . . . . . . . .
Longitudinal data analysis . . . . . . . .
2.3.1 Notation and approaches . . . .
2.3.2 General linear models . . . . . .
2.3.3 Generalized estimating equations
2.3.4 Generalized linear mixed models
Time-dependent covariates . . . . . . . .
2.4.1 Some useful classifications . . . .
2.4.2 Approaches in the Cox model . .
2.4.3 Time-dependent confounders . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Analysis of Childhood Mortality, Morbidity and
Growth
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
3.2 Mortality . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Data, study variables and models . . . . . .
3.2.2 Results . . . . . . . . . . . . . . . . . . . .
3.3 Morbidity: surveillance data . . . . . . . . . . . . .
3.3.1 Data, study variables and models . . . . . .
3.3.2 Age time scale . . . . . . . . . . . . . . . .
3.3.3 Calendar time . . . . . . . . . . . . . . . .
3.3.4 Time since weaning . . . . . . . . . . . . .
3.4 Morbidity: trial data . . . . . . . . . . . . . . . . .
3.4.1 Data, study variables and models . . . . . .
3.4.2 Results . . . . . . . . . . . . . . . . . . . .
3.5 Infant growth . . . . . . . . . . . . . . . . . . . . .
3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
Contents
4.4
4.5
4.6
4.3.1 Piecewise constant hazards . . . .
4.3.2 Time-dependent approaches . . . .
Simulation studies . . . . . . . . . . . . .
4.4.1 Erroneous scale . . . . . . . . . . .
4.4.2 Dual time scales . . . . . . . . . .
4.4.3 Miss-specification . . . . . . . . . .
Application to infant mortality age-period
Remarks . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
analysis
. . . . .
.
.
.
.
.
.
.
.
5 Event History Analysis with Longitudinal
Measurements
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
5.2 Problem and models . . . . . . . . . . . . . . . . . .
5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Simulation studies . . . . . . . . . . . . . . . . . . .
5.5 Application to infant respiratory infection and weight
data . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . .
6 Concluding Remarks
Appendix
A-1 Simulating alternative time scale . . . . . . . . . . .
A-2 Simulating dual time scales . . . . . . . . . . . . . .
A-3 Simulating longitudinal measurements and event-time
data . . . . . . . . . . . . . . . . . . . . . . . . . . .
A-3.1 Time-dependent covariate model . . . . . . .
A-3.2 Joint model . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
73
74
78
78
82
86
86
89
.
.
.
.
91
91
92
96
100
. 103
. 108
111
125
. 125
. 125
. 127
. 127
. 129
x
Contents
List of Figures
1.1
1.4
1.5
History of a hypothetical child experiencing healthy,
ill and dead states, observed at two periods . . . . .
Repeated measurements on weight . . . . . . . . . .
Repeated measurements on weight and respiratory infections . . . . . . . . . . . . . . . . . . . . . . . . .
Four subjects on two different time scales . . . . . .
Four subjects on a Lexis diagram . . . . . . . . . . .
2.1
Time-to-event and time-dependent covariates . . . . . 30
3.1
3.2
Sibling as a time-dependent covariate . . . . . . . . .
Profile likelihood for the mother and household random effect variance for infant mortality model . . . .
The cumulative hazard and hazard plot of childhood
respiratory infection and diarrhea by age. . . . . . .
The cumulative hazards and hazards plot of childhood
respiratory infection and diarrhea by calendar time.
Raw and smoothed hazard plot of childhood respiratory infection by age. . . . . . . . . . . . . . . . . . .
The children’s weight across age . . . . . . . . . . .
1.2
1.3
3.3
3.4
3.5
3.6
xi
.
.
3
4
.
.
.
5
7
9
. 39
. 40
. 48
. 51
. 59
. 61
xii
List of Figures
4.1
4.2
Lexis diagram and separate scale . . . . . . . . . . . . 70
Hypothetical event history data on a Lexis diagram . . 72
5.1
Event history data and longitudinal measurements . . 94
List of Tables
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
4.3
Five hazard models for infant mortality (0-1 years) . .
Five hazard models for child mortality (1-5 years) . .
Data layout for morbidity study . . . . . . . . . . . . .
Hazard model for diarrhea, age time scale . . . . . . .
Hazards model for respiratory infection, calendar time
Hazards model for respiratory infection, time since
weaning . . . . . . . . . . . . . . . . . . . . . . . . . .
Hazards model for respiratory infection using the Andersen Gill model, ZINAK study . . . . . . . . . . . .
Hazards model for respiratory infection using the gaptime model, ZINAK study . . . . . . . . . . . . . . . .
Growth curve model for weight using random effect
and ordinary linear model, ZINAK study . . . . . . .
41
42
46
49
52
55
58
59
62
Simulation study for erroneous scale with δi follows
uniform distribution . . . . . . . . . . . . . . . . . . . 80
Simulation study for erroneous scale with δi follows
an exponential distribution . . . . . . . . . . . . . . . 81
Simulation study for dual time scales S1 and S2 with
β1 = 1.5, β2 = 0, 1 and δi follows exponential with
rate 0.85 . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xiii
xiv
List of Tables
4.4
4.5
4.6
5.1
5.2
5.3
5.4
5.5
5.6
Simulation study for dual time scales S1 and S2 with
β1 = 1.5, β2 = 0, 1 and δi follows uniform(0,2) . . . . . 85
Likelihood ratio test (LRT) for variables in the infant
mortality models . . . . . . . . . . . . . . . . . . . . . 88
Estimated coefficients and their standard errors for
gender and maternal education in the infant mortality
models . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Simulation study for Cox’s time-dependent covariate
model analyzed with the LVCF, TEL, two-stage, Coxfrailty and Cox-strata methods . . . . . . . . . . . .
Simulation study for joint model analyzed with the
LVCF, TEL, two-stage, Cox-frailty and Cox-strata
methods . . . . . . . . . . . . . . . . . . . . . . . . .
Likelihood ratio test for the LVCF, TEL and twostage models . . . . . . . . . . . . . . . . . . . . . .
Hazards model for respiratory infection using the
Andersen-Gill model . . . . . . . . . . . . . . . . . .
Hazards model for respiratory infection using the gaptime model . . . . . . . . . . . . . . . . . . . . . . .
Separate and joint model analyses for infant respiratory infection and weight data . . . . . . . . . . . . .
. 101
. 101
. 105
. 106
. 106
. 109
A-1 The specification of hazard functions and times T generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 1
Introduction
1.1
Event history and longitudinal data
Event history and longitudinal data frequently arise in many scientific investigations. Important examples are in epidemiological
surveillance and clinical trials. The nature of the data is that information on specific units or subjects are followed over time.
The term event history data possibly originated from sociology.
Another applicable term is survival and duration data. Other popular terms for longitudinal data are repeated measurements, commonly used in biological or health sciences, and panel data, commonly used in the social sciences.
While, generally, event history and longitudinal data have many
characteristics in common, their differences will be emphasized here.
Event history data refers to time-to-event data, whereas longitudinal
data refers mostly to repeated measurements. Two examples that will
be used throughout this thesis are given below.
In 1994, an epidemiological surveillance was established in Purworejo district in Indonesia under the Community and Health Nutri1
2
1.1. Event history and longitudinal data
tion Laboratories (CHN-RL), Gadjah Mada University, Yogyakarta.
Households were visited every 90-th day to record vital demographic
events, morbidity events, nutritional status and utilization of health
services (Wilopo and CHN-RL Team, 1997). The general aim of the
surveillance was to improve the health and nutritional status at the
district, particularly for children and women. Vital events such as
births and deaths were recorded continuously over time, however,
other events such as morbidity events were not. Like many other
surveillance data, these were large in the number of subjects but
without very detailed information on each subject. In the period
between 1994 and 1998, there were about 15,000 households with
around 8,000 children involved but the information on childhood
morbidity was available for only a two week period every 90-th day.
Figure 1.1 is a typical event history collected in the surveillance.
Data for certain events of interest (for instance, illness or death) are
recorded for each child. The data is then available for investigating
the determinants of childhood mortality and morbidity. Often in
general surveillance data collection, observations can only be made
partially because of technical or logistical reasons. Referring to Figure 1.1, as the surveillance was only conducted every 90-th day, the
observation can only be recorded during period 1 and period 2. This
common nature of event history data, known as censoring and truncation, has to be considered in the analysis.
Many specific epidemiological studies and trials are also conducted and organized under the surveillance system. One of them
was the ZINAK study on zinc and iron supplementation in infants
(Lind, 2004). This study was a community based, randomized,
double-blind, placebo-controlled trial with the purpose to investigate the effect of four supplementation groups of iron, zinc, iron+zinc
and placebo on iron, zinc status, infant growth, cognitive development and incidence of infant infectious diseases during the first six to
twelve months of age. This thesis utilized the data on infant growth,
3
1.1. Event history and longitudinal data
States y(t)
period 1
period 2
dead
sick
healthy
time t
Figure 1.1: History of a hypothetical child experiencing healthy, ill
and dead state, observed at two periods.
weight and infectious disease, respiratory infection. There were 680
infants aged six to twelve months participating in the study with
daily supplementation and daily morbidity records, and monthly infant growth records.
Figure 1.2 shows an example of longitudinal data, repeated measurements of the weight of four infants across age in the ZINAK
study. Here, the measurements are intermittently performed, once
every month. One objective of the analysis of the data is to investigate the effect of the supplementations on weight development,
taking into account other explanatory variables.
The study also considered morbidity (illness), such as respiratory
infections. Figure 1.3 presents longitudinal measurements of weight
together with the occurrence of respiratory infections. Interesting
analyses of the data include studying the effect of supplementations
on weight development, taking into account the respiratory infections
as mentioned in the previous paragraph, or the effect of supplementa-
4
8
6
4
2
weight (kgs)
10
1.2. Review of the problem
0
2
4
6
8
10
12
age (months)
Figure 1.2: Repeated measurements on weight.
tions on the incidence of respiratory infections, taking into account
weight development. The third possible analysis is to investigate
weight and respiratory infection simultaneously, as both outcomes
may actually affect each other, given the supplementations.
1.2
Review of the problem
Time-to-event analysis deals with the analysis of time measured from
a well defined time origin up to the occurrence of a certain event of
interest. The scale for measuring time can be ordinary clock time
(minutes, days, years, and so forth) or other measurements such
as mileage or usage which are common in reliability; experience or
exposure which are common in epidemiology or social sciences.
Regression modeling of time-to-event data is commonly applied
in studying the relationship between the outcome and independent
(predictor) variables. The analysis can be performed through the
density function or through the hazard function. As with many
5
1.2. Review of the problem
2
4
6
8
10
12
14
16
14
16
Longitudinal measurement
12
weight (kgs)
10
8
6
4
2
0
resp−inf
Event occurrence
1
0
2
4
6
8
10
12
age (months)
Figure 1.3: Repeated measurements on weight and respiratory infections for one infant.
6
1.2. Review of the problem
other statistical procedures, the analysis can be performed parametrically by specifying the density function, or non-parametrically by
specifying nothing about the density function.
In this thesis, emphasis is given to the modeling of hazard functions using Cox’s semiparametric model (Cox, 1972; Cox, 1975). The
reasons of modeling the hazards are (Cox and Oakes, 1984; Hosmer and Lemeshow, 1999): (i) considering the immediate risk may
be useful; (ii) comparisons of groups of individuals are sometimes
sharpened by the hazard. For example, specific questions such as
how survival is related to the treatments under study can be investigated by studying the estimated regression parameters from the
hazard model; (iii) the hazard-based models can be extended to a
more general event process, such as multiple events.
The semiparametric model is appealing in fields like epidemiology
since most of the phenomena in epidemiological data are ’irregular’
in the sense that a specific distribution function may not be easily
determined. Furthermore, the idea of hazard comparison in the Cox
model is similar to the well known relative risk in the common epidemiological analysis.
It has already been mentioned in the previous section that censoring and truncation are quite natural in event history data. Figure
1.4 gives a common description of censoring and truncation. The
examples refer to the CHN-RL surveillance mortality data, for the
period of time from 1994 to 1998, and for children under 5 years of
age.
On the calendar time scale, many of the children did not enter
the study at the beginning of the period in 1994 (subjects number
3 and 4, Figure 1.4(a). There is a similar situation in the age time
scale where many of the children did not enter the study on their
day of birth (subjects 1 and 2, Figure 1.4(b). This kind of missing
information where the subjects are observed after the time origin
7
1
2
4
3
subjects
4
3
subjects
2
1
1.2. Review of the problem
1994 1995 1996 1997 1998
0
1
2
3
4
5
age
(a) Calendar time scale
(b) Age time scale
Figure 1.4: Four subjects with staggered entry (left-truncation),
right-censored (the lines without dots) and event (the lines with dots)
on two different time scales.
8
1.2. Review of the problem
in the time-to-event data is known as staggered entry, late entry or
left-truncation.
Some of the children experienced the events (deaths) and some
of them were only partially observed, known as censored, because of
the time limitation (only up to 1998 or reaching 5 years of age), and
also due to other causes such as emigration.
Truncation and censoring may introduce several problems in the
analysis such as length biased sampling (higher chance of being sampled for the longer survivors) and wasting information (if the analysis
only utilize complete observations). Nowadays, time-to-event analysis can deal with these problem easily, for instance by using a counting process approach (Andersen, Borgan, Gill and Keiding, 1993; Therneau and Grambsch, 2000). The tools that facilitate truncation
and censoring have made event history analysis with various time
scales easier. For instance, the four subjects can be analyzed using a
calendar time scale as easy as using an age time scale by specifying
a counting process style of input (Therneau and Grambsch, 2000)
corresponding to the scale used in the analysis. However, another
complication may arise as discussed later.
Figure 1.5 represents the life experiences of the 4 subjects in
Figure 1.4(a) and 1.4(b) on a Lexis diagram (Keiding, 1990). A
Lexis diagram is a dual time scale system (usually calendar time and
age), representing individual lives by line segments of unit slope, with
events usually marked by dots. Representing the life experiences of
the subjects in the period from 1994 to 1998 and under 5 years of age
is clearer in this diagram than in the separate time scales of Figure
1.4(a) and 1.4(b).
Event history analysis often involves data with more than one
time scale as shown in Figure 1.5. One early paper discussing this
problem gave an example on the choice of time scale between age
and age at first child’s birth of women with breast cancer (Farewell
and Cox, 1979). Another famous example is the age-period-cohort
9
3
2
0
1
age (years)
4
5
1.2. Review of the problem
1994
1995
1996
1997
1998
Figure 1.5: Four subjects with staggered entry (left-truncation),
right-censored (the lines without dots) and event (the lines with dots)
on a Lexis diagram.
model (Holford, 1998) which is popular in demography but carries
an identification problem.
The multiple time scales problem also arises in multi-state models
when many time scales are involved in the transition between states.
Coping with several time scales is one of the challenges of multi-state
models in epidemiology (Commenges, 1999).
Multiple time origins may be a more appropriate term than
multiple time scales, since this problem deals with life experiences measured from many different origins (birthdate, starting
date of surveillance, etc.). However, many authors have used the
term multiple time scales in reference to this problem (Farewell
and Cox, 1979; Berzuini and Clayton, 1994a; Oakes, 1995; Duchesne, 1999; Efron, 2002) and we continue to use the term.
This thesis considers the multiple time scales problem in the
event history analysis as the first problem. This first problem in-
10
1.2. Review of the problem
cludes the procedure to choose the most relevant time scale and to
simultaneously model time scales.
Typically, event history data, such as the ZINAK study mentioned in the previous section, will also include longitudinal measurements collected intermittently across time. For instance, the
growth or nutritional status, such as weight, were measured among
children together with the morbidity outcomes, such as respiratory
infections. The second problem considered in this thesis is the dual
outcomes of event occurrence and longitudinal measurement.
When weight is considered as the primary outcome, weight will be
the response variable with the occurrence or the symptom duration of
respiratory infections as an explanatory variable, possibly with some
other variables. The analysis can then be done using the longitudinal
analysis methods proposed by Diggle, Heagerty, Liang and Zeger
(2002).
Complications may arise when respiratory infection is the outcome of interest and weight is to be included as one explanatory
variable. In many applications, continuous measurements of a longitudinal covariate, such as weight in the ZINAK study, are usually
only available at some finite number of measurement times. This,
potentially, becomes a problem in the ordinary Cox regression, since
the method requires all values of covariates to be available at event
times. Compromising the analysis by using cases with complete values of covariates is possible, but will lead to bias in the estimated
regression coefficient.
Several methods have been proposed to cope with the above problem. They are the last value carried forward (LVCF), elapsed time
(TEL) (Bruijne, Cessie, Kluin-Nelemans and Houwelingen, 2001),
two-stage (Tsiatis, DeGruttola and Wulfsohn, 1995) and joint model
method (Wulfsohn and Tsiatis, 1997; Henderson, Diggle and Dobson, 2000; Tsiatis and Davidian, 2004). Some comparisons have been
made for some methods. The most recent, and perhaps, comprehen-
1.3. Objectives and scope
11
sive one is the investigation by Andersen and Liestøl (2003). No
attempt, however, has been made to compare the methods for repeated events such as respiratory infection in the ZINAK study.
1.3
Objectives and scope
The focus of this thesis is on the analysis of event history data using
Cox’s proportional hazards model with the objectives
• to demonstrate the use of event history analysis in the analysis
of infant and child mortality, morbidity and growth and to
identify the methodological problems in the analysis,
• to propose procedures to choose a basic time scale,
• to discuss the connections between the methods for modeling
dual time scales and to perform quantitative comparisons between them,
• to compare existing methods to deal with longitudinal measurements in the Cox model with two proposed methods.
1.4
Outline and summary
Chapter 2 provides technical reviews of event history and longitudinal analysis. The concept of time-dependent covariates, which plays
an important role in this thesis, is reviewed more comprehensively
than the other topics. Chapter 3 presents the application of event
history and longitudinal data analysis to childhood mortality and
morbidity data from the CHN-RL surveillance data, and application
on respiratory infection and weight data from the ZINAK study.
This chapter gives the background to problems considered in the
12
1.4. Outline and summary
later chapters. Chapter 4 is devoted to the problem of multiple time
scales. The procedures to choose the most relevant time scale and
to model dual time scales are discussed. Simulation studies and application to infant mortality data are provided. Chapter 5 presents
comparison of the methods to deal with longitudinal measurements
in the event history analysis. An application to the infant respiratory infection and weight data is provided. Chapter 6 summarize
and concludes this thesis and features further research and work in
this area.
Chapter 2
Basic Methods
2.1
Introduction
This chapter is a brief technical exposition of basic theories and
methods used for further developments in the later chapters. Longitudinal data analysis (LDA) and event history analysis (EHA) have
similarities; for instance, in the nature of the data involved as mentioned in the previous chapter. The methods have many overlapping techniques and areas (see, for example, the review paper by
Doksum and Gasko (1990), among others). The classical books on
survival analysis and counting process theory by Cox and Oakes
(1984); Kalbfleisch and Prentice (2002); Andersen et al. (1993) and
the book on LDA by Diggle et al. (2002) are the main references
for this chapter. This chapter also presents the similarities between
the two analyses, especially for topics related to the time dependent
covariates.
13
14
2.2
2.2.1
2.2. Event history analysis
Event history analysis
Hazard and survival
Generic survival data is in the form of (T, δ), where T = min(Te , Tc ),
the minimum of time to event Te (such as failure or death time) and
time to censored Tc ; δ = I{Te ≤Tc } , the indicator has a value of 1
if the event is observed or 0 if it is censored. Most often, we are
also interested in including covariates in the data. The survival data
becomes (T, δ, Z), where Z = (Z1 , . . . , Zp )′ is a p-dimensional vector
of covariates.
T is a non-negative random variable that can be continuous or
discrete. We first consider the continuous case. There are many functions that describe the distribution of T . The cumulative distribution
function F (t) = P(T ≤ t) and the density function f (t) = dF (t)/dt
are the usual functions characterizing a random variable. More useful functions in survival analysis are the survivor function
S(t) = 1 − F (t)
= P(T ≥ t),
(2.1)
i.e., the probability of the duration time (e.g., lifetime) being longer
than t, and the hazard function
1
P(t ≤ T < t + ∆t | T ≥ t),
∆t↓0 ∆t
λ(t) = lim
(2.2)
i.e., the probability of getting an event (e.g., death) within a short
interval, conditional upon survival to time t.
Applying the definition of conditional probability and the relations between F (t), f (t), and S(t), the relation between λ(t) and
15
2.2. Event history analysis
S(t) can be derived as
dF (t) 1
dt S(t)
f (t)
.
S(t)
λ(t) =
=
It also follows that
λ(t) = −
d
log S(t)
dt
and
S(t) = exp{−Λ(t)},
where
Λ(t) =
Z
(2.3)
t
λ(u)du
(2.4)
0
is the integrated or cumulative hazard function.
As noted by Flemming and Lin (2000), observing (T, δ) rather
than Te give the crude hazard (Equation (2.2)) rather than the net
hazard λnet (t) = lim∆t↓0 P(t ≤ T < t + ∆t | Te ≥ t)/∆t. Therefore,
in survival analysis the equality of the crude hazard and the net
hazard is an important assumption. A sufficient condition for this
assumption to be true is the independence of Te and Tc .
2.2.2
The counting process approach
Aalen (1978) introduced a martingale-based approach to survival
analysis, unifying the previously proposed non-parametric methods
under a counting process framework. In this approach, survival data
for a single subject i, (Ti , δi ), is represented as (Ni (t), Yi (t)), t > 0,
where Ni (t) = I{Ti ≤t,δi =1} is the number of observed events in [0, t]
for subject i, and Yi (t) = I{Ti ≥t} is the at-risk process.
The estimator of the
cumulative hazard is based on the aggree (t) = P Ni (t), the total number of events up to and
gated process N
16
2.2. Event history analysis
P
including t and R(t) = Yi (t), the risk size at time t. The estimator of the cumulative hazard (Equation (2.4)) is the Nelson-Aalen
estimator, defined as
Z t
I{R(u)>0}
e (u),
Λ̂(t) =
dN
(2.5)
R(u)
0
which intuitively can be thought of as the sum of the conditional
probabilities that an event happens in the short intervals over (0, t].
e (t) can be decomposed as the discrete and continuous part
The dN
e (t) = ∆N
e (t) + n(t)dt, where d∆N
e (t) = N
e (t) − N
e (t−) is the
dN
number of events occurring precisely at t for the discrete part and
n(t) is the change or differential for the continuous part.
An equivalent representation of the estimator is (Therneau and
Grambsch, 2000)
X ∆N
e (ti )
,
(2.6)
Λ̂(t) =
R(ti )
i:ti ≤t
where t1 , t2 , . . . are the ordered event times.
The Nelson-Aalen estimator Λ̂(t) has a close connection to the
Kaplan-Meier estimator (Kaplan and Meier, 1958). Let Ŝ(t) =
ˆ i ) = dN
e (ti )/R(ti ), the increment in the Nelsonexp(−Λ(t)) and dΛ(t
e (ti )/R(ti ) ≈ 0,
Aalen estimator at i-th event. Then when ∆N
Y
Ŝ(t) =
exp{−dΛ̂(ti )}
i:ti ≤t
≈
Y
{1 − dΛ̂(ti )},
i:ti ≤t
which is the Kaplan-Meier product limit estimator.
Further, the process given by
Z t
Yi (u)λi (u)du
Mi (t) = Ni (t) −
0
(2.7)
2.2. Event history analysis
17
is a martingale for subject i with respect to a proper filtration.
(Aalen, 1978; Fleming and Harrington, 1991; Therneau and Grambsch, 2000) The martingale Mi (t) (2.7) represents the difference between the observed and the model-predicted number of events over
the interval (0, t]. Informally, a martingale with respect to a history H(t) is defined as a stochastic process that has a key property
E{M (t) | H(s)} = M (s) for any 0 ≤R s < t.
t
We may rewrite (2.7) as Ni (t) = 0 Yi (u)λi (u)du+Mi (t) and refer
this decomposition as counting process=compensator+martingale,
which is analogous to to data=model+noise in the statistical model
decomposition (Therneau and Grambsch, 2000). This notion is important in studying residuals and diagnostics for survival models.
2.2.3
Regression models
Most often, it is desired to assess the effect of some covariates on
survival. We need the time-to-event, event indicator and covariates
information (T, δ, Z) for this analysis. The covariates may be fixed
throughout the observation period (time independent covariate) or
change with time (time dependent covariate).
The Cox proportional hazards regression model (Cox, 1972) is the
most frequently used regression model in survival analysis. There are
two approaches to this censored data regression model, the approach
originally proposed by Cox and the counting process approach.
At this stage, we assume that the covariates are time independent. Let S(t | Z) be the conditional survival function given the
covariate vector Z. The conditional hazard function is
1
λ(t | Z) = lim
P(t ≤ T < t + ∆t | T ≥ t, Z).
(2.8)
∆t↓0 ∆t
When ∆t > 0 is small, λ(t | Z)∆t is approximately the conditional
probability at event (failure, death) in the interval t to ∆t given
survival until time t and covariates Z.
18
2.2. Event history analysis
The Cox proportional hazards model specifies that
λ(t | Z) = λ0 (t) exp(β ′ Z),
(2.9)
where λ0 (t) is an unspecified non-negative function called the baseline hazard common to all subjects, and β is a set of unknown regression coefficients.
Cox (1972; 1975) proposed a semiparametric approach for the
proportional hazards model (2.9). Let D be the set of indices j
of ordered event-times t1 , t2 , . . . , tj , . . . (For the moment we assume
that only one subject gets an event at each event-time), and Rk
be risk set at time tk the subjects under observation and event-free
immediately prior to tk . The partial likelihood is given by
L(β) =
Y
k∈D
exp(β ′ Zk )
,
′
j∈Rk exp(β Zj )
P
(2.10)
in which the baseline hazard λ0 (t) is canceled out. The β can be estimated using the maximum partial likelihood. Many researchers has
investigated the large sample properties of this partial likelihood (see
review by Fleming and Lin (2000)). If there is more than one event
at a certain event-time (tied event-time), at least four procedures
have been proposed to handle it (Therneau and Grambsch, 2000):
Breslow’s approximation, Efron’s approximation, exact partial likelihood, and averaged likelihood. A method based on the maximum
likelihood (ML) as an alternative of the maximum partial likelihood
(MPL) is also proposed (Bailey, 1984; Broström, 2002). Efron’s approximation is recommended since it is computationally feasible even
with large tied data (Therneau and Grambsch, 2000). For heavier
tied data, the ML estimator is superior (Broström, 2002).
The counting process approach treats the survival data in a more
general way using the counting process notation (Ni (t), Yi (t)) discussed earlier in this section. This generality is useful for a more
19
2.2. Event history analysis
elaborate survival analysis such as including time-dependent covariates, time-dependent strata, left truncation, multiple time scales,
multiple events per subject, various problems with correlated data
and case-cohort models. In the counting process approach, the partial likelihood is written as
#dNk (t)
"
n Y
Y
Yi (t) exp(β ′ Zk )
Pn
,
(2.11)
L(β) =
′
j=1 Yj (t) exp(β Zj )
t>=0
k=1
where Yi (t) is zero-one at-risk process, and dNk (t) = 1 if Nk (t) −
Nk (t−) = 1, and dNk (t) = 0 otherwise.
2.2.4
Diagnostics and stratification
As in ordinary linear regression, diagnostics are also important in the
Cox regression model. There are a wide variety of model diagnostics
available. Lindkvist (2000) has given an extensive review of the diagnostics and studied the added variable plot in the Cox model. For
detecting the departure from the proportional hazards assumption,
Schoenfeld residuals are useful (Grambsch and Therneau, 1994).
For certain situations, it is often necessary to stratify the subjects into disjoint groups when the proportionality assumptions do
not hold for one or several covariates. In the stratified Cox model,
the subjects in a certain stratum have a distinct baseline hazard function but common values for the regression coefficients. The partial
likelihood for the stratified Cox model is given by
L(β) =
S
Y
Ls (β),
(2.12)
s=1
where S is the number of strata and Ls (β) is the partial likelihood
as in Equations (2.10) or (2.11) but calculated only for the subjects
in stratum s.
20
2.2.5
2.2. Event history analysis
Frailty
In a situation where the assumptions of independence and homogeneity of all individuals are violated, introducing frailty models may be
useful (Andersen, 1991; Hougaard, 1995). Vaupel, Manton and Stallard (1979) introduced the term frailty in survival analysis. In the
frailty model, an additional term is added to the Cox model of (2.9),
λ(t | W, Z) = W λ0 (t) exp(β ′ Z),
(2.13)
where W is the frailty term or the random effect term that is
assumed to operate multiplicatively on the baseline hazard. Dependence and heterogeneity among individuals is modeled via this
term by assuming W to follow a certain distribution. Estimation of
W can be done using penalized partial likelihood, EM algorithm or
the Bayesian Gibbs sampler approach (Sastry, 1997; Therneau and
Grambsch, 2000; Manda, 2001).
2.2.6
Multistate models
The concepts and methods in survival analysis extend naturally to
models with more than two states. For instance, the subjects may
move among healthy, diseased and death states over time.
A multistate model is a stochastic process {X(t), t ∈ T}, with
X(t) ∈ S and T = [0, τ ), τ ≤ +∞. X(t) denotes the state occupied
by a subject at time t and S = {0, 1, . . . , m} is a finite state space.
The process starts with the initial distribution πj (0) = P(X(0) =
j), j ∈ S. As the process develops, a history (also called a filtration) H(t) will be generated containing all information about the
process over interval [0, t), such as the number of transitions until t
(a counting process).
The multistate process is governed either by the transition prob-
21
2.3. Longitudinal data analysis
abilities from state j to state k, defined as
Pjk (s, t) = P(X(t) = k | X(s) = j, H(s−))
(2.14)
for j, k ∈ S, s, t ∈ T, s ≤ t; or by the transition intensities given the
history just before t, H(t−), defined as
αjk (t | H(t−)) = lim
∆t→0
Pjk (t, t + ∆t)
.
∆t
(2.15)
A state j ∈ S is absorbing if for all t ∈ T, k ∈ S, j 6= k,
αjk (t) = 0, otherwise j is transient.
Here of course, we will always assume that the limits in the definition of the transition intensities αjk (t | H(t−)) exist. Another
assumption that may be applied to αjk (t | H(t−)) is the nonhomogeneous Markov assumption, αjk (t | H(t−)) = αjk (t), ignoring the history but still depending on time. A stronger assumption
is the homogeneous Markov, which ignores both time and history,
αjk (t) = αjk . In certain applications, it is possible to assume that
the transitions depend on the time spent in the states, which leads
to the semi-Markov assumption.
2.3
2.3.1
Longitudinal data analysis
Notation and approaches
Longitudinal data sets consist of a measurement (outcome or response) variable Yij and vector of explanatory variables xij observed
at time tij for subject i = 1, . . . , m and observation j = 1, . . . , ni .
The mean and variance of Yij are denoted by E(Yij ) = µij and
Var(Yij ) = vij . For each subject i, Yi = (Yi1 , . . . , Yini )′ denotes the
vector of measurements with mean E(Yi ) = µi and ni × ni covariance matrix Var(Yi ) = Vi . The covariance between Yij and Yik is
22
2.3. Longitudinal data analysis
denoted by Cov(Yij , Yik ) = vijk . The ni P
× ni correlation matrix of
Yi is denoted by Ri . The complete N = m
i=1 ni measurements are
denoted by Y = (Yi′ , . . . , Ym′ )′ with mean E(Y) = µ and variance
matrix Var(Y) = V.
The scientific question of interest could be the pattern of change
over time of the outcome or the dependence of the outcome on the
covariates. Most of the approaches of LDA consider regression models under general linear model or the extension of generalized linear
model.
2.3.2
General linear models
We consider the data setup and notations as described in the previous section. Under the general linear model, it is assumed that Y
has a multivariate Normal distribution
Y ∼ MVN(µ, V).
(2.16)
This longitudinal data model is completed by specifying the form of
mean vector µ and variance matrix V.
The mean µ is specified as a linear model
µ = Xβ
(2.17)
with X = (xij1 , . . . , xijp ) are N × p design matrix that may include
covariate of interests and functions of time, and β = (β1 , . . . , βp ) is
a p-vector of unknown regression coefficients.
The specification of V can be made to include at least three
different sources of random variation: random effects, serial correlations and measurement errors. A model that incorporates all the
three sources of variation is
Y = Xβ + ZU + W(t) + ǫ,
(2.18)
2.3. Longitudinal data analysis
23
where U, W(t) and ǫ correspond to random effects, serial correlations and measurement errors, respectively; Z is the design matrix of
U; t = {tij } is a set of times at which the measurements are made.
Altogether, U, W(t) and ǫ has zero mean and specifies the variance
matrix V of model (2.16).
To be precise, it is assumed that U ∼ MVN(0, Ψ), ǫ ∼ N (0, τ 2 )
and W(t) are independent stationary Gaussian processes with mean
zero, variance σ 2 and correlation function ρ(u) which still needs to
be parameterized further. For instance, the popular choice of ρ(u) is
ρ(u) = exp(−φuc ) with c = 1 (the exponential correlation) or c = 1
(the Gaussian correlation) and φ > 0 (Diggle, 1988).
For each individual i, the covariance matrix Vi can be written
as
Vi = Zi ΨZ′i + σ 2 Hi + τ 2 Ii ,
(2.19)
where Hi is the ni × ni symmetric matrix with the (j, k)-th element
hijk = ρ(| tij − tik |), and I is the ni × ni identity matrix.
The specification of Vi will lead to various linear models, from
the simple classical linear model with independent errors to more
complicated ones, such as linear model that includes all those three
sources of errors.
Several estimation methods for this longitudinal model has been
proposed for the special case of variance structure given by (2.19) or
for the general case. Laird and Ware (1982); Diggle et al. (2002) suggested maximum likelihood (ML) and restricted maximum likelihood
(REML) with the remark that REML is usually better than ML.
Goldstein (1986; 1989) suggested iterative generalized linear model
(IGLS) and restricted IGLS (RIGLS) for more general multilevel
structure. Bates and Pinheiro (1998) proposed EM estimation followed by Newton-Rhapson or quasi-Newton optimization of the loglikelihood or the log-restricted-likelihood. Bayesian methods also
have been suggested, for instance using Gibbs sampling (Zeger and
24
2.3. Longitudinal data analysis
Karim, 1991). The multilevel mixed models as a general case for
the longitudinal models with normal and non-normal responses are
reviewed in Section 2.3.4.
2.3.3
Generalized estimating equations
For a more general longitudinal model with non-Gaussian outcome,
an extension of the generalized linear model (GLM) was suggested
by Liang and Zeger (1986). Like the ordinary GLM (McCullagh and
Nelder, 1989), the model can handle a wide range of discrete and
continuous outcome distributions such as binomial, Poisson, gamma
and normal.
Using the notation and data setup introduced in Section 2.3.1,
in this model the mean of Yi is specified as
µi = h(Xi β),
(2.20)
where β is p-vector of unknown parameters. The inverse of h is
known as the ”link” function in the GLM terminology. The variance
of Yi is specified through the ni × ni ”working” correlation matrix
Ri (α). It is said to be ”working” since we do not expect it to
be correctly specified (Zeger and Liang, 1986). The α are some
unknown parameters common to all subjects.
The working covariance matrix of Y is
1/2
1/2
Vi = Ai Ri (α)Ai /φ,
(2.21)
where Ai is an ni × ni diagonal matrix with known function g(µij )
as the j-th diagonal element and φ is a scale parameter.
The generalized estimating equation (GEE) of this longitudinal
data model is given by
m
X
i=1
D′i Vi−1 Si = 0,
(2.22)
2.3. Longitudinal data analysis
25
where Di = ∂µi /∂β and Si = Yi − µi . The GEE estimator of β is
the solution of equation (2.22). Liang and Zeger (1986) studied the
consistency of the estimator and proposed an iterative procedure to
estimate β.
A problem that frequently arises in longitudinal data is missing
values. The GEE estimation is still consistent even when Ri is missspecified provided that the missing values are completely at random
(Liang and Zeger, 1986; Diggle et al., 2002). When the missing values are not completely random, joint modeling of dropouts (missing
values) and longitudinal measurements may be needed.
The approach considered here is called the population averaged
(PA) models (Zeger, Liang and Albert, 1988) in which the aggregate response for the population is modeled. Another approach is
the subject specific (SS) models in which heterogeneity in regression
parameters is modeled. The next section considers the second approach.
2.3.4
Generalized linear mixed models
The models discussed in the previous two sections can be extended
to more general class of models. Generalized linear mixed model
(GLMM) is an extension of GLM by including random effects, or
more general multilevel or hierarchical structure in the model.
Rather than modeling the mean of Y as in the previous section,
this model focus on modeling ui =E(Y | b) specified as
ui = h(Xi β + Zi bi ),
(2.23)
where b is vector of random effects with design matrix Zi . The
inverse of h is the ”link” function as in Equation (2.20). This model
is also known as subject specific (SS) in (Zeger et al., 1988). SS
models are desirable when the response of an individual is the focus
rather than the average population response.
26
2.4. Time-dependent covariates
The GEE can be used for this model as well. In the GLMM
both the link function and the random effects distribution must be
correctly specified. To use GEE for the GLMM, the marginal moments µi and Vi of Equations (2.20) and (2.21) are calculated from
the conditional moments and the random effects distribution F and
solve the GEE.
The GLMM estimation using GEE aims primarily at estimating
fixed effects and does not estimate the random component terms
which are often useful for prediction or in model diagnostic. Lately,
Lee and Nelder (2001) developed hierarchical GLM that allows models with any combination of GLM distribution for the response with
any conjugate distribution for the random effects, structured dispersion components, different link functions for the fixed and random
effects and the use of quasilikelihoods in place of likelihoods for either
or both of the mean and dispersion models.
2.4
2.4.1
Time-dependent covariates
Some useful classifications
Longitudinal or event history data has the advantage of observing
the temporal order of the outcome and covariate. The analysis of
covariate changes may be useful in studying causal relationships. A
time-dependent covariate is a covariate that vary over time. This
section discusses basic issues of time-dependent covariates for both
event history and longitudinal data.
In survival analysis, Kalbfleisch and Prentice (2002, Section 6.3)
classify time-dependent covariates as external and internal. Let xi (t)
denote the time-dependent covariate at time t for individual i and
Xi (t) = {xi (u); 0 ≤ u < t} denote the covariate history up to time
2.4. Time-dependent covariates
27
t. For each individual i, the hazard function of (2.8) becomes
λi (t | Xi (t)) = lim
∆t↓0
1
P(t ≤ Ti < t + ∆t | Ti ≥ t, Xi (t)).
∆t
(2.24)
An external (time-dependent) covariate Xi (t) satisfies the condition
P(u ≤ Ti < u + ∆u | Ti ≥ u, Xi (u)) =
P(u ≤ Ti < u + ∆u | Ti ≥ u, Xi (t)) (2.25)
for all u, t such that 0 < u ≤ t. An equivalent condition is
P(Xi (t) | Ti ≥ u, Xi (u)) = P(Xi (t) | Ti = u, Xi (u)),
0 < u ≤ t.
(2.26)
This condition implies that the future path of Xi (t) up to any time
t > u is not affected by the occurrence of an event at time u.
When the conditions (2.25) or (2.26) are not satisfied, Xi (t) is
called an internal covariate. The main consequence of internal covariate is that the future path of the covariate is affected by the
event occurrence.
External covariates may be classified further as fixed, defined and
ancillary covariates. When the external covariate is fixed across
time, e.g., X(t) = Z, then the hazard function of (2.24) is the same
as (2.8). A defined covariate is when X(t) determined in advanced
for each individual. This covariate is usually a factor determined
in experimental study. Another example is the age of individual or
calendar time across the study. An ancillary covariate is the output
of stochastic processes that is external to the time-to-event process
of the individual, such as pollution, seasonality or social-economics
conditions.
28
2.4. Time-dependent covariates
The relation between the hazard function and the survival function for the external covariate is given by
Z t
λ(u | X(u))du ,
(2.27)
S(t | X(t)) = exp −
0
which is similar to that of a time-independent covariate. The relationship for the internal covariate is different to (2.27) and discussed
in the next section.
In LDA, there are similar definitions for internal and external covariates. We consider the notation in Section 2.3.1 with modification,
Xij denotes the time-dependent covariate and Zij denotes the timeindependent covariates. Here j represents discrete follow-up times.
Adapted from econometrics terminology, in the LDA, a covariate is
classified as exogenous or endogenous (Diggle et al., 2002).
Define the history of time-dependent covariates and outcomes
for individual i up to time t as HXi (t) = {Xi1 , Xi2 , . . . , Xit } and
HY i (t) = {Yi1 , Yi2 , . . . , Yit }, respectively, exogenous is defined as
f (Xit | HY i (t), HXi (t − 1), Zi ) = f (Xit | HXi (t − 1), Zi ),
(2.28)
where f (.) represents a density or probability function of the covariate. When the condition (2.28) is not satisfied, HXi (t) is endogenous.
When covariates are exogenous, the future of the covariates are
not affected by the outcomes and the analysis can focus on specifying
the dependence of Yit on Xi(t−1) , Xi(t−2) , . . .. Generally, the approach
consider E(Yit | Xis , s < t). For example, a GEE model with single
lagged covariate can be specified as
h(E(Yit | Xis , Zi )) = β0 + β1 Xi(t−k) + β ′2 Zi .
(2.29)
All methods and inferences discussed in Section 2.3.2 and Section
2.3.3 basically can be used in the lagged model.
2.4. Time-dependent covariates
2.4.2
29
Approaches in the Cox model
The partial likelihood for the Cox model with time-dependent covariate is similar with (2.11). The form of the Cox partial likelihood
is
#dNk (t)
"
n Y
Y
Yi (t) exp(β ′ Zk (t))
Pn
L(β) =
,
(2.30)
′
j=1 Yj (t) exp(β Zj (t))
t>=0
k=1
where Zj (t) is the time-dependent covariate at time t. The calculation of the likelihood requires covariate values at the event times.
Typical situations in survival analysis with time dependent covariates are illustrated in Figure 2.1. Figure 2.1(c) is a switching
treatments time dependent covariate (Cox and Oakes, 1984, Chapter 8) in which subjects may change from one treatment to another.
The usual method to deal with such a covariate, given that the covariate is external, is to split the individual life time by the time
when the covariate values change. This is easy to manage in standard statistical packages that facilitate the counting process style of
input.
Figure 2.1(b) is an example of a defined time-dependent covariate. For example, if the time scale used in the analysis is time since
entering the study, a defined covariate could be the age of the individuals. Of course, age has the same speed as the survival time,
and their values are always available at any event time. Unlike the
previous example, it is computationally more efficient to split the
individual life times by event times.
Often, covariates are collected intermittently across the time such
that their values are not available at the event times (Figure 2.1(a)).
In this situation several methods have been proposed. These include
the last value carried forward (LVCF ) method, using the last value of
the covariate to substitute the missing value prior to the event time.
30
2.4. Time-dependent covariates
event - outcome
covariates
(a)
*
*
*
*
(b)
(c)
Figure 2.1: Time-to-event and time-dependent covariates: (a) intermittently observed (b) defined covariate (c) switching treatments
covariate.
2.4. Time-dependent covariates
31
Imputation methods such as two-stage estimation and smoothing can
be applied to this problem as well. In the two-stage method, a mixed
model is fitted to the data at each event time with time-dependent
covariate as the response (Pawitan and Self, 1993; Tsiatis et al.,
1995). Bruijne et al. (2001) suggested another approach using time
elapsed since the last measurement (TEL) in the Cox’s regression
model together with the LVCF or other methods of imputation. The
TEL can be considered as ”the age of the longitudinal measurement”
in which Cox’s model that includes TEL may be better than the
Cox’s model with only LVCF or two-stage imputation.
More general methods based on the joint modeling of event-times
and longitudinal measurements have also been proposed (Wulfsohn
and Tsiatis, 1997; Henderson et al., 2000; Lin, Turnbull, McCulloch
and Slate, 2002; Xu and Zeger, 2001; Tsiatis and Davidian, 2004).
Basically, this model consider two linked sub-models, one for the
longitudinal measurements model and one for the event-time model.
The two sub-models are joined together with a Gaussian latent
process. Without the latent process the models become the ordinary separate longitudinal measurement and event-time models.
To estimate the model, a likelihood based method leading to EM
algorithms has been proposed (Wulfsohn and Tsiatis, 1997; Henderson et al., 2000; Lin, Turnbull, McCulloch and Slate, 2002).
Other methods are based on a Bayesian approach (Faucett and
Thomas, 1996; Xu and Zeger, 2001; Guo and Carlin, 2004). Utilizing the usual connection between survival analysis and GLM, the
model can also be estimated using the GEE approach (Rochon and
Gillespie, 2001) and by generalized linear latent mixed models (RabeHesketh, Yang and Pickles, 2001).
32
2.4.3
2.4. Time-dependent covariates
Time-dependent confounders
The notion of time-dependent confounders in epidemiology has been
recognized at least by Robins (1986) and later in the epidemiological journals in the 90’s (see for example articles by The Cebu Study
Team (1991); Pearce (1992); and Zohoori and Savitz (1997)). Keiding (1999) gave an overview of this problem in event history analysis.
A time-dependent confounder, often arising in longitudinal or cohort
studies, is both a confounder and an intermediate variable. It is also
known as feedback models (Zeger and Liang, 1991) and related to the
internal or endogenous discussed covariates in the previous section.
To deal with time-dependent confounders in longitudinal data,
we may use a method proposed by Zeger and Liang (1991). The
method is based on GEE models allowing for both lagged response
and endogenous covariates. A more general solution with theoretical
exposition can be found in a book by van der Laan and Robins
(2003).
For EHA, time-dependent confounders is closely related to internal covariates. The hazard function for an internal covariate is
defined by (2.24) but conditioned on the time-dependent covariate
only up to t− (time just before t) and not further. The relation
(2.27) does not hold. In fact, for survival data, the internal covariate requires the survival of individuals for its existence, therefore the
survival function is always one, provided that x(t−) 6= 0. Generally
the survival function will be (Jewell and Kalbfleisch, 1996; Andersen, 2003)
Z t
λ(u | X(u))du ,
(2.31)
S(t | X(t)) = E exp −
0
where the expectation is taken with respect to the sample path X(.).
The marginal survival probability at t given the past history is the
average over the possible paths among individuals at risk for X(t).
2.4. Time-dependent covariates
33
In Cox’s regression model, care must be taken in interpreting
the estimated coefficients, since X(t) may serve as an intermediate variable. However, an internal covariate is not something to be
avoided, a particular kind of internal covariates known as marker
or surrogate end-point have many useful applications (Jewell and
Kalbfleisch, 1996; Prentice, 1989).
The multiple time scales problem in the next chapter is closely
related to the defined covariate (Figure 2.1(b)), whereas the longitudinal measurement problem in Chapter 5 is closely related to the
intermittently observed time-dependent covariate (Figure 2.1(a)).
34
2.4. Time-dependent covariates
Chapter 3
Analysis of Childhood
Mortality, Morbidity and
Growth
3.1
Introduction
This chapter presents some applications of event history analysis
(EHA) and longitudinal data analysis (LDA) to a childhood epidemiological study. The Community and Health Nutrition Laboratories
(CHN-RL) surveillance and the ZINAK study on zinc and iron supplementation in infants introduced in Chapter 1 are the two main
sources of data used in the analysis. This chapter is also meant to
be a natural background for methodological development in the later
chapters.
35
36
3.2
3.2. Mortality
Mortality
Child survival in developing countries has been investigated intensively, especially since the study by Mosley and Chen (1984). The
Cox model for analyzing childhood mortality in developing countries
has been employed by, among others, Trussell and Hammerslough
(1983) and Pebley and Stupp (1987). Using the Community Health
and Nutrition Research Laboratories (CHN-RL) data, infant mortality has been investigated relating to the effects of sibling status
(Wahab, Winkvist, Stenlund and Wilopo, 2001). In general, they
concluded that boys had higher infant mortality rates than girls although the difference was not great. The risk for boys was even
higher when they were born after a few siblings compared with being first-born. Further study is still needed to evaluate the different
mortality pattern among boys and girls in that area.
Here, we investigated more aspects on the effect of siblings and
gender on childhood mortality, taking into account clustering levels
of mother, household, community and village using EHA. Detail of
the analysis has been reported elsewhere by Danardono (2003).
3.2.1
Data, study variables and models
Rather than considering the live births for a period of 1995 to 1996
in the CHN-RL surveillance (Wahab et al., 2001) as the subjects, we
considered all children observed since the start of surveillance on October 1994. This scheme has an advantage in utilizing all information
available in the surveillance but introduces length-biased sampling
(Section 1.2). Consequently, the length-biased sample selection has
to be taken into account in the analysis by using left-truncation. After excluding some twins and incomplete records, 7889 children were
available in the data set with 2948 of them being born after the start
of the surveillance data collection.
3.2. Mortality
37
Specifically, we investigated the sibling and gender effects on
mortality. The sibling factor has been pointed out as being of interest, in the way that it may explain the difference in care between boys and girls and possible competing resources among them
(Wahab et al., 2001). To study this effect, several variables were
constructed based on gender and birth order. The sibling variable is
a time-dependent covariate, a ”switching treatment” like covariate
(see Figure 2.1(c) in Chapter 2).
We give one example of this variable construction. We use the
term index child to denote the child under consideration. Suppose
we have information as in Figure 3.1(a). When a younger sibling
was born the value of this time dependent covariate is changed from
0 to 1. We may further consider the gender of the younger sibling
and categorize boy or girl rather than just 1 as the value of this
time-dependent covariate.
In Figure 3.1(a), there are two children who experienced the
events before the event times of the index child, and one child, the
sibling of the index child, who has not experienced the event. We
can construct the data suitable for event history analysis using Cox’s
model by event-time splitting (Figure 3.1(b)) or covariate-time splitting (Figure 3.1(c)). Both constructions will lead to the same result.
However, in the case of switching treatment covariate, in which the
value of the covariate is a step function with only a few values, splitting by covariate times is more efficient since it usually gives less
splitting intervals than event-time splitting.
Another situation is when the index child did not enter from
birth (delayed entry or left-truncation) and the younger sibling was
born before the entry time. In this case, there is no splitting by
the younger sibling covariate, except if the sibling dies. A similar
construction is applied for the older sibling covariate where the value
is changed when the older sibling dies. For this analysis, we only
38
3.2. Mortality
constructed covariates for the closest sibling (one younger or one
older sibling).
We used the Cox proportional hazards model reviewed in Section 2.2.3, i.e., the standard model of Equation (2.9) and the shared
frailty model of Equation (2.13). We used gamma frailty to model
the frailties. Currently, there is no general agreement about the
best frailty distribution for practical frailty modeling (Therneau
and Grambsch, 2000). The Gamma distribution, however, has
been used in several statistical and demographical studies (Guo and
Rodrı́guez, 1992; Sastry, 1997). To estimate the frailty term, we used
the penalized partial likelihood approach (Therneau and Grambsch, 2000), available in the R survival package (Ihaka and Gentleman, 1996; R Development Core Team, 2004).
3.2.2
Results
We obtained two hazard models for the childhood mortality: the
infant mortality (0-1 year of age) and child mortality (1-5 years of
age), presented in Table 3.1 and 3.2, respectively.
For the infant mortality hazard model, the strongest, yet unsurprising, result is the effect of maternal education. Higher education
gave a protective effect for childhood mortality. The gender of the
index child alone was slightly a significant factor for childhood mortality; girls seemed to have lower risk than boys. Birth order also
shows a significant linear effect on mortality, the risk increases with
higher birth order. The older sibling variable does not seem show
any effect, the relative risk of infants (0-1 year of age) who had no
older sibling, older brother or sister are the same.
After infancy (aged 1-5 years), the effects of gender, birth order
and maternal education seem to disappear, on the other hand the
effects of siblings appear. We also examined the interaction between
gender of the index child and the gender of the older sibling as well
39
3.2. Mortality
(a)
(b)
event - death
(c)
start
0
12
24
stop
12
24
30
status
0
0
1
sibling
0
1
1
start
0
15
stop
15
30
status
0
1
sibling
0
1
younger sibling
1
0
0
12 15
24
age (months)
30
Figure 3.1: Sibling as a time-dependent covariate: (a) The bold line
under event-death frame is the index child, the dashed lines are other
children; the line under younger sibling frame is the time-depedent
covariate value; (b) splitting by event times; (c) splitting by covariate
times.
40
−1017
−1018
95% c.i. (household)
−1019
Log(partial−likelihood)
−1016
3.2. Mortality
95% c.i. (mother)
0
2
4
6
8
random effect variance
Figure 3.2: Profile likelihood for the mother and household random
effect variance for infant mortality model.
as the younger sibling. Neither interaction was significant. The
risk of mortality is higher when the index child (boy or girl) has an
older or younger brother. The above results probably do not reflect
gender difference in care, in favor of boys, since the index child with
the higher risk is either boy or girl, but it may reflect exhausting
resources when a family has a boy (or boys) that lead to childhood
mortality. The confidence intervals of the relative risks of this model
are shown in Table 3.2, under the standard model. The estimates
are rather poor with wide confidence intervals for the sibling variable
and maternal education.
We also included several frailty terms that assumed to operate
on a certain meaningful level. The mother frailty may capture any
unobserved variables that operate on children born from the same
1
1
1.23 (0.68-2.22)
1.02 (0.55-1.90)
1
11.83 (3.92-35.69)
6.07 (2.21-16.64)
4.73 (1.59-14.02)
1
1.29 (0.72-2.34)
1.07 (0.57-1.99)
1
11.59 (3.84-34.96)
5.95 (2.17-16.33)
4.63 (1.56-13.72)
(0.041)
1
1.76 (0.91-3.40)
0.98 (0.63-1.53)
1
1.72 (0.89-3.32)
0.95 (0.61-1.48)
2.135
1
0.70 (0.49-0.99)
1.17 (1.01-1.36)
mother frailty
RR (c.i.)
1
0.71 (0.50-1.01)
1.16 (1.00-1.35)
standard model
RR (c.i.)
The estimated variance of random effects and the p-value of the LRT
Gender
boy
girl
Birth order (linear)
Maternal age at delivery
20-29 year
< 20 year
+ 30 year
Older sibling
none
older brother
older sister
Maternal education
12 years of education
no education
6 years of education
9 years of education
Variance of random effect1
mother
household
community
village
Variables
3.074
(0.004)
1
11.98 (3.97-36.14)
6.08 (2.22-16.69)
4.76 (1.61-14.11)
1
1.20 (0.67-2.17)
0.99 (0.54-1.87)
1
1.79 (0.93-3.45)
0.99 (0.64-1.54)
1
0.69 (0.49-0.98)
1.18 (1.01-1.37)
household frailty
RR (c.i.)
0.103
(0.319)
1
11.06 (3.67-33.37)
5.8 (2.11-15.89)
4.59 (1.55-13.61)
1
1.29 (0.71-2.32)
1.06 (0.57-1.98)
1
1.69 (0.88-3.26)
0.97 (0.62-1.50)
1
0.71 (0.50-1.01)
1.16 (1.00-1.34)
community frailty
RR (c.i.)
Table 3.1: Five hazard models for infant mortality (0-1 years)
0.054
(0.334)
1
11.47(3.80-34.60)
5.96(2.17-16.34)
4.63(1.56-13.74)
1
1.29 (0.72-2.34)
1.07 (0.57-1.99)
1
1.70 (0.88-3.28)
0.96 (0.62-1.48)
1
0.72 (0.51-1.01)
1.16 (1.00-1.34)
village frailty
RR (c.i.)
3.2. Mortality
41
3.2. Mortality
42
Variables
1
11.46 (2.64-49.8)
6.34 (1.26-32.01)
1
1.31 (0.19-8.90)
0.84 (0.38-1.88)
1
0.97 (0.46-2.07)
0.87 (0.58-1.31)
standard model
RR (c.i.)
1
4.86 (1.45-16.31)
1.2 (0.15-9.65)
1
11.46 (2.01-65.23)
6.34 (1.03-38.87)
1
1.31 (0.14-12.10)
0.84 (0.34-2.06)
1
0.97 (0.46-2.07)
0.87 (0.58-1.31)
mother frailty
RR (c.i.)
1
4.86 (1.45-16.31)
1.2 (0.15-9.65)
1
11.46 (2.01-65.23)
6.34 (1.03-38.87)
1
1.31 (0.14-12.1)
0.84 (0.34-2.06)
1
0.97 (0.46-2.07)
0.87 (0.58-1.31)
household frailty
RR (c.i.)
1
4.86 (1.45-16.31)
1.2 (0.15-9.65)
1
11.46 (2.01-65.23)
6.34 (1.03-38.87)
1
1.31 (0.14-12.10)
0.84 (0.34-2.06)
1
0.97 (0.46-2.07)
0.87 (0.58-1.31)
community frailty
RR (c.i.)
1
4.88(1.46-16.38)
1.17 (0.14-9.39)
1
11.57(2.03-65.82)
6.35(1.04-38.97)
1
1.33(0.14-12.28)
0.85 (0.35-2.09)
1
0.96 (0.45-2.04)
0.87 (0.58-1.30)
village frailty
RR (c.i.)
Table 3.2: Five hazard models for child mortality (1-5 years)
1
4.86 (1.44-16.45)
1.2 (0.16-9.01)
1
1.93(0.12-31.33)
4.95(0.66-37.12)
2.07(0.19-23.00)
0.423
1
2.03 (0.13-32.93)
5.02 (0.67-37.67)
2.07 (0.19-22.92)
≈1
0.184
1
2.03 (0.13-32.95)
5.02 (0.67-37.67)
2.07 (0.19-22.92)
≈0
0.947
1
2.03 (0.13-32.95)
5.02 (0.67-37.67)
2.07 (0.19-22.92)
≈1
0.005
1
2.03 (0.12-33.78)
5.02 (0.68-37.29)
2.07 (0.19-22.09)
≈0
The estimated variance of random effects and the p-value of the LRT
Gender
boy
girl
Birth order (linear)
Maternal age at delivery
20-29 year
< 20 year
+ 30 year
Older sibling
none
older brother
older sister
Younger sibling
none
younger brother
younger sister
Maternal education
12 years of education
no education
6 years of education
9 years of education
Variance of random effect1
mother
household
community
village
1
3.2. Mortality
43
mother, such as genetic factors and maternal competence. At the
household level, family size, socio-economic status and housing condition may be captured by household frailty term. At the broader
coverage of level, community and village level were also included.
These terms will account for the possible effects of infrastructure,
climate, and other environmental factors within the community; and
institutional effect within the village.
Figure 3.2 shows the profile likelihood for the mother and household frailty term. The 95% confidence interval is constructed by
referencing a horizontal line 3.84/2 units below the maximum logpartial likelihood. The reference line is obtained by assuming that
2×(the difference in likelihood) has Chi-square distribution with one
degree of freedom. The maximum log likelihood of the household
frailty model is -1015.48, which corresponds to the value 3.074 of
the estimated random effect variance, and -1017.43 for the mother
frailty effect, which corresponds to the estimated random effect variance of 2.14. The intervals range from 0.63 to 7.70 for the household
frailty, and 0.07 to 6.38 for the mother frailty. In fact, no interval
cover zero value of the random effect variance, suggesting that the
household and mother frailty are important. For community and
village frailty, the 95% confidence intervals cover the zero value of
the random effect variance, indicating that the community and village frailty are not important. This confirms the results of Table
3.1, in which household and mother frailty are important, whereas
community and village frailty are not.
High household frailty effect indicates that housing condition,
socio-economic status and other household level factors are more
important than other factors that operate at mother, community or
village level. The mother’s frailty effect was lower than the household, probably because some of the important maternal variables
for childhood mortality have been accounted for in the model, such
as maternal education and maternal age at delivery, whereas none
44
3.3. Morbidity: surveillance data
of household’s variables have been included. It is suggested that
household factor variables should be included for further studies.
Similar to the infant mortality model, the estimated parameters
in the child mortality models with frailty do not differ from the
standard model (Table 3.2).
The general conclusion regarding the sibling and gender factors is
that there was no evidence of gender difference reflected as difference
in care between boys and girls in Purworejo district, Indonesia that
may lead to mortality. This finding is in accord with the previous
research (Wahab et al., 2001) and the general trend of the narrowing gaps in many aspects between boys and girls in the Indonesian
society (Kevane and Levine, 2003). There is, however, an indication
that having brother(s) may lead to higher risk of child mortality.
3.3
Morbidity: surveillance data
Because of its importance, childhood morbidity has been investigated by many researchers from diverse disciplines such as public
health, biomedicine and social science. Two common diseases in
childhood, diarrhea and respiratory infection, remain to be the most
important causes of deaths among children (Rice, Sacco, Hyder and
Black, 2000; Black, Morris and Bryce, 2003; UNICEF, 2003). In Indonesia, especially in the CHN-RL area, several studies related to
childhood morbidity have been conducted. Machfudz (1998) conducted a study on the effect of morbidity (diarrhea and respiratory
infection) on the change of the mid-upper-arm circumference in children under five years of age. Danardono (2000) studied the multilevel
effects at community level, household level and individual level for
the case of diarrhea disease. Wibowo (2000) evaluated the influence
of nutritional status on morbidity (diarrhea and respiratory tract
infection) among infants.
3.3. Morbidity: surveillance data
45
We presented the application of EHA for analyzing two common
and important childhood diseases, diarrhea and respiratory infection
in the CHN-RL surveillance area. We demonstrated the use of various time scales to respond to research questions of interest. As in
the previous section, the detail of the analysis in this section has
been reported elsewhere by Danardono (2003).
3.3.1
Data, study variables and models
We utilized the CHN-RL morbidity surveillance for this analysis.
The surveillance used the two-week recall questionnaire to collect
information on childhood morbidity at the day of visit and 14 days
backward and related variables. This type of questionnaire has been
widely used for morbidity records, for instance in the Demographic
and Health Surveys (DHS) in many countries, including Indonesia
(CBS, NFPCB, MOH and MI, 1998).
The variables of interest are gender of the child, maternal education and maternal age (at the time of illness), sibling variables
(as in the childhood mortality models in the previous section) and
breastfeeding. Individual frailty effects as well as environmental and
institutional frailty effects are also investigated. To ensure that information on the breastfeeding variable is available, cohort data from
February 1995 until June 1998 were used with 2804 children available
in the data set.
To analyze the data, we need to construct the data set into
counting process style of input (start, stop], event. The process
is straightforward but tedious, and computer demanding when the
data set is large and includes time dependent covariates. Table 3.3
represents the data layout for the morbidity study. The observation
column is the information obtained by the two-week recall questionnaire. The start, stop, event columns are constructed by the
observation column and visit column. For instance, child with ID
46
3.3. Morbidity: surveillance data
Table 3.3: Data layout for morbidity study. In this example there are
2 children with 2 and 4 visits resulting into 9 spells (intervals with
(start, stop] and event). The event of interest is 1 in the observation column. Some observations are split because the occurrence
of the event or time-dependent covariate (e.g., weaned)
ID
96
96
start
96-05-15
96-08-20
stop
96-05-29
96-08-31
event
0
1
observation
000000000000000
000000000001111
visit
96-05-29
96-09-03
weaned
——
——
81
81
81
81
81
81
81
96-10-23
96-10-31
97-01-31
97-04-29
97-07-25
97-07-31
97-08-07
96-10-26
96-11-06
97-02-14
97-05-07
97-07-31
97-08-04
97-08-08
1
0
0
1
0
1
0
000111110000000
96-11-06
96-11-06
97-02-14
97-05-13
97-08-08
97-08-08
97-08-08
97-07-31
97-07-31
97-07-31
97-07-31
97-07-31
97-07-31
97-07-31
000000000000000
000000001111111
000000000011100
81 at visit 1996-11-06 was split into two intervals, one ended at
1996-11-26 with event, and one at 1996-11-06 censored. The dates
are constructed backwards in time from the visit date. When there
are changes in the value of the time-dependent covariate, such as
weaned at 1997-03-31, the observation was split according to the
covariate times (e.g., ID 81 at visit 1997-08-08)
We use the the Andersen-Gill (AG) model, an extension of Cox’s
model with age time scale, calendar time, and time since weaning.
The model assumes independent increments, i.e., the numbers of
events in non-overlapping time intervals are independent, given the
history, with common baseline hazards for all events.
The AG model specifies intensity process similar to hazard function in the Cox model
λ(t|Z(t)) = Y (t)λ0 (t) exp(β ′ Z(t)),
(3.1)
where λ0 (t) is the baseline intensity, β is unknown regression coeffi-
3.3. Morbidity: surveillance data
47
cients, Z(t) is vector of covariate, possibly time-dependent and Y (t)
is zero-one at-risk process. Unlike the Cox model for survival data,
Y (t) in the AG model is not absorbed to zero when an event occurs
but alternates between zero and one depending on the event process.
The purpose of counting process style of input (start, stop], event
mentioned above is to specify the Y (t).
In the analysis we used the AG model with three different time
scales, i.e., age, time since the start of the surveillance, and time
since weaning.
3.3.2
Age time scale
Respiratory infection and diarrhea, as well as many other childhood
diseases are usually age dependent. Choosing age as the time scale
does not allow age itself to be in the model, but we can check the
dependency by looking at the hazard plot. Figure 3.3 shows the plot
of the hazards for both diseases. The hazard plots are smoothed
by the Epanechnikov kernel, with a bandwidth of 10 months chosen
by visual inspection, and plotted over the monthly crude hazard
rates (the shaded barplot). The visual inspection is of course not
an optimal method for choosing a bandwidth, compared with the
method suggested by Andersen et al. (1993), but it is useful enough
for exploratory purposes. The cumulative hazards of both diseases
are almost linearly increasing. The estimated hazards show that the
hazard might be associated with age, and around 12 months of age
could be the highest peak of both diseases.
Table 3.4 gives the result for diarrhea. Increasing maternal age
seems to be associated with increasing the risk. The breastfeeding
variable has a rather significant contribution to the model where
the never breastfed children had the highest risk as compared to the
other categories. Maternal education and sibling variables did not
show any significant contribution in the model.
48
3.3. Morbidity: surveillance data
0
5
10
15
20
25
30
35
1.2
0.8
0.4
0.0
cumulative hazard
Diarrhea
0 2 4 6 8
cumulative hazard
Respiratory Infection
40
0
5
10
15
20
25
30
age (months)
0.04
0.00
hazard
0.08
0.0 0.1 0.2 0.3 0.4 0.5
hazard
age (months)
0
5
10
15
20
25
age (months)
30
35
40
0
5
10
15
20
25
30
age (months)
Figure 3.3: The cumulative hazard and hazard plot of childhood
respiratory infection and diarrhea by age.
49
3.3. Morbidity: surveillance data
Table 3.4: Hazard model for diarrhea, age time scale
Variables
Relative risk (c.i.)
Gender
boy
girl
Maternal education
non-educated
educated
Breastfeeding status
breastfed
weaned
never breastfed
Maternal age (years)
15-19
20-24
25-29
30-34
35+
Older sibling
none
brother
sister
Younger sibling
none
brother
sister
1 Likelihood
ratio test
1
1.14
p-value
LRT1
0.298
(reference)
(0.87-1.51)
p-value
Non-prop2
0.581
0.174
1
1.58
(reference)
(0.8-3.14)
0.784
0.088
1
1.09
2.07
(reference)
(0.69-1.73)
(1.04-4.11)
1
1.40
1.70
1.14
1.66
(reference)
(0.75-2.60)
(0.95-3.03)
(0.59-2.22)
(0.86-3.21)
1
0.84
0.91
(reference)
(0.57-1.24)
(0.62-1.33)
1
1.12
0.99
(reference)
(0.29-4.39)
(0.27-3.67)
0.179
0.260
<0.001
0.555
0.247
0.731
0.383
0.640
0.902
0.846
0.988
2 Non-proportionality
0.630
0.991
test, global p-value=0.89
50
3.3. Morbidity: surveillance data
The frailty effects of this hazard model for diarrhea are all significant, with the value of 1.273, 1.229, 1.237, 0.614, 0.350 for individual,
mother, household, community and village frailty, respectively. The
estimated coefficients in the frailty models are only slightly different to the estimated coefficients of the standard model (Table 3.4),
which may not give any further important information. However, the
significant frailty effect of these frailty models indicate the existence
of unobserved heterogeneity in the individual, mother, household,
community and village groups which may need to be investigated
further.
For the respiratory infection, the maternal age has a similar pattern to the diarrhea models as well as for the sex and sibling variables.
Contrary to the diarrhea model, maternal education gave significant
contribution to the model whereas the breastfeeding variables did
not. The frailty effects for the respiratory infection model were also
found to be important in the models.
For both hazards models of respiratory infection and diarrhea,
there is no evidence of non-proportionality, as indicated by the global
p-values test for non-proportionality (large values) and the p-values
for each coefficient in both models. All necessary interactions, such
as maternal education and breastfeeding, have also been checked and
taken care of.
3.3.3
Calendar time
We used other time scales than age to allow age as a time dependent
covariate in the model. One possible choice is the time since the
start of the surveillance (February 1995).
Figure 3.4 shows the cumulative hazards and hazard as a function of time since the start of the surveillance where time is converted
back to a calendar time. Against time, the hazard of respiratory infection is always higher than diarrhea. The highest peak of respira-
51
8
6
2
4
Respiratory infection
Diarrhea
0
cumulative hazard
3.3. Morbidity: surveillance data
Aug95
Feb96
Aug96
Feb97
Aug97
Feb98
Aug97
Feb98
5
Feb95
3
2
Diarrhea
0
1
hazard
4
Respiratory infection
Feb95
Aug95
Feb96
Aug96
Feb97
Figure 3.4: The cumulative hazards and hazards plot of childhood
respiratory infection and diarrhea by calendar time.
tory infection and diarrhea incidence seemed to be in April-June, the
transition period from the rainy to the dry season; and in SeptemberOctober, the transition from the dry to rainy season. There was also
a long dry season in 1997 and an economic crisis that might have
caused the peak incidence in that year.
Table 3.5 shows the hazards model for respiratory infection. The
children’s age variable is significantly associated with the risk of developing respiratory infection. The highest risk for respiratory infection is in the 6-23 (months) age group. The conclusion is the same
for maternal education and maternal age as in the model using age
time scale. The other variables have a similar pattern to the models using age as the time scale. The pattern is also similar for the
diarrhea models.
Also similar to the age time scale models, introducing frailty did
52
3.3. Morbidity: surveillance data
Table 3.5: Hazards model for respiratory infection, calendar time
Variables
Relative risk (c.i.)
Gender
boy
girl
Age of the child (months)
0-5
6-23
24+
Maternal education
no education
6 yrs of education
9 yrs of education
12 yrs of education
Maternal age (years)
15-19
20-24
25-29
30-34
35+
Breastfeeding status
breastfed
weaned
never breastfed
Older sibling
none
brother
sister
Younger sibling
none
brother
sister
1 Likelihood
ratio test
1
0.99
p-value
LRT1
0.984
(reference)
(0.91-1.10)
p-value
Non-prop2
0.135
<0.001
1
1.83
1.51
(reference)
(1.61-2.07)
(1.21-1.90)
1
1.33
1.29
1.42
(reference)
(1.04-1.70)
(0.99-1.69)
(1.09-1.85)
1
1.29
1.30
1.15
1.32
(reference)
(1.05-1.58)
(1.04-1.61)
(0.92-1.45)
(1.04-1.67)
1
1.01
0.84
(reference)
(0.87-1.18)
(0.55-1.28)
1
0.98
0.94
(reference)
(0.85-1.12)
(0.82-1.08)
1
0.98
1.32
(reference)
(0.57-1.68)
(0.85-2.05)
0.267
0.722
0.013
0.168
0.145
0.259
<0.001
0.303
0.367
0.461
0.656
0.674
0.725
0.595
0.641
0.560
0.458
0.480
2 Non-proportionality
0.661
0.922
test, global p-value=0.93
3.3. Morbidity: surveillance data
53
not change the estimated coefficients for both respiratory infection
and diarrhea models, but the frailty variance was quite significant
indicating unobserved heterogeneity in the data. Neither model violates the proportionality assumption of the Cox proportional hazard
model according to the non-proportionality test.
3.3.4
Time since weaning
The protective effect of breastfeeding for childhood illness is well
known and has been investigated by many authors, see for example Bhandari, Bahl, Mazumdar, Martines, Black, Bhan and Infant
Feeding Study Group (2003) and references therein. In this section,
the aim is to demonstrate the use of time since stop breastfeeding as
an alternative time scale, for investigating the effect of breastfeeding
on childhood morbidity. The breastfeeding definition in this section
is simply based on the questionnaire on health status, breastfeeding
and feeding practice and does not include breastfeeding pattern, such
as exclusive breastfeeding and frequency of breastfeeding. The percentage of breastfed children is quite high in the surveillance area,
about 98%, which is similar to the national figure of 96% (CBS
et al., 1998). The median duration of breastfeeding is relatively
long at 24.1 months which is also close to the national figure of 23.9
months.
The weaned age or the duration of breast feeding is one of the
variables of interest. This variable is a time independent covariate
that is fixed since the weaned time. Age of the child is also included
as a time dependent covariate.
Table 3.6 gives the hazards model for respiratory infection. The
weaned age is significant in the model. Although the differences of
the effects between weaned age category are not huge, the longer
weaned age seems to give a protective effect against respiratory infection.
54
3.3. Morbidity: surveillance data
Contrary to the previous models (with age and calendar time
scale), the effect of maternal education is weak and leads to a different direction. The maternal age has similar pattern to the previous
models. As with the previous models, there is no evidence of gender
and sibling effect in this model.
The frailty effects are significant but do not change the general
conclusion of the model (the estimated coefficients). In this model,
there is no indication of violating the non-proportionality assumption.
The hazards model for diarrhea generally gives similar results as
for respiratory infection. Here, the results are presented only for
the weaned age. The variable has fewer categorizations than for
respiratory infection because of the fitting problem. The relative
risks with confidence intervals are 0.97 (0.25-3.76), 0.53 (0.13-2.13)
for weaned age group 6-12 and 12+ months (the reference is 0-5
months) and has a p-value (LRT) of 0.607.
The frailty effects do not change the coefficient estimation of the
hazards model for diarrhea. In fact, no frailties effects for diarrhea
are significant. It may really show that there are no unobserved
factors for the risk of diarrhea or no difference in risk in these groups
or clusters (individual, mother, household, community and village).
However, it is also possible that the number of observations is not
large enough to show the frailty effects.
In general, the analysis concludes that children aged 6-23 months,
or aged around one year of age, are prone to develop respiratory infection and diarrhea and there is a pattern of seasonality in both
diseases. Maternal education is important. Surprisingly, the risk of
the children developing the diseases are higher for the higher educated mothers. As in the mortality study, there is no evidence of
gender and sibling’s effect.
55
3.3. Morbidity: surveillance data
Table 3.6: Hazards model for respiratory infection, time since weaning
Variables
Relative risk (c.i.)
Gender
boy
girl
Weaned age(months)
0-4
5-6
7-12
13+
Age of the child (months)
0-5
6-23
24+
Maternal education
no education
6 yrs of education
9 yrs of education
12 yrs of education
Maternal age
15-19
20-24
25-29
30-34
35+
Older sibling
none
brother
sister
Younger sibling
none
brother
sister
1 Likelihood
ratio test
1
1.22
p-value
LRT1
0.123
(reference)
(0.95-1.58)
p-value
Non-prop2
0.925
0.040
1
0.48
0.97
0.58
(reference)
(0.18-1.32)
(0.55-1.74)
(0.33-1.04)
1
1.41
1.35
(reference)
(0.59-3.34)
(0.52-3.51)
1
0.84
0.98
0.88
(reference)
(0.41-1.7)
(0.47-2.12)
(0.42-1.86)
1
2.44
3.23
2.32
2.36
(reference)
(1.03-5.76)
(1.36-7.68)
(0.94-5.74)
(0.93-5.97)
1
0.84
0.84
(reference)
(0.59-1.2)
(0.58-1.21)
1
0.77
1.26
(reference)
(0.36-1.65)
(0.73-2.16)
0.904
0.314
0.431
0.684
0.747
0.516
0.341
0.113
0.191
0.197
0.018
0.851
0.551
0.564
0.800
0.567
0.794
0.688
0.495
2 Non-proportionality
0.507
0.633
test, global p-value=0.86
56
3.4
3.4. Morbidity: trial data
Morbidity: trial data
Deficiencies of iron and zinc often coexist and cause growth faltering,
delayed development and increased morbidity from infectious diseases during infancy and childhood (Lind, 2004, Paper V). Therefore,
combined iron and zinc supplementation may be a logical prevention
strategy.
To investigate the effect of the supplementations, a communitybased, randomized, double-blind, controlled trial, the ZINAK study,
was conducted from July 1997 to May 1999 in the CHN-RL area,
Purworejo, Indonesia. The subjects are different to the children in
the surveillance morbidity discussed in the previous section.
This section demonstrates the use of EHA for morbidity analysis
in the ZINAK data. Unlike the morbidity analysis in the previous
section, here, we have continuous data collection in which various
analyses rather than only AG-model are possible to be performed.
We considered respiratory infection as the event of interest. Together
with infant growth analysis in the next section, this section serves
as a background problem for Chapter 5.
3.4.1
Data, study variables and models
The ZINAK study was conducted from July 1997 to May 1999 in
the CHN-RL surveillance area, Purworejo, Indonesia. Healthy and
singleton infants, aged less than six months were recruited. After
assessing their eligibility, 680 infants were randomized into one of
four treatments: iron, zinc, iron+zinc or placebo from 6 to 12 months
of age (180 days of supplementation). More detailed description of
the design and data collection is reported by Lind (2004). There are
several outcomes of interest, biochemical outcomes (iron and zinc
concentration in the blood), infants growth (anthropometry), infants
3.4. Morbidity: trial data
57
development (mental, psychomotor development) and morbidities.
Here, we consider respiratory infection as the outcome of interest.
Morbidity information was obtained by visitation every third day.
Field workers asked the parents or guardians regarding the compliance to supplementation as well as information on symptoms of
illness for the day of visit and for the two days preceding the visit.
Among 680 infants, 666 completed supplementations and some
of them dropped out. It may be necessary to consider the drop-out
in the analysis since all of them were related to the supplementation
as reported by Lind (2004). However, at this moment we analyze the
completed records only according to intent-to-treat analysis. Covariates under consideration, other than the treatment itself, are gender
and maternal education.
We used the AG model with age as the time scale as in the
previous section (Equation (3.1)). Additionally, we used gap-time
or sojourn time also as an alternative time scale. The gap-time
is defined as the time since entry or previous event. When both
models give similar results, we can safely assume a renewal process
and consider a constant baseline hazard.
As in the previous section, we may actually use calendar time as
well since morbidity may have a strong seasonal pattern. However
given the rather short period of observation time (six months) and
that most of the children entered the study at almost the same time,
using calendar time and age is almost identical. However, when we
want to model the morbidity with growth, which depends on age
rather than calendar time, the age time scale has a clear advantage
to calendar time.
3.4.2
Results
Tables 3.7 and 3.8 give the result of hazard model using the AG
model and gap-time model. They are actually quite similar in their
58
3.4. Morbidity: trial data
Table 3.7: Hazards model for respiratory infection using the Andersen Gill model, ZINAK study
Variables
Risk ratio (c.i.)
Gender
boy
girl
Supplementation
placebo
zinc
zinc+iron
iron
Maternal Education
no-education
6 years
9 years
12 years or more
1 Likelihood
ratio test
1
0.91
p-value
LRT1
0.044
(reference)
(0.83-1.00)
p-value
Non-prop2
0.308
0.411
1
1.00
0.91
0.97
(reference)
(0.88-1.14)
(0.79-1.03)
(0.85-1.11)
1
0.84
0.70
0.46
(reference)
(0.64-1.10)
(0.53-0.92)
(0.29-0.75)
0.805
0.235
0.723
<0.001
2 Non-proportionality
0.133
0.427
0.102
test, global p-value=0.177
risk ratio and p-value of the likelihood ratio test. Assuming constant
baseline hazards will give the same result. The raw and smoothed
hazard function in Figure 3.5 also indicated a constant hazard during
period of 6 to 12 months of age.
Looking at the estimates, there is no pronounced effect of the
supplementation to respiratory infection which confirms the result
by Lind (2004) in which Poisson regression was used. This result also
reiterates the importance of maternal education as it has been found
in the respiratory infection models using surveillance data (Section
3.3). Here, the direction of the maternal education is different to
that of surveillance data. Higher education seemed to have protective effect on respiratory infection. The infants’ gender was rather
significant with girls having a lower hazard than the boys.
59
3.4. Morbidity: trial data
Table 3.8: Hazards model for respiratory infection using the gap-time
model, ZINAK study
Variables
Risk ratio (c.i.)
Gender
boy
girl
Supplementation
placebo
zinc
zinc+iron
iron
Maternal Education
no-education
6 years
9 years
12 years or more
(reference)
(0.83-1.00)
p-value
Non-prop2
0.014
0.474
1
1.01
0.91
0.98
(reference)
(0.89-1.15)
(0.80-1.04)
(0.86-1.11)
1
0.85
0.72
0.50
(reference)
(0.65-1.12)
(0.54-0.95)
(0.31-0.80)
0.172
0.791
0.652
<0.001
0.484
0.176
0.883
2 Non-proportionality
ratio test
test, global p-value=0.101
0.3
0.0
0.1
0.2
hazard
0.4
0.5
1 Likelihood
1
0.91
p-value
LRT1
0.051
5
6
7
8
9
10
11
12
13
age (months)
Figure 3.5: Raw and smoothed hazard plot of childhood respiratory
infection by age.
60
3.5. Infant growth
The other purpose of this analysis, aside from demonstrating the
application of EHA, is to give a background for the problem of analyzing EHA together with longitudinal measurements in Chapter
5. It is well known that nutrition, growth and morbidity are closely
related (Scrimshaw, 2003). Therefore, evaluating supplementation
on both growth and morbidity simultaneously may give less bias
than analyzing the two outcomes separately. Although it also has
been reported briefly that anthropometrical status was not associated with the incidence of infectious disease (Lind, 2004), a more
careful analysis may be needed.
3.5
Infant growth
Infant growth indicators such as weight, length, knee-heel, mid-upper
arm circumference are another outcome of interest collected in the
ZINAK study. Obviously, the type of outcomes is not a time-toevent data but ordinary continuous data. We presented the use of
LDA to analyze such data, taking weight as the outcome of interest.
Also, together with the morbidity analysis in the previous section
this section serves as a background problem for Chapter 5.
Measurements of the weight were performed every month.
Weight measurements before the period of trial were also available
for most of the children. Figure 3.6 shows the children’s weight
by age with smoothing lines. During the trial period from 6 to 12
months of age, a linear model for this weight growth curve may be
sufficient. However, weight growth is very individually developed in
which the between individual variance is usually large. Therefore,
employing the linear random effects model reviewed in Section 2.3.2
is more suitable to the weight data than the ordinary linear model.
61
8
6
2
4
weight (kgs)
10
12
3.5. Infant growth
0
5
10
15
age (months)
Figure 3.6: The children’s weight across age. The greyed points denote the actual measurements of weight; the line denotes the smoothing splines of the weight measurements; the dashed line denotes the
reference population (CDC 2000 growth charts); and the two vertical
lines denote the starting and ending point of the trial.
62
3.5. Infant growth
Table 3.9: Growth curve model for weight using random effect and
ordinary linear model, ZINAK study
Variables
Intercept
Age
Gender
boy
girl
Supplementation
placebo
zinc
zinc+iron
iron
Maternal Education
no-education
6 years
9 years
12 years or more
Illness days
Random effect
sd(Intercept)
sd(Age)
corr(Intercept,Age)
Random effect model
6.37
(5.88,6.86)
0.17
(0.17,0.18)
linear model
6.38 ( 6.11, 6.65)
0.17 ( 0.15, 0.19)
-0.54
(reference)
(-0.68 ,-0.40)
-0.54
(-0.61,-0.48)
0.02
0.01
0.01
(reference)
(-0.18 , 0.22)
(-0.19 , 0.21)
(-0.19 , 0.21)
0.08
0.02
0.03
(-0.01, 0.16)
(-0.07, 0.10)
(-0.06, 0.12)
0.20
0.31
0.26
-0.53
(reference)
(-0.28 , 0.68)
(-0.18 , 0.79)
(-0.41 , 0.94)
(-0.64 ,-0.41)
0.19
0.30
0.27
-1.12
(-0.02, 0.40)
( 0.09, 0.51)
(-0.03, 0.57)
(-1.57,-0.67)
0.993
0.065
-0.617
(0.923,1.064)
(0.061,0.070)
(-0.860,-0.430)
The model for weight is
yi = Xi β + Zbi + ǫi ,
bi ∼ N (0, Σ),
i = 1, . . . , N,
ǫi ∼ N (0, σ 2 I),
(3.2)
where yi is the weight measurements on child i and N is the number of children, bi is vector of random effects, Xi and Zbi are the
covariates for the fixed and random effects, respectively.
Table 3.9 shows the results of fitting the weight models using a
3.5. Infant growth
63
random effects model and also the ordinary linear model for comparison. The age and illness-days covariates are measured as continuous
variables while the rest are categorical. Illness days is the number
of days with illnesses (symptoms) from the previous measurements
time up to the current measurement time, as a proxy variable for
the effect of duration of illness.
The random effects model has two parts, the fixed part (upper
part of the column variables) and the random part (the lower one).
First, we look at the random effects which correspond to the standard
deviation and correlation of intercept and age (the bi in model (3.2)).
They were found to be significantly different from zero, as indicated
by their intervals which do not include zero. The result confirms the
assumption that weight growth is quite individually developed.
Now we look at the fixed part (β in model (3.2)) and compare
the estimates with that of the ordinary linear model. The estimated
coefficients of the two models are quite similar except for the illnessdays. The confidence intervals from the random effect model are
generally wider than that from the ordinary linear model. There
seemed to be no effect of supplementation on the weight. The pronounced effects were gender, age and illness-days. We have check
some interactions as well and we found that there was no interaction
between supplementation and illness-days.
As comparisons, we also performed two alternative analyses for
the weight longitudinal measurements. The first one is an analysis
with WAZ (weight-for-age z-score) instead of weight. The WAZ is
a standardized value of the weight compared to a reference population. We used the CDC 2000 reference population (Kuczmarski,
Ogden and Guo, 2002) which was also used by Lind (2004) (see the
dashed line in Figure 3.6). The age and gender variables were important, similar to the weight random effect model of Table 3.9, but the
direction of the estimated regression coefficients was reversed. The
estimated 95% confidence intervals were (-0.22, -0.206) and (0.00,
64
3.6. Remarks
0.33) for age and girl, respectively. This indicates growth decreasing
compared to the growth of the CDC 2000 and the boys seemed to
suffer more than the girls. There is no different in conclusion for the
supplementation, maternal education and illness days.
The second one is an analysis using weight velocity. The weight
velocity for a certain age of individual is the weight difference between the current weight and the previous measurement weight divided by the length of time from the previous measurement age to
current age. We used the ordinary linear model as the random effect
part did not show any significant contributions. The age and illness
days still show a large effect as in the weight models. The gender
effect, however, disappeared. As in the weight models, supplementation did not show any significant effect in this weight velocity model.
In conclusion, there is a general growth decrease for children in
the study compared to the standard reference population, but the
supplementation did not seem to affect the growth. It is also of
interest to investigate the growth model in relation to time-to-event
morbidity data. We will discuss this problem in Chapter 5.
3.6
Remarks
We have demonstrated the application of EHA and LDA to analyze data from childhood health studies. There are two points of
methodological interest emerging from the applications.
In EHA, sometimes we face more than one competing time scale.
For instance, we may use calendar time instead of age in the morbidity model of Section 3.3. Age-period or age-period-cohort model
is another situation in which more than one time scale is involved.
The problem of multiple time scales will be discussed in the next
chapter.
Important statistical issues in the ZINAK study is that the out-
3.6. Remarks
65
comes of interest may actually interact with each other and analyzing
them separately may give biased results. Specifically, the interest is
on the joint analysis of time-to-event and longitudinal measurements
outcomes. Comparison of approaches and further analysis of ZINAK
respiratory infection and growth data will be presented in Chapter 5.
66
3.6. Remarks
Chapter 4
Multiple Time Scales
4.1
Introduction
Time is indispensable in event history analysis. Although time may
be just a proxy measure for other influences of the events (Berzuini
and Clayton, 1994b), time is the most readily available measurement
and easy to utilize for comparison and generalization. For example,
in epidemiology, age is the most often used time scale since it reflects cumulative damage that causes mortality, whereas, in clinical
studies, time since diagnosis may be more important. This chapter
considers the problem of choosing an appropriate baseline time scale
and modeling dual time scales.
The choice of time scale is driven by the research question of the
study. However, in the absence of knowledge about the importance
of time scales, we may have to consider all relevant time scales. In
an epidemiological surveillance study, it is common to perform an
exploratory study to identify new emerging risk factors. One way of
exploring the factors is by investigating several relevant time scales.
In general, the choice of relevant time scales in epidemiology or ob67
68
4.1. Introduction
servational studies is more difficult than in clinical studies (Liestøl
and Andersen, 2002).
Farewell and Cox (1979) and Oakes (1995) suggested to choose a
basic time scale that accounts for as much as the variation as possible. Duchesne (1999) and Duchesne and Lawless (2000) introduced
the concept of ideal time scale. However, their focus is on the usage variable (such as mileage, asbestos exposure, etc.), as the other
scale rather than the multiple origins problem, as considered in this
thesis. Multiple time scales have been considered in the multistate
model as well. Jones and Crowley (1992) and Commenges (1999)
considered the problem of multiple time scales under the Markov
and semi-Markov models. Ng and Cook (1997) developed a random
effects model that includes piecewise constant formulations. Andersen and Keiding (2002) suggested a practical approach to choosing
a basic time scale in the Cox model.
The piecewise constant hazards and discrete time models are the
usual approaches to the multiple time scales problem, if we want
to treat multiple time scales symmetrically (Keiding, 1990; Berzuini
and Clayton, 1994b). Those approaches utilize the relation between
Poisson regression and Cox’s proportional hazards model. Efron
(2002) considered the discrete time approach to develop a two-way
proportional hazards model and decomposed the hazards multiplicatively for a dual time scales problem.
In the Cox model, other time scales (than the basic time scale)
can be considered as a defined time-dependent covariate (see Section
2.4.1). Therefore, Cox models with a time-dependent approach, such
as a time-dependent covariate and time-dependent strata, can be
used for multiple time scales modeling.
In this chapter, procedures to choose a basic time scale in Cox’s
regression model are proposed. For the dual time scales problem,
the connection between piecewise constant hazards and the time-
4.2. The choice of relevant time scales
69
dependent approach is discussed. Quantitative comparisons are performed through simulation.
4.2
The choice of relevant time scales
The multiple time scales problem considered here is basically a multiple time origins problem with time equal to ordinary clock time.
The nature of the problem is different from the usual multivariate
survival such as bivariate survival in twin studies or pairs of human
organ studies. In the multiple time origins problem, see the Lexis
diagram in Figure 4.1(a), movement of time scale pairs is in the same
direction (a line with slope 1) (Keiding, 1990). When a subject dies,
for instance, both movements for that subject stop. In twin studies,
a pair of twins may have different paths, if one dies the other may
still continue the path.
Figure 4.1 shows the life line in a Lexis diagram for one subject
and its corresponding separate time scales. Usually the time on the
abscissa (T1 ) represents calendar time, life length measured from
the ”zero” calendar date (the birth of Christ); whereas time on the
ordinate (T2 ) represents age, life length measured from the subject’s
birthdate. Another example is in a clinical study, where T1 represents
age and T2 represents time-since-diagnosis. As we can see from the
figure, both time scales stop at the same event time (the dashed
lines) at a certain reference time, but their origins are different. The
problem is choosing the most relevant time scale as baseline.
There is no regression coefficient estimated for the basic time
scale. Therefore, a time variable whose effect is of interest should
not be used as the basic time scale (Andersen and Keiding, 2002).
However, the time variable with suspiciously irregular effect, which
is difficult to model parametrically via a time-dependent covariate,
may be chosen as the basic time scale.
70
4.2. The choice of relevant time scales
a)
b)
T
T2
T2
T1
T1
reference time
Figure 4.1: (a) Lexis diagram and (b) separate scale
The guideline may be useful enough in practice, yet there is another situation when a more formal procedure in choosing a time
scale is needed. When there is a suspicion about the erroneously
specified time origin we may need a formal procedure to examine
the observed time scales. We call a procedure to deal with the problem an erroneous scale procedure, henceforth.
The erroneous scale model assumes a data generating mechanism
as in Figure 4.1. The hazard function of a true but unobserved
duration T is modeled as a Cox model
λi (t) = λ0 (t) exp(βZi (t)),
t > 0,
(4.1)
where λi (t) is the baseline hazard function for subject i.
Several alternative time origins might be observed, resulting in
several time scales (durations), e.g., T1 and T2 in Figure 4.1. In a
real situation, the true duration T may be the time since onset until
the event of interest which is not observable, and the alternative durations T1 and T2 are age and time since diagnosis, respectively. We
are interested in choosing one most relevant time scale as a surrogate
of the true time scale.
71
4.2. The choice of relevant time scales
The Cox model with alternative time scales can be specified as
λi (t) = λ0 (t + δi ) exp(βZi (t + δi )),
t > 0,
(4.2)
where δi represents the difference or delay between the true origin
and the alternative origins for subject i. For example, δi is the
duration from onset until diagnosis.
In this situation we may not have a proportional hazards model
any longer since δi varies between individuals. Therefore, when we
observe only the alternative time scales, a simple procedure to investigate whether the time scale is appropriate or not is by examining
the proportional hazards assumption.
We can write the hazard λ0 (t + δi ) as λ0 (t0 )Wi , separating the
baseline hazards and the subject-specific factor, if we assume the
Gompertz hazard function (Liestøl and Andersen, 2002). Model (4.2)
then is a Cox model with frailty (Section 2.2.5),
λi (t) = λ̃0 (t)Wi exp(βZi (t + δi )),
t > 0,
(4.3)
where Wi is the random effects or frailty variable as a function of
δi . In this situation we may estimate a frailty effect, for instance by
assuming that Wi is gamma distributed with mean 1 and variance
ω. Therefore, another procedure to examine the time scales is by
examining the frailty effects.
However, when those procedures do not seem to reveal the most
relevant time scale, and there is scientific reason that the time scales
are all important, we may model multiple time scales simultaneously.
We discuss this problem for the case of dual time scales in the next
section.
72
4.3. Modeling dual time scales
y4
age (y)
y3
y2
y1
y0
x0
x1
x2
x3
x4
x5
x6
calendar time (x)
Figure 4.2: Hypothetical event history data on a Lexis diagram. The
lines represent the observed follow-up time by age and calendar time
(period); the dots represents the event of interest (deaths, diseased)
4.3
Modeling dual time scales
We will discuss the multiple time scales problem for the case of dual
time scales such as age and calendar time (period). Figure 4.2 represents typical dual time scales event history data on a Lexis diagram.
The general aim is to model the hazards as a function of age
y, calendar time x and covariate Z which may also depend on y
and x. Let µ(x, y) be the hazard function at period x and age y.
Generalizing from the single time scale, the Cox proportional hazard
model for dual time scales is
µ(x, y | Z) = µ0 (x, y) exp(βZ),
(4.4)
where µ0 (x, y) is the baseline hazard function at period x and age
y common to all individuals. Three approaches are considered here
to model (4.4), i.e., the piecewise constant hazards, time-dependent
strata and time-dependent covariate methods.
73
4.3. Modeling dual time scales
4.3.1
Piecewise constant hazards
In the piecewise constant hazards model we assume that the hazard
function µ(x, y) is piecewise constant across the Lexis plane. Technically, the Lexis plane is divided into sufficiently small rectangles such
that constant hazard function µ can be reasonably assumed in each
rectangle. Let ui be the total exposure time in a rectangle and di be
the number of events (0 or 1) for individual i, then the contribution
of individual i to the likelihood is
Li (µ) = (µ)di exp(−µui ),
i = 1, . . . , n,
(4.5)
in this specific rectangle. To assess other effects on the hazard we
may specify µ exp(Zβ) instead of only µ, where Z is a vector of
covariates and β is a vector of unknown regression coefficients. Although any functional form of Z and β is possible, the log-linear
form exp(βZ) is convenient.
Let the Lexis plane, as in Figure (4.2), be divided into smaller
rectangles
Ω(r,s) = {(x, y) : x ∈ [xr−1 , xr ) and y ∈ [ys−1 , ys )},
r = 1, 2, . . . , R, s = 1, 2, . . . , S; di(r,s) and ui(r,s) be the number of
observed events and time spent (exposure time) in each Ω(r,s) for
individual i, i = 1, . . . , n. The likelihood for the piecewise constant
hazards model (4.5) for all individuals and over the lexis grid Ω is
L(µ, β) =
n h
S Y
R Y
Y
r=1 s=1 i=1
µrs eβZi
di(r,s)
i
exp(−µrs eβZi ui(r,s) ) , (4.6)
where µrs is the baseline hazard in Ω(r, s).
It is possible to assess the effects of time (age and calendar time)
on the hazard by assuming a multiplicative decomposition µrs =
λ s γr .
74
4.3.2
4.3. Modeling dual time scales
Time-dependent approaches
In the single time scale situation, the partial likelihood used in the
Cox proportional hazards model to estimate the regression coefficients can be interpreted as a profile likelihood obtained from a
piecewise constant hazards likelihood maximized to certain nuisance
parameters and allowing the width of the time intervals approaching
zero (Johansen, 1983; Clayton, 1988). This procedure does not work
in the dual time scale situation due to the lack of smoothness of the
maximum likelihood baseline rate estimates (Keiding, 1990; Berzuini
and Clayton, 1994b). Efron (2002) was able to construct a genuine
two-way proportional hazards model by considering discrete time
scales.
An alternative approach is to let the partition of one time scale
interval be fixed as the partition in the other direction gets finer
and finer. In the limit, we get two different solutions, depending on
which partition is kept fixed.
We consider the likelihood for the piecewise constant hazards
model of Equation (4.6). Now, given β, the µrs may be separately
estimated as follows. Looking at specific values of r and s and suppressing the dependence of them, and taking logs gives
ℓrs =
n
X
[di log µ + di βZ − µ exp(βZi )ui ] .
(4.7)
i=1
By equating the derivative of (4.7) wrt µ to zero, we get µ̂rs (β):
Pn
i=1 di(r,s)
µ̂rs (β) = Pn
, r = 1, . . . , R; s = 1, . . . , S. (4.8)
i=1 ui(r,s) exp(βZi )
By replacing µrs in (4.6) by (4.8), taking logarithms and simplifying,
75
4.3. Modeling dual time scales
we get the profile log likelihood
ℓp (β) ∝
n
S X
R X
X
di(r,s) log
r=1 s=1 i=1
eβZi
Pn
βZj u
j(r,s)
j=1 e
!
.
(4.9)
The time-dependent strata approach
We proceed with the approach with a fixed period (calendar time)
x scale, i.e., we keep R in (4.9) fixed.
Now let ω = ys − ys−1 be the constant width of the time intervals
on the y scale. When S → ∞ (ω → 0), di (r, s) and ui (r, s) will
become
(
1 if an event occurs for individual i in Ω(r, s),
di (r, s) =
0 otherwise,
ui (r, s) ≈
(
Yi (r, s) =
(
ω
0
if individual i is observed in Ω(r, m),
otherwise.
Let
1 if individual i is observed in Ω(r, m),
0 otherwise.
The profile likelihood (4.9) then becomes
ℓp (β) ≈
R X
S X
n
X
di(r,s) log
r=1 s=1 i=1
=
XXX
r
−
s
di(r,s) log
i
XXX
r
s
i
eβZi
Pn
βZj ωY
j(r,s)
j=1 e
!
eβZi
P βZ
j Y (r, s)
j
je
di(r,s) log(ω)
!
76
4.3. Modeling dual time scales
∝
XXX
r
s
di(r,s) log
i
eβZi
P βZ
j Y (r, s)
j
je
!
(4.10)
removing the terms independent of β. Since the di (r, s) has values 1
only at the event times, the contributions to the likelihood are only
at the event times. Therefore the denominator of the log part is
actually a sum over the risk set given r. The profile likelihood can
be written as
!
R X
X
exp (βZi )
log P
ℓp (β) =
,
(4.11)
βZj
j∈Rr (yi ) e
r=1 i∈D
r
where Dr is the event set and Rr (yi ) is the risk sets at yi , given r. In
Figure 4.2, event set is all lines with dots, and the risk set is all lines
that intersect the horizontal line crossing each dot (the event times
y). The profile likelihood (4.11) corresponds to the partial likelihood
of Cox’s proportional hazards model with basic time scale age y and
time dependent strata on the time scale x.
Time-dependent covariate approach
Assuming a multiplicative model for the baseline hazard function,
µrs = λs γr ,
r = 1, . . . , R; s = 1, . . . , S,
(4.12)
we get a slightly different profiling procedure, leading to the timedependent covariate approach. The log likelihood becomes
R X
n h
S X
i
X
ℓ(γ, λ, β) =
di(r,s) log λs γr eβZi − λs γr eβZi ui(r,s) .
s=1 r=1 i=1
(4.13)
4.3. Modeling dual time scales
77
Given β and γr , r = 1, . . . , R, maximizing (4.13) with respect to
λ1 , . . . , λS is straightforward. The solution is
P P
i di(r,s)
λ̂s = P r P
, s = 1, . . . , S.
(4.14)
βZi
ie
r γr
Substituting (4.14) into (4.13) and simplifying by removing the terms
independent of β and γ gives the profile likelihood
!
S
R X
n X
X
γr exp βZi
.
di(r,s) log PR
ℓp (γ, β) ∝
Pn
βZj u
j(t,s)
t=1 γt
j=1 e
i=1 r=1 s=1
(4.15)
We proceed with this derivation in a similar manner to that of
the case with time dependent strata. When S → ∞ or ω → 0, di(r,s)
and ui(r,s) becomes the event indicator and at risk indicator at time
s. The summation over all individuals i becomes the summation
over the event times i ∈ D. At the denominator of the log part,
summation will be determined only at the event times s, since all
other terms will vanish by the definition of the event indicator di(r,s) .
Similarly, in the denominator the summation will be over the risk set
R(yi ). The summation over γt will also be completely determined
by j ∈ R(yi ). The profile likelihood becomes
!
R
XX
γr exp (βZi )
P
log P
ℓp (γ, β) =
.
(4.16)
βZj
j∈R(yi )
m γm e
i∈D r=1
The log profile likelihood (4.16) is exactly the log of Cox’s partial
likelihood with a time dependent categorical covariate, where the
categories are defined by the time intervals (xr−1 , xr ], r = 1, . . . , R.
Instead of categorical covariate, we may also specify the values of xr
or any function of xr at the event times.
A similar connection can be derived by letting the age be fixed
and period interval lengths approach zero. The result will be the Cox
78
4.4. Simulation studies
proportional hazards model with (age) entering as time dependent
strata or as a time dependent covariate in the model with basic time
scale calendar time.
Other pairs of time scales are of course possible. For instance,
dual time scales age and time since diagnosis arise frequently in
clinical studies, age and time since weaning is another example from
childhood life studies.
4.4
4.4.1
Simulation studies
Erroneous scale
The first simulation study investigates the performance of procedures to select relevant time scales discussed in Section 4.2. The
procedures are the proportional hazards assumption test and frailty
model estimation.
Several data generating models are assumed. We consider two
competing time scales S1 with duration T1 and S2 with duration T2 .
S1 was specified as a better time scale than S2 in the sense that S1
has lower value of time delay δi than S2 has. One example in a real
study, the true time scale is time since the onset of certain disease,
T1 is time since the subject feels any symptoms of the disease and
T2 is time since diagnosis. We assume that the time since onset can
not be determined by the diagnosis.
The true duration T is generated by the ordinary proportional
hazards model,
λi (t) = λ0 (t) exp(βZi ),
t > 0,
(4.17)
but we can only observe T1 and T2 generated from the true time
scale with delays δi for each individual i. The details of the simulation procedure is described in the Appendix A-1. No truncation or
4.4. Simulation studies
79
censoring is considered in this simulation. Similar simulation studies have been considered by Liestøl and Andersen (2002) for the
Gompertz-Makeham baseline hazard function with the purpose of
showing the effect of misalignment patients and measurement error
on the estimated regression coefficients.
To make S1 better than S2 , the mean of δi for T1 was specified lower than that for T2 and δi follows uniform and exponential
distributions. For this simulation the baseline hazards were determined parametrically as Gompertz, Exponential or Weibull hazard
functions. One fixed categorical zero-one covariate Zi generated the
from Bernoulli distribution was also included.
Now, we compare the performance of the proportional hazard
test (ph-test) and frailty variance estimation to detect the relevant
time scales. The relevant time scale is expected to satisfy the proportional hazards assumption, and therefore will have larger p-values.
In the frailty model, the estimated gamma frailty variance is used to
detect the relevant time scale. A smaller frailty variance will indicate a better time scale. In a real situation, a more careful analysis
can be performed. For example, a Schoenfeld residuals plot may be
used to accompany the ph-test, and a confidence interval constructed
from the profile likelihood may be calculated for the gamma frailty
variance.
In the simulation the mean and standard deviation of the ph-test
p-value and gamma frailty variance are used to summarize the result
from 1000 replications. Histograms of the values are also examined
(results not shown). In Tables 4.1 and 4.2 the mean of the ph-test
p-value is under the zph column, and the mean of the gamma frailty
variance is in the ω column.
There are some general comments for the generated data. The
delays (δi ) that follows an exponential distribution seems to make
the model suffering from the violation of the proportional hazards
assumption, shown by the low value of the coverage (the percentage
4.4. Simulation studies
80
Model
CPH
CPHF
CPH
CPHF
CPH
β̄
1.98
0.19
1.99
0.21
1.64
0.18
1.64
0.18
1.85
0.19
1.85
0.19
p
94.8
–
94.2
–
47.8
–
48.2
–
85.8
–
85.8
–
S1
zph
0.65
0.25
0.66
0.23
0.44
0.27
0.44
0.27
0.58
0.26
0.59
0.26
1.92
0.19
1.99
0.21
1.43
0.16
1.43
0.16
1.70
0.19
1.70
0.19
β̄
Time Scale
ω
–
–
0.009
0.039
–
–
0.001
0.01
–
–
0.004
0.028
zph
0.63
0.25
0.66
0.23
0.41
0.28
0.41
0.28
0.53
0.27
0.54
0.27
ω
–
–
0.009
0.039
–
–
0
0.001
–
–
0.002
0.016
CoxPH : Cox’s proportional hazards
CoxPHF : Cox’s proportional
with frailty
S2
p
91.8
–
94.2
–
9.6
–
9.6
–
59.6
–
60.0
–
Table 4.1: Simulation study for erroneous scale with δi follows uniform distribution U (0, 1)
and U (0.5, 2), for S1 and S2 respectively. The true coefficient β is 2. Each value is calculated
based on a sample of size 200 with 1000 replications
Baseline
hazards
Gompertz
Exponential
Weibull
CPHF
p is the coverage (percentage) of the interval estimation
β̄ is the mean of estimated coefficient
zph is the mean of proportional hazards test p-value
ω is the mean of estimated frailty variance
The values in every second row are standard deviations
CPHF
CPH
CPHF
CPH
CPHF
CPH
Model
1.87
0.20
1.90
0.22
1.23
0.19
1.39
0.26
1.55
0.18
1.63
0.23
β̄
p
87.0
–
87.0
–
1.2
–
19.6
–
28.2
–
46.2
–
S1
zph
0.63
0.25
0.66
0.22
0.36
0.29
0.50
0.21
0.51
0.29
0.61
0.20
ω
–
–
0.02
0.067
–
–
0.147
0.196
–
–
0.069
0.124
p
13.6
–
48.8
–
0.0
–
10.0
–
0.0
–
22.6
–
S2
zph
0.34
0.30
0.57
0.19
0.13
0.21
0.25
0.18
0.13
0.20
0.34
0.16
ω
–
–
0.169
0.19
–
–
0.425
0.417
–
–
0.375
0.304
CoxPH : Cox’s proportional hazards
CoxPHF : Cox’s proportional
with frailty
1.44
0.20
1.65
0.28
0.70
0.18
1.04
0.36
0.94
0.18
1.31
0.32
β̄
Time Scale
p is the coverage (percentage) of the interval estimation
β̄ is the mean of estimated coefficient
zph is the mean of proportional hazards test p-value
ω is the mean of estimated frailty variance
The values in every second row are standard deviations
Weibull
Exponential
Baseline
hazards
Gompertz
Table 4.2: Simulation study for erroneous scale with δi follows exponential distribution with
mean 0.5 and 1.25, for S1 and S2 respectively. The true coefficient β is 2. Each value is
calculated based on a sample of size 200 with 1000 replications
4.4. Simulation studies
81
82
4.4. Simulation studies
of confidence intervals covering the true coefficient β). The most
suffering one is the model with exponential baseline hazard.
When the baseline hazard function follows a Gompertz model,
both the ph-test and frailty model show good performances. In Table
4.1, S1 and S2 are equally good, whereas in Table 4.2, S1 is better
than S2 showed by the larger value of zph and smaller ω. The
performances are confirmed by the coverage percentages p which
have lower value for the wrong time scale.
Exponential baseline hazards are very much affected by the erroneous scale. Although the zph’s do not show very low values and
ω’s do not show very large values, the coverage probabilities are very
low. In Table 4.1, it is rather hard to distinguish the time scales,
because the values of zph and ω look similar, but the coverage probabilities are quite low for S2 . For a larger effect of erroneous scale in
Table 4.2, S1 and S2 can be distinguished by the value of zph and ω.
The estimated frailty variances in the exponential baseline hazards
(Table 4.1) do not seem to reveal the frailty effect. They give small
variances but actually the effect is rather bad (lower coverages).
The performance of the procedures under the Weibull baseline
hazard is generally similar with that of Gompertz. In the Weibull
hazard the delays has a larger effect than in the Gompertz. The zph
and ω can distinguish S1 and S2 in the data with a larger effect of
erroneous scale.
When the procedures do not show a difference between S1 and S2 ,
dual time scales modeling may be performed. For the data generated
from these erroneous scale models, the inclusion of other time scales
in the analysis will not likely increase the model fit.
4.4.2
Dual time scales
The second simulation study considered the approach discussed in
Section 4.3 for modeling dual time scales S1 and S2 . In a real study,
83
4.4. Simulation studies
S1 and S2 could be calendar time and age, or age and time since diagnosis. In this simulation we assume the true model that generates
duration T1 as a result of using S1 follows a Cox model with time
dependent covariate
λi (t | Z(t)) = θ exp(β1 ηi + β2 (t + δi )),
t > 0.
(4.18)
For example, T1 is time since onset of certain disease, T2 is the age
and δi is the age at onset, so where T1 = T2 − δ. For positive β2
the hazard for individual i will increase with time and the hazard
is higher for individuals with higher δi (higher age at onset). The
details of the data generating procedure of this simulation are presented in Appendix A-2.
In reality, we do not know the exact data generating process, we
only believe that T1 and T2 should be modeled simultaneously. The
compared performances for this simulation are the estimation of β1
(the mean estimation, standard deviation) and the mean of the phtest p-value (for analysis with Cox’s model). One example of dual
analysis is in the childhood mortality studies (Section 3.2). We may
believe that the mortality depends on age and seasonality, therefore
both time scales, age and period (as the proxy of seasonality), have
to be taken care of. The variables of interest are not the times
themselves but other explanatory variables such as gender, maternal
education, etc. How the method of taking care of T1 and T2 affects
the explanatory variable of interest is what we want to compare.
For the piecewise constant hazard approach, each time scale were
divided into four equal-width intervals. Experimenting with several
variations of gridding for generated data used in this simulation,
four intervals gave reasonable piecewise constant hazards and was
computationally feasible.
For the time-dependent strata, the same intervals as in the piecewise constant hazards were used. The analysis used the counting process data setup (Section 2.2). For each generated data set,
84
4.4. Simulation studies
Table 4.3: Simulation study for dual time scales S1 and S2 analyzed
with piecewise constant hazards and time dependent approaches.
The true coefficients are β1 = 1.5, β2 = 0, 1 and δi is exponential
with rate 0.85. Each value is calculated based on a sample of size
200 with 1000 replications
Method
β2 = 0
p1
1.55(.18) 95.2
1.51(.19) 94.9
1.51(.19) 95.2
1.51(.18) 95.3
0.91(.13)
5.6
β¯1 (sd)
piecewise const-hzd
S1 time-dep strata
S2 time-dep strata
S1 time-dep covariate
S2 time-dep covariate
zph
–
.81
.63
.90
.89
β2 = 1
p1
1.51(.15) 95.5
1.50(.18) 95.7
1.51(.18) 96.4
1.48(.16) 95.7
0.58(.13)
0.1
β¯1 (sd)
zph
–
.71
.66
.98
.99
S1 and S2 in front of the method’s name denotes the basic time scale used
two time-dependent strata estimation procedures were carried out.
The first one used S1 as the basic time scale with S2 as the timedependent stratum, and the second used S2 as the basic time scale
with S1 as the time-dependent stratum.
In the Cox time-dependent covariate analysis, the values of the
covariate are only used at event times with a certain functional form.
Analyzing time-dependent covariates in that way is computationally
demanding, therefore we used similar time intervals as in the piecewise constant hazard and the time-dependent strata method. The
form of the function is modeled non-parametrically using penalized
smoothing spline (Hastie and Tibshirani, 1990), which is available,
for instance, in the survival package of R or S-PLUS. As in the timedependent strata case, two analyses were carried out by this model
using each time scale and including the other time scale as a timedependent covariate.
The results are shown in Table 4.3 for exponentially distributed
δi and in Table 4.4 for uniformly distributed δi .
85
4.4. Simulation studies
Table 4.4: Simulation study for dual time scales S1 and S2 analyzed
with piecewise constant hazards and time dependent approaches.
The true coefficients are β1 = 1.5, β2 = 0, 1 and δi is uniform(0,2).
Each value is calculated based on a sample of size 200 with 1000
replications
Method
β2 = 0
p̃1
1.53(.17) 95.1
1.51(.18) 95.4
1.51(.18) 94.8
1.50(.18) 95.1
0.88(.13)
3.1
β̄1 (sd)
piecewise const-hzd
S1 time-dep strata
S2 time-dep strata
S1 time-dep covariate
S2 time-dep covariate
zph
–
.81
.61
.97
.92
β2 = 1
p̃1
1.50(.15) 96.4
1.49(.18) 95.7
1.50(.15) 97.6
1.48(.17) 95.2
0.50(.11)
0.0
β̄1 (sd)
zph
–
.74
.47
.95
.93
S1 and S2 in front of the method’s name denotes the basic time scale used
In general, piecewise constant hazard and time-dependent strata
show good performances. For the time-dependent strata, the appropriate analysis assuming model 4.18 is to use S1 as the basic time
scale which gave good performances. However, even if the inappropriate basic time scale S2 is used, the performances are also good
with only slightly violated proportional hazards assumption.
For the time-dependent covariate approaches, using S1 as the basic time scale gave good performances which is not surprising given
the data generating model. Using the wrong basic time scale S2 is
really harmful and worse if the time-dependent covariate is really
in the model, i.e., β2 = 1. Simulation with β2 = 0 complements
the result given by Liestøl and Andersen (2002). In their simulation
T1 is ’time since diagnosis’ which had a Gompertz form and T2 is
age. The Gompertz baseline hazard is convenient since T1 can switch
into time-dependent covariate and still give the same result. In this
simulation it is shown that a baseline hazard other than the Gompertz (constant hazard in this simulation) will give different results.
86
4.5. Application to infant mortality age-period analysis
This issue has also been discussed for the case of an epidemiological
follow-up study by Korn, Graubard and Midthune (1997).
4.4.3
Miss-specification
We also analyzed the generated data sets under miss-specified analysis, i.e., (i) the data was generated from the erroneous scale model
but analyzed with the dual time scales methods; (ii) the data was
generated from the dual time scales model but analyzed with the
erroneous scale methods.
For the first miss-specified analysis, all dual time scales approaches showed good performances for the low effect of erroneous
scale (the Gompertz baseline hazard case) but not for the large effect
of erroneous scale (the exponential baseline hazard case). The Cox
models with time-dependent strata and piecewise constant hazards
approaches have similar performances and they are better than the
Cox model with a time-dependent covariate.
For the second miss-specified analysis, the exponentially distributed δi , the ph-test and frailty model suggest that S1 is the most
relevant time scale. However, for the uniformly distributed δi the
procedures do not show any difference.
4.5
Application to infant mortality
age–period analysis
We look again at the application considered in Section 3.2 about
infant and child mortality. We mainly concentrate on the dual time
scales age-period problem with categorical covariates gender (boy or
girl) and maternal education (none, 6, 9, 12 years of education) for
infant mortality data.
4.5. Application to infant mortality age-period analysis
87
Analyses with piecewise constant hazards, Cox’s proportional
hazards with age time scale and Cox’s proportional hazards with period time scale were performed. Two-month grids were applied for
both age and period. For the piecewise constant hazards model, the
standard Poisson model for the number of events in each grid with
log link function was used. The total exposure times in each grid was
entered to the model as an offset. For the Cox model with age time
scale, period time was included as time-dependent strata or timedependent covariate. Similarly, in the Cox model with period time
scale, age was included as time-dependent strata or time-dependent
covariate. The time-dependent covariates in both models using age
and period as the basic time scale were treated non-parametrically
using a penalized smoothing spline.
There is no scientific background suggesting that the two time
scales, age and calendar time, are two alternative time scales. However, we can examine this by checking the proportionality assumption of the model using age and period as the basic time scale in
separate analyses. No model violates the proportionality assumption
with relatively large p-values for the proportionality test of 0.332 and
0.763 for age and period time scale, respectively.
Tables 4.5 and 4.6 show the result of likelihood ratio tests for the
variables in each model and estimated coefficients, respectively. In
this particular data set, in fact, they gave similar results. However,
the safe approach is to consider the results from a Cox model with the
time-dependent strata or piecewise constant hazards method. The
general conclusion is that maternal education is quite important and
gives protective effect in the case of infant mortality.
88
4.5. Application to infant mortality age-period analysis
Table 4.5: Likelihood ratio test (LRT) for variables in the infant mortality models using piecewise constant hazards (pc-hazards), Cox
proportional hazards with age time scale (Cox-age), Cox proportional hazards with period time scale (Cox-period)
Variables
pc-hazards
Age
Period
Gender
Maternal educ.
< .001
.395
.080
< .001
Cox-age
td-strata td-covar
—
—
—
.979
.100
.075
< .001
< .001
Cox-period
td-strata td-covar
—
< .001
—
—
.104
.132
< .001
< .001
Table 4.6: Estimated coefficients and their standard errors for gender
and maternal education in the infant mortality models
Variables
Gender
boy
girl
Maternal educ.
none
6 years
9 years
12 years
pc-hazards
Cox-age
td-strata
td-covar
Cox-period
td-strata
td-covar
—
-0.31 (.18)
—
-0.29 (.18)
—
-0.32 (.18)
—
-0.29 (0.18)
—
-0.29 (.19)
—
-0.76 (.26)
-1.17 (.34)
-2.74 (.56)
—
-0.76 (.27)
-1.18 (.35)
-2.72 (.56)
—
-0.74 (.26)
-1.16 (.34)
-2.70 (.56)
—
-0.72 (.27)
-1.11 (.35)
-2.67 (.56)
—
-0.53 (.28)
-1.10 (.36)
-2.43 (.57)
4.6. Remarks
4.6
89
Remarks
The first consideration when we face a multiple time scales problem
in event history analysis is to look for the scientific background of
the time scales. The background may be obvious in clinical studies
but may not be so in epidemiological or observational studies.
A proportional hazards test is advisable for checking the alternative time scales. This procedure is simpler than using a frailty model,
moreover analyzing individual frailty may give wrong conclusions, especially when we use an incorrect underlying frailty distribution. We
have noticed this problem also in the simulation studies.
A safe approach in analyzing dual time scales is to use the Cox
model with time-dependent strata or the piecewise constant hazard
approach. Simulation studies showed that both approaches gave
good performance when analyzing dual time scales generated by the
erroneous scale model or by the dual time scales model. The Cox
model with time-dependent covariate is superior to other approaches
when the other time scale (than the basic time scale) is really a timedependent covariate in the model.
90
4.6. Remarks
Chapter 5
Event History Analysis
with Longitudinal
Measurements
5.1
Introduction
We consider modeling event history with longitudinal measurements
when the longitudinal measurements are intermittently observed and
eventually measured with errors. Analysis of respiratory infection
and weight in the ZINAK study presented in Chapter 3 is one example of such a situation.
One way of analyzing such data is by considering the time-toevent data as the outcome and longitudinal measurements as a timedependent covariate. Another way is to analyze both outcomes simultaneously assuming that they are independent given certain latent processes.
Several methods have been proposed to deal with this kind of
problem and some of them have been reviewed in Section 2.4.1. Four
91
92
5.2. Problem and models
methods that have been around in the literature are LVCF (Last
Value Carried Forward), TEL (time elapsed since the last measurement), two-stage, and joint model of event time and longitudinal
measurements. Two methods based on Cox’s proportional hazards
model with stratification and frailty are proposed. The emphasis
of the analysis is on the joint evolution of time-to-event and longitudinal measurements rather than longitudinal measurements as
surrogate markers for the event.
To our knowledge, all methods mentioned above were mostly applied to clinical settings such as AIDS studies, psychiatric disorders
and cancer prevention trials, not to observational or epidemiological
settings which are more ”irregular” than the clinical ones. Applications of the methods to multiple events or repeated events are also
rarely considered in the literature.
The aim of this chapter is to compare the methods by means of
simulation and to perform further analysis of the respiratory infection and weight data from the ZINAK study introduced in Chapter 3.
5.2
Problem and models
Suppose n individuals are followed over a time interval [0, L) with
longitudinal measurements {yij : i = 1, 2, . . . , n; j = 1, 2, . . . , ni }
at times {tij : i = 1, 2, . . . , n; j = 1, 2, . . . , ni }. Together with the
measurements, a counting process {Ni (u) : 0 ≤ u ≤ L} for the
events and a predictable at risk process {Ki (u) : 0 ≤ u ≤ L} are also
recorded. An additional fixed time covariate or baseline covariate Z
may be included.
One example of such a data setup is illustrated in Figure 5.1. The
event history data are repeated events data in which one individual
may have several counting process intervals (t0 , t1 ], event. The
at-risk process {Ki (u) : 0 ≤ u ≤ L} is alternating between 0 and 1
93
5.2. Problem and models
at time points specified by the intervals. Notice that for repeated
events such as morbidity, after an event occurrence, the individual
is not at risk for a certain period of time. The not-at-risk period
corresponds to the duration of illness (denoted by dashed lines in
Figure 5.1(a) under event-symptoms).
The longitudinal measurements are obtained throughout the period of observation and do not necessarily coincide with the event
times. The observed measurement data are not perfect since the true
time-dependent covariate might be a continuous curve as depicted
in the figure but we only collect some values (the ⋆’s in the picture,
for id = 1). Moreover, the values may be subject to measurement
errors (the ⋆’s are not exactly on the curve). This is a quite common
situation in many applications, for instance when measuring infant
weights. The situation creates a problem when we use Cox’s model
for analyzing event history data since the partial likelihood requires
the values of all covariates at the event times (see Equation (2.30)).
We consider two models for the data generating mechanism of
time-to-event and longitudinal outcomes. The first one assumes a
model of time-to-event with the longitudinal measurements as a
time-dependent covariate. The second one assumes a joint model
of time-to-event and longitudinal measurements induced by a latent
process.
In general, any model may be specified for the longitudinal measurements. Here, we consider a linear random effects model for the
longitudinal measurements as in the infants’ growth model of the
ZINAK study (Section 3.5). The covariate process for the infant
weight data may be specified as
Y⋆ (t) = α0 + α1 Z + α2 t + U1 + U2 t,
t > 0,
(5.1)
where Y⋆ (t) = (Y1⋆ (t), . . . , Yn⋆ (t)) is the ”true” weight at age t for
individual i = 1, . . . , n; Z = (Z1 , . . . , Zn ) are fixed time covariates;
94
5.2. Problem and models
(a)
(b)
event - symptoms
id=1
id=2
id=3
(c)
longitudinal measurements
◦
⋆
⋆
4
◦
⋆
◦
id
1
1
2
2
3
3
t0
0
7
0
4
0
7
t1
6
10
3
9
5
10
id
1
1
1
1
t
2
4
6
8
Y
2.9
4.2
5
4
event
1
0
1
1
1
0
id=1
3
⋆
◦
2
0
2
4
6
8
10
time t
Figure 5.1: (a) Event history data and longitudinal measurements
and an illustration of imputing time-dependent covariate values using LVCF and time since measurements; (b) event history data (c)
longitudinal data
95
5.2. Problem and models
U1 and U2 are the unobservable random effects for the intercept and
age t with (U1 , U2 ) ∼ N (0, Σ). The observed covariate process is
Y(t) = Y⋆ (t) + ǫt ,
t = t 1 , t2 , . . . ,
(5.2)
where ǫt ∼ N (0, σ) are the measurements errors. The actual observation Y(t) is not continuously observed but finitely observed at
times t1 , t2 , . . ..
The time-to-event is modeled through Cox’s proportional hazards
model
λ(t | X, Y ⋆ (t)) = λ0 (t) exp(β 1 X + β2 Y ⋆ (t)),
(5.3)
where X are fixed time covariates such as maternal education, supplementation and may also include Z (e.g., the gender variable in
the covariate process model).
The central methodological and practical problem is how to estimate the parameters in (5.3) when Y ⋆ (t) is not available but Y (t)
is instead. The methods we consider are LVCF, TEL, two-stage,
Cox-frailty and Cox-strata discussed in the next section.
The joint model was mentioned in Section 2.4.2 as a more general
methodology to deal with time-dependent covariates. It has two submodels, one for the longitudinal measurements and another for the
time-to-event.
The longitudinal measurements model is the same as model (5.2).
The specification of the time-to-event model, however, is different
from that of (5.3). The hazard function of this joint model is
λ(t | X) = λ0 (t) exp(βX + γ(U1 + U2 t)),
(5.4)
where X are fixed time covariates, could be the same or overlap
with Z; U1 and U2 are specified similarly as in model (5.3). The
difference compared to model (5.3) is that the ”true” value Y ⋆ (t) are
not included in the hazard function but only the random effect part
U1 + U2 t.
96
5.3. Methods
The idea of the joint model is that the dependence between the
longitudinal measurements and the time-to-event can arise through
the common covariate X and the possible unobserved heterogeneity
in both models. The joint model attempts to take care of the latent heterogeneity in both models simultaneously. When there is no
latent association, γ = 0, the joint model is actually two separate
models of longitudinal data and event history data.
It is also possible to include more random effect terms than
U1 + U2 t. The latent association can also be extended as in the
Henderson et al.’s (2000) model and there can even be more than
one longitudinal measurement (Lin, McCulloch and Mayne, 2002).
However, in many practical situations, the random effects term of
the initial value of measurement (the intercept) and the steepness of
the longitudinal covariate by time (e.g., age) are the most important
terms.
5.3
Methods
The four methods of LVCF, TEL, two-stage and joint model have
been reviewed briefly in Section 2.4.2. We discuss the methods further here with some illustrations.
Suppose we have event history data and longitudinal data as in
Figure 5.1. To construct LVCF and TEL for the individual with
id = 1, we have to know the covariate value for this individual at
the event times 3, 5, 6, and 9.
The values obtained by the LVCF method are the symbol ◦ in
Figure 5.1. They are obtained by assuming that the most recent
measurement value is the value at event-time. It is possible that
the event-times correspond to the covariate-time (as at t = 6, the
symbol ◦⋆ ). The time elapsed since the last measurement (TEL) is
5.3. Methods
97
the length of the horizontal line connecting the actual measurement
and the event time.
The LVCF and TEL methods are used in the ordinary Cox regression as
λ(t | Z, Ỹt ) = λ0 (t) exp(βZ + Ỹt ),
(5.5)
λ(t | Z, Ỹt , τ ) = λ0 (t) exp(βZ + Ỹt f1 (τt ) + f2 (τt )),
(5.6)
respectively, where Z are fixed-time covariates; Ỹt denotes the value
obtained by LVCF, τt is the TEL at t, f1 and f2 are suitable functions.
The LVCF method is known to give biased estimates of the parameters (Prentice, 1982). However, we believe that LVCF is commonly used in practice. The Ỹt could be a good predictor of hazard
when the effect of the longitudinal measurements on event time is
delayed.
The idea of the TEL method is that the measurements could be
”aging” and new information closer to the event time is better than
old information. When the measurements are irregularly observed,
the value of τt (TEL) could carry information about the subjects
disease progression (Bruijne et al., 2001). However, it is not likely
to be an added advantage for regular measurements such as in the
ZINAK study. The added difficulty in using TEL instead of LVCF is
in specifying f1 and f2 . Therefore, the skill and tool needed in TEL
is actually the same as modeling the ordinary Cox regression. This
method could be a practical alternative to more complex sophisticated methods.
The two-stage method is mentioned briefly in Section 2.4.2. The
main idea of the method is to reconstruct the covariate function given
the observed values of the covariate. In the two-stage approach,
Cox’s proportional hazards model becomes
λ(t | Z, Yt ) = λ0 (t) exp(β1 Z)E [exp(β2 Y ⋆ (t)) | K(t) = 1, Yt , Z] ,
(5.7)
98
5.3. Methods
where Z are fixed-time covariates always available at the event times;
Yt denotes the observed values (the Y(t) in model (5.2)); and K(t) =
I{T ≥t} is an at-risk process indicator function (the usual notation
Y (t) for the at-risk process has been reserved for the longitudinal
measurements). Tsiatis et al. (1995) used a first-order approximation
of the conditional expectation in model (5.7).
The LVCF, TEL and two-stage methods basically assume the
first model discussed in Section 5.2 (Equations (5.1), (5.2), and
(5.3)) where the central problem is on imputing missing longitudinal covariate values at the event times. The two-stage model is
more computationally demanding than the others since it needs to
fit a longitudinal model at each event time.
The joint model method assumes the joint model of the timeto-event and longitudinal measurements process induced by a latent
process discussed in Section 5.2. We have mentioned several methods to fit the model in Section 2.4.2. Basically, the joint method
maximizes the likelihood function of longitudinal and hazard model
simultaneously. Theoretically, it has many desirable properties,
such as less biased parameter estimates, making the efficient use
of data and easier model validation (Tsiatis et al., 1995; Henderson
et al., 2000; Ibrahim, Chen and Sinha, 2001; Tsiatis and Davidian, 2004). Practically, the methods developed for this model are
still lacking computational tools and this model is computationally
demanding (Do, 2002).
In addition to the methods discussed above, we propose two
methods based on Cox’s model with stratification and Cox’s model
with frailty. We call them Cox-strata and Cox-frailty, henceforth.
The main idea of both methods is adjustment for the longitudinal
covariate when the covariate is considered as a nuisance variable. For
example, when the interest is not on the effect of weight on the morbidity but on the effect of other variables such as supplementation,
99
5.3. Methods
gender or maternal education, these methods could be a reasonable
choice.
Basically we assume a constant multiplicative effect of exp(Y ⋆ (t))
on the baseline hazard function over time. This assumption may be
violated when we use Y (t) as a proxy of Y ⋆ (t). Stratification is the
usual approach to deal with non-proportionality.
In this longitudinal measurements problem, the Y (t) is time dependent, therefore the stratification is actually a time-dependent
stratification. Cox’s model with time-dependent strata is
λ(t | Z, Y (t)) = λ0yj (t) exp(βZ),
if Y (t) = yj ,
(5.8)
where yj is the value of Y (t), j = 1, 2, . . . , V , where V is the number
of unique values of Y (t).
In practice, when V is large, the Cox-strata method may not be
feasible and the precision of estimated coefficients may be low. To
overcome this, we may categorize Y (t) such that the size of V is
reasonable.
The term exp(β2 Y ⋆ (t)) in model (5.3) which is a random variable
instead of a fixed variable also leads naturally to Cox’s model with
frailty (Section 2.2.5, Equation (2.13)). In this case, clusters in this
frailty model are the longitudinal measurements, therefore we use the
value of the longitudinal measurements as a categorical variable, the
same as yj in the Cox-strata method. In fact, this frailty approach
is one alternative solution when the V in the Cox-strata method is
large.
Practically, we may use the value obtained by the LVCF method
or by the two-stage method for the value of Y (t). The problem of
specifying the distribution of the frailty effects is the same as in
any Cox’s frailty model. We have discussed this problem for the
childhood mortality model (Chapter 3) and the multiple time scales
problem (Chapter 4).
100
5.4
5.4. Simulation studies
Simulation studies
The purpose of the simulation study is to investigate the performance
of the methods discussed in the previous section. We compared the
LVCF, TEL, two-stage, Cox-frailty and Cox-strata methods. The
joint model was not included in the comparison since, as we have
noticed in the previous section, it requires heavy computation and
is not feasible for large simulations.
The simulated data for each individual consists of event history
data (t0 , t1 ], event with one fixed covariate Z and longitudinal data
(t, Yt ), similar to the illustration in Figure 5.1. The fixed covariate
could be supplementation, gender or maternal education as in the
ZINAK study. The details of the simulation procedures are found in
Appendix A-3. Simulations were performed in R (Ihaka and Gentleman, 1996; R Development Core Team, 2004).
We look at βˆ1 , the estimated coefficient of Z, as one criterion
of method performances. The coverage, an indicator whether the
interval estimation includes the true parameter or not, were calculated. Proportionality of the hazard model is another criterion to
be investigated. Additionally we also looked at βˆ2 , the estimated
coefficients of the longitudinal measurements Y , for the LVCF, TEL
and two-stage methods.
Tables 5.1 and 5.2 show the results of the simulations based on
the two main models discussed in Section 5.2. When there is no
covariate effect in the hazard model (5.3) or no latent association in
model (5.4), that is β2 = 0, all methods arrived at similar results.
The performance of the methods in estimating β1 were similarly good
with coverages close to 95%. The performance investigated from the
proportionality assumption was also good for all methods but the
TEL method generally has better hazard proportionality than the
others.
For the LVCF, TEL and two-stage methods, we can also look at
101
5.4. Simulation studies
Table 5.1: Simulation study for Cox’s time-dependent covariate
model analyzed with the LVCF, TEL, two-stage, Cox-frailty and
Cox-strata methods. See the text and Appendix A-3 for the simulation specifications. Each value is calculated based on a sample of
size 50 with 500 replications
Method
LVCF
TEL
Two-stage
Cox-strata
Cox-frailty
β2
β¯1 (sd)
1.23
1.23
1.22
1.23
1.23
(.170)
(.170)
(.167)
(.271)
(.168)
=0
p1
94.3
94.1
95.5
94.7
95.3
zph
.477
.547
.507
.510
.492
β2 =
β¯1 (sd)
1.24
1.24
1.24
1.24
1.24
(.232)
(.233)
(.236)
(.361)
(.232)
−0.1
p1
95.2
95.2
95.3
95.0
95.6
zph
.498
.580
.505
.492
.490
β¯1 is the mean of the estimated coefficient βˆ1 (true value β1 = 1.2) and
their standard deviation (sd) in parentheses
p is the coverage (percentage) of the interval estimation
zph is the mean of the proportional hazards test p-value
Table 5.2: Simulation study for joint model analyzed with the LVCF,
TEL, two-stage, Cox-frailty and Cox-strata methods. See the text
and Appendix A-3 for the simulation specifications. Each value is
calculated based on a sample of size 50 with 500 replications
Method
LVCF
TEL
Two-stage
Cox-strata
Cox-frailty
β2
β¯1 (sd)
1.20
1.20
1.22
1.22
1.20
(.193)
(.196)
(.203)
(.365)
(.160)
=0
p1
94.8
94.6
94.5
96.0
95.5
zph
.512
.589
.493
.515
.508
β2
β¯1 (sd)
1.01
1.01
1.02
1.03
1.04
(.254)
(.257)
(.259)
(.447)
(.207)
=1
p1
78.4
78.4
78.6
85.1
84.5
zph
.429
.510
.441
.478
.476
β¯1 is the mean of the estimated coefficient βˆ1 (true value β1 = 1.2) and
their standard deviation (sd) in parentheses
p is the coverage (percentage) of the interval estimation
zph is the mean of the proportional hazards test p-value
102
5.4. Simulation studies
the performance based on the estimates of β2 . The β2 were almost
perfectly estimated with mean of 0.003, 0.002, and 0.002; coverage
94.9%, 95.7% and 94.1%, for the LVCF, TEL and two-stage methods,
respectively.
When data was generated by Cox’s time-dependent covariate
model and the effect of covariate Y was present, i.e., β2 = −0.1
(Table 5.1), the performance of all methods were as good as that of
the result with β2 = 0. It is rather surprising that the LVCF method
is similarly good enough to estimate β1 for this situation as compared
to the two-stage method, except perhaps for its proportionality. The
LVCF method is probably good when the time-dependent covariate
is a linear model with low gradient, as specified for Y in this simulation.
For the LVCF, TEL and two-stage methods, their estimates of
β2 were rather poor with means of -0.005, -0.013, and -0.005, respectively; coverage 82.2%, 89.8% and 83%, respectively. This problem
was due to the measurement errors of Y . The estimation of β2 was
slightly better with the TEL method.
When data were generated from the joint model with latent association (Table 5.2), the estimates of β1 were biased for all methods.
Here, the Cox-strata and Cox-frailty methods are slightly better than
the LVCF, TEL and two-stage methods with larger coverage probability. The performances dramatically went down for the LVCF, TEL
and two-stage methods in estimating β2 . Their estimates were close
to zero and far from the true value 1 of β2 . Their coverage were close
to zero as well. These severe under-estimations were largely caused
by the miss-specification of Y .
5.5. Application to infant respiratory infection and weight data
5.5
103
Application to infant respiratory infection and weight data
We continue the analysis of infants respiratory infection and weight
data from the ZINAK study introduced in Chapter 3. We included
the weight longitudinal covariate in the hazard model and analyzed
the data using the LVCF, TEL, two-stage, Cox-strata and Coxfrailty, and joint model methods.
We used both the Andersen-Gill model (AG model) and the gaptime model to specify the model of repeated events. We expect
improvements over the models presented in Table 3.7 (the AG model)
and Table 3.8 (the gap-time model) by using these methods.
The implementation of the LVCF method to the data is straightforward but rather tedious. First we have to arrange the data by
event-time splitting (See example in Figure 5.1, Section 5.2). There
were 2,423 records for this repeated time to event data set collected
from 666 subjects. The number of records grew considerably into
104,838 records after splitting. The missing values of weight at event
times were then imputed by the values of weight from 3,770 records
of the weight data set.
The τ (the time since the most recent measurements) in the TEL
method can be directly calculated from the LVCF data set. There
are many alternatives to model f1 and f2 , parametrically or nonparametrically. We consider parametric models of the exponential
form.
Cox and Oakes (1984); Bruijne et al. (2001) considered the exponential form
β1 Y(t−τ ) + β2 Y(t−τ ) exp(Cτ ) + β3 exp(−Dτ ),
(5.9)
where Y(t−τ ) is the measurements obtained by the LVCF method,
and τ is the time since the most recent measurements, C and D are
104
5.5. Application to infant respiratory infection and weight data
constant parameters which give the highest maximized log-likelihood
of the model. We do not elaborate this form and chose the simplest
C = D = 1. As we can see later, this simple form of the TEL model
had the best fit compared to the LVCF and two-stage methods.
In the two-stage method, a linear random effects model was used
for the weight growth curve model as in Equation (3.2) but only
included gender as a fixed covariate. The two-stage method was the
most time consuming in the data preparation since a new model had
to be fitted at each event time.
The Cox-strata and Cox-frailty methods used the actual value of
weight obtained by the LVCF method as the stratum or cluster in
the models. The weight data were measured in 2 digits precision.
For Cox-strata, the values were rounded into 1 digit, giving about
90 unique values of weight as strata.
The results from the LVCF, TEL and two-stage methods were
actually quite similar, especially the LVCF and two-stage methods
were close. The likelihood ratio test in Table 5.3 shows that the
two-stage method was only slightly better than the LVCF method
but the TEL method certainly had the best goodness of fit among
the others, both for the Andersen-Gill and gap-time repeated events
models. We only present the result from the TEL method in the
subsequent discussion.
The results from the Cox-frailty method were almost identical to
the previous AG and gap-time models (Table 3.7 and Table 3.8). The
random effect variance was very small and its likelihood ratio test
was not significant. The results from Cox-strata generally included
wider confidence intervals than that from the other models. We only
present the results from the Cox-strata together with the results from
the TEL method.
Tables 5.4 and 5.5 present the results of the fitted model using
the TEL and Cox-strata methods from the previous models in Tables
5.5. Application to infant respiratory infection and weight data
105
Table 5.3: Likelihood ratio test for the LVCF, TEL and two-stage
models compare to the model in Table 3.7 for the AG model and
Table 3.8 for the gap-time model
Method
LVCF
TEL
two-stage
AG model
Deviance(df) p-val
2.9 (1)
.087
13.4(3)
.004
3.1 (1)
.076
gap-time model
Deviance(df)
p-val
12.7(1)
.0004
22.0(3) .00005
13.1(1)
.0002
3.7 and 3.8. Note that the reference categories in the variables were
omitted for conciseness.
The estimates of gender, supplementation and maternal education were similar to that of Table 3.7 (the AG model without weight
variable and τ ). Maternal education and gender had an important
contribution to the model. The Cox-strata method was generally
conservative with wider confidence intervals as compared to that of
the TEL method and results in Table 3.7. The proportionality assumptions for these models were checked, there was no indication of
a proportional hazard violation.
The weight variable was only slightly important in the model but
the statistical interaction with exp(τ ) was very significant. Weight
seemed to have a protective effect from respiratory infections in infants. Removing weight from the model and its interaction with τ
did not improve the fitted model, therefore both weight, τ and their
interaction were kept in the model. Similar results were found in the
model using the gap-time repeated event model of Table 5.5.
Finally, we compared the above methods with the joint model
method. The hazard model was specified as an exponential hazard
model with all the variables as in Table 5.4 or 5.5 included, except
the weight and tel variables. As we have found in Section 3.4 and
106
5.5. Application to infant respiratory infection and weight data
Table 5.4: Hazards model for respiratory infection using the
Andersen-Gill model
Parameter
TEL method
Cox-strata method
girl
0.91
(0.81, 0.98)
0.89
(0.80, 0.99)
zinc
1.01
(0.88, 1.15)
1.02
(0.89, 1.17)
zinc+iron
0.91
(0.80, 1.04)
0.93
(0.80, 1.06)
iron
0.96
(0.85, 1.10)
0.96
(0.83, 1.10)
6 years
0.85
(0.64, 1.11)
0.86
(0.64, 1.16)
9 years
0.71
(0.54, 0.94)
0.70
(0.52, 0.95)
12 years or more
0.48
(0.30, 0.77)
0.50
(0.30, 0.82)
weight
0.96
(0.91, 1.01)
exp(tel)
0.99
(0.98, 1.00)
weight*exp(tel)
1.002 (1.001, 1.003)
tel is the time since the most recent measurement of weight
The estimated parameters are presented as exp(β̂)
Table 5.5: Hazards model for respiratory infection using the gap-time
model
Parameter
TEL method
Cox-strata method
girl
0.87
(0.79, 0.96)
0.87
(0.79, 0.96)
zinc
1.01
(0.88, 1.15)
0.99
(0.87, 1.13)
zinc+iron
0.91
(0.80, 1.04)
0.91
(0.79, 1.04)
iron
0.97
(0.85, 1.10)
0.95
(0.83, 1.08)
6 years
0.85
(0.65, 1.12)
0.82
(0.62, 1.08)
9 years
0.72
(0.54, 0.95)
0.69
(0.52, 0.91)
12 years or more
0.48
(0.30, 0.77)
0.46
(0.29, 0.75)
weight
0.92
(0.87, 0.96)
exp(tel)
0.99
(0.99, 1.00)
weight*exp(tel)
1.002 (1.001, 1.003)
tel is the time since the most recent measurement of weight
The estimated parameters are presented as exp(β̂)
5.5. Application to infant respiratory infection and weight data
107
also in this section, the results for the AG model and the gap-time
model were similar, indicating that an exponential baseline hazard
should be fine. The longitudinal model was specified similarly as
in Table 3.9 (random effect growth curve model). The estimate is
based on the joint maximized likelihood of exponential hazard model
and linear random effect model. SAS with the NLMIX procedure
(Guo and Carlin, 2004) was used to fit the joint model. With 2,423
records of event history data and 3,172 records of longitudinal data,
it took 35 hours to fit the model. The result is presented in Table 5.6.
The hazard model in the separate analysis column in Table 5.6
is comparable to that of Table 5.5 or Table 3.8 since the exponential
model fitted in the joint model also used gap-time as the time scale.
The longitudinal model in the separate analysis column is the same
as Table 3.9.
Although small, the estimated risk ratios for the hazard model
generally were away from one as compared to the separate model,
indicating that the possible frailty effect in the model had been taken
care of by the joint model. For the longitudinal model, the random
effect components were stronger than in the separate model, which
may indicate that possible under-estimations had been taken care
of. The general conclusion for the effect of the variables, however, is
similar to that of separate models.
The significant latent association γ gave additional information
about the positive association between random effects from both
models. In this type of joint model, we may interpret the γ as the
effect of time-dependent frailty in the hazard model which operates
through age. The positive value indicates that age had a considerably
large effect on the hazard of experiencing respiratory infection. We
compared this finding with an ordinary gap-time Cox-frailty model
using age as a frailty term. The analysis was performed by adding the
frailty term age in the TEL model of Table 5.5. We found that the
variance of the random effect was 1.03 with a very significant result
108
5.6. Remarks
of LRT (497.6 with 1 degree of freedom) which therefore confirms
the significant latent association in the joint model.
We summarize the findings for the infants’ respiratory infection
and weight data. Maternal education seemed to be important for
infant respiratory infection but not for weight. Weight was associated with infant respiratory infection and its duration. Finally,
none of the methods show any statistical significance of supplementation. However, we have not considered other important variables
such as breastfeeding, food intake and socio-economic indicators in
the model, further analyses with those variables may be necessary.
5.6
Remarks
Cox based models such as the LVCF, TEL, and two-stage methods
should be good enough in situations where data comes a from Cox’s
model with time-dependent covariate. Practically, the TEL method
would be the first choice. The TEL method may even be used when
the measurement is only performed once, in this situation the twostage or the joint model methods may be difficult to perform. The
TEL method is also favorably applied to the switching-treatment
type covariate where the covariate path is a step function with only
a few values during the period of observation instead of continuous
function covariate.
Care must be taken in using the Cox based model with a timedependent covariate under the model with miss-specification. The
Cox-strata or Cox-frailty may be more appropriate in the situation
when the longitudinal covariate is regarded as a nuisance variable,
in which there is no need to explicitly estimate their effects. Alternatively, the joint model method can be used, but this may require
complex and heavy computation.
109
5.6. Remarks
Table 5.6: Separate and joint model analysis for infant respiratory
infection and weight data
Parameter
Intercept
girl
zinc
zinc+iron
iron
6 years
9 years
12 years or more
Intercept
Age
girl
zinc
zinc+iron
iron
6 years
9 years
12 years or more
Illness days
Random effects
sd(Intercept)
sd(Age)
corr(Intercept, Age)
Separate analysis
Joint analysis
hazard model
0.64
(0.49, 0.85)
0.53
(0.35, 0.79)
0.91
(0.83, 1.00)
0.92
(0.81, 1.05)
1.00
(0.88, 1.14)
0.98
(0.82, 1.17)
0.91
(0.79, 1.03)
0.90
(0.75, 1.08)
0.97
(0.85, 1.11)
0.98
(0.82, 1.17)
0.84
(0.64, 1.10)
0.86
(0.58, 1.28)
0.70
(0.53, 0.92)
0.72
(0.48, 1.08)
0.46
(0.28, 0.74)
0.47
(0.25, 0.88)
longitudinal model
6.37
(5.88, 6.86)
6.37
(5.89, 6.84)
0.17
(0.17, 0.18)
0.17
(0.16, 0.18)
-0.54
(-0.68, -0.40)
-0.54
(-0.68, -0.40)
0.02
(-0.18, 0.22)
0.02
(-0.18, 0.21)
0.01
(-0.19, 0.21)
0.01
(-0.18, 0.20)
0.01
(-0.19, 0.21)
0.01
(-0.18, 0.20)
0.20
(-0.27, 0.68)
0.20
(-0.26, 0.67)
0.31
(-0.17, 0.79)
0.31
(-0.16, 0.78)
0.26
(-0.41, 0.94)
0.26
(-0.39, 0.92)
-0.53
(-0.64, -0.41)
-0.53
(-0.65, -0.41)
0.993
0.065
-0.617
(0.923,1.064)
0.997
(0.927, 1.068)
(0.061,0.070)
0.067
(0.062, 0.072)
(-0.860,-0.430)
-0.684 (-0.940,-0.486)
latent association
γ
0.596
(0.500, 0.692)
For the hazard models, the estimated parameters are presented as exp(β̂)
γ is the parameter specified in Equation (5.4)
110
5.6. Remarks
Chapter 6
Concluding Remarks
This thesis has contributed several solutions and discussions to the
problems in event history analysis with multiple time scales and longitudinal measurements motivated by some epidemiological studies.
The focus is on the Cox regression model, but this is by no means
the solution to all problem. Other approaches such as the parametric
proportional hazards, additive hazards and accelerated failure times,
that have been omitted in the discussion, deserve attention. Similar problems presented in this thesis will certainly appear in those
approaches as well.
We have presented methods for choosing the time scale in the
Cox regression model based on the proportional hazards test and
the frailty model. Although the methods are inferential, we suggest
using the methods as exploratory tools together with consideration
of the scientific background of the data.
When several time scales are considered to be important and the
model is a pure bivariate or multivariate time scale model, the Cox
model with time-dependent strata, or the piecewise constant hazards approach are suggested. The price is that, in the Cox model
111
112
with time-dependent strata the effect of the time scale can not be
quantified by means of the estimated regression coefficients; and the
piecewise constant hazards is only an approximation of the model.
A general methodology for this multivariate time scale problem still
needs more investigation. The developments since the review by Andersen et al. (1993, Chapter X) are the non-parametric estimation of
the bivariate survivor function(Prentice, 1999; Gentleman and Vandal, 2002) and a more theoretical ground by Ivanoff and Merzbach
(2002).
We have presented comparisons of several widely used methods
to deal with longitudinal measurements in the event history analysis together with two proposed methods. Comparison by simulation
showed that the time elapsed measurement time method (TEL) performed well when the data came from the Cox model with a timedependent covariate. The two proposed methods based on Cox’s
model with stratification and frailty may be useful when the data are
suspected to cause miss-specification in the Cox model. In the comparison by simulation we have left out the joint model, a promising
method that unfortunately requires heavy and complex computation.
The joint model is not in a mature development state yet, especially
in the computing aspects. Further research is certainly needed. An
estimation method in the generalized linear latent models (Huber,
Ronchetti and Victoria-Feser, 2004) seems to be fruitful to estimate
the joint model. Other urgent future research is diagnostic tools for
the joint model, which is still in its infancy.
Finally, any developed methods should have a real advantage in
practice. We have performed several analyses by the discussed methods using epidemiological surveillance and randomized trial data. We
have confirmed the results obtained by the original investigators and
contributed additional insights to their findings.
Bibliography
Aalen, O. (1978). Nonparametric inference for a family of counting
processes, The Annals of Statistics 6: 701–726.
Andersen, P. (2003). Two encyclopedia contributions: Time-dependent
covariate, Technical report, Department of Biostatistics, Institute of
Public Health, University of Copenhagen.
Andersen, P. K. (1991). Survival analysis 1982-1991: The second decade
of the proportional hazards regression model, Statistics in Medicine
10: 1931–1941.
Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1993). Statistical
Models Based on Counting Processes, Springer-Verlag Inc.
Andersen, P. K. and Keiding, N. (2002). Multi-state models for event
history analysis, Statistical Methods in Medical Research 11(2): 91–
115.
Andersen, P. K. and Liestøl, K. (2003). Attenuation caused by infrequently
updated covariates in survival analysis, Biostatistics 4: 633–649.
Bailey, K. R. (1984). Asymptotic equivalence between the Cox estimator
and the general ML estimators of regression and survival parameters
in the Cox model, The Annals of Statistics 12: 730–736.
Bates, D. M. and Pinheiro, J. (1998). Computational methods for multilevel
models., Technical memorandum bl0112140-980226-01tm, Bell Labs,
Lucent Technologies, Murray Hill, NJ.
113
114
Bibliography
Berzuini, C. and Clayton, D. (1994a). Bayesian analysis of survival on
multiple time scales, Statistics in Medicine 13(8): 823–838.
Berzuini, C. and Clayton, D. (1994b). Bayesian analysis of survival on
multiple time scales, Statistics in Medicine 13: 823–838.
Bhandari, N., Bahl, R., Mazumdar, S., Martines, J., Black, R., Bhan, M.
and Infant Feeding Study Group (2003). Effect of community-based
promotion of exclusive breastfeeding on diarrhoeal illness and growth:
A cluster randomised controlled trial, Lancet 361: 1418–1423.
Black, R., Morris, S. and Bryce, J. (2003). Where and why are 10 million
children dying every year?, Lancet 361: 2226–2234.
Broström, G. (2002). Cox regression; ties without tears, Communications
in Statistics, Part A – Theory and Methods 31(2): 285–297.
Bruijne, M. H. J. d., Cessie, S. l., Kluin-Nelemans, H. C. and Houwelingen,
H. C. v. (2001). On the use of Cox regression in the presence of an
irregularly observed time-dependent covariate, Statistics in Medicine
20(24): 3817–3829.
Central Bureau of Statistics (CBS) [Indonesia], State Ministry of Population/National Family Planning Coordinating Board (NFPCB) and
Ministry of Health (MOH) and Macro Intemational Inc. (MI) (1998).
Indonesia Demographic and Health Survey 1997, CBS and MI.,
Calverton, Maryland.
Clayton, D. (1988). The analysis of event history data: A review of progress
and outstanding problems, Statistics in Medicine 7: 819–841.
Commenges, D. (1999). Multi-state models in epidemiology, Lifetime Data
Analysis 5: 315–327.
Cox, D. R. (1972). Regression models and life-tables (with discussion),
Journal of the Royal Statistical Society, Series B, Methodological
34: 187–220.
Cox, D. R. (1975). Partial likelihood, Biometrika 62: 269–276.
Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data, Chapman &
Hall Ltd.
Bibliography
115
Danardono (2000). Multilevel Model of the Diarrhea Occurrence in Children, Master’s thesis, Department of Biostatistics and Demography,
Faculty of Public Health Khon Kaen University, Thailand.
Danardono (2003). Event history analysis of childhood mortality and morbidity in Purworejo, Indonesia., Statistical studies 30, Department of
Statistics, Umeå University.
Diggle, P. (1988). An approach to the analysis of repeated measurements,
Biometrics 44: 959–971.
Diggle, P., Heagerty, P., Liang, K.-Y. and Zeger, S. L. (2002). Analysis of
Longitudinal Data, second edn, Oxford University Press.
Do, K.-A. (2002). Biostatistical approaches for modeling longitudinal and
event time data, Clin. Cancer Res. 8(8): 2473–2474.
Doksum, K. A. and Gasko, M. (1990). On a correspondence between models
in binary regression analysis and in survival analysis, International
Statistical Review 58: 243–252.
Duchesne, T. (1999). Multiple Time Scales in Survival Analysis, PhD thesis, University of Waterloo.
Duchesne, T. and Lawless, J. (2000). Alternative time scales and failure
time models, Lifetime Data Analysis 6(2): 157–179.
Efron, B. (2002). The two-way proportional hazards model, Journal of the
Royal Statistical Society, Series B, Methodological 64(4): 899–909.
Farewell, V. T. and Cox, D. R. (1979). A note on multiple time scales in
life testing, Applied Statistics 28: 73–75.
Faucett, C. L. and Thomas, D. C. (1996). Simultaneously modelling censored survival data and repeatedly measured covariates: A Gibbs sampling approach, Statistics in Medicine 15: 1663–1685.
Fleming, T. and Harrington, D. (1991). Counting Processes and Survival
Analysis, Wiley.
Fleming, T. and Lin, D. (2000). Survival analysis in clinical trials: Past
developments and future directions, Biometrics. 56(4): 971–983.
116
Bibliography
Gentleman, R. and Vandal, A. C. (2002). Nonparametric estimation of the
bivariate CDF for arbitrarily censored data, The Canadian Journal of
Statistics 30(4): 557–571.
Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative
generalized least squares, Biometrika 73: 43–56.
Goldstein, H. (1989). Restricted unbiased iterative generalized leastsquares estimation, Biometrika 76: 622–623.
Grambsch, P. and Therneau, T. (1994). Proportional hazards tests and
diagnostics based on weighted residuals, Biometrika 81: 515–526.
Guo, G. and Rodrı́guez, G. (1992). Estimating a multivariate proportional
hazards model for clustered data using the EM algorithm, with an
application to child survival in Guatemala, Journal of the American
Statistical Association 87: 969–976.
Guo, X. and Carlin, B. P. (2004). Separate and joint modeling of longitudinal and event time data using standard computer packages, The
American Statistician 58: 16–24.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models,
Chapmn and Hall, London.
Hastie, T. and Tibshirani, R. (1986). Generalized additive models, Stat.
Sci. 1: 297–318.
Henderson, R., Diggle, P. and Dobson, A. (2000). Joint modelling of longitudinal measurements and event time data, Biostatistics 1: 465–480.
Holford, T. (1998). Age-period-cohort analysis, in P. Armitage and
T. Colton (eds), Encyclopedia of Biostatistics, John Wiley and Sons,
Ltd.
Hosmer, D. and Lemeshow, S. (1999). Applied Survival Analysis. Regression
Modeling of Time to Event Data, John Wiley and Sons, Inc.
Hougaard, P. (1995). Frailty models for survival data, Lifetime Data Analysis 1: 255–273.
Bibliography
117
Huber, P., Ronchetti, E. and Victoria-Feser, M.-P. (2004). Estimation of
generalized linear latent variable models, J. R. Statist. Soc. B 66: 893–
908.
Ibrahim, J. G., Chen, M.-H. and Sinha, D. (2001). Bayesian Survival
Analysis, Springer-Verlag Inc.
Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics
5(3): 299–314.
Ivanoff, B. and Merzbach, E. (2002). Random censoring in set-indexed
survival analysis, The Annals of Applied Probability 12: 944–971.
Jewell, N. and Kalbfleisch, J. (1996). Marker processes in survival analysis,
Lifetime Data Analysis 2: 15–29.
Johansen, S. (1983). An extension of Cox’s regression model, International
Statistical Review 51: 165–174.
Jones, M. P. and Crowley, J. (1992). Nonparametric tests of the Markov
model for survival data, Biometrika 79: 513–522.
Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of
Failure Time Data, second edn, John Wiley and Sons.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations, Journal of the American Statistical Association 53: 457–481.
Keiding, N. (1990). Statistical inference in the lexis diagram, Phil. Trans.
R. Soc. London A 332: 487–509.
Keiding, N. (1999). Event history analysis and inference from observational
epidemiology, Statistics in Medicine 18: 2353–2363.
Kevane, M. and Levine, D. I. (2003). Changing status of daughters in indonesia, Paper c03-126, Center for International and Development Economics Research. University of California, Barkeley.
http://Repositories.Cdlib.Org/Iber/Cider/C03-126.
118
Bibliography
Korn, E., Graubard, B. and Midthune, D. (1997). Time-to-event analysis
of longitudinal follow-up of a survey: Choice of the time-scale, Am-JEpidemiol 145: 72–80.
Kuczmarski, R., Ogden, C. and Guo, S. (2002). CDC growth charts for the
united states: Methods and development., Vital Health Stat 11 246,
National Center for Health Statistics.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data, Biometrics 38: 963–974.
Lee, Y. and Nelder, J. (2001). Hierarchical generalised linear models: A
synthesis of generalised linea models, random-effet models and structure dispersions, Biometrika 88: 987–1006.
Liang, K. and Zeger, S. (1986). Longitudinal data analysis using generalized
linear models, Biometrika. 73: 13–22.
Liestøl, K. and Andersen, P. (2002). Updating of covariates and choice of
time origin in survival analysis: Problems with vaguely defined disease
states, Statist. Med. 21: 3701–3714.
Lin, H., McCulloch, C. E. and Mayne, S. T. (2002). Maximum likelihood
estimation in the joint analysis of time-to-event and multiple longitudinal variables, Statistics in Medicine 21(16): 2369–2382.
Lin, H., Turnbull, B. W., McCulloch, C. E. and Slate, E. H. (2002). Latent
class models for joint analysis of longitudinal biomarker and event
process data: Application to longitudinal prostate-specific antigen
readings and prostate cancer, Journal of the American Statistical Association 97(457): 53–65.
Lind, T. (2004). Iron and Zinc in Infancy: Results from Experimental
Trials in Sweden and Indonesia, Umeå university medical dissertations, Epidemiology and Public Health Sciences, Department of Public
Health and Clinical Medicine, and Pediatrics Department of Clinical
Sciences, Umeå University, Sweden.
Lindkvist, M. (2000). Added Variable Plots and Influence in Cox’s Regression Model., PhD thesis, Department of Statistics, Umeå University.
Bibliography
119
Machfudz, S. (1998). Effect of Morbidity on Change in Mid-upper-arm
Circumference in Children Under Five Years of Age. a Cohort Study
in Purworejo, Central Java, Indonesia, Master’s thesis, Department
of Epidemiology and Public Health Umeå University.
Manda, S. (2001). A comparison of methods for analysing a nested frailty
model to child survival in malawi, Australian New Zealand Journal of
Statistics 43(1): 7–16.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (Second
Edition), Chapman & Hall Ltd.
Mosley, W. and Chen, L. (1984). An analytical framework for the study
of child survival in developing countries, Population and Development
Review 10: 25–48. Suppl.
Ng, E. T. M. and Cook, R. J. (1997). Modeling two-state disease processes
with random effects, Lifetime Data Analysis 3: 315–335.
Oakes, D. (1995). Multiple time scales in survival analysis, Lifetime Data
Analysis 1: 7–18.
Pawitan, Y. and Self, S. (1993). Modeling disease marker processes in
AIDS, Journal of the American Statistical Association 88: 719–726.
Pearce, N. (1992). Methodological problems of time-related variables in
occupational cohort studies, Rev Epidemiol Sante Publique 40 Suppl
1: S43–54.
Pebley, A. and Stupp, P. (1987). Reproductive patterns and child mortality
in Guatemala, Demography 24(1): 43–60.
Prentice, R. (1982). Covariate measurement errors and parameter estimates
in a failure time regression model., Biometrika 69: 331–342.
Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition
and operational criteria, Statistics in Medicine 8: 431–440.
Prentice, R. L. (1999). On non-parametric maximum likelihood estimation
of the bivariate survivor function, Statistics in Medicine 18: 2517–
2527.
120
Bibliography
R Development Core Team (2004). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna,
Austria. 3-900051-00-3.
*http://www.R-project.org
Rabe-Hesketh, S., Yang, S. and Pickles, A. (2001). Multilevel models for
censored and latent responses, Stat. Methods Med. Res. 10: 409–427.
Rice, A., Sacco, L., Hyder, A. and Black, R. (2000). Malnutrition as an underlying cause of childhood deaths associated with infectious diseases
in developing countries, Bulletin of the World Health Organization
78: 1207–1221.
Robins, J. M. (1986). A new approach to causal inference in mortality
studies with sustained exposure periods - application to control of the
healthy worker survivor effect, Mathematical Modelling 7: 1393–1512.
Rochon, J. and Gillespie, B. (2001). A methodology for analysing a
repeated measures and survival outcome simultaneously., Stat.Med.
20(8): 1173–1184.
Sastry, N. (1997). A nested frailty model for survival data, with an application to the study of child survival in northeast Brazil, Journal of
the American Statistical Association 92: 426–435.
Scrimshaw, N. S. (2003). Historical concepts of interactions, synergism and
antagonism between nutrition and infection, J. Nutr. 133: 316S–321S.
The Cebu Study Team (1991). Underlying and proximate determinants of
child health: The cebu longitudinal health and nutrition study, Am.
J. Epidemiol 133: 185–201.
Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data:
Extending the Cox Model, Springer-Verlag Inc.
Trussell, J. and Hammerslough, C. (1983). A hazard-model analysis of
the covariates of infant and child mortality in sri lanka, Demography
20: 1–26.
Tsiatis, A. A. and Davidian, M. (2004). Joint modeling of longitudinal and
time-to-event data: An overview, Statistica Sinica 14: 809–834.
Bibliography
121
Tsiatis, A. A., DeGruttola, V. and Wulfsohn, M. S. (1995). Modeling the
relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS, Journal
of the American Statistical Association 90: 27–37.
UNICEF (2003). Child Survival and Health. http://www.childinfo.org/
eddb/health.htm. Accessed October 13, 2003.
van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for Censored
Longitudinal Data and Causality, Springer-Verlag, Inc.
Vaupel, J. W., Manton, K. G. and Stallard, E. (1979). The impact of
heterogeneity in individual frailty on the dynamics of mortality, Demography 16: 439–454.
Wahab, A., Winkvist, A., Stenlund, H. and Wilopo, S. (2001). Infant
mortality among Indonesian boys and girls: Effect of sibling status,
Annals of Tropical Paediatrics 21(1): 66–71.
Wibowo, T. (2000). Does Poor Nutritional Status Lead to Morbidity? A
Longitudinal Study of Infants 6 - 12 Months in Purworejo, Central
java, Indonesia, Master’s thesis, Department of Epidemiology and
Public Health Umeå University.
Wilopo, S. and CHN-RL Team (1997). Key Issues on Research Design,
Data Collection and Management. Community Health and Nutrition
Research Laboratory, Faculty of Medicine, Gadjah Mada University,
Reprint Series No. 2, Community Health and Nutrition Research Laboratory, Yogyakarta.
Wulfsohn, M. S. and Tsiatis, A. A. (1997). A joint model for survival and
longitudinal data measured with error, Biometrics 53: 330–339.
Xu, J. and Zeger, S. L. (2001). Joint analysis of longitudinal data comprising repeated measures and times to events, Applied Statistics
50(3): 375–387.
Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with
random effects: A Gibbs sampling approach, Journal of the American
Statistical Association 86: 79–86.
122
Bibliography
Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete
and continuous outcomes, Biometrics 42: 121–130.
Zeger, S. L. and Liang, K.-Y. (1991). Feedback models for discrete and
continuous time series, Statistica Sinica 1: 51–64.
Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach, Biometrics
44: 1049–1060. (Correction: V45 P347).
Zohoori, N. and Savitz, D. (1997). Econometric approaches to epidemiologic data: Relating endogeneity and unobserved heterogeneity to
confounding, Ann. Epidemiol 7: 251–257.
Appendix
123
125
A-1. Simulating alternative time scale
A-1
Simulating alternative time scale
The simulation procedure for the alternative time scales in Section 4.4.1
is described here. The true duration T is generated by the ordinary Cox
model
λ(t | Z) = λ0 (t) exp(βZ), t > 0,
(A-1)
where λ(t | Z) is the hazard for an individual, λ0 (t) is the baseline hazard, parametrically specified in this simulation, Z is a zero-one fixed time
covariate with coefficient β.
Z is specified by the Bernoulli distribution with probability 0.4 of success and the true value of β is 2. The baseline hazards are specified by
Gompertz, exponential and Weibull hazard functions. Table A-1 shows
the detailed specifications.
Table A-1: The specification of hazard functions and times T generation
Baseline
Gompertz
hazard
λ0 (t) = θ1 eθ2 t
T generation
T = θ12 log(− θθ21
exponential
Weibull
λ0 (t) = θ
λ0 (t) = θ1 θ2 (θ2 t)θ1 −1
T = − log(u)
θΨi
T = θ12 ( − log(u)
)1/θ1
Ψi
Ψi = exp(βZi ),
log(u)
Ψi
+ 1)
specification
θ1 = 0.15,
θ2 = 2
θ = 0.85
θ1 = 1.2,
θ2 = 0.5
u ∼ U (0, 1)
After T is generated, T1 and T2 are generated by adding δ1 and δ2 ,
respectively. In the simulation, δ1 is U (0, 1) or exponential(0.5); δ2 is
U (0.5, 2) or exponential(1.25). Samples of size n = 200 individuals were
generated according to this procedure with 1000 replications.
A-2
Simulating dual time scales
The simulation procedure for the dual time scales in Section 4.4.2 used
time-dependent covariate models. In general, if we have a Cox model with
126
A-2. Simulating dual time scales
time dependent covariate
λ(t | Z(t)) = λ0 (t)Ψ(β, t), t > 0,
(A-2)
the duration T can be generated through the relationship between hazard and survival. If T has distribution function F (t) or survival function
S(t) then U = F (T ) or similarly U = S(T ) will follow a uniform U (0, 1)
distribution.
Under model (A-2) the cumulative hazard function for T is
G(t)
=
Λ(t | Z(s), 0 ≤ s ≤ t)
Z t
λ0 (y)Ψ(β, y)dy
=
(A-3)
S(t) = exp(−G(t)).
(A-4)
0
so that
Now, U = S(T ) is U (0, 1). Therefore, solving U = exp(−G(T )) for T
gives what we want.
Suppose T has hazard function
λ(t | Z(t + δ)) = λ0 (t)Ψ(β, t + δ), t > 0,
(A-5)
where λ(t | Z(t + δ)) is the hazard function for an individual the covariate
process Z, λ0 (t) is the baseline hazard, parametrically specified in this
simulation, and Ψ(β, t) is specified as
Ψ(β, t) = exp(β1 η + β2 (t + δ)), t > 0,
(A-6)
where β1 and β2 are parameters specified in the simulation, and η and δ
follow certain distributions.
The dual times T1 and T2 can be generated from model (A-5) after
specifying the baseline hazard function λ0 . In this simulation, we specify
a constant hazard θ such that (A-4) has a closed form solution,
λ(t | Z(t)) = θ exp(β1 η + β2 (t + δ)), t > 0.
(A-7)
A-3. Simulating longitudinal measurements and event-time data
127
The cumulative hazard function for an individual with covariate process
Z is
G(t)
=
Λ(t | Z(s), 0 ≤ s ≤ t)
Z t
θ exp(β1 η + β2 (y + δ))dy
=
0
β1 η+β2 δ
= θe
= θeβ1 η+β2 δ
eβ2 y
β2
β2 t
e
t
y=0
−1
β2
.
(A-8)
In the simulation study, we specify a constant hazard θ = 1.2, the true
coefficients β1 = 1.5 , β2 = 0, 1, zero-one fixed covariate η ∼ Bernoulli(p =
0.45), and δ follows exponential with rate 0.85 and U (0, 2). Using this
specification T1 and T2 can be generated through the inverse of G,
(
β2 y
1
for β2 6= 0
+
1
log
β
η+β
δ
2
θe 1
G−1 (y) = yβ2
(A-9)
−β1 η
for β2 = 0
θe
and T1 = G−1 (− log(u)) with u ∼ U (0, 1); T2 = T1 + δ. Samples of size
n = 200 individuals were generated according to this procedure with 1000
replications.
A-3
Simulating longitudinal measurements
and event-time data
The simulation method in Section 5.4 uses the same principle as in A-2,
in which the event times are generated through the inverse of the cumulative hazard function. However, in this simulation a longitudinal model is
involved.
A-3.1
Time-dependent covariate model
This simulation is based on Equations (5.1), (5.2), and (5.3) (Section 5.2).
128
A-3. Simulating longitudinal measurements and event-time data
Specifically we have the longitudinal growth curve model
Yi⋆ (t) = (α1 + a1i ) + (α2 + a2i )t,
t > 0, i = 1, . . . , n,
(A-10)
where Yi⋆ (t) are longitudinal measurements. The random coefficients a1i
and a2i are assumed to follow a bivariate Gaussian distribution with mean
zero and variance-covariance matrix Σ.
The measurements are made intermittently for each individual i and
with error, therefore the simulated model for the growth curve is
Yij = Yi⋆ (tij ) + ǫij , i = 1, . . . , n, j = 1, . . . , m,
(A-11)
where tij , i = 1, . . . , n, i = 1, . . . , m are time points of measurement. The
measurement errors ǫij are assumed to be mutually independent Gaussian
distributed with mean zero and variance σǫ .
The hazard function is modeled as a Cox model with constant baseline
hazard
λi (t) = θ exp(β1 Zi + β2 Yi⋆ (t)), t > 0, i = 1, . . . , n.
(A-12)
Substituting Yi⋆ (t) from Equation (A-10) and dropping the index i, the
cumulative hazard of (A-12) can be written as
G(t) = K
exp (β2 (α2 + a2 )t) − 1
, t > 0,
β2 (α2 + a2 )
(A-13)
where K = θ exp(β1 Z + β2 α1 + β2 a1 ).
The event times are generated by G−1 (− log(u)) with u ∼ U (0, 1) (see
(A-9)). Since the simulation is for repeated events, for one individual we
assume that the inter event times are generated by the same model but
the time origin is advanced by a certain random amount after each event
time. In the context of morbidity, we call the advancing of the time origin as
duration of illness. For this simulation we choose the lognormal distribution
as the distribution of illness duration.
In the simulation, we specified the parameters for the hazard model as
θ = 0.4, β1 = 1.2 and varied β2 = 0, −0.1, illness duration was lognormal(0,
0.3); and in the growth curve model, we used the parameter values α1 = 6.5,
A-3. Simulating longitudinal measurements and event-time data
129
α2 = 0.17, σǫ = 0.2, and
Σ=
0.9
−0.04
−0.04
.
0.01
These specified values are roughly equal to the parameter estimates obtained from the ZINAK study especially for the weight growth model. Age
time scale is used in the simulation starting from 6 to 12 months, which is
also roughly the same as in the ZINAK study. The counting process style
input (start, stop], event is used for the repeated events.
The longitudinal measurements were generated at some defined time
intervals. The measurements time points were ti1 , ti2 , ti3 and were not
exactly the same for all individuals. This was done by adding a random
uniform U (−0.4, 0.4) to time points 6, 9, 12 for each individual. Samples
of size n = 50 individuals were generated according to this procedure with
500 replications.
A-3.2
Joint model
Simulation of the joint model is based on Equations (5.1), (5.2) and (5.4)
(Section 5.2). The procedure for the simulated longitudinal measurements
is similar to that of the time-dependent covariate model with the following
modification
Yi (t) = (α1 + a1i ) + (α2 + a2i )t + α3 Zi + ǫi , t > 0, i = 1, . . . , n, (A-14)
where now we have Zi in the model.
The simulated event-times were generated from the hazard function
λi (t) = θ exp(β1 Zi + β2 (a1i + a2i t)), t > 0.
(A-15)
The cumulative hazard of (A-15) is
G(t) = K
exp (β2 a2 t) − 1
, t > 0,
β2 a2
(A-16)
where K = θ exp(β1 Z + β2 a1 ). The event times are then generated by
G−1 (− log(u)) with u ∼ U (0, 1).
130
A-3. Simulating longitudinal measurements and event-time data
The duration of illness, θ, β1 , σǫ and Σ, as well as the schedule of
measurement times tij were specified similarly as in the time-dependent
covariate model. The α’s were specified as α1 = 6.5, α2 = 0.5, α3 = 1.5
and varied β2 = 0, 1. Samples of size n = 50 individuals were generated
according to this procedure with 500 replications.
Statistical Studies
issued by
Department of Statistics, Umeå University
SE–901 87 Umeå, Sweden
1. Gustafsson, Lennart: Några aspekter på stickprovsteorier
vid ändliga populationer med tillämpningar på tvåstegsurval
(1968).
2. Pollak, Kay: Variationsskattningar baserade på kvadratiska
former av ordnade variabler, några illustrationer (1969).
3. Cassel, Claes-Magnus: Inferensproblemet vid ändliga populationer, några synpunkter (1970).
4. Wretman, Jan-Håkan: Om inferens vid ändliga populationer
under superpopulationsantagande (1970).
5. Carlsson, Olle: Om fördelningen av en summa av vägda
oberoende Poissonvariabler med tillämpningar inom statistisk
inferensteori och stokastiska processer (1970).
6. Stenlund, Hans och Westlund, Anders: A Monte-Carlo Study
of Some Sampling Designs (1974).
7. Westlund, Anders: Estimation and Prediction Interdependent
Systems in the Presence of Specification Errors (1975).
8. Björnham, Åke och Wiklund, Dan-Erik: Analysis of Fetal
Heart Rate Variability During Labour: Registration, Estimation, and Decision (1976).
9. Hållberg, Bengt: Statistiska modeller för banbrottsfrekvens
hos tryckpapper (1976).
10. Freij, Lennart och Wall, Stig: Exploring Child Health and its
Ecology (1977).
11. Baudin, Anders: On the Application of Short-term Causal
Models (1977).
12. Brännäs, Kurt: On Estimation in Economic System in the
Presence of Time Varying Parameters (1980).
13. Nyquist, Hans: Recent Studies on Lp-Norm Estimation (1980).
14. Törnkvist, Birgitta: Quantifying Structural Change - A Model
Based Approach (1988).
15. Laitila, Thomas: Estimation in Truncated and Censored Regressions (1989).
16. Carlsson, Olle: On Quality Selection (1990).
17. Segerstedt, Bo: On Conditioning and Ridge Estimation in
Generalized Linear Models (1991).
18. Öhman, Marie-Louise: Contributions to Generalized Wilcoxon
Rank Tests (1992).
19. Wiklund, Stig-Johan: Control Charts and Process Adjustments (1994).
20. Arnoldsson, Göran: Generalised Linear Models and Optimal
Design (1994).
21. Öhman, Marie-Louise: Aspects of Analysis of Small-Sample
Right Censored Data Using Generalized Wilcoxon Rank Tests
(1994).
22. Arnoldsson, Göran: Optimal Design for Inference in Generalized Linear Models (1997).
23. Bränberg, Kenny: On Test Score Equating (1997).
24. Häggström, Jonas: The Minimax Approach to Optimum Design of Experiments (2000).
25. Lindkvist, Marie: Added Variable Plots and Influence in Cox’s
Regression Model (2000).
26. Pettersson, Hans: Optimum in Average and Minimax Designs
for Estimation of Generalized Linear Models (2001).
27. Häggström Lundevaller, Erling: Tests of Random Effects in
Linear and Non-Linear Models (2002).
28. Adler, John: Statistical Models for Estimating Career Mobility
(2003).
29. Wiberg, Marie: Computerized Achievement Tests - Sequential
and Fixed Length Tests (2003).
30. Danardono: Event History Analysis of Childhood Mortality
and Morbidity in Purworejo, Indonesia (2003).
31. Puu, Margareta: Optimum Experimental Designs for Generalized Linear Models with Multinomial Response (2003).
32. Appelgren, Jari: Locally D-optimal Designs for Bivariate Logistic Regression (2004).
33. Danardono: Multiple Time Scales and Longitudinal Measurements in Event History Analysis (2005).
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement