# Supplement to the paper (PDF) ```SUPPLEMENT TO “CLUSTERING FOR MULTIVARIATE
CONTINUOUS AND DISCRETE LONGITUDINAL DATA”
By Arnošt Komárek and Lenka Komárková
Charles University in Prague and the University of Economics in Prague
This is a supplement to the paper published by The Annals of
Applied Statistics. It contains (A) more detailed description of the
assumed prior distribution for model parameters giving also some
guidelines for the selection of the hyperparameters to achieve a weakly
informative prior distribution; (B) more details on the posterior distribution and the sampling MCMC algorithm; (C) additional information to the analysis of the Mayo Clinic PBC data; (D) more detailed results of the simulation study.
This is a supplement to the paper “Clustering for multivariate continuous
and discrete longitudinal data” (Komárek and Komárková, 2012). Sections
in the supplement are considered as appendices to the main paper and are
numbered using capital letters. Cross-references in this supplement that do
not contain a capital letter refer to corresponding parts of the main paper.
Appendix A provides a detailed description of the prior distribution in the
Bayesian specification of the mixture multivariate generalized linear mixed
model as well as guidelines for selection of the fixed hyperparameters to
achieve weakly informative prior distribution. Appendix B gives more details on the posterior distribution and the sampling MCMC algorithm. Additional information to the analysis of the Mayo Clinic PBC data is given in
Appendix C. Finally, Appendix D shows additional results of the simulation
study.
Keywords and phrases: Classification, functional data, generalized linear mixed model,
multivariate longitudinal data, repeated observations
1
2
A. KOMÁREK AND L. KOMÁRKOVÁ
A. Prior distribution. In this appendix, we provide a detailed overview
of the prior distribution assumed in the Bayesian specification of the mixture
multivariate generalized linear mixed model. We also give some guidelines
for the selection of the hyperparameters to achieve a weakly informative
prior distribution.
A.1. Notation. First, we review and slightly extend the notation:
⊤ ⊤
α is a cr × 1 vector
• α = (α⊤
1 , . . . , αR ) : the fixed effects, where
PR r
(r = 1, . . . , R) and α is a c × 1 vector (c = r=1 cr );
• φ = (φ1 , . . . , φR )⊤ : GLMM dispersion parameters of which some of
them might be constants by definition, e.g., for Bernoulli response
with a logit link (logistic regression), corresponding φr is equal to 1,
and the same is true for Poisson response with a log link. Further,
we denote by φ−1 a vector of inverted dispersion parameters, i.e.,
−1 ⊤
φ−1 = (φ−1
1 , . . . , φR ) ;
• B = (b1 , . . . , bN ): individual (latent) values of random effects, where
⊤ ⊤
bi = b⊤
, bi,r (r = 1, . . . , R) are dr × 1 vectors and bi
i,1 , . . . , bi,R
P
(i = 1, . . . , N ) are d × 1 vectors (d = R
r=1 dr ) ;
⊤
• w = (w1 , . . . , wK ) : mixture weights from the distribution of random
effects;
• µ1 , . . . , µK : d × 1 vectors of mixture means;
• D 1 , . . . , D K : d × d mixture covariance matrices;
• s : d× 1 shift vector used to calculate the shifted-scaled random effects
(see below);
• S : d × d diagonal scale matrix used to calculate the shifted-scaled
random effects;
• B ∗ = (b∗1 , . . . , b∗N ): individual (latent) values of shifted-scaled random
⊤
effects, where b∗i = b∗i,1 ⊤ , . . . , b∗i,R ⊤ , b∗i,r ⊤ (r = 1, . . . , R) are dr × 1
vectors and b∗i ⊤ (i = 1, . . . , N ) are d × 1 vectors;
• µ∗1 , . . . , µ∗K : d × 1 vectors of mixture means from the distribution of
shifted-scaled random effects;
• D ∗1 , . . . , D ∗K : d × d mixture covariance matrices from the distribution
of shifted-scaled random effects;
• u = (u1 , . . . , uN )⊤ : (latent) component allocations;
• γ b = (γb,1 , . . . , γb,d )⊤ : random hyperparameter introduced in an additional hierarchy level when specifying the prior distribution for the
−1
mixture covariance matrices (see below). Further, let γ −1
b = (γb,1 , . . . ,
−1 ⊤
) ;
γb,d
• γ φ = (γφ,1 , . . . , γφ,R )⊤ : random hyperparameter introduced in an ad-
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
3
ditional hierarchy level when specifying the prior distribution for the
−1
GLMM dispersion parameters (see below). Further, let γ −1
φ = (γφ,1 ,
−1 ⊤
. . . , γφ,R
) .
The set of GLMM related parameters further composes a vector ψ, and the
set of mixture related parameters further composes a vector θ, i.e.,
⊤
⊤
⊤
.
ψ = α⊤ , φ⊤ ,
θ = w⊤ , µ⊤
1 , . . . , µK , vec(D 1 ), . . . , vec(D K )
In the following, ψ and θ are sometimes referenced as frequentist parameters
as they are the same as unknown parameters of the corresponding frequentist model. Finally, let p(·) and p(· | ·) be generic symbols for (conditional)
distributions and let ϕ(· | µ, D) be a density of the (multivariate) normal
distribution with a mean µ and a covariance matrix D
A.2. Shifted-scaled random effects. Due to possibility of improving the
mixing and numerical stability of the MCMC algorithm described in Section B, we introduce shifted-scaled random effects:
b∗i = S −1 bi − s ,
i = 1, . . . , N,
where s is a d × 1 fixed shift vector and S is a d × d fixed diagonal scale
matrix. It follows from Eq. (4) that
b∗i | θ ∗
i.i.d.
∼
K
X
k=1
wk MVN µ∗k , D∗k ,
i = 1, . . . , N,
⊤
where θ ∗ = w ⊤ , µ∗1 ⊤ , . . . , µ∗K ⊤ , vec(D ∗1 ), . . . , vec(D ∗K ) :
(
µ∗k = S −1 µk − s ,
(A.1)
D∗k = S −1 Dk S −1 ,
k = 1, . . . , K.
Without the loss of generality, we may set s = 0 (zero vector) and S = I d
(d × d identity matrix). However for numerical stability of the MCMC which
is subsequently used to obtain a sample from the corresponding posterior
distribution, it is useful if the shifted-scaled random effects have approximately zero means and unit variances. We return to this point in Sec. A.5.
A.3. Factorization of the prior distribution. With the MCMC algorithm,
we primarily obtain a sample from the posterior distribution of θ ∗ (parameters of the mixture distribution of shifted-scaled random effects) and then
use the transformation (A.1) to get a sample from the posterior distribution
4
A. KOMÁREK AND L. KOMÁRKOVÁ
of θ (parameters of the mixture distribution of the original random effects).
For this reason, all expression explaining the assumed prior distribution will
be expressed in terms of θ ∗ rather than θ.
As it is usual for hierarchically specified models, the joint prior distribution which involves also latent random effects B and component allocations
u exploits conditional independence between sets of unrelated parameters.
Namely, it is factorized as
(A.2)
p(ψ, θ ∗ , B, u) = p(B | θ ∗ , u) × p(u | θ ∗ ) × p(θ ∗ ) × p(ψ),
where
(A.3)

N
Y


∗

p(bi | θ ∗ , ui ),

 p(B | θ , u) =
i=1





p(u | θ ∗ ) =
N
Y
p(ui | w).
i=1
The products on the right-hand side of Eq. (A.3) follow from assumed independence between subjects/patients. For the prior distribution p(θ ∗ ) of the
mixture related parameters, we shall use a multivariate version of the classical proposal of Richardson and Green (1997) which exploits a factorization:
(A.4)
p(θ ∗ ) = p(w) ×
K n
Y
o
p µ∗k × p D ∗k −1 .
k=1
For p(ψ), the last part of the prior expression related to the GLMM parameters, we shall adopt classically used priors in this context (see, e.g., Fong,
Rue and Wakefield, 2010) with factorization:
(A.5)
p(ψ) = p(φ−1 ) × p(α) =
R
Y
p(φ−1
r ) × p(α).
r=1
In the remained of this Section, we describe our particular choices for the
factors in Eqs. (A.3), (A.4) and (A.5). Further, we explain a data motivated
guidelines for selection of the fixed hyperparameters which may be followed
to obtain a weakly informative prior distribution. These guidelines are motivated by the classical proposal of Richardson and Green (1997) in the case
of mixture related parameters and the classical proposal for GLMM related
parameters (see, e.g., Fong, Rue and Wakefield, 2010).
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
5
A.4. Weakly informative prior distribution. To be able to follow the
Richardson and Green’s guidelines for the selection of the fixed hyperparameters leading to the weakly informative prior distribution, we have to have
some initial idea concerning the location, scale and individual values of the
random effects in our multivariate GLMM. This can be obtained by using
standard software and calculating separate maximum-likelihood estimates
in each of R GLMM models (3) under the classical assumption of normally
distributed random effects. In the following, let for each r = 1, . . . , R, β 0M L,r
be so obtained initial estimate for the overall mean vector of random effects,
and d0M L,r analogously obtained initial estimates for the vector of overall
standard deviations of random effects. Further, let for each r = 1, . . . , R,
b0EB,i,r , i = 1, . . . , N be empirical Bayes estimates of individual values of
random effects in the rth ML estimated GLMM under the assumption of
normality of random effects. Furthermore, let
⊤
⊤ ⊤
β 0M L = β 0M L,1 , . . . , β 0M L,R
,
⊤
⊤
⊤
d0M L = d0M L,1 , . . . , d0M L,R
,
⊤
⊤
⊤
b0EB,i = b0EB,i,1 , . . . , b0EB,i,R
,
i = 1, . . . , N . The right-hand side of the Expression (A.2) is then composed
of the factors explained in Sec. A.5–A.11.
A.5. Prior for random effects. p(bi | θ ∗ , ui ) is for each i = 1, . . . , N
a (multivariate) normal distribution N s + Sµ∗ui , SD ∗ui S ⊤ , that is,
(A.6) p(bi | θ ∗ , ui )
n 1
⊤
o
1
∝ |SD∗ui S ⊤ |− 2 exp − bi − s − Sµ∗ui (SD∗ui S ⊤ )−1 bi − s − Sµ∗ui .
2
This follows from assumed mixture model (4) For numerical stability of the
MCMC which is subsequently used to obtain a sample from the posterior
distribution, it is useful if the shifted-scaled random effects
(A.7)
b∗i = S −1 (bi − s),
i = 1, . . . , N
have approximately zero means and unit variances. This can be achieved
by setting s = β 0M L and S = diag d0M L . In a sequel, let further b0∗
EB,i =
−1 0
S (bEB,i − s), i = 1, . . . , N be vectors of shifted and scaled values of
the initial ML based empirical Bayes estimates of random effects, and let
0∗
0∗
R0∗
EB = (REB,1 , . . . , REB,d ) be a vector of sample standard deviations based
on elements of b0∗
EB,i multiplied by some factor c, where c > 1. In our appli
2 ⊤.
2
0∗
2
0∗
cations we use c = 6. Finally, let (R0∗
EB ) = (REB,1 ) , . . . , (REB,d )
6
A. KOMÁREK AND L. KOMÁRKOVÁ
A.6. Prior for component allocations. p(ui | w), i = 1, . . . , N , also follows from the assumed mixture model for the random effects and coincides
with the discrete distribution with
P(ui = k | w) = wk ,
k = 1, . . . , K.
A.7. Prior for mixture weights. p(w) is a Dirichlet distribution D(δ, . . . , δ),
where δ is a fixed hyperparameter (prior sample size). That is,
K
Γ(Kδ) Y δ−1
wk ,
p(w) = K
Γ (δ)
k=1
Weakly informative prior is obtained by setting δ to a small positive value,
e.g., δ = 1.
A.8. Prior for mixture means. p(µ∗k ), k = 1, . . . , K, is a (multivariate)
normal distribution N (ξ b , C b ), where ξ b and C b are fixed prior mean and
covariance matrix, respectively, that is,
n 1
o
⊤
1
∗
,
k = 1, . . . , K.
−
ξ
p(µ∗k ) ∝ |C b |− 2 exp − µ∗k − ξ b C −1
µ
b
k
b
2
Following the suggestions of Richardson and Green (1997) adapted to the
multivariate GLMM, weakly informative prior is obtained by taking
ξ b to
0∗ 2
be a sample mean of b0∗
,
i
=
1,
.
.
.
,
N
and
C
=
diag
(R
)
.
b
EB,i
EB
A.9. Prior for mixture covariance matrices. p D ∗k −1 | γ b , k = 1, . . . , K,
is a Wishart distribution W(ζb , Ξb ), where ζb are fixed prior degrees of
freedom and Ξb = diag(γ b ) is a diagonal scale matrix with random diagonal
elements. That is,
n 1
ζb −d−1
o
− ζb ∗ −1
,
D
(A.8) p D ∗k −1 | γ b ∝ Ξb 2 D ∗k −1 2 exp − tr Ξ−1
k
b
2
k = 1, . . . , K.
To get a weakly informative prior, we set ζb to a small positive
number,
e.g., ζb = d + 1. Further observe that a apriori, E D∗k −1 γ b = ζb Ξb . As
noted by Richardson and Green (1997) in the univariate i.i.d. context, it
seems restrictive to suppose that a knowledge of the variability of the data,
represented by R0∗
EB in our case, implies much about the size of matrices
∗ −1
D k , k = 1, . . . , K and subsequently, it is not advisable to take a fixed scale
matrix Ξb based on R0∗
EB . For this reason, they suggest to introduce additional hierarchical level in the prior distribution. Their proposal adapted to
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
7
our multivariate setting then takes Ξb = diag(γ b ) with independent gamma
−1
−1
, the inverted diagonal
, . . . , γb,d
G(gb,1 , hb,1 ), . . . , G(gb,d , hb,d ) priors for γb,1
elements of Ξb . That is,
g
d
d Y
Y
hb,lb,l
−1
−1 gb,l −1
−1
−1
exp −hb,l γb,l
,
γ
p(γ b ) =
p(γb,l ) =
Γ(gb,l ) b,l
l=1
l=1
where (gb,1 , hb,1 ), . . . , (gb,d , hb,d ) are fixed hyperparameters. Weakly informative prior is obtained by setting gb,l to a small positive value, e.g., gb,l =
0∗ )2 , e.g.,
0.2, l = 1, . . . , d, and setting hb,l to moderate multiple of 1/(REB,l
0∗ )2 , l = 1, . . . , d.
hb,l = 10/(REB,l
A.10. Prior for GLMM dispersion parameters. p(φ−1
r | γφ,r ), r = 1, . . . , R,
is nondegenerate distribution only for such r = 1, . . . , R for which the dispersion parameter φr of the corresponding GLMM is not constant by definition.
With respect to the analysis of PBC910 data where we consider three types
of GLMM’s: (i) Gaussian with identity link, (ii) Bernoulli with logit link,
(iii) Poisson with log link, this only applies to the Gaussian GLMM with
identity link (≡ LMM) for which φr is the residual variance σr2 . In such
−1
situations, p(φ−1
r | γφ,r ) is a gamma distribution G ζφ,r /2, γφ,r /2 , that is,
p(φ−1
r | γφ,r ) =
2−
Γ
ζφ,r
2
−
ζφ,r
γ 2
ζφ,r φ,r
2
φ−1
r
ζφ,r −2
2
1
−1 −1
exp − γφ,r
φr .
2
Note that with this parameterization of the gamma distribution we have
a direct correspondence to the univariate Wishart distribution with the same
meaning for ζ and γ parameters as in the prior distribution for the inverted
mixture covariance matrices, see (A.8). Weakly informative prior distribution is obtained by setting ζφ,r to a small positive number, e.g., ζφ,r = 2.
Further, similarly to the prior distribution for the inverted mixture covari−1
ance matrices, we will assume that γφ,r
is random and assign it a gamma
G(gφ,r , hφ,r ) prior distribution, i.e.,
gφ,r
hφ,r
−1
−1 gφ,r −1
−1
exp −hφ,r γφ,r ,
γ
p(γ φ,r ) =
Γ(gφ,r ) φ,r
where (gφ,r , hφ,r ) are fixed hyperparameters. This allows to specify weakly
informative prior distribution and at the same time keeps the posterior distribution proper. Weakly informative prior is obtained by setting gφ,r to
a small positive value, e.g., gφ,r = 0.2, and setting hφ,r to moderate multiple
0∗ )2 , e.g., h
0∗
2
0∗
of 1/(Rres,r
φ,r = 10/(Rres,r ) , where Rres,r is a range of residuals
from the initial ML fit to the rth (G)LMM.
8
A. KOMÁREK AND L. KOMÁRKOVÁ
A.11. Prior for fixed effects. p(α) is a (multivariate) normal distribution
N (ξ α , C α ), where ξα and C α are fixed prior hyperparameters, i.e.,
(A.9)
n 1
⊤
o
1
.
α
−
ξ
p(α) ∝ |C α |− 2 exp − α − ξ α C −1
α
α
2
As usual in the context of hierarchical regression models, weakly informative
prior is obtained by setting ξ α = 0 (zero vector) and C α to diagonal matrix
with large positive numbers, e.g., 10 000 on the diagonal. It is always useful
to check that the variance of the (marginal) posterior distribution of α is
indeed considerably smaller.
A.12. Joint prior distribution including random hyperparameters. When
we incorporate also random hyperparameters γ b and γ φ into the expression
of the joint prior distribution, we obtain:

p(ψ, θ ∗ , B, u, γ b , γ φ )






= p(B | θ ∗ , u) × p(u | θ ∗ )



−1


× p(θ ∗ | γ b ) × p(γ −1

b ) × p(ψ | γ φ ) × p(γ φ )


N n

o
Y



p(bi | θ ∗ , ui ) × p(ui | w) × p(w)
=
(A.10)
i=1


K n
d
o

Y
Y


∗ −1
∗
−1

×
p(µk ) × p(D k | γ b ) ×
)
p(γb,l




k=1
l=1


R n

o
Y


−1
−1

p(φ
|
γ
)
×
p(γ
)
× p(α),
×

φ,r
r

φ,r
r=1
where all factors appearing in (A.10) have been explained in Sec. A.5–A.11.
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
9
B. Posterior distribution and the sampling algorithm. In this
appendix, we provide more detailed explanation of the Monte Carlo Markov
chain (MCMC) algorithm mentioned in Komárek and Komárková (2012,
Section 2.2) which is used to obtain the sample from the posterior distribution. As it is usual with the MCMC applications, we generate a sample from
the joint posterior distribution of all model parameters:
(
p ψ, θ ∗ , B, u, γ b , γ φ y
(B.1)
∝ p(ψ, θ ∗ , B, u, γ b , γ φ ) × LBayes (ψ, θ ∗ , B, u, γ b , γ φ ),
where
LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = p y ψ, θ ∗ , B, u, γ b , γ φ
is the likelihood of the Bayesian model.
B.1. Likelihood for the Bayesian model. Given the (Bayesian) model parameters, the likelihood is given by
(B.2) LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = p(y | ψ, θ ∗ , B, u, γ b , γ φ )
= p(y | ψ, B) =
i,r
N Y
R n
Y
Y
p(yi,r,j | φr , αr , bi,r ),
i=1 r=1 j=1
where the form of p(yi,r,j | φr , αr , bi,r ) follows from a particular GLMM, in
which
⊤
(B.3) h−1
E(Yi,r,j | αr , bi,r ) = ηi,r,j = x⊤
r
i,r,j αr + z i,r,j bi,r ,
i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r ,
where h−1
r is the link function for the rth response, ηi,r,j is the linear predictor, and xi,r,j , z i,r,j are cr × 1 and dr × 1, respectively, vectors of known
covariates. It is assumed that p(yi,r,j | φr , αr , bi,r ) belongs to an exponential
family. For notational convenience, we will assume that h−1
r is a canonical
(B.4) p(yi,r,j | φr , αr , bi,r ) = p(yi,r,j | φr , ηi,r,j )
yi,r,j ηi,r,j − qr (ηi,r,j )
+ kr (yi,r,j , φr ) ,
= exp
φr
where kr and qr are appropriate distribution specific functions, see Table B.1
for their form in three most common situations of the Gaussian, Bernoulli
and Poisson distribution.
In a sequel, we shall exploit additional notation. Let
10
A. KOMÁREK AND L. KOMÁRKOVÁ
Table B.1
Components of three most common generalized linear models. All distributions are
parametrized such that the mean is λ.
Distribution
Gaussian N (λ, σ 2 )
λ
η
Dispersion φ
σ2
Variance v(φ, η)
φ=σ
2
2
y
log(2πφ)
−
2
2φ
η2
2
k(y, φ)
q(η)
Bernoulli A(λ)
λ logit(λ) = log
1−λ
exp(η)
ilogit(η) =
1 + exp(η)
Poisson Po(λ)
1
h(η) 1 − h(η) = λ (1 − λ)
1
0
log(λ)
exp(η)
h(η) = λ
− log(y!)
log 1 + exp(η)
P
• n•r = N
i=1 ni,r be the number of all observations of the rth marker;
⊤
⊤
be the vector of all observations of the rth
• y •r = y ⊤
1,r , . . . , y N,r
marker;
• η i,r = (ηi,r,1 , . . . , ηi,r,ni,r )⊤ be the vector of linear predictors for the
observations of the rth marker of the ith subject;
⊤
⊤
• η •r = η ⊤
be the vector of linear predictors for all ob1,r , . . . , η N,r
servations of the rth marker;
• λi,r,j = E(Yi,r,j | αr , bi,r ) = E(Yi,r,j | ηi,r,j ), i = 1, . . . , N , r = 1, . . . , R,
j = 1, . . . , ni,r , be the mean of the (i, r, j)th response obtained from
the GLMM (B.3) with the response density (B.4) which in general is
given by
λi,r,j =
∂qr
(ηi,r,j ),
∂η
i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r ;
• λi,r = (λi,r,1 , . . . , λi,r,ni,r )⊤ , i = 1, . . . , N , r = 1, . . . , R, be the vector
of means of the rth marker for the ith subject;
⊤ ⊤
• λ•r = (λ⊤
1,r , . . . , λN,r ) , r = 1, . . . , R, be the vector of the means for
the rth marker;
• vi,r,j = var(Yi,r,j | φr , αr , bi,r ) = var(Yi,r,j | φr , ηi,r,j ), i = 1, . . . , N , r =
1, . . . , R, j = 1, . . . , ni,r , be the variance of the (i, r, j)th response
obtained from the GLMM (B.3) with the response density (B.4). which
in general is given by
vi,r,j = φr
∂ 2 qr
(ηi,r,j ),
∂η 2
i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r ,
exp(η)
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
•
•
•
•
•
11
see Table B.1 for explicit formulas in three most common situations;
vi,r = (vi,r,1 , . . . , vi,r,ni,r )⊤ , i = 1, . . . , N , r = 1, . . . , R, be the vector
of variances of the rth marker for the ith subject;
V i,r = diag(v i,r ), i = 1, . . . , N , r = 1, . . . , R, be the diagonal matrix
with variances of the rth marker for the ith subject on a diagonal;
⊤ ⊤
v•r = (v ⊤
1,r , . . . , v N,r ) , r = 1, . . . , R, be the vector of the variances
for the rth marker;
V •r = diag(v •r ), r = 1, . . . , R, be the diagonal matrix with variances
for the rth marker on a diagonal;
X i,r , i = 1, . . . , N , r = 1, . . . , R, be the matrix of covariates for the
fixed effects from the GLMM (B.3) for the rth response of the ith
subject, i.e.,
 ⊤ 
xi,r,1
 .. 
X i,r =  .  ;
x⊤
i,r,ni,r
• X •r , r = 1, . . . , R be the matrix of covariates for the fixed effects from
the GLMM (B.3) for the rth response, i.e.,


X 1,r


X •r =  ...  ;
X N,r
• Z i,r , i = 1, . . . , N , r = 1, . . . , R be the matrix of covariates for the
random effects from the GLMM (B.3) for the rth response of the ith
subject, i.e.,
 ⊤ 
z i,r,1
 .. 
Z i,r =  .  ;
z⊤
i,r,ni,r
• qi,r,j = qr (ηi,r,1 ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r ;
⊤
• q i,r = qi,r,1 , . . . , qi,r,ni,r ) , i = 1, . . . , N , r = 1, . . . , R;
⊤
⊤
, r = 1, . . . , R;
• q •r = q ⊤
1,r , . . . , q N,r
• ki,r,j = kr (yi,r,j , φr ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r ;
⊤
• ki,r = ki,r,1 , . . . , ki,r,ni,r , i = 1, . . . , N , r = 1, . . . , R;
⊤
⊤
• k•r = k⊤
, r = 1, . . . , R;
1,r , . . . , k N,r
• sr , r = 1, . . . , R, be the part of the shift vector s, see Expression (A.7),
⊤ ⊤;
which corresponds to the rth marker, i.e., s = s⊤
1 , . . . , sR
• S r , r = 1, . . . , R be the diagonal block of the scale matrix S, see
Expression (A.7), which corresponds to the rth marker, i.e., S =
block-diag(S 1 , . . . , S R ).
12
A. KOMÁREK AND L. KOMÁRKOVÁ
Consequently, the likelihood (B.2) can be written as
(B.5)
LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = LBayes (ψ, B)
X
R
N X
⊤
−1
⊤
⊤
φr y i,r η i,r − 1 q i,r + 1 ki,r
= exp
i=1 r=1
= exp
X
R
r=1
⊤
⊤
⊤
φ−1
y
η
−
1
q
+
1
k
•r ,
r
•r •r
•r
where 1 is a vector of ones of appropriate dimension.
B.2. Monte Carlo Markov chain. In this Section, we introduce the MCMC
procedure to obtain a sample
o
n
(m) (m)
: m = 1, . . . , M
SM = ψ (m) , θ ∗(m) , B (m) , u(m) , γ b , γ φ
from the joint posterior distribution (B.1). We exploit the Gibbs algorithm
and update the parameters in blocks of related parameters by sampling from
the appropriate full conditional distributions, further denoted as p(· | · · · ).
In the remainder of this Section, we derive these full conditional distributions and explain how to sample from them in situations when they do not
have a form of any classical distribution. We exploit the following additional
notation:
P
• Nk = N
i=1 I(ui = k), k = 1, . . . , K, will denote the number of subjects
who are assigned into the kth mixture component
according to the
PK
values of component allocations. Note that k=1 Nk = N , the number
of subjects;
• b∗[k] , k = 1, . . . , K, will be the sample mean of shifted-scaled values of
random effects of those subjects assigned into the kth mixture component according to the values of component allocations u, i.e.,
X
b∗i ,
k = 1, . . . , K.
b∗[k] = Nk−1
i:ui =k
B.2.1. Mixture weights. Mixture weights form one block of the Gibbs
algorithm. The corresponding full conditional distribution is the Dirichlet
distribution D(δ + N1 , . . . , δ + NK ). That is,
Γ(Kδ + N )
p(w | · · · ) = QK
k=1 Γ(δ
K
Y
+ Nk ) k=1
wkδ+Nk −1 ,
which is the distribution we can directly sample from.
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
13
B.2.2. Mixture means. The second block of the Gibbs algorithm is composed of mixture means. Their full conditional distribution is a product of
K independent normal distributions, i.e.,
p(µ∗1 , . . . , µ∗k | · · · ) =
K
Y
p(µ∗k | · · · ) ∼
k=1
K
Y
k=1
N E(µ∗k | · · · ), var(µ∗k | · · · ) ,
where
var(µ∗k | · · · ) = Nk D ∗k −1 + C −1
b
E(µ∗k
| ···) =
var(µ∗k
| ···)
−1
,
Nk D ∗k −1 b∗[k]
+ C −1
b ξb ,
k = 1, . . . , K.
B.2.3. Mixture covariance matrices. The third block of the Gibbs algorithm is formed of the inversions of the mixture covariance matrices. Their
full conditional distribution is a product of K independent Wishart distributions, i.e.,
K
K
Y
Y
e b,k ,
p D∗1 −1 , . . . , D ∗k −1 · · · =
p D∗k −1 · · · ∼
W ζeb,k , Ξ
k=1
k=1
where
ζeb,k = ζb + Nk ,
n
o−1
X
∗
∗
∗ ⊤
∗
e b,k = Ξ−1 +
Ξ
)
−
µ
(b
−
µ
)(b
,
k
i
i
k
b
k = 1, . . . , K.
i:ui =k
B.2.4. Variance hyperparameters for mixture covariance matrices. In the
fourth block of the Gibbs algorithm, we update the variance hyperparameters γ −1
b whose inversions form a random diagonal of the scale matrix Ξb
in the Wishart prior for the inversions of the mixture covariance matrices.
Their full conditional distribution is a product of independent gamma distributions, i.e.,
p(γ −1
b | ···) =
d
Y
−1
| ···) ∼
p(γb,l
l=1
l=1
where
geb,l = gb,l
K ζb
+
,
2
d
Y
G(e
gb,l , e
hb,l ),
K
1 X
e
hb,l = hb,l +
Qk (l, l),
2
l = 1, . . . , d,
k=1
where Qk (l, l) is the lth diagonal element of the matrix D ∗k −1 .
14
A. KOMÁREK AND L. KOMÁRKOVÁ
B.2.5. Component allocations. In the fifth block of the Gibbs algorithm,
we update the latent component allocations u. Their full conditional distribution is a product over subjects (i = 1, . . . , N ) of the following discrete
distributions:
wk ϕ b∗i µ∗k , D ∗k
k = 1, . . . , K.
P(ui = k | · · · ) = PK
,
wl ϕ b∗i µ∗ , D ∗
l=1
l
l
B.2.6. Fixed effects. The sixth block of the Gibbs algorithm consists
in updating the vector of fixed effects α. The full conditional distribution
factorizes with respect to the R longitudinal markers involved in a model,
i.e.,
R
Y
p(αr | · · · ).
p(α | · · · ) =
r=1
To proceed, we introduce some additional notation. Let for each r = 1, . . . , R,
n
o
⊤
⊤
(ψ, B) = exp φ−1
LBayes
y⊤
α,r
r
•r η •r − 1 q •r + 1 k•r ,
be the part of the likelihood (B.5) pertaining to the rth marker. Combining
this with the prior distribution (A.9) leads to the following full conditional
distribution for the vector of the fixed effects from the rth GLMM:
(B.6) p(αr | · · · )
n
⊤
o
1
⊤
∝ exp φ−1
y⊤
,
αr − ξ α,r C −1
r
•r η •r − 1 q •r −
α,r αr − ξ α,r
2
r = 1, . . . , R, where ξ α,r and C α,r are appropriate subvector and submatrix, respectively, of the prior mean ξ α and prior covariance matrix C α ,
respectively. In a situation when the rth GLMM distribution (B.4) is normal N (ηi,r,j , φr ), it can easily be shown that (B.6) is a (multivariate) normal
distribution with the variance and mean given by
⊤
−1 −1
var(αr | · · · ) = φ−1
(B.7)
,
r X •r X •r + C α,r
⊤
−1
E(αr | · · · ) = var(αr | · · · ) φ−1
(B.8)
r X •r eα,r + C α,r ξ α,r ,
where
eα,r
 

y 1,r − Z 1,r b1,r
eα,1,r
 


..
=  ...  = 
,
.
y N,r − Z N,r bN,r
eα,N,r

from which we can easily sample from. Nevertheless, in general, (B.6) does
not have a form of any classically used distribution. In that case, we follow
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
15
a proposal of Gamerman (1997) and update αr using a Metropolis-Hastings
step with a normal proposal based on appropriately tuned second-order
approximation to (B.6) created using one Newton-Raphson step. Let in the
(m−1)
mth MCMC iteration αr
denote the previously sampled value of αr , and
(m−1)
(m−1)
the vector of residuals λ•r calculated using αr = αr
. Further,
λ•r
let
∂ log LBayes
(ψ, B)
α,r
⊤
= φ−1
Uα,r (αr ) =
r X •r (y •r − λ•r ),
∂αr
∂ 2 log LBayes
(ψ, B)
α,r
⊤
= φ−2
Jα,r (αr ) = −
r X •r V •r X •r ,
∂αr ∂α⊤
r
be the score and observed information matrix from the rth GLMM with
respect to the vector of fixed effects αr . A new value of αr is proposed by
sampling from a (multivariate) normal distribution with the variance and
mean given by
o−1
n
c r | · · · ) = σα,r Jα,r αr(m−1) + C −1
(B.9)
,
var(α
α,r
(B.10)
where
b r | · · · ) = α(m−1) + ∆(m) ,
E(α
r
α,r
o−1
n
−1
(m)
(B.11) ∆α,r
= Jα,r αr(m−1) + C α,r
n
o
(m−1)
Uα,r αr(m−1) − C −1
α
−
ξ
α,r
α,r
r
−1
⊤
−1
= φ−2
r X •r V •r X •r + C α,r
n
o
(m−1) ⊤
−1
(m−1)
φ−1
X
y
−
λ
−
C
α
−
ξ
,
•r
r
•r
•r
α,r
r
α,r
and σα,r > 0 is a tuning parameter. In practical analyzes, we recommend to
case of normally distributed response, setting σα,r = 1 leads to (B.9) which
is identical to (B.7) and (B.10) coincides with (B.8).
Finally, we remark that both the proposal covariance matrix (B.9) (with
σα,r = 1) and the Newton-Raphson step (B.11) can efficiently be obtained
as a least squares solution and QR decomposition to a particular problem.
To show that, let Lα,r be the Cholesky decomposition of the inversion of
⊤
the prior covariance matrix C α,r , i.e., C −1
α,r = Lα,r Lα,r . Further, let
!
!
(m−1)
−1/2
−1 V 1/2 X
)
(y
−
λ
V
φ
•r
•r
•r
•r
•r
r
f•r =
e
,
e•r =
X
.
(m−1)
L⊤
− L⊤
− ξ α,r
α,r
α,r αr
16
A. KOMÁREK AND L. KOMÁRKOVÁ
It follows from (B.11) that
−1 ⊤
⊤
(m)
f•r
f•r e
f•r X
er ,
X
∆α,r
= X
(m)
f•r δ = e
f•r =
i.e., ∆α,r is the least squares solution to X
er . That is, if X
e •r is the QR decomposition of the matrix X
f•r , we have
e •r R
Q
−1
⊤
(m)
e Q
e e
∆α,r
=R
•r
•r er ,
and also
−1
−1 e ⊤ e
c r | ···)
var(α
= σα,r
R•r R•r .
B.2.7. Random effects. The seventh block of the Gibbs algorithm consists in updating the vectors of random effects b1 , . . . , bN . The full conditional distribution factorizes with respect to N subjects, i.e.,
p(B | · · · ) =
N
Y
p(bi | · · · ).
i=1
The remainder of this part is similar to Paragraph B.2.6. First, we introduce
additional notation. Let for each i = 1, . . . , N ,
(B.12)
LBayes
(ψ, B) = exp
b,i
X
R
r=1
⊤
⊤
⊤
φ−1
y
η
−
1
q
+
1
k
i,r ,
i,r
r
i,r i,r
be the part of the likelihood (B.5) pertaining to the ith subject. If we combine this with the model (A.6) for random effects, we obtain that the full
conditional distribution for the ith random effect vector bi is
p(bi | · · · )
X
R
1
⊤
⊤
∗ ⊤
∗
⊤ −1
∗
φ−1
y
η
−1
q
∝ exp
−
b
−s−Sµ
SD
S
b
−s−Sµ
,
i
i
i,r
r
i,r i,r
ui
ui
ui
2
r=1
When working with shifted-scaled random effects, see Expression (A.7), we
obtain the following full conditional distribution
(B.13) p(bi | · · · )
X
R
1 ∗
−1
⊤
⊤
∗ ⊤ ∗ −1 ∗
∗
φr y i,r η i,r − 1 q i,r −
b − µ ui D ui
∝ exp
b i − µ ui ,
2 i
r=1
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
17
from which we have to sample from during this block of the Gibbs algorithm.
In a situation when all R GLMM distributions (B.4) are normal N (ηi,r,j , φr ),
r = 1, . . . , R, it can easily be shown that (B.13) is a (multivariate) normal
distribution with the variance and mean given by
−1
(B.14)
var(b∗i | · · · ) = J b,i + D ∗ui −1
(B.15)
E(b∗i | · · · ) = var(b∗i | · · · ) U b,i + D ∗ui −1 µ∗ui ,
where


⊤ ⊤
φ−1
1 S 1 Z i,1 (y i,1 − X i,1 α1 − Z i,1 s1 )


..
=
,
.
−1 ⊤ ⊤
φR S R Z i,R (y i,R − X i,R αR − Z i,R sR )
⊤ ⊤
= block-diag φ−1
r S r Z i,r Z i,r S r : r = 1, . . . , R ,
U b,i
J b,i
i = 1, . . . , N , and we can directly update the values of random effects by
sampling from this multivariate normal distribution. Analogously to Paragraph B.2.6, the full conditional distribution (B.13) does not have a form
of any classically used distribution as soon as at least one of R GLMM
distributions (B.4) is not normal. In that case, we update each b∗i (and
bi ) using a Metropolis-Hastings step with a normal proposal based on appropriately tuned second-order approximation to (B.13) created using one
Newton-Raphson step. Let in the mth MCMC iteration, b∗i (m−1) denote the
previously sampled value of b∗i . Further, let


(m−1) ⊤ ⊤
Bayes
φ−1
S
Z
y
−
λ
i,1

 1 1 i,1 .i,1
∂ log Lb,i (ψ, B)
,

.
=
Ub,i (b∗i ) =
.
∗


∂bi
(m−1)
−1 ⊤ ⊤
φR S R Z i,R y i,R − λi,R
Jb,i (b∗i ) = −
∂ 2 log LBayes
(ψ, B)
b,i
∂b∗i ∂b∗i ⊤
⊤ ⊤
= block-diag φ−2
r S r Z i,r V i,r Z i,r S r : r = 1, . . . , R ,
be the score and observed information matrix with respect to the shiftedscaled vector of random effects corresponding to the likelihood contribution
(B.12). A new value of b∗i is proposed by sampling from a (multivariate)
normal distribution with the variance and mean given by
o−1
n
c ∗i | · · · ) = σb Jb,i b∗i (m−1) + D ∗ui −1
(B.16)
,
var(b
(B.17)
b ∗ | · · · ) = b∗ (m−1) + ∆(m) ,
E(b
i
i
b,i
18
A. KOMÁREK AND L. KOMÁRKOVÁ
where
(m)
(B.18) ∆b,i
o−1 n
n
o
Ub,i b∗i (m−1) − D ∗ui −1 b∗i (m−1) − µ∗ui ,
= Jb,i b∗i (m−1) + D ∗ui −1
and σb > 0 is a tuning parameter. Also in this case, we recommend to
the case of normally distributed response, setting σb = 1 leads to (B.16)
which is identical to (B.14) and (B.17) coincides with (B.15). Analogously to
Paragraph B.2.6, both the proposal covariance matrix (B.16) (with σb = 1)
and the Newton-Raphson step (B.18) can efficiently be obtained as a least
squares solution and QR decomposition to a particular problem.
B.2.8. GLMM dispersion parameters. The eight block of the Gibbs algorithm involves sampling the GLMM dispersion parameters φ1 , . . . , φR for
those r = 1, . . . , R for which φr of the corresponding GLMM is not constant by definition. In these situations, the full conditional distribution of
the inverted dispersion parameter is given by
(B.19) p(φ−1
r | ···)
n 1
o
ζφ,r −2
−1 −1
⊤
⊤
2
exp − γφ,r
φr + φ−1
y⊤
∝ φ−1
r
•r η •r − 1 q •r + 1 k •r .
r
2
If we concentrate on three most common situations described in Table B.1,
the only case when it is necessary to sample the dispersion parameter φ−1
r
is normally distributed response with φr = σr2 being the error variance
of the corresponding linear mixed model. In this case the full conditional
distribution (B.19) is the gamma distribution, i.e.,
−1
e
p(φ−1
eφ,r
/2 ,
r | · · · ) ∼ G ζφ,r /2, γ
where
ζeφ,r = ζφ,r + n•r ,
n
⊤
o−1
−1
+ y •r − η •r
γφ,r = γφ,r
e
y •r − η •r
.
B.2.9. Variance hyperparameter for dispersions. The last, ninth block
of the Gibbs algorithm consists of sampling the variance hyperparameters
−1
γφ,r
for those r = 1, . . . , R for which φr of the corresponding GLMM is not
constant by definition. In these situations, the full conditional distribution
−1
of γφ,r
is the gamma distribution, i.e.,
where
−1
| ···) ∼ G e
gφ,r , e
hφ,r ,
p(γφ,r
gφ,r = gφ,r +
e
ζφ,r
,
2
φ−1
e
hφ,r = hφ,r + r .
2
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
19
C. Analysis of the Mayo Clinic PBC Data. This section supplements the main paper by some details on the analysis of the PBC910 data.
The MMGLMM for the clustering of patients included in the PBC910 data
is based on longitudinal measurements of
(1) logarithmic serum bilirubin (lbili, Yi,1,j ),
(2) platelet counts (platelet, Yi,2,j ),
(3) dichotomous presence of blood vessel malformations (spiders, Yi,3,j ),
with assumed (1) Gaussian, (2) Poisson, and (3) Bernoulli distribution, respectively and the mean structure (3) given by


E(Yi,1,j | bi,1) = bi,1,1 + bi,1,2 ti,1,j ,
(C.1)
log E(Yi,2,j | bi,2 ) = bi,2,1 + bi,2,2 ti,2,j ,

logit E(Yi,3,j | bi,3 , α3 ) = bi,3 + α3 ti,3,j ,
i = 1, . . . , N , j = 1, . . . , ni,r , r = 1, 2, 3, where 1 ≤ ni,r ≤ 5. In model (C.1),
ti,r,j is the time in months from the start of follow-up when the value of Yi,r,j
was obtained.
C.1. Summary of model parameters. In the main analysis, we shall try to
classify patients in two groups and hence a two component mixture (K = 2)
will be considered in the distribution of random effects. In summary, the
considered model involves the following parameters of the main interest.
• Random effect vectors bi = (bi,1,1 , bi,1,2 , bi,2,1 , bi,2,2 , bi,3 )⊤ , i = 1, . . . , N ,
where bi,1,1 and bi,1,2 are the random intercept and random slope from
the linear mixed model for logarithmic bilirubin, bi,2,1 and bi,2,2 are
the random intercept and random slope from the log-Poisson model
for platelet counts and bi,3 is the random intercept from the logit model
for presence of blood vessel malformations.
• Mixture parameters wk , µ∗k , D ∗k , k = 1, 2 and their counterparts µk =
s + Sµ∗k , Dk = SD ∗k S ⊤ , k = 1, 2 which represent the mean and the
covariance matrix for each cluster.
• Dispersion parameter φ1 from the linear mixed model for logarithmic
bilirubin. In this case, φ1 is the residual variance and we will denote
it also as σ12 = φ1 .
• Fixed effects α = α3 , where α3 is the slope from the logit model for
presence of blood vessel malformations.
For the purpose of estimation, we have
ψ = σ12 , α3
⊤
,
⊤
θ ∗ = w ⊤ , µ∗1 ⊤ , µ∗2 ⊤ , vec(D ∗1 ), vec(D ∗2 ) .
20
A. KOMÁREK AND L. KOMÁRKOVÁ
C.2. Fixed hyperparameters, shift vector, scale matrix. The values of
fixed hyperparameters, the shift vector and the scale matrix chosen according to the guidelines given in Section A.4 were as follows.
⊤
s = 0.315, 0.00765, 5.53, −0.00663, −2.75 ,
S = diag 0.864, 0.02008, 0.35, 0.01565, 3.23 ,
δ = 1,
⊤
ξ b = 0, 0, 0, 0, 0 ,
ζb = 6,
gb,l = 0.2,
ζφ,1 = 2,
⊤
C b = diag 36, 36, 36, 36, 36 ,
hb,l = 0.278,
gφ,1 = 0.2,
ξ α = 0,
l = 1, . . . , 5,
hφ,1 = 2.76,
C α = 10 000.
0.5
−1.0
0.0
Autocorrelation
14100
14060
Sampled values
1.0
C.3. Monte Carlo Markov chain. The reported results are based on 10 000
iterations of 1:100 thinned MCMC obtained after a burn-in period of 1 000 iterations. This took 75 minutes on Intel Core 2 Duo 3 GHz CPU with 3.25 GB
RAM running on Linux OS. Convergence of the MCMC was evaluated using
the R package coda (Plummer et al., 2006). We illustrate the performance
of the MCMC on traceplots of last 1 000 sampled values (1/10 of the
sample
used for the inference) of the model deviance D(ψ, θ) = −2 log L(ψ, θ)
on the left panel of Figure C.1. Estimated autocorrelations for this quantity
are shown on the right panel of Figure C.1.
10000
10400
Iteration
10800
0
5
10
15
20
Lag
Fig C.1. PBC910 Data. Traceplot of last 1 000 sampled values of the observed data deviance
D(ψ, θ) and estimated autocorrelation based on 10 000 sampled values of the observed data
deviance D(ψ, θ).
21
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
D. Simulation study. In this section, we provide more details on the
simulation study. Namely, Table D.1 shows the true values of the mixture
covariance matrices in the distributions of random effects used to generate
the data. Further, Figures D.1 and D.2 show observed longitudinal profiles
for one simulated dataset in settings with K = 2, and K = 3, respectively, normally distributed random effects in each cluster and in total 200
subjects. Plots are also supplemented by the true cluster-specific marginal
mean evolutions of each outcome. Supplementary results of the simulation
study follow in Tables D.2 – D.6.
Table D.1
Simulation study. Standard deviations (on a diagonal) and correlations (off-diagonal
elements) for each mixture component derived from the true covariance matrices D 1 ,
D2 , D 3 .
Intercept
(Gaus.)
Slope
(Gaus.)
Intercept
(Pois.)
k=1
−0.3
0.0
0.3
Intercept (Gaus.)
Slope (Gaus.)
Intercept (Pois.)
Slope (Pois.)
Intercept (Bernoulli)
0.5
0.0
0.01
Intercept (Gaus.)
Slope (Gaus.)
Intercept (Pois.)
Slope (Pois.)
Intercept (Bernoulli)
0.3
−0.2
0.01
k=2
0.1
0.0
0.5
Intercept (Gaus.)
Slope (Gaus.)
Intercept (Pois.)
Slope (Pois.)
Intercept (Bernoulli)
0.7
0.0
0.03
k=3
0.0
0.0
0.7
Slope
(Pois.)
Intercept
(Bernoulli)
0.0
−0.2
0.0
0.003
0.3
0.1
0.0
0.0
2.0
−0.1
0.3
0.0
0.005
0.2
0.1
−0.2
0.0
3.0
0.0
0.0
0.0
0.010
0.0
0.0
0.0
0.0
0.5
22
A. KOMÁREK AND L. KOMÁRKOVÁ
1
−1
0
Gaussian response
1
0
−1
Gaussian response
2
Group 2
2
Group 1
0
5
10
15
20
25
0
5
20
25
20
25
20
25
500
400
300
100
200
Poisson response
400
300
200
100
Poisson response
0
5
10
15
20
25
0
5
10
15
0.8
0.6
0.4
Bernoulli response
0.6
0.4
0.0
0.2
0.0
0.2
0.8
1.0
Time (months)
1.0
Time (months)
Bernoulli response
15
Time (months)
500
Time (months)
10
0
5
10
15
Time (months)
20
25
0
5
10
15
Time (months)
Fig D.1. Simulation study. One simulated dataset for a model with K = 2 (N = (60, 40))
and normally distributed random effects in each cluster. Thick lines show true clusterspecific marginal mean evolution. The values of a dichotomous Bernoulli response are
vertically jittered.
23
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
25
2
10
15
20
25
0
10
15
20
25
15
20
0
10
15
20
25
1.0
0.6
Bernoulli response
0.8
1.0
Time (months)
25
5
Time (months)
0.0
0.0
20
400
25
0.4
Bernoulli response
15
25
500
10
0.2
1.0
0.8
0.6
0.4
0.2
10
20
100
5
Time (months)
0.0
5
15
300
Poisson response
0
Time (months)
0
10
200
500
400
300
100
5
5
Time (months)
200
Poisson response
400
300
200
100
Poisson response
5
Time (months)
500
Time (months)
0
1
−1
0
0.8
20
0.6
15
0.4
10
0.2
5
0
Gaussian response
1
−1
0
Gaussian response
1
0
−1
Gaussian response
0
Bernoulli response
Group 3
2
Group 2
2
Group 1
0
5
10
15
20
Time (months)
25
0
5
10
15
20
25
Time (months)
Fig D.2. Simulation study. One simulated dataset for a model with K = 3 (N =
(60, 34, 6)) and normally distributed random effects in each cluster. Thick lines show true
cluster-specific marginal mean evolution. The values of a dichotomous Bernoulli response
are vertically jittered.
24
A. KOMÁREK AND L. KOMÁRKOVÁ
Table D.2
Simulation study: error rates of classification obtained from a model with correctly
specified number of clusters.
Setting
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
Normal (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
MVT5 (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
Overall
error
rate (%)
Conditional error rate (%)
given component
1
2
3
15.8
7.8
5.8
7.9
4.0
4.0
27.6
13.4
8.4
N.A.
N.A.
N.A.
20.6
10.8
8.7
10.9
3.5
4.3
35.2
21.9
15.4
N.A.
N.A.
N.A.
26.5
17.4
10.1
13.9
11.9
6.4
41.9
20.0
10.8
66.0
56.8
42.3
23.8
18.8
9.7
11.2
14.8
6.4
39.3
19.3
11.9
62.3
55.2
30.4
Table D.3
Simulation study: mixture weights. Bias, standard deviation and mean squared error
(MSE) are based on posterior means as parameter estimates, the reported MSE is the
average MSE over the K mixture components.
√
Setting
Bias
Std. Dev.
MSE
w = (0.600, 0.400)
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
(0.053, −0.053)
(0.030, −0.030)
(0.008, −0.008)
(0.161, 0.161)
(0.057, 0.057)
(0.037, 0.037)
0.169
0.064
0.037
(0.071, −0.071)
(0.064, −0.064)
(0.033, −0.033)
(0.198, 0.198)
(0.111, 0.111)
(0.097, 0.097)
0.209
0.128
0.102
(0.023, −0.051, 0.028)
(0.003, 0.003, −0.006)
(0.005, 0.003, −0.008)
(0.190, 0.127, 0.127)
(0.113, 0.090, 0.051)
(0.054, 0.056, 0.029)
0.154
0.088
0.048
(0.045, −0.057, 0.012)
(−0.029, 0.008, 0.021)
(−0.007, −0.011, 0.017)
(0.133, 0.104, 0.070)
(0.158, 0.106, 0.086)
(0.065, 0.060, 0.034)
0.113
0.122
0.056
w = (0.600, 0.340, 0.060)
Normal (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
MVT5 (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
25
Table D.4
Simulation study. Mixture means of the random effects in the model for Gaussian
response. Bias, standard deviation and mean squared error (MSE) are based on posterior
means as parameter estimates, the reported MSE is the average MSE over the K mixture
components.
√
Setting
Bias
Std. Dev.
MSE
µ∗,1 = (0.000, 1.000) (Intercept, Gaussian resp.)
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
(0.140, −0.102)
(0.053, −0.020)
(0.014, 0.003)
(0.194, 0.213)
(0.128, 0.162)
(0.074, 0.055)
0.237
0.150
0.066
(0.174, −0.140)
(0.075, −0.041)
(0.038, −0.085)
(0.215, 0.738)
(0.145, 0.592)
(0.114, 0.516)
0.564
0.433
0.378
µ∗,1 = (0.000, 1.000, 1.300) (Intercept,
Normal (K = 3)
N = (30, 17, 3)
(0.184, −0.117, −0.427)
(60, 34, 6)
(0.124, −0.086, −0.124)
(120, 68, 12)
(0.030, −0.014, −0.040)
MVT5 (K = 3)
N = (30, 17, 3)
(0.181, −0.107, −0.282)
(60, 34, 6)
(0.095, −0.100, −0.362)
(120, 68, 12)
(0.013, 0.017, −0.163)
Gaussian resp.)
(0.181, 0.295, 0.495)
(0.161, 0.209, 0.533)
(0.084, 0.107, 0.428)
0.444
0.360
0.260
(0.196, 0.336, 0.827)
(0.151, 0.201, 0.574)
(0.091, 0.086, 0.579)
0.563
0.424
0.353
µ∗,2 = (0.0100, 0.0100) (Slope, Gaussian resp.)
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
(0.0006, −0.0001)
(−0.0001, −0.0001)
(−0.0003, −0.0002)
(0.0036, 0.0058)
(0.0029, 0.0039)
(0.0018, 0.0026)
0.0048
0.0034
0.0022
(0.0000, 0.0023)
(0.0006, 0.0000)
(0.0002, −0.0006)
(0.0038, 0.0185)
(0.0025, 0.0144)
(0.0018, 0.0067)
0.0134
0.0103
0.0049
µ∗,2 = (0.0100, 0.0100, −0.0300) (Slope, Gaussian resp.)
Normal (K = 3)
N = (30, 17, 3)
(−0.0005, −0.0056, 0.0200) (0.0039, 0.0138, 0.0248)
(60, 34, 6)
(−0.0008, −0.0015, 0.0034) (0.0031, 0.0049, 0.0279)
(120, 68, 12) (−0.0001, −0.0008, −0.0028) (0.0022, 0.0029, 0.0214)
MVT5 (K = 3)
N = (30, 17, 3)
(−0.0008, −0.0042, 0.0223) (0.0041, 0.0173, 0.0294)
(60, 34, 6)
(−0.0007, −0.0024, 0.0093) (0.0029, 0.0068, 0.0347)
(120, 68, 12)
(0.0002, −0.0016, 0.0051) (0.0020, 0.0079, 0.0327)
0.0204
0.0165
0.0126
0.0237
0.0211
0.0196
26
A. KOMÁREK AND L. KOMÁRKOVÁ
Table D.5
Simulation study: mixture means of the random effects in the model for Poisson response.
Bias, standard deviation and mean squared error (MSE) are based on posterior means as
parameter estimates, the reported MSE is the average MSE over the K mixture
components.
√
Setting
Bias
Std. Dev.
MSE
µ∗,3 = (5.00, 5.00) (Intercept, Poisson resp.)
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
(0.00, −0.03)
(−0.01, 0.00)
(0.00, 0.00)
(0.06, 0.16)
(0.05, 0.11)
(0.03, 0.06)
0.12
0.09
0.05
(0.00, −0.04)
(0.00, −0.01)
(0.00, −0.02)
(0.05, 0.51)
(0.04, 0.67)
(0.03, 0.22)
0.36
0.47
0.16
µ∗,3 = (5.00, 5.00, 5.50) (Intercept, Poisson resp.)
Normal (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
MVT5 (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
Normal (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
MVT5 (K = 2)
N = (30, 20)
(60, 40)
(120, 80)
(0.02, 0.10, −0.19)
(0.02, 0.03, −0.08)
(0.01, 0.01, 0.02)
(0.10, 0.32, 0.40)
(0.08, 0.15, 0.42)
(0.06, 0.09, 0.31)
0.33
0.26
0.19
(0.01, 0.16, −0.21)
(0.02, 0.02, −0.12)
(0.00, 0.01, −0.14)
(0.08, 0.52, 0.70)
(0.06, 0.14, 0.51)
(0.03, 0.10, 0.35)
0.52
0.31
0.22
µ∗,4 = (−0.0050, −0.0200) (Slope, Poisson resp.)
µ∗,4
Normal (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
MVT5 (K = 3)
N = (30, 17, 3)
(60, 34, 6)
(120, 68, 12)
(−0.0017, 0.0003)
(−0.0006, −0.0005)
(−0.0002, −0.0002)
(0.0025, 0.0046)
(0.0017, 0.0018)
(0.0006, 0.0012)
0.0039
0.0018
0.0010
(−0.0026, 0.0019)
(−0.0011, 0.0007)
(−0.0006, 0.0001)
(0.0029, 0.0058)
(0.0020, 0.0044)
(0.0016, 0.0022)
0.0051
0.0035
0.0020
= (−0.0050, −0.0200, 0.0000) (Slope, Poisson resp.)
(−0.0023, 0.0049, −0.0042)
(−0.0011, 0.0025, 0.0001)
(−0.0002, 0.0005, 0.0004)
(0.0021, 0.0057, 0.0089)
(0.0020, 0.0042, 0.0078)
(0.0007, 0.0022, 0.0055)
0.0073
0.0054
0.0034
(−0.0023, 0.0049, −0.0039)
(−0.0010, 0.0027, −0.0038)
(−0.0002, 0.0010, −0.0021)
(0.0027, 0.0062, 0.0097)
(0.0019, 0.0043, 0.0075)
(0.0011, 0.0036, 0.0071)
0.0078
0.0058
0.0048
SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA”
27
Table D.6
Simulation study: mixture means of the random effects in the model for Bernoulli
response. Bias, standard deviation and mean squared error (MSE) are based on posterior
means as parameter estimates, the reported MSE is the average MSE over the K mixture
components.
√
Setting
Bias
Std. Dev.
MSE
µ∗,5 = (−3.00, −1.00) (Intercept, Bernoulli resp.)
Normal (K = 2)
N = (30, 20)
(−0.06, −0.27)
(1.15, 2.33)
(60, 40)
(−0.08, 0.02)
(0.61, 1.05)
(120, 80)
(−0.08, 0.00)
(0.42, 0.50)
MVT5 (K = 2)
N = (30, 20)
(0.09, 0.35)
(0.87, 4.63)
(60, 40)
(−0.03, −0.18)
(0.63, 3.02)
(120, 80)
(−0.11, 0.26)
(0.45, 2.02)
µ∗,5 = (−3.00, −1.00, −2.00) (Intercept, Bernoulli resp.)
Normal (K = 3)
N = (30, 17, 3)
(−0.01, −0.54, −3.91) (1.03, 2.69, 7.28)
(60, 34, 6)
(0.09, −0.46, −2.74) (0.75, 1.95, 4.06)
(120, 68, 12)
(−0.02, 0.10, −1.28) (0.47, 0.55, 2.32)
MVT5 (K = 3)
N = (30, 17, 3)
(0.08, −1.10, −2.05) (0.86, 3.98, 5.12)
(60, 34, 6)
(0.02, −0.33, −1.83) (0.83, 1.01, 4.04)
(120, 68, 12)
(0.01, 0.07, −0.76) (0.36, 0.52, 2.53)
1.84
0.86
0.46
3.33
2.17
1.47
5.04
3.07
1.58
3.99
2.66
1.56
28
A. KOMÁREK AND L. KOMÁRKOVÁ
References.
Fong, Y., Rue, H. and Wakefield, J. (2010). Bayesian inference for generalized linear
mixed models. Biostatistics 11 397–412.
Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear
mixed models. Statistics and Computing 7 57–68.
Komárek, A. and Komárková, L. (2012). Clustering for multivariate continuous and
discrete longitudinal data. The Annals of Applied Statistics.
Plummer, M., Best, N., Cowles, K. and Vines, K. (2006). CODA: Convergence Diagnosis and Output Analysis for MCMC. R News 6 7–11.
Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with unknown number of components (with Discussion). Journal of the Royal Statistical Society, Series B 59 731–792.
A. Komárek
Faculty of Mathematics and Physics
Charles University in Prague
Sokolovská 83
CZ–186 75, Praha 8
Czech Republic
E-mail: [email protected]
URL: http://www.karlin.mff.cuni.cz/∼komarek
L. Komárková
Faculty of Management
the University of Economics in Prague
Jarošovská 1117