SUPPLEMENT TO “CLUSTERING FOR MULTIVARIATE CONTINUOUS AND DISCRETE LONGITUDINAL DATA” By Arnošt Komárek and Lenka Komárková Charles University in Prague and the University of Economics in Prague This is a supplement to the paper published by The Annals of Applied Statistics. It contains (A) more detailed description of the assumed prior distribution for model parameters giving also some guidelines for the selection of the hyperparameters to achieve a weakly informative prior distribution; (B) more details on the posterior distribution and the sampling MCMC algorithm; (C) additional information to the analysis of the Mayo Clinic PBC data; (D) more detailed results of the simulation study. This is a supplement to the paper “Clustering for multivariate continuous and discrete longitudinal data” (Komárek and Komárková, 2012). Sections in the supplement are considered as appendices to the main paper and are numbered using capital letters. Cross-references in this supplement that do not contain a capital letter refer to corresponding parts of the main paper. Appendix A provides a detailed description of the prior distribution in the Bayesian specification of the mixture multivariate generalized linear mixed model as well as guidelines for selection of the fixed hyperparameters to achieve weakly informative prior distribution. Appendix B gives more details on the posterior distribution and the sampling MCMC algorithm. Additional information to the analysis of the Mayo Clinic PBC data is given in Appendix C. Finally, Appendix D shows additional results of the simulation study. Keywords and phrases: Classification, functional data, generalized linear mixed model, multivariate longitudinal data, repeated observations 1 2 A. KOMÁREK AND L. KOMÁRKOVÁ A. Prior distribution. In this appendix, we provide a detailed overview of the prior distribution assumed in the Bayesian specification of the mixture multivariate generalized linear mixed model. We also give some guidelines for the selection of the hyperparameters to achieve a weakly informative prior distribution. A.1. Notation. First, we review and slightly extend the notation: ⊤ ⊤ α is a cr × 1 vector • α = (α⊤ 1 , . . . , αR ) : the fixed effects, where PR r (r = 1, . . . , R) and α is a c × 1 vector (c = r=1 cr ); • φ = (φ1 , . . . , φR )⊤ : GLMM dispersion parameters of which some of them might be constants by definition, e.g., for Bernoulli response with a logit link (logistic regression), corresponding φr is equal to 1, and the same is true for Poisson response with a log link. Further, we denote by φ−1 a vector of inverted dispersion parameters, i.e., −1 ⊤ φ−1 = (φ−1 1 , . . . , φR ) ; • B = (b1 , . . . , bN ): individual (latent) values of random effects, where ⊤ ⊤ bi = b⊤ , bi,r (r = 1, . . . , R) are dr × 1 vectors and bi i,1 , . . . , bi,R P (i = 1, . . . , N ) are d × 1 vectors (d = R r=1 dr ) ; ⊤ • w = (w1 , . . . , wK ) : mixture weights from the distribution of random effects; • µ1 , . . . , µK : d × 1 vectors of mixture means; • D 1 , . . . , D K : d × d mixture covariance matrices; • s : d× 1 shift vector used to calculate the shifted-scaled random effects (see below); • S : d × d diagonal scale matrix used to calculate the shifted-scaled random effects; • B ∗ = (b∗1 , . . . , b∗N ): individual (latent) values of shifted-scaled random ⊤ effects, where b∗i = b∗i,1 ⊤ , . . . , b∗i,R ⊤ , b∗i,r ⊤ (r = 1, . . . , R) are dr × 1 vectors and b∗i ⊤ (i = 1, . . . , N ) are d × 1 vectors; • µ∗1 , . . . , µ∗K : d × 1 vectors of mixture means from the distribution of shifted-scaled random effects; • D ∗1 , . . . , D ∗K : d × d mixture covariance matrices from the distribution of shifted-scaled random effects; • u = (u1 , . . . , uN )⊤ : (latent) component allocations; • γ b = (γb,1 , . . . , γb,d )⊤ : random hyperparameter introduced in an additional hierarchy level when specifying the prior distribution for the −1 mixture covariance matrices (see below). Further, let γ −1 b = (γb,1 , . . . , −1 ⊤ ) ; γb,d • γ φ = (γφ,1 , . . . , γφ,R )⊤ : random hyperparameter introduced in an ad- SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 3 ditional hierarchy level when specifying the prior distribution for the −1 GLMM dispersion parameters (see below). Further, let γ −1 φ = (γφ,1 , −1 ⊤ . . . , γφ,R ) . The set of GLMM related parameters further composes a vector ψ, and the set of mixture related parameters further composes a vector θ, i.e., ⊤ ⊤ ⊤ . ψ = α⊤ , φ⊤ , θ = w⊤ , µ⊤ 1 , . . . , µK , vec(D 1 ), . . . , vec(D K ) In the following, ψ and θ are sometimes referenced as frequentist parameters as they are the same as unknown parameters of the corresponding frequentist model. Finally, let p(·) and p(· | ·) be generic symbols for (conditional) distributions and let ϕ(· | µ, D) be a density of the (multivariate) normal distribution with a mean µ and a covariance matrix D A.2. Shifted-scaled random effects. Due to possibility of improving the mixing and numerical stability of the MCMC algorithm described in Section B, we introduce shifted-scaled random effects: b∗i = S −1 bi − s , i = 1, . . . , N, where s is a d × 1 fixed shift vector and S is a d × d fixed diagonal scale matrix. It follows from Eq. (4) that b∗i | θ ∗ i.i.d. ∼ K X k=1 wk MVN µ∗k , D∗k , i = 1, . . . , N, ⊤ where θ ∗ = w ⊤ , µ∗1 ⊤ , . . . , µ∗K ⊤ , vec(D ∗1 ), . . . , vec(D ∗K ) : ( µ∗k = S −1 µk − s , (A.1) D∗k = S −1 Dk S −1 , k = 1, . . . , K. Without the loss of generality, we may set s = 0 (zero vector) and S = I d (d × d identity matrix). However for numerical stability of the MCMC which is subsequently used to obtain a sample from the corresponding posterior distribution, it is useful if the shifted-scaled random effects have approximately zero means and unit variances. We return to this point in Sec. A.5. A.3. Factorization of the prior distribution. With the MCMC algorithm, we primarily obtain a sample from the posterior distribution of θ ∗ (parameters of the mixture distribution of shifted-scaled random effects) and then use the transformation (A.1) to get a sample from the posterior distribution 4 A. KOMÁREK AND L. KOMÁRKOVÁ of θ (parameters of the mixture distribution of the original random effects). For this reason, all expression explaining the assumed prior distribution will be expressed in terms of θ ∗ rather than θ. As it is usual for hierarchically specified models, the joint prior distribution which involves also latent random effects B and component allocations u exploits conditional independence between sets of unrelated parameters. Namely, it is factorized as (A.2) p(ψ, θ ∗ , B, u) = p(B | θ ∗ , u) × p(u | θ ∗ ) × p(θ ∗ ) × p(ψ), where (A.3) N Y ∗ p(bi | θ ∗ , ui ), p(B | θ , u) = i=1 p(u | θ ∗ ) = N Y p(ui | w). i=1 The products on the right-hand side of Eq. (A.3) follow from assumed independence between subjects/patients. For the prior distribution p(θ ∗ ) of the mixture related parameters, we shall use a multivariate version of the classical proposal of Richardson and Green (1997) which exploits a factorization: (A.4) p(θ ∗ ) = p(w) × K n Y o p µ∗k × p D ∗k −1 . k=1 For p(ψ), the last part of the prior expression related to the GLMM parameters, we shall adopt classically used priors in this context (see, e.g., Fong, Rue and Wakefield, 2010) with factorization: (A.5) p(ψ) = p(φ−1 ) × p(α) = R Y p(φ−1 r ) × p(α). r=1 In the remained of this Section, we describe our particular choices for the factors in Eqs. (A.3), (A.4) and (A.5). Further, we explain a data motivated guidelines for selection of the fixed hyperparameters which may be followed to obtain a weakly informative prior distribution. These guidelines are motivated by the classical proposal of Richardson and Green (1997) in the case of mixture related parameters and the classical proposal for GLMM related parameters (see, e.g., Fong, Rue and Wakefield, 2010). SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 5 A.4. Weakly informative prior distribution. To be able to follow the Richardson and Green’s guidelines for the selection of the fixed hyperparameters leading to the weakly informative prior distribution, we have to have some initial idea concerning the location, scale and individual values of the random effects in our multivariate GLMM. This can be obtained by using standard software and calculating separate maximum-likelihood estimates in each of R GLMM models (3) under the classical assumption of normally distributed random effects. In the following, let for each r = 1, . . . , R, β 0M L,r be so obtained initial estimate for the overall mean vector of random effects, and d0M L,r analogously obtained initial estimates for the vector of overall standard deviations of random effects. Further, let for each r = 1, . . . , R, b0EB,i,r , i = 1, . . . , N be empirical Bayes estimates of individual values of random effects in the rth ML estimated GLMM under the assumption of normality of random effects. Furthermore, let ⊤ ⊤ ⊤ β 0M L = β 0M L,1 , . . . , β 0M L,R , ⊤ ⊤ ⊤ d0M L = d0M L,1 , . . . , d0M L,R , ⊤ ⊤ ⊤ b0EB,i = b0EB,i,1 , . . . , b0EB,i,R , i = 1, . . . , N . The right-hand side of the Expression (A.2) is then composed of the factors explained in Sec. A.5–A.11. A.5. Prior for random effects. p(bi | θ ∗ , ui ) is for each i = 1, . . . , N a (multivariate) normal distribution N s + Sµ∗ui , SD ∗ui S ⊤ , that is, (A.6) p(bi | θ ∗ , ui ) n 1 ⊤ o 1 ∝ |SD∗ui S ⊤ |− 2 exp − bi − s − Sµ∗ui (SD∗ui S ⊤ )−1 bi − s − Sµ∗ui . 2 This follows from assumed mixture model (4) For numerical stability of the MCMC which is subsequently used to obtain a sample from the posterior distribution, it is useful if the shifted-scaled random effects (A.7) b∗i = S −1 (bi − s), i = 1, . . . , N have approximately zero means and unit variances. This can be achieved by setting s = β 0M L and S = diag d0M L . In a sequel, let further b0∗ EB,i = −1 0 S (bEB,i − s), i = 1, . . . , N be vectors of shifted and scaled values of the initial ML based empirical Bayes estimates of random effects, and let 0∗ 0∗ R0∗ EB = (REB,1 , . . . , REB,d ) be a vector of sample standard deviations based on elements of b0∗ EB,i multiplied by some factor c, where c > 1. In our appli 2 ⊤. 2 0∗ 2 0∗ cations we use c = 6. Finally, let (R0∗ EB ) = (REB,1 ) , . . . , (REB,d ) 6 A. KOMÁREK AND L. KOMÁRKOVÁ A.6. Prior for component allocations. p(ui | w), i = 1, . . . , N , also follows from the assumed mixture model for the random effects and coincides with the discrete distribution with P(ui = k | w) = wk , k = 1, . . . , K. A.7. Prior for mixture weights. p(w) is a Dirichlet distribution D(δ, . . . , δ), where δ is a fixed hyperparameter (prior sample size). That is, K Γ(Kδ) Y δ−1 wk , p(w) = K Γ (δ) k=1 Weakly informative prior is obtained by setting δ to a small positive value, e.g., δ = 1. A.8. Prior for mixture means. p(µ∗k ), k = 1, . . . , K, is a (multivariate) normal distribution N (ξ b , C b ), where ξ b and C b are fixed prior mean and covariance matrix, respectively, that is, n 1 o ⊤ 1 ∗ , k = 1, . . . , K. − ξ p(µ∗k ) ∝ |C b |− 2 exp − µ∗k − ξ b C −1 µ b k b 2 Following the suggestions of Richardson and Green (1997) adapted to the multivariate GLMM, weakly informative prior is obtained by taking ξ b to 0∗ 2 be a sample mean of b0∗ , i = 1, . . . , N and C = diag (R ) . b EB,i EB A.9. Prior for mixture covariance matrices. p D ∗k −1 | γ b , k = 1, . . . , K, is a Wishart distribution W(ζb , Ξb ), where ζb are fixed prior degrees of freedom and Ξb = diag(γ b ) is a diagonal scale matrix with random diagonal elements. That is, n 1 ζb −d−1 o − ζb ∗ −1 , D (A.8) p D ∗k −1 | γ b ∝ Ξb 2 D ∗k −1 2 exp − tr Ξ−1 k b 2 k = 1, . . . , K. To get a weakly informative prior, we set ζb to a small positive number, e.g., ζb = d + 1. Further observe that a apriori, E D∗k −1 γ b = ζb Ξb . As noted by Richardson and Green (1997) in the univariate i.i.d. context, it seems restrictive to suppose that a knowledge of the variability of the data, represented by R0∗ EB in our case, implies much about the size of matrices ∗ −1 D k , k = 1, . . . , K and subsequently, it is not advisable to take a fixed scale matrix Ξb based on R0∗ EB . For this reason, they suggest to introduce additional hierarchical level in the prior distribution. Their proposal adapted to SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 7 our multivariate setting then takes Ξb = diag(γ b ) with independent gamma −1 −1 , the inverted diagonal , . . . , γb,d G(gb,1 , hb,1 ), . . . , G(gb,d , hb,d ) priors for γb,1 elements of Ξb . That is, g d d Y Y hb,lb,l −1 −1 gb,l −1 −1 −1 exp −hb,l γb,l , γ p(γ b ) = p(γb,l ) = Γ(gb,l ) b,l l=1 l=1 where (gb,1 , hb,1 ), . . . , (gb,d , hb,d ) are fixed hyperparameters. Weakly informative prior is obtained by setting gb,l to a small positive value, e.g., gb,l = 0∗ )2 , e.g., 0.2, l = 1, . . . , d, and setting hb,l to moderate multiple of 1/(REB,l 0∗ )2 , l = 1, . . . , d. hb,l = 10/(REB,l A.10. Prior for GLMM dispersion parameters. p(φ−1 r | γφ,r ), r = 1, . . . , R, is nondegenerate distribution only for such r = 1, . . . , R for which the dispersion parameter φr of the corresponding GLMM is not constant by definition. With respect to the analysis of PBC910 data where we consider three types of GLMM’s: (i) Gaussian with identity link, (ii) Bernoulli with logit link, (iii) Poisson with log link, this only applies to the Gaussian GLMM with identity link (≡ LMM) for which φr is the residual variance σr2 . In such −1 situations, p(φ−1 r | γφ,r ) is a gamma distribution G ζφ,r /2, γφ,r /2 , that is, p(φ−1 r | γφ,r ) = 2− Γ ζφ,r 2 − ζφ,r γ 2 ζφ,r φ,r 2 φ−1 r ζφ,r −2 2 1 −1 −1 exp − γφ,r φr . 2 Note that with this parameterization of the gamma distribution we have a direct correspondence to the univariate Wishart distribution with the same meaning for ζ and γ parameters as in the prior distribution for the inverted mixture covariance matrices, see (A.8). Weakly informative prior distribution is obtained by setting ζφ,r to a small positive number, e.g., ζφ,r = 2. Further, similarly to the prior distribution for the inverted mixture covari−1 ance matrices, we will assume that γφ,r is random and assign it a gamma G(gφ,r , hφ,r ) prior distribution, i.e., gφ,r hφ,r −1 −1 gφ,r −1 −1 exp −hφ,r γφ,r , γ p(γ φ,r ) = Γ(gφ,r ) φ,r where (gφ,r , hφ,r ) are fixed hyperparameters. This allows to specify weakly informative prior distribution and at the same time keeps the posterior distribution proper. Weakly informative prior is obtained by setting gφ,r to a small positive value, e.g., gφ,r = 0.2, and setting hφ,r to moderate multiple 0∗ )2 , e.g., h 0∗ 2 0∗ of 1/(Rres,r φ,r = 10/(Rres,r ) , where Rres,r is a range of residuals from the initial ML fit to the rth (G)LMM. 8 A. KOMÁREK AND L. KOMÁRKOVÁ A.11. Prior for fixed effects. p(α) is a (multivariate) normal distribution N (ξ α , C α ), where ξα and C α are fixed prior hyperparameters, i.e., (A.9) n 1 ⊤ o 1 . α − ξ p(α) ∝ |C α |− 2 exp − α − ξ α C −1 α α 2 As usual in the context of hierarchical regression models, weakly informative prior is obtained by setting ξ α = 0 (zero vector) and C α to diagonal matrix with large positive numbers, e.g., 10 000 on the diagonal. It is always useful to check that the variance of the (marginal) posterior distribution of α is indeed considerably smaller. A.12. Joint prior distribution including random hyperparameters. When we incorporate also random hyperparameters γ b and γ φ into the expression of the joint prior distribution, we obtain: p(ψ, θ ∗ , B, u, γ b , γ φ ) = p(B | θ ∗ , u) × p(u | θ ∗ ) −1 × p(θ ∗ | γ b ) × p(γ −1 b ) × p(ψ | γ φ ) × p(γ φ ) N n o Y p(bi | θ ∗ , ui ) × p(ui | w) × p(w) = (A.10) i=1 K n d o Y Y ∗ −1 ∗ −1 × p(µk ) × p(D k | γ b ) × ) p(γb,l k=1 l=1 R n o Y −1 −1 p(φ | γ ) × p(γ ) × p(α), × φ,r r φ,r r=1 where all factors appearing in (A.10) have been explained in Sec. A.5–A.11. SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 9 B. Posterior distribution and the sampling algorithm. In this appendix, we provide more detailed explanation of the Monte Carlo Markov chain (MCMC) algorithm mentioned in Komárek and Komárková (2012, Section 2.2) which is used to obtain the sample from the posterior distribution. As it is usual with the MCMC applications, we generate a sample from the joint posterior distribution of all model parameters: ( p ψ, θ ∗ , B, u, γ b , γ φ y (B.1) ∝ p(ψ, θ ∗ , B, u, γ b , γ φ ) × LBayes (ψ, θ ∗ , B, u, γ b , γ φ ), where LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = p y ψ, θ ∗ , B, u, γ b , γ φ is the likelihood of the Bayesian model. B.1. Likelihood for the Bayesian model. Given the (Bayesian) model parameters, the likelihood is given by (B.2) LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = p(y | ψ, θ ∗ , B, u, γ b , γ φ ) = p(y | ψ, B) = i,r N Y R n Y Y p(yi,r,j | φr , αr , bi,r ), i=1 r=1 j=1 where the form of p(yi,r,j | φr , αr , bi,r ) follows from a particular GLMM, in which ⊤ (B.3) h−1 E(Yi,r,j | αr , bi,r ) = ηi,r,j = x⊤ r i,r,j αr + z i,r,j bi,r , i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r , where h−1 r is the link function for the rth response, ηi,r,j is the linear predictor, and xi,r,j , z i,r,j are cr × 1 and dr × 1, respectively, vectors of known covariates. It is assumed that p(yi,r,j | φr , αr , bi,r ) belongs to an exponential family. For notational convenience, we will assume that h−1 r is a canonical link function which implies that (B.4) p(yi,r,j | φr , αr , bi,r ) = p(yi,r,j | φr , ηi,r,j ) yi,r,j ηi,r,j − qr (ηi,r,j ) + kr (yi,r,j , φr ) , = exp φr where kr and qr are appropriate distribution specific functions, see Table B.1 for their form in three most common situations of the Gaussian, Bernoulli and Poisson distribution. In a sequel, we shall exploit additional notation. Let 10 A. KOMÁREK AND L. KOMÁRKOVÁ Table B.1 Components of three most common generalized linear models. All distributions are parametrized such that the mean is λ. Distribution Gaussian N (λ, σ 2 ) Canonical link h−1 (λ) λ Inverted canonical link h(η) η Dispersion φ σ2 Variance v(φ, η) φ=σ 2 2 y log(2πφ) − 2 2φ η2 2 k(y, φ) q(η) Bernoulli A(λ) λ logit(λ) = log 1−λ exp(η) ilogit(η) = 1 + exp(η) Poisson Po(λ) 1 h(η) 1 − h(η) = λ (1 − λ) 1 0 log(λ) exp(η) h(η) = λ − log(y!) log 1 + exp(η) P • n•r = N i=1 ni,r be the number of all observations of the rth marker; ⊤ ⊤ be the vector of all observations of the rth • y •r = y ⊤ 1,r , . . . , y N,r marker; • η i,r = (ηi,r,1 , . . . , ηi,r,ni,r )⊤ be the vector of linear predictors for the observations of the rth marker of the ith subject; ⊤ ⊤ • η •r = η ⊤ be the vector of linear predictors for all ob1,r , . . . , η N,r servations of the rth marker; • λi,r,j = E(Yi,r,j | αr , bi,r ) = E(Yi,r,j | ηi,r,j ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r , be the mean of the (i, r, j)th response obtained from the GLMM (B.3) with the response density (B.4) which in general is given by λi,r,j = ∂qr (ηi,r,j ), ∂η i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r ; • λi,r = (λi,r,1 , . . . , λi,r,ni,r )⊤ , i = 1, . . . , N , r = 1, . . . , R, be the vector of means of the rth marker for the ith subject; ⊤ ⊤ • λ•r = (λ⊤ 1,r , . . . , λN,r ) , r = 1, . . . , R, be the vector of the means for the rth marker; • vi,r,j = var(Yi,r,j | φr , αr , bi,r ) = var(Yi,r,j | φr , ηi,r,j ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r , be the variance of the (i, r, j)th response obtained from the GLMM (B.3) with the response density (B.4). which in general is given by vi,r,j = φr ∂ 2 qr (ηi,r,j ), ∂η 2 i = 1, . . . , N, r = 1, . . . , R, j = 1, . . . , ni,r , exp(η) SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” • • • • • 11 see Table B.1 for explicit formulas in three most common situations; vi,r = (vi,r,1 , . . . , vi,r,ni,r )⊤ , i = 1, . . . , N , r = 1, . . . , R, be the vector of variances of the rth marker for the ith subject; V i,r = diag(v i,r ), i = 1, . . . , N , r = 1, . . . , R, be the diagonal matrix with variances of the rth marker for the ith subject on a diagonal; ⊤ ⊤ v•r = (v ⊤ 1,r , . . . , v N,r ) , r = 1, . . . , R, be the vector of the variances for the rth marker; V •r = diag(v •r ), r = 1, . . . , R, be the diagonal matrix with variances for the rth marker on a diagonal; X i,r , i = 1, . . . , N , r = 1, . . . , R, be the matrix of covariates for the fixed effects from the GLMM (B.3) for the rth response of the ith subject, i.e., ⊤ xi,r,1 .. X i,r = . ; x⊤ i,r,ni,r • X •r , r = 1, . . . , R be the matrix of covariates for the fixed effects from the GLMM (B.3) for the rth response, i.e., X 1,r X •r = ... ; X N,r • Z i,r , i = 1, . . . , N , r = 1, . . . , R be the matrix of covariates for the random effects from the GLMM (B.3) for the rth response of the ith subject, i.e., ⊤ z i,r,1 .. Z i,r = . ; z⊤ i,r,ni,r • qi,r,j = qr (ηi,r,1 ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r ; ⊤ • q i,r = qi,r,1 , . . . , qi,r,ni,r ) , i = 1, . . . , N , r = 1, . . . , R; ⊤ ⊤ , r = 1, . . . , R; • q •r = q ⊤ 1,r , . . . , q N,r • ki,r,j = kr (yi,r,j , φr ), i = 1, . . . , N , r = 1, . . . , R, j = 1, . . . , ni,r ; ⊤ • ki,r = ki,r,1 , . . . , ki,r,ni,r , i = 1, . . . , N , r = 1, . . . , R; ⊤ ⊤ • k•r = k⊤ , r = 1, . . . , R; 1,r , . . . , k N,r • sr , r = 1, . . . , R, be the part of the shift vector s, see Expression (A.7), ⊤ ⊤; which corresponds to the rth marker, i.e., s = s⊤ 1 , . . . , sR • S r , r = 1, . . . , R be the diagonal block of the scale matrix S, see Expression (A.7), which corresponds to the rth marker, i.e., S = block-diag(S 1 , . . . , S R ). 12 A. KOMÁREK AND L. KOMÁRKOVÁ Consequently, the likelihood (B.2) can be written as (B.5) LBayes (ψ, θ ∗ , B, u, γ b , γ φ ) = LBayes (ψ, B) X R N X ⊤ −1 ⊤ ⊤ φr y i,r η i,r − 1 q i,r + 1 ki,r = exp i=1 r=1 = exp X R r=1 ⊤ ⊤ ⊤ φ−1 y η − 1 q + 1 k •r , r •r •r •r where 1 is a vector of ones of appropriate dimension. B.2. Monte Carlo Markov chain. In this Section, we introduce the MCMC procedure to obtain a sample o n (m) (m) : m = 1, . . . , M SM = ψ (m) , θ ∗(m) , B (m) , u(m) , γ b , γ φ from the joint posterior distribution (B.1). We exploit the Gibbs algorithm and update the parameters in blocks of related parameters by sampling from the appropriate full conditional distributions, further denoted as p(· | · · · ). In the remainder of this Section, we derive these full conditional distributions and explain how to sample from them in situations when they do not have a form of any classical distribution. We exploit the following additional notation: P • Nk = N i=1 I(ui = k), k = 1, . . . , K, will denote the number of subjects who are assigned into the kth mixture component according to the PK values of component allocations. Note that k=1 Nk = N , the number of subjects; • b∗[k] , k = 1, . . . , K, will be the sample mean of shifted-scaled values of random effects of those subjects assigned into the kth mixture component according to the values of component allocations u, i.e., X b∗i , k = 1, . . . , K. b∗[k] = Nk−1 i:ui =k B.2.1. Mixture weights. Mixture weights form one block of the Gibbs algorithm. The corresponding full conditional distribution is the Dirichlet distribution D(δ + N1 , . . . , δ + NK ). That is, Γ(Kδ + N ) p(w | · · · ) = QK k=1 Γ(δ K Y + Nk ) k=1 wkδ+Nk −1 , which is the distribution we can directly sample from. SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 13 B.2.2. Mixture means. The second block of the Gibbs algorithm is composed of mixture means. Their full conditional distribution is a product of K independent normal distributions, i.e., p(µ∗1 , . . . , µ∗k | · · · ) = K Y p(µ∗k | · · · ) ∼ k=1 K Y k=1 N E(µ∗k | · · · ), var(µ∗k | · · · ) , where var(µ∗k | · · · ) = Nk D ∗k −1 + C −1 b E(µ∗k | ···) = var(µ∗k | ···) −1 , Nk D ∗k −1 b∗[k] + C −1 b ξb , k = 1, . . . , K. B.2.3. Mixture covariance matrices. The third block of the Gibbs algorithm is formed of the inversions of the mixture covariance matrices. Their full conditional distribution is a product of K independent Wishart distributions, i.e., K K Y Y e b,k , p D∗1 −1 , . . . , D ∗k −1 · · · = p D∗k −1 · · · ∼ W ζeb,k , Ξ k=1 k=1 where ζeb,k = ζb + Nk , n o−1 X ∗ ∗ ∗ ⊤ ∗ e b,k = Ξ−1 + Ξ ) − µ (b − µ )(b , k i i k b k = 1, . . . , K. i:ui =k B.2.4. Variance hyperparameters for mixture covariance matrices. In the fourth block of the Gibbs algorithm, we update the variance hyperparameters γ −1 b whose inversions form a random diagonal of the scale matrix Ξb in the Wishart prior for the inversions of the mixture covariance matrices. Their full conditional distribution is a product of independent gamma distributions, i.e., p(γ −1 b | ···) = d Y −1 | ···) ∼ p(γb,l l=1 l=1 where geb,l = gb,l K ζb + , 2 d Y G(e gb,l , e hb,l ), K 1 X e hb,l = hb,l + Qk (l, l), 2 l = 1, . . . , d, k=1 where Qk (l, l) is the lth diagonal element of the matrix D ∗k −1 . 14 A. KOMÁREK AND L. KOMÁRKOVÁ B.2.5. Component allocations. In the fifth block of the Gibbs algorithm, we update the latent component allocations u. Their full conditional distribution is a product over subjects (i = 1, . . . , N ) of the following discrete distributions: wk ϕ b∗i µ∗k , D ∗k k = 1, . . . , K. P(ui = k | · · · ) = PK , wl ϕ b∗i µ∗ , D ∗ l=1 l l B.2.6. Fixed effects. The sixth block of the Gibbs algorithm consists in updating the vector of fixed effects α. The full conditional distribution factorizes with respect to the R longitudinal markers involved in a model, i.e., R Y p(αr | · · · ). p(α | · · · ) = r=1 To proceed, we introduce some additional notation. Let for each r = 1, . . . , R, n o ⊤ ⊤ (ψ, B) = exp φ−1 LBayes y⊤ α,r r •r η •r − 1 q •r + 1 k•r , be the part of the likelihood (B.5) pertaining to the rth marker. Combining this with the prior distribution (A.9) leads to the following full conditional distribution for the vector of the fixed effects from the rth GLMM: (B.6) p(αr | · · · ) n ⊤ o 1 ⊤ ∝ exp φ−1 y⊤ , αr − ξ α,r C −1 r •r η •r − 1 q •r − α,r αr − ξ α,r 2 r = 1, . . . , R, where ξ α,r and C α,r are appropriate subvector and submatrix, respectively, of the prior mean ξ α and prior covariance matrix C α , respectively. In a situation when the rth GLMM distribution (B.4) is normal N (ηi,r,j , φr ), it can easily be shown that (B.6) is a (multivariate) normal distribution with the variance and mean given by ⊤ −1 −1 var(αr | · · · ) = φ−1 (B.7) , r X •r X •r + C α,r ⊤ −1 E(αr | · · · ) = var(αr | · · · ) φ−1 (B.8) r X •r eα,r + C α,r ξ α,r , where eα,r y 1,r − Z 1,r b1,r eα,1,r .. = ... = , . y N,r − Z N,r bN,r eα,N,r from which we can easily sample from. Nevertheless, in general, (B.6) does not have a form of any classically used distribution. In that case, we follow SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 15 a proposal of Gamerman (1997) and update αr using a Metropolis-Hastings step with a normal proposal based on appropriately tuned second-order approximation to (B.6) created using one Newton-Raphson step. Let in the (m−1) mth MCMC iteration αr denote the previously sampled value of αr , and (m−1) (m−1) the vector of residuals λ•r calculated using αr = αr . Further, λ•r let ∂ log LBayes (ψ, B) α,r ⊤ = φ−1 Uα,r (αr ) = r X •r (y •r − λ•r ), ∂αr ∂ 2 log LBayes (ψ, B) α,r ⊤ = φ−2 Jα,r (αr ) = − r X •r V •r X •r , ∂αr ∂α⊤ r be the score and observed information matrix from the rth GLMM with respect to the vector of fixed effects αr . A new value of αr is proposed by sampling from a (multivariate) normal distribution with the variance and mean given by o−1 n c r | · · · ) = σα,r Jα,r αr(m−1) + C −1 (B.9) , var(α α,r (B.10) where b r | · · · ) = α(m−1) + ∆(m) , E(α r α,r o−1 n −1 (m) (B.11) ∆α,r = Jα,r αr(m−1) + C α,r n o (m−1) Uα,r αr(m−1) − C −1 α − ξ α,r α,r r −1 ⊤ −1 = φ−2 r X •r V •r X •r + C α,r n o (m−1) ⊤ −1 (m−1) φ−1 X y − λ − C α − ξ , •r r •r •r α,r r α,r and σα,r > 0 is a tuning parameter. In practical analyzes, we recommend to start with σα,r = 1 which often leads to reasonable mixing. Note that in the case of normally distributed response, setting σα,r = 1 leads to (B.9) which is identical to (B.7) and (B.10) coincides with (B.8). Finally, we remark that both the proposal covariance matrix (B.9) (with σα,r = 1) and the Newton-Raphson step (B.11) can efficiently be obtained as a least squares solution and QR decomposition to a particular problem. To show that, let Lα,r be the Cholesky decomposition of the inversion of ⊤ the prior covariance matrix C α,r , i.e., C −1 α,r = Lα,r Lα,r . Further, let ! ! (m−1) −1/2 −1 V 1/2 X ) (y − λ V φ •r •r •r •r •r r f•r = e , e•r = X . (m−1) L⊤ − L⊤ − ξ α,r α,r α,r αr 16 A. KOMÁREK AND L. KOMÁRKOVÁ It follows from (B.11) that −1 ⊤ ⊤ (m) f•r f•r e f•r X er , X ∆α,r = X (m) f•r δ = e f•r = i.e., ∆α,r is the least squares solution to X er . That is, if X e •r is the QR decomposition of the matrix X f•r , we have e •r R Q −1 ⊤ (m) e Q e e ∆α,r =R •r •r er , and also −1 −1 e ⊤ e c r | ···) var(α = σα,r R•r R•r . B.2.7. Random effects. The seventh block of the Gibbs algorithm consists in updating the vectors of random effects b1 , . . . , bN . The full conditional distribution factorizes with respect to N subjects, i.e., p(B | · · · ) = N Y p(bi | · · · ). i=1 The remainder of this part is similar to Paragraph B.2.6. First, we introduce additional notation. Let for each i = 1, . . . , N , (B.12) LBayes (ψ, B) = exp b,i X R r=1 ⊤ ⊤ ⊤ φ−1 y η − 1 q + 1 k i,r , i,r r i,r i,r be the part of the likelihood (B.5) pertaining to the ith subject. If we combine this with the model (A.6) for random effects, we obtain that the full conditional distribution for the ith random effect vector bi is p(bi | · · · ) X R 1 ⊤ ⊤ ∗ ⊤ ∗ ⊤ −1 ∗ φ−1 y η −1 q ∝ exp − b −s−Sµ SD S b −s−Sµ , i i i,r r i,r i,r ui ui ui 2 r=1 When working with shifted-scaled random effects, see Expression (A.7), we obtain the following full conditional distribution (B.13) p(bi | · · · ) X R 1 ∗ −1 ⊤ ⊤ ∗ ⊤ ∗ −1 ∗ ∗ φr y i,r η i,r − 1 q i,r − b − µ ui D ui ∝ exp b i − µ ui , 2 i r=1 SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 17 from which we have to sample from during this block of the Gibbs algorithm. In a situation when all R GLMM distributions (B.4) are normal N (ηi,r,j , φr ), r = 1, . . . , R, it can easily be shown that (B.13) is a (multivariate) normal distribution with the variance and mean given by −1 (B.14) var(b∗i | · · · ) = J b,i + D ∗ui −1 (B.15) E(b∗i | · · · ) = var(b∗i | · · · ) U b,i + D ∗ui −1 µ∗ui , where ⊤ ⊤ φ−1 1 S 1 Z i,1 (y i,1 − X i,1 α1 − Z i,1 s1 ) .. = , . −1 ⊤ ⊤ φR S R Z i,R (y i,R − X i,R αR − Z i,R sR ) ⊤ ⊤ = block-diag φ−1 r S r Z i,r Z i,r S r : r = 1, . . . , R , U b,i J b,i i = 1, . . . , N , and we can directly update the values of random effects by sampling from this multivariate normal distribution. Analogously to Paragraph B.2.6, the full conditional distribution (B.13) does not have a form of any classically used distribution as soon as at least one of R GLMM distributions (B.4) is not normal. In that case, we update each b∗i (and bi ) using a Metropolis-Hastings step with a normal proposal based on appropriately tuned second-order approximation to (B.13) created using one Newton-Raphson step. Let in the mth MCMC iteration, b∗i (m−1) denote the previously sampled value of b∗i . Further, let (m−1) ⊤ ⊤ Bayes φ−1 S Z y − λ i,1 1 1 i,1 .i,1 ∂ log Lb,i (ψ, B) , . = Ub,i (b∗i ) = . ∗ ∂bi (m−1) −1 ⊤ ⊤ φR S R Z i,R y i,R − λi,R Jb,i (b∗i ) = − ∂ 2 log LBayes (ψ, B) b,i ∂b∗i ∂b∗i ⊤ ⊤ ⊤ = block-diag φ−2 r S r Z i,r V i,r Z i,r S r : r = 1, . . . , R , be the score and observed information matrix with respect to the shiftedscaled vector of random effects corresponding to the likelihood contribution (B.12). A new value of b∗i is proposed by sampling from a (multivariate) normal distribution with the variance and mean given by o−1 n c ∗i | · · · ) = σb Jb,i b∗i (m−1) + D ∗ui −1 (B.16) , var(b (B.17) b ∗ | · · · ) = b∗ (m−1) + ∆(m) , E(b i i b,i 18 A. KOMÁREK AND L. KOMÁRKOVÁ where (m) (B.18) ∆b,i o−1 n n o Ub,i b∗i (m−1) − D ∗ui −1 b∗i (m−1) − µ∗ui , = Jb,i b∗i (m−1) + D ∗ui −1 and σb > 0 is a tuning parameter. Also in this case, we recommend to start with σb = 1 which often leads to reasonable mixing. Note that in the case of normally distributed response, setting σb = 1 leads to (B.16) which is identical to (B.14) and (B.17) coincides with (B.15). Analogously to Paragraph B.2.6, both the proposal covariance matrix (B.16) (with σb = 1) and the Newton-Raphson step (B.18) can efficiently be obtained as a least squares solution and QR decomposition to a particular problem. B.2.8. GLMM dispersion parameters. The eight block of the Gibbs algorithm involves sampling the GLMM dispersion parameters φ1 , . . . , φR for those r = 1, . . . , R for which φr of the corresponding GLMM is not constant by definition. In these situations, the full conditional distribution of the inverted dispersion parameter is given by (B.19) p(φ−1 r | ···) n 1 o ζφ,r −2 −1 −1 ⊤ ⊤ 2 exp − γφ,r φr + φ−1 y⊤ ∝ φ−1 r •r η •r − 1 q •r + 1 k •r . r 2 If we concentrate on three most common situations described in Table B.1, the only case when it is necessary to sample the dispersion parameter φ−1 r is normally distributed response with φr = σr2 being the error variance of the corresponding linear mixed model. In this case the full conditional distribution (B.19) is the gamma distribution, i.e., −1 e p(φ−1 eφ,r /2 , r | · · · ) ∼ G ζφ,r /2, γ where ζeφ,r = ζφ,r + n•r , n ⊤ o−1 −1 + y •r − η •r γφ,r = γφ,r e y •r − η •r . B.2.9. Variance hyperparameter for dispersions. The last, ninth block of the Gibbs algorithm consists of sampling the variance hyperparameters −1 γφ,r for those r = 1, . . . , R for which φr of the corresponding GLMM is not constant by definition. In these situations, the full conditional distribution −1 of γφ,r is the gamma distribution, i.e., where −1 | ···) ∼ G e gφ,r , e hφ,r , p(γφ,r gφ,r = gφ,r + e ζφ,r , 2 φ−1 e hφ,r = hφ,r + r . 2 SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 19 C. Analysis of the Mayo Clinic PBC Data. This section supplements the main paper by some details on the analysis of the PBC910 data. The MMGLMM for the clustering of patients included in the PBC910 data is based on longitudinal measurements of (1) logarithmic serum bilirubin (lbili, Yi,1,j ), (2) platelet counts (platelet, Yi,2,j ), (3) dichotomous presence of blood vessel malformations (spiders, Yi,3,j ), with assumed (1) Gaussian, (2) Poisson, and (3) Bernoulli distribution, respectively and the mean structure (3) given by E(Yi,1,j | bi,1) = bi,1,1 + bi,1,2 ti,1,j , (C.1) log E(Yi,2,j | bi,2 ) = bi,2,1 + bi,2,2 ti,2,j , logit E(Yi,3,j | bi,3 , α3 ) = bi,3 + α3 ti,3,j , i = 1, . . . , N , j = 1, . . . , ni,r , r = 1, 2, 3, where 1 ≤ ni,r ≤ 5. In model (C.1), ti,r,j is the time in months from the start of follow-up when the value of Yi,r,j was obtained. C.1. Summary of model parameters. In the main analysis, we shall try to classify patients in two groups and hence a two component mixture (K = 2) will be considered in the distribution of random effects. In summary, the considered model involves the following parameters of the main interest. • Random effect vectors bi = (bi,1,1 , bi,1,2 , bi,2,1 , bi,2,2 , bi,3 )⊤ , i = 1, . . . , N , where bi,1,1 and bi,1,2 are the random intercept and random slope from the linear mixed model for logarithmic bilirubin, bi,2,1 and bi,2,2 are the random intercept and random slope from the log-Poisson model for platelet counts and bi,3 is the random intercept from the logit model for presence of blood vessel malformations. • Mixture parameters wk , µ∗k , D ∗k , k = 1, 2 and their counterparts µk = s + Sµ∗k , Dk = SD ∗k S ⊤ , k = 1, 2 which represent the mean and the covariance matrix for each cluster. • Dispersion parameter φ1 from the linear mixed model for logarithmic bilirubin. In this case, φ1 is the residual variance and we will denote it also as σ12 = φ1 . • Fixed effects α = α3 , where α3 is the slope from the logit model for presence of blood vessel malformations. For the purpose of estimation, we have ψ = σ12 , α3 ⊤ , ⊤ θ ∗ = w ⊤ , µ∗1 ⊤ , µ∗2 ⊤ , vec(D ∗1 ), vec(D ∗2 ) . 20 A. KOMÁREK AND L. KOMÁRKOVÁ C.2. Fixed hyperparameters, shift vector, scale matrix. The values of fixed hyperparameters, the shift vector and the scale matrix chosen according to the guidelines given in Section A.4 were as follows. ⊤ s = 0.315, 0.00765, 5.53, −0.00663, −2.75 , S = diag 0.864, 0.02008, 0.35, 0.01565, 3.23 , δ = 1, ⊤ ξ b = 0, 0, 0, 0, 0 , ζb = 6, gb,l = 0.2, ζφ,1 = 2, ⊤ C b = diag 36, 36, 36, 36, 36 , hb,l = 0.278, gφ,1 = 0.2, ξ α = 0, l = 1, . . . , 5, hφ,1 = 2.76, C α = 10 000. 0.5 −1.0 0.0 Autocorrelation 14100 14060 Sampled values 1.0 C.3. Monte Carlo Markov chain. The reported results are based on 10 000 iterations of 1:100 thinned MCMC obtained after a burn-in period of 1 000 iterations. This took 75 minutes on Intel Core 2 Duo 3 GHz CPU with 3.25 GB RAM running on Linux OS. Convergence of the MCMC was evaluated using the R package coda (Plummer et al., 2006). We illustrate the performance of the MCMC on traceplots of last 1 000 sampled values (1/10 of the sample used for the inference) of the model deviance D(ψ, θ) = −2 log L(ψ, θ) on the left panel of Figure C.1. Estimated autocorrelations for this quantity are shown on the right panel of Figure C.1. 10000 10400 Iteration 10800 0 5 10 15 20 Lag Fig C.1. PBC910 Data. Traceplot of last 1 000 sampled values of the observed data deviance D(ψ, θ) and estimated autocorrelation based on 10 000 sampled values of the observed data deviance D(ψ, θ). 21 SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” D. Simulation study. In this section, we provide more details on the simulation study. Namely, Table D.1 shows the true values of the mixture covariance matrices in the distributions of random effects used to generate the data. Further, Figures D.1 and D.2 show observed longitudinal profiles for one simulated dataset in settings with K = 2, and K = 3, respectively, normally distributed random effects in each cluster and in total 200 subjects. Plots are also supplemented by the true cluster-specific marginal mean evolutions of each outcome. Supplementary results of the simulation study follow in Tables D.2 – D.6. Table D.1 Simulation study. Standard deviations (on a diagonal) and correlations (off-diagonal elements) for each mixture component derived from the true covariance matrices D 1 , D2 , D 3 . Intercept (Gaus.) Slope (Gaus.) Intercept (Pois.) k=1 −0.3 0.0 0.3 Intercept (Gaus.) Slope (Gaus.) Intercept (Pois.) Slope (Pois.) Intercept (Bernoulli) 0.5 0.0 0.01 Intercept (Gaus.) Slope (Gaus.) Intercept (Pois.) Slope (Pois.) Intercept (Bernoulli) 0.3 −0.2 0.01 k=2 0.1 0.0 0.5 Intercept (Gaus.) Slope (Gaus.) Intercept (Pois.) Slope (Pois.) Intercept (Bernoulli) 0.7 0.0 0.03 k=3 0.0 0.0 0.7 Slope (Pois.) Intercept (Bernoulli) 0.0 −0.2 0.0 0.003 0.3 0.1 0.0 0.0 2.0 −0.1 0.3 0.0 0.005 0.2 0.1 −0.2 0.0 3.0 0.0 0.0 0.0 0.010 0.0 0.0 0.0 0.0 0.5 22 A. KOMÁREK AND L. KOMÁRKOVÁ 1 −1 0 Gaussian response 1 0 −1 Gaussian response 2 Group 2 2 Group 1 0 5 10 15 20 25 0 5 20 25 20 25 20 25 500 400 300 100 200 Poisson response 400 300 200 100 Poisson response 0 5 10 15 20 25 0 5 10 15 0.8 0.6 0.4 Bernoulli response 0.6 0.4 0.0 0.2 0.0 0.2 0.8 1.0 Time (months) 1.0 Time (months) Bernoulli response 15 Time (months) 500 Time (months) 10 0 5 10 15 Time (months) 20 25 0 5 10 15 Time (months) Fig D.1. Simulation study. One simulated dataset for a model with K = 2 (N = (60, 40)) and normally distributed random effects in each cluster. Thick lines show true clusterspecific marginal mean evolution. The values of a dichotomous Bernoulli response are vertically jittered. 23 SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 25 2 10 15 20 25 0 10 15 20 25 15 20 0 10 15 20 25 1.0 0.6 Bernoulli response 0.8 1.0 Time (months) 25 5 Time (months) 0.0 0.0 20 400 25 0.4 Bernoulli response 15 25 500 10 0.2 1.0 0.8 0.6 0.4 0.2 10 20 100 5 Time (months) 0.0 5 15 300 Poisson response 0 Time (months) 0 10 200 500 400 300 100 5 5 Time (months) 200 Poisson response 400 300 200 100 Poisson response 5 Time (months) 500 Time (months) 0 1 −1 0 0.8 20 0.6 15 0.4 10 0.2 5 0 Gaussian response 1 −1 0 Gaussian response 1 0 −1 Gaussian response 0 Bernoulli response Group 3 2 Group 2 2 Group 1 0 5 10 15 20 Time (months) 25 0 5 10 15 20 25 Time (months) Fig D.2. Simulation study. One simulated dataset for a model with K = 3 (N = (60, 34, 6)) and normally distributed random effects in each cluster. Thick lines show true cluster-specific marginal mean evolution. The values of a dichotomous Bernoulli response are vertically jittered. 24 A. KOMÁREK AND L. KOMÁRKOVÁ Table D.2 Simulation study: error rates of classification obtained from a model with correctly specified number of clusters. Setting Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) Normal (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) MVT5 (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) Overall error rate (%) Conditional error rate (%) given component 1 2 3 15.8 7.8 5.8 7.9 4.0 4.0 27.6 13.4 8.4 N.A. N.A. N.A. 20.6 10.8 8.7 10.9 3.5 4.3 35.2 21.9 15.4 N.A. N.A. N.A. 26.5 17.4 10.1 13.9 11.9 6.4 41.9 20.0 10.8 66.0 56.8 42.3 23.8 18.8 9.7 11.2 14.8 6.4 39.3 19.3 11.9 62.3 55.2 30.4 Table D.3 Simulation study: mixture weights. Bias, standard deviation and mean squared error (MSE) are based on posterior means as parameter estimates, the reported MSE is the average MSE over the K mixture components. √ Setting Bias Std. Dev. MSE w = (0.600, 0.400) Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) (0.053, −0.053) (0.030, −0.030) (0.008, −0.008) (0.161, 0.161) (0.057, 0.057) (0.037, 0.037) 0.169 0.064 0.037 (0.071, −0.071) (0.064, −0.064) (0.033, −0.033) (0.198, 0.198) (0.111, 0.111) (0.097, 0.097) 0.209 0.128 0.102 (0.023, −0.051, 0.028) (0.003, 0.003, −0.006) (0.005, 0.003, −0.008) (0.190, 0.127, 0.127) (0.113, 0.090, 0.051) (0.054, 0.056, 0.029) 0.154 0.088 0.048 (0.045, −0.057, 0.012) (−0.029, 0.008, 0.021) (−0.007, −0.011, 0.017) (0.133, 0.104, 0.070) (0.158, 0.106, 0.086) (0.065, 0.060, 0.034) 0.113 0.122 0.056 w = (0.600, 0.340, 0.060) Normal (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) MVT5 (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 25 Table D.4 Simulation study. Mixture means of the random effects in the model for Gaussian response. Bias, standard deviation and mean squared error (MSE) are based on posterior means as parameter estimates, the reported MSE is the average MSE over the K mixture components. √ Setting Bias Std. Dev. MSE µ∗,1 = (0.000, 1.000) (Intercept, Gaussian resp.) Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) (0.140, −0.102) (0.053, −0.020) (0.014, 0.003) (0.194, 0.213) (0.128, 0.162) (0.074, 0.055) 0.237 0.150 0.066 (0.174, −0.140) (0.075, −0.041) (0.038, −0.085) (0.215, 0.738) (0.145, 0.592) (0.114, 0.516) 0.564 0.433 0.378 µ∗,1 = (0.000, 1.000, 1.300) (Intercept, Normal (K = 3) N = (30, 17, 3) (0.184, −0.117, −0.427) (60, 34, 6) (0.124, −0.086, −0.124) (120, 68, 12) (0.030, −0.014, −0.040) MVT5 (K = 3) N = (30, 17, 3) (0.181, −0.107, −0.282) (60, 34, 6) (0.095, −0.100, −0.362) (120, 68, 12) (0.013, 0.017, −0.163) Gaussian resp.) (0.181, 0.295, 0.495) (0.161, 0.209, 0.533) (0.084, 0.107, 0.428) 0.444 0.360 0.260 (0.196, 0.336, 0.827) (0.151, 0.201, 0.574) (0.091, 0.086, 0.579) 0.563 0.424 0.353 µ∗,2 = (0.0100, 0.0100) (Slope, Gaussian resp.) Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) (0.0006, −0.0001) (−0.0001, −0.0001) (−0.0003, −0.0002) (0.0036, 0.0058) (0.0029, 0.0039) (0.0018, 0.0026) 0.0048 0.0034 0.0022 (0.0000, 0.0023) (0.0006, 0.0000) (0.0002, −0.0006) (0.0038, 0.0185) (0.0025, 0.0144) (0.0018, 0.0067) 0.0134 0.0103 0.0049 µ∗,2 = (0.0100, 0.0100, −0.0300) (Slope, Gaussian resp.) Normal (K = 3) N = (30, 17, 3) (−0.0005, −0.0056, 0.0200) (0.0039, 0.0138, 0.0248) (60, 34, 6) (−0.0008, −0.0015, 0.0034) (0.0031, 0.0049, 0.0279) (120, 68, 12) (−0.0001, −0.0008, −0.0028) (0.0022, 0.0029, 0.0214) MVT5 (K = 3) N = (30, 17, 3) (−0.0008, −0.0042, 0.0223) (0.0041, 0.0173, 0.0294) (60, 34, 6) (−0.0007, −0.0024, 0.0093) (0.0029, 0.0068, 0.0347) (120, 68, 12) (0.0002, −0.0016, 0.0051) (0.0020, 0.0079, 0.0327) 0.0204 0.0165 0.0126 0.0237 0.0211 0.0196 26 A. KOMÁREK AND L. KOMÁRKOVÁ Table D.5 Simulation study: mixture means of the random effects in the model for Poisson response. Bias, standard deviation and mean squared error (MSE) are based on posterior means as parameter estimates, the reported MSE is the average MSE over the K mixture components. √ Setting Bias Std. Dev. MSE µ∗,3 = (5.00, 5.00) (Intercept, Poisson resp.) Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) (0.00, −0.03) (−0.01, 0.00) (0.00, 0.00) (0.06, 0.16) (0.05, 0.11) (0.03, 0.06) 0.12 0.09 0.05 (0.00, −0.04) (0.00, −0.01) (0.00, −0.02) (0.05, 0.51) (0.04, 0.67) (0.03, 0.22) 0.36 0.47 0.16 µ∗,3 = (5.00, 5.00, 5.50) (Intercept, Poisson resp.) Normal (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) MVT5 (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) Normal (K = 2) N = (30, 20) (60, 40) (120, 80) MVT5 (K = 2) N = (30, 20) (60, 40) (120, 80) (0.02, 0.10, −0.19) (0.02, 0.03, −0.08) (0.01, 0.01, 0.02) (0.10, 0.32, 0.40) (0.08, 0.15, 0.42) (0.06, 0.09, 0.31) 0.33 0.26 0.19 (0.01, 0.16, −0.21) (0.02, 0.02, −0.12) (0.00, 0.01, −0.14) (0.08, 0.52, 0.70) (0.06, 0.14, 0.51) (0.03, 0.10, 0.35) 0.52 0.31 0.22 µ∗,4 = (−0.0050, −0.0200) (Slope, Poisson resp.) µ∗,4 Normal (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) MVT5 (K = 3) N = (30, 17, 3) (60, 34, 6) (120, 68, 12) (−0.0017, 0.0003) (−0.0006, −0.0005) (−0.0002, −0.0002) (0.0025, 0.0046) (0.0017, 0.0018) (0.0006, 0.0012) 0.0039 0.0018 0.0010 (−0.0026, 0.0019) (−0.0011, 0.0007) (−0.0006, 0.0001) (0.0029, 0.0058) (0.0020, 0.0044) (0.0016, 0.0022) 0.0051 0.0035 0.0020 = (−0.0050, −0.0200, 0.0000) (Slope, Poisson resp.) (−0.0023, 0.0049, −0.0042) (−0.0011, 0.0025, 0.0001) (−0.0002, 0.0005, 0.0004) (0.0021, 0.0057, 0.0089) (0.0020, 0.0042, 0.0078) (0.0007, 0.0022, 0.0055) 0.0073 0.0054 0.0034 (−0.0023, 0.0049, −0.0039) (−0.0010, 0.0027, −0.0038) (−0.0002, 0.0010, −0.0021) (0.0027, 0.0062, 0.0097) (0.0019, 0.0043, 0.0075) (0.0011, 0.0036, 0.0071) 0.0078 0.0058 0.0048 SUPPLEMENT TO “CLUSTERING FOR LONGITUDINAL DATA” 27 Table D.6 Simulation study: mixture means of the random effects in the model for Bernoulli response. Bias, standard deviation and mean squared error (MSE) are based on posterior means as parameter estimates, the reported MSE is the average MSE over the K mixture components. √ Setting Bias Std. Dev. MSE µ∗,5 = (−3.00, −1.00) (Intercept, Bernoulli resp.) Normal (K = 2) N = (30, 20) (−0.06, −0.27) (1.15, 2.33) (60, 40) (−0.08, 0.02) (0.61, 1.05) (120, 80) (−0.08, 0.00) (0.42, 0.50) MVT5 (K = 2) N = (30, 20) (0.09, 0.35) (0.87, 4.63) (60, 40) (−0.03, −0.18) (0.63, 3.02) (120, 80) (−0.11, 0.26) (0.45, 2.02) µ∗,5 = (−3.00, −1.00, −2.00) (Intercept, Bernoulli resp.) Normal (K = 3) N = (30, 17, 3) (−0.01, −0.54, −3.91) (1.03, 2.69, 7.28) (60, 34, 6) (0.09, −0.46, −2.74) (0.75, 1.95, 4.06) (120, 68, 12) (−0.02, 0.10, −1.28) (0.47, 0.55, 2.32) MVT5 (K = 3) N = (30, 17, 3) (0.08, −1.10, −2.05) (0.86, 3.98, 5.12) (60, 34, 6) (0.02, −0.33, −1.83) (0.83, 1.01, 4.04) (120, 68, 12) (0.01, 0.07, −0.76) (0.36, 0.52, 2.53) 1.84 0.86 0.46 3.33 2.17 1.47 5.04 3.07 1.58 3.99 2.66 1.56 28 A. KOMÁREK AND L. KOMÁRKOVÁ References. Fong, Y., Rue, H. and Wakefield, J. (2010). Bayesian inference for generalized linear mixed models. Biostatistics 11 397–412. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7 57–68. Komárek, A. and Komárková, L. (2012). Clustering for multivariate continuous and discrete longitudinal data. The Annals of Applied Statistics. Plummer, M., Best, N., Cowles, K. and Vines, K. (2006). CODA: Convergence Diagnosis and Output Analysis for MCMC. R News 6 7–11. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with unknown number of components (with Discussion). Journal of the Royal Statistical Society, Series B 59 731–792. A. Komárek Faculty of Mathematics and Physics Charles University in Prague Sokolovská 83 CZ–186 75, Praha 8 Czech Republic E-mail: [email protected] URL: http://www.karlin.mff.cuni.cz/∼komarek L. Komárková Faculty of Management the University of Economics in Prague Jarošovská 1117 CZ–377 01, Jindřichův Hradec Czech Republic E-mail: [email protected]

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement