Department of Probability and Mathematical Statistics NMSA407 Linear Regression Course Notes 2015–16 Arnošt Komárek Last modified on January 14, 2016. These course notes contain an overview of notation, definitions, theorems and comments covered by the course “NMSA407 Linear Regression”, which is a part of the curriculum of the Master’s programs “Probability, Mathematical Statistics and Econometrics” and “Financial and Insurance Mathematics”. This document undergoes continuing development. This version is dated January 14, 2016. Arnošt Komárek [email protected] On Řečička, in Karlín, in Zvůle from May 2015, partially based on lecture overheads used in fall 2013 and 2014. Contents 1 Linear Model 1 1.1 Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Basic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Probabilistic model for the data . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear model: Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Definition of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Definition of a linear model using the error terms . . . . . . . . . . . . . . 3 1.2.3 Rank of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.4 Independent observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.5 Linear model with i.i.d. errors . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.6 Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.7 Transformations of covariates . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.8 Linear model with intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.9 Interpretation of regression coefficients . . . . . . . . . . . . . . . . . . . . . 8 1.2.10 Fixed or random covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.11 Limitations of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 2 Least Squares Estimation 2.1 11 Regression and residual space, projections . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Regression and residual space . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Fitted values, residuals, Gauss–Markov theorem . . . . . . . . . . . . . . . . . . . . 16 2.3 Normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Estimable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Parameterizations of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 Equivalent linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.2 Full-rank parameterization of a linear model . . . . . . . . . . . . . . . . . . 30 iii CONTENTS 2.6 iv Matrix algebra and a method of least squares . . . . . . . . . . . . . . . . . . . . . 33 2.6.1 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.2 SVD decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Normal Linear Model 4 5 35 3.1 Normal linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Properties of the least squares estimators under the normality . . . . . . . . . . . . 37 3.2.1 Statistical inference in a full-rank normal linear model . . . . . . . . . . . . 38 3.2.2 Statistical inference in a general rank normal linear model . . . . . . . . . . 40 3.3 Confidence interval for the model based mean, prediction interval . . . . . . . . . . 42 3.4 Distribution of the linear hypotheses test statistics under the alternative . . . . . . 44 Basic Regression Diagnostics 46 4.1 (Normal) linear model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Standardized residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Graphical tools of regression diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 (A1) Correctness of the regression function . . . . . . . . . . . . . . . . . . . 50 4.3.2 (A2) Homoscedasticity of the errors . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 (A3) Uncorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.4 (A4) Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Submodels 53 5.1 Submodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.1 Projection considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.2 Properties of submodel related quantities . . . . . . . . . . . . . . . . . . . 56 5.1.3 Series of submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.4 Statistical test to compare nested models . . . . . . . . . . . . . . . . . . . 58 5.2 Omitting some covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Linear constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3.1 F-statistic to verify a set of linear constraints . . . . . . . . . . . . . . . . . 65 5.3.2 t-statistic to verify a linear constraint . . . . . . . . . . . . . . . . . . . . . . 65 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 Intercept only model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.2 Models with intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4.3 Evaluation of a prediction quality of the model . . . . . . . . . . . . . . . . 68 5.4.4 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.5 Overall F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 6 General Linear Model 73 CONTENTS 7 v Parameterizations of Covariates 77 7.1 Linearization of the dependence of the response on the covariates . . . . . . . . . . 77 7.2 Parameterization of a single covariate . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2.1 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2.2 Covariate types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Numeric covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.3.1 Simple transformation of the covariate . . . . . . . . . . . . . . . . . . . . . 81 7.3.2 Raw polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.3.3 Orthonormal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.3.4 Regression splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Categorical covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.4.1 Link to a G-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.4.2 Linear model parameterization of one-way classified group means . . . . . 90 7.4.3 ANOVA parameterization of one-way classified group means . . . . . . . . . 91 7.4.4 Full-rank parameterization of one-way classified group means . . . . . . . . 97 7.3 7.4 8 Additivity and Interactions 8.1 8.2 8.3 8.4 8.5 8.6 105 Additivity and partial effect of a covariate . . . . . . . . . . . . . . . . . . . . . . . 105 8.1.1 Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.1.2 Partial effect of a covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.1.3 Additivity, partial covariate effect and conditional independence . . . . . . 106 Additivity of the effect of a numeric covariate . . . . . . . . . . . . . . . . . . . . . 107 8.2.1 Partial effect of a numeric covariate . . . . . . . . . . . . . . . . . . . . . . 107 Additivity of the effect of a categorical covariate . . . . . . . . . . . . . . . . . . . . 108 8.3.1 Partial effects of a categorical covariate . . . . . . . . . . . . . . . . . . . . 108 8.3.2 Interpretation of the regression coefficients . . . . . . . . . . . . . . . . . . 109 Effect modification and interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.4.1 Effect modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.4.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.4.3 Interactions with the regression spline . . . . . . . . . . . . . . . . . . . . . 112 8.4.4 Linear model with interactions . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.4.5 Rank of the interaction model . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Interaction of two numeric covariates . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.5.1 Mutual effect modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.5.2 Mutual effect modification with regression splines . . . . . . . . . . . . . . 117 Interaction of a categorical and a numeric covariate . . . . . . . . . . . . . . . . . . 120 8.6.1 Categorical effect modification . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.6.2 Categorical effect modification with regression splines . . . . . . . . . . . . 124 CONTENTS 8.7 8.8 vi Interaction of two categorical covariates . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.7.1 Linear model parameterization of two-way classified group means . . . . . 127 8.7.2 ANOVA parameterization of two-way classified group means . . . . . . . . . 130 8.7.3 Full-rank parameterization of two-way classified group means . . . . . . . . 133 8.7.4 Relationship between the full-rank and ANOVA parameterizations . . . . . . 135 8.7.5 Additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.7.6 Interpretation of model parameters for selected choices of (pseudo)contrasts 137 Hierarchically well-formulated models, ANOVA tables . . . . . . . . . . . . . . . . . 143 8.8.1 Model terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.8.2 Model formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.8.3 Hierarchically well formulated model . . . . . . . . . . . . . . . . . . . . . . 146 8.8.4 ANOVA tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9 Analysis of Variance 9.1 9.2 152 One-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.1 Parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.1.2 One-way ANOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.1.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.1.4 Within and between groups sums of squares, ANOVA F-test . . . . . . . . . 157 Two-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2.1 Parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.2.2 Two-way ANOVA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.2.4 Sums of squares and ANOVA tables with balanced data . . . . . . . . . . . 170 10 Checking Model Assumptions 173 10.1 Model with added regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 10.2 Correct regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 10.2.1 Partial residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 10.2.2 Test for linearity of the effect . . . . . . . . . . . . . . . . . . . . . . . . . . 181 10.3 Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.1 Tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.2 Score tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.3 Some other tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . . 185 10.4 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10.4.1 Tests of normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 10.5 Uncorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.5.1 Durbin-Watson test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 CONTENTS vii 10.6 Transformation of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 Prediction based on a model with transformed response . . . . . . . . . . . 193 10.6.2 Log-normal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 11 Consequences of a Problematic Regression Space 11.1 11.2 196 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 11.1.1 Singular value decomposition of a model matrix . . . . . . . . . . . . . . . 197 11.1.2 Multicollinearity and its impact on precision of the LSE . . . . . . . . . . . 198 11.1.3 Variance inflation factor and tolerance . . . . . . . . . . . . . . . . . . . . . 200 11.1.4 Basic treatment of multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 204 Misspecified regression space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 11.2.1 Omitted and irrelevant regressors . . . . . . . . . . . . . . . . . . . . . . . . 205 11.2.2 Prediction quality of the fitted model . . . . . . . . . . . . . . . . . . . . . . 209 11.2.3 Omitted regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 11.2.4 Irrelevant regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 11.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 12 Simultaneous Inference in a Linear Model 12.1 193 219 Basic simultaneous inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 12.2 Multiple comparison procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 12.2.1 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 12.2.2 Simultaneous confidence intervals . . . . . . . . . . . . . . . . . . . . . . . 223 12.2.3 Multiple comparison procedure, P-values adjusted for multiple comparison 12.2.4 Bonferroni simultaneous inference in a normal linear model . . . . . . . . . 226 224 12.3 Tukey’s T-procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 12.3.1 Tukey’s pairwise comparisons theorem . . . . . . . . . . . . . . . . . . . . . 228 12.3.2 Tukey’s honest significance differences (HSD) . . . . . . . . . . . . . . . . . 230 12.3.3 Tukey’s HSD in a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 233 12.4 Hothorn-Bretz-Westfall procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 12.4.1 Max-abs-t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 12.4.2 General multiple comparison procedure for a linear model . . . . . . . . . 239 12.5 Confidence band for the regression function . . . . . . . . . . . . . . . . . . . . . . 243 13 Asymptotic Properties of the LSE and Sandwich Estimator 13.1 247 Assumptions and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 13.2 Consistency of LSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 13.3 Asymptotic normality of LSE under homoscedasticity . . . . . . . . . . . . . . . . . 255 13.3.1 Asymptotic validity of the classical inference under homoscedasticity but non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 CONTENTS viii 13.4 Asymptotic normality of LSE under heteroscedasticity . . . . . . . . . . . . . . . . . 258 13.4.1 Heteroscedasticity consistent asymptotic inference . . . . . . . . . . . . . . 265 14 Unusual Observations 14.1 267 Leave-one-out and outlier model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 14.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 14.3 Leverage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 14.4 Influential diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 14.4.1 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 14.4.2 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 14.4.3 Cook distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.4 COVRATIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 14.4.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 A Matrices 281 285 A.1 Pseudoinverse of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 A.2 Kronecker product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 A.3 Additional theorems on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 B Distributions 290 B.1 Non-central univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 B.2 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 C Asymptotic Theorems 295 Bibliography 299 Preface • R software (R Core Team, 2015). • Basic literature: Khuri (2010); Zvára (2008). • Supplementary literature: Seber and Lee (2003); Draper and Smith (1998); Sun (2003); Weisberg (2005); Anděl (2007); Cipra (2008); Zvára (1989). ix Notation and general conventions • Vectors are understood as column vectors (matrices with one column). Nevertheless, to simplify notation, standalone vectors will also be written as rows, e.g., z = z1 , . . . , zp . The transposition symbol (>) will only be used if it is necessary to highlight the fact that the vector must be used as a row vector (matrix with one row). This mainly points to situations when (i) vectors are used within the matrix multiplication, e.g., µ = z > γ, z = z1 , . . . , zp , γ = γ1 , . . . , γp , means that µ is a product of a row vector z > and a column vector γ (i.e., a scalar product of the two vectors); (ii) vectors are used to indicate rows of a matrix, e.g., Z> Z 1 = Z1,1 , . . . , Z1,p , 1 . .. . Z= . . , > Zn Z n = Zn,1 , . . . , Zn,p , > means that the matrix Z has rows given by the (row) vectors Z > 1 , . . ., Z n , i.e., Z1,1 . . . Z1,p . .. .. .. Z= . . . Zn,1 . . . Zn,p • Statements concerning equalities between two random quantities are understood as equalities almost surely even if “almost surely” is not explicitely stated. x Chapter 1 Linear Model 1.1 Regression analysis Start of Linear is a basic method of so called regression which covers a variety of Lecture #1 methods to model on how distribution of one variable depends on one or more other variables. (05/10/2015) A principal tool of linear regression is then so called linear model 3 which will be a main topic of this lecture. regression1 1.1.1 analysis2 Basic setup Most methods of regression analysis assume that the data can be represented by n random vectors Yi , X i , i = 1, . . . , n, where X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, have all the same number k < n components. This will also be a basic assumption used throughout the whole lecture. Notation and terminology (Response, covariates, response vector, model matrix). • Yi is called response 4 or dependent variable 5 . • The components of X i are called covariates6 , explanatory variables7 , predictors8 , or independent variables9 . • The sample space10 of the covariates will be denoted as X . That is, X ⊆ Rk , and among other things, P(X i ∈ X ) = 1, i = 1, . . . , n. Further, let Y1 . . Y = . , Yn X> X1,0 . . . X1,k−1 1 . . .. .. . . X= . . . = . . X> Xn,0 . . . Xn,k−1 n • Vector Y is called the response vector 11 . • The n × k matrix X is called the model matrix 12 or the regression matrix 13 . 1 lineární regrese 2 regresní analýza 3 lineární model 4 odezva 5 závisle proměnná 6 Nepřekládá se. Výraz „kovariáty“ nepoužívat! 7 vysvětlující proměnné 8 prediktory 9 nezávisle proměnné 10 výběrový prostor 11 vektor odezvy 12 matice modelu 13 regresní matice 1 1.1. REGRESSION ANALYSIS 2 Notation. Letter Y (or y) will always denote a response related quantity. Letters X (or x) and Z (or z) will always denote a quantity related to the covariates. This lecture: • Response Y is continuous. • Interest in modelling dependence of only the expected value (the mean) of Y on the covariates. • Covariates can be of any type (numeric, categorical). 1.1.2 Probabilistic model for the data Any statistical analysis is based on specifying a stochastic mechanism which is assumed to generate the data. In our situation, with data Yi , X i , i = 1, . . . , n, the data generating mechanism cor- responds to a joint distribution of a “long” random vector Y1 , . . . , Yn , X 1 , . . . , X n ≡ Y , X which can be given by a joint density (1.1) fY ,X (y1 , . . . , yn , x1 , . . . , xn ) ≡ fY ,X (y, x) (with respect to some σ-finite product measure λY × λX ). Note. For the purpose of this lecture, λY will always be a Lebesgue measure on Rn , Bn . Further, it is known from basic lectures on probability that any joint density can be decomposed into a product of a conditional and a marginal density as (1.2) fY ,X (y, x) = fY |X y x fX (x). With the regression analysis, and with the linear regression in particular, the interest lies in revealing certain features of the conditional distribution Y X (given by the density fY |X ) while considering the marginal distribution of the covariates X (given by the density fX ) as nuisance. It will be shown during the lecture that a valid statistical inference is possible for suitable characteristics of the conditional distribution of the response given the covariates while leaving the covariates distribution fX practically unspecified. Moreover, to infer characteristics of certain on the conditional distribution Y X, e.g., on the conditional mean E Y X , even the density fY |X might be left practically unspecified for many tasks. 1.2. LINEAR MODEL: BASICS 1.2 1.2.1 3 Linear model: Basics Definition of a linear model Definition 1.1 Linear model. The data Yi , X i , i = 1, . . . , n, E Y satisfy a linear model if X = Xβ, var Y X = σ 2 In , where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters. Terminology (Regression coefficients, residual variance and standard deviation). • β = β0 , . . . , βk−1 is called the vector of regression coefficients14 or regression parameters.15 • σ 2 is called the residual variance.16 √ • σ = σ 2 is called the residual standard deviation.17 Note. The linear model as specified by Definition 1.1 deals with specifying only the first two moments of the conditional distribution Y X. For the rest, both the density fY |X and the density fX from (1.2) can be arbitrary. 1.2.2 Definition of a linear model using the error terms The linear model can equivalently be defined as follows. Alternative to Definition 1.1 The data Yi , X i , i = 1, . . . , n satisfy a linear model if Y = Xβ + ε, where ε = ε1 , . . . , εn is a random vector such that ε and X are independent, E(ε) = 0n , var(ε) = σ 2 In , and β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters. Terminology (Error terms). Random variables ε1 , . . . , εn are called the error terms (disturbances).18 Notation. • To indicate that a random vector Y follows some distribution with mean µ and a covariance matrix Σ, we will write Y ∼ µ, Σ . 14 regresní koeficienty odchylky 15 regresní parametry 16 reziduální rozptyl 17 reziduální směrodatná odchylka 18 náhodné 1.2. LINEAR MODEL: BASICS 4 • The fact that data Yi , X i , i = 1, . . . , n, follow a linear model can now be indicated by writing Y X ∼ Xβ, σ 2 In . • When using the error term vector ε, the fact that data Yi , X i , i = 1, . . . , n, follow a linear model can be written as Y = Xβ + ε, ε ∼ 0n , σ 2 In . 1.2.3 Rank of the model The k-dimensional covariate vectors X 1 , . . . , X n (the n × k model matrix X) are in general generated by some (n · k)-dimensional joint distribution with a density fX (x1 , . . . , xn ) = fX (x). Next to assuming that n > k, we will additionally assume in the whole lecture that for a fixed r ≤ k, P rank(X) = r = 1. (1.3) That is, we will assume that the (column) rank of the model matrix is fixed rather than being random. It should gradually become clear throughout the lecture that this assumption is not really restrictive for most of the practical applications of a linear model. Convention. In the reminder of the lecture, we will only write rank X = r which will mean that P rank(X) = r = 1 if randomness of the covariates should be taken into account. Notation. To indicate that data Yi , X i , i = 1, . . . , n follow a linear model where the k dimensional covariates satisfy the condition (1.3) will be denoted as Y X ∼ Xβ, σ 2 In , rank Xn×k = r, or as Y = Xβ + ε, ε ∼ 0n , σ 2 In , rank Xn×k = r. Definition 1.2 Full-rank linear model. A full-rank linear model19 is such a linear model where r = k. Note. In a full-rank linear model, columns of the model matrix X are linearly independent vectors in Rn (almost surely). 1.2.4 Independent observations In many areas of regression modelling, the data Yi , X i , i = 1, . . . , n, correspond to independently behaving units (experimental units, individuals sampled randomly from a certain population, . . . ). If it is so, the random vectors Yi , X i are independent for i = 1, . . . , n, and the joint density (1.1) takes a product form fY ,X (y, x) = n Y i=1 19 lineární model o plné hodnosti fYi ,X i (yi , xi ), 1.2. LINEAR MODEL: BASICS 5 where fYi ,X i is a joint density of a random vector Yi , X i . This can again be decomposed into a product of a conditional and a marginal density as i = 1, . . . , n, fYi ,X i (yi , xi ) = fYi |X i yi xi fX i (xi ), leading to the joint density of all data points of a form fY ,X (y, x) = n nY fYi |X i n o o nY yi xi fX i (xi ) . i=1 | (1.4) i=1 {z fY |X y x } | {z fX (x) } Due to the fact that the Q covariate distribution is in fact a nuisance, it is not really necessary to assume that fX (x) = ni=1 fX i (xi ) to derive most results that will be presented in this lecture. In other words, the form of fX in the expression (1.4) might be left in its general form which leads to the joint density of the data points of the form fY ,X (y, x) = n nY | i=1 o fYi |X i yi xi fX (x). {z fY |X y x } Definition 1.3 Independent observations in a regression context. In the regression context, we say that we deal with independent observations if the conditional density of the response vector Y given the covariates X takes the form n Y fY |X y x = fYi |X i yi xi . (1.5) i=1 Note. When dealing with independent observations in a regression context according to Defini- tion 1.3, no independence assumptions are imposed on the joint distribution fX of the covariates. 1.2.5 Linear model with i.i.d. errors If we deal with independent observations in a regression context and a form (1.5) of the conditional distribution of Y given X can be assumed, it is often useful to assume not only a certain form of the first two moments of each conditional distribution fYi |X i , i = 1, . . . , n, but even a common functional form for all those conditional distributions. Definition 1.4 Linear model with i.i.d. errors. The data Yi , X i , i = 1, . . . , n, satisfy a linear model with i.i.d. errors20 if the joint conditional density of Y given X takes a form fY |X (y, x) = n Y i=1 20 lineární model s nezávislými stejně rozdělenými chybami fY |X yi xi , 1.2. LINEAR MODEL: BASICS where 6 2 fY |X yi xi = fe (yi − x> i β; σ ), i = 1, . . . , n, β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters and fe (·; σ 2 ) is some density of a continuous distribution with zero mean and a variance σ 2 . 2 Note. In a linear model with i.i.d. errors, fY |X yi xi = fe (yi − x> i β; σ ), i = 1, . . . , n and since fe (·; σ 2 ) is a density of zero mean distribution with variance σ 2 , we still have 2 E Yi X i = X > i = 1, . . . , n, i β, var Yi X i = σ , var Y X = σ 2 In . E Y X = Xβ, That is, the linear model with i.i.d. errors indeed implies a linear model according to the basic definition (Definition 1.1). Analogously to Section 1.2.2, also a linear model with i.i.d. errors can alternatively be defined while using the (i.i.d.) error terms as follows. Definition 1.4 using the error terms The data Yi , X i , i = 1, . . . , n, satisfy a linear model with i.i.d. errors if Yi = X > i β + εi , i = 1, . . . , n, where ε1 , . . . , εn are independent and identically distributed (i.i.d.) random variables with a continuous distribution with zero mean and a variance σ 2 , εi and X j independent for each i, j = 1, . . . , n. Parameters β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters. Example 1.1 (Normal linear model). So called normal linear model that we will extensively work with starting from Chapter 3 is obtained by considering fe (·; σ 2 ) as a density of a normal distribution N (0, σ 2 ). We then have 2 Yi X i ∼ N (X > i = 1, . . . , n, i β, σ ), Y X ∼ Nn (Xβ, σ 2 In ), i.i.d. ε1 , . . . , εn ∼ N (0, σ 2 ). 1.2.6 Regression function Let Y, X denote a (generic) random vector that correspond to a new data point being generated by the same random mechanism as data at hand. When using the linear model, we do not specify any characteristic of the random mechanism that generates the covariate value X. Nevertheless, as soon as a covariate value x ∈ X is given, and i.i.d. errors are assumed, the linear model specifies (models) the first two moments of the conditional distribution Y X = x (with a density fY |X · x)). Namely, the linear model claims that E Y X = x = x> β, var Y X = x = σ 2 , 1.2. LINEAR MODEL: BASICS 7 for some β ∈ Rk and some σ 2 . Terminology (Regression function). A function m : Rk −→ R given as m(x) = x> β = β0 x0 + · · · + βk−1 xk−1 , x = x0 , . . . , xk−1 ∈ X , is called the regression function.21 Note. The regression function is an assumed model for evolution of the response expectation given the covariates if those are changing. That is, m(x) = E Y X = x , 1.2.7 x ∈ X. Transformations of covariates In most practical situations,p we primarily observe data Yi , Z i , i = 1, . . . , n, where Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ R , i = 1, . . . , n. The main interest lies in modelling the conditional expectations E Yi Z i , or generically the conditional expectation E Y Z = z , z ∈ Z. To allow for use of a linear model, original covariates Z i must often be transformed into X i = Xi,0 , . . . , Xi,k−1 as Xi,0 = t0 (Z i ), ... , Xi,k−1 = tk−1 (Z i ), where tj : Rp −→ R, j = 0, . . . , k − 1, are some functions. When applying classical linear model methodology, it is assumed that the functions t0 , . . . , tk−1 are known. In the following, let for z ∈ Z: x = x0 , . . . , xk−1 = t0 (z), . . . , tk−1 (z) = t(z). The linear model then specifies (i = 1, . . . , n): E Yi Z i = t> (Z i )β = X > i β = E Yi X i , or generically E Y Z = z = t> (z)β = x> β = E Y X = x . The model matrix takes the form X1,0 . . . X1,k−1 t0 (Z 1 ) . . . tk−1 (Z 1 ) t> (Z 1 ) X> 1 . . . .. .. .. .. = ... = .. , .. = .. X= . . . . > (Z ) X> X . . . X t (Z ) . . . t (Z ) t n n,0 0 n n k−1 n,k−1 n and the corresponding regression function is m(z) = E Y Z = z = t> (z)β = β0 t0 (z) + · · · + βk−1 tk−1 (z), z = z1 , . . . , zp ∈ Z. (1.6) Note. Even though some of the transformations t0 , . . . , tk−1 are often non-linear functions, a model with the regression function (1.6) still belongs to the area of linear regression. The reason is that the regression function is linear with respect to the unknown parameters, i.e., in the regression coefficients β. 21 regresní funkce 1.2. LINEAR MODEL: BASICS 1.2.8 8 Linear model with intercept Often, Xi,0 is taken to be constantly equal to one. That is, the first element of the (transformed) covariate vectors corresponds to a random variable which is almost surely equal to one. In this case, the model matrix and the regression function take the forms 1 X1,1 . . . X1,k−1 m(x) = β0 + β1 x1 + · · · + βk−1 xk−1 , . .. .. , .. X= . . x = x0 , . . . , xk−1 ∈ X ⊆ Rk . 1 Xn,1 . . . Xn,k−1 Definition 1.5 Linear model with intercept. Linear model in which the first column of the model matrix X is almost surely equal to a vector 1n of ones is called the linear model with intercept22 . Terminology (Intercept). In the linear model with intercept, the first column of the model matrix, X1,0 , . . . , Xn,0 which is almost surely equal to a vector 1n of ones is called the intercept column. The regression coefficient β0 is called the intercept term23 (or just the intercept) of the regression function. 1.2.9 Interpretation of regression coefficients The regression parameters express influence of the covariates on response. Let for a chosen j ∈ 0, 1, . . . , k − 1 x = x0 , . . . , xj . . . , xk−1 ∈ X , and xj(+1) := x0 , . . . , xj + 1 . . . , xk−1 ∈ X . We then have E Y X = xj(+1) − E Y X = x = E Y X0 = x0 , . . . , Xj = xj + 1, . . . , Xk−1 = xk−1 − E Y X0 = x0 , . . . , Xj = xj , . . . , Xk−1 = xk−1 = β0 x0 + · · · + βj (xj + 1) + · · · + βk−1 xk−1 − β0 x0 + · · · + βj xj + · · · + βk−1 xk−1 = βj . That is, the regression coefficient βj expresses a change of the response expectation corresponding to a unity change of the jth regressor while keeping the remaining covariates unchanged. Further, let for a fixed δ ∈ R xj(+δ) := x0 , . . . , xj + δ . . . , xk−1 ∈ X , 22 lineární model s absolutním členem 23 absolutní člen 1.2. LINEAR MODEL: BASICS 9 we then have E Y X = xj(+δ) − E Y X = x = E Y X0 = x0 , . . . , Xj = xj + δ, . . . , Xk−1 = xk−1 − E Y X0 = x0 , . . . , Xj = xj , . . . , Xk−1 = xk−1 = βj δ. It is thus implied by the assumed linear model: (i) The change of the response expectation corresponding to a constant change δ of the jth regressor does not depend on the value xj of that regressor which is changed by δ. (ii) The change of the response expectation corresponding to a constant change δ of the jth regressor does not depend on the values of the remaining regressors. Terminology (Effect of the regressor). The regression coefficient βj is also called the effect of the jth regressor. Linear model with intercept In a model with intercept where Xi,0 is almost surely equal to one, it does not make sense to consider a change of this covariate by any fixed value. The intercept β0 has then the following interpretation. If x0 , x1 , . . . , xk−1 = 1, 0, . . . , 0 ∈ X , that is, if the non-intercept covariates may all attain zero values, we have β0 = E Y X1 = 0, . . . , Xk−1 = 0 . 1.2.10 Fixed or random covariates In certain application areas (e.g., designed experiments), the covariates can all (or some of them) be fixed rather than random variables. This means that the covariate values are determined/set by the analyst rather than being observed on (randomly selected) subjects. For majority of the theory presented throughout this course, it does not really matter whether the covariates are considered as random or as fixed quantities. Many proofs proceed almost exactly in the same way in both situations. Nevertheless, especially when dealing with asymptotic properties of the estimators used in the context of a linear model (see Chapter 13), care must be taken on whether the covariates are considered as random or as fixed. End of Lecture #1 Convention. For majority of the lecture, most expectations that we will work with, will be (05/10/2015) conditional expectations with respect to the conditional distribution of the response Y given the Start of covariate values we will use E(·) Lecture #2 matrix X). To simplify notation, X 1 , . . . , X n (given the model X . That is, if g = g1 , . . . , gm : Rn·(1+k) −→ Rm is also for E · X and var(·) also for var · (08/10/2015) a measurable function then E g(Y , X) and var g(Y , X) means E g(Y , X) := E g(Y , X) X Z = gj (y, X) fY |X y X dλY (y) , Rn var g(Y , X) := var g(Y , X) X . j=1,...,m 1.2. LINEAR MODEL: BASICS 10 Further, for x ∈ Rn·k , we will use E g(Y , x) := E g(Y , X) X = x , var g(Y , x) := var g(Y , X) X = x . (Unconditional) expectations and covariance matrices with respect to the joint distribution of Y and X will be indicated by a subscript Y , X, that is, Z EY , X g(Y , X) = gj (y, x) fY ,X (y, x) dλY (y) dλX (x) . Rn(1+k) j=1,...,m Analogously, expectations with respect to the marginal distribution of X will also be indicated by an appropriate subscript. That is, if h = h1 , . . . , hm : Rn·k −→ Rm is a measurable function then Z hj (x) fX (x) dλX (x) EX h(X) = . Rn·k j=1,...,m Note. To calculate EY , X , a basic relationship EY , X h i g(Y , X) = EX E g(Y , X) (1.7) known from probability courses can conveniently be used. Furthermore, it trivially follows from n·k (1.7) that if E g(Y , x) = c for λX -almost all x ∈ R then also EY , X g(Y , X) = c. 1.2.11 Limitations of a linear model “Essentially, all models are wrong, but some are useful. The practical question is how wrong do they have to be to not be useful.” George E. P. Box (1919 – 2013) Linear model is indeed only one possibility (out of infinitely many) on how to model dependence of the response on the covariates. The linear model as defined by Definition 1.1 is (possibly seriously) wrong if, for example, • The expected value E Y X = x , x ∈ X , cannot be expressed as a linear function of x. ⇒ Incorrect regression function. • The conditional variance var Y X = x , x ∈ X , is not constant. It may depend on x as well, it may depend on other factors. ⇒ Heteroscedasticity. • Response random variables are not conditionally uncorrelated/independent (the error terms are not uncorrelated/independent). This is often the case if response is measured repeatedly (e.g., over time) on n subjects included in the study. Additionally, the linear model deals with modelling of only the first two (conditional) moments of the response. In many application areas, other characteristics of the conditional distribution Y X are of (primary) interest. Chapter 2 Least Squares Estimation We keep considering a set of n random vectors Yi , X i , X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, that satisfy a linear model. That is, Y X ∼ Xβ, σ 2 In , rank Xn×k = r ≤ k < n, (2.1) > where Y = Y1 , . . . , Yn , X is a matrix with vectors X > 1 , . . . , X n in its rows and β = β0 , . . . , k 2 βk−1 ∈ R and σ > 0 are unknown parameters. In this chapter, we introduce a method of least squares1 to estimate the unknown parameters of the linear model (2.1). All results in this chapter will be derived without imposing any (parametric) distributional assumptions concerning the conditional distribution of the response given the covariates and without assuming independent observations in the regression context or i.i.d. errors. 1 metoda nejmenších čtverců 11 2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS 2.1 2.1.1 12 Regression and residual space, projections Regression and residual space Notation (Linear span of columns of the model matrix and its orthogonal complement). For given dataset and a linear model, the model matrix X is a real n×k matrix. Let x0 , . . . , xk−1 ∈ Rn denote its columns, i.e., X = x0 , . . . , xk−1 . 0 k−1 will be • The linear span2 of columns of X, i.e., a vector space generated by vectors x , . . . , x denoted as M X , that is, M X = v: v= k−1 X βj xj , β = β0 , . . . , βk−1 ∈ Rk . j=0 ⊥ • The orthogonal complement to M X will be denoted as M X , that is, M X ⊥ = u : u ∈ Rn , v > u = 0 for all v ∈ M X . Note. We know from linear algebra lectures that the linear span of column of X, M X , is a vector subspace of dimension r of the n-dimensional Euclidean space Rn . Similarly, M X a vector subspace of dimension n − r of the n-dimensional Euclidean space Rn . We have ⊥ is ⊥ ⊥ M X ∪ M X = Rn , M X ∩ M X = 0n , ⊥ for any v ∈ M X , u ∈ M X v > u = 0. Definition 2.1 Regression and residual space of a linear model. Consider a linear modelY X ∼ Xβ, σ 2 In , rank(X) = r. The regression space3 of the model is a vector space M X . The residual space4 of the model is the orthogonal complement of the ⊥ regression space, i.e., a vector space M X . Notation (Orthonormal vector bases of the regression and residual space). Let q 1 , . . . , q r be (any) orthonormal vector basis of the regression space M X and let n1 , . . . , nn−r ⊥ be (any) orthonormal vector basis of the residual space M X . That is, q 1 , . . . , q r , n1 , . . . , nn−r is an orthonormal vector basis of the n-dimensinal Euclidean space Rn . We will denote • Qn×r = q 1 , . . . , q r . • Nn×(n−r) = n1 , . . . , nn−r . 2 lineární obal 3 regresní prostor 4 reziduální prostor 2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS 13 • Pn×n = q 1 , . . . , q r , n1 , . . . , nn−r = Q, N . It follows from the linear algebra lectures. • Properties of the columns of the Q matrix: • q> j = 1, . . . , r; j q j = 1, j, l = 1, . . . , r, j 6= l. • q> j q l = 0, Notes. • Properties of the columns of the N matrix: • n> j = 1, . . . , n − r; j nj = 1, > • nj nl = 0, j, l = 1, . . . , n − r, j 6= l. • Mutual properties of the columns of the Q and N matrix: > j = 1, . . . , r, l = 1, . . . , n − r. • q> j nl = nl q j = 0, • Above properties written in a matrix form: Q> Q = Ir , N> N = In−r , Q> N = 0r×(n−r) , N> Q = 0(n−r)×r , P> P = In . (2.2) • It follows from (2.2) that P> is inverse to P and hence ! Q> > In = P P = Q, N = Q Q> + N N> . N> • It is also useful to remind M X =M Q , M X ⊥ =M N , Rn = M P . Notation. In the following, let H = Q Q> , M = N N> . Note. Matrices H and M are symmetric and idempotent: > H> = Q Q> = Q Q> = H, > M> = N N> = N N> = M, 2.1.2 H H = Q Q> Q Q> = Q Ir Q> = Q Q> = H, M M = N N> N N> = N In−r N> = N N> = M. Projections Let y ∈ Rn . We can then write (while using identity in Expression 2.2) y = In y = Q Q> + N N> y = (H + M)y = Hy + My. We have 2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS 14 b := Hy = Q Q> y ∈ M X . • y ⊥ • u := My = N N> y ∈ M X . • y > u = y > Q Q> N N> y = y > Q 0r×(n−r) N> y = 0. That is, we have decomposition of any y ∈ Rn into b + u, y=y ⊥ b∈M X , u∈M X , y b ⊥ u. y ⊥ b and u are projections of y into M X and M X , respectively, and H and M In other words, y are corresponding projection matrices. It follows from the linear algebra lectures. b + u is unique. • Decomposition y = y Notes. • Projection matrices H, M are unique. That is H = Q Q> does not depend on a choice of the orthonormal vector basis of M X included in the Q matrix and M = N N> does not depend ⊥ on a choice of the orthonormal vector basis of M X included in the N matrix. b = yb1 , . . . , ybn is the closest point (in a Euclidean metric) in the regression space • Vector y M(X) to a given vector y = y1 , . . . , yn , that is, e = ye1 , . . . , yen ∈ M(X) ∀y n n X X 2 bk = e k2 . ky − y (yi − ybi ) ≤ (yi − yei )2 = ky − y 2 i=1 i=1 Definition 2.2 Hat matrix, residual projection matrix. Consider a linear model Y X ∼ Xβ, σ 2 In , where Q and N are the orthonormal bases of the regression and the residual space, respectively. 1. The hat matrix5 of the model is the matrix Q Q> which is denoted as H. 2. The residual projection matrix6 of the model is the matrix N N> which is denoted as M. Lemma 2.1 Expressions of the projection matrices using the model matrix. The hat matrix H and the residual projection matrix M can be expressed as H = X X> X − X> , − M = In − X X> X X> . Proof. − • Five matrices rule (Theorem A.2): n X X> X Xo> X − In − X X> X X> X 5 regresní projekční matice, lze však užívat též výrazu „hat matice“ 6 = X, = 0n×k . reziduální projekční matice 2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS e = X X> X • Let H 15 − X> , e = In − X X> X − X> = In − H. e M e = 0n×k , both H e and M e are symmetric. • We have MX • We now have: e + In − H e y = Hy e + My. e y = In y = H e =X • Clearly, Hy n X> X − o X> y ∈ M X . e = b> X> My = y > MX b = 0. Hence My e ∈ M X ⊥. • For any z = Xb ∈ M X : z > My |{z} 0n ⇒ Uniqueness of projections and projection matrices e = X X> X H=H − X> , e = In − X X> X − X> . M=M k Notes. − − • Expression X X> X X> does not depend on a choice of the pseudoinverse matrix X> X . • If r = rank Xn×k = k then −1 H = X X> X X> , −1 M = In − X X> X X> . 2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM 2.2 16 Fitted values, residuals, Gauss–Markov theorem Before starting to deal with estimation of the principal parameters of the linear model, which are the regression coefficients β, we will deal with estimation of the full (conditional) mean E(Y ) = Xβ of the response vector Y and its (conditional) covariance matrix var(Y ) = σ 2 In for which it is sufficient to estimate the residual variance σ 2 . Notation. We denote µ := Xβ = E(Y ). By saying that we are now interested in estimation of the full (conditional) expectation E(Y ) we mean that we want to estimate the parameter vector µ on its own without necessity to know its decomposition into Xβ. Definition 2.3 Fitted values, residuals, residual sum of squares. Consider a linear model Y X ∼ Xβ, σ 2 In . 1. The fitted values7 or the vector of fitted values of the model is a vector HY which will be denoted as Yb . That is, Yb = Yb1 , . . . , Ybn = HY . 2. The residuals8 or the vector of residuals of the model is a vector MY which will be denoted as U . That is, U = U1 , . . . , Un = MY = Y − Yb . 2 3. The residual sum of squares9 of the model is a quantity U which will be denoted as SSe . That is, n X 2 SSe = U = U > U = Ui2 i=1 = n X Yi − Ybi 2 = Y − Yb > 2 Y − Yb = Y − Yb . i=1 Notes. • The fitted values Yb and the residuals U are projections of the response vector Y into the ⊥ regression space M X and the residual space M X , respectively. • Using different quantities and expressions introduced in Section 2.1, we can write − Yb = HY = Q Q> Y = X X> X X> Y , n − o U = MY = N N> Y = In − X X> X X> Y = Y − Yb . 7 vyrovnané hodnoty 8 rezidua 9 reziduální součet čtverců 2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM 17 > • It follows from the projection properties that the vector Yb = Yb1 , . . . , Ybn is the nearest point > of the regression space M X to the response vector Y = Y1 , . . . , Yn , that is, ∀ Ye = Ye1 , . . . , Yen > ∈ M(X) n n X X 2 Y − Yb 2 = (Yi − Ybi )2 ≤ (Yi − Yei )2 = Y − Ye . (2.3) i=1 i=1 • The Gauss–Markov theorem introduced below shows that Yb is a suitable estimator of µ = Xβ. Owing to (2.3), it is also called the least squares estimator (LSE).10 The method of estimation is then called the method of least squares11 or the method of ordinary least squares (OLS). Theorem 2.2 Gauss–Markov. Assume a linear model Y X ∼ Xβ, σ 2 In . Then the vector of fitted values Yb is the best linear unbiased estimator (BLUE)12 of a vector parameter µ = E(Y ). Further, − var Yb = σ 2 H = σ 2 X X> X X> . Proof. First, remind our notational convention that E(·) = E · X and var(·) = var · X . Linearity means that Yb is a linear function of the response vector Y which is clear from the expression Yb = HY . Unbiasedness. Let us calculate E Yb . E Yb = E(HY ) = H E(Y ) = H Xβ = Xβ = µ. The pre-last equality holds due to the fact that HX is a projection of each column of X into M X which is generated by those columns. That is HX = X. Optimality. Let Ye = a + BY be some other linear unbiased estimator of µ = Xβ. • That is, ∀ β ∈ Rk E Ye = Xβ, ∀ β ∈ Rk a + B E(Y ) = Xβ, ∀ β ∈ Rk a + BXβ = Xβ. It follows from here, by using above equality with β = 0k , that a = 0k . • That is, from unbiasedness, we have that ∀ β ∈ Rk B Xβ = Xβ. Take now β = > 0, . . . , 1, . . . , 0 while changing a position of one. From here, it follows that BX = X. • We now have: Ye = a + BY unbiased estimator of µ =⇒ a = 0k & BX = X. Trivially (but we will not need it here), also the opposite implication holds (if Ye = BY with BX = X then Ye is the unbiased estimator of µ = Xβ). In other words, Ye = a + BY is unbiased estimator of µ 10 odhad metodou nejmenších čtverců 11 ⇐⇒ metoda nejmenších čtverců (MNČ) 12 a = 0k & BX = X. nejlepší lineární nestranný odhad 2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM 18 • Let us now explore what can be concluded from the equality BX = X. · X> X − X> BX = X, − − BX X> X X> = X X> X X> , BH = H, (2.4) H> B> = H> , HB> = H. (2.5) • Let us calculate var Yb : var Yb = var(HY ) = H var(Y ) H> = H (σ 2 In ) H> − = σ 2 HH> = σ 2 H = σ 2 X X> X X> . • Analogously, we calculate var Ye for Ye = BY , where BX = X: var Ye = var BY = B var(Y ) B> = B (σ 2 In ) B> = σ 2 BB> = σ 2 (H + B − H) (H + B − H)> > > > > + H(B − H) = σ 2 HH + (B − H)H + (B − H) (B − H) | {z } | {z } | {z } H 0n 0n = σ 2 H + σ 2 (B − H) (B − H)> , where H(B − H)> = (B − H)H> = 0n follow from (2.4) and (2.5) and from the fact that H is symmetric and idempotent. • Hence finally, var Ye − var Yb = σ 2 (B − H) (B − H)> , which is a positive semidefinite matrix. That is, the estimator Yb is not worse than the estimator Ye . k Note. It follows from the Gauss–Markov theorem that Yb X ∼ Xβ, σ 2 H . Historical remarks • The method of least squares was used in astronomy and geodesy already at the beginning of the 19th century. • 1805: First documented publication of least squares. Adrien-Marie Legendre. Appendix “Sur le méthode des moindres quarrés” (“On the method of least squares”) in the book Nouvelles Méthodes Pour la Détermination des Orbites des Comètes (New Methods for the Determination of the Orbits of the Comets). 2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM 19 • 1809: Another (supposedly independent) publication of least squares. Carl Friedrich Gauss. In Volume 2 of the book Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (The Theory of the Motion of Heavenly Bodies Moving Around the Sun in Conic Sections). • C. F. Gauss claimed he had been using the method of least squares since 1795 (which is probably true). • The Gauss–Markov theorem was first proved by C. F. Gauss in 1821 – 1823. • In 1912, A. A. Markov provided another version of the proof. • In 1934, J. Neyman described the Markov’s proof as being “elegant” and stated that Markov’s contribution (written in Russian) had been overlooked in the West. ⇒ The name Gauss–Markov theorem. Theorem 2.3 Basic properties of the residuals and the residual sum of squares. Let Y = Xβ + ε, ε ∼ 0n , σ 2 In , where rank Xn×k = r ≤ k < n. The following then holds: (i) U = Mε. (ii) SSe = Y > MY = ε> Mε. (iii) E U = 0n , var U = σ 2 M. (iv) E SSe = EY , X SSe = (n − r) σ 2 . Notes. • Point (i) of Theorem 2.3 says that the residuals can be obtained not only by projecting the ⊥ response vector Y into M X but also by projecting the vector of the error terms of the linear ⊥ model into M X . • Point (iii) of Theorem 2.3 can also be briefly written as U X ∼ 0n , σ 2 M . Proof. (i) U = MY = M (Xβ + ε) = |{z} MX β + Mε = Mε. 0n (ii) SSe = U > U = (MY )> MY > > = Y>M | {zM} Y = Y MY M (i) = ε> M> M ε = ε> M ε. 2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM (iii) 20 E(U ) = E(MY ) = |{z} MX β = 0n . 0n var(U ) = var(MY ) = M var(Y ) M> = M σ 2 In M> = σ 2 MM> = σ 2 M. (iv) E SSe n n n o o o = E ε> M ε = E tr ε> M ε = E tr Mεε> = tr E Mεε> n o = tr M E εε> = tr M σ 2 In = tr σ 2 M = σ 2 tr(M) | {z } var(ε) 2 = σ tr NN> = σ 2 tr N> N = σ 2 tr In−r = σ 2 (n − r), where (to remind) N denotes an n × (n − r) matrix whose columns form an orthonormal ⊥ vector basis of a residual space M X . Z Further, EY , X SSe = E SSe fX (x) dλX (x) Rn·k Z = Rn·k σ 2 (n − r) fX (x) dλX (x) = σ 2 (n − r). k Definition 2.4 Residual mean square and residual degrees of freedom. Consider a linear model Y X ∼ Xβ, σ 2 In , rank(X) = r. 1. The residual mean square13 of the model is a quantity SSe /(n − r) and will be denoted as MSe . That is, SSe . MSe = n−r 2. The residual degrees of freedom14 of the model is the dimension of the residual space and will be denotes as νe . That is, νe = n − r. Theorem 2.4 Unbiased estimator of the residual variance. The residual mean square MSe is an unbiased estimator (both conditionally given X and also with joint distribution of Y and X) of the residual variance σ 2 in a linear model respect to the Y X ∼ Xβ, σ 2 In , rank Xn×k = r ≤ k < n. Proof. Direct consequence of Theorem 2.3, point (iv). 13 reziduální střední čtverec 14 reziduální stupně volnosti k End of Lecture #2 (08/10/2015) 2.3. NORMAL EQUATIONS 2.3 21 Normal equations Start of The vector of fitted values Yb = HY is a projection of the response vector into M X . Hence, it Lecture #3 must be possible to write Yb as a linear combination of the columns of the model matrix X. That (08/10/2015) is, there exists b ∈ Rk such that Yb = Xb. (2.6) Notes. • In a full-rank model (rank Xn×k = k), linearly independent columns of X form a vector basis of M X . Hence b ∈ Rk such that Yb = Xb is unique. • If rank Xn×k = r < k, a vector b ∈ Rk such that Yb = Xb is not unique. We already know from the Gauss-Markov theorem (Theorem 2.2) that E Yb = Xβ. Hence if we manage to express Yb as Yb = Xb and b will be unique, we have a natural candidate for an estimator of the regression coefficients β. Nevertheless, before we proceed to estimation of β, we derive conditions that b ∈ Rk must satisfy to fulfill also (2.6). Definition 2.5 Sum of squares. Consider a linear model Y X ∼ Xβ, σ 2 In . The function SS : Rk −→ R given as follows 2 > Y − Xβ , SS(β) = Y − Xβ = Y − Xβ β ∈ Rk will be called the sum of squares15 of the model. Theorem 2.5 Least squares and normal equations. Assume a linear model Y X ∼ Xβ, σ 2 In . The vector of fitted values Yb equals to Xb, b ∈ Rk if and only if b solves a linear system X> Xb = X> Y . (2.7) Proof. Yb = Xb, is a projection of Y into M X ⇔ ⇔ Yb = Xb is the closest point to Y in M X 2 Yb = Xb, where b minimizes SS(β) = Y − Xβ over β ∈ Rk . Let us find conditions under which the term SS(β) attains its minimal value over β ∈ Rk . To this end, a vector of the first derivatives (a gradient) and a matrix of the second derivatives (a Hessian) of SS(β) are needed. 2 > SS(β) = Y − Xβ = Y − Xβ Y − Xβ = Y > Y − 2Y > Xβ + β > X> Xβ. ∂SS (β) = −2 X> Y + 2 X> Xβ. ∂β 15 součet čtverců 2.3. NORMAL EQUATIONS 22 ∂ 2 SS (β) = 2 X> X. ∂β∂β > (2.8) For any β ∈ Rk , the Hessian (2.8) is a positive semidefinite matrix and hence b minimizes SS(β) over β ∈ Rk if and only if ∂SS (b) = 0k , ∂β that is, if and only if X> Xb = X> Y . k Definition 2.6 Normal equations. Consider a linear model Y X ∼ Xβ, σ 2 In . The system of normal equations16 or concisely normal equations17 of the model is the linear system X> Xb = X> Y , or equivalently, the linear system X> (Y − Xb) = 0k . Note. In general, the linear system (2.7) of normal equations would not have to have a solution. Nevertheless, in our case, existence of the solution follows from the fact that it corresponds to b the projection Y of Y into the regression space M X and existence of the projection Yb is guaranteed by the projection properties known from the linear algebra lectures. On the other hand, we can also show quite easily that there exists a solution to the normal equations by using the following lemma. Lemma 2.6 Vector spaces generated by the rows of the model matrix. Let Xn×k be a real matrix. Then M X> X = M X> . Note. A vector space M X> is a vector space generated by the columns of the matrix X> , that is, it is a vector space generated by the rows of the matrix X. ⊥ ⊥ Proof. First note that M X> X = M X> is equivalent to M X> X = M X> . We will ⊥ ⊥ show this by showing that for any a ∈ Rk a ∈ M X> if and only if a ∈ M X> X . (i) a ∈ M X> ⊥ > > > ⇒ a> X> = 0> n ⇒ a X X = 0k ⊥ ⇔ a ∈ M X> X 16 systém normálních rovnic 17 normální rovnice 2.3. NORMAL EQUATIONS (ii) a ∈ M X> X ⊥ 23 ⇒ a> X> X = 0> ⇒ a> X> X a = 0 k ⇒ kXak = 0 ⇔ Xa = 0n ⇔ a> X> = 0> n ⊥ ⇔ a ∈ M X> k Notes. • Existence of a solution to normal equations (2.7) follows from the fact that its right-hand side > > > X Y ∈ M X and M X is (by Lemma 2.6) the same space as a vector space generated by the columns of the matrix of the linear system (X> X). • By Theorem A.1, all solutions to normalequations, i.e., a set of points that minimize the sum of − − squares SS(β) are given as b = X> X X> Y , where X> X is any pseudoinverse to X> X (if rank Xn×k = r < k, this pseudoinverse is not unique). − • We also have that for any b = X> X X> Y : SSe = SS(b). Notation. • In the following, symbol b will be exclusively used to denote any solution to normal equations, that is, − b = b0 , . . . , bk−1 = X> X X> Y . • For a full-rank linear model, (rank Xn×k = k), the following holds: − − −1 • The only pseudoinverse X> X is X> X = X> X . −1 > • The only solution of normal equations is b = X> X X Y which is also a unique minimizer of the sum of squares SS(β). b That is, In this case, we will denote the unique solution to normal equations as β. b = βb0 , . . . , βbk−1 = X> X −1 X> Y . β 2.4. ESTIMABLE PARAMETERS 2.4 24 Estimable parameters We have seen in the previous section that the sum of squares SS(β) does not necessarily attain a unique minimum. This happens if the model matrix Xn×k has linearly dependent columns (its rank r < k) and hence there exist (infinitely) many possibilities on how to express the vector of the fitted values Yb ∈ M X as a linear combination of the columns of the model matrix X. In other words, there exist (infinitely) many vectors b ∈ Rk such that Yb = Xb. This could also be interpreted as that there are (infinitely) many estimators of the regression parameters β leading to the (unique) unbiased estimator of the response mean µ = E(Y ) = Xβ. It then does not make much sense to talk about estimation of the regression parameters β. To avoid such situations, we now define a notion of an estimable parameter 18 of a linear model. Definition 2.7 Estimable parameter. Consider a linear model Y X ∼ Xβ, σ 2 In . Let l ∈ Rk . We say that a parameter θ = l> β is an estimable parameter of the model if for all µ ∈ M X the expression l> β does not depend on a choice of a solution to the linear system Xβ = µ. Notes. • Definition of an estimable parameter is equivalent to the requirement ∀β 1 , β 2 ∈ Rk Xβ 1 = Xβ 2 ⇒ l> β 1 = l> β 2 . That is, the estimable parameter is such a linear combination of the regression coefficients β whichdoes not depend on a choice of the β leading to the same vector in the regression space M X (leading to the same vector of the response expectation µ). • In a full-rank model (rank Xn×k = k), columns of the model matrix X form a vector basis of the regression space M X . It then follows from the properties of a vector basis that for any µ ∈ M X there exist a unique β such that Xβ = µ. Trivially, for any l ∈ Rk , the expression l> β then does not depend on a choice of a solution to the linear system Xβ = µ since there is only one such solution. In other words, in a full-rank model, any linear function of the regression coefficients β is estimable. Definition 2.8 Estimable vector parameter. Consider a linear model Y X ∼ Xβ, σ 2 In . Let l ∈ Rk . Let l1 , . . . , lm ∈ Rk . Let L be an m × k > matrix having vectors l> 1 , . . . , lm in its rows. We say that a vector parameter θ = Lβ is an estimable vector parameter of the model if all parameters θj = l> j β, j = 1, . . . , m, are estimable. 18 odhadnutelný parametr 2.4. ESTIMABLE PARAMETERS 25 Notes. • Definition of an estimable parameter is equivalent to the requirement ∀β 1 , β 2 ∈ Rk Xβ 1 = Xβ 2 ⇒ Lβ 1 = Lβ 2 . • Trivially, a vector parameter µ = E(Y ) = Xβ is always estimable. We also already know its BLUE which is the vector of fitted values Yb . • In a full-rank model (rank Xn×k = k), the regression coefficients vector β is an estimable vector parameter. Example 2.1 (Overparameterized two-sample problem). Consider a two-sample problem, where Y1 , . . . , Yn1 are assumed to be identically distributed random variables with E(Y1 ) = · · · = E(Yn1 ) = µ1 and var(Y1 ) = · · · = var(Yn1 ) = σ 2 (sample 1), Yn1 +1 , . . . , Yn1 +n2 are also assumed to be identically distributed random variables with E(Yn1 +1 ) = · · · = E(Yn1 +n2 ) = µ2 and var(Yn1 +1 ) = · · · = var(Yn1 +n2 ) = σ 2 (sample 2) and Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 are assumed to be independent. This situation can be described by a linear model Y ∼ Xβ, σ 2 In , n = n1 + n2 , where Y1 µ1 β0 + β1 1 1 0 . . . . . . .. .. .. .. .. .. β0 Yn1 β0 + β1 µ1 1 1 0 = . Y = µ = Xβ = µ , X = 1 0 1 , β = β1 , Y β + β n1 +1 2 0 2 β2 . . . . . .. . . . . . . . . . . . Yn1 +n2 1 0 1 β0 + β2 µ2 (i) Parameters µ1 = β0 + β1 and µ2 = β0 + β2 are (trivially) estimable. > = 0, 1, 0 and (ii) None of the elements of the vector β is estimable. For example, take β 1 β 2 = 1, 0, −1 . We have Xβ 1 = Xβ 2 = 1, . . . , 1, 0, . . . , 0 but none of the elements of β 1 and β 2 is equal. This corresponds to the fact that two means µ1 and µ2 can be expressed in infinitely many ways using three numbers β0 , β1 , β2 as µ1 = β0 + β1 and µ2 = β0 + β2 . (iii) A non-trivial estimable parameter is, e.g., θ = µ2 − µ1 = β2 − β1 = l> β, l = 0, −1, 1 . > We have for β 1 = β1,0 , β1,1 , β1,2 ∈ R3 and β 2 = β2,0 , β2,1 , β2,2 ∈ R3 : Xβ 1 = Xβ 2 ⇔ β1,0 + β1,1 = β2,0 + β2,1 , β1,0 + β1,2 = β2,0 + β2,2 ⇒ β1,2 − β1,1 = β2,2 + β2,1 ⇔ l> β 1 = l> β 2 . Definition 2.9 Contrast. Consider a linear model Y X ∼ Xβ, σ 2 In . An estimable parameter θ = c> β, given by a real > vector c = c0 , . . . , ck−1 which satisfies c> 1k = 0, i.e., k−1 X j=0 cj = 0, 2.4. ESTIMABLE PARAMETERS 26 is called contrast19 . Definition 2.10 Orthogonal contrasts. Consider a linear model Y X ∼ Xβ, σ 2 In . Contrasts θ = c> β and η = d> β given by orthogo> > nal vectors c = c0 , . . . , ck−1 and d = d0 , . . . , dk−1 , i.e., given by vectors c and d that satisfy c> d = 0, are called (mutually) orthogonal contrasts. Theorem 2.7 Estimable parameter, necessary and sufficient condition. Assume a linear model Y X ∼ Xβ, σ 2 In . (i) Let l ∈ Rk . Parameter θ = l> β is an estimable parameter if and only if l ∈ M X> . (ii) A vector θ = Lβ is an estimable vector parameter if and only if M L> ⊂ M X> . Proof. (i) θ = l> β is estimable ⇔ ∀ β 1 , β 2 ∈ Rk Xβ 1 = Xβ 2 ⇒ l> β 1 = l> β 2 ⇔ ∀ β 1 , β 2 ∈ Rk X(β 1 − β 2 ) = 0n ⇒ l> (β 1 − β 2 ) = 0 ⇔ ∀ γ ∈ Rk Xγ = 0n ⇒ l> γ = 0 ⇔ ∀ γ ∈ Rk γ orthogonal to all rows of X ⇒ l> γ = 0 ⊥ ∀ γ ∈ Rk γ ∈ M X> ⇒ l> γ = 0 l ∈ M X> . ⇔ ⇔ (ii) Direct consequence of point (i). k Note. In a full-rank model (rank Xn×k = k < n), M X> = Rk . That is, any linear function of β is indeed estimable (statement that we already concluded from the definition of an estimable parameter). 19 kontrast 2.4. ESTIMABLE PARAMETERS 27 Theorem 2.8 Gauss–Markov for estimable parameters. Let θ = l> β be an estimable parameter of a linear model Y X ∼ Xβ, σ 2 In . Let b be any solution to the normal equations. The statistic θb = l> b then satisfies: (i) θb does not depend on a choice of the solution b of the normal equations, i.e., it does not depend − on a choice of a pseudoinverse in b = X> X X> Y . (ii) θb is the best linear unbiased estimator (BLUE) of the parameter θ. − (iii) var θb = σ 2 l> X> X l , that is, − θb | X ∼ θ, σ 2 l> X> X l , where l> X> X − l does not depend on a choice of the pseudoinverse X> X − If additionally l 6= 0k then l> X> X l > 0. − . > Let further θ1 = l> 1 β and θ2 = l2 β be estimable parameters. Let θb1 = l> 1 b, Then cov θb1 , θb2 > where l> 1 X X − θb2 = l> 2 b. > = σ 2 l> 1 X X − l2 , l2 does not depend on a choice of the pseudoinverse X> X − . Proof. (i) Let b1 , b2 be two solutions to normal equations, that is, X> Y = X> X b1 = X> X b2 . By Theorem 2.5 (Least squares and normal equations): ⇔ Yb = Xb1 & Yb = Xb2 , that is, Xb1 = Xb2 . Estimability of θ: ⇒ l> b1 = l> b2 . (ii) Parameter θ = l> β is estimable. By Theorem 2.7: ⇔ l ∈ M X> ⇔ l = X> a for some a ∈ Rn ⇒ θb = a> Xb = a> Yb . That is, θb is a linear function of Yb which is the BLUE of µ = Xβ. It then follows that θb is the BLUE of the parameter a> µ = a> Xβ = l> β = θ. 2.4. ESTIMABLE PARAMETERS 28 (iii) Proof/calculations were available on the blackboard in K1. > The last part “Let further θ1 = l> 1 β and θ2 = l2 β be. . . ”: Proof/calculations were available on the blackboard in K1. k Theorem 2.9 Gauss–Markov for estimable vector parameter. Let θ = Lβ be an estimable vector parameter of a linear model Y X ∼ Xβ, σ 2 In . Let b be any solution to normal equations. The statistic b = Lb θ then satisfies: b does not depend on a choice of the solution b of the normal equations. (i) θ b is the best linear unbiased estimator (BLUE) of the vector parameter θ. (ii) θ b = σ 2 L X> X − L> , that is, (iii) var θ b | X ∼ θ, σ 2 L X> X − L> , θ − − where L X> X L> does not depend on a choice of the pseudoinverse X> X . If additionally m ≤ r and the rows of the matrix L are linearly independent then L X> X is a positive definite (invertible) matrix. Proof. Direct consequence of Theorem 2.8, except positive definiteness of L X> X situations when L has linearly independent rows. − Positive definiteness of L X> X L> if Lm×k has linearly independent rows: Proof/calculations were available on the blackboard in K1. Consequence of Theorem 2.9. Assume a full-rank linear model Y X ∼ Xβ, σ 2 In , rank Xn×k = k < n. The statistic b = X> X β −1 X> Y then satisfies: b is the best linear unbiased estimator (BLUE) of the regression coefficients β. (i) β − − L> L> in k 2.4. ESTIMABLE PARAMETERS 29 b = σ 2 X> X −1 , that is, (ii) var β b X ∼ β, σ 2 X> X −1 . β Proof. Use L = Ik in Theorem 2.9. k 2.5. PARAMETERIZATIONS OF A LINEAR MODEL 2.5 2.5.1 30 Parameterizations of a linear model Equivalent linear models Definition 2.11 Equivalent linear models. Assume two linear models: M1 : Y X1 ∼ X1 β, σ 2 In , where X1 is an n × k matrix with rank X = r and M2 : Y X2 ∼ X2 γ, σ 2 In , where X2 is an n × l matrix with rank X2 = r. We say that models M1 and M2 are equivalent if their regression spaces are the same. That is, if M X1 = M X2 . Notes. • The two equivalent models: − − • have the same hat matrix H = X X> X X> = Z Z> Z Z> and a vector of fitted values Yb = HY ; • have the same residual projection matrix M = In − H and a vector of residuals U = MY ; • have the same value of the residual sum of squares SSe = U > U , residual degrees of freedom νe = n − r and the residual mean square MSe = SSe /(n − r). • The two equivalent models provide two different parameterizations of one situation. Nevertheless, practical interpretation of the regression coefficients β ∈ Rk and γ ∈ Rl in the two models might be different. In practice, both parameterizations might be useful and this is also the reason why it often makes sense to deal with both parameterizations. 2.5.2 Full-rank parameterization of a linear model Any linear model can be parameterized such that the model matrix has linearly independent 2 columns, i.e., is of a full-rank. To see this, consider a linear model Y X ∼ Xβ, σ In ,where rank Xn×k = r ≤ k < n. If Qn×r is a matrix with the orthonormal vector basis of M X in its columns (that is, rank(Q) = r), the linear model Y Q ∼ Qγ, σ 2 In (2.9) is equivalent to the original model with the model matrix X. Nevertheless, parameterization of a model using the orthonormal basis and the Q matrix is only rarely used in practice since the interpretation of the regression coefficients γ in model (2.9) is usually quite awkward. Parameterization of a linear model using the orthonormal basis matrix Q is indeed not the only full-rank parameterization of a given linear model. There always exist infinitely many full-rank parameterizations and in reasonable practical analyses, it should always be possible to choose such a full-rank parameterization or even parameterizations that also provide practically interpretable regression coefficients. Example 2.2 (Different parameterizations of a two-sample problem). Let us again consider a two-sample problem (see also Example 2.1). That is, > Y = Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 , 2.5. PARAMETERIZATIONS OF A LINEAR MODEL 31 where Y1 , . . . , Yn1 are identically distributed random variables with E(Y1 ) = · · · = E(Yn1 ) = µ1 and var(Y1 ) = · · · = var(Yn1 ) = σ 2 (sample 1), Yn1 +1 , . . . , Yn1 +n2 are also identically distributed random variables with E(Yn1 +1 ) = · · · = E(Yn1 +n2 ) = µ2 and var(Yn1 +1 ) = · · · = var(Yn1 +n2 ) = σ 2 (sample 2) and Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 are assumed to be independent. This situation can be described by differently parameterized linear models Y X ∼ Xβ, σ 2 In , n = n1 + n2 where the model matrix X is always divided into two blocks as ! X1 X= , X2 where X1 is an n1 × k matrix having n1 identical rows x> 1 and X2 is an n2 × k matrix having n2 identical rows x> . The response mean vector µ = E(Y ) is then 2 x> β µ1 1 . . .. .. ! > x1 β µ1 X1 β µ = Xβ = = x> β = µ . X2 β 2 2 . . . . . . x> µ2 2β That is, parameterization of the model is given by choices of vectors x1 6= x2 , x1 6= 0k , x2 6= 0k leading to expressions of the means of the two samples as µ 1 = x> 1 β, µ2 = x> 2 β. The rank of the model is always r = 2. Overparameterized model x1 = 1, 1, 0 , x2 = 1, 0, 1 : 1 1 0 . . . .. .. .. 1 1 0 X= 1 0 1 , . . . . . . . . . 1 0 1 √ β0 β = β1 , β2 µ1 = β0 + β1 , µ2 = β0 + β2 . √ Orthonormal basis x1 = 1/ n1 , 0 , x2 = 0, 1/ n2 : √1 n1 . . . √1 X = Q = n1 0 . .. 0 0 .. . 0 , √1 n2 .. . √1 n2 β= β1 β2 ! , 1 µ 1 = √ β1 , n1 1 µ2 = √ β 2 , n2 β1 = √ n 1 µ1 , β2 = √ n 2 µ2 . 2.5. PARAMETERIZATIONS OF A LINEAR MODEL 32 Group means x1 = 1, 0 , x2 = 0, 1 : 1 0 . . .. .. 1 0 , X= 0 1 . . . . . . 0 1 β= β1 β2 ! , µ1 = β1 , µ2 = β2 . This could also be viewed as the overparameterized model constraint by a condition β0 = 0. Group differences x1 = 1, 1 , x2 = 1, 0 : 1 1 . . .. .. 1 1 X= 1 0 , . . . . . . 1 0 β= β0 β1 ! , µ 1 = β0 + β1 , β1 = µ1 − µ2 . µ 2 = β0 , This could also be viewed as the overparameterized model constraint by a condition β2 = 0. Deviations from the mean of the means x1 = 1, 1 , x2 = 1, −1 : 1 1 . .. .. . 1 1 , X= 1 −1 . .. . . . 1 −1 µ1 = β0 + β1 , β= β0 β1 ! , µ2 = β0 − β1 , µ1 + µ2 , 2 µ1 + µ2 β1 = µ 1 − 2 µ1 + µ2 = − µ2 . 2 β0 = This could also be viewed as the overparameterized model constraint by a condition β1 + β2 = 0. Except the overparameterized model, all above parameterizations are based on a model matrix having full-rank r = 2. End of Lecture #3 (08/10/2015) 2.6. MATRIX ALGEBRA AND A METHOD OF LEAST SQUARES 2.6 33 Matrix algebra and a method of least squares Start of 2 Lecture #4 We have seen in Section 2.5 that any linear model Y X ∼ Xβ, σ In can be reparameterized such that the model matrix X has linearly independent columns, that is, rank Xn×k = k. Remind (15/10/2015) now expressions of some quantities that must be calculated when dealing with the least squares estimation of parameters of the full-rank linear model: −1 −1 H = X X> X X> , M = In − H = In − X X> X X> , −1 −1 Yb = HY = X X> X X> Y , var Yb = σ 2 H = σ 2 X X> X X> , n −1 U = MY = Y − Yb , var U = σ 2 M = σ 2 In − X X> X X> , b = X> X −1 X> Y , β b = σ 2 X> X −1 . var β −1 The only non-trivial calculation involved in above expressions is calculation of the inverse X> X . Nevertheless, all above expressions (and many others needed in a context of the least squares estimation) can be calculated without explicit evaluation of the matrix X> X. Some of above ex−1 pressions can even be evaluated without knowing explicitely the form of the X> X matrix. To this end, methods of matrix algebra can be used (and are used by all reasonable software routines dealing with the least squares estimation). Two methods, known from the course Fundamentals of Numerical Mathematics (NMNM201), that have direct usage in the context of least squares are: • QR decomposition; • Singular value decomposition (SVD) applied to the model matrix X. Both of them can be used, among other things, to find the orthonormal vector basis of the regression space M X and to calculate expressions mentioned above. 2.6.1 QR decomposition QR decomposition of the model matrix is used, for example, by the R software (R Core Team, 2015) to estimate a linear model by the method of least squares. If Xn×k is a real matrix with rank X = k < n then we know from the course Fundamentals of Numerical Mathematics (NMNM201) that it can be decomposed as X = QR, where Qn×k = q 1 , . . . , q k , q j ∈ Rk , j = 1, . . . , k, q 1 , . . . , q k is an orthonormal basis of M X and Rk×k is upper triangular matrix. That is, Q> Q = Ik , We then have QQ> = H. X> X = R> Q> Q R = R> R. | {z } Ik (2.10) That is, R> R is a Cholesky (square root) decomposition of the symmetric matrix X> X. Note that this is a special case of an LU decomposition for symmetric matrices. Decomposition (2.10) −1 can now be used to get easily (i) matrix X> X , (ii) a value of its determinant or a value of determinant of X> X, (iii) solution to normal equations. 2.6. MATRIX ALGEBRA AND A METHOD OF LEAST SQUARES 34 −1 (i) Matrix X> X . X> X −1 = R> R −1 = R−1 R> −1 = R−1 R−1 > . That is, to invert the matrix X> X, we only have to invert the upper triangular matrix R. −1 (ii) Determinant of X> X and X> X . Let r1 , . . . , rk denote diagonal elements of the matrix R. We then have k 2 2 Y det X> X = det R> R = det(R) = rj , j=1 n −1 o n o−1 . det X> X = det X> X b = X> X (iii) Solution to normal equations β b by solving: We can obtain β −1 X> Y . X> X b = X> Y R> R b = R> Q> Y R b = Q> Y . (2.11) b it is only necessary to solve a linear system with the upper triangular That is, to get β, system matrix which can easily be done by backward substitution. > Further, the right-hand-side c = c1 , . . . , ck := Q> Y of the linear system (2.11) additionally serves to calculate the vector of fitted values. We have Yb = HY = QQ> Y = Q c = k X cj q j . j=1 That is, the vector c provides coefficients of the linear combination of the orthonormal vector basis of the regression space M X that provide the fitted values Yb . 2.6.2 SVD decomposition Use of the SVD decomposition for the least squares will not be explained in detail in this course. It is covered by the Fundamentals of Numerical Mathematics (NMNM201) course. Chapter 3 Normal Linear Model Until now, all proved theorems did not pose any distributional assumptions on the random vectors Yi , X i , X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, that represent the data. We only assumed a certain form of the (conditional) expectation and the (conditional) covariance matrix of Y = Y1 , . . . , Yn given X 1 , . . . , X n (given the model matrix X). In this chapter, we will additionally assume that the response is conditionally normally distributed given the covariates which will lead us to the normal linear model. 35 3.1. NORMAL LINEAR MODEL 3.1 36 Normal linear model Definition 3.1 Normal linear model. The data Yi , X i , i = 1, . . . , n, satisfy a normal linear model1 if they satisfy a linear model with i.i.d. errors with 2 Yi X i ∼ N X > i β, σ , where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters. Notation. It follows from Definition 3.1, from definition of a linear model with i.i.d. errors (Definition 1.4) and from properties of a normal distribution that a joint conditional distribution of Y given X is multivariate normal with a mean Xβ and a covariance matrix σ 2 In . Hence the fact that data follow a normal linear model will be indicated by notation Y X ∼ Nn Xβ, σ 2 In . Definition 3.1 using the error terms The data Yi , X i , i = 1, . . . , n, satisfy a normal linear model if Yi = X > i β + εi , i = 1, . . . , n, i.i.d. where ε1 , . . . , εn ∼ N (0, σ 2 ), ε = ε1 , . . . , εn and X independent, and β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters. Notation. To indicate that data follow a normal linear model, we will also use notation Y = Xβ + ε, 1 normální lineární model ε ∼ Nn (0n , σ 2 In ). 3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY 3.2 37 Properties of the least squares estimators under the normality Theorem 3.1 Least squares estimators under the normality. Let Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r. Let Lm×k is a real matrix with non-zero rows > > > > l> = l> = Lβ is an estimable parameter. 1 , . . . , lm such that θ = θ1 , . . . , θm 1 β, . . . , lm β > > > > b b b Let θ = θ1 , . . . , θm = l b, . . . , l b = Lb be its least squares estimator. Further, let 1 m − V = L X> X L> = vj,t j,t=1,...,m , 1 1 , ..., √ , D = diag √ v1,1 vm,m θbj − θj Tj = p , j = 1, . . . , m, MSe vj,j > 1 b−θ . T = T1 , . . . , Tm = √ D θ MSe The following then holds. (i) Yb X ∼ Nn Xβ, σ 2 H . (ii) U X ∼ Nn 0n , σ 2 M . b X ∼ Nm θ, σ 2 V . (iii) θ (iv) Statistics Yb and U are conditionally, given X, independent. b and SSe are conditionally, given X, independent. (v) Statistics θ Yb − Xβ 2 (vi) ∼ χ2r . σ2 (vii) SSe ∼ χ2n−r . σ2 (viii) For each j = 1, . . . , m, Tj ∼ tn−r . (ix) T | X ∼ mvtm,n−r DVD . (x) If additionally rank Lm×k = m ≤ r then the matrix V is invertible and > −1 1 b b − θ ∼ Fm, n−r . θ−θ MSe V θ m Proof. Proof/calculations were available on the blackboard in K1. k b = X> X −1 X> Y and under the normality assumption, In a full-rank linear model, we have β b of the regression coefficients β. Theorem 3.1 can be used to state additional properties of the LSE β 3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY 38 Consequence of Theorem 3.1: Least squares estimator of the regression coefficients in a full-rank normal linear model. Let Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = k. Further, let −1 V = X> X = vj,t j,t=0,...,k−1 , 1 1 D = diag √ , ..., √ . v0,0 vk−1,k−1 The following then holds. b X ∼ Nk β, σ 2 V . (i) β b and SSe are conditionally, given X, independent. (ii) Statistics β βbj − βj (iii) For each j = 0, . . . , k − 1, Tj := p ∼ tn−k . MSe vj,j > 1 b − β ∼ mvtk,n−k DVD . D β (iv) T := T0 , . . . , Tk−1 = √ MSe > 1 b > b (v) β − β MS−1 e X X β − β ∼ Fk, n−k . k Proof. Use L = Ik in Theorem 3.1 and realize that the only pseudoinverse to the matrix X> X in −1 a full-rank model is the inverse X> X . k Theorem 3.1 and its consequence can now be used to perform principal statistical inference, i.e., calculation of confidence intervals and regions, testing statistical hypotheses, in a normal linear model. 3.2.1 Statistical inference in a full-rank normal linear model Assume a full-rank normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = k and keep −1 denoting V = X> X = vj,t j,t=0,...,k−1 . Inference on a chosen regression coefficient First, take a chosen j ∈ 0, . . . , k − 1 . We then have the following. • Standard error of βbj and confidence interval for βj We have var βbj = σ 2 vj,j (Consequence of Theorem 2.9) which is unbiasedly estimated as MSe vj,j (Theorem 2.4). The square root of this quantity, i.e., estimated standard deviation of βbj is then called as standard error 2 of the estimator βbj . That is, p S.E. βbj = MSe vj,j . (3.1) 2 směrodatná, příp. standardní chyba 3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY 39 The standard error (3.1) is also the denominator of the t-statistic Tj from point (iii) of Consequence of Theorem 3.1. Hence the lower and the upper bounds of the Wald-type (1 − α) 100% confidence interval for βj based on the statistic Tj are α βbj ± S.E. βbj tn−k 1 − . 2 Analogously, also one-sided confidence interval can be calculated. • Test on a value of βj Suppose that for a given βj0 ∈ R, we aim in testing H0 : H1 : βj = βj0 , βj 6= βj0 . The Wald-type test based on point (iii) of Consequence of Theorem 3.1 proceeds as follows: βbj − βj0 βbj − βj0 . Test statistic: Tj,0 = =p MSe vj,j S.E. βbj Reject H0 if α |Tj,0 | ≥ tn−k 1 − . 2 P-value when Tj,0 = tj,0 : p = 2 CDFt, n−k − |tj,0 | . Analogously, also one-sided tests can be conducted. End of Lecture #4 (15/10/2015) Simultaneous inference on a vector of regression coefficients Start of When the interest lies in the inference for the full vector of the regression coefficients β, the Lecture #6 (22/10/2015) following procedures can be used. • Simultaneous confidence region3 for β It follows from point (v) of Consequence of Theorem 3.1 that the simultaneous (1 − α) 100% confidence region for β is the set n o b > MS−1 X> X β − β b < k Fk,n−k (1 − α) , β ∈ Rk : β − β e which is an ellipsoid with center: shape matrix: diameter: b β, −1 b , c β MSe X> X = var p k Fk,n−k (1 − α). Remember from the linear algebra and geometry lectures that the shape matrix determines the principal directions of the ellipsoid as those are given by the eigen vectors of this matrix. In this case, the principal directions of the confidence ellipsoid are given by the eigen vectors of the b . c β estimated covariance matrix var • Test on a value of β Suppose that for a given β 0 ∈ Rk , we aim in testing H0 : H1 : β = β0 , β 6= β 0 . The Wald-type test based on point (v) of Consequence of Theorem 3.1 proceeds as follows: > 1 b > 0 b Test statistic: Q0 = β − β 0 MS−1 e X X β−β . k 3 Reject H0 if Q0 ≥ Fk,n−k (1 − α). P-value when Q0 = q0 : p = 1 − CDFF , k,n−k q0 . simultánní konfidenční oblast 3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY 3.2.2 40 Statistical inference in a general rank normal linear model Let us now assume a geneal rank normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r ≤ k. Inference on an estimable parameter Let θ = l> β, l 6= 0k , be an estimable parameter and let θb = l> b be its least squares estimator. • Standard error of θb and confidence interval for θ − − We have var θb = σ 2 l> X> X l (Theorem 2.8) which is unbiasedly estimated as MSe l> X> X l (Theorem 2.4). Hence the standard error of θb is − q b S.E. θ = MSe l> X> X l. (3.2) The standard error (3.2) is also the denominator of the appropriate t-statistic from point (viii) of Theorem 3.1. Hence the lower and the upper bounds of the Wald-type (1 − α) 100% confidence interval for θ based on this t-statistic are α . θb ± S.E. θb tn−r 1 − 2 Analogously, also one-sided confidence interval can be calculated. • Test on a value of θ Suppose that for a given θ0 ∈ R, we aim in testing H0 : H1 : θ = θ0 , θ 6= θ0 . The Wald-type test based on point (viii) of Theorem 3.1 proceeds as follows: θb − θ0 θb − θ0 Test statistic: T0 = =q − . S.E. θb MSe l> X> X l α Reject H0 if |T0 | ≥ tn−r 1 − . 2 P-value when T0 = t0 : p = 2 CDFt, n−r − |t0 | . Analogously, also one-sided tests can be conducted. Simultaneous inference on an estimable vector parameter Finally, let θ = Lβ be an estimable parameter, where L is an m × k matrix with m ≤ r linearly b = Lb be the least squares estimator of θ. independent rows. Let θ • Simultaneous confidence region for θ It follows from point (x) of Theorem 3.1 that the simultaneous (1 − α) 100% confidence region for θ is the set n o > n − > o−1 m > b b θ ∈R : θ−θ MSe L X X L θ − θ < m Fm,n−r (1 − α) , which is an ellipsoid with center: shape matrix: diameter: b θ, − b , c θ MSe L X> X L> = var p m Fm,n−r (1 − α). 3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY • Test on a value of θ Suppose that for a given θ 0 ∈ Rm , we aim in testing H0 : H1 : 41 θ = θ0 , θ 6= θ 0 . The Wald-type test based on point (x) of Theorem 3.1 proceeds as follows: > n − o−1 1 b b − θ0 . Test statistic: Q0 = θ θ − θ0 MSe L X> X L> m Reject H0 if Q0 ≥ Fm,n−r (1 − α). P-value when Q0 = q0 : p = 1 − CDFF , m,n−r q0 . Note. Assume again a full-rank model (r = k) and take L as a submatrix of the identity matrix Ik by selecting some of its rows. The above procedures can then be used to infer simultaneously on a subvector of the regression coefficients β. Note. All tests, confidence intervals and confidence regions derived in this Section were derived under the assumption of a normal linear model. Nevertheless, we show in Chapter 13 that under certain conditions, all those methods of statistical inference remain asymptotically valid even if normality does not hold. 3.3. CONFIDENCE INTERVAL FOR THE MODEL BASED MEAN, PREDICTION INTERVAL 3.3 42 Confidence interval for the model based mean, prediction interval We keep assuming that the data Yi , X i , i = 1, . . . , n, follow a normal linear model. That is, Yi = X > i β + εi , In other words, i.i.d. εi ∼ N (0, σ 2 ). 2 Yi X i ∼ N (X > i β, σ ) and Y1 , . . . , Yn are conditionally independent given X 1 , . . . , X n . Remember that X ⊆ Rk denote a sample space of the covariate random vectors X 1 , . . . , X n . Let xnew ∈ X and let Ynew = x> new β + εnew , > where εnew ∼ N (0, σ 2 ) is independent of ε = ε1 , . . . , εn . A value of Ynew is thus a value of a “new” observation sampled from the conditional distribution 2 Ynew X new = xnew ∼ N (x> new β, σ ) independently of the “old” observations. We will now tackle two important problems: (i) Interval estimation of µnew := E Ynew X new = xnew = x> new β. (ii) Interval estimation of the value of the random variable Ynew itself, given the covariate vector X new = xnew . Solution to the outlined problems will be provided by the following theorem. Theorem 3.2 Confidence interval for the model based mean, prediction interval. Let Y = Xβ + ε, ε ∼ Nn (0n , σ 2 In ), rank Xn×k = r. Let xnew ∈ X ∩ M X> , xnew 6= 0k . Let εnew ∼ N (0, σ 2 ) is independent of ε. Finally, let Ynew = x> new β + εnew . The following then holds: (i) µnew = x> new β is estimable, µ bnew = x> new b is its best linear unbiased estimator (BLUE) with the standard error of q − > S.E. µ bnew = MSe x> xnew new X X and the lower and the upper bound of the (1 − α) 100% confidence interval for µnew are α µ bnew ± S.E. µ bnew tn−r 1 − . (3.3) 2 (ii) A (random) interval with the bounds α µ bnew ± S.E.P. xnew tn−r 1 − , (3.4) 2 where r S.E.P. xnew = n o >X −x , MSe 1 + x> X new new covers with the probability of (1 − α) the value of Ynew . (3.5) 3.3. CONFIDENCE INTERVAL FOR THE MODEL BASED MEAN, PREDICTION INTERVAL 43 Proof. Proof/calculations were available on the blackboard in K1. k Terminology (Confidence interval for the model based mean, prediction interval, standard error of prediction). • The interval with the bounds (3.3) is called the confidence interval for the model based mean. • The interval with the bounds (3.4) is called the prediction interval. • The quantity (3.5) is called the standard error of prediction. Terminology (Fitted regression function). b of the regression Suppose that the corresponding linear model is of full-rank with the LSE β coefficients. The function b m(x) b = x> β, x ∈ X, which, by Theorem 3.2, provides BLUE’s of the values of µ(x) := E Ynew X new = x = x> β and also provides predictions for Ynew = x> β + εnew , is called the fitted regression function.4 Terminology (Confidence band around the regression function, prediction band). As was explained in Section 1.2.7, the covariates X i ∈ X ⊆ Rk used in the linear model are often obtained by transforming some original covariates Z i ∈ Z ⊆ Rp . Common situation is that Z ⊆ R is an interval and > > X i = Xi,0 , . . . , Xi,k−1 = t0 (Zi ), . . . , tk−1 (Zi ) = t(Zi ), i = 1, . . . , n, where t : R −→ Rk is a suitable transformation such that E Yi Zi = t> (Zi )β = X > i β. b of the regression Suppose again that the corresponding linear model is of full-rank with the LSE β coefficients. Confidence intervals for the model based mean or prediction intervals can then be calculated for an (equidistant) sequence of values znew,1 , . . . , znew,N ∈ Z and then drawn over a scatterplot of observed data Y1 , Z1 , . . . , Yn , Zn . In this way, two different bands with a fitted regression function b m(z) b = t> (z)β, z ∈ Z, going through the middle of both the bands, are obtained. In this context, (i) The band based on the confidence intervals for the model based mean (Eq. 3.3) is called the confidence band around the regression function;5 (ii) The band based on the prediction intervals (Eq. 3.4) is called the prediction band.6 4 odhadnutá regresní funkce 5 pás spolehlivosti okolo regresní funkce 6 predikční pás 3.4. DISTRIBUTION OF THE LINEAR HYPOTHESES TEST STATISTICS UNDER THE ALTERNATIVE 3.4 44 Distribution of the linear hypotheses test statistics under the alternative Beginning of Section 3.2 provided classical tests of the linear hypotheses (hypotheses on the values of estimable skipped part parameters). To allow for power or sample size calculations, we additionally need distribution of the test statistics under the alternatives. Theorem 3.3 Distribution of the linear hypothesis test statistics under the alternative. Let Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. Let l 6= 0k such that θ = l> β is estimable. Let θb = l> b be its LSE. Let θ0 , θ1 ∈ R, θ0 = 6 θ1 and let T0 = q θb − θ0 MSe l > − X> X l . Then under the hypothesis θ = θ1 , T0 X ∼ tn−r (λ), θ1 − θ0 λ= q − . σ 2 l > X> X l Note. The statistic T0 is the test statistic to test the null hypothesis H0 : θ = θ0 using point (viii) of Theorem 3.1. Proof. Proof/calculations were skipped and are not requested for the exam. k Theorem 3.4 Distribution of the linear hypotheses test statistics under the alternative. Let Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. Let Lm×k be a real matrix with m ≤ r linearly b = Lb be its LSE. Let θ 0 , θ 1 ∈ Rm , θ 0 6= θ 1 independent rows such that θ = Lβ is estimable. Let θ and let > n − o−1 1 b b − θ0 . Q0 = θ − θ0 MSe L X> X L> θ m Then under the hypothesis θ = θ 1 , Q0 X ∼ Fm,n−r (λ), λ = θ1 − θ0 > n 2 − o−1 1 σ L X> X L> θ − θ0 . Note. The statistic Q0 is the test statistic to test the null hypothesis H0 : θ = θ0 using point (x) of Theorem 3.1. 3.4. DISTRIBUTION OF THE LINEAR HYPOTHESES TEST STATISTICS UNDER THE ALTERNATIVE Proof. Proof/calculations were skipped and are not requested for the exam. 45 k Note. We derived only a conditional (given the covariates) distribution of the test statistics at hand. This corresponds to the fact that power and sample size calculations for linear models are mainly used in the area of designed experiments7 where the covariate values, i.e., the model matrix X is assumed to be fixed and not random. A problem of the sample size calculation then involves not only calculation of needed sample size n but also determination of the form of the model matrix X. More can be learned in the course Experimental Design (NMST436).8 7 navržené experimenty 8 Návrhy experimentů (NMST436) End of skipped part Chapter 4 Basic Regression Diagnostics We will now assume that data are represented by n random vectors Yi , Z i , Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp i = 1, . . . , n. We keep considering that the principal aim of the statistical analysis is to find a suitable model to express the (conditional) response expectation E Y := E Y Z , where Z is a matrix with vectors Z 1 , . . ., Z n in its rows. Suppose that t : Z −→ X ⊆ Rk is a transformation of covariates leading to the model matrix X> t> (Z 1 ) 1 . . . . X= rank Xn×k = r ≤ k. . = . =: t(Z), X> t> (Z n ) n 46 4.1. (NORMAL) LINEAR MODEL ASSUMPTIONS 4.1 47 (Normal) linear model assumptions Basis for statistical inference shown for the by now was derived whilek assuming a linear model 2 data, i.e., while assuming that E Y Z = Xβ for some β ∈ R and var Y Z = σ In . For the data Yi , X i , i = 1, . . . , n, where we directly work with the transformed covariate vectors, this means the following assumptions (i = 1, . . . , n): (A1) E Yi X i = x = x> β for some β ∈ Rk and any x ∈ X . ≡ Correct regression function m(z) = t> (z)β, z ∈ Z, correct choice of transformation t of the original covariates leading to linearity of the (conditional) response expectation. (A2) var Yi X i = x = σ 2 for some σ 2 irrespective of the value of x ∈ X . ≡ The conditional response variance is constant (does not depend on the covariates or other factors) ≡ homoscedasticity 1 of the response. (A3) cov Yi , Yl X = 0, i 6= l. ≡ The responses are conditionally uncorrelated. Some of our results (especially those shown in Chapter 3) were derived while additionally assuming normality of the response, i.e., while assuming (A4) Yi | X i = x ∼ N x> β, σ 2 , x ∈ X . ≡ Normality of the response. If we use a definition of the linear model using the error terms, i.e., while assuming that Y = Xβ + ε for some β ∈ Rk , the linear model assumptions are all transferred into assumptions on the error terms ε = ε1 , . . . , εn . Namely (i = 1, . . . , n): (A1) E εi = 0. ≡ This again means that a structural part of the model stating that E Y X = Xβ for some β ∈ Rk is correctly specified, or in other words, that the regression function of the model is correctly specified. (A2) var εi = σ 2 for some σ 2 which is constant irrespective of the value if i. ≡ The error variance is constant ≡ homoscedasticity of the errors. (A3) cov εi , εl = 0, i 6= l. ≡ The errors are uncorrelated. Possible assumption of normality is transferred into the errors as (A4) εi ∼ N 0, σ 2 i.i.d. ≡ The errors are normally distributed and owing to previous assumptions, ε1 , . . . , εn ∼ N (0, σ 2 ). Remember now that many important results, especially those already derived in Chapter 2, are valid even without assuming normality of the response/errors. Moreover, we shall show in Chapter 13 that also majority of inferential tools based on results of Chapters 3 and 5 are, under certain conditions, asymptotically valid even if normality does not hold. 1 homoskedasticita 4.1. (NORMAL) LINEAR MODEL ASSUMPTIONS 48 In general, if inferential tools based on a statistical model with certain properties (assumptions) are to be used, we should verify, at least into some extent, validity of those assumptions with a particular dataset. In a context of regression models, the tools to verify the model assumptions are usually referred to as regression diagnostic 2 tools. In this chapter, we provide only the most basic graphical methods. Additional, more advanced tools of the regression diagnostics will be provided in Chapters 10 and 14. As already mentioned above, the assumptions (A1)–(A4) are not equally important. Some of them are not needed to justify usage of a particular inferential tool (estimator, statistical test, . . . ), see assumptions and proofs of corresponding Theorems. This should be taken into account when using the regression diagnostics. It is indeed not necessary to verify those assumptions that are not needed for a specific task. It should finally be mentioned that with respect to the importance of the assumptions (A1)–(A4), far the most important is assumption (A1) concerning a correct specification of the regression function. Remember that practically all Theorems in this lecture that are related to the inference of a linear model use in their proofs, in some sense, the on the parameters assumption E Y X ∈ M X . Hence if this is not satisfied, majority of the traditional statistical inference is not correct. In other words, special attention in any data analysis should be devoted to verifying the assumption (A1) related to a correct specification of the regression function. As we shall show, the assumptions of the linear model are basically checked through exploration of the properties of the residuals U of the model, where − U = MY , M = In − X X> X X> = mi,l i,l=1,...,n . When doing so, it is exploited that each of assumptions (A1)–(A4) implies a certain property of the residuals stated earlier in Theorems 2.3 (Basic properties of the residuals and the residual sum of squares) or will be stated later in Theorem 3.1 (Properties of the LSE under the normality). It follows from those theorems (or their proofs) the following: 1. (A1) =⇒ E U X =: E U = 0n . 2. (A1) & (A2) & (A3) =⇒ var U X =: var U = σ 2 M. 3. (A1) & (A2) & (A3) & (A4) =⇒ U | X ∼ Nn 0n , σ 2 M . Usually, the right-hand side of the implication is verified and if it is found not to be satisfied, we know that also the left-hand side of the implication (a particular assumption or a set of assumptions) is not fulfilled. Clearly, if we conclude that the right-hand side of the implication is fulfilled, we still do not know whether the left-hand side (a model assumption) is valid. Nevertheless, it is common to most of the statistical diagnostic tools that they are only able to reveal unsatisfied model assumptions but are never able to confirm their validity. An uncomfortable property ofthe residuals of the linear model is the fact that even if the errors (ε) are homoscedastic (var εi = σ 2 for all i = 1, . . . , n), the residuals U are, in general, heteroscedastic (having unequal variances). Indeed, even if the assumption (A2) if fulfilled, we have var U = σ 2 M, var Ui = σ 2 mi,i (i = 1, . . . , n), where note that the residual projection matrix M, in general, does not have a constant diagonal m1,1 , . . . , mn,n . Moreover, the matrix M is even not a diagonal matrix. That is, even if the errors ε1 , . . . , εn are uncorrelated, the residuals U1 , . . . , Un are, in general, correlated. This must be taken into account when the residuals U are used to check validity of assumption (A2). The problem of heteroscedasticity of the residuals U is then partly solved be defining so called standardized residuals. 2 regresní diagnostika 4.2. STANDARDIZED RESIDUALS 4.2 49 Standardized residuals Consider a linear model Y X ∼ Xβ, σ 2 In , with the vector or residuals U = U1 , . . . , Un , the residual mean square MSe , and the residual projection matrix M having a diagonal m1,1 , . . . , mn.n . The following definition is motivated by the facts following the properties of residuals shown in Theorem 2.3: E U = 0n , var U = σ 2 M, ! ! Ui Ui E p = 0, var p = 1, if mi,i > 0, i = 1, . . . , n. (4.1) σ 2 mi,i σ 2 mi,i Definition 4.1 Standardized residuals. The standardized residuals3 or the vector of standardized residuals of the model is a vector U std = U1std , . . . , Unstd , where Ui , mi,i > 0, p MSe mi,i Uistd = i = 1, . . . , n. undefined, m = 0, i,i Notes. • It will be shown in Section 10.4 that if a normal linear model is assumed, i.e., if Y X ∼ Nn Xβ, σ 2 In and if for given i ∈ {1, . . . , n}, mi,i > 0 then, analogously to (4.1), E Uistd = 0, var Uistd = 1. • Unfortunately, even in a normal linear model, the standardized residuals U1std , . . . , Unstd are, in general, • neither normally distributed; • nor uncorrelated. • In some literature (and some software packages), the standardized residuals are called studentized residuals4 . • In other literature including those course notes (and many software packages including R), the term studentized residuals is reserved for a different quantity which we shall define in Chapter 14. 3 standardizovaná rezidua 4 studentizovaná rezidua 4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS 4.3 50 Graphical tools of regression diagnostics In the whole section, the columns of the model matrix X (the regressors), will be denoted as X 0 , . . . , X k−1 , i.e., X = X 0 , . . . , X k−1 . Remember that usually X 0 = 1, . . . , 1 is an intercept column. Further, in many situations, see Section 5.2 dealing with a submodel obtained by omitting some regressors, the current model matrix X is the model matrix of just a candidate submodel (playing the role of the model matrix X0 in Section regressors are available to model the response expectation 5.2) and perhaps additional 1 E Y Z . Let us denote them as V , . . . , V m . That is, in the notation of Section 5.2, X1 = V 1 , . . . , V m . The reminder of this section provides purely an overview of basic residual plots that are used as basic diagnostic tools in the context of a linear regression. More explanation on use of those plots will be/was provided during the lecture and the exercise classes. 4.3.1 (A1) Correctness of the regression function To detect: Overall inappropriateness of the regression function ⇒ scatterplot Yb , U of residuals versus fitted values. Nonlinearity of the regression function with respect to a particular regressor X j ⇒ scatterplot X j , U of residuals versus that regressor. Possibly omitted regressor V ⇒ scatterplot V , U of residuals versus that regressor. For all proposed plots, a slightly better insight is obtained if standardized residuals U std are used instead of the raw residuals U . 4.3.2 (A2) Homoscedasticity of the errors To detect Residual variance that depends on the response expectation ⇒ scatterplot Yb , U of residuals versus fitted values. Residual variance that depends on a particular regressor X j ⇒ scatterplot X j , U of residuals versus that regressor. Residual variance that depend on a regressor V not included in the model ⇒ scatterplot V , U of residuals versus that regressor. 4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS 51 For all proposed plots, a better insight is obtained if standardized residuals U std are used instead of the raw residuals U . This due to the fact that even if homoscedasticity of the errors is fulfilled, 2 the raw residuals U are not necessarily homoscedastic (var U Z = σ M), but the standardized residuals are homoscedastic having all a unity variance if additionally normality of the response holds. So called scale-location plots are obtained, if on the above proposed plots, the vector of raw residuals U is replaced by a vector q q U std , . . . , Unstd . 1 4.3.3 (A3) Uncorrelated errors End of Lecture #6 (22/10/2015) Start of Assumption of uncorrelated errors is often justified by the used data gathering mechanism (e.g., Lecture #8 observations/measurements performed on clearly independently behaving units/individuals). In (29/10/2015) that case, it does not make much sense to verify this assumption. Two typical situation when uncorrelated errors cannot be taken for granted are (i) repeated observations performed on N independently behaving units/subjects; (ii) observations performed sequentially in time where the ith response value Yi is obtained in time ti and the observational occasions t1 < · · · < tn form an equidistant sequence. In the following, we will not discuss any further the case (i) of repeated observations. In that case, a simple linear model is in most cases fully inappropriate for a statistical inference and more advanced models and methods must be used, see the course Advanced Regression Models (NMST432). In case (ii), the errors ε1 , . . . , εn can be considered as a time series5 . The assumptions (A1)–(A3) of the linear model then states that this time series (the errors of the model) forms a white noise 6 . Possible serial correlation (autocorrelation) between the error terms is then usually considered as possible violation of the assumption (A3) of uncorrelated errors. As stated above, even if the errors are uncorrelated and assumption (A3) is fulfilled, the residuals U are in general correlated. Nevertheless, the correlation is usually rather low and the residuals are typically used to check assumption (A3) and possibly to detect a form of the serial correlation present in data at hand. See Stochastic Processes 2 (NMSA409) course for basic diagnostic methods that include: • Autocorrelation and partial autocorrelation plot based on residuals U . • Plot of delayed residuals, that is a scatterplot based on points (U1 , U2 ), (U2 , U3 ), . . ., (Un−1 , Un ). 4.3.4 (A4) Normality To detect possible non-normality of the errors, standard tools used to check normality of a random sample known from the course Mathematical Statistics 1 (NMSA331) are used, now with the vector of residuals U or standardized residuals U std in place of the random sample which normality is to be checked. A basic graphical tool to check the normality of a sample is then • the normal probability plot (the QQ plot). Usage of both the raw residuals U and the standardized residuals U std to check the normality assumption (A4) bears certain inconveniences. If all assumptions of the normal linear model are fulfilled, then 5 časová řada 6 bílý šum 4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS 52 The raw residuals U satisfy U | Z ∼ Nn 0m , σ 2 M . That is, they maintain the normality, nevertheless, they are, in general, not homoscedastic (var Ui = σ 2 mi,i , i = 1, . . . , n). Hence seeming non-normality of a “sample” U1 , . . . , Un might be caused by the fact that the residuals are imposed to different variability. The standardized residuals U std satisfy E Uistd Z = 0, var Uistd Z = 1 for all i = 1, . . . , n. That is, the standardized residuals are homoscedastic (with a known variance of one), nevertheless, they are not necessarily normally distributed. On the other hand, deviation of the distributional shape of the standardized residuals from the distributional shape of the errors ε is usually rather minor and hence the standardized residuals are usually useful in detecting non-normality of the errors. Chapter 5 Submodels In this chapter, we will data being represented by n random vectors Yi , Z i , again consider Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp , i = 1, . . . , n. The main aim is still to find a suitable model to express the (conditional) response expectation E Y := E Y Z , where Z is a matrix with vectors Z 1 , . . ., Z n in its rows. Suppose that t0 : Rp −→ Rk0 and t : Rp −→ Rk are two transformations of covariates leading to model matrices > X 01 X 01 = t0 (Z 1 ), X> X 1 = t(Z 1 ), 1 . . . .. 0 . .. . X = X= (5.1) . . , . , > X 0n = t0 (Z n ), X> X n = t(Z n ). X 0n n Briefly, we will write Let X0 = t0 (Z), X = t(Z). rank(X0 ) = r0 , rank(X) = r, (5.2) where 0 < r0 ≤ k0 < n, 0 < r ≤ k < n. We will now deal with a situation when the matrices X0 and X determine two linear models: Model M0 : Y | Z ∼ X0 β 0 , σ 2 In , Model M : Y | Z ∼ Xβ, σ 2 In , and the task is to decide on whether one of the two models fits “better” the data. In this chapter, we limit ourselves to a situation when M0 is so called submodel of the model M. 53 5.1. SUBMODEL 5.1 54 Submodel Definition 5.1 Submodel. We say that the model M0 is the submodel1 (or the nested model2 ) of the model M if M X0 ⊂ M X with r0 < r. Notation. Situation that a model M0 is a submodel of a model M will be denoted as M0 ⊂ M. Notes. • Submodel provides a more parsimonious expression of the response expectation E Y . • The fact that the submodel M0 holds means E Y ∈ M X0 ⊂ M X . That is, if the submodel M0 holds then also the larger model M holds. That is, there exist β 0 ∈ Rk0 and β ∈ Rk such that E Y = X0 β 0 = Xβ. • The fact submodel M0 does not hold but the model M holds means that E Y ∈ that the M X \ M X0 . That is, there exist no β 0 ∈ Rk0 such that E Y = X0 β 0 . 5.1.1 Projection considerations Decomposition of the n-dimensional Euclidean space Since M X0 ⊂ M X ⊆ Rn , it is possible to construct an orthonormal vector basis Pn×n = p1 , . . . , pn of the n-dimensional Euclidean space as P = Q0 , Q1 , N , where • Q0n×r0 : orthonormal vector basis of a submodel regression space, i.e., M X0 = M Q0 . • Q1n×(r−r0 ) : orthonormal vectors such that Q := Q0 , Q1 is an orthonormal vector basis of a model regression space, i.e., M X = M Q = M Q0 , Q1 . 1 podmodel 2 vnořený model 5.1. SUBMODEL 55 • Nn×(n−r) : orthonormal vector basis of a model residual space, i.e., M X ⊥ =M N . Further, • N0n×(n−r0 ) := Q1 , N : orthonormal vector basis of a submodel residual space, i.e., M X0 ⊥ = M N0 = M Q1 , N . It follows from the orthonormality of columns of the matrix P: > > In = P> P = P P> = Q0 Q0 + Q1 Q1 + N N> = Q Q> + N N> > > = Q0 Q0 + N0 N0 . Notation. In the following, let > H0 = Q0 Q0 , > > M0 = N0 N0 = Q1 Q1 + N N> . Notes. • Matrices H0 and M0 which are symmetric and idempotent, are projection matrices into the regression and residual space, respectively, of the submodel. • The hat matrix and the residual projection matrix of the model can now also be written as > > > H = Q Q> = Q0 Q0 + Q1 Q1 = H0 + Q1 Q1 , > M = N N> = M0 − Q1 Q1 . Projections into subspaces of the n-dimensional Euclidean space Let y ∈ Rn . We can then write y = In y = Q0 Q0 > + Q1 Q1 > > > > > + NN> y = Q0 Q0 y + Q1 Q1 y + NN> y | {z } | {z } u b y = Q0 Q0 y + Q1 Q1 y + NN> y. | {z } | {z } 0 0 u b y We have b = Q0 Q0 • y > + Q1 Q1 > y = Hy ∈ M X . 5.1. SUBMODEL 56 • u = N N> y = My ∈ M X ⊥ . > b 0 := Q0 Q0 y = H0 y ∈ M X0 . • y ⊥ > • u0 := Q1 Q1 + N N> y = M0 y ∈ M X0 . > b−y b 0 = u0 − u. • d := Q1 Q1 y = y 5.1.2 Properties of submodel related quantities Notation (Quantities related to a submodel). When dealing with a pair of a model and a submodel, quantities related to the submodel will be denoted by a superscript (or by a subscript) 0. In particular: 0 > • Yb = H0 Y = Q0 Q0 Y : fitted values in the submodel (projection of Y into the submodel regression space. 0 > • U 0 = Y − Yb = M0 Y = Q1 Q1 + NN> Y : residuals of the submodel. 2 • SS0e = U 0 : residual sum of squares of the submodel. • νe0 = n − r0 : submodel residual degrees of freedom. • MS0e = SS0e : submodel residual mean square. νe0 Additionally, as D, we denote projection of the response vector Y into the space M Q1 , i.e., > 0 D = Q1 Q1 Y = Yb − Yb = U 0 − U . (5.3) Theorem 5.1 On a submodel. Consider two linear models M : Y | Z ∼ Xβ, σ 2 In and M0 : Y | Z ∼ X0 β 0 , σ 2 In such that M0 ⊂ M. Let the submodel M0 holds, i.e., let E Y ∈ M X0 . Then 0 (i) Yb is the best linear unbiased estimator (BLUE) of a vector parameter µ0 = X0 β 0 = E Y . (ii) The submodel residual mean square MS0e is the unbiased estimator of the residual variance σ 2 . 0 (iii) Statistics Yb and U 0 are conditionally, given Z, uncorrelated. 0 (iv) A random vector D = Yb − Yb = U 0 − U satisfies 2 D = SS0e − SSe . (v) If additionally, a normal linear model is assumed, i.e., if Y | Z ∼ Nn X0 β 0 , σ 2 In then the 0 statistics Yb and U 0 are conditionally, given Z, independent and SS0e − SSe SS0e − SSe νe0 − νe r − r0 F0 = = ∼ Fr−r0 , n−r = Fνe0 −νe , νe . SSe SSe n−r νe (5.4) 5.1. SUBMODEL 57 Proof. Proof/calculations were available on the blackboard in K1. k 5.1.3 Series of submodels When looking for a suitable model to express E Y , often a series of submodels is considered. Let us now assume a series of models Model M0 : Y | Z ∼ X0 β 0 , σ 2 In , Model M1 : Y | Z ∼ X1 β 1 , σ 2 In , Model M : Y | Z ∼ Xβ, σ 2 In , where, analogously to (5.1), an n × k1 matrix X1 is given as > X 11 X 11 = t1 (Z 1 ), . .. . X1 = . . , > 1 X n = t1 (Z n ), X 1n for some transformation t1 : Rp −→ Rk1 of the original covariates Z 1 , . . . , Z n , which we briefly write as X1 = t1 (Z). Analogously to (5.2), we will assume that for some 0 < r1 ≤ k1 < n, rank(X1 ) = r1 . Finally, we will assume that the three considered models are mutually submodels. That is, we will assume that M X0 ⊂ M X1 ⊂ M X with r0 < r1 < r, which we denote as M0 ⊂ M1 ⊂ M. Notation. Quantities derived while assuming a particular model will be denoted by the corresponding superscript (or by no superscript in case of the model M). That is: 0 • Yb , U 0 , SS0e , νe0 , MS0e : quantities based on the (sub)model M0 : Y | Z ∼ X0 β 0 , σ 2 In ; 1 • Yb , U 1 , SS1e , νe1 , MS1e : quantities based on the (sub)model M1 : Y | Z ∼ X1 β 1 , σ 2 In ; • Yb , U , SSe , νe , MSe : quantities based on the model M: Y | Z ∼ Xβ, σ 2 In . Theorem 5.2 On submodels. Consider three normal linear models M : Y | Z ∼ Nn Xβ, σ 2 In , M1 : Y | Z ∼ Nn X1 β 1 , σ 2 In , 5.1. SUBMODEL 58 0 M0 : Y | Z∼ Nn X0 β , σ 2 In such that M0 ⊂ M1 ⊂ M. Let the (smallest) submodel M0 hold, i.e., let E Y ∈ M X0 . Then F0,1 SS0e − SS1e SS0e − SS1e νe0 − νe1 r1 − r0 = = ∼ Fr1 −r0 , n−r = Fνe0 −νe1 , νe . SSe SSe n−r νe Proof. Proof/calculations were available on the blackboard in K1. (5.5) k Note. Both F-statistics (5.4) and (5.5) contain • In the numerator: a difference in the residual sums of squares of the two models where one of them is a submodel of the other divided by the difference of the residual degrees of freedom of those two models. • In the denominator: a residual sum of squares of the model which is larger or equal to any of the two models whose quantities appear in the numerator, divided by the corresponding degrees of freedom. • To obtain an F-distribution of the F-statistics (5.4) or (5.5), the smallest model whose quantities appear in that F-statistic must hold which implies that any other larger model holds as well. Notation (Differences when dealing with a submodel). Let MA and MB are two models distinguished by symbols “A” and “B” such that MA ⊂ MB . Let A B B Yb and Yb , U A and U B , SSA e and SSe denote the fitted values, the vectors of residuals and the residual sums of squares based on models MA and MB , respectively. The following notation will be used if it becomes necessary to indicate which are the two model related to the vector D or to the difference in the sums of squares: B A D MB MA = D B A := Yb − Yb = U A − U B . B SS MB MA = SS B A := SSA e − SSe . Notes. • Both F-statistics (5.4) and (5.5) contain certain SS B A in their numerators. • Point (iv) of Theorem 5.1 gives 2 SS B A = D B A . 5.1.4 Statistical test to compare nested models End of Theorems 5.1 and 5.2 provide a way to compare two nested models by the mean of a statistical Lecture #8 test. (29/10/2015) Start of Lecture #10 (05/11/2015) 5.1. SUBMODEL 59 F-test on a submodel based on Theorem 5.1 Consider two normal linear models: Model M0 : Y | Z ∼ Nn X0 β 0 , σ 2 In , Model M: Y | Z ∼ Nn Xβ, σ 2 In , where M0 ⊂ M, and a set of statistical hypotheses: H0 : E Y ∈ M X0 H1 : E Y ∈ M X \ M X0 , that aim in answering the questions: • Is model M significantly better than model M0 ? • Does the (larger) regression space M X provide a significantly better expression for E Y over the (smaller) regression space M X0 ? The F-statistic (5.4) from Theorem 5.1 now provides a way to test the above hypotheses as follows: SS M M0 SS0e − SSe r − r0 r − r0 = . Test statistic: F0 = SSe SSe n−r n−r Reject H0 if F0 ≥ Fr−r0 ,n−r (1 − α). P-value when F0 = f0 : p = 1 − CDFF , r−r0 ,n−r f0 . F-test on a submodel based on Theorem 5.2 Consider three normal linear models: Model M0 : Y | Z ∼ Nn X0 β 0 , σ 2 In , Model M1 : Y | Z ∼ Nn X1 β 1 , σ 2 In , Model M: Y | Z ∼ Nn Xβ, σ 2 In , where M0 ⊂ M1 ⊂ M, and a set of statistical hypotheses: H0 : E Y ∈ M X0 H1 : E Y ∈ M X1 \ M X0 , that aim in answering the questions: • Is model M1 significantly better than model M0 ? • Does the (larger) regression space M X1 provide a significantly better expression for E Y over the (smaller) regression space M X0 ? The F-statistic (5.5) from Theorem 5.2 now provides a way to test the above hypotheses as follows: SS M1 M0 SS0e − SS1e r1 − r0 r1 − r0 = . Test statistic: F0,1 = SSe SSe n−r n−r Reject H0 if F0,1 ≥ Fr1 −r0 ,n−r (1 − α). P-value when F0,1 = f0,1 : p = 1 − CDFF , r1 −r0 ,n−r f0,1 . 5.2. OMITTING SOME COVARIATES 5.2 60 Omitting some covariates The most common couple (model – submodel) is Model M: Submodel M0 : Y | Z ∼ Xβ, σ 2 In , Y | Z ∼ X0 β 0 , σ 2 In , where the submodel matrix X0 is obtained by omitting selected columns from the model matrix X. In other words, some covariates are omitted from the original covariate vectors X 1 , . . . , X n to get the submodel and the matrix X0 . In the following, without the loss of generality, let X = X0 , X1 , 0 < rank X0 = r0 < r = rank X < n. The corresponding submodel F-test then evaluates whether, given the knowledge of the covariates included in the submodel matrix X0 , the covariates included in the matrix X1 has an impact on the response expectation. Theorem 5.3 Effect of omitting some covariates. Consider a couple (model – submodel), where the submodel is obtained by omitting some covariates from the model. Then (i) D 6= 0n and SS0e − SSe > 0. (ii) If M X1 ⊥ M X0 then > D = X1 X1 X1 − > 1 X1 Y =: Yb , which are the fitted values from a linear model Y | Z ∼ X1 β 1 , σ 2 In . Proof. Proof/calculations were available on the blackboard in K1. k Note. If we take the residual sum of squares as a measure of a quality of the model, point (i) of Theorem 5.3 says that the model is always getting worse if some covariates are removed. Nevertheless, in practice, it is always a question whether this worsening is statistically significant (the submodel F-test answers this) or practically important (additional reasoning is needed). 5.3. LINEAR CONSTRAINTS 5.3 61 Linear constraints Suppose that a linear model Y | Z ∼ Xβ, σ2 In , rank Xn×k = r is given and it is our aim to verify whether the response expectation E Y lies in a constrained regression space M X; Lβ = θ 0 := v : v = Xβ, β ∈ Rk , Lβ = θ 0 , (5.6) where Lm×k is a given real matrix and θ 0 ∈ Rm is a given vector. In other words, verification of whether the response expectation lies in the space M X; Lβ = θ 0 corresponds to verification of whether the regression coefficients satisfy a linear constraint Lβ = θ 0 . Lemma 5.4 Regression space given by linear constraints. Consider a linear model Y | Z ∼ Xβ, σ 2 In , rank Xn×k = r ≤ k < n. Let Lm×k be a real matrix with m ≤ r rows such that (i) rank L = m (i.e., L is a matrix with linearly independent rows); (ii) θ = Lβ is estimable parameter of the considered linear model. The space M X; Lβ = 0m is then a vector subspace of dimension r − m of the regression space M X . Proof. Proof/calculations were available on the blackboard in K1. k Notes. 0 • The space M X; Lβ = θ is a vector space only if θ 0 = 0m since otherwise, 0n ∈ / 0 M X; Lβ = θ . Nevertheless, for the purpose of the statistical analysis, it is possible (and in practice also necessary) to work also with θ 0 6= 0m . • With m = r, M X; Lβ = 0m = 0n . Definition 5.2 Submodel given by linear constraints. We say that the model given by linear constraints3 Lβ = θ 0 of model M: M0 is a submodel 2 Y | Z ∼ Xβ, σ In , rank Xn×k = r, if matrix L satisfies conditions of Lemma 5.4, m < r and the response expectation E Y under the model M0 is assumed to lie in a space M X; Lβ = θ 0 . Notation. A submodel given by linear constraints will be denoted as M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 . 3 podmodel zadaný lineárními omezeními 5.3. LINEAR CONSTRAINTS 62 Since with θ 0 6= 0m , the space M X; Lβ = 0m is not a vector space, we in general cannot talk about projections in a sense of linear algebra when deriving the fitted values, the residuals and other quantities related to the submodel given by linear constraints. Hence we introduce the following definition. Definition 5.3 Fitted values, residuals, residual sum of squares, rank of the model and residual degrees of freedom in a submodel given by linear constraints. 2 Let b0 ∈ Rk minimize SS(β) = Y − Xβ over β ∈ Rk subject to Lβ = θ 0 . For the submodel M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 , the following quantities are defined as follows: 0 Fitted values: Yb := Xb0 . 0 Residuals: U 0 := Y − Yb . 2 Residual sum of squares: SS0e := U 0 . Rank of the model: r0 = r − m. Residual degrees of freedom: νe0 := n − r0 . Note. The fitted values could also be defined as 0 Yb = argmin e ∈M X; Lβ=θ 0 Y Y − Ye 2 . That is, the fitted values are (still) the closest point to Y in the constrained regression space M X; Lβ = θ 0 . Theorem 5.5 On a submodel given by linear constraints. Let M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 be a submodel given by linear constraints of a model M : Y | Z ∼ Xβ, σ 2 In . Then 0 (i) The fitted values Yb and consequently also the residuals U 0 and the residual sum of squares SS0e are unique. 2 (ii) b0 minimizes SS(β) = Y − Xβ subject to Lβ = θ 0 if and only if b0 = b − X> X where b = X> X − − n − o−1 L> L X> X L> Lb − θ 0 , X> Y is (any) solution to a system of normal equations X> Xb = X> Y . 0 (iii) The fitted values Yb can be expressed as 0 Yb = Yb − X X> X − n − o−1 L> L X> X L> Lb − θ 0 . 0 (iv) The vector D = Yb − Yb satisfies n o−1 2 D = SS0e − SSe = (Lb − θ 0 )> L X> X − L> (Lb − θ 0 ). (5.7) 5.3. LINEAR CONSTRAINTS 63 Proof. First mention that under our assumptions, the matrix L X> X − L> is (i) invertible; (ii) does not depend on a choice of the pseudoinverse X> X − . This follows from Theorem 2.9 (Gauss–Markov for estimable vector parameter). 2 0 Second, try to look for Yb = Xb0 such that b0 minimizes SS(β) = Y − Xβ over β ∈ Rk subject to Lβ = θ 0 by a method of Lagrange multipliers. Let 2 ϕ(β, λ) = Y − Xβ + 2λ> Lβ − θ 0 > = Y − Xβ Y − Xβ + 2λ> Lβ − θ 0 , where a factor of 2 in the second part of expression of the Lagrange function ϕ is only included to simplify subsequent expressions. The first derivatives of ϕ are as follows: ∂ϕ (β, λ) = −2 X> Y − Xβ + 2 L> λ, ∂β ∂ϕ (β, λ) = 2 Lβ − θ 0 . ∂λ Realize now that ∂ϕ (β, λ) = 0k if and only if ∂β X> Xβ = X> Y − L> λ. (5.8) Note that the linear system (5.8) is consistent for any λ ∈ Rm and any Y ∈ Rn . This follows from > > the fact that due to estimability of a parameter Lβ, we have M L ⊂ M X (Theorem 2.7). Hence the right-hand-side of the system (5.8) lies in M X> , for any λ ∈ Rm and any Y ∈ Rn . The left-hand-side of the system (5.8) lies in M X> X , for any β ∈ Rk . We already know that > > M X = M X X (Lemma 2.6) which proves that there always exist a solution to the linear system (5.8). Let b0 (λ) be any solution to X> Xβ = X> Y − L> λ. That is, − − b0 (λ) = X> X X> Y − X> X L> λ − = b − X> X L> λ, which depends on a choice of X> X Further, − . ∂ϕ (β, λ) = 0m if and only if ∂λ Lb0 (λ) = θ 0 − Lb − L X> X L> λ = θ 0 − L X> X L> λ = Lb − θ 0 . | {z } invertible as we already know 5.3. LINEAR CONSTRAINTS That is, 64 n o−1 − λ = L X> X L> Lb − θ 0 . Finally, 0 > b = b− X X − n − > o−1 > (Lb − θ 0 ), L L X X L 0 Yb = Xb0 = Yb − X X> X > − n − o−1 (Lb − θ 0 ). L> L X> X L> Realize again that M L> ⊂ M X> . That is, there exist a matrix A such that L> = X> A> , L = AX. 0 Under our assumptions, matrix A is even unique. The vector Yb can now be written as n − o−1 − 0 A> L X> X L> Lb −θ 0 . Yb = |{z} Yb − X X> X X> |{z} |{z} {z } | {z } unique unique | unique unique unique (5.9) 0 To show point (iv), use (5.9) in expressing the vector D = Yb − Yb : n − o−1 − (Lb − θ 0 ). D = X X> X X> A> L X> X L> That is, o−1 n 2 − − D = (Lb − θ 0 )> L X> X − L> A X X> X X> X X> X X> A> | {z } X by the five matrices rule n − o−1 L X> X L> (Lb − θ 0 ) = (Lb − θ 0 ) > n n − o−1 − o−1 − (Lb − θ 0 ) L X> X L> AX X> X X> A> L X> X L> = (Lb − θ 0 ) > n n − o−1 − − o−1 (Lb − θ 0 ) L X> X L> L X> X L> L X> X L> = (Lb − θ 0 ) > n − o−1 L X> X L> (Lb − θ 0 ). 2 It remains to be shown that D = SS0e − SSe . We have n o−1 0 2 0 2 b + X X> X − L> L X> X − L> SS0e = Y − Yb = Y − Y (Lb − θ ) | {z } {z } ⊥ | U ∈M X D∈M X 2 2 2 = U + D = SSe + D . k 5.3. LINEAR CONSTRAINTS 5.3.1 65 F-statistic to verify a set of linear constraints Let us take the expression (5.7) for the difference between the residual sums of squares of the model and the submodel given by linear constraints and derive the submodel F-statistic (5.4): > (Lb − θ 0 ) SS0e − SSe r − r0 F0 = = SSe n−r n − o−1 L X> X L> (Lb − θ 0 ) m SSe n−r = n − o−1 1 > (Lb − θ 0 ) MSe L X> X L> (Lb − θ 0 ) m = n − o−1 > 1 b b − θ 0 ), (θ − θ 0 ) MSe L X> X L> (θ m (5.10) b = Lb is the LSE of the estimable vector parameter θ = Lβ in the linear model Y X ∼ where θ Xβ, σ 2 In without constraints. Note now that (5.10) is exactly equal to the Wald-type statistic 0 Q0 (see page 41) that we used in Section 3.2.2 to test the null hypothesis2 H0 : θ = θ on an estimable vector parameter θ in a normal linear model Y Z ∼ Nn Xβ, σ In . If normality can be assumed, point (x) of Theorem 3.1 then provided that under the null hypothesis H0 : θ = θ 0 , that is, under the validity of the submodel given by linear constraints Lβ = θ 0 , the statistic F0 follows the usual F-distribution Fm,n−r . This shows that the Wald-type test on the estimable vector parameter in a normal linear model based on Theorem 3.1 is equivalent to the submodel F-test based on Theorem 5.1. 5.3.2 t-statistic to verify a linear constraint > k Consider L = l> , l ∈ R , l 6= 02k such that 0θ = l β is an estimable parameter of the normal linear model Y Z ∼ Nn Xβ, σ In . Take θ ∈ R and consider the submodel given by m = 1 linear constraint l> β = θ0 . Let θb = l> b, where b is any solution to the normal equations in the model without constraints. The statistic (5.10) then takes the form !2 b − θ0 − o−1 n 1 b θ 0 > > 0 = T02 , MSe l X X l F0 = θ−θ θb − θ = q − m MSe l> X> X l where θb − θ0 T0 = q − MSe l> X> X l is the Wald-type test statistic introduced in Section 3.2.2 (on page 40) to test the null hypothesis H0 : θ = θ0 in a normal linear model Y Z ∼ Nn Xβ, σ 2 In . Point (viii) of Theorem 3.1 provided that under the null hypothesis H0 : θ = θ0 , the statistic T0 follows the Student t-distribution tn−r which is indeed in agreement with the fact that T02 = F0 follows the F-distribution F1,n−r . 5.4. COEFFICIENT OF DETERMINATION 5.4 5.4.1 66 Coefficient of determination Intercept only model Notation (Response sample mean). The sample mean over the response vector Y = Y1 , . . . , Yn > will be denoted as Y . That is, n 1X 1 Y = Yi = Y > 1n . n n i=1 Definition 5.4 Regression and total sums of squares in a linear model. Consider a linear model Y X ∼ Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. The following expressions define the following quantities: (i) Regression sum of squares4 and corresponding degrees of freedom: n 2 X 2 SSR = Yb − Y 1n = Ybi − Y , νR = r − 1, i=1 (ii) Total sum of squares5 and corresponding degrees of freedom: n 2 X 2 Yi − Y , SST = Y − Y 1n = νT = n − 1. i=1 Lemma 5.6 Model with intercept only. Let Y ∼ 1n γ, ζ 2 In . Then (i) Yb = Y 1n = Y , . . . , Y > . (ii) SSe = SST . Proof. This is a full-rank model with X = 1n . Further, X> X Hence γ b= 4 1 n Pn i=1 Yi regresní součet čtverců 5 −1 = 1> n 1n −1 = 1 , n X > Y = 1> nY = = Y and Yb = Xb γ = 1n Y = Y 1n . celkový součet čtverců n X Yi . i=1 k 5.4. COEFFICIENT OF DETERMINATION 5.4.2 67 Models with intercept Lemma 5.7 Identity in a linear model with intercept. Let Y X ∼ Xβ, σ 2 In where 1n ∈ M X . Then 1> nY = n X Yi = n X i=1 b Ybi = 1> nY . i=1 Proof. • Follows directly from the normal equations if 1n is one of the columns of X matrix. • General proof: b b> 1> n Y = Y 1n = HY > 1n = Y > H1n = Y > 1n , since H1n = 1n due to the fact that 1n ∈ M X . k Theorem 5.8 Breakdown of the total sum of squares in a linear model with intercept. Let Y X ∼ Xβ, σ 2 In where 1n ∈ M X . Then SST n X i=1 Yi − Y = 2 = SSe n X Yi − Ybi + 2 + i=1 SSR n X Ybi − Y 2 . i=1 Proof. The identity SST = SSe + SSR follows trivially if r = rank X = 1 since then M X = M 1n and hence (by Lemma 5.6) Yb = Y 1n . Then SST = SSe , SSR = 0. 0, σ2I In the following, let r = rank X > 1. Then, model Y | X ∼ 1 β is a submodel of the n n model Y X ∼ Xβ, σ 2 In and by Lemma 5.6, SST = SS0e . Further, from definition of SSR , 2 0 it equals to SSR = D , where D = Yb − Yb . By point (iv) of Theorem 5.1 (on a submodel), 2 D = SS0e − SSe . In other words, SSR = SST − SSe . k 5.4. COEFFICIENT OF DETERMINATION 68 The identity SST = SSe + SSR can also be shown directly while using a little algebra. We have SST = n X Yi − Y 2 = i=1 = n X n X Yi − Ybi + Ybi − Y 2 i=1 Yi − Ybi 2 + i=1 n X Ybi − Y 2 +2 = SSe + SSR + 2 Yi − Ybi Ybi − Y i=1 i=1 n nX n X Yi Ybi − Y i=1 n X Yi + Y i=1 n X i=1 Ybi − n X Ybi2 i=1 {z 0 | o } = SSe + SSR P P since ni=1 Yi = ni=1 Ybi and additionally n X Yi Ybi = Y > Yb = Y > HY , i=1 5.4.3 n X > Ybi2 = Yb Yb = Y > HHY = Y > HY . i=1 k Evaluation of a prediction quality of the model End of Lecture #10 (05/11/2015) Start of One of the usual aims of regression modelling is so called prediction in which case the model based Lecture #12 mean is used as the predicted response value. In such situations, it is assumed that data Yi , X i , (12/11/2015) i = 1, . . . , n, are a random sample from some joint distribution of a generic random vector Y, X and theconditional distribution Y | X can be described by a linear model Y | X ∼ Xβ, σ 2 In , rank X = r, for the data. That is, the mean and the variance of the conditional distribution Y | X are given as E Y X = X > β and var Y X = σ 2 , respectively. In the following, let EY and varY , respectively, denote expectation and variance, respectively, with 2 respect to the marginal distribution of Y . The only intercept model Y ∼ 1n γ, ζ In for the data then corresponds to the marginal distribution of the response Y with EY Y = γ and varY Y = ζ 2 . Suppose now that notyet observed random vector Ynew , X new is also distributed as a generic random vector Y, X and assume that all parameters of considered models are known. The aim is to provide the prediction Ybnew of the value of Ynew . To this end, the quality of the prediction is, most classically, evaluated by the mean squared error of prediction6 (MSEP) defined as 2 MSEP Ybnew = EY, X Ybnew − Ynew , (5.11) where symbol EY, X denotes expectation with respect to the joint distribution of the random vector Y, X . To predict the value of Ynew , we have basically two options depending on whether a value of X new (covariates for the new observation) is or is not available to construct the prediction. 6 střední čtvercová chyba predikce 5.4. COEFFICIENT OF DETERMINATION 69 (i) If the value of X new is not available, prediction can only be based on the marginal (intercept M of only) model where the model based mean equals to γ, which is also the prediction Ybnew Ynew . That is, M Ybnew = γ, and we get 2 2 M MSEP Ybnew = EY, X γ − Ynew = EY γ − Ynew = varY Ynew = ζ 2 . (ii) If the value of X new is available, the conditional (regression) model can be used leading to the prediction of Ynew of the form C Ybnew = X> new β. Let EX denote the expectation with respect to the marginal distribution of the covariates X. The MSEP then equals h i 2 2 C > > b MSEP Ynew = EY, X X new β − Ynew = EX E X new β − Ynew X new n o = EX var Ynew X new = EX σ 2 = σ 2 . Finally, we get C MSEP Ybnew σ2 = 2. M ζ MSEP Ybnew C based on the regression That is, the ratio σ 2 /ζ 2 quantifies advantage of using the prediction Ybnew M which is equal to model and the covariate values X new compared to using the prediction Ybnew the marginal response expectation. 5.4.4 Coefficient of determination To estimate the ratio σ 2 /ζ 2 between the conditional and the marginal response variances, we can straightforwardly consider the following. First, n 2 1 1 X SST = Yi − Y νT n−1 i=1 is a standard sample variance based on the random sample Y1 , . . . , Yn . That is, it is an unbiased estimator of the marginal variance ζ 2 . Note that it is also the residual mean square from the only 2 intercept model 1n γ, ζ In . Further, n 2 1 1 X SSe = Yi − Ybi , νe n−r i=1 which is the residual mean square from the considered linear model Xβ, σ 2 In , is an unbiased estimator of the conditional variance σ 2 . That is, a suitable estimator of the ratio σ 2 /ζ 2 is 1 n−r SSe 1 n−1 SST = n − 1 SSe . n − r SST (5.12) 5.4. COEFFICIENT OF DETERMINATION 70 Alternatively, if Y ∼ N (γ, ζ 2 ), that is, if Y1 , . . . , Yn is a random sample from N (γ, ζ 2 ), it can be (it was) easily derived that a quantity n 2 1 1 X SST = Yi − Y n n i=1 is the maximum-likelihood estimator7 (MLE) of the marginal variance ζ 2 . Analogously, if Y | X ∼ > N X β, σ 2 , it can be derived (see the exercise class) that a quantity n 2 1 1 X SSe = Yi − Yb n n i=1 is the MLE of the conditional variance σ 2 . Alternative estimator of the ratio σ 2 /ζ 2 is then 1 n SSe 1 n SST = SSe . SST (5.13) Remember that in the model Y | X ∼ Xβ, σ 2 In with intercept (1n ∈ M X ), we have, n X Yi − Y 2 Yi − Ybi 2 i=1 i=1 | = n X {z SST } | + n X Ybi − Y 2 , i=1 {z SSe } | {z SSR } where the three sums of squares represent different sources of the response variability: SST (total sum of squares): original (marginal) variability of the response, SSe (residual sum of squares): variability not explained by the regression model, (residual variability, conditional variability) SSR (regression sum of squares): variability explained by the regression model. Expressions (5.12) and (5.13) then motivate the following definition. Definition 5.5 Coefficients of determination. Consider a linear model Y X ∼ Xβ, σ 2 In , rank(X) = r where 1n ∈ M X . A value R2 = 1 − SSe SST is called the coefficient of determination8 of the linear model. A value 2 Radj =1− n − 1 SSe n − r SST is called the adjusted coefficient of determination9 of the linear model. 7 maximálně věrohodný odhad 8 koeficient determinace 9 upravený koeficient determinace 5.4. COEFFICIENT OF DETERMINATION 71 Notes. • By Theorem 5.8, SST = SSe + SSR and at the same time SST ≥ 0. Hence 0 ≤ R2 ≤ 1, and R2 can also be expressed as R2 = 2 ≤ 1, 0 ≤ Radj SSR . SST 2 are often reported as R2 · 100% and R2 · 100% which can be interpreted • Both R2 and Radj adj as a percentage of the response variability explained by the regression model. 2 quantify a relative improvement of the quality of prediction if the regression • Both R2 and Radj model and the conditional distribution of response given the covariates is used compared to the prediction based on the marginal distribution of the response. • Both coefficients of determination only quantifies the predictive ability of the model. They do not say much about the quality of the model with respect to the possibility 2to capture correctly ) might be useful the conditional mean E Y X . Even a model with a low value of R2 (Radj with respect to modelling the conditional mean E Y X . The model is perhaps only useless for prediction purposes. 5.4.5 Overall F-test Lemma 5.9 Overall F-test. Assume a normal linear model Y | X ∼ Nn Xβ, σ 2 In , rank Xn×k = r > 1 where 1n ∈ M X . Let R2 be its coefficient of determination. The submodel F-statistic to compare model M : Y | X ∼ Nn Xβ, σ 2 In and the only intercept model M0 : Y | X ∼ Nn 1n γ, σ 2 In takes the form F0 = R2 n−r · . 2 1−R r−1 (5.14) Proof. • R2 = 1 − SSe SST and according to Lemma 5.6: SST = SS0e . • Hence R2 = 1 − SSe SS0e − SSe = , SS0e SS0e 1 − R2 = SSe . SS0e • At the same time F0 = SS0e −SSe r−1 SSe n−r n − r SS0e − SSe n−r = = r−1 SSe r−1 SS0e −SSe SS0e SSe SS0e = n − r R2 . r − 1 1 − R2 k 5.4. COEFFICIENT OF DETERMINATION 72 Note. The F-test with the test statistic (5.14) is sometimes (especially in some software packages) referred to as an overall goodness-of-fit test. Nevertheless be cautious when interpreting the results of such test. It says practically nothing about the quality of the model and the “goodness-of-fit”! Chapter 6 General Linear Model We still assume that data are represented by a set of n random vectors Yi , X i , X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, and use symbols Y for a vector Y1 , . . . , Yn and X for an n × k matrix with rows given by vectors X 1 , . . . , X n . In this chapter, we mildly extend a linear model by allowing for a covariance matrix having different form than σ 2 In assumed by now. Definition 6.1 General linear model. The data Yi , X i , i = 1, . . . , n, satisfy a general linear model1 if E Y X = Xβ, var Y X = σ 2 W−1 , where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters and W is a known positive definite matrix. Notes. • The fact that data follow a general linear model is denoted as Y X ∼ Xβ, σ 2 W−1 . • General linear model should not be confused with a generalized linear model 2 which is something different (see Advanced Regression Models (NMST432) course). In the literature, abbreviation “GLM” is used for (unfortunately) both general and generalized linear model. It must be clear from context which of the two is meant. Example 6.1 (Regression based on sample means). Suppose that data are represented by random vectors Ye1,1 , . . . , Ye1,w1 , X 1 , ..., Yen,1 , . . . , Yen,wn , X n such that for each i = 1, . . . , n, the random variables Yei,1 , . . . , Yei,wi are uncorrelated with a common conditional (given X i ) variance σ 2 . 1 obecný lineární model 2 zobecněný lineární model 73 74 Suppose that we are only able to observe the sample means of the “Ye ” variables leading to the response variables Y1 , . . . , Yn , where Y1 = w1 1 X Ye1,j , w1 ..., Yn = j=1 wn 1 X Yen,j . wn j=1 The covariance matrix (conditional given X) of a random vector Y = Y1 , . . . , Yn is then 1 w1 . . var Y := var Y X = σ 2 . 0 | ... .. . ... {z W−1 0 .. . . 1 wn } Theorem 6.1 Generalized least squares. Assume a general linear model Y X ∼ Xβ, σ 2 W−1 , where rank Xn×k = r ≤ k < n. The following then holds: (i) A vector Yb G := X X> WX − X> WY is the best linear unbiased estimator (BLUE) of a vector parameter µ := E Y = Xβ, and − var Yb G = σ 2 X X> WX X> . − Both Yb G and var Yb G do not depend on a choice of the pseudoinverse X> WX . If further Y X ∼ Nn Xβ, σ 2 W−1 then − Yb G X ∼ Nn Xβ, σ 2 X X> WX X> . (ii) Let l ∈ Rk , l 6= 0k , be such that θ = l> β is an estimable parameter of the model and let bG := X> WX − X> WY . Then θbG = l> bG does not depend on a choice of the pseudoinverse used to calculate bG and θbG is the best linear unbiased estimator (BLUE) of θ with − var θbG = σ 2 l> X> WX l, which also does not depend on a choice of the pseudoinverse. If further Y X ∼ Nn Xβ, σ 2 W−1 then − θbG X ∼ N θ, σ 2 l> X> WX l . 75 (iii) If further r = k (full-rank general linear model), then b := X> WX −1 X> WY β G is the best linear unbiased estimator (BLUE) of β with b = σ 2 X> WX −1 . var β G If additionally Y X ∼ Nn Xβ, σ 2 W−1 then b X ∼ Nk β, σ 2 X> WX −1 . β G (iv) The statistic MSe,G := where SSe,G , n−r 1 > 2 SSe,G := W 2 Y − Yb G = Y − Yb G W Y − Yb G , is the unbiased estimator of the residual variance σ 2 . If additionally Y X ∼ Nn Xβ, σ 2 W−1 then SSe,G ∼ χ2n−r , σ2 and the statistics SSe,G and Yb G are conditionally, given X, independent. Proof. Proof/calculations were available on the blackboard in K1. k Note. Mention also that as consequence of the above theorem, all classical tests, confidence intervals etc. work in the same way as in the OLS case. Terminology (Generalized fitted values, residual sum of squares, mean square, least square estimator). − • The statistic Yb G = X X> WX X> WY is called the vector of the generalized fitted values.3 1 > 2 • The statistic SSe,G = W 2 Y − Yb G = Y − Yb G W Y − Yb G is called the generalized residual sum of squares.4 SSe,G is called the generalized mean square.5 n−r b = X> WX −1 X> WY in a full-rank general linear model is called the gener• The statistic β G alized least squares (GLS) estimator 6 of the regression coefficients. • The statistic MSe,G = 3 zobecněné vyrovnané hodnoty zobecněných nejmenších čtverců 4 zobecněný reziduální součet čtverců 5 zobecněný střední čtverec 6 odhad metodou 76 Note. The most common use of the generalized least squares is the situation described in Example 6.1, where 1 w1 . . W−1 = . 0 ... .. . ... 0 .. . . 1 wn We then get X> WY = n X wi Yi X i , i=1 SSe,G = X> WX = n X wi X i X > i , i=1 n X 2 wi Yi − YbG,i . i=1 The method of the generalized least squares is then usually referred to as the method of the weighted least squares (WLS).7 7 vážené nejmenší čtverce End of Lecture #12 (12/11/2015) Chapter 7 Parameterizations of Covariates 7.1 Linearization of the dependence of the response on the covariates Start of As it is usual in this lecture, we represent data by n random vectors Yi , Z i , Z i = Zi,1 , . . . , Lecture #5 (15/10/2015) Zi,p ∈ Z ⊆ Rp , i = 1, . . . , n. The principal problem we consider is to find a suitable model to express the (conditional) response expectation E Y := E Y Z , where Z is a matrix with vectors Z 1 , . . ., Zn in its rows. To this end, we consider a linear model, where E Y can be expressed as E Y = Xβ for some β ∈ Rk , where X> X 1 = X1,0 , . . . , X1,k−1 = t(Z 1 ), 1 . .. . X= . . , > Xn X n = Xn,0 , . . . , Xn,k−1 = t(Z n ), and t : Z −→ X ⊆ Rk , t(z) = t0 (z), . . . , tk−1 (z) = x0 , . . . , xk−1 = x, is a suitable transformation of the original covariates that linearize the relationship between the response expectation and the covariates. The corresponding regression function is then m(z) = β0 t0 (z) + · · · + βk−1 tk−1 (z), z ∈ Z. (7.1) One of the main problems of a regression analysis is to find a reasonable form of the transformation function t to obtain a model that is perhaps but at least useful to capture sufficiently the wrong form of E Y and in general to express E Y Z = z , z ∈ Z, for a generic response Y being generated, given the covariate value Z = z, by the same probabilistic mechanism as the original data. 77 7.2. PARAMETERIZATION OF A SINGLE COVARIATE 7.2 78 Parameterization of a single covariate In this and two following sections, we first limit ourselves to the situation of a single covariate, i.e., p = 1, Z ⊆ R, and show some classical choices of the transformations that are used in practical analyses when attempting to find a useful linear model. 7.2.1 Parameterization Our aim is to propose transformations t : Z −→ Rk , t(z) = t0 (z), . . . , tk−1 (z) such that a regression function (7.1) can possibly provide a useful model for the response expectation E Y Z = z . Furthermore, in most cases, we limit ourselves to transformations that lead to a linear model with intercept. In such cases, the regression function will be m(z) = β0 + β1 s1 (z) + · · · + βk−1 sk−1 (z), z ∈ Z, (7.2) where the non-intercept part of the transformation t will be denoted as s. That is, for z ∈ Z, j = 1, . . . , k − 1, s(z) = s1 (z), . . . , sk−1 (z) = t1 (z), . . . , tk−1 (z) . sj (z) = tj (z), s : Z −→ Rk−1 , Definition 7.1 Parameterization of a covariate. Let Z1 , . . . , Zn be values of a given univariate covariate Z ∈ Z ⊆ R. By a parameterization of this covariate we mean (i) a function s : Z −→ Rk−1 , s(z) = s1 (z), . . . , sk−1 (z) , z ∈ Z, where all s1 , . . . , sk−1 are non-constant functions on Z, and (ii) an n × (k − 1) matrix S, where s> (Z1 ) s1 (Z1 ) . . . sk−1 (Z1 ) . . .. .. . .. = .. S= . . s> (Zn ) s1 (Zn ) . . . sk−1 (Zn ) Terminology (Reparameterizing matrix, regressors). Matrix S from Definition 7.1 is called reparameterizing matrix 1 of a covariate. Its columns, i.e., vectors sk−1 (Z1 ) s1 (Z1 ) . .. .. , . . . , X k−1 = X1 = . s1 (Zn ) sk−1 (Zn ) are called regressors.2 1 reparametrizační matice 2 regresory 7.2. PARAMETERIZATION OF A SINGLE COVARIATE 79 Notes. • A model matrix X of the model with 1 X> 1 .. .. X = 1n , S = . = . > 1 Xn X i = s(Zi ), the regression function (7.2) is 1 X1,1 . . . X1,k−1 . .. .. .. 1 k−1 , .. = 1 , X , . . . , X n . . . 1 Xn,1 . . . Xn,k−1 Xi,j = sj (Zi ), i = 1, . . . , n, j = 1, . . . , k − 1. • Definition 7.1 is such that an intercept vector 1n (or a vector c 1n , c ∈ R) is (with a positive probability) not included in the reparameterizing matrix S. Nevertheless, it will be useful in some situations to consider such parameterizations that (almost surely) include an intercept term in the corresponding regression space. That is, for some parameterizations (see the regression splines in Section 7.3.4), we will have 1n ∈ M S . 7.2.2 Covariate types The covariate space Z and the corresponding univariate covariates Z1 , . . . , Zn are usually of one of the two types and different parameterizations are useful depending on the covariate type which are the following. Numeric covariates Numeric 3 covariates are such covariates where a ratio of the two covariate values makes sense and a unity increase of the covariate value has an unambiguous meaning. The numeric covariate is then usually of one of the two following subtypes: (i) continuous, in which case Z is mostly an interval in R. Such covariates have usually a physical interpretation and some units whose choice must be taken into account when interpreting the results of the statistical analysis. The continuous numeric covariates are mostly (but not necessarily) represented by continuous random variables. (ii) discrete, in which case Z is infinite countable or finite (but “large”) subset of R. The most common situation of a discrete numeric covariate is a count 4 with Z ⊆ N0 . The numeric discrete covariates are represented by discrete random variables. Categorical covariates Categorical 5 covariates (in the R software referred to as factors), are such covariates where the ratio of the two covariate values does not necessarily make sense and a unity increase of the covariate value does not necessarily have an unambiguous meaning. The sample space Z is a finite (and mostly “small”) set, i.e., Z = ω1 , . . . , ωG , where the values ω1 < · · · < ωG are somehow arbitrarily chosen labels of categories purely used to obtain a mathematical representation of the covariate values. The categorical covariate is always represented by a discrete random variable. Even for categorical covariates, it is useful to distinguish the two subtypes: 3 numerické, příp. kvantitativní 4 počet 5 kategoriální, příp. kvalitativní 7.2. PARAMETERIZATION OF A SINGLE COVARIATE 80 (i) nominal 6 where from a practical point of view, chosen values ω1 , . . . , ωG are completely arbitrary. Consequently, practically interpretable results and conclusions of any sensible statistical analysis should be invariant towards the choice of ω1 , . . . , ωG . The nominal categorical covariate mostly represents a pertinence to some group (a group label), e.g., region of residence. (ii) ordinal 7 where ordering ω1 < · · · < ωG makes sense also from a practical point of view. An example is a school grade. Notes. • From the practical point of view, it is mainly important to distinguish numeric and categorical covariates. • Often, ordinal categorical covariate can be viewed also as a discrete numeric. Whatever in this lecture that will be applied to the discrete numeric covariate can also be applied to the ordinal categorical covariate if it makes sense to interprete, at least into some extent, its unity increase (and not only the ordering of the covariate values). 6 nominální 7 ordinální 7.3. NUMERIC COVARIATE 7.3 81 Numeric covariate It is now assumed that Zi ∈ Z ⊆ R, i = 1, . . . , n, are numeric covariates. Our aim is now to propose their sensible parameterizations. 7.3.1 Simple transformation of the covariate The regression function is m(z) = β0 + β1 s(z), z ∈ Z, (7.3) where s : Z −→ R is a suitable non-constant function. The corresponding reparameterizing matrix is s(Z1 ) . . S= . . s(Zn ) Due to interpretability issues, “simple” functions like: identity, logarithm, exponential, square root, reciprocal, . . . , are considered in place of the transformation s. Evaluation of the effect of the original covariate Advantage of a model with the regression function (7.3) is the fact that a single regression coefficient β1 (the slope in a model with the regression line in x = s(z)) quantifies the effect of the covariate on the response expectation which can then be easily summarized by a single point estimate and a confidence interval. Evaluation of a statistical significance of the effect of the original covariate on the response expectation is achieved by testing the null hypothesis H0 : β1 = 0. A possible test procedure was introduced in Section 3.2. Interpretation of the regression coefficients Disadvantage is the fact that the slope β1 expresses the change of the response expectation that corresponds to a unity change of the transformed covariate X = s(Z), i.e., for z ∈ Z: β1 = E Y X = s(z) + 1 − E Y X = s(z) , which is not always easily interpretable. Moreover, unless that corresponds E Y the transformation s is a linear function, the change in the response expectation to a unity change of the original covariate is a function of that covariate: Z = z + 1 − E Y Z = z = β1 s(z + 1) − s(z) , z ∈ Z. In other words, a model with the regression function (7.3) and a non-linear transformation s expresses the fact that the original covariate has different influence on the response expectation depending on the value of this covariate. Note. It is easily seen that if n > k = 2, the transformation s is strictly monotone and the data contain at least two different values among Z1 , . . . , Zn (which has a probability of one if the covariates Zi are sampled from a continuous distribution), the model matrix X = 1n , S is of a full-rank r = k = 2. 7.3. NUMERIC COVARIATE 7.3.2 82 Raw polynomials The regression function is polynomial of a chosen degree k − 1, i.e., m(z) = β0 + β1 z + · · · + βk−1 z k−1 , z ∈ Z. (7.4) The parameterization is s : Z −→ Rk−1 , s(z) = z, . . . , z k−1 , z ∈ Z and the corresponding reparameterizing matrix Z1 . . S= . Zn is . . . Z1k−1 .. .. . . . k−1 . . . Zn Evaluation of the effect of the original covariate The effect of the original covariate on the response expectation is now quantified by a set of k − 1 Z regression coefficients β := β1 , . . . , βk−1 . To evaluate a statistical significance of the effect of the original covariate on the response expectation we have to test the null hypothesis H0 : β Z = 0k−1 . An appropriate test procedure was introduced in Section 3.2. Interpretation of the regression coefficients With k > 2 (at least a quadratic regression function), the single regression coefficients β1 , . . . , βk−1 only occasionally have a direct reasonable interpretation. Analogously to simple non-linear transformation of the covariate, the change in the response expectation that corresponds to a unity change of the original covariate is a function of that covariate: E Y Z = z + 1 − E Y Z = z = β1 + β2 (z + 1)2 − z 2 + · · · + βk−1 (z + 1)k−1 − z k−1 , z ∈ Z. Note. It is again easily seen that if n > k and the data contain at least k different values among among Z1 , . . . , Zn (which has a probabilityof one if the covariates Zi are sampled from a continuous distribution), the model matrix 1n , S is of a full-rank r = k. Degree of a polynomial Test on a subset of regression coefficients (Section 3.2) or a submodel test (Section 5.2) can be used to infer on the degree of a polynomial in the regression function (7.4). The null hypothesis expressing, for d < k, belief that the regression function is a polynomial of degree d−1 corresponds to the null hypothesis H0 : βd = 0 & . . . & βk−1 = 0. 7.3. NUMERIC COVARIATE 7.3.3 83 Orthonormal polynomials The regression function is again polynomial of a chosen degree k − 1, nevertheless, a different basis of the regression space, i.e., a different parameterization of the polynomial is used. Namely, the regression function is m(z) = β0 + β1 P 1 (z) + · · · + βk−1 P k−1 (z), z ∈ Z, (7.5) where P j is an orthonormal polynomial of degree j, j = 1, . . . , k − 1 built above a set of the covariate datapoints Z1 , . . . , Zn . That is, P j (z) = aj,0 + aj,1 z + · · · + aj,j z j , j = 1, . . . , k − 1, (7.6) and the polynomial coefficients aj,l , j = 1, . . . , k − 1, l = 0, . . . , j are such that vectors P j (Z1 ) . . Pj = . , j = 1, . . . , k − 1, P j (Zn ) are all orthonormal and also orthogonal to an intercept vector P 0 = 1, . . . , 1 . The corresponding reparameterizing matrix is P 1 (Z1 ) . . . P k−1 (Z1 ) .. .. .. S = P 1 , . . . , P k−1 = (7.7) . . . P 1 (Zn ) . . . P k−1 (Zn ), which leads to the model matrix X = 1n , S which have all columns mutually orthogonal and the non-intercept columns having even a unity norm. For methods of calculation of the coefficients of the polynomials (7.6), see lectures on linear algebra. It can only be mentioned here that as soon as the data contain at least k different values among Z1 , . . . , Zn , those polynomial coefficients exist and are unique. Note. For given dataset and given polynomial degree k −1, the model matrix X = 1n , S based on the orthonormal polynomial provide the same regression space as the model matrix based on the raw polynomials. Hence, the two model matrices determine two equivalent linear models. Advantages of orthonormal polynomials compared to raw polynomials • All non-intercept columns of the model matrix have the same (unity) norm. Consequently, all non-intercept regression coefficients β1 , . . . , βk−1 have the same scale. This may be helpful when evaluating a practical (not statistical!) importance of higher-order degree polynomial terms. b • Matrix X> X is a diagonal matrix diag(n, 1, . . . , 1). Consequently, the covariance matrix var β is also a diagonal matrix, i.e., the LSE of the regression coefficients are uncorrelated. Evaluation of the effect of the original covariate The effect of the original covariate on the response expectation is again quantified by a set of k − 1 Z regression coefficients β := β1 , . . . , βk−1 . To evaluate a statistical significance of the effect of the original covariate on the response expectation we have to test the null hypothesis H0 : β Z = 0k−1 . See Section 3.2 for a possible test procedure. 7.3. NUMERIC COVARIATE 84 Interpretation of the regression coefficients The single regression coefficients β1 , . . . , βk−1 do not usually have a direct reasonable interpretation. Degree of a polynomial Test on a subset of regression coefficients/test on submodels (were introduced in Sections 3.2 and 5.2) can again be used to infer on the degree of a polynomial in the regression function (7.5) in the same way as with the raw polynomials. The null hypothesis expressing, for d < k, belief that the regression function is a polynomial of degree d − 1 corresponds to the null hypothesis H0 : βd = 0 & . . . & βk−1 = 0. 7.3.4 Regression splines Basis splines The advantage of a polynomial regression function introduced in Sections 7.3.2 and 7.3.3 is that it is smooth (have continuous derivatives of all orders) on the whole real line. Nevertheless, with the least squares estimation, each data point affects globally the fitted regression function. This often leads to undesirable boundary when effects the fitted regression function only poorly approximates the response expectation E Y Z = z for the values of z being close to the boundaries of the covariate space Z. This can be avoided with so-called regression splines. Definition 7.2 Basis spline with distinct knots. Let d ∈ N0 and λ = λ1 , . . . , λd+2 ∈ Rd+2 , where −∞ < λ1 < · · · < λd+2 < ∞. The basis spline of degree d with distinct knots8 λ is such a function B d (z; λ), z ∈ R that (i) B d (z; λ) = 0, for z ≤ λ1 and z ≥ λd+2 ; (ii) On each of the intervals (λj , λj+1 ), j = 1, . . . , d + 1, B d (·; λ) is a polynomial of degree d; (iii) B d (·; λ) has continuous derivatives up to an order d − 1 on R. Notes. • The basis spline with distinct knots is piecewise 9 polynomial of degree d on (λ1 , λd+2 ). • The polynomial pieces are connected smoothly (of order d − 1) at inner knots λ2 , . . . , λd+1 . • On the boundary (λ1 and λd+2 ), the polynomial pieces are connected smoothly (of order d − 1) with a constant zero. Definition 7.3 Basis spline with coincident left boundary knots. Let d ∈ N0 , 1 < r < d + 2 and λ = λ1 , . . . , λd+2 ∈ Rd+2 , where −∞ < λ1 = · · · = λr < · · · < λd+2 < ∞. The basis spline of degree d with r coincident left boundary knots10 λ is such a function B d (z; λ), z ∈ R that 8 bazický spline [čti splajn] stupně d se vzájemně různými uzly se levými uzly 9 po částech 10 bazický spline stupně d s r překrývajícími 7.3. NUMERIC COVARIATE 85 (i) B d (z; λ) = 0, for z ≤ λr and z ≥ λd+2 ; (ii) On each of the intervals (λj , λj+1 ), j = r, . . . , d + 1, B d (·; λ) is a polynomial of degree d; (iii) B d (·; λ) has continuous derivatives up to an order d − 1 on (λr , ∞); (iv) B d (·; λ) has continuous derivatives up to an order d − r in λr . Notes. • The only qualitative difference between the basis spline with coincident left boundary knots and the basis spline with distinct knots is the fact that the basis spline with coincident left boundary knots is at the left boundary smooth of order only d − r compared to order d − 1 in case of the basis spline with distinct knots. • By mirroring Definition 7.3 to the right boundary, basis spline with coincident right boundary knots is defined. Basis B-splines There are many ways on how to construct the basis splines that satisfy conditions of Definitions 7.2 and 7.3, see Fundamentals of Numerical Mathematics (NMNM201) course. In statistics, so called B-splines have proved to be extremely useful for regression purposes. It goes beyond the scope of this lecture to explain in detail their construction which is fully covered by two landmark books de Boor (1978, 2001); Dierckx (1993) or in a compact way, e.g., by a paper Eilers and Marx (1996). For the purpose of this lecture it is assumed that a routine is available to construct the basis B-splines of given degree with given knots (e.g., the R function bs from the recommended package splines). An important property of the basis B-splines is that they are positive inside their support interval (general basis splines is, if can also attain negative values inside the support interval). That d λ = λ1 , . . . , λd+2 is a set of knots (either distinct or coincident left or right) and B (·, λ) is a basis B-spline of degree d built above the knots λ then B d (z, λ) > 0, λ1 < z < λd+2 , B d (z, λ) = 0, z ≤ λ1 , z ≥ λd+2 . Spline basis Definition 7.4 Spline basis. Let d ∈ N0 , k ≥ d+1 and λ = λ1 , . . . , λk−d+1 ∈ Rk−d+1 , where −∞ < λ1 < . . . < λk−d+1 < ∞. The spline basis11 of degree d with knots λ is a set of basis splines B1 , . . . , Bk , where for z ∈ R, B1 (z) = B d (z; λ1 , . . . , λ1 , λ2 ), | {z } (d+1)× B2 (z) = B d (z; λ1 , . . . , λ1 , λ2 , λ3 ), | {z } d× .. . 11 splinová báze 7.3. NUMERIC COVARIATE 86 Bd (z) = B d (z; λ1 , λ1 , λ2 , . . . , λd+1 ), | {z } 2× Bd+1 (z) = B d (z; λ1 , λ2 , . . . , λd+2 ), Bd+2 (z) = B d (z; λ2 , . . . , λd+3 ), .. . Bk−d (z) = B d (z; λk−2d , . . . , λk−d+1 ), Bk−d+1 (z) = B d (z; λk−2d+1 , . . . , λk−d+1 , λk−d+1 ), {z } | 2× .. . Bk−1 (z) = B d (z; λk−d−1 , λk−d . . . , λk−d+1 , . . . , λk−d+1 ), {z } | d× Bk (z) = B d (z; λk−d . . . , λk−d+1 , . . . , λk−d+1 ). {z } | (d+1)× Properties of the B-spline basis If k ≥ d + 1, a set of knots λ = λ1 , . . . , λk−d+1 , −∞ < λ1 < . . . < λk−d+1 < ∞ is given and B1 , . . . , Bk is the spline basis of degree d with knots λ composed of basis B-splines k ≥ d + 1, a set of knots λ = λ1 , . . . , λk−d+1 , −∞ < λ1 < . . . < λk−d+1 < ∞ is given and B1 , . . . , Bk is the spline basis of degree d with knots λ composed of basis B-splines then (a) k X Bj (z) = 1 for all z ∈ λ1 , λk−d+1 ; (7.8) j=1 (b) for each m ≤ d there exist a set of coefficients γ1m , . . . , γkm such that k X γjm Bj (z) is on (λ1 , λk−d+1 ) a polynomial in z of degree m. (7.9) j=1 Regression spline It will now be assumed that the covariate space is a bounded interval, i.e., Z = zmin , zmax , −∞ < zmin < zmax < ∞. The regression function that exploits the regression splines is m(z) = β1 B1 (z) + · · · + βk Bk (z), z ∈ Z, (7.10) where B1 , . . . , Bk is the spline basis of chosen degree d ∈ N0 composed of basis B-splines built above a set of chosen knots λ = λ1 , . . . , λk−d+1 , zmin = λ1 < . . . < λk−d+1 = zmax . The corresponding reparameterizing matrix coincided with the model matrix and is B1 (Z1 ) . . . Bk (Z1 ) . .. .. .. X=S= (7.11) . . =: B. B1 (Zn ) . . . Bk (Zn ) End of Lecture #5 (15/10/2015) 7.3. NUMERIC COVARIATE 87 Start of Lecture #7 (22/10/2015) Notes. • It follows from (7.8) that 1n ∈ M B . This is also the reason why we do not explicitely include the intercept term in the regression function since it is implicitely included in the regression space. Due to clarity of notation, the regression coefficients are now indexed from 1 to k. That is, the vector of regression coefficients is β = β1 , . . . , βk . • It also follows from (7.9) that for any m ≤ d, a linear model with the regression function based on either raw or orthonormal polynomials of degree m is a submodel of the linear model with the regression function given by a regression spline and the model matrix B. • With d = 0, the regression spline (7.10) is simply a piecewise constant function. • In practice, not much attention is paid to the choice of the degree d of the regression spline. Usually d = 2 (quadratic spline) or d = 3 (cubic spline) is used which provides continuous first or second derivatives, respectively, of the regression function inside the covariate domain Z. • On the other hand, the placement of knots (selection of the values of λ1 , . . . , λk−d+1 ) is quite important to obtain function that sufficiently well approximates the response the regression expectations E Y Z = z , z ∈ Z. Unfortunately, only relatively ad-hoc methods towards selection of the knots will be demonstrated during this lecture as profound methods of the knots selection go far beyond the scope of this course. Advantages of the regression splines compared to raw/orthogonal polynomials • Each data point influences the LSE of the regression coefficients and hence the fitted regression function only locally. Indeed, only the LSE of those regression coefficients that correspond to the basis splines whose supports cover a specific data point are influenced by those data points. • Regression splines of even a low degree d (2 or 3) are, with a suitable choice of knots, able to approximate sufficiently well even functions with a highly variable curvature and that globally on the whole interval Z. Evaluation of the effect of the original covariate To evaluate a statistical significance of the effect of the original covariate on the response expectation we have to test the null hypothesis H0 : β1 = · · · = βk . Due to the property (7.8), this null hypothesis corresponds to assuming that E Y Z ∈ M 1n ⊂ M B . Consequently, it is possible to use a test on submodel that was introduced in Section 5.1 to test the above null hypothesis. Interpretation of the regression coefficients The single regression coefficients β1 , . . . , βk do not usually have a direct reasonable interpretation. 7.4. CATEGORICAL COVARIATE 7.4 88 Categorical covariate In this Section, it is assumed that Zi ∈ Z, i = 1, . . . , n, are categorical covariates. That is, the covariate sample space Z is finite and its elements are only understood as labels. Without loss of generality, we will use, unless stated otherwise, a simple sequence 1, . . . , G for those labels, i.e., Z = 1, . . . , G . Unless explicitely stated (in Section 7.4.4), even the ordering of the labels 1 < · · · < G will not be used for any but notational purposes and the methodology described below is then suitable for both nominal and ordinal categorical covariates. The regression function, m : Z −→ R is now a function defined on a finite set aiming in parameterizing just G (conditional) response expectations E Y Z = 1 , . . . , E Y Z = G . For some clarity in notation, we will also use symbols m1 , . . . , mG for those expectations, i.e., m(1) = E Y Z = 1 =: m1 , .. .. . . m(G) = E Y Z = G =: mG . Notation and terminology (One-way classified group means). Since a categorical covariate often indicates pertinence to one of G groups, we will call m1 , . . . , mG as group means12 or one-way classified group means A vector m = m1 , . . . , mG will be called a vector of group means,13 or a vector of one-way classified group means. Note. Perhaps appealing simple regression function of the form m(z) = β0 + β1 z, z = 1, . . . , G, is in most cases fully inappropriate. First, it orders ad-hoc the group means to form a monotone sequence (increasing if β1 > 0, decreasing if β1 < 0). Second, it ad-hoc assumes a linear relationship between the group means. Both those properties also depend on the ordering or even the values of the labels (1, . . . , G in our case) assigned to the G categories at hand. With a nominal categorical covariate, none of it is justifiable, with an ordinal categorical covariate, such assumptions should, at least, never be taken for granted and used without proper verification. 7.4.1 Link to a G-sample problem For following considerations, we will additionally assume (again without loss of generality) that the data Yi , Zi , i = 1, . . . , n, are sorted according to the covariate values Z1 , . . . , Zn . Furthermore, we will also exchangeably use a double subscript with the response where the first subscript will 12 skupinové střední hodnoty 13 vektor skupinových středních hodnot 7.4. CATEGORICAL COVARIATE indicate the covariate value, i.e., 1 Z1 . . .. .. 1 Zn1 − − − −− . .. = .. Z= . − − − −− Z n−nG +1 G . .. . . . Zn G , 89 n1 -times nG -times Y1 .. . Yn1 − − −− .. = Y = . − − −− Y n−nG +1 .. . Yn Y1,1 .. . Y1,n1 −−− .. . . −−− YG,1 .. . YG,nG Finally, let Y g = Yg,1 , . . . , Yg,ng , g = 1, . . . , G, denote a subvector of the response vector that corresponds to observations with the covariate value being equal to g. That is, Y = Y1 , . . . , Yn = Y 1 , . . . , Y G . A related regression model written using above introduced notation and the error terms is then Yg,j = mg + εg,j , E ε = 0n , ε := ε1,1 , . . . , εG,nG , var ε = σ 2 In . (7.12) Notes. • If the covariates Z1 , . . . , ZG are random then also n1 , . . . , nG are random. • If the covariates Z1 , . . . , ZG are fixed and errors ε1,1 , . . . , εG,nG in (7.12) are assumed to be independent (possibly also identically distributed) then (7.12) is a “regression” parameterization of a classical G-sample problem, where n1 , . . . , nG are the sample sizes of each sample. Observation of the sample g, g = 1, . . . , G, are given by a vector Y g , the expected value in the sample g is given by mg and the variance in the sample g is given by σ 2 . Note that if model (7.12) is assumed then it is assumed that the G samples are homoscedastic, i.e., have all the same variance. • In the following, it is always assumed that n1 > 0, . . . , nG > 0. 7.4. CATEGORICAL COVARIATE 7.4.2 90 Linear model parameterization of one-way classified group means As usual, let µ be the (conditional) µ1,1 .. . µ1,n1 −− .. E Y Z = µ := . −− µ G,1 .. . µG,nG response expectation, i.e., m1 . .. n1 -times m1 −− m1 1n1 . .. = .. . = . −− mG 1nG m G . . nG -times . m (7.13) G Notation and terminology (Regression space of a categorical covariate). A vector space m 1 1 n 1 . : m1 , . . . , mG ∈ R ⊆ Rn .. m 1 G nG will be called the regression space of a categorical covariate (factor) with levels frequencies n1 , . . . , nG and will be denoted as MF (n1 , . . . , nG ). Note. Obviously, with n1 > 0, . . . , nG > 0, a vector dimension of MF (n1 , . . . , nG ) is equal to G and a possible (orthogonal) vector basis is 1 ... 0 . . . .. .. .. n1 -times 1 ... 0 − − −− 1n1 ⊗ 1, . . . , . . . .. .. .. Q= = . .. − − −− 1nG ⊗ 0, . . . , 0 ... 1 . . . . . . nG -times . . . 0 ... 1 0 . 1 (7.14) When using the linear model, we are trying to allow for expressing the response expectation µ, i.e., a vector from MF (n1 , . . . , nG ) as a linear combination of columns of a suitable n × k matrix X, i.e., as µ = Xβ, β ∈ Rk . It is obvious that any model matrix that parameterizes the regression space MF (n1 , . . . , nG ) 7.4. CATEGORICAL COVARIATE 91 must have at least G columns, i.e., k ≥ G x> 1 . .. > x1 −− . X= .. −− x> G . . . x> G and must be of the type n1 -times 1n1 ⊗ x> 1 .. = . 1nG ⊗ x> G nG -times , (7.15) where x1 , . . . , xG ∈ Rk are suitable vectors. Problem of parameterizing a categorical covariate with G levels thus simplifies into selecting a G×k e such that matrix X x> 1 . e . X= . . > xG Clearly, e . rank X = rank X Hence to be able to parameterize the regression space MF (n1 , . . . , nG ) which has a vector e must satisfy dimension of G, the matrix X e = G. rank X The group means then depend on a vector β = β0 , . . . , βk−1 of the regression coefficients as mg = x> g β, g = 1, . . . , G, e m = Xβ. A possible (full-rank) linear model parameterization of regression space of a categorical covariate e = IG and we have uses matrix Q from (7.14) as a model matrix X. In that case, X µ = Q β, m = β. (7.16) Even though parameterization (7.16) seems appealing since the regression coefficients are directly equal to the group means, it is only rarely considered in practice for reasons that will become clear later on. Still, it is useful for some of theoretical derivations. 7.4.3 ANOVA parameterization of one-way classified group means In practice and especially in the area of designed experiments, the group means are parameterized as mg = α0 + αg , g = 1, . . . , G, (7.17) m = 1G , IG α = α0 1G + αZ , 7.4. CATEGORICAL COVARIATE 92 where α = α0 , α1 , . . . , αG is a vector of regression coefficients and αZ = α1 , . . . , αG is its e is non-intercept subvector. That is, the matrix X e = 1G , IG . X The model matrix is then 1 1 ... 0 . . .. .. .. .. . . 1 1 ... 0 − − − − −− . . .. .. X= . . .. .. − − − − −− 1 0 ... 1 . . .. .. . . . . . . 1 0 ... 1 n1 -times 1n1 ⊗ 1, 1, . . . , 0 .. , = . 1nG ⊗ 1, 0, . . . , 1 (7.18) nG -times which has G + 1 columns but its rank is G (as required). That is, thelinear model Y | Z ∼ Xα, σ 2 In is less-than-full rank. In other words, for given µ ∈ M X = MF (n1 , . . . , nG ) a vector α ∈ RG+1 such that µ = Xα is not unique. Consequently, also a solution to related normal equations is not unique. Nevertheless, unique solution can be obtained if suitable indetifying constraints14 are imposed on the vector of regression coefficients α. Terminology (Effects of a categorical covariate). Values of α1 , . . . , αG (a vector αZ ) are called effects of a categorical covariate. Note. Effects of a categorical covariate are not unique. Hence their interpretation depends on chosen identifying constraints. Identification in less-than-full-rank linear model In the following, a linear model Y X ∼ Xβ, σ 2 In , rank(Xn×k ) = r will be assumed (in our general notation), where r < k. We shall consider linear constraints on a vector of regression coefficients, i.e., constraints of the type Aβ = 0m , where A is an m × k matrix. Definition 7.5 Identifying constraints. We say that a constraint Aβ = 0m identifies a vector β in a linear model Y X ∼ Xβ, σ 2 In if and only if for each µ ∈ M X there exists only one vector β which satisfies at the same time µ = Xβ 14 identifikační omezení and Aβ = 0m . 7.4. CATEGORICAL COVARIATE 93 Note. If a matrix A determines the identifying constraints, then, due to Theorem 2.5 (least squares and normal equations), it also uniquely determines the solution to normal equations. That is, there b that jointly solves linear systems is a unique solution b = β X> Xb = X> Y , Ab = 0m , or written differently, there is a unique solution to a linear system ! ! X> X X> Y . b= A 0m The question is now, what are the conditions for a matrix A to determinean identifying constraint. Remember (Theorem 2.7): If a matrix Lm×k satisfies M L> ⊂ M X> then a parameter vector θ = Lβ is estimable which also means that for all real vectors β 1 , β 2 the following holds: Xβ 1 = Xβ 2 =⇒ Lβ 1 = Lβ 2 . That is, if two different solutions of normal equations are taken and one of them satisfies the constraint then do the both. It was also shown in Section 5.3 that if further L has linearly independent rows then a set of linear constraints Lβ = 0 determines a so called submodel (Lemma 5.4). It follows from above that for identification, we cannot use such a matrix L for identification. Theorem 7.1 Scheffé on identification in a linear model. X ∼ Constraint Aβ = 0 with a real matrix A identifies a vector β in a linear model Y m m×k 2 Xβ, σ In , rank(Xn×k ) = r < k ≤ n if and only if M A> ∩ M X> = 0 , rank(X) + rank(A) = k. Proof. We have to show that, for any µ ∈ M X , the conditions stated in the theorem are equivalent to existence of the unique solution to a linear system Xβ = µ that satisfies Aβ = 0m . Existence of the solution ⇔ ∀µ ∈ M X there exists a vector β ∈ Rk such that Xβ = µ & Aβ = 0m . ⇔ ∀µ ∈ M X there exists a vector β ∈ Rk such that ! ! Xn×k µ β = . 0m Am×k | {z } D 7.4. CATEGORICAL COVARIATE ⇔ ⇔ ⇔ ⇔ ∀µ ∈ M X there exists a vector β ∈ Rk such that ! µ . Dβ = 0m n > o ⊆M D . µ> , 0> , µ ∈ M X m n o⊥ n > > > o⊥ ⊆ µ , 0m , µ ∈ M X . M D ∀v 1 ∈ Rn , v 2 ∈ Rm ∀µ ∈ M X v> 1, ⇔ > v2 D = ∀v 1 ∈ Rn , v 2 ∈ Rm v> 1, ⇔ 94 v> 2 n ∀v 1 ∈ Rn , v 2 ∈ Rm 0> k =0 . ! =0 . ⇒ v> 1, v2 ⇒ v> 1, v> 2 Xβ 0m ∀β ∈ Rk D= 0> k ∀β ∈ Rk > v> 1 X = −v 2 A ⇔ n ∀v 1 ∈ Rn , v 2 ∈ Rm n ∀v 1 ∈ Rn , v 2 ∈ Rm ⇔ n ∀u ∈ Rk ⇔ M A> ∩ M X> = 0 . ⇔ ! µ 0m > ⇒ o v> Xβ = 0 . 1 > v> 1 X = −v 2 A ⇒ o > v> X = 0 1 k . o X> v 1 = −A> v 2 ⇒ X> v 1 = 0k . | {z } o u > u∈M X ∩ M A> ⇒ u = 0k . Uniqueness of the solution ⇔ ∀µ ∈ M X there exists a unique vector β ∈ Rk such that Xβ = µ & Aβ = 0m . ⇔ ⇔ ⇔ ∀µ ∈ M X there exists a unique vector β ∈ Rk such that ! ! Xn×k µ β = . Am×k 0m | {z } D rank(D) = k. n o A has rows such that dim M A> = k − r (since rank(X) = r) and all rows in A are linearly independent with rows in X. ⇔ rank(A) = k − r (since we already have a condition M X> ∩ M A> = 0} needed for existence of the solution). 7.4. CATEGORICAL COVARIATE 95 k Notes. 1. Matrix Am×k used for identification must satisfy rank(A) = k − r. In practice, the number of identifying constraints (the number of rows of the matrix A) is usually the lowest possible, i.e., m = k − r. 2. Theorem 7.1 further states that the matrix A must be such that a vector parameter θ = Aβ is not estimable in a given model. 3. In practice, a vector µ, for which we look for a unique β such that µ = Xβ, Aβ = 0m b ∈ Rk (that, since is equal to the vector of fitted values Yb . That is, we look for a unique β being unique, can be considered as the LSE of the regression coefficients) such that b = Yb Xβ & b = 0m . Aβ b if and only if β b solves By Theorem 2.5 (Least squares and normal equations), Yb = Xβ > > b =X Y. normal equations X Xβ Suppose now that rank A = m = k − r, i.e., the regression parameters are identified by b we have to solve a set of m = k − r linearly independent linear constraints. To get β, a linear system b = X> Y , X> Xβ b = 0m , Aβ which can be solved by solving b = X> Y , X> Xβ b = 0m , A> Aβ or using a linear system which written differently is b = X> Y , X> X + A> A β b = X> Y , D> Dβ with D= ! X . A Matrix D> D is now an invertible k × k matrix and hence the unique solution is b = D> D β −1 X> Y . 7.4. CATEGORICAL COVARIATE 96 Identification in a one-way ANOVA model As example of use of Scheffé’s Theorem 7.1, consider a model matrix X given by (7.18) that provides an ANOVA parameterization of a single categorical covariate, i.e., a linear model for the one-way classified group means parameterized as mg = α0 + αg , g = 1, . . . , G. We have rank Xn×(G+1) = G with a vector α = α0 , α1 , . . . , αG of the regression coefficients. The smallest matrix Am×(G+1) that identifies α with respect to the regression space M X = MF (n1 , . . . , nG ) is hence a non-zero matrix with m = 1 row, i.e., A = a> = a0 , a1 , . . . , aG 6= 0G+1 such that a ∈ / M X> , i.e., such that θ = a> α is not estimable in the linear model Y X ∼ Xα, σ 2 In . It is seen from a structure of the matrix X given by (7.18) that a ∈ M X> ⇐⇒ a= G X cg , c1 , . . . , cG g=1 for some c = c1 , . . . , cG ∈ RG , c 6= 0G . That is, for identification of α in the linear model Y X ∼ Xα, σ 2 In with the model matrix (7.18), we can use any vector a = a0 , a1 , . . . , aG 6= 0G−1 that satisfy G X a0 6= ag . g=1 Commonly used identifying constraints include: Sum constraint: A1 = a> 1 = 0, 1, . . . , 1 ⇐⇒ G X αg = 0 g=1 that imply the following interpretation of the model parameters: G α0 = 1 X mg G =: m, g=1 α1 = m1 − m, .. . αG = mG − m. Weighted sum constraint: A2 = a> 2 = 0, n1 , . . . , nG ⇐⇒ G X ng αg = 0 g=1 that implies G α0 = 1X ng mg n g=1 α1 = m1 − mW , .. . αG = mG − mW . =: mW , 7.4. CATEGORICAL COVARIATE 97 Reference group constraint (l ∈ {1, . . . , G}): A3 = a> . . . , 1, . . . , 0 3 = 0, 0, | {z } 1 on lth place ⇐⇒ αl = 0, which corresponds to omitting one of the non-intercept columns in the model matrix X given by (7.18) and using the resulting full-rank parameterization. It implies α0 = m l , α1 = m 1 − m l , .. . αG = mG − ml . No intercept: A4 = a> 4 = 1, 0, . . . , 0 ⇐⇒ α0 = 0, which corresponds to omitting the intercept column in the model matrix X given by (7.18) and using the full-rank parameterization with the matrix Q given by (7.14). That is, α0 = 0, α1 = m 1 , .. . αG = mG . Note. Identifying constraints given by vectors a1 , a2 , a3 (sum, weighted sum and reference group constraint) correspond to one of commonly used full-rank parameterizations that will be introduced in Section 7.4.4 where we shall also discuss interpretation of the effects αZ = α1 , . . . , αG if different identifying constraints are used. 7.4.4 End of Lecture #7 (22/10/2015) Full-rank parameterization of one-way classified group means Start of In the following, we limit ourselves to full-rank parameterizations that involve an intercept column. Lecture #9 That is, the model matrix will be an n × G matrix (29/10/2015) 1 c> 1 . .. .. n1 -times . 1 c> 1 −−− 1n1 ⊗ 1, c> 1 . .. .. , . = X= . . . > −−− 1nG ⊗ 1, cG 1 c> G . . . . nG -times . . 1 c> G where c1 , . . . , cG ∈ are suitable vectors. In the following, let C be an G × (G − 1) matrix with those vectors as rows, i.e., c> 1 . . C= . . > cG RG−1 7.4. CATEGORICAL COVARIATE 98 e is thus a G × G matrix A matrix X e = 1G , C . X If β = β0 , . . . , βG−1 ∈ RG denote, as usual, a vector of regression coefficients, the group means m are parameterized as where β Z know, Z mg = β0 + c> g = 1, . . . , G, gβ , (7.19) e = 1G , C β = β0 1G + Cβ Z , m = Xβ = β1 , . . . , βG−1 is a non-intercept subvector of the regression coefficients. As we e = rank (1G , C) . rank X = rank X Hence, to get the model matrix X of a full-rank (rank X = G), the matrix C must satisfy rank C = G − 1 and 1G ∈ / M C . That is, the columns of C must be (i) (G − 1) linearly independent vectors from RG ; (ii) being all linearly independent with a vector of ones 1G . Definition 7.6 Full-rank parameterization of a categorical covariate. Full-rank parameterization of a categorical covariate with G levels (G = card(Z)) is a choice of the G × (G − 1) matrix C that satisfies rank C = G − 1, 1G ∈ /M C . Terminology ((Pseudo)contrast matrix). Columns of matrix C are often chosen to form a set of G − 1 linearly independent contrasts from RG . In this case, we will call the matrix C as a contrast matrix.15 In other cases, the matrix C will be called as a pseudocontrast matrix.16 Note. The (pseudo)contrast matrix C also determines parameterization of a categorical covariate according to Definition 7.1. Corresponding function s : Z −→ RG−1 is s(z) = cz , z = 1, . . . , G, and the reparameterizing matrix S is an n × (G − 1) matrix c> 1 . .. n1 -times > c1 −− 1n1 ⊗ c> 1 . . .. = S = .. −− 1nG ⊗ c> G c> G . . nG -times . c> G 15 kontrastová matice 16 pseudokontrastová matice . 7.4. CATEGORICAL COVARIATE 99 Evaluation of the effect of the categorical covariate With a given full-rank parameterization of a categorical covariate, evaluation of a statistical significance of its effect on the response expectation corresponds to testing the null hypothesis H0 : β1 = 0 & · · · & βG−1 = 0, or written concisely (7.20) H0 : β Z = 0G−1 . This null hypothesis indeed also corresponds to a submodel where only intercept is included in the model matrix. Finally, it can be mentioned that the null hypothesis (7.20) is indeed equivalent to the hypothesis of equality of the group means H0 : m1 = · · · = mG . (7.21) If normality of the response is assumed, equivalently an F-test on a submodel (Theorem 5.1) or a test on a value of a subvector of the regression coefficients (F-test if G ≥ 2, t-test if G = 2, see Theorem 3.1) can be used. The following can be shown with only a little algebra: • If G = 2, β = β0 , β1 . The (usual) t-statistic to test the hypothesis H0 : β1 = 0 using point (viii) of Theorem 3.1, i.e., the statistic based on the LSE of β, is the same as a statistic of a standard two-sample t-test. Notes. • If G ≥ 2, the (usual) F-statistic to test the null hypothesis (7.20) using point (x) of Theorem 3.1 which is the same as the (usual) F-statistic on a submodel, where the submodel is the onlyintercept model, is the same as an F-statistic used classically in one-way analysis of variance (ANOVA) to test the null hypothesis (7.21). Reference group pseudocontrasts 0 ... 0 1 . . . 0 = C = . . . . ... .. 0 ... 1 0> G−1 IG−1 ! (7.22) The regression coefficients have the following interpretation m1 = β0 , m2 = β0 + β1 , .. . β0 = m1 , β 1 = m2 − m1 , .. . (7.23) βG−1 = mG − m1 . mG = β0 + βG−1 , That is, the interceptβ0 is equal to the mean of the first (reference) group, the elements of β Z = β1 , . . . , βG−1 (the effects of Z) provide differences between the means of the remaining groups and the reference one. The regression function can be written as m(z) = β0 + β1 I(z = 2) + · · · + βG−1 I(z = G), z = 1, . . . , G. It is seen from (7.23) that the full-rank parameterization using the reference group pseudocontrasts is equivalent to the less-than-full-rank (ANOVA) parameterization mg = α0 + αg , g = 1, . . . , G, where α = α0 , α1 , . . . , αG is identified by the reference group constraint α1 = 0. 7.4. CATEGORICAL COVARIATE 100 Notes. • With the pseudocontrast matrix C given by (7.22), a group labeled by Z = 1 is chosen as a reference for which the intercept β0 provides the group mean. In practice, any other group can be taken as a reference by moving the zero row of the C matrix. • In the R software, the reference group pseudocontrasts with the C matrix being of the form (7.22) are used by default to parameterize categorical covariates (factors). Explicitely this choice is indicated by the contr.treatment function. Alternatively, the contr.SAS function provides a pseudocontrast matrix in which the last Gth group serves as the reference, i.e., the C matrix has zeros on its last row. Sum contrasts 1 ... 0 . . .. .. .. . = C= 1 0 ... −1 . . . −1 Let IG−1 − 1> G−1 ! (7.24) G 1 X m= mg . G g=1 The regression coefficients have the following interpretation β0 = m, β1 = m1 − m, m1 = β0 + β1 , mG−1 .. . = β0 + βG−1 , mG = β0 − G−1 X βG−1 .. . = mG−1 − m. (7.25) βg , g=1 The regression function can be written as m(z) = β0 + β1 I(z = 1) + · · · + βG−1 I(z = G − 1) − G−1 X βg I(z = G), g=1 z = 1, . . . , G. If we consider the less-than-full-rank ANOVA parameterization of the group means as mg = α0 +αg , g = 1, . . . , G, it is seen from (7.25) that the full-rank parameterization using the contrast matrix (7.24) links the regression coefficients of the two models as = m, = µ1 − m, .. . α0 = β0 α1 = β1 .. . αG−1 = βG−1 G−1 X αG = − βg g=1 = µG−1 − m, = µG − m. (7.26) 7.4. CATEGORICAL COVARIATE 101 At the same time, the vector α satisfies G X (7.27) αg = 0. g=1 That is, the full-rank parameterization using the sum contrasts (7.25) is equivalent to the lessthan-full-rank ANOVA parameterization, where the regression coefficients are identified by the sum constraint (7.27). The intercepts α0 = β0 equal to the mean of the group means and the elements of β Z = β1 , . . . , βG−1 = α1 , . . . , αG−1 are equal to the differences between the corresponding group mean and the means of the P group means. The same quantity for the last, Gth group, αG is calculated from β Z as αG = − G−1 g=1 βg . Note. In the R software, the sum contrasts with the C matrix being of the form (7.24) can be used by the mean of the function contr.sum. Weighted sum contrasts 1 .. . C= 0 n1 − nG Let ... .. . ... ... − 0 .. . 1 nG−1 (7.28) nG G mW 1X = ng mg . n g=1 The regression coefficients have the following interpretation β0 = mW , β1 = m1 − mW , m1 = β0 + β1 , mG−1 .. . = β0 + βG−1 , mG = β0 − G−1 X g=1 βG−1 .. . = mG−1 − mW . (7.29) ng βg , nG The regression function can be written as m(z) = β0 + β1 I(z = 1) + · · · + βG−1 I(z = G − 1) − G−1 X ng βg I(z = G), nG g=1 z = 1, . . . , G. If we consider the less-than-full-rank ANOVA parameterization of the group means as mg = α0 +αg , g = 1, . . . , G, it is seen from (7.29) that the full-rank parameterization using the contrast matrix 7.4. CATEGORICAL COVARIATE 102 (7.28) links the regression coefficients of the two models as α0 = β0 α1 = β1 .. . = mW , = m1 − mW , .. . = mG−1 − mW , αG−1 = βG−1 G−1 X ng βg αG = − nG = mG − mW . g=1 At the same time, the vector α satisfies G X (7.30) ng αg = 0. g=1 That is, the full-rank parameterization using the weighted sum pseudocontrasts (7.29) is equivalent to the less-than-full-rank ANOVA parameterization, where the regression coefficients are identified by the weighted sum constraint (7.30). The intercepts α0 = β0 equal to the weighted mean of the group means and the elements of β Z = β1 , . . . , βG−1 = α1 , . . . , αG−1 are equal to the differences between the corresponding group mean and the weighted means of the group means. PG−1 n Z The same quantity for the last, Gth group, αG is calculated from β as αG = − g=1 nGg βg . Helmert contrasts −1 −1 . . . −1 1 −1 . . . −1 2 ... −1 C= 0 . . . . .. .. .. .. 0 0 ... G − 1 The group means are obtained from the regression coefficients as m1 = β0 − G−1 X βg , g=1 m2 = β0 + β1 − G−1 X βg , g=2 G−1 X m3 = β0 + 2 β2 − βg , g=3 mG−1 .. . = β0 + (G − 2) βG−2 − βG−1 , mG = β0 + (G − 1) βG−1 . (7.31) 7.4. CATEGORICAL COVARIATE 103 Inversely, the regression coefficients are linked to the group means as G β0 = β1 = β2 = β3 = .. . βG−1 = 1 X mg G g=1 1 (m2 − m1 ), 2 o 1 1n m3 − (m1 + m2 ) , 3 2 n o 1 1 m4 − (m1 + m2 + m3 ) , 4 3 =: m, G−1 o 1n 1 X mG − mg . G G−1 g=1 which provide their (slightly awkward) interpretation: βg , g = 1, . . . , G − 1, is 1/(g + 1) times the difference between the mean of group g + 1 and the mean of the means of the previous groups 1, . . . , g. Note. In the R software, the Helmert contrasts with the C matrix being of the form (7.31) can be used by the mean of the function contr.helmert. Orthonormal polynomial contrasts P 1 (ω1 ) 1 P (ω2 ) C= .. . P 2 (ω1 ) P 2 (ω2 ) .. . ... ... .. . P G−1 (ω1 ) P G−1 (ω2 ) , .. . (7.32) P 1 (ωG ) P 2 (ωG ) . . . P G−1 (ωG ) where ω1 < · · · < ωG is an equidistant (arithmetic) sequence of the group labels and P j (z) = aj,0 + aj,1 z + · · · + aj,j z j , j = 1, . . . , G − 1, are orthonormal polynomials of degree 1, . . . , G − 1 built above a sequence of the group labels. Note. It can be shown that the polynomial coefficients aj,l , j = 1, . . . , G − 1, l = 0, . . . , j and hence the C matrix (7.32) is for given G invariant towards the choice of the group labels as soon 7.4. CATEGORICAL COVARIATE 104 as they form an equidistant (arithmetic) sequence. For example, for G = 2, 3, 4 the C matrix is G=2 G=3 1 √ − 2 C= 0 1 √ 2 1 −√ 2 C= 1 , √ 2 G=4 3 − √ 2 5 1 − √ 2 5 C= 1 √ 2 5 3 √ 2 5 1 2 1 − 2 1 − 2 1 2 1 √ 6 2 −√ , 6 1 √ 6 1 − √ 2 5 3 √ 2 5 . 3 − √ 2 5 1 √ 2 5 The group means are then obtained as m1 = m(ω1 ) = β0 + β1 P 1 (ω1 ) + · · · + βG−1 P G−1 (ω1 ), m2 = m(ω2 ) = β0 + β1 P 1 (ω2 ) + · · · + βG−1 P G−1 (ω2 ), .. . mG = m(ωG ) = β0 + β1 P 1 (ωG ) + · · · + βG−1 P G−1 (ωG ), where m(z) = β0 + β1 P 1 (z) + · · · + βG−1 P G−1 (z), z ∈ ω1 , . . . , ωG is the regression function. The regression coefficients β now do not have any direct interpretation. That is why, even though the parameterization with the contrast matrix (7.32) can be used with the categorical nominal covariate, it is only rarely done so. Nevertheless, in case of the categorical ordinal covariate where the ordered group labels ω1 < · · · < ωG have also practical interpretability, parameterization (7.32) can be used to reveal possible polynomial trends in the evolution of the group means m1 , . . . , mG and to evaluate whether it may make sense to consider that covariate as numeric rather than categorical. Indeed, for d < G, the null hypothesis H0 : βd = 0 & . . . & βG−1 = 0 corresponds to the hypothesis that the covariate at hand can be considered as numeric (with values ω1 , . . . , ωG of the form of an equidistant sequence) and the evolution of the group means can be described by a polynomial of degree d − 1. Note. In the R software, the orthonormal polynomial contrasts with the C matrix being of the form (7.32) can be used by the mean of the function contr.poly. It is also a default choice if the covariate is coded as categorical ordinal (ordered). End of Lecture #9 (29/10/2015) Chapter 8 Additivity and Interactions 8.1 Additivity and partial effect of a covariate Suppose now that the covariate vectors are Z1 , V 1 , . . . , Zn , V n ∈ Z × V, Z ⊆ R, V ⊆ Rp−1 , As usual, let Z, V ∈ Z × V denote a generic covariate, and let Z1 V> 1 . . .. .. , Z= V = Zn V> n Start of Lecture #11 (05/11/2015) p ≥ 2. be matrices covering the observed values of the two sets of covariates. 8.1.1 Additivity Definition 8.1 Additivity of the covariate effect. We say that a covariate Z ∈ Z acts additively in the regression model with covariates Z, V ∈ Z×V, if the regression function is of the form m(z, v) = mZ (z) + mV (v), z, v ∈ Z × V, (8.1) where mZ : Z −→ R and mV : V −→ R are some functions. 8.1.2 Partial effect of a covariate If the effect of Z ∈ Z acts additively in a regression model, we have for any fixed v ∈ V: E Y Z = z + 1, V = v − E Y Z = z, V = v = mZ (z + 1) − mZ (z), z ∈ Z. (8.2) That is, the influence (effect) of the covariate Z on the response expectation is the same with any value of V ∈ V. 105 8.1. ADDITIVITY AND PARTIAL EFFECT OF A COVARIATE 106 Terminology (Partial effect of a covariate). If a covariate Z ∈ Z acts additively in the regression model with covariates (Z, V ) ∈ Z × V, quantity (8.2) expresses so called partial effect 1 of the covariate Z on the response given the value of V . 8.1.3 Additivity, partial covariate effect and conditional independence In a context of a linear model, both mZ and mV are chosen to be linear in unknown (regression) parameters and the corresponding model matrix is decomposed as X = XZ , XV , where XZ corresponds to the regression function mZ and depends only on the covariate values Z, and XV corresponds to the regression function mV and depends only on the covariate values V. That is, the response expectation is assumed to be E Y Z, V = XZ β + XV γ, for some real vectors of regression coefficients β and γ. Matrix XZ and the regression function mZ then correspond to parameterization of a single covariate for which any choice out of those introduced in Sections 7.3 and 7.4 (or others notdiscussed here) can be used. Further, matrix XZ is/can be usually chosen such that 1n ∈ M XZ in which case a hypothesis of no partial effect of the Z covariate corresponds to testing a submodel with the model matrix X 0 = 1n , X V againts a model with the model matrix X = XZ , XV . Note that if it can be assumed that the covariates at hand influence only the (conditional) response expectation and not other characteristics of the conditional distribution of the response given the covariates, then testing a submodel with the model matrix X0 against a model with the model matrix X corresponds to testing a conditional independence of the response and the Z covariate given the remaining covariates V . 1 parciální efekt 8.2. ADDITIVITY OF THE EFFECT OF A NUMERIC COVARIATE 8.2 107 Additivity of the effect of a numeric covariate Assume that Z is a numeric covariate with Z ⊆ R. While limiting ourselves to parameterizations discussed in Section 7.3, the matrix XZ can be (i) XZ = 1n , SZ , where SZ is a reparameterizing matrix of a parameterization sZ = s1 , . . . , sk−1 : Z −→ Rk−1 having a form of either (a) a simple transformation (Section 7.3.1); (b) raw polynomials (Section 7.3.2); (c) orthonormal polynomials (Section 7.3.3). Z If we denote the regression coefficients related to the model matrix X as β = β0 , β1 , . . . , βk−1 , the regression function is m(z, v) = β0 + β1 s1 (z) + · · · + βk−1 sk−1 (z) + mV (v), z, v ∈ Z × V, (8.3) which can also be interpreted as m(z, v) = γ0 (v) + β1 s1 (z) + · · · + βk−1 sk−1 (z), z, v ∈ Z × V, where γ0 (v) = β0 + mV (v). In other words, if a certain covariate acts additively and its effect on the response is described by parameterization sZ then the remaining covariates V only modify an intercept term in the relationship between the response and the covariate Z. (ii) XZ = BZ , where BZ is a model matrix (7.11) of the regression splines B1 , . . . , Bk . With the regression coefficients related to the model matrix BZ being denoted as β = β1 , . . . , βk , the regression function becomes m(z, v) = β1 B1 (z) + · · · + βk Bk (z) + mV (v), z, v ∈ Z × V, (8.4) where the term mV (v) can again be interpreted as an intercept γ0 (v) = mV (v) in the relationship between response and the covariate Z whose value depends on the remaining covariates V . 8.2.1 Partial effect of a numeric covariate With the regression function (8.3), the partial effect of the Z covariate on the response is determined by a set of the non-intercept regression coefficients β Z := β1 , . . . , βk−1 . The null hypothesis H0 : β Z = 0k−1 then expresses the hypothesis that the covariate Z has, conditionally given a fixed (even though arbitrary) value of V , no effect on the response expectation. That is, it is a hypothesis of no partial effect of the covariate Z on the response expectation. With the spline-based regression function (8.4), the partial effect of the Z covariate is expressed by (all) spline-related regression coefficients β1 , . . . , βk . Nevertheless, due to the B-splines property (7.8), the null hypothesis of no partial effect of the Z covariate is now H0 : β 1 = · · · = β k . 8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE 8.3 108 Additivity of the effect of a categorical covariate Assume that Z is a categorical covariate with Z = {1, . . . , G} where Z = g, g = 1, . . . , G, is repeated ng -times in the data which are assumed to be sorted according to the values of this covariate. The group means used in Section 7.4 must now be understood as conditional group means, given a value of the covariates V , and the regression function (8.1) parameterizes those conditional group means, i.e., for v ∈ V: m(1, v) = E Y Z = 1, V = v =: m1 (v), .. .. . . m(G, v) = E Y Z = G, V = v =: mG (v). (8.5) Let m(v) = m1 (v), . . . , mG (v) be a vector of those conditional group means. The matrix XZ can be any of the model matrices discussed in Section 7.4. If we restrict ourselves to the full-rank parameterizations introduced in Section 7.4.4, the matrix XZ is 1n1 ⊗ c> 1 .. , XZ = 1n , SZ , SZ = . > 1nG ⊗ cG where c1 , . . . , cG ∈ RG−1 are rows of a chosen (pseudo)contrast matrix c> 1 . .. . C= c> G If β = β0 , β1 , . . . , βG−1 denotes the regression coefficients related to the model matrix XZ = 1n , SZ and we further denote β Z = β1 , . . . , βG−1 , the conditional group means are, for v ∈ V, given as Z > Z m1 (v) = β0 + c> 1 β + mV (v) = γ0 (v) + c1 β , .. .. . . Z Z > mG (v) = β0 + cG β + mV (v) = γ0 (v) + c> Gβ , (8.6) where γ0 (v) = β0 + mV (v), v ∈ V. In a matrix notation, (8.6) becomes m(v) = γ0 (v) 1G + Cβ Z . 8.3.1 (8.7) Partial effects of a categorical covariate In agreement with a general expression (8.2), we have for arbitrary v ∈ V and arbitrary g1 , g2 ∈ Z: E Y Z = g1 , V = v − E Y Z = g2 , V = v = mg1 (v) − mg2 (v) > = cg1 − cg2 β Z , (8.8) 8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE 109 which does not depend on a value of V = v. That is, the difference between the two conditional group means is the same for all values of the covariates in V . Terminology (Partial effects of a categorical covariate). If additivity of a categorical Z covariate and V covariates can be assumed, a vector of coefficients β Z from parameterization of the conditional group means (Eqs. 8.6, 8.7) will be referred to as partial effects of the categorical covariate. Note. It should be clear from (8.8) that interpretation of the partial effects of a categorical covariate depends on chosen parameterization (chosen (pseudo)contrast matrix C). If the Z covariate acts additively with the V covariate, it makes a sense to ask a question whether all G conditional group means are, for a given v ∈ V, equal. That is, whether all partial effects of the Z covariate are equal to zero. In general, this corresponds to the null hypothesis H0 : m1 (v) = · · · = mG (v), v ∈ V. (8.9) If the regression function is parameterized as (8.6), the null hypothesis (8.9) is expressed using the partial effects as H0 : β Z = 0G−1 . 8.3.2 Interpretation of the regression coefficients Note that (8.6) and (8.7) are basically the same expressions as those in (7.19) in Section 7.4.4. The only difference is dependence of the group means and the intercept term on the value of the covariates V . Hence interpretation of the individual coefficients β0 and β Z = β1 , . . . , βG−1 depends on the chosen pseudocontrast matrix C, nevertheless, it is basically the same as in case of a single categorical covariate in Section 7.4.4 with the only difference that (i) The non-intercept coefficients in β Z have the same interpretation as in Section 7.4.4 but always conditionally, given a chosen (even though arbitrary) value v ∈ V. (ii) The intercept β0 has interpretation given in Section 7.4.4 only for such v ∈ V for which mV (v) = 0. This follows from the fact that, again, for a chosen v ∈ V, the expression (8.6) of the conditional group means is the same as in Section 7.4.4. Nevertheless only for v such that mV (v) = 0, we have β0 = γ0 (v). Example 8.1 (Reference group pseudocontrasts). If C is the reference group pseudocontrasts matrix (7.22), we obtain analogously to (7.23), but now for a chosen v ∈ V, the following β0 + mV (v) = γ0 (v) = m1 (v), β1 = m2 (v) − m1 (v), .. . βG−1 = mG−1 (v) − m1 (v). 8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE 110 Example 8.2 (Sum contrasts). If C is the sum contrasts matrix (7.24), we obtain analogously to (7.25), but now for a chosen v ∈ V, the following β0 + mV (v) = γ0 (v) = m(v), β1 = m1 (v) − m(v), .. . βG−1 = mG−1 (v) − m(v), where G m(v) = 1 X mg (v), G v ∈ V. g=1 If we additionally define αG = − PG−1 g=1 αG = − βg , we get, in agreement with (7.26), G−1 X g=1 βg = mG (v) − m(v). 8.4. EFFECT MODIFICATION AND INTERACTIONS 8.4 111 Effect modification and interactions 8.4.1 Effect modification Suppose now that the covariate vectors are Z1 , W1 , V 1 , . . . , Zn , Wn , V n ∈ Z ×W ×V, As usual, let Z1 . . Z= . , Zn Z ⊆ R, W ⊆ R, V ⊆ Rp−2 , W1 . . W= . , Wn p ≥ 2. V> 1 . . V= . V> n denote matrices collecting observed covariate values and finally, let Z, W, V ∈ Z × W × V denote a generic covariate. Suppose now that the regression function is m(z, w, v) = mZW (z, w) + mV (v), z, w, v ∈ Z × W × V, (8.10) where mV : V −→ R is some function and mZW : Z × W −→ R is a function that cannot be factorized as mZW (z, w) = mZ (z) + mW (w). We then have for any fixed v ∈ V. E Y Z = z + 1, W = w, V = v − E Y Z = z, W = w, V = v = mZW (z + 1, w) − mZW (z, w), (8.11) E Y Z = z, W = w + 1, V = v − E Y Z = z, W = w, V = v = mZW (z, w + 1) − mZW (z, w), (8.12) z ∈ Z, w ∈ W, where (8.11), i.e., the effect of a covariate Z on the response expectation possibly depends on the value of W = w and also (8.12), i.e., the effect of a covariate W on the response expectation possibly depends on the value of Z = z. We then say that the covariates Z and W are mutual effect modifiers.2 In a context of a linear model, both mZW and mV are chosen to be linear in unknown (regression) parameters and the corresponding model matrix is decomposed as X = XZW , XV , (8.13) where XZW corresponds to the regression function mZW and depends only on the covariate values Z and W and XV corresponds to the regression function mV and depends only on the covariate values V. In the rest of this section and in Sections 8.5, 8.6 and 8.7, we show classical choices for the matrix XZW based on so called interactions derived from covariate parameterizations that we introduced in Sections 7.3 and 7.4. End of Lecture #11 (05/11/2015) 8.4.2 Interactions Suppose that the covariate Z is parameterized using a parameterization Z k−1 sZ = sZ , 1 , . . . , sk−1 : Z −→ R 2 modifikátory efektu (8.14) Start of Lecture #13 (12/11/2015) 8.4. EFFECT MODIFICATION AND INTERACTIONS and the covariate W is parameterized using a parameterization W l−1 sW = sW , 1 , . . . , sl−1 : W −→ R 112 (8.15) and let SZ and SW be the corresponding reparameterizing matrices: > (W ) s> (Z ) s 1 1 Z. W . W = S 1W , . . . , S l−1 . .. = S 1Z , . . . , S k−1 .. SZ = , S = Z W s> s> Z (Zn ) W (Wn ) Definition 8.2 Interaction terms. Let Z1 , W1 , . . . , Zn , Wn ∈ Z × W ⊆ R2 be values of two covariates being parameterized using the reparameterizing matrices SZ and SW . By interaction terms3 based on the reparameterizing matrices SZ and SW we mean columns of a matrix SZW := SZ : SW . Note. See Definition A.5 for a definition of the columnwise product of two matrices. We have SZW = SZ : SW = k−1 : S l−1 S 1Z : S 1W , . . . , S k−1 : S 1W , . . . , S 1Z : S l−1 W Z W , . . . , SZ > s> W (W1 ) ⊗ sZ (Z1 ) .. . = . > (Z ) s> (W ) ⊗ s n n W Z 8.4.3 Interactions with the regression spline Either the Z covariate or/and the W covariate can also be parameterized by the regression splines. In that case, the interaction terms are defined in the same way as in Definition 8.2. For example, if the Z covariate is parameterized by the regression splines B Z = B1Z , . . . , BkZ with the related model matrix B> (Z ) 1 Z .. = B 1Z , . . . , B kZ , BZ = . B> Z (Zn ) and the W covariates by the parameterization (8.15) and the reparameterizing matrix SW as above, we will mean by the interaction terms columns of a matrix l−1 k BZW = BZ : SW = B 1Z : S 1W , . . . , B kZ : S 1W , . . . , B 1Z : S l−1 W , . . . , BZ : SW > s> (W ) ⊗ B (Z ) 1 1 Z W .. . = . > > sW (Wn ) ⊗ B Z (Zn ) 3 interakční členy 8.4. EFFECT MODIFICATION AND INTERACTIONS 8.4.4 113 Linear model with interactions Interaction terms are used in a linear model to express a certain form of the effect modification. If 1n ∈ / SZ and 1n ∈ / SW , the matrix XZW from (8.13) is usually chosen as XZW = 1n , SZ , SW , SZW , (8.16) which, as will be shown, corresponds to a certain form of the effect modification. Let the related regression coefficients be denoted as W ZW ZW ZW ZW Z , β1,1 , . . . , βk−1,1 , . . . , β1,l−1 , . . . , βk−1, β = β0 , β1Z , . . . , βk−1 , β1W , . . . , βl−1 l−1 . {z } | {z } | | {z } Z W ZW =: β =: β =: β Main and interaction effects Terminology (Main and interaction effects). Coefficients in β Z and β W are called the main effects4 of the covariate Z and W , respectively. Coefficients in β ZW are called the interaction effects.5 The related regression function (8.10) is Z > W ZW m(z, w, v) = β0 + s> + s> + mV (v), Z (z)β + sW (w)β ZW (z, w)β z, w, v ∈ Z × W × V, (8.17) where sZW (z, w) : Z × W −→ R, sZW (z, w) = sW (w) ⊗ sZ (z) W Z W Z W W Z = sZ 1 (z)s1 (w), . . . , sk−1 (z)s1 (w), . . . , s1 (z)sl−1 1(w), . . . , sk−1 (z)sl−1 (w) , z, w ∈ Z × W. The effects of the covariates Z or W , given the remaining covariates are then expressed as E Y Z = z + 1, W = w, V = v − E Y Z = z, W = w, V = v > > = sZ (z + 1) − sZ (z) β Z + sZW (z + 1, w) − sZW (z, w) β ZW , (8.18) E Y Z = z, W = w + 1, V = v − E Y Z = z, W = w, V = v > > = sW (w + 1) − sW (w) β W + sZW (z, w + 1) − sZW (z, w) β ZW , (8.19) z ∈ Z, w ∈ W. That is, the effect (8.18) of the covariate Z is determined by the main effects β Z of this covariate as well as by the interaction effects β ZW . Analogously, the effect (8.19) of the covariate W is determined by its main effects β W as well as by the interaction effects β ZW . 4 hlavní efekty 5 interakční efekty 8.4. EFFECT MODIFICATION AND INTERACTIONS 114 Hypothesis of no effect modification If factor mZW (z, w) of the regression function (8.10) is parameterized by matrix XZW given by (8.16) then the hypothesis of no effect modification is expressed by considering a submodel in which matrix XZW is replaced by a matrix XZ+W = 1n , SZ , SW . Interaction model with regression splines Suppose that matrix SZ = BZ where BZ corresponds to the parameterization of the Z covariate Z W using the regression splines. We then have 1n ∈ M B and also M S ⊆ M BZW for BZW = BZ : SW (see Section 8.5.2). That is, M 1n , BZ , SW , BZW = M BZ , BZW . It is thus sufficient (with respect to obtained regression space) to choose the matrix XZW as XZW = BZ , BZW . (8.20) Hypothesis of no effect modification then corresponds to a submodel in which matrix (8.20) is replaced by a matrix XZ+W = BZ , SW . 8.4.5 Rank of the interaction model Lemma 8.1 Rank of the interaction model. (i) Let rank SZ , SW = k + l − 2, i.e., all columns from the matrices SZ and SW are linearly independent and the matrix SZ , SW is of full-rank. Then the matrix SZW = SZ : SW is of full-rank as well, i.e., rank SZW = (k − 1) (l − 1). (ii) Let additionally 1n ∈ / M SZ , 1n ∈ / M SW . Then also a matrix XZW = 1n , SZ , SW , SZW is of full-rank, i.e., rank XZW = 1 + (k − 1) + (l − 1) + (k − 1) (l − 1) = k l. Proof. Left as exercise in linear algebra. Proof/calculations were skipped and are not requested for the exam. k 8.4. EFFECT MODIFICATION AND INTERACTIONS 115 Note (Hypothesis of no effect modification). Under the conditions of Lemma 8.1, we have for XZW = 1n , SZ , SW , SZW 1n , SZ , SW : rank XZW = k l, and XZ+W = rank XZ+W = 1 + (k − 1) + (l − 1) = k + l − 1, M XZ+W ⊂ M XZW . If the hypothesis of no effect modification is tested by a submodel F-test then its numerator degrees of freedom are kl − k − l + 1 = (k − 1) · (l − 1). The corresponding null hypothesis can also be specified as a hypothesis on the zero value of an estimable vector of all interaction effects: H0 : β ZW = 0(k−1)·(l−1) . 8.5. INTERACTION OF TWO NUMERIC COVARIATES 8.5 116 Interaction of two numeric covariates Let us now consider a situation when both Z and W are numeric covariates with Z ⊆ R, W ⊆ R. 8.5.1 Mutual effect modification Linear effect modification Suppose first that SZ is a reparameterizing matrix that corresponds to a simple identity transformation of the covariate Z. For the second covariate, W , assume that the matrix SW is an n × (l − 1) reparameterizing matrix that corresponds to the general parameterization (8.14), e.g., any of reparameterizing matrices discussed in Sections 7.3.1, 7.3.2 and 7.3.3. That is, z ∈ Z, W sW (w) = sW 1 (w), . . . , sl−1 (w) , w ∈ W, sZ (z) = z, Z1 . . SZ = . , Zn SW W (W ) . . . sW (W ) s> (W ) s 1 1 1 1 k−1 W . .. .. .. = , .. = . . . W (W ) . . . sW (W ) s> (W ) s n n n 1 W k−1 The matrix XZW , Eq. (8.16) is then W W Z1 s W 1 Z1 sW 1 (W1 ) . . . Z1 sl−1 (W1 ) 1 (W1 ) . . . sl−1 (W1 ) .. .. .. .. .. .. .. XZW = ... . . . . . . . W W W W Zn s1 (Wn ) . . . Zn sl−1 (Wn ) s1 (Wn ) . . . sl−1 (Wn ) 1 Zn {z } | {z } |{z} | {z } | 1n SZ SW SZW (8.21) and the related regression coefficients are W ZW β = β0 , β Z , β1W , . . . , βl−1 , β1ZW , . . . , βl−1 . | {z } | {z } βW β ZW The regression function (8.17) then becomes W W m(z, w, v) = β0 + β Z z + β1W sW 1 (w) + · · · + βl−1 sl−1 (w) ZW W + β1ZW z sW 1 (w) + · · · + βl−1 z sl−1 (w) + mV (v) W ZW = β0 + s> + mV (v) + (β Z + s> )z W (w)β W (w)β | {z } | {z } =: γ1Z (w) =: γ0Z (w, v) (8.22) (8.23) 8.5. INTERACTION OF TWO NUMERIC COVARIATES 117 = β0 + β Z z + mV (v) | {z } W =: γ0 (z, v) W ZW (w) + · · · + (βl−1 + βl−1 z) sW + (β1W + β1ZW z) sW (w), {z } 1 | | {z } l−1 W =: γ1W (z) =: γl−1 (z) (8.24) z, w, v ∈ Z × W × V. The regression function (8.22) can be interpreted twofold. (i) Expression (8.23) shows that for any fixed w ∈ W, the covariates Z and V act additively and the effect of Z on the response expectation is expressed by a line. Nevertheless, both intercept γ0Z and the slope γ1Z of this line depend on w and this dependence is described by the parameterization sW . The intercept is further additively modified by a factor mV (v). With respect to interpretation, this shows that the main effect β Z has an interpretation of the slope of the line that for a given V = v describes the influence of Z on the response if W = w is such that sW (w) = 0l−1 . This also shows that a test of the null hypothesis H0 : β Z = 0 does not evaluate a statistical significance of the influence of the covariate Z on the response expectation. It only evaluates it for values of W = w for which sW (w) = 0l−1 . (ii) Analogously, expression (8.24) shows that for any fixed z ∈ Z, the covariates W and V act additively and the effect of W on the response expectation is expressed by its parameterizaW ) depend in a linear way tion sW . Nevertheless, the related coefficients (γ0W , γ1W , . . . , γl−1 W on z. The intercept term is γ0 further additively modified by a factor mV (v). With respect to interpretation, this shows that the main effects β W has an interpretation of the coefficients of the influence of the W covariate on the response if Z = 0. This also shows that a test of the null hypothesis H0 : β W = 0l−1 does not evaluate a statistical significance of the influence of the covariate W on the response expectation. It only evaluates it under the condition of Z = 0. More complex effect modification More complex effect modifications can be obtained by choosing a more complex reparameterizing matrix SZ for the Z covariate. Interpretation of such a model is then straightforward generalization of the above situation. 8.5.2 Mutual effect modification with regression splines Linear effect modification with regression splines Let us again assume that the covariate Z is parameterized using a simple identity transformation and the reparameterizing matrix SZ is given as in (8.21). Nevertheless, for the covariate W , let us assume its parameterization using the regression splines B W = B1W , . . . , BlW 8.5. INTERACTION OF TWO NUMERIC COVARIATES 118 with the related model matrix BW B> B1W (W1 ) . . . BlW (W1 ) W (W1 ) .. .. .. .. = . = . . . . > W W B W (Wn ) B1 (Wn ) . . . Bl (Wn ) Analogously to previous usage of a matrix SZW , let a matrix BZW be defined as Z1 B1W (W1 ) . . . Z1 BkW (W1 ) .. .. .. . BZW = SZ : BW = . . . W W Zn B1 (Wn ) . . . Zn Bk (Wn ) P Remember that for any w ∈ W, lj=1 BjW (w) = 1 from which it follows that (i) 1n ∈ BW ; (ii) M SZ ⊆ M BZW . That is, M 1n , SZ , BW , BZW = M BW , BZW . (8.25) Hence if a full-rank linear model is to be obtained, where interaction between a covariate parameterized using the regression splines and a covariate parameterized using the reparameterizing > matrix SZ = Z1 , . . . , Zn is included, the model matrix XZW must be of the form Z1 B1W (W1 ) . . . Z1 BlW (W1 ) B1W (W1 ) . . . BlW (W1 ) .. .. .. .. .. .. XZW = BW , BZW = . . . . . . . Zn B1W (Wn ) . . . Zn BlW (Wn ) B1W (Wn ) . . . BlW (Wn ) {z } | {z } | W ZW B B If we denote the related regression coefficients as β = β1W , . . . , βlW , β1ZW , . . . , βlZW , | {z } | {z } W ZW =: β =: β the regression function (8.17) becomes m(z, w, v) = β1W B1W (w) + · · · + βlW BlW (w) + β1ZW z B1W (w) + · · · + βlZW z BlW (w) + mV (v) (8.26) W = B> + mV (v) + B > (w)β ZW z W (w)β | W {z } | {z } Z Z =: γ1 (w) =: γ0 (w, v) (8.27) = (β1W + β1ZW z) B1W (w) + · · · + (βlW + βlZW z) BlW (w) + mV (v), | {z } | {z } W W =: γ1 (z) =: γl (z) (8.28) z, w, v ∈ Z × W × V. The regression function (8.26) can again be interpreted twofold. 8.5. INTERACTION OF TWO NUMERIC COVARIATES 119 (i) Expression (8.27) shows that for any fixed w ∈ W, the covariates Z and V act additively and the effect of W on the response expectation is expressed by a line. Nevertheless, both intercept γ0Z and the slope γ1Z of this line depend on w and this dependence is described by the regression splines B W . The intercept is further additively modified by the factor mV (v). (ii) Analogously, expression (8.28) shows that for any fixed z ∈ Z, the covariates W and V act additively and the effect of W on the response expectation is expressed by the regression splines B W . Nevertheless, related spline coefficients (γ1W , . . . , γlW ) depend in a linear way on z. With respect to interpretation, this shows that the main effects β W has an interpretation of the coefficients of the influence of the W covariate on the response if Z = 0. More complex effect modification with regression splines Also with regression splines, a more complex reparameterizing matrix SZ can be chosen. The property (8.25) still holds and the matrix XZW can still be chosen as BW , BZW , BZW = SZ : BW . Interpretation of the model is again a straightforward generalization of the above situation. With respect to the rank of the resulting model, analogous statements hold as those given in Z Z Lemma 8.1. Suppose that the matrix S has k − 1 columns. If rank S , BW = k − 1 + l, i.e., Z , BW is all columns from the matrices SZ and BW are linearly independent and the matrix S of full-rank, then both BZW = SZ : BW and XZW = BW , BZW are of full-rank, i.e., rank BZW = (k − 1) l, rank XZW = l + (k − 1)l = k l. 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 8.6 120 Interaction of a categorical and a numeric covariate Consider now a situation when Z is a categorical covariate with Z = {1, . . . , G} where Z = g, g = 1, . . . , G, is repeated ng -times in the data and W is a numeric covariate with W ⊆ R. We will assume (without loss of generality) that data are sorted according to the values of the categorical covariate Z. For the clarity of notation, we will analogously to Section 7.4 use also a double subscript to number the individual observations where the first subscript will indicate the value of the covariate Z. That is, we will use Z1 Z1,1 1 W1 W1,1 . .. .. .. .. .. . . . . Z Z 1 W W n 1,n n 1,n 1 1 1 1 − − −− − − − −− − − −− − − − . .. .. . . . .. .. . . = = .. , = − − −− − − − −− − − −− − − − Z W n−nG +1 ZG,1 G n−nG +1 WG,1 . .. .. .. .. . . . . . . Zn ZG,nG G Wn WG,nG If the categorical covariate Z can be interpreted as a label that indicates pertinence to one of G groups, the regression function (8.17) in which the value of z is fixed at z = g, g = 1, . . . , G, can be viewed as a regression function that parameterizes dependence of the response expectation on the numeric covariate W and possibly other covariates V in group g. We have for w ∈ W, v ∈ V: =: m1 (w, v), m(1, w, v) = E Y Z = 1, W = w, V = v .. .. . . m(G, w, v) = E Y Z = G, W = w, V = v =: mG (w, v). (8.29) Functions m1 , . . . , mG are then conditional (given a value of Z) regression functions describing dependence of the response expectation on the covariates W and V . Alternatively, for fixed w ∈ W and v ∈ V, a vector m(w, v) = m1 (w, v), . . . , mG (w, v) can be interpreted as a vector of conditional (given W and V ) group means. In the following assume that the categorical covariate Z is parameterized by the mean of a chosen (pseudo)contrast matrix c = c , . . . , c , 1 1,1 1,G−1 c> 1n1 ⊗ c> 1 1 . .. Z .. . .. , C= that is, S = . . > c> 1 ⊗ c nG cG = cG,1 , . . . , cG,G−1 , G G 8.6.1 Categorical effect modification First suppose that the numeric covariate W is parameterized using a parameterization W l−1 sW = sW , 1 , . . . , sl−1 : W −→ R 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 121 W SW is the corresponding n × (l − 1) reparameterizing matrix. Let SW 1 , . . . , SG be the blocks of W the reparameterizing matrix S that correspond to datapoints with Z = 1, . . . , Z = G. That is, for g = 1, . . . , G, matrix SW g is an ng × (l − 1) matrix, s> sW . . . sW SW 1 (Wg,1 ) 1 W (Wg,1 ) l−1 (Wg,1 ) .. .. .. .. .. W W Sg = = and S = . . . . . . W s> sW SW 1 (Wg,ng ) . . . sl−1 (Wg,ng ) W (Wg,ng ) G The matrix XZW that parameterizes the term mZW (z, w) in the regression function (8.10) is again XZW = 1n , SZ , SW , SZW , where the interaction matrix SZW = SZ : SW is an n × (G − 1)(l − 1) matrix: SZW = ... c1,1 sW ... c1,G−1 sW c1,1 sW ... c1,G−1 sW 1 (W1,1 ) 1 (W1,1 ) l−1 (W1,1 ) l−1 (W1,1 ) .. .. .. .. .. .. .. . . . . . . . W W W ) ) . . . c s (W c1,1 sW (W ) . . . c s . . . c s (W 1,G−1 1,n 1,n 1,G−1 1,1 1,n 1 1 l−1 (W1,n1 ) l−1 1 1 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− .. .. .. .. .. .. .. . . . . . . . − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− . . . cG,1 sW . . . cG,G−1 sW cG,1 sW . . . cG,G−1 sW 1 (WG,1 ) 1 (WG,1 ) l−1 (WG,1 ) l−1 (WG,1 ) .. .. .. .. .. .. .. . . . . . . . cG,1 sW . . . cG,G−1 sW . . . cG,1 sW . . . cG,G−1 sW 1 (WG,nG ) 1 (WG,nG ) l−1 (WG,nG ) l−1 (WG,n1 ) > > sW . . . sW 1 (W1,1 ) c1 l−1 (W1,1 ) c1 .. .. .. . . . W > s1 (W1,n1 ) c> . . . sW 1 l−1 (W1,n1 ) c1 − − − − − − − − − − − − − − −− .. .. .. = . . . − − − − − − − − − − − − − − −− sW (W ) c> . . . sW (W ) c> 1 G,1 G,1 G G l−1 .. .. .. . . . W W > s1 (WG,nG ) cG . . . sl−1 (WG,nG ) c> G SW 1 = SW G ⊗ c> 1 .. . . > ⊗ cG The regression coefficients related to the model matrix XZW are now Z W ZW ZW ZW ZW β = β0 , β1Z , . . . , βG−1 , β1W , . . . , βl−1 , β1,1 , . . . , βG−1,1 , . . . , β1,l−1 , . . . , βG−1,l−1 . | {z } | {z } | {z } =: β Z =: β W =: β ZW For following considerations, it will be useful to denote subvectors of the interaction effects β ZW as ZW ZW ZW ZW β ZW = β1,1 , . . . , βG−1,1 , . . . , β1,l−1 , . . . , βG−1,l−1 . | {z } | {z } =: β ZW =: β ZW •1 •l−1 The values of the regression function (8.17) for z = g, g = 1, . . . , G, w ∈ W and v ∈ V , i.e., the values of the conditional regression functions (8.29) can then be written as Z W W mg (w, v) = β0 + c> + β1W sW gβ 1 (w) + · · · + βl−1 sl−1 (w) > ZW W > ZW + sW 1 (w) cg β •1 + · · · + sl−1 (w) cg β •l−1 + mV (v). (8.30) 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 122 A useful interpretation of the regression function (8.30) is obtained if we view mg (w, v) as a conditional (given Z = g) regression function that describes dependence of the response expectation on the covariates W and V in group g and write it as Z mg (w, v) = β0 + c> g β + mV (v) | {z } W =: γg,0 (v) W W W ZW ZW s1 (w) + · · · + βl−1 + c> + β1W + c> g β •l−1 sl−1 (w). (8.31) g β •1 | | {z } {z } W W =: γg,1 =: γg,l−1 Expression (8.31) shows that a linear model with an interaction between a numeric covariate parameterized using a parameterization sW and a categorical covariate can be interpreted such that for any fixed Z = g, the covariates W and V act additively and the effect of W on the response expectation is expressed by its parameterization sW . Nevertheless, the related coefficients depend on a value of a categorical covariate Z. The intercept term is further additively modified by a factor mV (v). In other words, if the categorical covariate Z expresses pertinence of a subject/experimental unit into one of G groups (internally labeled by numbers 1, . . . , G), the regression function (8.31) of the interaction model parameterizes a situation when, given the remaining covariates V , dependence of the response expectation on the numeric covariate W can be in each of the G groups expressed by the same linear model (parameterized by the parameterization sW ), nevertheless, the regression coefficients of the G linear models may differ. It follows from (8.31) that given Z = g (and given V = v), the regression coefficients for the dependence of the response on the numeric covariate W expressed by the parameterization sW are W Z γg,0 (v) = β0 + c> + mV (v), gβ W γg,j = βjW + ZW c> g β •j , (8.32) j = 1, . . . , l − 1. (8.33) Chosen (pseudo)contrasts that parameterize a categorical covariate Z then determine interpretation of the intercept β0 , both sets of main effect β Z and β W and also the interaction effects β ZW . This interpretation is now a straightforward generalization of derivations shown earlier in Sections 7.4.4 and 8.3. • Interpretation of the intercept term β0 and the main effects β Z of the categorical covariate Z is obtained by noting correspondence between the expression of the group specific intercepts W (v), . . . , γ W (v) given by (8.32) and the conditional group means (8.6) in Section 8.3. γ1,0 G,0 • Analogously, interpretation of the main effects β W and the interaction effects β ZW is obtained W , . . . , γ W given by by noting that for each j = 1, . . . , l − 1, the group specific “slopes” γ1,j G,j (8.33) play a role of the group specific means (7.19) in Section 7.4.4. Example 8.3 (Reference group pseudocontrasts). Suppose that C is the reference group pseudocontrast matrix (7.22). While viewing the group specific intercepts (8.32) as the conditional (given V = v) group means (8.6), we obtain, analogously to Example 8.1, the following interpretation of the intercept term β0 and the main effects Z β Z = β1Z , . . . , βG−1 of the categorical covariate: W (v), β0 + mV (v) = γ1,0 β1Z W (v) − γ W (v), = γ2,0 1,0 .. . Z W W (v). βG−1 = γG−1,0 (v) − γ1,0 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 123 W , . . . , γ W given by (8.33) are viewed If for given j = 1, . . . , l − 1, the group specific “slopes” γ1,j G,j as the group specific means (7.19) in Section 7.4.4, interpretation of the jth main effect βjW of the ZW ZW numeric covariate and the jth set of the interaction effects β ZW •j = β1,j , . . . , βG−1,j is analogous to expressions (7.23): W, βjW = γ1,j ZW β1,j ZW βG−1,j W − γW , = γ2,j 1,j .. . W W. = γG−1,j − γ1,j Example 8.4 (Sum contrasts). Suppose now that C is the sum contrast matrix (7.24). Again, while viewing the group specific intercepts (8.32) as the conditional (given V = v) group means (8.6), we obtain, now analogously to Example 8.2, Z the following interpretation of the intercept term β0 and the main effects β Z = β1Z , . . . , βG−1 of the categorical covariate: β0 + mV (v) = γ0 W (v), W (v) − γ W (v), = γ1,0 0 .. . β1Z W Z (v) − γ0 W (v), = γG−1,0 βG−1 where G γ0 W (v) = 1 X W γg,0 (v), G v ∈ V. g=1 W , . . . , γ W given by (8.33) are viewed If for given j = 1, . . . , l − 1, the group specific “slopes” γ1,j G,j as the group specific means (7.19) in Section 7.4.4, interpretation of the jth main effect βjW of the ZW ZW numeric covariate and the jth set of the interaction effects β ZW •j = β1,j , . . . , βG−1,1 is analogous to expression (7.26): βjW = γj W , ZW β1,j ZW βG−1,j W − γ W, = γ1,j j .. . W = γG−1,j − γj W , where G γj W = 1 X W γg,j . G g=1 Alternative, interpretation of the regression function (8.30) is obtained if for a fixed w ∈ W and v ∈ V , the values of mg (w, v), g = 1, . . . , G, are viewed as conditional (given W and V ) group means. Expression (8.30) can then be rewritten as Z W W ZW W ZW mg (w, v) = β0 + s> + mV (v) + c> W (w)β g β + s1 (w)β •1 + · · · + sl−1 (w)β •l−1 . | | {z } {z } Z Z? =: γ0 (w, v) γ (w) (8.34) 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 124 That is, the vector m(w, v) is parameterized as m(w, v) = γ0Z (w, v)1G + Cγ Z? (w). And the related coefficients γ0Z (w, v), γ Z? (w) depend on w by a linear model parameterized by the parameterization sW , the intercept term is further additively modified by mV (v). Expression (8.34) perhaps provide a way for the interpretation of the intercept term β0 and the main effects β Z . Nevertheless, attempts to use (8.34) for interpretation of the main effects β W and the interaction effects β ZW are usually quite awkward. 8.6.2 Categorical effect modification with regression splines Suppose now that the numeric covariate W is parameterized using the regression splines B W = B1W , . . . , BlW W with the related model matrix BW that we again factorize into blocks BW 1 , . . . , BG that correspond W to datapoints with Z = 1, . . . , Z = G. That is, for g = 1, . . . , G, matrix Bg is an ng × l matrix, B> B1W (Wg,1 ) . . . BlW (Wg,1 ) BW W (Wg,1 ) 1 . .. .. .. .. = . BW and BW = . . . . g = . . B> B1W (Wg,ng ) . . . BlW (Wg,ng ) BW W (Wg,ng ) G Let BZW = SZ : BW which is an n × (G − 1)l matrix: BZW = ... c1,1 BlW (W1,1 ) ... c1,G−1 BlW (W1,1 ) c1,1 B1W (W1,1 ) ... c1,G−1 B1W (W1,1 ) .. .. .. .. .. .. .. . . . . . . . c1,1 B1W (W1,n1 ) . . . c1,G−1 B1W (W1,n1 ) . . . c1,1 BlW (W1,n1 ) . . . c1,G−1 BlW (W1,n1 ) − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− .. .. .. .. .. .. .. . . . . . . . − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− cG,1 B1W (WG,1 ) . . . cG,G−1 B1W (WG,1 ) . . . cG,1 BlW (WG,1 ) . . . cG,G−1 BlW (WG,1 ) .. .. .. .. .. .. .. . . . . . . . cG,1 B1W (WG,nG ) . . . cG,G−1 B1W (WG,nG ) . . . cG,1 BlW (WG,nG ) . . . cG,G−1 BlW (WG,n1 ) B1W (W1,1 ) c> . . . BlW (W1,1 ) c> 1 1 .. .. .. . . . W W B1 (W1,n1 ) c> . . . Bl (W1,n1 ) c> 1 1 − − − − − − − − − − − − − − −− .. .. .. = . . . − − − − − − − − − − − − − − −− B W (W ) c> . . . B W (W ) c> G,1 G,1 1 G G l . . . .. .. .. W > B1W (WG,nG ) c> G . . . Bl (WG,nG ) cG As in Section 8.5.2, remember that for any w ∈ W, (i) 1n ∈ BW ; BW 1 = BW G Pl W j=1 Bj (w) ⊗ c> 1 .. . . > ⊗ cG = 1 from which it follows that 8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE 125 (ii) M SZ ⊆ M BZW . Hence M 1n , SZ , BW , BZW = M BW , BZW and if a full-rank linear model is to be obtained that includes an interaction between a numeric covariate parameterized using the regression splines and a categorical covariate parameterized by the reparameterizing matrix SZ derived from a (pseudo)contrast matrix C, the model matrix XZW that parameterizes the term mZW (z, w) in the regression function (8.10) is W > BW 1 , B1 ⊗ c1 . .. . .. XZW = BW , BZW = . W ⊗ c> BW , B G G G The regression coefficients related to the model matrix XZW are ZW ZW ZW ZW , . . . , βG−1,1 , . . . , β1,k , . . . , βG−1,k . β = β1W , . . . , βlW , β1,1 {z } | | {z } | {z } =: β W =: β ZW =: β ZW •1 •k | {z } ZW =: β The value of the regression function (8.17) for z = g, g = 1, . . . , G, w ∈ W, and v ∈ V , i.e., the values of the conditional regression functions (8.29) can then be written as mg (w, v) = β1W B1W (w) + · · · + βlW BlW (w) > ZW ZW W + mV (v). + B1W (w) c> g β •1 + · · · + Bl (w) cg β •l Its useful interpretation is obtained if we write it as W W ZW ZW mg (w, v) = β1W + c> B1 (w) + · · · + βlW + c> Bl (w) + mV (v), g β •1 g β •l | | {z } {z } W W =: γg,1 =: γg,k which shows that the underlying linear model assumes that given Z = g, the covariates W and V act additively and the effect of the numeric covariate W on the response expectation is described W , . . . , γ W , however, depend on the value of the by the regression spline whose coefficients γg,1 g,k categorical covariate Z. Analogously to Section 8.6.1, interpretation of the regression coefficients β W and β ZW depends on chosen (pseudo)contrasts used to parameterize the categorical covariate Z. End of Lecture #13 (12/11/2015) 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 8.7 126 Interaction of two categorical covariates Start of Lecture #14 (19/11/2015) Finally consider a situation when both Z and W are categorical covariates with Z = {1, . . . , G}, W = {1, . . . , H}. Let a combination (Z, W ) = (g, h) be repeated ng,h -times in the data, g = 1, . . . , G, h = 1, . . . , H and assume that ng,h > 0 for each g and h. For the clarity of notation, we will now use also a triple subscript to index the individual observations. The first subscript will indicate a value of the covariate Z, the second subscript will indicate a value of the covariate W and the third subscript will consecutively number the observations with the same (Z, W ) combination. Finally, without loss of generality, we will assume that data are sorted primarily with respect to the value of the covariate W and secondarily with respect to the value of the covariate Z. That is, we assume and denote Z1 .. . .. . .. . .. . .. . Zn W1 .. . .. . .. . .. . .. . Wn = Z1,1,1 W1,1,1 .. .. . . W1,1,n1,1 Z1,1,n1,1 − − − − − − − − −− .. .. . . − − − − − − − − −− ZG,1,1 WG,1,1 .. .. . . ZG,1,nG,1 WG,1,nG,1 − − − − − − − − −− .. .. . . .. .. . . − − − − − − − − −− Z1,H,1 W1,H,1 .. .. . . Z1,H,n1,H W1,H,n1,H − − − − − − − − −− .. .. . . − − − − − − − − −− ZG,H,1 WG,H,1 .. .. . . ZG,H,nG,H WG,H,nG,H = 1 1 .. .. . . 1 1 −−− .. .. . . −−− G 1 .. .. . . G 1 −−− .. .. . . .. .. . . −−− 1 H .. .. . . 1 H −−− .. .. . . −−− G H .. .. . . G H , Y = Y1 .. . .. . .. . .. . .. . Yn = Y1,1,1 .. . Y1,1,n1,1 − − −− .. . − − −− YG,1,1 .. . YG,1,nG,1 − − −− .. . .. . − − −− Y1,H,1 .. . Y1,H,n1,H − − −− .. . − − −− YG,H,1 .. . YG,H,nG,H . In the same way, a triple subscript will also be used with the covariates V , i.e., V 1 , . . . , V n ≡ V 1,1,1 , . . . , V G,H,nG,H . Further, let ng• = H X ng,h , g = 1, . . . , G h=1 denote the number of datapoints with Z = g and similarly, let n•h = G X g=1 ng,h , h = 1, . . . , H 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 127 denote the number of datapoints with W = h. Analogously to Section 8.3, it will be useful to view, for a fixed v ∈ V, the values of the regression function (8.17) for z ∈ {1, . . . , G} and w ∈ {1, . . . , H} as a set of G · H conditional (given V ) group means: m(g, h, v) = E Y Z = g, W = h, V = v =: mg,h (v), g = 1, . . . , G, h = 1, . . . , H. (8.35) Let, for v ∈ V, m(v) be a vector with elements (8.35): m(v) = m1,1 (v), . . . , mG,1 (v), . . . . . . , m1,H (v), . . . , mG,H (v) . Note that under our assumptions concerning the factorization (8.10) of the regression function and an additive effect of the covariates V , we have mg,h (v) = mg,h + mV (v), g = 1, . . . , G, h = 1, . . . , H, (8.36) m(v) = m + mV (v), where mg,h := mZW (g, h) in (8.10) and m = m1,1 , . . . , mG,1 , . . . . . . , m1,H , . . . , mG,H . Due to the fact that m = m(v) if v ∈ V is such that mV (v) = 0, we will call in this section the vector m as a vector of baseline group means. Further, since they relate to division of the data according to values of two covariates (factors), we will also call them as two-way classified (baseline) group means. Notation. In the following, we will additionally use notation: G H 1 XX m := mg,h , G·H g=1 h=1 mg• := m•h := 8.7.1 1 H 1 G H X h=1 G X mg,h , g = 1, . . . , G, mg,h , h = 1, . . . , H. g=1 Linear model parameterization of two-way classified group means Let, as usual, µ denote a vector of the response expectations where again a triple subscript will be used, i.e., µ1,1,1 .. . µ = E Y Z, W, V = . µG,H,nG,H 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES Due to (8.36), this response expectation is factorized as m1,1 (V 1,1,1 ) mV (V 1,1,1 ) m1,1 . .. . .. .. . m1,1 (V 1,1,n1,1 ) mV (V 1,1,n1,1 ) m1,1 −−−−−−− −−−−−−− −−− .. .. .. . . . −−−−−−− −−−−−−− −−− m (V m (V m G,1 G,1,1 ) G,1,1 ) G,1 V .. .. .. . . . mG,1 (V G,1,nG,1 ) mV (V G,1,nG,1 ) mG,1 −−−−−−− −−−−−−− −−− . .. . .. .. . µ = = + .. .. .. . . . −−−−−−− −−−−−−− −−− m (V m (V m 1,H 1,H,1 ) 1,H,1 ) 1,H V . . . .. .. .. m1,H (V 1,H,n1,H ) mV (V 1,H,n1,H ) m1,H −−−−−−− −−−−−−− −−− . . .. .. .. . −−−−−−− −−−−−−− −−− mG,H (V G,H,1 ) mV (V G,H,1 ) mG,H . .. . .. .. . mG,H (V G,H,nG,H ) mV (V G,H,nG,H ) mG,H {z } {z | | =: µZW =: µV 128 . (8.37) } Note that a vector µZW from (8.37) corresponds to the factor mZW in the regression function (8.10). It is now our aim to parameterize µZW for use in a linear model, i.e., as µZW = XZW β, 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 129 where β is a vector of regression coefficients. Further note that only the baseline group means appear in the expression of µZW and we have m1,1 1n1,1 .. . mG,1 1nG,1 − − − − − . .. µZW = . . .. − − − − − m 1,H 1n1,H .. . mG,H 1nG,H The situation is basically the same as in case of a single categorical covariate in Section 7.4 if we view each of the G · H combinations of the Z and W covariates as one of the values of a new categorical covariate with G · H levels labeled by double indeces (1, 1), . . . , (G, H). The following facts then directly follow from Section 7.4 (given our assumption that ng,h > 0 for all (g, h)). • Matrix XZW must have a rank of G · H, i.e., at least k = G · H columns and its choice simplifies e ZW such that into selecting an (G · H) × k matrix X > x> 1 ⊗ x n 1,1 1,1 1,1 . .. .. . > > xG,1 1nG,1 ⊗ xG,1 − − − − − − − −− . .. .. . e ZW = X . , leading to XZW = . .. .. . − − − − − − − −− x> 1 > ⊗ x 1,H n1,H 1,H . .. . . . > x> 1 ⊗ x nG,H G,H G,H e ZW . • rank XZW = rank X e ZW parameterizes the baseline group means as • Matrix X mg,h = x> g,h β, g = 1, . . . , G, h = 1, . . . , H, e ZW β. m = X e can If purely parameterization of a vector m of the baseline group means is of interest, matrix X be chosen using the methods discussed in Section 7.4 applied to a combined categorical covariate with G · H levels. Nevertheless, it is often of interest to decompose in a certain sense the influence of the original covariates Z and W on the response expectation and this will be provided, as we shall show, by the interaction model built using common guidelines introduced in Section 8.4. 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 8.7.2 130 ANOVA parameterization of two-way classified group means Nevertheless, we first mention parameterization of two-way classified group means which only partly corresponds to common guidelines of Section 8.4, nevertheless, it can be encoutered in practice and it directly generalizes the ANOVA parameterization of one-way classified group means introduced in Section 7.4.3. The ANOVA parameterization of two-way classified group means is given as ZW mg,h = α0 + αgZ + αhW + αg,h , g = 1, . . . , G, h = 1, . . . , H, (8.38) where a vector of regression coefficients α = α0 , αZ , αW , αZW is composed of • intercept term α0 ; Z of the covariate Z; • main effects αZ = α1Z , . . . , αG W of the covariate W ; • main effects αW = α1W , . . . , αH ZW , . . . , αZW , . . . . . . , αZW , . . . , αZW . • interaction effects αZW = α1,1 G,1 1,H G,H Interaction effects and no effect modification The interaction effects αZW allow for possible mutual effect modification of the two categorical covariates Z and W . With parameterization (8.38) of the baseline group means, we have for any g1 , g2 ∈ {1, . . . , G}, h ∈ {1, . . . , H} and any v ∈ V : E Y Z = g1 , W = h, V = v − E Y Z = g2 , W = h, V = v = mg1 , h (v) − mg2 , h (v) = mg1 , h − mg2 , h = αgZ1 − αgZ2 . + αgZW − αgZW 2 ,h 1 ,h That is, if the value of the Z covariate is changed, the change of the response expectation possibly depends, through the interaction terms, on a value of the W covariate. Similarly, if we express a change in the response expectation due to a change in the second categorical covariate W . For any h1 , h2 ∈ {1, . . . , H}, g ∈ {1, . . . , G} and any v ∈ V : E Y Z = g, W = h1 , V = v − E Y Z = g, W = h2 , V = v = mg, h1 (v) − mg, h2 (v) = mg, h1 − mg, h2 = αhW1 − αhW2 ZW ZW + αg,h − αg,h . 1 2 Above expressions also show that hypothesis of no effect modification is given by equality of all interaction effects, i.e., ZW ZW H0 : α1,1 = · · · = αG,H . (8.39) 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 131 Linear model parameterization In a matrix form, parameterization (8.38) is m1,1 .. . mG,1 −−− IG . . . . . . 0G×G 1G IG 1G . . . . . . 0G . . . . . . .. . .. .. .. .. .. .. .. .. . . m= = . . . . .. . .. . . .. .. .. .. .. .. . . .. −−− IG G 0G×G . . . . . . | 1G IG 0G . . . . . . 1{z m 1,H e ZW X α .. . mG,H α0 αZ αW αZW } , e ZW is an (G · H) × (1 + G + H + G · H) matrix and its rank is hence at most where matrix X α e ZW provides less-than-full rank G · H (it is indeed precisely equal to G · H). That is, matrix X α e ZW can concisely be parameterization of the two-way classified group means. Note that matrix X α written as e ZW = 1H ⊗ 1G 1H ⊗ IG IH ⊗ 1G IH ⊗ IG (8.40) X α e ⊗ 1G D e ⊗C e , e D = 1H ⊗ 1G 1H ⊗ C (8.41) | {z } 1G·H where e = IG , C e = IH . D That is, we have e αZ + D e ⊗ 1G α W + D e ⊗C e αZW . m = α0 1G·H + 1H ⊗ C (8.42) Lemma 8.2 Column rank of a matrix that parameterizes two-way classified group means. e ZW being divided into blocks as Matrix X e ZW = 1H ⊗ 1G 1H ⊗ C e D e ⊗ 1G D e ⊗C e X e and 1H , D e . That is, has the column rank given by a product of column ranks of matrices 1G , C e ZW = col-rank 1G , C e e . col-rank X · col-rank 1H , D Proof. Proof/calculations below are shown only for those who are interested. 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 132 e ZW is upon suitable reordering of columns (which does not By point (x) of Theorem A.3, matrix X have any influence on the rank of the matrix) equal to a matrix ZW e e e ⊗ 1G , C e . Xreord = 1H ⊗ 1G , C D Further, by point (ix) of Theorem A.3: e ZW = 1H , D e ⊗ 1G , C e . X reord Finally, by point (xi) of Theorem A.3: e e . e ZW = col-rank X e ZW = col-rank 1G , C · col-rank 1 , D col-rank X H reord k e ZW given by (8.40) is indeed Lemma 8.2 can now be used to get easily that the rank of the matrix X α G · H and hence it can be used to parameterize G · H two-way classified group means. Sum constraints identification e ZW is 1 + G + H. By Scheffé’s theorem on identification in Deficiency in the rank of the matrix X α a linear model (Theorem 7.1), (1+G+H) (or more) linear constraints on the regression coefficients α are needed to identify the vector α in the related linear model. In practice, the following set of (2 + G + H) constraints is often used: G X H X αgZ = 0, g=1 H X ZW αg,h = 0, h=1 G X g = 1, . . . , G, αhW = 0, (8.43) ZW αg,h = 0, h = 1, . . . , H, g=1 h=1 which in matrix notation is written as Aα = 02+G+H , 0 1> G 0> H 0> 1> 0 G H 0G 0G×G 0G×H A= 0 0> 0> G H .. .. .. . . . > 0 0G 0> H > 0> G . . . 0G > 0> G . . . 0G IG . . . IG . > > 1G . . . 0G .. .. .. . . . > 0> G . . . 1G We leave it as an exercise in linear algebra to verify that an (2 + G + H) × (1 + G + H + G · H) matrix A satisfies conditions of Scheffé’s theorem, i.e., e ZW > = 0 . rank A = 1 + G + H, M A> ∩ M X α The coefficients α identified by a set of constraints (8.43) have the following (easy to see using simple algebra with expressions (8.38) while taking into account the constraints) useful interpretation. α0 = m, 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 133 αgZ = mg• − m, g = 1, . . . , G, αhW = m•h − m, h = 1, . . . , H, ZW αg,h = mg,h − mg• − m•h + m, g = 1, . . . , G, h = 1, . . . , H, from which it also follows αgZ1 − αgZ2 = mg1 • − mg2 • , g1 , g2 = 1, . . . , G, (8.44) αhW1 − αhW2 = m•h1 − m•h2 , h1 , h2 = 1, . . . , H. (8.45) That is, the difference between the two main effects of the Z covariate, Eq. (8.44) provides the mean effect of changing the Z covariate if the mean is taken over possible values of its effect modifier W . Analogously, the difference between the two main effects of the W covariate, Eq. (8.45) provides the mean effect of changing the W covariate if the mean is taken over possible values of its effect modifier Z. Both (8.44) and (8.45) are important quantities especially in the area of designed (industrial, agricultural, . . . ) experiments aiming in evaluating an effect of two factors on the response. 8.7.3 Full-rank parameterization of two-way classified group means Suppose that the categorical covariate Z is parameterized by the mean of a chosen G × (G − 1) (pseudo)contrast matrix c1 = c1,1 , . . . , c1,G−1 , c> 1 . .. . C= . , . > cG cG = cG,1 , . . . , cG,G−1 , and the categorical covariate W is parameterized by the mean of a chosen H × (H − 1) (pseudo)contrast matrix d = d , . . . , d , 1 1,1 1,H−1 d> 1 . .. . D= . , . > dH dH = dH,1 , . . . , dH,H−1 . Note that we do not require that matrices C and D are based on (pseudo)contrasts of the same type. Let e ZW = 1H ⊗ 1G 1H ⊗ C D ⊗ 1G D ⊗ C (8.46) X β = 1 .. . c> 1 d> 1 d> 1 ⊗ .. . c> 1 1 .. . .. . .. . c> G .. . .. . .. . d> 1 .. . .. . > d> 1 ⊗ cG .. . .. . 1 .. . c> 1 .. . d> H .. . > d> G ⊗ c1 .. . 1 c> G d> H > d> G ⊗ cG , (8.47) 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 134 which is a matrix with G · H rows and 1 + (G − 1) + (H − 1) + (G − 1)(H − 1) = G · H columns and its structure is the same as a structure of the matrix (8.41), where we now take e = C, C e = D. D Using Lemma 8.2 and properties of (pseudo)contrast matrices, we have e ZW = col-rank 1G , C col-rank X · col-rank 1 , D = G · H. H β e ZW is of full-rank G · H and hence can be used to parameterize the two-way That is, the matrix X β classified group means as e ZW β, m=X β = β0 , β Z , β W , β ZW , β where Z β Z = β1Z , · · · , βG−1 , W β W = β1W , · · · , βH−1 , ZW ZW ZW ZW β ZW = β1,1 , . . . , βG−1,1 , . . . . . . , β1,H−1 , . . . , βG−1,H−1 . We can also write m = β0 1G·H + 1H ⊗ C β Z + D ⊗ 1G β W + D ⊗ C β ZW , ZW > W Z , + d> mg,h = β0 + c> + d> gβ h ⊗ cg β hβ g = 1, . . . , G, h = 1, . . . , H. (8.48) Different choices of the (pseudo)contrast matrices C and D lead to different interpretations of the regression coefficients β. Link to general interaction model e ZW , it is directly seen that it can also be written as If we take expression (8.47) of the matrix X e ZW = 1G·H , e X SZ , e SW , e SZW , β where e SZ = 1H ⊗ C, e SW = D ⊗ 1G , e SZW = e SZ : e SW . Similarly, a matrix XZW which parameterizes a vector µZW (in which a value mg,h is repeated β ng,h -times) is factorized as XZW = 1n , SZ , SW , SZW , (8.49) β where SZ and SW is obtained from matrices e SZ and e SW , respectively, by appropriately repeating ZW Z W their rows and S = S : S . That is, the model matrix (8.49) is precisely of the form given in (8.16) that we used to parameterize a linear model with interactions. In context of this chapter, the (pseudo)contrast matrices C and D, respectively, play the role of parameterizations sZ in (8.14) and sW in (8.15), respectively. 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 8.7.4 135 Relationship between the full-rank and ANOVA parameterizations With the full-rank parameterization of the two-way classified group means, expression (8.48) shows that we can also write ZW mg,h = α0 + αgZ + αhW + αg,h , g = 1, . . . , G, h = 1, . . . , H. (8.50) where α0 := β0 , αgZ Z := c> gβ , g = 1, . . . , G, αhW W := d> hβ , h = 1, . . . , H, ZW αg,h := ZW > , d> h ⊗ cg β (8.51) g = 1, . . . , G, h = 1, . . . , H. That is, chosen full-rank parameterization of the two-way classified group means corresponds to the ANOVA parameterization (8.50) in which (1 + G + H + G · H) regression coefficients α is uniquely obtained from G·H coefficients β of the full-rank parameterization using the relationships (8.51). In other words, expressions (8.51) correspond to identifying constraints on α in the less-than-full-rank parameterization. Note also, that in matrix notation, (8.51) can be written as α0 := β0 , αZ := C β Z , αW := D β W , αZW (8.52) := (D ⊗ C) β ZW . Interaction effects and no effect modification ZW = · · · = αZW , Hypothesis of no effect modification for covariates Z and W is given as H0 : α1,1 G,H see (8.39), which can also be written as H0 : αZW = a 1G·H for some a ∈ R. ZW Taking into account (8.52), αZW = = a 1. Due a 1 for some a ∈ R, if and only if (D ⊗ C) β to the fact that 1 ∈ / M D ⊗ C (if both C and D are (pseudo)contrast matrices), this is only possible with a = 0 and β ZW = 0. Hence with the full-rank parameterization of the two-way classified group means, the hypothesis of no effect modification is H0 : β ZW = 0(G−1)(H−1) . 8.7.5 Additive model Suppose that the categorical covariates Z and W act additively. That is, it can be assumed that αZW = a 1 for some a ∈ R and hence the baseline group means can be parameterized as mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H, (8.53) Z is a vector of main effects of the covariate Z and αW = where, as before αZ = α1Z , . . . , αG W is a vector of main effects of the covariate W , and α1W , . . . , αH α = α0 , αZ , αW 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 136 is a vector of regression coefficients. Even though this situation was in fact treated in a more general way in Section 8.3 (covariate W in this section plays a role of covariate vector V in Section 8.3), it will be useful to revisit it in this specific case when both covariates at hand are categorical. Parameterization (8.53) written in a matrix notation is e Z+W α, m=X α e Z+W = X α 1H ⊗ 1G 1H ⊗ IG IH ⊗ 1G . e Z+W has 1 + G + H columns but its rank is only Matrix X α e Z+W = 1 + (G − 1) + (H − 1) = G + H − 1. rank X α Analogously to the interaction model and also analogously to Section 8.3, a full-rank parameterization of the two-way classified group means (8.53) under additivity assumption, is Z W mg,h = β0 + c> + d> gβ hβ , (8.54) g = 1, . . . , G, h = 1, . . . , H, where c1 > , . . . , c> G are rows of an G × (G − 1) (pseudo)contrast matrix C that parameterizes the categorical covariate Z and d1 > , . . . , d> H are rows of an H × (H − 1) (pseudo)contrast matrix D that parameterizes the categorical covariate W and Z W β = β0 , β1Z , . . . , βG−1 , β1W , . . . , βH−1 | {z } | {z } Z W β β are regression coefficients. Parameterization (8.54) written in a matrix form is (8.55) e Z+W β, m=X β e Z+W = X β 1H ⊗ 1G 1H ⊗ C D ⊗ 1G . e Z+W is a matrix with 1 + (G − 1) + (H − 1) = G + H − 1 columns obtained from Since X β e ZW , Eq. (8.46), by omitting its last (G − 1)(H − 1) columns, matrix X e Z+W is a full-rank matrix X β β indeed of a full-rank G + H − 1. Analogously to the interaction model, coefficients β from the full-rank parameterization correspond to certain identifying constraints imposed on the coefficients α of the less-than-full-rank parameterization. Their mutual relationship is basically the same as in Section 8.7.4 and is given by the first three expressions from (8.51). Partial effects If two-way classified baseline group means satisfy additivity, we have for any g1 , g2 ∈ {1, . . . , G}, h ∈ {1, . . . , H} and any v ∈ V : E Y Z = g1 , W = h, V = v − E Y Z = g2 , W = h, V = v = = mg1 , h (v) − mg2 , h (v) αgZ1 − αgZ2 = = mg1 , h − mg2 , h > Z c> g1 − cg2 β . In agreement with Section 8.3.1, main effects αZ or β Z are referred to as partial effects of Z depend on a categorical covariate Z. Remind that interpretation of particular values α1Z , . . . , αG 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 137 Z considered identifying constraints, interpretation of particular values β1Z , . . . , βG−1 depend on a choice of a (pseudo)contrast matrix C. Due to additivity, we also have that, still for arbitrary h ∈ {1, . . . , H}: mg1 , h − mg2 , h = mg1 • − mg2 • = αgZ1 − αgZ2 = Z > c> g1 − cg2 β . Interpretation of the invididual coefficients is then the same as already explained in Section 8.3.2. Similarly, we have for any h1 , h2 ∈ {1, . . . , H}, g ∈ {1, . . . , G} and any v ∈ V : E Y Z = g, W = h1 , V = v − E Y Z = g, W = h2 , V = v 8.7.6 = mg, h1 (v) − mg, h1 (v) = αhW1 − αhW2 = d> h1 mg, h2 − mg, h2 W − d> h2 β . = = m•h2 − m• h2 Interpretation of model parameters for selected choices of (pseudo)contrasts Chosen (pseudo)contrast matrices C and D determine interpretation of the coefficients β of the full-rank parameterization (8.48) of the two-way classified group means and also of the coefficients α given by (8.51) and determine an ANOVA parameterization with certain identification. For interpretation, it is useful to view the two-way classified group means as entries in the G × H matrix m1,1 . . . m1,H . .. . M= ... . . . mG,1 . . . mG,H Corresponding sample sizes ng,h , g = 1, . . . , G, h = 1, . . . , H, form a G × H contingency table based on the values of the two categorical covariates Z and W . With the ANOVA parameterization (8.50), the main effects αZ = Cβ Z and αW = Dβ W can be interpreted as the row and column effects in matrix M, respectively. The interaction effects αZW = D ⊗ C β ZW can be put into a matrix > > ZW . . . αZW > β ZW . . . > β ZW α1,1 d ⊗ c d ⊗ c 1 H 1 1 1,H . .. .. .. ZW . A = . ... . = . ... . > > ZW . . . αZW > β ZW . . . > β ZW αG,1 d ⊗ c d ⊗ c 1 H G,H G G whose entries could be interpreted as cell effects in matrix M. In other words, each of the values in M is obtained as a sum of the intercept term α0 and corresponding row, column and cell effects. As was already mentioned, the two (pseudo)contrast matrices C and D can be both of different type, e.g., matrix C being the reference group pseudocontrast matrix and matrix D being the sum contrast matrix. Nevertheless, in practice, both of them are mostly chosen as being of the same type. In the reminder of this section, we shall discuss interpretation of the model parameters for two most common choices of the (pseudo)contrasts which are (i) the reference group pseudocontrasts and (ii) sum contrasts. 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 138 Reference group pseudocontrasts Suppose that both C and D are reference group pseudocontrasts, i.e., > > 0 ... 0 c1 0 ... 0 d1 ! > 1 . . . 0 c2 1 . . . 0 d> 0> 2 G−1 C = . . = . = , D = . . = . = . . ... . . ... IG−1 .. .. .. .. 0 ... 1 c> 0 ... 1 d> H G We have, Z α1 0 Z Z α2 β . = αZ = Cβ Z = .1 , . . . . Z Z αG βG−1 W α1 0 W W α2 β . = αW = Dβ W = 1. , . . . . W W αH βH−1 0> H−1 IH−1 ! (8.56) To get the link between the full-rank interaction terms β ZW and their ANOVA counterparts αZW , > we have to explore the form of vectors d> h ⊗ cg , g = 1, . . . , G, h = 1, . . . , H. With the reference group pseudocontrasts, we easily see that > = 0, d> for all h = 1, . . . , H, h ⊗ c1 > > d1 ⊗ cg = 0, for all g = 1, . . . , G, > > dh ⊗ cg = 0, . . . , 1, . . . , 0 , if g 6= 1 & h 6= 1, ZW > ZW , 1 on a place that in dh ⊗ c> multiplies βg−1,h−1 g β which leads to A ZW ZW α1,1 ZW α2,1 = .. . ZW α1,2 ZW α2,2 .. . ZW αG,1 ZW αG,2 ZW . . . α1,H 0 ZW . . . α2,H 0 .. = .. ... . . ... ZW αG,H 0 0 ZW β1,1 .. . ZW βG−1,1 ... ... 0 ZW β1,H−1 .. . ... ZW . . . βG−1,H−1 . In summary, the ANOVA coefficients are identified by a set of 3 + (H − 1) + (G − 1) = G + H + 1 constraints α1Z = 0, α1W = 0, ZW α1,1 = 0, ZW α1,h = 0, h = 2, . . . , H, ZW αg,1 g = 2, . . . , H. = 0, The first two constraints come from (8.56), remaining ones correspond to zeros in the matrix AZW . ZW , g = 1, . . . , G, h = While taking into account parameterization mg,h = α0 + αgZ + αhW + αg,h 1, . . . , H, of the two-way classified group means where the parameters satisfy above constraints, we get their following interpretation: α0 = β 0 = m1,1 , Z αgZ = βg−1 = mg,1 − m1,1 , g = 2, . . . , G, . 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES W αhW = βh−1 = m1,h − m1,1 , 139 h = 2, . . . , H, ZW ZW αg,h = βg−1,h−1 = mg,h − mg,1 − m1,h + m1,1 , g = 2, . . . , G, h = 2, . . . , H. Note (Reference group pseudocontrasts in the additive model). If the additive model is assumed where mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H and the reference group pseudocontrasts are used in the full-rank parameterization (8.55), the ANOVA coefficients α = α0 , αZ , αW are obtained from the full-rank coefficients β = β0 , β Z , β W again by (8.56), that is, they are identified by two constraints α1Z = 0, α1W = 0. Their interpretation becomes α0 = β0 = m1,1 , Z αgZ = βg−1 = mg,h − m1,h , g = 2, . . . , G, arbitrary h ∈ {1, . . . , H} h = 2, . . . , H, arbitrary g ∈ {1, . . . , G} = mg• − m1• , W = mg,h − mg,1 , αhW = βh−1 = m•h − m•1 . Sum contrasts Suppose now that both C and D 1 . .. C= 0 −1 are sum contrasts, i.e., > ... 0 c1 . .. .. . . . = . = > ... 1 cG−1 . . . −1 c> G > 1 ... 0 d1 . . . . .. .. .. . = . = D= > 1 0 ... dH−1 −1 . . . −1 d> H IG−1 − 1> G−1 ! , We have, IH−1 − 1> H−1 β1Z α1Z .. . .. . = αZ = Cβ Z = , Z Z βG−1 αG−1 PG−1 Z Z αG − g=1 βg ! . α1W β1W .. . .. . = αW = Dβ W = , W W βH−1 αH−1 P W H−1 W αH − h=1 βh (8.57) 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES 140 > The form of the vectors d> h ⊗ cg , g = 1, . . . , G, h = 1, . . . , H, needed to calculate the interaction ZW terms αg,h is the following > = d> h ⊗ cg 0, . . . , 1, . . . , 0 , ZW > 1 on a place that in d> h ⊗ cg β for g = 1, . . . , G − 1, h = 1, . . . , H − 1, ZW multiplies βg,h , > d> h ⊗ cG = 0G−1 , . . . , − 1G−1 , . . . , 0G−1 , for all h = 1, . . . , H − 1, ZW > > ZW , multiply β•h − 1G−1 block on places that in dh ⊗ cG β > = d> H ⊗ cg 0, . . . , −1, . . . , 0, . . . . . . , 0, . . . , −1, . . . , 0 , for all g = 1, . . . , G − 1, ZW > > ZW −1’s on places that in dH ⊗ cg β multiply βg• , > d> H ⊗ cG = 1, . . . , 1 = 1(G−1)(H−1) . This leads to AZW ZW ZW ZW α1,1 ... α1,H−1 α1,H . .. .. .. .. . . . = ZW ZW ZW αG−1,1 . . . αG−1,H−1 αG−1,H ZW ZW ZW αG,H . . . αG,H−1 αG,1 ZW β1,1 .. . = ... ... ... ZW βG−1,1 − PG−1 g=1 ZW βg,1 ... − ZW β1,H−1 .. . ZW βG−1,H−1 PG−1 g=1 ZW βg,H−1 − − PH−1 h=1 .. . PH−1 h=1 ZW β1,h ZW βG−1,h PG−1 PH−1 g=1 h=1 ZW βg,h . Note that the entries in each row of the matrix AZW and also in each of its column sum up to zero. Similarly, (8.57) shows that the elements of the main effects αZ and also elements of the main effects αW sum up to zero. Identifying constraints for the ANOVA coefficients α that correspond to considered sum contrast full-rank parameterization are hence G X αgZ = 0, g=1 H X αhW = 0, h=1 H X (8.58) ZW αg,h = 0, for each g = 1, . . . , G, ZW αg,h = 0, for each h = 1, . . . , H. h=1 G X g=1 ZW , one constraint is redundant Note that in a set of G + H constraints on the interaction terms αg,h and the last two rows could also be replaced by a set of (G − 1) + (H − 1) + 1 = G + H − 1 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES constraints: H X 141 ZW αg,h = 0, for each g = 1, . . . , G − 1, ZW αg,h = 0, for each h = 1, . . . , H − 1, h=1 G X g=1 G−1 X H−1 X ZW ZW αg,h = αG,H . g=1 h=1 We see that the set of equations (8.58) exactly corresponds to identification by the sum constraints (8.43), see Section 8.7.2. Hence the interpretation of the regression coefficients is the same as derived there, namely, α0 = m, αgZ = mg• − m, g = 1, . . . , G, αhW = m•h − m, h = 1, . . . , H, ZW = mg,h − mg• − m•h + m, αg,h g = 1, . . . , G, h = 1, . . . , H. Additionally, αgZ1 − αgZ2 = mg1 • − mg2 • , g1 , g2 = 1, . . . , G, αhW1 − αhW2 = m•h1 − m•h2 , h1 , h2 = 1, . . . , H. Note (Sum contrasts in the additive model). If the additive model is assumed where mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H and the sum contrasts are used in the full-rank parameterization (8.55), the ANOVA coefficients α = α0 , αZ , αW are obtained from the full-rank coefficients β = β0 , β Z , β W again by (8.57), that is, they are identified by two constraints G X g=1 αgZ = 0, H X αhW = 0. h=1 Their interpretation becomes α0 = m, αgZ = mg,h − m•h , g = 1, . . . , G, arbitrary h ∈ {1, . . . , H} h = 1, . . . , H, arbitrary g ∈ {1, . . . , G} = mg• − m, αhW = mg,h − mg• , = m•h − m. Additionally, αgZ1 − αgZ2 = mg1 ,h − mg2 ,h , = mg1 • − mg2 • , g1 , g2 = 1, . . . , G, arbitrary h ∈ {1, . . . , H} 8.7. INTERACTION OF TWO CATEGORICAL COVARIATES αhW1 − αhW2 = mg,h1 − mg,h2 , h1 , h2 = 1, . . . , H, 142 arbitrary g ∈ {1, . . . , G} = m•h1 − m•h2 . End of Lecture #14 (19/11/2015) 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 8.8 8.8.1 143 Hierarchically well-formulated models, ANOVA tables Start of Lecture #15 (19/11/2015) Model terms In majority of applications of a linear model, a particular covariate Z ∈ Z ⊆ R enters the regression function using one of the parameterizations described in Sections 7.3 and 7.4 or inside an interaction (see Defition 8.2) or inside a so called higher order interaction (will be defined in a while). As a summary, depending on whether the covariate is numeric or categorical, several parameterizations s were introduced in Sections 7.3 and 7.4 that with the covariate values Z1 , . . . , Zn in the data lead to a reparameterizing matrix s> (Z1 ) X> 1 . . . . S= . = . , > > s (Zn ) Xn where X 1 = s(Z1 ), . . ., X n = s(Zn ) are the regressors used in the linear model. The considered parameterizations were the following. Numeric covariate (i) Simple transformation: s = s : Z −→ R with s(Z1 ) X 1 = X1 = s(Z1 ), . .. S = .. = S , . s(Zn ) X n = Xn = s(Zn ). (8.59) (ii) Polynomial: s = s1 , . . . , sk−1 such that sj (z) = P j (z) is polynomial in z of degree j, j = 1, . . . , k − 1. This leads to P 1 (Z1 ) . . . P k−1 (Z1 ) . .. .. 1 k−1 , .. S= = (8.60) P , . . . , P . . P 1 (Zn ) . . . P k−1 (Zn ) X 1 = P 1 (Z1 ), . . . , P k−1 (Z1 ) , .. . X n = P 1 (Zn ), . . . , P k−1 (Zn ) . For a particular form of the basis polynomials P 1 , . . . , P k−1 , raw or orthonormal polynomials have been suggested in Sections 7.3.2 and 7.3.3. Other choices are possible as well. (iii) Regression spline: s = s1 , . . . , sk such that sj (z) = Bj (z), j = 1, . . . , k, where B1 , . . . , Bk is the spline basis of chosen degree d ∈ N0 composed of basis B-splines built above a set of chosen knots λ = λ1 , . . . , λk−d+1 . This leads to B1 (Z1 ) . . . Bk (Z1 ) . .. .. 1 k , .. S=B= = (8.61) B , ..., B . . B1 (Zn ) . . . Bk (Zn ) 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES X1 = .. . B1 (Z1 ), . . . , Bk (Z1 ) , Xn = B1 (Zn ), . . . , B k (Zn ) . 144 Categorical covariate with Z = 1, . . . , G . The parameterization s is s(z) = cz , z ∈ Z, where c1 , . . . , cG ∈ RG−1 are the rows of a chosen (pseudo)contrast matrix CG×G−1 . This leads to c> X 1 = cZ1 , Z. 1 .. 1 G−1 . S= (8.62) , . . = C , ..., C c> X n = cZn . Zn Main effect model terms In the following, we restrict ourselves only into situations when the considered covariates are parameterized by one of above mentioned ways. The following definitions define sets of the columns of a possible model matrix which will be called the model terms and which are useful to be always considered “together” when proposing a linear model for a problem at hand. Definition 8.3 The main effect model term. Depending on a chosen parameterization, the main effect model term6 (of order one) of a given covariate Z is defined as a matrix T with columns: Numeric covariate (i) Simple transformation: (the only) column S of the reparameterizing matrix S given by (8.59), i.e., T= S . (ii) Polynomial: the first column P 1 of the reparameterizing matrix S (given by Eq. 8.60) that corresponds to the linear transformation of the covariate Z, i.e., T = P1 . (iii) Regression spline: (all) columns B 1 , . . . , B k of the reparameterizing matrix S = B given by (8.61), i.e., T = B1, . . . , Bk . Categorical covariate: (all) columns C 1 , . . . , C G−1 of the reparameterizing matrix S given by (8.62), i.e., T = C 1 , . . . , C G−1 . Definition 8.4 The main effect model term of order j. If a numeric covariate Z is parameterized using the polynomial of degree k − 1 then the main effect model term of order j, j = 2, . . . , k − 1, means a matrix Tj whose the only column is the jth 6 hlavní efekt 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 145 column P j of the reparameterizing matrix S (given by Eq. 8.60) that corresponds to the polynomial of degree j, i.e., Tj = P j . Note. The terms T, . . ., Tj−1 are called as lower order terms included in the term Tj . Two-way interaction model terms In the following, consider two covariates Z and W and their main effect model terms TZ and TW . Definition 8.5 The two-way interaction model term. The two-way interaction7 model term means a matrix TZW , where TZW := TZ : TW . Notes. • The main effect model term TZ and/or the main effect model term TW that enter the two-way interaction may also be of a degree j > 1. • Both the main effect model terms TZ and TW are called as lower order terms included in the two-way interaction term TZ : TW . Higher order interaction model terms In the following, consider three covariates Z, W and V and their main effect model terms TZ , TW , TV . Definition 8.6 The three-way interaction model term. The three-way interaction8 model term means a matrix TZW V , where TZW V := TZ : TW : TV . Notes. • Any of the main effect model terms TZ , TW , TV that enter the three-way interaction may also be of a degree j > 1. • All main effect terms TZ , TW and TV and also all two-way interaction terms TZ : TW , TZ : TV and TW : TV are called as lower order terms included in the three-way interaction term TZW V . • By induction, we could define also four-way, five-way, . . . , i.e., higher order interaction model terms and a notion of corresponding lower order nested terms. 7 dvojná interakce 8 trojná interakce 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 8.8.2 146 Model formula To write concisely linear models based on several covariates, the model formula will be used. The following symbols in the model formula have the following meaning: • 1: intercept term in the model if this is the only term in the model (i.e., intercept only model). • Letter or abbreviation: main effect of order one of a particular covariate (which is identified by the letter or abbreviation). It is assumed that chosen parameterization is either known from context or is indicated in some way (e.g., by the used abbreviation). Letters or abbreviations will also be used to indicate a response variable. • Power of j, j > 1 (above a letter or abbreviation): main effect of order j of a particular covariate. • Colon (:) between two or more letters or abbreviations: interaction term based on particular covariates. • Plus sign (+): a delimiter of the model terms. • Tilde (∼): a delimiter between the response and description of the regression function. Further, when using a model formula, it is assumed that the intercept term is explicitely included in the regression function. If the explicit intercept should not be included, this will be indicated by writing −1 among the model terms. 8.8.3 Hierarchically well formulated model Definition 8.7 Hierarchically well formulated model. Hierarchically well formulated (HWF) model9 is such a model that contains an intercept term (possibly implicitely) and with each model term also all lower order terms that are nested in this term. Notes. • Unless there is some well-defined specific reason, models used in practice should be hierarchically well formulated. • Reason for use of the HWF models is the fact that the regression space of such models is invariant towards linear (location-scale) transformations of the regressors where invariance is meant with respect to possibility to obtain the equivalent linear models. Example 8.5. Consider a quadratic regression function mx (x) = β0 + β1 x + β2 x2 and perform a linear transformation of the regressor: x = δ (t − ϕ), 9 hierarchicky dobře formulovaný model t=ϕ+ x , δ (8.63) 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 147 where δ 6= 0 and ϕ 6= 0 are pre-specified constants and t is a new regressor. The regression function in t is mt (t) = γ0 + γ1 t + γ2 t2 , where γ0 = β0 − β1 δϕ + β2 δ 2 ϕ2 , γ1 = β1 δ − 2β2 δ 2 ϕ, γ2 = β2 δ 2 . With at least three different x values in the data, both regression functions lead to two equivalent linear models of rank 3. Suppose now that the initial regression function mx did not include a linear term, i.e., it was mx (x) = β0 + β2 x2 which leads to a linear model of rank 2 (with at least three or even two different covariate values in data). Upon performing the linear transformation (8.63) of the regressor x, the regression function becomes mt (t) = γ0 + γ1 t + γ2 t2 with γ0 = β0 + β2 δ 2 ϕ2 , γ1 = −2β2 δ 2 ϕ, γ2 = β 2 δ 2 . With at least three different covariate values in data, this leads to the linear model of rank 3. To use a non-HWF model in practice, there should always be a (physical, . . . ) reason for that. For example, • No intercept in the model ≡ it can be assumed that the response expectation is zero if all regressors in a chosen parameterization take zero values. • No linear term in a model with a quadratic regression function m(x) = β0 + β2 x2 ≡ it can be assumed that the regression function is a parabola with the vertex in a point (0, β0 ) with respect to the x parameterization. • No main effect of one covariate in an interaction model with two numeric covariates and a regression function m(x, z) = β0 + β1 z + β2 x z ≡ it can be assumed that with z = 0, the response expectation does not depend on a value of x, i.e., E Y X = x, Z = 0 = β0 (a constant). 8.8.4 ANOVA tables For a particular linear model, so called ANOVA tables are often produced to help the analyst to decide which model terms are important with respect to its influence on the response expectation. Similarly to well known one-way ANOVA table (see any of introductory statistical courses), ANOVA tables produced in a context of linear models provide on each row input of a certain F-statistic, now that based on Theorem 5.2. The last row of the table (labeled often as Residual, Error or Within) provides (i) residual degrees of freedom νe of the considered model; (ii) residual sum of squares SSe of the considered model; (iii) residual mean square MSe = SSe /νe of the considered model. 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 148 Each of the remaining rows of the ANOVA table provides input for the numerator of the F-statistic that corresponds to comparison of certain two models M1 ⊂ M2 which are both submodels of the considered model (or M2 is the considered model itself) and which have ν1 and ν2 degrees of freedom, respectively. The following quantities are provided on each of the remaining rows of the ANOVA table: (i) degrees of freedom for the numerator of the F-statistic (effect degrees of freedom νE = ν1 −ν2 ); (ii) difference in the residual sum of squares of the two models (effect sum of squares SSE = SS M2 M1 ); (iii) ratio of the above two values which is the numerator of the F-statistic (effect mean square MSE = SSE /νE ); (iv) value of the F-statistic FE = MSE /MSe ; (v) a p-value based on the F-statistic FE and the FνE , νe distribution. Several types of the ANOVA tables are distinguished which differ by definition of a pair of the two models M1 and M2 that are being compared on a particular row. Consequently, interpretation of results provided by the ANOVA tables of different type differs. Further, it is important to know that in all ANOVA tables, the lower order terms always appear on earlier rows in the table than the higher order terms that include them. Finally, for some ANOVA tables, different interpretation of the results is obtained for different ordering of the rows with the terms of the same hierarchical level, e.g., for different ordering of the main effect terms. We introduce ANOVA tables of three types which are labeled by the R software (and by many others as well) as tables of type I, II or III (arabic numbers can be used as well). Nevertheless, note that there exist software packages and literature that use different typology. In the reminder of this section we assume that intercept term is included in the considered model. In the following, we illustrate each type of the ANOVA table on a linear model based on two covariates whose main effect terms will be denoted as A and B. Next to the main effects, the model will include also an interaction term A : B. That is, the model formula of the considered model, denoted as MAB is ∼ A + B + A : B. In total, the following (sub)models of this model will appear in the ANOVA tables: M0 : ∼ 1, MA : ∼ A, MB : ∼ B, MA+B : ∼ A + B, MAB : ∼ A + B + A : B. The symbol SS F2 F1 will denote a difference in the residual sum of squares of the models with model formulas F1 and F2 . Type I (sequential) ANOVA table Example 8.6 (Type I ANOVA table for model MAB :∼ A + B + A : B). In the type I ANOVA table, the presented results depend on the ordering of the rows with the terms of the same hierarchical level. In this example, those are the rows that correspond to the main effect terms A and B. 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 149 Order A + B + A:B Effect (Term) Degrees of freedom A ? Effect sum of squares SS A 1 ? ? ? ? SS A + B A ? ? ? A:B ? SS A + B + A : B A + B ? ? ? Residual νe SSe MSe B Effect mean square F-stat. P-value Order B + A + A:B Effect (Term) Degrees of freedom B ? Effect sum of squares SS B 1 ? ? ? ? SS A + B B ? ? ? A:B ? SS A + B + A : B A + B ? ? ? Residual νe SSe MSe A Effect mean square F-stat. P-value The row of the effect (term) E in the type I ANOVA table has in general the following interpretation and properties. • It compares two models M1 ⊂ M2 , where • M1 contains all terms included in the rows that precede the row of the term E. • M2 contains the terms of model M1 and additionally the term E. • The sum of squares shows increase of the explained variability of the response due to the term E on top of the terms shown on the preceding rows. • The p-value provides a significance of the influence of the term E on the response while controlling (adjusting) for all terms shown on the preceding rows. • Interpretation of the F-tests is different for rows labeled equally A in the two tables in Example 8.6. Similarly, interpretation of the F-tests is different for rows labeled equally B in the two tables in Example 8.6. • The sum of all sums of squares shown in the type I ANOVA table gives the total sum of squares SST of the considered model. This follows from the construction of the table where the terms are added sequentially one-by-one and from a sequential use of Theorem 5.8 (Breakdown of the total sum of squares in a linear model with intercept). Type II ANOVA table Example 8.7 (Type II ANOVA table for model MAB :∼ A + B + A : B). In the type II ANOVA table, the presented results do not depend on the ordering of the rows with the terms of the same hierarchical level as should become clear from subsequent explanation. 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES Degrees of freedom Effect sum of squares Effect mean square F-stat. P-value A ? ? ? ? B ? SS A + B B SS A + B A ? ? ? A:B ? SS A + B + A : B A + B ? ? ? Residual νe SSe MSe Effect (Term) 150 The row of the effect (term) E in the type II ANOVA table has in general the following interpretation and properties. • It compares two models M1 ⊂ M2 , where • M1 is the considered (full) model without the term E and also all higher order terms than E that include E. • M2 contains the terms of model M1 and additionally the term E (this is the same as in type I ANOVA table). • The sum of squares shows increase of the explained variability of the response due to the term E on top of all other terms that do not include the term E. • The p-value provides a significance of the influence of the term E on the response while controlling (adjusting) for all other terms that do not include E. • For practical purposes, this is probably the most useful ANOVA table. Type III ANOVA table Example 8.8 (Type III ANOVA table for model MAB :∼ A + B + A : B). Also in the type III ANOVA table, the presented results do not depend on the ordering of the rows with the terms of the same hierarchical level as should become clear from subsequent explanation. Degrees of freedom Effect sum of squares Effect mean square F-stat. P-value A ? ? ? ? B ? ? ? ? A:B ? SS A + B + A : B B + A : B SS A + B + A : B A + A : B SS A + B + A : B A + B ? ? ? Residual νe SSe MSe Effect (Term) The row of the effect (term) E in the type III ANOVA table has in general the following interpretation and properties. • It compares two models M1 ⊂ M2 , where • M1 is the considered (full) model without the term E. • M2 contains the terms of model M1 and additionally the term E (this is the same as in type I and type II ANOVA table). Due to the construction of M1 , the model M2 is always equal to the considered (full) model. 8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES 151 • The submodel M1 is not necessarily hierarchically well formulated. If M1 is not HWF, interpretation of its comparison to model M2 depends on a parameterization of the term E. Consequently, also the interpretation of the F-test depends on the used parameterization. • For general practical purposes, most rows of the type III ANOVA table are often useless. Chapter 9 Analysis of Variance In this chapter, we examine several specific issues of linearmodels where all covariates are categorical. That is, the covariate vector Z is Z = Z1 , . . . , Zp , Zj ∈ Zj , j = 1, . . . , p, and each Zj is a finite set (with usually a “low” cardinality). The corresponding linear models are traditionally used in the area of designed (industrial, agricultural, . . . ) experiments or controlled clinical studies. The elements of the covariate vector Z then correspond to p factors whose influence on the response Y is of interest. The values of those factors for experimental units/subjects are typically within the control of an experimenter in which case the covariates are fixed rather than being random. Nevertheless, since the whole theory presented in this chapter is based on statements on the conditional distribution of the response given the covariate values, everything applies for both fixed and random covariates. 152 9.1. ONE-WAY CLASSIFICATION 9.1 153 One-way classification One-way classification corresponds to situation of one categorical covariate Z ∈ Z = {1, . . . , G}, see also Section 7.4. A linear model is then used to parameterize a set of G (conditional) response expectations E Y Z = 1 , . . ., E Y Z = G that we call as one-way classified group means: g = 1, . . . , G. m(g) = E Y Z = g =: mg , Without loss of generality, we can assume that the response random variables Y1 , . . . , Yn are sorted such that Z1 = · · · = Zn1 = 1, Zn1 +1 = · · · = Zn1 +n2 = 2, . . ., Zn1 +···+nG−1 +1 = · · · = Zn = G. For notational clarity in theoretical derivations, it is useful to use a double subscript to index the individual observations and to merge responses with a common covariate value Z = g, g = 1, . . . , G, into response subvectors Y g : Z=1: .. . Y 1 = Y1,1 , . . . , Y1,n1 , .. . Z = G : Y G = YG,1 , . . . , YG,nG . > The full response vector is Y = Y 1 , . . . , Y G and its (conditional, given Z = Z1 , . . . , Zn ) mean is m1 1n1 .. . E Y Z = µ = . mG 1nG A standard linear model then additionally assumes var Y Z = σ 2 In . If moreover, a linear model with i.i.d. errors is assumed, we get Yg,j = mg + εg,j , g = 1, . . . , G, j = 1, . . . , ng , i.i.d. where εg,j ∼ D(0, σ 2 ). Finally, if moreover the covariates are fixed rather than random, the data form G independent random samples (with common variances): Sample 1 : .. . Y 1 = Y1,1 , . . . , Y1,n1 , .. . i.i.d. Y1,j ∼ D(m1 , σ 2 ), .. . j = 1, . . . , n1 , i.i.d. Sample G : Y G = YG,1 , . . . , YG,nG , YG,j ∼ D(mG , σ 2 ), j = 1, . . . , nG . A linear model and related methodology can then be used for inference on the group means m1 , . . . , mG or on their linear combinations. 9.1.1 Parameters of interest Differences between the group means The principal inferential interest with one-way classification lies in estimation of and tests on parameters θg,h = mg − mh , g, h = 1, . . . , G, g 6= h, 9.1. ONE-WAY CLASSIFICATION 154 which are the differences between the means. Since each θg,h is a linear combination of the group elements of the mean vector µ = E Y Z , it is trivially an estimable parameter of the underlying linear model irrespective of its parameterization. The LSE of each θg,h is then a difference between the corresponding fitted values. The principal null hypothesis being tested in context of the one-way classification is the null hypothesis on equality of the group means, i.e., the null hypothesis H0 : m1 = · · · = mG , which written in terms of the differences between the group means is H0 : θg,h = 0, g, h = 1, . . . , G, g 6= h. Factor effects One-way classification often corresponds to a designed experiment which aims in evaluating the effect of a certain factor on the response. In that case, the following quantities, called as factor effects, are usually of primary interest. Definition 9.1 Factor effects in a one-way classification. By factor effects in case of a one-way classification we understand the quantities η1 , . . . , ηG defined as G 1 X ηg = mg − mh , g = 1, . . . , G. G h=1 Notes. • The effects are again linear combinations of the elements of the mean vector µ = factor E Y Z and hence all are estimable parameters of the underlying linear model with the LSE being equal to the appropriate linear combination of the fitted values. • The null hypothesis H0 : ηg = 0, g = 1, . . . , G, is equivalent to the null hypothesis H0 : m1 = · · · = mG on the equality of the group means. 9.1.2 One-way ANOVA model As a reminder from Section 7.4.2, the regression space of the one-way classification is m 1 1 n1 . : m1 , . . . , mG ∈ R ⊆ Rn . .. m 1 G nG While assuming ng > 0, g = 1, . . . , G, n > G, its vector dimension is G. In Sections 7.4.3 and 7.4.4, we introduced two classical classes of parameterizations of this regression space and of the response mean vector µ as µ = Xβ, β ∈ Rk . 9.1. ONE-WAY CLASSIFICATION 155 ANOVA (less-than-full rank) parameterization mg = α0 + αg , with k = G + 1, β =: α = α0 , |{z} αZ . α1 , . . . , αG g = 1, . . . , G Full-rank parameterization Z mg = β0 + c> gβ , g = 1, . . . , G with k = G, β = β0 , β Z , where |{z} β1 , . . . , βG−1 c> 1 . . C= . c> G is a chosen G × (G − 1) (pseudo)contrast matrix. Note. If the parameters in the ANOVA parameterization are identified by the sum constraint P G g=1 αg = 0, we get G α0 = 1 X mg , G g=1 αg = ηg = mg − H 1 X mh , H h=1 that is, parameters α1 , . . . , αG are then equal to the factor effects. Terminology. Related linear model is referred to as one-way ANOVA model 1 Notes. • Depending on chosen parameterization (ANOVA or full-rank) the differences between the group means, parameters θg,h , are expressed as θg,h = αg − αh = cg − ch > βZ , g 6= h. The null hypothesis H0 : m1 = · · · mG on equality of the group means is the expressed as (a) H0 : α1 = · · · = αG . (b) H0 : β1 = 0 & . . . & βG−1 = 0, i.e., H0 : β Z = 0G−1 . • If a normal linear model is assumed, test on a value of the estimable vector parameter or a submodel test which compares the one-way ANOVA model with the intercept-only model can be used to test the above null hypotheses. The corresponding F-test is indeed a well known one-way ANOVA F-test. End of Lecture #15 (19/11/2015) 9.1. ONE-WAY CLASSIFICATION 9.1.3 156 Least squares estimation Start of In case of a one-way ANOVA linear model, explicit formulas for the LSE related quantities can easily Lecture #16 be derived. (26/11/2015) Theorem 9.1 Least squares estimation in one-way ANOVA linear model. The fitted values and the LSE of the group means in a one-way ANOVA linear model are equal to the group sample means: ng 1 X b m b g = Yg,j = Yg,l =: Y g• , ng g = 1, . . . , G, j = 1, . . . , ng . l=1 That is, Y 1• m b1 . . . . c := m . = . , Y G• m bG Y 1• 1n1 .. . Yb = . Y G• 1nG c | Z ∼ NG m, σ 2 V , where If additionally normality is assumed then m 1 . . . 0 n.1 . .. . .. V= . . . 0 . . . n1G Proof. Use a full-rank parameterization µ = Xβ with 1n1 . . . 0n1 . .. .. .. , X= β = m , . . . , m . . . 1 G .. 0nG . 1nG We have n1 . . . 0 . . .. . .. X> X = . . , 0 . . . nG P n1 j=1 Y1,j .. X> Y = . PnG , X> X −1 j=1 YG,j 1 n1 . . = . 0 ... .. . ... 0 .. . , 1 nG −1 > b=m c= m β b 1, . . . , m b G = X> X X Y = Y 1• , . . . , Y G• . Finally, m b 1 1n1 Y 1• 1n1 .. b = ... = . Yb = Xβ . Y G• 1nG m b G 1nG c follows from a general LSE theory. Normality and the form of the covariance matrix of m 1 model analýzy rozptylu jednoduchého třídění k 9.1. ONE-WAY CLASSIFICATION 157 LSE of regression coefficients and estimable parameters With a full-rank parameterization, a vector m is linked to the regression coefficients β = β0 , β Z , β Z = β1 , . . . , βG−1 , by the relationship m = β0 1G + Cβ Z . b where X is a model matrix derived from the (pseudo)contrast matrix Due to the fact that Yb = Xβ, Z b = βb0 , β b of the regression coefficients in a full-rank parameterization satisfy C, the LSE β bZ , c = βb0 1G + Cβ m which is a regular linear system with the solution βb0 bZ β ! Y 1• −1 ... . = 1G , C Y G• That is, the LSE of the regression coefficients is always a linear combination of the group sample means. The same then holds for any estimable parameter. For example, the LSE of the differences between the group means θg,h = mg − mh , g, h = 1, . . . , G, are θbg,h = Y g• − Y h• , g, h = 1, . . . , G. P Analogously, the LSE of the factor effects ηg = mg − G1 G h=1 mh , g = 1, . . . , G, are G ηbg = Y g• − 1 X Y h• , G g = 1, . . . , G. h=1 9.1.4 Within and between groups sums of squares, ANOVA Ftest Sums of squares Let as usual, Y denote a sample mean based on the response vector Y , i.e., ng G G 1 XX 1X Y = ng Y g• . Yg,j = n n g=1 j=1 g=1 In a one-way ANOVA linear model, the residual and the regression sums of squares and corresponding degrees of freedom are n n g g G X G X 2 X 2 X 2 SSe = Y − Yb = Yg,j − Ybg,j = Yg,j − Y g• , g=1 j=1 g=1 j=1 νe = n − G, n g G X G 2 X 2 X 2 SSR = Yb − Y 1n = Ybg,j − Y = ng Y g• − Y , g=1 j=1 g=1 νR = G − 1. In this context, the residual sum of squares SSe is also called the within groups sum of squares2 , the regression sum of squares SSR is called the between groups sum of squares3 . 2 vnitroskupinový součet čtverců 3 meziskupinový součet čtverců 9.1. ONE-WAY CLASSIFICATION 158 One-way ANOVA F-test Let us assume normality of the response and consider a submodel Y | Z ∼ Nn 1n β0 , σ 2 In of the one-way ANOVA model. A residual sum of squares of the submodel is n SS0e g G X 2 X 2 = SST = Y − Y 1n = Yg,j − Y . g=1 j=1 Breakdown of the total sum of squares (Theorem 5.8) gives SSR = SST − SSe = SS0e − SSe and hence the statistic of the F-test on a submodel is F = SSR G−1 SSe n−G = MSR , MSe (9.1) where SSR SSe , MSe = . G−1 n−G The F-statistic (9.1) is indeed a classical one-way ANOVA F-statistics which under the null hypothesis of validity of a submodel, i.e., under the null hypothesis of equality of the group means, follows an FG−1, n−G distribution. Above quantities, together with the P-value derived from the FG−1, n−G distribution are often recorded in a form of the ANOVA table: MSR = Effect (Term) Degrees of freedom Effect sum of squares Effect mean square F-stat. P-value Factor G−1 SSR MSR F p Residual n−G SSe MSe Consider a terminology introduced in Section 8.8, and denote as Z main effect terms that correspond to the covariate Z. We have SSR = SS Z 1 and the above ANOVA table is now type I as well as type II ANOVA table. If intercept is explicitely included in the model matrix then it is also the type III ANOVA table. 9.2. TWO-WAY CLASSIFICATION 9.2 159 Two-way classification Two-way classification corresponds to situation of two categorical covariates Z ∈ Z = {1, . . . , G} and W ∈ W = {1, . . . , H}, see also Section 8.7. A linear model is then used to parameterize a set of G · H (conditional) response expectations E Y Z = g, W = h , g = 1, . . . , G, h = 1, . . . , H that we will call as two-way classified group means: m(g, h) = E Y Z = g, W = h =: mg,h , g = 1, . . . , G, h = 1, . . . , H. Analogously to Section 8.7 and without loss of generality, we can assume that the response variables Y1 , . . . , Yn are sorted as indicated in that section. That is, the first n1,1 responses correspond to (Z, W ) = (1, 1), the following n2,1 responses to (Z, W ) = (2, 1) etc. Analogously to one-way classification, we will now use a triple subscript to index the individual observations and merge responses with a common value of the two covariates into response subvectors Y g,h , g = 1, . . . , G, h = 1, . . . , H as indicated in the following table: Z 1 1 .. . Y 1,1 = Y1,1,1 , . . . , Y1,1,n1,1 .. . G Y G,1 = YG,1,1 , . . . , YG,1,nG,1 W ... H .. . Y 1,H = Y1,H,1 , . . . , Y1,H,n1,H .. .. . . .. . Y G,H = YG,H,1 , . . . , YG,H,nG,H The overall response vector Y is assumed to be taken from columns of the above table, i.e., Y = Y 1,1 , . . . , Y G,1 , . . . , Y 1,H , . . . , Y G,H . In the same way, we define a vector m being composed of the two-way classified group means as m = m1,1 , . . . , mG,1 , . . . , m1,H , . . . , mG,H . Finally, we keep using a dot notation for collapsed sample sizes or means of the group means. That is, n= G X H X ng,h , g=1 h=1 ng• = H X ng,h , g = 1, . . . , G, n•h = H 1 X mg,h , H h = 1, . . . , H, G g = 1, . . . , G, m•h = 1 X mg,h , G h = 1, . . . , H, g=1 h=1 m = ng,h , g=1 h=1 mg• = G X 1 G·H G X H X mg,h = g=1 h=1 which can be summarized in a tabular form as 1 G G X g=1 mg• = H 1 X m•h , H h=1 9.2. TWO-WAY CLASSIFICATION 160 Group means Sample sizes W Z 1 mG,1 ... .. . .. . .. . 1 .. . m1,1 .. . G • m•1 ... W H • Z 1 m1,H .. . m1• .. . 1 .. . n1,1 .. . mG,H mG• G m•H m • nG,1 ... .. . .. . .. . n•1 ... H • n1,H .. . n1• .. . nG,H nG• n•H n Note. The above defined quantities mg• , m•h , m are the means of the group means which are not weighted by the corresponding sample sizes (which are moreover random if the covariates are random). As such, all above defined means are always real constants and never random variables (irrespective of whether the covariates are considered as being fixed or random). > > The (conditional, given Z = Z1 , . . . , Zn and W = W1 , . . . , Wn ) mean of the response vector Y is m1,1 1n1,1 .. . E Y Z, W = µ = . mG,H 1nG,H A standard linear model then additionally assumes var Y Z, W = σ 2 In . If moreover, a linear model with i.i.d. errors is assumed, we get Yg,h,j = mg,h + εg,h,j , g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h , i.i.d. where εg,h,j ∼ D(0, σ 2 ). Finally, if moreover the covariates are fixed rather than random, the data form G · H independent random samples (with common variances): Sample (1, 1) : Y 1,1 = Y1,1,1 , . . . , Y1,1,n1,1 , i.i.d. Y1,1,j ∼ D(m1,1 , σ 2 ), .. . j = 1, . . . , n1,1 , .. . Sample (G, H) : Y G,H = YG,H,1 , . . . , YG,H,nG,H , i.i.d. YG,H,j ∼ D(mG,H , σ 2 ), j = 1, . . . , nG,H . A linear model and related methodology can then be used for inference on the group means m1,1 , . . . , mG,H or on their linear combinations. On top of that, a linear model can reveal a structure of the relationship in which the two covariates – the two factors influence the two-way classified group means. 9.2.1 Parameters of interest Means of the means and their differences Various quantities, all being linear combinations of the two-way classified group means, i.e., all being estimable in any parameterization of the two-way classification, are clasically of interest. They include: 9.2. TWO-WAY CLASSIFICATION 161 (i) The mean of the group means m. (ii) The means of the means by the first or the second factor, i.e., parameters m1• , . . . , mG• , and m•1 , . . . , m•H . (iii) Differences between the means of the means by the first or the second factor, i.e., parameters θg1 ,g2 • := mg1 • − mg2 • , g1 , g2 = 1, . . . , G, g1 6= g2 , θ•h1 ,h2 := m•h1 − m•h2 , h1 , h2 = 1, . . . , H, h1 6= h2 . Those, in a certain sense quantify the mean effect of the first or the second factor on the response. Main effects Analogously to one-way classification, also the two-way classification often corresponds to a designed experiment, in this case aiming in evaluating the effect of two factors represented by the covariates Z and W . The following quantities, called as main effects of factor Z and W , respectively, are then usually of primary interest. Definition 9.2 Main effects in a two-way classification. Consider a two-way classification based on factors Z and W . By main effects of the factor Z, we Z defined as understand quantities η1Z , . . . , ηG ηgZ := mg• − m, g = 1, . . . , G. W defined as By main effects of the factor W , we understand quantities η1W , . . . , ηH ηhW := m•h − m, h = 1, . . . , H. Note. Differences between the means of the means are also equal to the differences between the main effects: θg1 ,g2 • = mg1 • − mg2 • = ηgZ1 − ηgZ2 , θ•h1 ,h2 = m•h1 − m•h2 = ηhW1 − ηhW2 , g1 , g2 = 1, . . . , G, g1 6= g2 , h1 , h2 = 1, . . . , H, h1 6= h2 . Interaction effects Suppose now that the factors Z and W act additively on the response expectation. In this case, differences mg1 ,h − mg2 ,h , g1 , g2 = 1, . . . , G do not depend on a value of h = 1, . . . , H. Consequently, for any h mg1 ,h − mg2 ,h = mg1 • − mg2 • , g1 , g2 = 1, . . . , G. 9.2. TWO-WAY CLASSIFICATION 162 This implies mg1 ,h − mg1 • = mg2 ,h − mg2 • , g1 , g2 = 1, . . . , G, h = 1, . . . , H. In other words, additivity implies that for any h = 1, . . . , H, the differences ∆(g, h) = mg,h − mg• do not depend on a value of g = 1, . . . , G. Then (for any g = 1, . . . , G and h = 1, . . . , H) mg,h − mg• = ∆(g, h) = G 1 X ∆(g ? , h) G ? g =1 G 1 X (mg? ,h − mg? • ) = G ? g =1 = m•h − m. Clearly, we would arrive at the same comclusion if we started differently from assuming that mg,h1 − mg,h2 , h1 , h2 = 1, . . . , H do not depend on a value of g = 1, . . . , G. That is, additivity implies mg,h − mg• − m•h + m = 0, g = 1, . . . , G, h = 1, . . . , H. Easily, we find that this is also a sufficient condition for additivity. Definition 9.3 Interaction effects in a two-way classification. Consider a two-way classification based on factors Z and W . By interaction effects of the two factors, ZW , . . . , η ZW defined as we understand quantities η1,1 G,H ZW := mg,h − mg• − m•h + m, ηg,h g = 1, . . . , G, h = 1, . . . , H. Notes. • While taking into account definition of the main and interaction effects, the two-way classified group means are given as ZW , mg,h = m + ηgZ + ηhW + ηg,h g = 1, . . . , G, h = 1, . . . , H. • Hypothesis of additivity of the effect of the factors Z and W on the response expectation is given as ZW H0 : ηg,h = 0, g = 1, . . . , G, h = 1, . . . , H. 9.2.2 Two-way ANOVA models The following linear models, referred to as two-way ANOVA models,4 are traditionally considered. Each of them corresponds to different structure for the two-way classified group means. 4 modely analýzy rozptylu dvojného třídění 9.2. TWO-WAY CLASSIFICATION 163 Interaction model No structure is imposed on the group means that can in general be written as ZW mg,h = α0 + αgZ + αhW + αg,h , g = 1, . . . , G, h = 1, . . . , H, (9.2) where α0 , Z αZ = α1Z , . . . , αG , αZW W αW = α1W , . . . , αH , ZW ZW ZW ZW = α1,1 , . . . , αG,1 , . . . , α1,H , . . . , αG,H are the regression parameters. If ng,h > 0 for all g, h, the rank of the related linear model is G · H, see Section 8.7. This explains why the interaction model is also called as the saturated model 5 . The reason is that its regression space has maximal possible vector dimension equal to the number of the group means. Identification of the regression coefficients is possibly achieved by the sum constraints (see Section 8.7.2) G X αgZ = 0, g=1 H X H X αhW = 0, h=1 (9.3) ZW = 0, g = 1, . . . , G, αg,h h=1 G X ZW = 0, h = 1, . . . , H. αg,h g=1 Having imposed the sum constraints (9.3), the regression coefficients coincide with the mean of the means m, with the main and interaction effects respectively (see Section 8.7.2 for corresponding derivations), that is, α0 = m, αgZ = ηgZ = mg• − m, αhW ZW αg,h = = g = 1, . . . , G, ηhW = m•h − m, ZW = mg,h − mg• ηg,h h = 1, . . . , H, − m•h + m, g = 1, . . . , G, h = 1, . . . , H. Section 8.7 also explains possible full-rank parameterization of the underlying linear model, which parameterizes the two-way classified group means as ZW > Z > W mg,h = β0 + c> + d> g = 1, . . . , G, h = 1, . . . , H, (9.4) g β + dh β h ⊗ cg β where CG×(G−1) c> 1 . .. , = c> G DH×(H−1) d> 1 . .. , = d> H are chosen (pseudo)contrast matrices, and β0 , β ZW Z β Z = β1Z , . . . , βG−1 , W β W = β1W , . . . , βH−1 , ZW ZW ZW ZW = β1,1 , . . . , βG−1,1 , . . . , β1,H−1 , . . . , βG−1,H−1 are the regression parameters. 5 saturovaný model 9.2. TWO-WAY CLASSIFICATION 164 In the following, let symbols Z and W denote the terms in the model matrix that correspond to the main effects β Z of the covariate Z, and β W of the covariate W , respectively. Let further Z : W denote the terms corresponding to the interaction effects β ZW . The interaction model will then symbolically be written as MZW : ∼ Z + W + Z : W. Additive model It is obtained as a submodel of the interaction model (9.2) where it is requested ZW ZW α1,1 = · · · = αG,H , which in the full-rank parameterization (9.4) corresponds to requesting β ZW = 0(G−1)·(H−1) . Hence the group means can be written as mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H, Z > W = β0 + c> g β + dh β , (9.5) (9.6) In Section 8.7.5, we showed that if ng,h > 0 for all g, h, the rank of the linear model with the two-way classified group means that satisfy (9.5), is G+H −1. The additive model will symbolically be written as MZ+W : ∼ Z + W. Note. It can easily be shown that ng• for all g = 1, . . . , G and n•h for all h = 1, . . . , H suffice to get a rank of the related linear model being still G+H −1. This guarantees, among other things, that all parameters that are estimable in the additive model with ng,h > 0 for all g, h, are still estimable under a weaker requirement ng• for all g = 1, . . . , G and n•h for all h = 1, . . . , H. That is, if the additive model can be assumed, it is not necessary to have observations for all possible combinations of the values of the two covariates (factors) and the same types of the statistical inference are possible. This is often exploited in the area of designed experiments where it might be impractical or even impossible to get observations under all possible covariate combinations. See Section 8.7.5 what the additive model implies for the two-way classified group means. Most importantly, (i) For each g1 6= g2 , g1 , g2 ∈ {1, . . . , G}, the difference mg1 ,h − mg2 ,h does not depend on a value of h ∈ {1, . . . , H} and is equal to the difference between the corresponding means of the means by the first factor, i.e., mg1 ,h − mg2 ,h = mg1 • − mg2 • = θg1 ,g2 • , which is expressed using the parameterizations (9.5) and (9.6) as > θg1 ,g2 • = αgZ1 − αgZ2 = cg1 − cg2 β Z . (ii) For each h1 6= h2 , h1 , h2 ∈ {1, . . . , H}, the difference mg,h1 − mg,h2 does not depend on a value of g ∈ {1, . . . , G} and is equal to the difference between the corresponding means of the means by the second factor, i.e., mg,h1 − mg,h2 = m•h1 − m•h2 = θ•h1 ,h2 , which is expressed using the parameterizations (9.5) and (9.6) as > θ•h1 ,h2 = αhW1 − αhW2 = dh1 − dh2 β W . 9.2. TWO-WAY CLASSIFICATION 165 Model of effect of Z only It is obtained as a submodel of the additive model (9.5) by requesting W α1W = · · · = αH , which in the full-rank parameterization (9.6) corresponds to requesting β W = 0H−1 . Hence the group means can be written as mg,h = α0 + αgZ , g = 1, . . . , G, h = 1, . . . , H, (9.7) Z = β0 + c> gβ , This is in fact a linear model for the one-way classified (by the values of the covariate Z) group means whose rank is G as soon as ng• > 0 for all g = 1, . . . , G. The model of effect of Z only will symbolically be written as MZ : ∼ Z. The two-way classified group means then satisfy (i) For each g = 1, . . . , G, mg,1 = · · · = mg,H = mg• . (ii) m•1 = · · · = m•H . Model of effect of W only It is the same as the model of effect of Z only with exchaged meaning of Z and W . That is, the model of effect of W only is obtained as a submodel of the additive model (9.5) by requesting Z α1Z = · · · = αG , which in the full-rank parameterization (9.6) corresponds to requesting β Z = 0G−1 . Hence the group means can be written as mg,h = α0 + αhW , g = 1, . . . , G, h = 1, . . . , H, W = β0 + d> hβ . The model of effect of W only will symbolically be written as MW : ∼ W. Intercept only model This is a submodel of either the model (9.7) of effect of Z only where it is requested Z α1Z = · · · = αG , (9.8) 9.2. TWO-WAY CLASSIFICATION 166 or the model (9.8) of effect of W only where it is requested W α1W = · · · = αH , Hence the group means can be written as mg,h = α0 , g = 1, . . . , G, h = 1, . . . , H. As usual, this model will symbolically be denoted as M0 : ∼ 1. Summary In summary, we consider the following models for the two-way classification: Model Rank Requirement for Rank MZW : ∼ Z + W + Z : W G·H ng,h > 0 for all g = 1, . . . , G, h = 1, . . . , H G+H −1 ng• > 0 for all g = 1, . . . , G, n•h > 0 for all h = 1, . . . , H MZ : ∼ Z G ng• > 0 for all g = 1, . . . , G MW : ∼ W H n•h > 0 for all h = 1, . . . , H M0 : ∼ 1 1 n>0 MZ+W : ∼ Z + W The considered models form two sequence of nested submodels: (i) M0 ⊂ MZ ⊂ MZ+W ⊂ MZW ; (ii) M0 ⊂ MW ⊂ MZ+W ⊂ MZW . Related submodel testing then corresponds to evaluating whether the two-way classified group means satisfy a particular structure invoked by the submodel at hand. If normality of the error terms is assumed, the testing can be performed by the methodology of Chapter 5 (F-tests on submodels). 9.2.3 Least squares estimation Also with the two-way classification, explicit formulas for some of the LSE related quantities can be derived and then certain properties of the least squares based inference drawn. Notation (Sample means in two-way classification). Y g,h• := ng,h 1 X ng,h j=1 Yg,h,j , g = 1, . . . , G, h = 1, . . . , H, 9.2. TWO-WAY CLASSIFICATION Y g• H ng,h H 1 XX 1 X := Yg,h,j = ng,h Y g,h• , ng• ng• g = 1, . . . , G, G ng,h G 1 XX 1 X := Yg,h,j = ng,h Y g,h• , n•h n•h h = 1, . . . , G, h=1 j=1 Y •h 167 g=1 j=1 G Y := H ng,h h=1 g=1 G H g=1 h=1 1 XXX 1X 1X ng• Y g• = Yg,h,j = n•h Y •h . n n n g=1 h=1 j=1 As usual, m b g,h , g = 1, . . . , G, h = 1, . . . , H, denote the LSE of the two-way classified group means c= m and m b 1,1 , . . . , m b G,H . Theorem 9.2 Least squares estimation in two-way ANOVA linear models. The fitted values and the LSE of the group means in two-way ANOVA linear models are given as follows (always for g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h ). (i) Interaction model MZW : ∼ Z + W + Z : W m b g,h = Ybg,h,j = Y g,h• . (ii) Additive model MZ+W : ∼ Z + W m b g,h = Ybg,h,j = Y g• + Y •h − Y , but only in case of balanced data6 (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H). (iii) Model of effect of Z only MZ : ∼ Z m b g,h = Ybg,h,j = Y g• . (iv) Model of effect of W only MW : ∼ W m b g,h = Ybg,h,j = Y •h . (v) Intercept only model M0 : ∼ 1 m b g,h = Ybg,h,j = Y . Note. There exists no simple expression to calculate the fitted values in the additive model in case of unbalanced data. See Searle (1987, Section 4.9) for more details. Proof. Only the fitted values in the additive model must be derived now. 6 vyvážená data 9.2. TWO-WAY CLASSIFICATION 168 Models MZW , MZ , MW are, in fact, one-way ANOVA models where we already know that the fitted values are equal to the corresponding group means. Also model M0 is nothing new. Fitted values in the additive model can be calculated by solving the normal equations corresponding to the parameterization mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H. while imposing the identifying constraints G X H X αgZ = 0, g=1 αhW = 0. h=1 For the additive model with the balanced data (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H): • Sum of squares to be minimized SS(α) = XXX g h Yg,h,j − α0 − αgZ − αhW 2 . j • Normal equations ≡ derivatives of SS(α) divided by (−2) and set to zero: XXX X X Yg,h,j − GHJα0 − HJ αgZ − GJ αhW = 0, g h g j XX h Yg,h,j − HJα0 − HJαgZ − J αhW = 0, g = 1, . . . , G, αgZ − GJαhW = 0, h = 1, . . . , H. j h XX g h X Yg,h,j − GJα0 − J X g j • After exploiting the identifying constraints: XXX Yg,h,j − GHJα0 = 0, g h XX h Yg,h,j − HJα0 − HJαgZ = 0, g = 1, . . . , G, Yg,h,j − GJα0 − GJαhW = 0, h = 1, . . . , H. j XX g j j • Hence α b0 = Y , α bgZ = Y g• − Y , α bhW = Y •h − Y , g = 1, . . . , G, h = 1, . . . , H. • And then m b g,h = α b0 + α bgZ + α bhW = Y g• + Y •h − Y , g = 1, . . . , G, h = 1, . . . , H. k End of Lecture #16 (26/11/2015) 9.2. TWO-WAY CLASSIFICATION 169 Start of Lecture #17 Consequence of Theorem 9.2: LSE of the means of the means in the interaction (26/11/2015) and the additive model with balanced data. With balanced data (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H), the LSE of the means of the means by the first factor (parameters m1• , . . . , mG• ) or by the second factor (parameters m•1 , . . . , m•H ) satisfy in both the interaction and the additive two-way ANOVA linear models the following: b g• = Y g• , m b •h = Y •h , m g = 1, . . . , G, h = 1, . . . , H. cZ := m cW := m b 1• , . . . , m b G• and m b •1 , . . . , m b •H If additionally normality is assumed then m satisfy cZ | Z, W ∼ NG mZ , σ 2 VZ , cW | Z, W ∼ NH mW , σ 2 VW , m m where m1• . . mZ = . , mG• mW m•1 . . = . , m•H 1 JH . . VZ = . 0 1 JG . . VW = . 0 ... .. . ... ... .. . ... 0 .. . , 1 JH 0 .. . . 1 JG Proof. All parameters mg• , g = 1, . . . , G, and m•h , h = 1, . . . , H are linear combinations of the group means (of the response mean vector µ = E Y Z, W ) and hence are estimable with the LSE being an appropriate linear combination of the LSE of the group means. With balanced data, we get for the the considered models (calculation shown only for LSE of mg• , g = 1, . . . , G): (i) Interaction model H H H H 1 X 1 X 1 X 1 X b mg• = m b g,h = Y g,h• = J Y g,h• = ng,h Y g,h• = Y g• . H H HJ ng• h=1 h=1 h=1 h=1 (ii) Additive model H H 1 X 1 X b mg• = m b g,h = Y g• + Y •h − Y H H h=1 h=1 H H 1 X 1 X G JY •h − Y = Y g• + Y •h − Y = Y g• + H H GJ h=1 h=1 H = Y g• + 1X n•h Y •h −Y = Y g• . n | h=1 {z } Y 9.2. TWO-WAY CLASSIFICATION 170 Further, E Y g• Z, W = mg• follows from properties of the LSE which are unbiased or from direct calculation. Next, var Y g• Z, W = var H J 1 XX σ2 Yg,h,j Z, W = JH JH h=1 j=1 follows from the linear model assumption var Y Z, W = σ 2 In . Finally, normality of Y g• in case of a normal linear model, follows from the general LSE theory. 9.2.4 k Sums of squares and ANOVA tables with balanced data Sums of squares As already mentioned in Section 9.2.2, the considered models form two sequence of nested submodels: (i) M0 ⊂ MZ ⊂ MZ+W ⊂ MZW ; (ii) M0 ⊂ MW ⊂ MZ+W ⊂ MZW . Corresponding differences in the residual sums of squares (that enter the numerator of the respective F-statistic) are given as squared Euclidean norms of the fitted values from the models being compared (Section 5.1). In particular, in case of balanced data (ng,h = J, g = 1, . . . , G, h = 1, . . . , H), we get G X H X 2 SS Z + W + Z : W Z + W = J Y g,h• − Y g• − Y •h + Y , g=1 h=1 G X H G X H X 2 X 2 SS Z + W W = J Y g• + Y •h − Y − Y •h = J Y g• − Y , g=1 h=1 g=1 h=1 G X G X H H X 2 X 2 SS Z + W Z = J Y g• + Y •h − Y − Y g• = J Y •h − Y , g=1 h=1 G X H X 2 SS Z 1 = J Y g• − Y , g=1 h=1 G X H X 2 SS W 1 = J Y •h − Y . g=1 h=1 We see, SS Z + W W = SS Z 1 , SS Z + W Z = SS W 1 . Nevertheless, note that this does not hold in case of unbalanced data. g=1 h=1 9.2. TWO-WAY CLASSIFICATION 171 Notation (Sums of squares in two-way classification). In case of two-way classification and balanced data, we will use the following notation. SSZ := SSW := SSZW := G X H X g=1 h=1 G X H X g=1 h=1 G X H X J Y g• − Y 2 J Y •h − Y 2 , , J Y g,h• − Y g• − Y •h + Y 2 , g=1 h=1 SST := G X H X J X Yg,h,j − Y 2 , g=1 h=1 j=1 SSZW := e G X H X J X 2 Yg,h,j − Y g,h• . g=1 h=1 j=1 Notes. • Quantities SSZ , SSW , SSZW are differences of the residual sums of squares of two models that differ by terms Z, W or Z : W, respectively. • Quantity SST is a classical total sum of squares. • Quantity SSZW is a residual sum of squares from the interaction model. e Lemma 9.3 Breakdown of the total sum of squares in a balanced two-way classification. In case of a balanced two-way classification, the following identity holds SST = SSZ + SSW + SSZW + SSZW . e Proof. Decomposition in the lemma corresponds to the numerator sum of squares of the F -statistics when testing a series of submodels M0 ⊂ MZ ⊂ MZ+W ⊂ MZW or a series of submodels M0 ⊂ MW ⊂ MZ+W ⊂ MZW . Let M0 , MZ , MW , MZ+W , MZW be the regression spaces of the models M0 , MZ , MW , MZ+W , MZW , respectively. That is, SST = kU 0 k2 , where U 0 are residuals of model M0 and U 0 = D 1 + D 2 + D 3 + U ZW , where D 1 , D 2 , D 3 , U ZW are mutually orthogonal projections of Y into subspaces of Rn : 9.2. TWO-WAY CLASSIFICATION 172 (i) D 1 : projection into MZ \ M0 , kD 1 k2 = SSZ . (ii) D 2 : projection into MZ+W \ MZ , kD 2 k2 = SSZ+W . (iii) D 3 : projection into MZW \ MZ+W , kD 3 k2 = SSZW . (iv) U ZW : projection into Rn \ MZW (residual space of MZW ). From orthogonality: SST = SSZ + SSW + SSZW + SSZW . e k ANOVA tables As consequence of the above considerations, it holds for balanced data: (i) Equally labeled rows in the type I ANOVA table are the same irrespective of whether the table is formed in the order Z + W + Z:W or in the order W + Z + Z:W. (ii) Type I and type II ANOVA tables are the same. Both type I and type II ANOVA table then take the form Effect (Term) Degrees of freedom Effect sum of squares Effect mean square F-stat. P-value Z G−1 SSZ ? ? ? W H −1 SSW ? ? ? Z:W GH − G − H + 1 SSZW ? ? ? Residual n − GH SSZW e ? Chapter 10 Checking Model Assumptions In Chapter 4, we introduced some basic, mostly graphical methods to check the model assumptions. Now, we introduce some additional methods, mostly based onstatistical tests. As in Chapter 4, we assume that data are represented by n random vectors Yi , Z i , Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp i = 1, . . . , n. Possibly two sets of regressors are available: (i) X i , i = 1, . . . , n, where X i = tX (Z i ) for some transformation tX : Rp −→ Rk . They give rise to the model matrix X> 1 . 0 .. = X , . . . , X k−1 . Xn×k = > Xn For most practical problems, X 0 = 1, . . . , 1 (almost surely). (ii) V i , i = 1, . . . , n, where V i = tV (Z i ) for some transformation tV : Rp −→ Rl . They give rise to the model matrix V> 1 . 1 l . Vn×l = . = V , . . . , V . > Vn Primarily, we will assume that the model matrix X is sufficient to be able to assume that E Y Z = k E Y X = Xβ for some β = β0 , . . . , βk−1 ∈ R . That is, we will arrive from assuming Y Z ∼ Xβ, σ 2 In , or even from assuming normality, i.e., Y Z ∼ Nn Xβ, σ 2 In . The task is now to verify appropriateness of those assumptions that, in principle, consist of four subassumptions outlined in Chapter 4: (A1) Correct regression function (errors with a zero mean). (A2) Homoscedasticity of errors. (A3) Uncorrelated/independent errors. (A4) Normal errors. 173 10.1. MODEL WITH ADDED REGRESSORS 10.1 174 Model with added regressors In this section, we technically derive some expressions that will be useful in latter sections of this chapter and also in Chapter 14. We will deal with two models: (i) Model M: Y Z ∼ Xβ, σ 2 In . (ii) Model Mg : Y Z ∼ Xβ + Vγ, σ 2 In , where the model matrix is an n × (k + l) matrix G, G = X, V . Notation (Quantities derived under the two models). (i) Quantities derived while assuming model M will be denoted as it is usual. In particular: − • (Any) solution to normal equations: b = X> X X> Y . In case of a full-rank model matrix X: b = X> X −1 X> Y β is the LSE of a vector β in model M; • Hat matrix (projection matrix into the regression space M X ): H = X X> X − X> = hi,t i,t=1,...,n ; • Fitted values Yb = HY = Yb1 , . . . , Ybn ; • Projection matrix into the residual space M X ⊥ M = In − H = mi,t : i,t=1,...,n ; • Residuals: U = Y − Yb = MY = U1 , . . . , Un ; 2 • Residual sum of squares: SSe = U . (ii) Analogous quantities derived while assuming model Mg will be indicated by a subscript g: − • (Any) solution to normal equations: bg , cg = G> G G> Y . In case of a full-rank model matrix G: −1 > > b ,γ β G Y g bg = G G provides the LSE of vectors β and γ in model Mg ; • Hat matrix (projection matrix into the regression space M G ): − Hg = G G> G G> = hg,i,t i,t=1,...,n ; • Fitted values Yb g = Hg Y = Ybg,1 , . . . , Ybg,n ; ⊥ • Projection matrix into the residual space M G : Mg = In − Hg = mg,i,t i,t=1,...,n • Residuals: U g = Y − Yb g = Mg Y = Ug,1 , . . . , Ug,n ; 2 • Residual sum of squares: SSe,g = U g . ; 10.1. MODEL WITH ADDED REGRESSORS 175 Lemma 10.1 Model with added regressors. Quantities derived while assuming model M : Y Z ∼ Xβ, σ 2 In and quantities derived while assuming model Mg : Y Z ∼ Xβ + Vγ, σ 2 In are mutually in the following relationship. Yb g = Yb + MV V> MV − V> U for some bg ∈ Rk , cg ∈ Rl . = Xbg + Vcg , Vector bg and cg such that Yb g = Xbg + Vcg satisfy: cg = − V> MV V> U , − bg = b − X> X X> Vcg Finally for some b = X> X − X> Y . 2 SSe − SSe,g = MVcg . Proof. • Yb g is a projection of Y into M X, V = M X, MV . − • Use “H = X X> X X> ”: − > X MV} {z | 0 Hg = X, MV V> MX V> MV | {z } 0 = X, MV = X X> X − X> X X> X 0 − 0 − > V MV X> + MV V> MV − X> V> M ! ! X> V> M ! V> M. • So that, − − Yb g = Hg Y = X X> X X> Y + MV V> MV V> MY |{z} | {z } U Yb − = Yb + MV V> MV V> U ® • Theorem 2.5: It must be possible to write Yb g as Yb g = Xbg + Vcg , where bg , cg solves normal equations based on a model matrix X, V . • We rewrite ® to see what bg and cg could be. 10.1. MODEL WITH ADDED REGRESSORS • Remember that Yb = Xb for any b = X> X 176 − X> Y . Take now ® and further calculate: − − Yb g = |{z} Xb + In − X X> X X> V V> MV V> U {z } | Yb M − − − = Xb + V V> MV V> U − X X> X X> V V> MV V> U − − − = X b − X> X X> V V> MV V> U + V V> MV V> U . | {z } | {z } cg bg − • That is, cg = V> MV V> U , − bg = b − X> X X> Vcg . • Finally 2 2 2 − SSe − SSe,g = Yb g − Yb = MV V> MV V> U = MVcg . k End of Lecture #17 (26/11/2015) 10.2. CORRECT REGRESSION FUNCTION 10.2 177 Correct regression function We are now assuming a linear model M : Y Z ∼ Xβ, σ 2 In , that written using the error terms is M : Y = Xβ + ε, E ε Z = E ε = 0n , var ε Z = var ε = σ 2 In . The assumption (A1) of a correct regression function is, in particular, E Y Z = Xβ for some β ∈ Rk , E Y Z ∈ M X , E ε Z = E ε = 0n . As (also) explained in Section 4.1, assumption (A1) implies E U Z = 0n and this property is exploited by a basic diagnostic tool which is a plot of residuals against possible factors derived from the covariates Z that may influence the residuals expectation. Factors traditionally considered are (i) Fitted values Yb ; (ii) Regressors included in the model M (columns of the model matrix X); (iii) Regressors not included in the model M (columns of the model matrix V). Assumptions. For the rest of this section, we assume that model M is a model of general rank r with intercept, that is, rank X = r ≤ k < n, X = X 0 , . . . , X k−1 , X 0 = 1n . In the following, we develop methods to examine whether for given j (j ∈ 1, . . . , k − 1 ) the jth regressor, i.e., the column X j , is correctly included in the model matrix X. In other words, we will aim in examining whether the jth regressor is possibly responsible for violation of the assumption (A1). 10.2.1 Partial residuals Notation (Model with a removed regressor). For j ∈ 1, . . . , k − 1 , let X(−j) denote the model matrix X without the column X j and let β (−j) = β0 , . . . , βj−1 , βj , . . . , βk−1 denote the regression coefficients vector without the jth element. Model with a removed jth regressor will be a linear model M(−j) : Y Z ∼ X(−j) β (−j) , σ 2 In . Start of Lecture #18 (03/12/2015) 10.2. CORRECT REGRESSION FUNCTION 178 All quantities related to the model M(−j) will be indicated by a superscript (−j). In particular, − > > M(−j) = In − X(−j) X(−j) X(−j) X(−j) is a projection matrix into the residual space M X(−j) ⊥ ; U (−j) = M(−j) Y is a vector of residuals of the model M(−j) . Assumptions. We will assume rank X(−j) = r − 1 which implies that (i) X j ∈ / M X(−j) ; (ii) X j 6= 0n ; (iii) X j is not a multiple of a vector 1n . Derivations towards partial residuals Model M is now a model with one added regressor to a model M(−j) and the two models form a pair (model–submodel). Let b = b0 , . . . , bj−1 , bj , bj+1 , . . . , bk−1 > be (any) solution to normal equations in model M. Lemma 10.1 (Model with added regressors) provides − > > bj = X j M(−j) X j X j U (−j) . (10.1) Further, since a matrix M(−j) is idempotent, we have 2 > X j M(−j) X j = M(−j) X j . > At the same time, M(−j) X j 6= 0n since X j ∈ / M X(−j) , X j 6= 0n . Hence, X j M(−j) X j > 0 and a pseudoinverse in (10.1) can be replaced by an inverse. That is, > bj = βbj = X j M(−j) X j −1 X j> > U (−j) = X j U (−j) > X j M(−j) X j is the LSE of the estimable parameter βj of model M (which is its BLUE). In summary, under the assumptions used to perform derivations above, i.e., while assuming that 0 X = 1n and for chosen j ∈ 1, . . . , k − 1 , the regression coefficient βj is estimable. Consequently, we define a vector of jth partial residuals of model M as follows. 10.2. CORRECT REGRESSION FUNCTION 179 Definition 10.1 Partial residuals. A vector of jth partial residuals1 of model M is a vector U part,j U1 + βbj X1,j .. . = U + βbj X j = . Un + βbj Xn,j Note. We have U part,j = U + βbj X j = Y − Xb − βbj X j = Y − Yb − βbj X j . That is, the jth partial residuals are calculated as (classical) residuals where, however, the fitted values subtract a part that corresponds to the column X j of the model matrix. Theorem 10.2 Property of partial residuals. Let Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = r ≤ k, X 0 = 1n , β = β0 , . . . , βk−1 . Let j ∈ 1, . . . , k − 1 be such that rank X(−j) = r − 1 and let βbj be the LSE of βj . Let us consider a linear model (regression line with covariates X j ) with • the jth partial residuals U part,j as response; • a matrix 1n , X j as the model matrix; • regression coefficients γ j = γj,0 , γj,1 . The least squares estimators of parameters γj,0 and γj,1 are γ bj,0 = 0, γ bj,1 = βbj . Proof. • U part,j = U + βbj X j . 2 2 part,j j j b • Hence U − γj,0 1n − γj,1 X = U − γj,0 1n + (γj,1 − βj )X = ®. ⊥ • Since 1n ∈ M X , X j ∈ M X , U ∈ M X , we have ® 2 2 2 = U + γj,0 1n + γj,1 − βbj X j ≥ U with equality if and only if γj,0 = 0 & γj,1 = βbj . 1 vektor jtých parciálních reziduí k 10.2. CORRECT REGRESSION FUNCTION 180 Shifted partial residuals Notation (Response, regressor and partial residuals means). Let n Y = n 1X Yi , n j X = i=1 1X Xi,j , n n U part,j i=1 = 1 X part,j Ui . n i=1 If X 0 = 1n (model with intercept), we have 0= n X Ui = i=1 1 n n X n X Uipart,j + βbj Xi,j , i=1 Uipart,j i=1 n 1 X b Xi,j , = βj n i=1 U part,j j = βbj X . Especially for purpose of visualization by plotting the partial residuals against the regressors a shifted partial residuals are sometimes used. Note that this only changes the estimated intercept of the regression line of dependence of partial residuals on the regressor. Definition 10.2 Shifted partial residuals. A vector of jth response-mean partial residuals of model M is a vector j U part,j,Y = U part,j + Y − βbj X 1n . A vector of jth zero-mean partial residuals of model M is a vector j U part,j,0 = U part,j − βbj X 1n . Notes. • A mean of the response-mean partial residuals is the response sample mean Y , i.e., n 1 X part,j,Y Ui =Y. n i=1 • A mean of the zero-mean partial residuals is zero, i.e., n 1 X part,j,0 Ui = 0. n i=1 The zero-mean partial residuals are calculated by the R function residuals with its type argument being set to partial. 10.2. CORRECT REGRESSION FUNCTION 181 Notes (Use of partial residuals). A vector of partial residuals can be interpreted as a response vector from which we removed a possible effect of all remaining regressors. Hence, dependence of U part,j on X j shows • a net effect of the jth regressor on the response; • a partial effect of the jth regressor on the response which is adjusted for the effect of the remaining regressors. The partial residuals are then mainly used twofold: Diagnostic tool. As a (graphical) diagnostic tool, a scatterplot X j , U part,j is used. In case, the jth regressor is correctly included in the original regression model M, i.e., if no transformaj tion of the regressor X is required to achieve E Y Z ∈ M X , points in the scatterplot j part,j X ,U should lie along a line. Visualization. Property that the estimated slope of the regression line in a model U part,j ∼ X j is the same as the jth estimated regression coeffient in the multiple regression model Y ∼ X is also used to visualize dependence of the response of the jth regressor by showing a scatterplot X j , U part,j equipped by a line with zero intercept and slope equal to βbj . 10.2.2 Test for linearity of the effect To examine appropriateness of the linearity of the effect of the jth regressor X j on the response expectation E Y Z by a statistical test, we can use a test on submodel (per se, requires additional assumption of normality). Without loss of generality, assume that the jth regressor X j is the last column of the model matrix X and denote the remaining non-intercept columns of matrix X as X0 . That is, assume that X = 1n , X 0 , X j . Two classical choices of a pair model–submodel being tested in this context are the following. More general parameterization of the jth regressor Submodel is the model M with the model matrix X. The (larger) model is model Mg obtained by replacing column X j in the model matrix X by a matrix V such that Xj ∈ M V , rank V ≥ 2. That is, the model matrices of the submodel and the (larger) model are Submodel M: (Larger) model Mg : 1n , X0 , X j = X; 1n , X 0 , V . Classical choices of the matrix V are such that it corresponds to: (i) polynomial of degree d ≥ 2 based on the regressor X j ; 10.2. CORRECT REGRESSION FUNCTION 182 (ii) regression spline of degree d ≥ 1 based on the regressor X j . In this case, 1n ∈ V and hence for practical calculations, the larger model Mg is usually estimated while using a model matrix X0 , V that does not explicitely include the intercept term which is included implicitely. Categorization of the jth regressor upp Let −∞ < xlow < xupp < ∞ be chosen such that interval xlow j j , xj j X1,j , . . . , Xn,j of the jth regressor. That is, xlow < min Xi,j , j i covers the values max Xi,j < xupp j . i upp Let I1 , . . . , IH be H > 1 subintervals of xlow based on a grid j , xj xlow < λ1 < · · · < λH−1 < xupp j j . Let xh ∈ Ih , h = 1, . . . , H, be chosen representative values for each of the subintervals I1 , . . . , IH (e.g., their midpoints) and let X j,cut = X1j,cut , . . . , Xnj,cut be obtained by categorization of the jth regressor using the division I1 , . . . , IH and representatives x1 , . . . , xH , i.e., (i = 1, . . . , n): Xij,cut = xh ≡ Xij ∈ Ih , h = 1, . . . , H. In this way, we obtained a categorical ordinal regressor X j,cut whose values x1 , . . . , xH , can be considered as collapsed values of the original regressor X j . Consequently, if linearity with respect to the original regressor X j holds then it also does (approximately, depending on chosen division I1 , . . . , IH and the representatives x1 , . . . , xH ) with respect to the ordinal categorical regressor X j,cut if this is viewed as numeric one. Let V be an n × (H − 1) model matrix corresponding to some (pseudo)contrast parameterization of the covariate X j,cut if this is viewed as categorical with H levels. We have X j,cut ∈ M V , and test for linearity of the jth regressor is obtained by considering the following model matrices in the submodel and the (larger) model: Submodel M: (Larger) model Mg : 1n , X0 , X j,cut ; 1n , X0 , V . Drawback of tests for linearity of the effect Remind that hypothesis of linearity of the effect of the jth regressor always forms the null hypothesis of the proposed submodel tests. Hence we are only able to confirm non-linearity of the effect (if the submodel is rejected) but are never able to confirm linearity. 10.3. HOMOSCEDASTICITY 10.3 183 Homoscedasticity We are again assuming a linear model M : Y Z ∼ Xβ, σ 2 In , that written using the error terms is M : Y = Xβ + ε, E ε Z = E ε = 0n , var ε Z = var ε = σ 2 In . The assumption (A2) of homoscedasticity is, in particular, var Y Z = σ 2 In , var ε Z = var ε = σ 2 In , where σ 2 is unknown but most importantly constant. 10.3.1 Tests of homoscedasticity Many tests of homoscedasticity can be found in literature. They mostly consider the following null and alternative hypotheses: H0 : σ 2 = const, H1 : σ 2 = certain function of some factor(s). A particular test is then sensitive (powerful) to detect heteroscedasticity if this expresses itself such that the variance σ 2 is the certain function of the factor(s) as specified by the alternative hypothesis. The test is possibly weak to detect heteroscedasticity (weak to reject the null hypothesis of homoscedasticity) if heteroscedasticity expresses itself in a different way compared to the considered alternative hypothesis. 10.3.2 Score tests of homoscedasticity A wide range of tests of homoscedasticity can be derived by assuming a (full-rank) normal linear model, basing the alternative hypothesis on a particular general linear model and then using an (asymptotic) maximum-likelihood theory to derive a testing procedure. Assumptions. For the rest of this section, we assume that model M (model under the null hypothesis) is normal of full-rank, i.e., M : Y Z ∼ Nn Xβ, σ 2 In , rank Xn×k = k, and an alternative model is a generalization of a general normal linear model Mhetero : Y Z ∼ Nn Xβ, σ 2 W−1 , where W = diag(w1 , . . . , wn ), wi−1 = τ (λ, β, Z i ), i = 1, . . . , n, τ is a known function of λ ∈ Rq , β ∈ Rk (regression coefficients), z ∈ Rp (covariates) such that τ (0, β, z) = 1, for all β ∈ Rk , z ∈ Rp . 10.3. HOMOSCEDASTICITY 184 Model Mhetero is then a model with unknown parameters β, λ, σ 2 which with λ = 0 simplifies into model M. In other words, model M is a nested 2 model of model Mhetero and a test of homoscedasticity corresponds to testing H0 : λ = 0, H1 : λ = 6 0. (10.2) Having assumed normality, both models M and Mhetero are fully parametric models and a standard (asymptotic) maximum-likelihood theory can now be used to derive a test of (10.2). A family of score tests based on specific choices of the weight function τ is derived by Cook and Weisberg (1983). Breusch-Pagan test A particular score test of homoscedasticity was also derived by Breusch and Pagan (1979) who consider the following weight function (x = tX (z) is a transformation of the original covariates that determines the regressors of model M). τ (λ, β, z) = τ (λ, β, x) = exp λ x> β . That is, under the heteroscedastic model, for i = 1, . . . , n, 2 exp λ E Y Z var Yi Z i = var εi Z i = σ 2 exp λ X > β = σ i i , i (10.3) and the test of homoscedasticity is testing H0 : λ = 0, H1 : λ = 6 0. It is seen from the model (10.3) that the Breusch-Pagan test is sensitive (powerful to detect heteroscedasticity) if the residual variance is a monotone function of the response expectation. Note (One-sided tests of homoscedasticity). In practical situations, if it can be assumed that the residual variance is possibly a monotone function of the response expectation then it can mostly be also assumed that it is its increasing function. A more powerful test of homoscedasticity is then obtained by considering the one-sided alternative H1 : λ > 0. Analogously, a test that is sensitive towards alternative of a residual variance which decreases with the response expectation is obtained by considering the alternative H1 : λ < 0. Note (Koenker’s studentized Breusch-Pagan test). The original Breusch-Pagan test is derived using standard maximum-likelihood theory while departing from assumption of a normal linear model. It has been shown in the literature that the test is not robust towards non-normality. For this reason, Koenker (1981) derived a slightly modified version of the Breusch-Pagan test which is robust towards non-normality. It is usually referred to as (Koenker’s) studentized Breusch-Pagan test and its use is preferred to the original test. 2 vnořený 10.3. HOMOSCEDASTICITY 185 Linear dependence on the regressors Let tW : Rp −→ Rq be a given transformation, w := tW (z), W i = tW (Z i ), i = 1, . . . , n. The following choice of the weight function can be considered: τ (β, λ, z) = τ (λ, w) = exp λ> w . That is, under the heteroscedastic model, for i = 1, . . . , n, = var Yi Z i = var εi Z i = var εi On a log-scale: σ 2 exp λ> W i . log var Yi Z i = log(σ 2 ) + λ> W i . | {z } λ0 In other words, the residual variance follows on a log-scale a linear model with regressors given by vectors W i . If tW is a univariate transformation leading to w = tW (z), one-sided alternatives are again possible reflecting assumption that under heteroscedasticity, the residual variance increases/decreases with a value of W = tW (Z). The most common use is then such that tW (z) and related values of W1 = tW (Z 1 ), . . ., Wn = tW (Z n ) correspond to one of the (non-intercept) regressors from either the model matrix X (regressors included in the model), or from the matrix V that contains regressors currently not included in the model. The corresponding score test of homoscedasticity then examines whether the residual variance changes/increases/decreases (depending on chosen alternative) with that regressor. Note (Score tests of homoscedasticity in the R software). In the R software, the score tests of homoscedasticity are provided by functions: (i) ncvTest (abbreviation for a “non-constant variance test”) from package car; (ii) bptest from package lmtest. The Koenker’s studentized variant of the test is only possible with the bptest function. 10.3.3 Some other tests of homoscedasticity Some other tests of homoscedasticity that can be encountered in practice include the following Goldfeld-Quandt test is an adaptation of a classical F-test of equality of the variances of the two independent samples into a regression context proposed by Goldfeld and Quandt (1965). It is applicable in linear models with both numeric and categorical covariates and under the alternative, heteroscedasticity is expressed by a monotone dependence of the residual variance on a prespecified ordering of the observations. G-sample tests of homoscedasticity are tests applicable for linear models with only categorical covariates (ANOVA models). They require repeated observations for each combination of values of the covariates and basically test equality of variances of G independent random samples. The most common tests of this type include: Bartlett test by Bartlett (1937) which, however, is quite sensitive towards non-normality and hence its use is not recommended. It is implemented in the R function bartlett.test; 10.3. HOMOSCEDASTICITY 186 Levene test by Levene (1960), implemented in the R function leveneTest from package car or in the R function levene.test from package lawstat; Brown-Forsythe test by Brown and Forsythe (1974) which is a robustified version of the Levene test and is implemented in the R function levene.test from package lawstat; Fligner-Killeen test by Fligner and Killeen (1976) which is implemented in the R function fligner.test. End of Lecture #18 (03/12/2015) 10.4. NORMALITY 10.4 187 Normality Start of Lecture #19 (03/12/2015) In this section, we are assuming a normal linear model M : Y Z ∼ Nn Xβ, σ 2 In , rank(X) = r, that written using the error terms is i.i.d. M : Yi = X > i β + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n. (10.4) Our interest now lies in verifying assumption (A4) of normality of the error terms εi , i = 1, . . . , n. Let us remind our standard notation needed in this section: (i) Hat matrix (projection matrix into the regression space M X ): − H = X X> X X> = hi,t i,t=1,...,n ; (ii) Projection matrix into the residual space M X ⊥ : M = In − H = mi,t i,t=1,...,n ; (iii) Residuals: U = Y − Yb = MY = U1 , . . . , Un ; 2 (iv) Residual sum of squares: SSe = U ; (v) Residual mean square: MSe = 1 n−r SSe . (vi) Standardized residuals: U std = U1std , . . . , Unstd , where Ui Uistd = p , MSe mi,i i = 1, . . . , n (if mi,i > 0). Lemma 10.3 Property of normal distribution. Let Z ∼ Nn (0, σ 2 In ). Let T : Rn −→ R be a measurable function satisfying T (cz) = T (z) for all c > 0 and z ∈ Rn . The random variables T (Z) and kZk are then independent. Proof. Proof/calculations were skipped and are not requested for the exam. Proof/calculations below are shown only for those who are interested. • Consider spherical coordinates: Z1 = R cos(φ1 ), Z2 = R sin(φ1 ) cos(φ2 ), Z3 = R sin(φ1 ) sin(φ2 ) cos(φ3 ), .. . Zn−1 = R sin(φ1 ) · · · sin(φn−2 ) cos(φn−1 ), Zn = R sin(φ1 ) · · · sin(φn−2 ) sin(φn−1 ). 10.4. NORMALITY 188 • Distance from origin: R = kZk. > • Direction: φ = φ1 , . . . , φn−1 . • Exercise for the 3rd year bachelor students: If Z ∼ Nn (0, σ 2 In ) then distance R from the origin and direction φ are independent. • R = kZk (distance from origin itself), T (Z) depends on the direction only (since T (Z) = T (cZ) for all c > 0) and hence kZk and T (Z) are independent. k Theorem 10.4 Moments of standardized residuals under normality. Let Y X ∼ Nn Xβ, σ 2 In and let for chosen i ∈ {1, . . . , n}, mi,i > 0. Then E Uistd X = 0, var Uistd X = 1. Proof. Proof/calculations were available on the blackboard in K1. Notes. k If the normal linear model (10.4) holds then Theorems 3.1 and 10.4 provide: (i) For (raw) residuals: U Z ∼ Nn 0n , σ 2 M . That is, the (raw) residuals follow also a normal distribution, nevertheless, the variances of the individual residuals U1 , . . . , Un differ (a diagonal of the projection matrix M is not necessarily constant). On top of that, the residuals are not necessarily independent (the projection matrix M is not necessarily a diagonal matrix). (ii) For standardized residuals (if mi,i > 0 for all i = 1, . . . , n, which is always the case in a full-rank model): E Uistd Z = 0, var Uistd Z = 1, i = 1, . . . , n. That is, the standardized residuals have the same mean and also the variance but are neither necessarily normally distributed nor necessarily independent. In summary, in a normal linear model, neither the raw residuals, nor standardized residuals form a random sample (a set of i.i.d. random variables) from a normal distribution. 10.4.1 Tests of normality There exist formal tests of the null hypothesis on a normality of the error terms: H0 : distribution of ε1 , . . . , εn is normal, (10.5) 10.4. NORMALITY 189 where a distribution of the test statistic is exactly known under the null hypothesis of normality. Nevertheless, those tests have quite a low power and hence are only rarely used in practice. In practice, approximate approaches are used that apply standard tests of normality on either the raw residuals U or the standardized residuals U std (both of them, under the null hypothesis (10.5), do not form a random sample from the normal distribution ). Several empirical studies showed that such approaches maintain quite well a significance level of the test on a requested value. At the same time, they mostly recommend to use the raw residuals U rather than the standardized residuals U std . Classical tests of normality include the following: Shapiro-Wilk test implemented in the R function shapiro.test. Lilliefors test implemented in the R function lillie.test from package nortest. Anderson-Darling test implemented in the R function ad.test from package nortest. 10.5. UNCORRELATED ERRORS 10.5 190 Uncorrelated errors In this section, we are assuming a linear model M: Yi = X > i β + εi , E εi X i = E εi = 0, var εi X i = var εi = σ 2 , i = 1, . . . , n, (10.6) i 6= l. cor(εi , εl ) = 0, Our interest now lies in verifying assumption (A3) of whether the error terms εi , i = 1, . . . , n, are uncorrelated. The fact that errors are uncorrelated often follows from a design of the study/data collection (measurements on independently behaving units, . . . ) and then there is no need to check this assumption. Situation when uncorrelated errors cannot be taken for granted is if the observations are obtained sequentially. Typical examples are (i) time series (time does not have to be a covariate of the model) which may lead to so called serial depedence among the error terms of the linear model; (ii) repeated measurements performed using one measurement unit or on one subject. In the following, we introduce a classical procedure that is used to test a null hypothesis of uncorrelated errors against alternative of serial dependence expressed by the first order autoregressive process. 10.5.1 Durbin-Watson test Assumptions. It is assumed that the ordering of the observations expressed by their indeces 1, . . . , n, has a practical meaning and may induce depedence between the error terms ε1 , . . . , εn of the model. One of the simplest stochastic processes that capture a certain form of serial dependence is the first order autoregressive process AR(1). Assuming this for the error terms ε1 , . . . , εn of the linear model (10.6) leads to a more general model MAR : Yi = X > i β + εi , i = 1, . . . , n, ε1 = η1 , εi = % εi−1 + ηi , i = 2, . . . , n, E ηi X i = E ηi = 0, var ηi X i = var ηi = σ 2 , i = 1, . . . , n, i 6= l, cor(ηi , ηl ) = 0, where −1 < % < 1 is additional unknown parameter of the model. It has been shown in the course Stochastic Processes 2 (NMSA409): • ε1 , . . . , εn is a stacionary process if and only if −1 < % < 1. Notes. • For each m ≥ 0: cor(εi , εi−m ) = %m , i = m + 1, . . . , n. In particular % = cor(εi , εi−1 ), i = 2, . . . , n. (10.7) 10.5. UNCORRELATED ERRORS 191 Test of uncorrelated errors in model M can be now be based on testing H0 : H1 : % = 0, % 6= 0 in model MAR . Since positive autocorrelation (% > 0) is more common in practice, one-sided tests (with H1 : % > 0) are used frequently as well. Let U = U1 , . . . , Un be residuals from model M which corresponds to the null hypothesis. A test statistic proposed by Durbin and Watson (1950, 1951, 1971) takes a form n X (Ui − Ui−1 )2 DW = i=2 n X . Ui2 i=1 A testing procedure is based on observing that a statistic DW is approximately equal to 2 (1 − %b), where %b is an estimator of the autoregression parameter % from model MAR . Calculations. First remember that E(Ui ) := E Ui X = 0, i = 1, . . . , n, and this property is maintained even if the error terms of the model are not uncorrelated (see process of the proof of Theorem 2.3). As residuals can be considered as predictions of the error terms ε1 , . . . , εn , a suitable estimator of their covariance of lag 1 is c εl , εl−1 σ b1,2 = cov n 1 X = Ui Ui−1 . n−1 i=2 Similarly, three possible estimators of the variance σ 2 of the error terms ε1 , . . . , εn are c εl = σ b2 = var n−1 1 X 2 Ui n−1 or i=1 n 1 X 2 Ui n−1 or i=2 n 1 X 2 Ui . n i=1 Then, Pn DW = (Ui − Ui−1 )2 i=2P = n 2 i=1 Ui Pn 2 i=2 Ui + Pn 2 i=2 Ui−1 − Pn 2 i=1 Ui 2 Pn i=2 Ui Ui−1 σ b2 + σ b2 − 2 σ b1,2 σ b1,2 =2 1− 2 ≈ σ b2 σ b = 2 1 − %b . Use of the test statistic DW for tests of H0 : % = 0 is complicated by the fact that distribution of DW under the null hypothesis depends on the model matrix X. It is hence not possible to derive (and tabulate) critical values in full generality. In practice, two approaches are used to calculate approximate critical values and p-values: 10.5. UNCORRELATED ERRORS 192 (i) Numerical algorithm of Farebrother (1980, 1984) which is implemented in the R function dwtest from package lmtest; (ii) General simulation method bootstrap (introduced by Efron, 1979) whose use for the DurbinWatson test is implemented in the R function durbinWatsonTest from package car. 10.6. TRANSFORMATION OF RESPONSE 10.6 193 Transformation of response Especially in situations when homoscedasticity and/or normality does not hold, it is often possible to achieve a linear model where both those assumptions are fulfilled by a suitable (non-linear) transformation t : R −→ R of the response. That is, it is worked with a normal linear model t(Yi ) = X > i β + εi , i = 1, . . . , n, i.i.d. εi ∼ N (0, σ 2 ), (10.8) where it is assumed that both homoscedasticity and normality hold. Disadvantage of a model with transformed response is that the corresponding regression function m(x) = x> β provides a model for expectation of the transformed response and not of the original response, i.e., for x ∈ X (sample space of the regressors): m(x) = E t(Y ) X = x 6= t E Y X = x , unless the transformation t is a linear function. Similarly, regression coefficients have now interpretation of an expected change of the transformed response t(Y ) related to a unity increase of the regressor. 10.6.1 Prediction based on a model with transformed response Nevertheless, the above mentioned interpretational issue is not a problem in a situation when prediction of a new value of the response Ynew , given X new = xnew , is of interest. If this is the case, we can base the prediction on the model (10.8) for the transormed response. In the following, we assume that t is strictly increasing, nevertheless, the procedure can be adjusted for decreasing or even non-monotone t as well: trans,L b trans,U trans and a (1 − α) 100% prediction interval Y bnew for • Construct a prediction Ybnew , Ynew trans Ynew = t(Ynew ) based on the model (10.8). • Trivially, an interval −1 trans,L −1 trans,U L U Ybnew , Ybnew = t Ybnew , t Ybnew (10.9) covers a value of Ynew with a probability of 1 − α. trans lies inside the prediction interval (10.9) and can be considered • A value Ybnew = t−1 Ybnew L , Y bU as a point prediction of Ynew . Only note that the prediction interval Ybnew new is not necessarily centered around a value of Ybnew . 10.6.2 Log-normal model Suitably interpretable model is obtained if the response is logarithmically transformed. Suppose that the following model (normal linear model for log-transformed response) holds: log(Yi ) = X > i β + εi , i = 1, . . . , n, i.i.d. εi ∼ N (0, σ 2 ). We then have Yi = exp X > i β ηi , i.i.d. exp(εi ) = ηi ∼ LN (0, σ 2 ), i = 1, . . . , n, (10.10) 10.6. TRANSFORMATION OF RESPONSE 194 where LN (0, σ 2 ) denotes a log-normal distribution with location parameter 0 and a scale parameter σ. That is, under validity of the model (10.10) for the log-transformed response, errors in a model for the original response are combined multiplicatively with the regression function. We can easily calculate the first two moments of the log-normal distribution which provides (for i = 1, . . . , n), σ2 M := E ηi = exp > 1 (with σ 2 > 0), 2 V := var ηi = exp(σ 2 ) − 1 exp(σ 2 ). Hence, for x ∈ X : E Yi X i = x = M exp x> β , var Yi X i = x = V exp 2 x> β = V · E Yi X i = x 2 . M (10.11) A log-normal model (10.10) is thus suitable in two typical situations that cause non-normality and/or heteroscedasticity of a linear model for the original response Y : (i) a conditional distribution of Y given X = x is skewed. If this is the case, the log-normal distribution which is skewed as well may provide a satisfactory model for this distribution. (ii) a conditional variance var Y X = x increases with a conditional expectation E Y X = x . This feature is captured by (10.11). Indeed, under the by the log-normal model as shown log-normal model, var Y X = x increases with E Y X = x . It is then said that the logarithmic transformation stabilizes the variance. Interpretation of regression coefficients With a log-normal model (10.10), the (non-intercept) regression coefficients have the following interpretation. Let for j ∈ {1, . . . , k − 1}, x = x0 , . . . , xj . . . , xk−1 ∈ X , and xj(+1) := x0 , . . . , xj + 1 . . . , xk−1 ∈ X , and suppose that β = β0 , . . . , βk−1 We then have > E Y X = xj(+1) M exp xj(+1) β = = exp(βj ). E Y X = x M exp x> β Notes. • If ANOVA linear model with log-transformed response is fitted, estimated differences between the group means of the log-response are equal to estimated log-ratios between the group means of the original response. • If a linear model with logarithmically transformed response if fitted, estimated regression coefficients, estimates of estimable parameters etc. and corresponding confidence intervals are often reported back-transformed (exponentiated) due to above interpretation. 10.6. TRANSFORMATION OF RESPONSE 195 Evaluation of impact of the regressors on response Evaluation of impact of the regressors on response requires necessity to perform statistical tests on regression coefficients or estimable parameters of a linear model. Homoscedasticity and for small samples also normality are needed to be able to use standard t- or F-tests. Both homoscedasticity and normality can be achieved by a log transformation of the response. Consequently performed statistical tests still have a reasonable practical interpretation as tests on ratios of two expectations of the (original) response. Chapter 11 Consequences of a Problematic Regression Space As in Chapter 10, that data are represented by n random vectors Yi , Z i , Z i = we assume Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp i = 1, . . . , n. As usual, let Y = Y1 , . . . , Yn and let Zn×p denote a matrix with covariate vectors Z 1 , . . . , Z n in its rows. Finally, let X i , i = 1, . . . , n, where X i = tX (Z i ) for some transformation tX : Rp −→ Rk , be the regressors that give rise to the model matrix X> 1 . 0 .. = X , . . . , X k−1 . Xn×k = > Xn It will be assumed that X 0 = 1, . . . , 1 (almost surely) leading to the model matrix Xn×k = 1n , X 1 , . . . , X k−1 , with explicitely included intercept term. Primarily, to be able to assume that E Y Z = we will assume that the model matrixX is sufficient E Y X = Xβ for some β = β0 , . . . , βk−1 ∈ Rk . That is, we will arrive from assuming Y Z ∼ Xβ, σ 2 In . It will finally be assumed in the whole chapter that the model matrix X is of full rank, i.e., rank X = k < n. 196 11.1. MULTICOLLINEARITY 11.1 197 Multicollinearity A principal assumption of any regression model is correct specification of the function. regression 2 While assuming a linear model Y Z ∼ Xβ, σ In , this means that E Y Z ∈ M X . To guarantee this, it seems to be optimal to choose the regression space M X as rich as possible. In other words, if many covariates are available, it seems optimal to include a high number k of columns in the model matrix X. Nevertheless, as we show in this section, this approach bears certain complications. 11.1.1 Singular value decomposition of a model matrix We are assuming rank Xn×k = k < n. As was shown in the course Fundamentals of Numerical Mathematics (NMNM201), the matrix X can be decomposed as X = U D V> = k−1 X dj uj v > j , D = diag(d0 , . . . , dk−1 ), j=0 where • Un×k = u0 , . . . , uk−1 are the first k orthonormal eigenvectors of the n × n matrix XX> . • Vk×k = v 0 , . . . , v k−1 are (all) orthonormal eigenvectors of the k × k (invertible) matrix X> X. p • dj = λj , j = 0, . . . , k − 1, where λ0 ≥ · · · ≥ λk−1 > 0 are • the first k eigenvalues of the matrix XX> ; • (all) eigenvalues of the matrix X> X, i.e., > X X= k−1 X > λj v j v > j = VΛV , Λ = diag(λ0 , . . . , λk−1 ) j=0 = k−1 X 2 > d2j v j v > j = VD V . j=0 The numbers d0 ≥ · · · ≥ dk−1 > 0 are called singular values1 of the matrix X. We then have k−1 X −1 1 > −2 > X> X = V , 2 vj vj = V D d j j=0 tr n X> X −1 o k−1 X 1 = . d2 j=0 j (11.1) Note (Moore-Penrose pseudoinverse of the matrix X> X). The singular value decomposition of the model matrix X provides also a way to calculate the MoorePenrose pseudoinverse of the matrix X> X if X is of less-than-full rank. If rank Xn×k = r < k, then d0 ≥ · · · ≥ dr−1 > dr = · · · = dk−1 = 0. The Moore-Penrose pseudoinverse of X> X is obtained as r−1 + X 1 > X> X = 2 vj vj . d j=0 j 1 singulární hodnoty 11.1. MULTICOLLINEARITY 11.1.2 198 Multicollinearity and its impact on precision of the LSE It is seen from (11.1) that with dk−1 −→ 0: (i) the matrix X> X tends to a singular matrix, i.e., the columns of the model matrix X tend to being linearly dependent; n −1 o (ii) tr X> X −→ ∞. Situation when the columns of the (full-rank) model matrix X are close to being linearly dependent is referred to as multicollinearity. If a linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k is assumed, then we know from GaussMarkov theorem that −1 (i) The fitted values Yb = Yb1 , . . . , Ybn = HY , where H = X X> X X> , is the best linear unbiased estimator (BLUE) of a vector parameter µ = Xβ = E Y Z with var Yb Z = σ 2 H; b = βb0 , . . . , βbk−1 = X> X −1 X> Y is the BLUE of a vector (ii) The least squares estimator β of regression coefficients β with b Z = σ 2 X> X −1 . var β It then follows n X var Ybi Z = tr var Yb Z = tr σ 2 H = σ 2 tr(H) = σ 2 k, i=1 k−1 X n n −1 o −1 o 2 > 2 > b b = σ tr X X . var βj Z = tr var β Z = tr σ X X j=0 This shows that multicollinearity (i) does not have any impact on precision of the LSE of the response expectation µ = Xβ; (ii) may have a serious impact on precision of the LSE of the regression coefficients β. At the same time, since LSE is BLUE, there exist no better linear unbiased estimator of β. If additionally normality is assumed there even exist no better unbiased estimator at all. An impact of multicollinearity can also be expressed by considering a problem of estimating the squared Euclidean norm of µ = Xβ and β, respectively. Asnatural 2 estimators 2 of those squared b norms are the squared norms of the corresponding LSE’s, i.e., Y and b β , respectively. As we show, those estimators are biased, the amount of bias nevertheless, does 2 not depend on a degree 2 b b of multicollinearity in case of Y but depends on it in case of β . 11.1. MULTICOLLINEARITY 199 Lemma 11.1 Bias in estimation of the squared norms. Let Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k. The following then holds. h 2 i 2 E Yb − Xβ Z = σ 2 k, h n 2 i −1 o 2 E b β − β Z = σ 2 tr X> X . Proof. In accordance with our convention, condition will be omitted from notation of all expectations and variances. Nevertheless, all are still understood as conditional expectations and variances given the covariate values Z. h 2 i 2 b E Y − Xβ Z • Let us calculate: n n nX 2 2 o X EYb − Xβ = E Ybi − X > β = var Ybi i i=1 i=1 n o = tr σ 2 H = σ 2 tr(H)= σ 2 k. = tr var Yb • At the same time: 2 > EYb − Xβ = E Yb − Xβ Yb − Xβ 2 2 EYb = EYb + EXβ − 2 β > X> |{z} Xβ 2 2 2 2 2 = EYb + Xβ − 2 Xβ = EYb − Xβ . 2 2 • So that, EYb − Xβ 2 EYb = = σ 2 k, 2 Xβ + σ 2 k. h 2 i 2 E b β − β Z • Let us start in a similar way: k−1 k−1 nX 2 2 o X b b E β−β =E βj − βj = var βbj j=0 j=0 n n n o o o b = tr σ 2 X> X −1 = σ 2 tr X> X −1 . = tr var β 11.1. MULTICOLLINEARITY 200 • At the same time: 2 b −β > β b −β Eb β − β = E β 2 2 b β + Eβ − 2 β > Eβ = Eb |{z} β 2 2 2 2 2 = Eb β + β − 2 β = Eb β − β . 2 2 • So that, Eb β − β 2 Eb β = = n −1 o σ 2 tr X> X , n 2 o β + σ 2 tr X> X −1 . {z } | k−1 X var βbj j=0 11.1.3 k Variance inflation factor and tolerance Notation. For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k, where Y = Y1 , . . . , Yn , > X = 1n , X 1 , . . . , X k−1 , X j = X1,j , . . . , Xn,j , j = 1, . . . , k − 1, the following (partly standard) notation, will be used: n Response sample mean: Square root of the total sum of squares: 1X Y = Yi ; n i=1 v u n uX 2 TY = t Yi − Y = Y − Y 1n ; i=1 Fitted values: Yb = Yb1 , . . . , Ybn ; Coefficient of determination: Y − Yb 2 Y − Yb 2 R = 1 − . = 1 − Y − Y 1n 2 TY2 Residual mean square: MSe = 2 1 Y − Yb 2 . n−k Further, for each j = 1, . . . , k − 1, consider a linear model Mj , where the vector X j acts as a response and the model matrix is X(−j) = 1n , X 1 , . . . , X j−1 , X j+1 , . . . , X k−1 . The following notation will be used: End of Lecture #19 (03/12/2015) Start of Lecture #20 (10/12/2015) 11.1. MULTICOLLINEARITY 201 n Column sample mean: 1X X = Xi,j ; n j i=1 Square root of the total sum of squares from model Mj : v u n uX j 2 j Tj = t = X j − X 1n ; Xi,j − X i=1 Fitted values from model Mj : cj = X b1,j , . . . , X bn,j ; X Coefficient of determination from model Mj : j j cj 2 cj 2 X − X X − X 2 Rj = 1 − . = 1 − Tj2 X j − X j 1n 2 Notes. (i) If data (response random variables and non-intercept covariates) Yi , Xi,1 , . . . , Xi,k−1 , i = 1, . . . , n are a random sample from a distribution of a generic random vector Y, X1 , . . . , Xk−1 then • The coefficient of determination R2 is also a squared value of a sample coefficient of multiple correlation between Y and X := X1 , . . . , Xk−1 . • For each j = 1, . . . , k − 1, the coefficient of determination Rj2 is also a squared value of a sample coefficient of multiple correlation between Xj and X (−j) := X1 , . . . , Xj−1 , Xj+1 , . . . , Xk−1 . (ii) For given j = 1, . . . , k − 1: • A value of Rj2 close to 1 means that the jth column X j is almost equal to some linear combination of the columns of the matrix X(−j) (remaining columns of the model matrix). We then say that X j is collinear with the remaining columns of the model matrix. • A value of Rj2 = 0 means that • the column X j is orthognal to all remaining non-intercept regressors (non-intercept columns of the matrix X(−j) ); • the jth regressor represented by a random variable Xj is multiply uncorrelated with the remaining regressors represented by the random vector X (−j) . For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k, b Z = MSe X> X −1 . c β var −1 The following Theorem shows that diagonal elements of the matrix MSe X> X , i.e., values c βbj Z can also be calculated, for j = 1, . . . , k − 1, using above defined quantities TY , Tj , R2 , var Rj2 . Theorem 11.2 Estimated variances of the LSE of the regression coefficients. For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k, diagonal elements of the matrix c βbj Z are, for j = 1, . . . , k − 1, var 2 1 − R2 1 TY c βbj Z = · · . var Tj n−k 1 − Rj2 11.1. MULTICOLLINEARITY Proof. See Zvára (2008, Chapter 11). Proof/calculations were skipped and are not requested for the exam. 202 k Definition 11.1 Variance inflation factor and tolerance. For given j = 1, . . ., k − 1, the variance inflation factor2 and the tolerance3 of the jth regressor of the linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k are values VIFj and Tolerj , respectively, defined as 1 1 VIFj = . , Tolerj = 1 − Rj2 = 2 VIFj 1 − Rj Notes. • With Rj = 0 (the jth regressor orthogonal to all remaining regressors, the j regressor multiply uncorrelated with the remaining ones), VIFj = 1. • With Rj −→ 1 (the jth regressor collinear with the remaining regressors, the jth regressor almost perfectly multiply correlated with the remaining ones), VIFj −→ ∞. Interpretation and use of VIF • If we take into account the statement of Theorem 11.2, the VIF of the jth regressor (j = 1, . . . , k − 1) can be interpreted as a factor by which the (estimated) variance of βbj is multiplied (inflated) compared to an optimal situation when the jth regressor is orthogonal to (multiply uncorrelated with) the remaining regressors included in the model. Hence the term variance inflation factor. • Under assumption of normality, the confidence interval for βj with a coverage of 1 − α has the lower and the upper bounds given as q α b c βbj . βj ± tn−k 1 − var 2 Using the statement of Theorem 11.2, the lower and the upper bounds of the confidence interval for βj can also be written as r α TY 1 − R2 p b βj ± tn−k 1 − VIFj . 2 Tj n−k That is, the (square root of) VIF also provides a factor by which the half-length (radius) of the confidence interval is inflated compared to an optimal situation when the jth regressor is orthogonal to (multiply uncorrelated with) the remaining regressors included in the model, namely, Volj 2 VIFj = , (11.2) Vol0,j 2 varianční inflační faktor 3 tolerance 11.1. MULTICOLLINEARITY where Volj = Vol0,j = 203 length (volume) of the confidence interval for βj ; length (volume) of the confidence interval for βj if it was Rj2 = 0. • Regressors with a high VIF are possibly responsible for multicollinearity. Nevertheless, the VIF does not reveal which regressors are mutually collinear. Generalized variance inflation factor Beginning of A generalized variance inflation factor was derived by Fox and Monette (1992) to evaluate a degree skipped part of collinearity between a specified group of regressors and the remaining regressors. Let • J ⊂ 1, . . . , k − 1 , J = m; • β [J ] be a subvector of β having the elements indexed by j ∈ J . Under normality, a confidence ellipsoid for βJ with a coverage 1 − α is n β [J ] ∈ Rm : b β [J ] − β [J ] > MSe V[J ] −1 o b β [J ] − β < m F (1 − α) , m,n−k [J ] −1 V[J ] = (J − J ) block of the matrix X> X . (11.3) Let VolJ : Vol0,J : volume of the confidence ellipsoid (11.3); volume of the confidence ellipsoid (11.3) would all columns of X coresponding to β [J ] be orthogonal to the remaining colums of X. A definition of the generalized variance inflation factor gVIF is motivated by (11.2) as it is given as gVIFJ = VolJ Vol0,J 2 . It is seen that with J = {j} for some j = 1, . . . , k − 1, the generalized VIF simplifies into a standard VIF, i.e., gVIFj = VIFj . Notes. • The generalized VIF is especially useful if J relates to the regression coefficients corresponding to the reparameterizing (pseudo)contrasts of one categorical covariate. It can then be shown that gVIFJ does not depend on a choice of the (pseudo)contrasts. gVIFJ then evaluates the magnitude of the linear dependence of a categorical variable and the remaining regressors. • When comparing gVIF for index sets J , J of different cardinality m, quantities J 1 gVIF 2 m J = VolJ Vol0,J 1 m (11.4) should be compared which all relate to volume units in 1D. • Generalized VIF’s (and standard VIF’s if m = 1) together with (11.4) are calculated by the R function vif from the package car. End of skipped part 11.1. MULTICOLLINEARITY 11.1.4 204 Basic treatment of multicollinearity Especially in situations when inference on the regression coefficients is of interest, i.e., when the primary purpose of the regression modelling is to evaluate which variables influence significantly the response expectation and which not, multicollinearity is a serious problem. Basic treatment of multicollinearity consists of preliminary exploration of mutual relationships between all covariates and then choosing only suitable representatives of each group of mutually multiply correlated covariates. Very basic decision can be based on pairwise correlation coefficients. In some (especially “cook-book”) literature, rules of thumb are applied like “Covariates with a correlation (in absolute value) higher than 0.80 should not be included together in one model.” Nevertheless, such rules should never be applied in an automatic manner (why just 0.80 and not 0.79, . . . ?) Decision on which covariates cause multicollinearity can additionally be based on (generalized) variance inflation factors. Nevertheless, also those should be used comprehensively. In general, if a large set of covariates is available to relate it to the response expectation, a deep (and often timely) analyzis of mutual relationships and their understanding must preceed any regression modelling that is to lead to useful results. 11.2. MISSPECIFIED REGRESSION SPACE 11.2 205 Misspecified regression space We are often in a situation when a large (potentially enormous) number p of candidate regressors is available. The question is then which of them should be included in a linear model. As shown in Section 11.1, inclusion of all possible regressors in the model is not necessarily optimal and may even have seriously negative impact on the statistical inference we would like to draw using the linear model. In this section, we explore some (additional) properties of the least squares estimators and of the related prediction in two situations: (i) Omitted important regressors. (ii) Irrelevant regressors included in a model. 11.2.1 Omitted and irrelevant regressors We will assume that possibly two sets of regressors are available: (i) X i , i = 1, . . . , n, where X i = tX (Z i ) for some transformation tX : Rp −→ Rk . They give rise to the model matrix X> 1 . 0 .. = X , . . . , X k−1 . Xn×k = X> n It will still be assumed that X 0 = 1, . . . , 1 (almost surely) leading to the model matrix Xn×k = 1n , X 1 , . . . , X k−1 , with explicitely included intercept term. (ii) V i , i = 1, . . . , n, where V i = tV (Z i ) for some transformation tV : Rp −→ Rl . They give rise to the model matrix V> 1 . 1 .. = V , . . . , V l . Vn×l = > Vn We will assume that both matrices X and V are of a full column rank and their columns are linearly independent, i.e., we assume rank Xn×k = k, rank Vn×l = l, for Gn×(k+l) := X, V , rank G = k + l < n. The matrices X and G give rise to two nested linear models: Model MX Y Z ∼ Xβ, σ 2 In ; Model MXV Y Z ∼ Xβ + Vγ, σ 2 In . Depending on which of the two models is a correct one and which model is used for inference, we face two situations: 11.2. MISSPECIFIED REGRESSION SPACE 206 Omitted important regressors mean that the larger model MXV is correct (with γ 6= 0m ) but we base inference on model MX . In particular, • β is estimated using model MX ; • σ 2 is estimated using model MX ; • prediction is based on the fitted model MX . Irrelevant regressors included in a model that the smaller model MX is correct but we base inference on model MXV . In particular, • β is estimated (together with γ) using model MXV ; • σ 2 is estimated using model MXV ; • prediction is based on the fitted model MXV . Note that if MX is correct then MXV is correct as well. Nevertheless, it includes redundant parameters γ which are known to be equal to zeros. Notation (Quantities derived under the two models). Quantities derived while assuming model MX will be indicated by subscript X, quantities derived while assuming model MXV will be indicated by subscript XV . Namely, (i) Quantities derived while assuming model MX : • Least squares estimator of β: b = X> X β X −1 X> Y = βbX,0 , . . . , βbX,k−1 ; ⊥ • Projection matrices into the regression space M X and into the residual space M X : −1 HX = X X> X X> , MX = In − HX ; • Fitted values (LSE of a vector Xβ): b = YbX,1 , . . . , YbX,n ; Yb X = HX Y = Xβ X • Residuals U X = MX Y = Y − Yb X = UX,1 , . . . , UX,n ; • Residual sum of squares and residual mean square: 2 SSe,X = U X , MSe,X = SSe,X . n−k (ii) Quantities derived while assuming model MXV : • Least squares estimator of β, γ : b β XV −1 > > b ,γ β G Y, XV b XV = G G b XV = γ γ bXV,1 , . . . , γ bXV,l ; = βbXV,0 , . . . , βbXV,k−1 , ⊥ • Projection matrices into the regression space M G and into the residual space M G : −1 HXV = G G> G G> , MXV = In − HXV ; 11.2. MISSPECIFIED REGRESSION SPACE 207 • Fitted values (LSE of a vector Xβ + Vγ): b b b Yb XV = HXV Y = Xβ XV + Vγ XV = YXV,1 , . . . , YXV,n ; • Residuals U XV = MXV Y = Y − Yb XV = UXV,1 , . . . , UXV,n ; • Residual sum of squares and residual mean square: 2 SSe,XV = U XV , MSe,XV = SSe,XV . n−k−l Consequence of Theorem 10.1: Relationship between the quantities derived while assuming the two models. Quantities derived while assuming models MX and MXV are mutually in the following relationships: −1 Yb XV − Yb X = MX V V> MX V V> U X , b b = X β −β + Vb γ , XV b XV = γ X V> MX V > b b β XV − β X = − X X −1 −1 XV V> U X , X> Vb γ XV , 2 γ XV , SSe,X − SSe,XV = MX Vb HXV = HX + MX V V> MX V −1 V> MX . Proof. Direct use of Lemma 10.1 while taking into account the fact that now, all involved model matrices are of full-rank. −1 > Relationship HXV = HX +MX V V> MX V V MX was shown inside the proof of Lemma 10.1. It easily follows from a general expression of the hat matrix if we realize that M X, V = M X, MX V , and that X> MX V = 0k×l . k Theorem 11.3 Variance of the LSE in the two models. Irrespective of whether MX or MXV holds, the covariance matrices of the fitted values and the LSE of the regression coefficients satisfy the following: 11.2. MISSPECIFIED REGRESSION SPACE 208 var Yb XV Z − var Yb X Z ≥ 0, b b var β XV Z − var β X Z ≥ 0. Notes. • Estimator of the response mean vector µ = E Y Z based on a (smaller) model MX is always (does not matter which model is correct) less or equally variable than the estimator based on the (richer) model MXV . • Estimators of the regression coefficients β based on a (smaller) model MX have always lower (or equal if X> V = 0k×m ) standard errors than the estimator based on the (richer) model MXV . Proof. In accordance with our convention, condition will be omitted from notation of all expectations and variances. Nevertheless, all are still understood as conditional expectations and variances given the covariate values Z. var Yb XV − var Yb X ≥ 0 var Yb X = var HX Y = HX (σ 2 In )HX We have, = σ 2 HX var Yb XV b var β XV (even if MX is not correct). = var HXV Y = σ 2 HXV = σ 2 HX + MX V(V> MX V)−1 V> MX = var Yb X + σ 2 MX V(V> MX V)−1 V> MX . {z } | positive semidefinite matrix b − var β X ≥0 Proof/calculations for this part were skipped and are not requested for the exam. Proof/calculations below are shown only for those who are interested. First, use a formula to calculate an inverse of a matrix divided into blocks (Theorem A.4): −1 n o−1 ! > X X> V > X − X> V V> V −1 V> X b X X β XV = σ2 . var = σ2 > > b XV γ V X V V V V V Further, −1 > −1 −1 > b var β = var X X X Y = X> X X> (σ 2 In )X X> X X −1 = σ 2 X> X (even if MX is not correct). b var β XV n o−1 −1 = σ 2 X> X − X> V V> V V> X . Property of positive definite matrices (“A − B ≥ 0 ⇔ B−1 − A−1 ≥ 0”) finalizes the proof. k 11.2. MISSPECIFIED REGRESSION SPACE 11.2.2 209 Prediction quality of the fitted model To evaluate a prediction quality of the fitted model, we will assume that data Y , Z , Zi = i i Zi,1 , . . . , Zi,p ∈ Z⊆ Rp , i = 1, . . . , n, are a random sample from a distribution of a generic random vector Y, Z , Z = Z1 , . . . , Zp . Let the conditional distribution Y | Z of Y given the covariates Z satisfies E Y Z = m(Z), var Y Z = σ 2 , (11.5) for some (regression) function m and some σ 2 > 0. Replicated response Let z 1 , . . . , z n be the values of the covariate vectors Z 1 , . . . , Z n in the original data that are available to estimate the parameters of the model (11.5). Further, let Yn+i , Z n+i , i = 1, . . . , n, be independent random vectors (new or future data) beingdistributed as a generic random vector Y, Z and being independent of the original data Yi , Z i , i = 1, . . . , n. Suppose that our aim is to predict values of Yn+i , i = 1, . . . , n, under the condition that the new covariate values are equal to the old ones. That is, we want to predict, for i = 1, . . . , n, values of Yn+i given Z n+i = z i . Terminology (Replicated response). A random vector Y new = Yn+1 , . . . , Yn+n , where Yn+i is supposed to come from the conditional distribution Y | Z = z i , i = 1, . . . , n, is called the replicated data. Notes. • The original (old) response vector Y and the replicated response vector Y new are assumed to be independent. • Both Y and Y new are assumed to be generated by the same conditional distribution (given Z), where E Y Z 1 = z1, . . . , Z n = zn = µ = E Y new Z n+1 = z 1 , . . . , Z n+n = z n , var Y Z 1 = z 1 , . . . , Z n = z n = σ 2 In = var Y new Z n+1 = z 1 , . . . , Z n+n = z n , for some σ 2 > 0, and µ = m(z 1 ), . . . , m(z n ) = µ1 , . . . , µn . Prediction of replicated response Let Yb new = Ybn+1 , . . . , Ybn+n be the prediction of a vector Y new based on the regression model (11.5) estimated using the original data Y . Analogously to Section 5.4.3, we shall evaluate a quality of the prediction by the mean squared error of prediction (MSEP). Nevertheless, in contrast to Section 5.4.3, the following issues will be different: 11.2. MISSPECIFIED REGRESSION SPACE 210 (i) A value of a random vector rather than a value of a random variable (as in Section 5.4.3) is predicted now. Now, the MSEP will be given as a sum of the MSEPs of the elements of the random vector being predicted. (ii) Since we are now interested in prediction of new response values given the covariate values being equal to the covariate values in the original data, the MSEP now will be based on a conditional distribution of the responses given Z (given Z i = Z n+i = z i , i = 1, . . . , n). In contrast, variability of the covariates was taken into account in Section 5.4.3. (iii) Variability of the prediction induced by estimation of the model parameters (estimation of the regression function) using the original data Y will also be taken into account now. In contrast, model parameters were assumed to be known when deriving the MSEP in Section 5.4.3. Definition 11.2 Quantification of a prediction quality of the fitted regression model. Prediction quality of the fitted regression model will be evaluated by the mean squared error of prediction (MSEP)4 defined as n h X 2 i MSEP Yb new = E Ybn+i − Yn+i Z , (11.6) i=1 where the expectation is with respect to the (n+n)-dimensional conditional distribution of Y , Y new given Z> Z> 1 n+1 . . . . Z= . = . . Z> Z> n n+n Additionally, we define the averaged mean squared error of prediction (AMSEP)5 as 1 AMSEP Yb new = MSEP Yb new . n Prediction of replicated response in a linear model End of Lecture #20 (10/12/2015) Start of With a linear model, it is assumed that m(z) = x> β for some (known) transformation x = tX (z) Lecture #21 (10/12/2015) and a vector of (unknown) parameters β. Hence, it is assumed that > µ = Xβ = x> 1 β, . . . , xn β = µ1 , . . . , µn , for a model matrix X based on the (transformed) covariate values xi = tX (z i ), i = 1, . . . , n. If we restrict our attention to unbiased and linear predictions of Y new , i.e., to predictions of the form Yb new = AY for some matrix A, a variant of the Gauss-Markov theorem would show that (11.6) is minimized for − Yb new = Yb , Yb = X X> X X> Y , Ybn+i = Ybi , 4 střední čtvercová chyba predikce 5 i = 1, . . . , n. průměrná střední čtvercová chyba predikce 11.2. MISSPECIFIED REGRESSION SPACE 211 That is, for Yb new being equal to the fitted values of the model estimated using the original data. Note also that b, Yb new = Yb =: µ b is the LSE of a vector µ = E Y Z 1 = z 1 , . . . , Z n = z n = E Y new Z n+1 = where µ z 1 , . . . , Z n+n = z n . Lemma 11.4 Mean squared error of prediction in a linear model. In a linear model, the mean squared error of prediction can be expressed as MSEP Yb new 2 = nσ + n X MSE Ybi , i=1 where h 2 i MSE Ybi = E Ybi − µi Z , i = 1, . . . , n, is the mean squared error6 of Ybi if this is viewed as estimator of µi , i = 1, . . . , n. Proof. In accordance with our convention, condition will be omitted from notation of all expectations and variances. Nevertheless, all are still understood as conditional expectations and variances given the covariate values Z. We have for i = 1, . . . , n (remember, Ybn+i = Ybi , i = 1, . . . , n), 2 2 E Ybn+i − Yn+i = E Ybi − Yn+i 2 = E Ybi − µi − (Yn+i − µi ) 2 2 = E Ybi − µi + E Yn+i − µi −2 E(Ybi − µi )(Yn+i − µi ) | {z } b b E(Yi − µi ) E(Yn+i − µi ) = E(Yi − µi ) · 0 = E Ybi − µi 2 + E Yn+i − µi 2 = MSE(Ybi ) + σ 2 . So that MSEP(Yb new ) = n X E Ybn+i − Yn+i i=1 2 = n σ2 + n X MSE Ybi . i=1 Notes. • We can also write n X h 2 i b b MSE Yi = E Y − µ Z . i=1 Hence, MSEP Yb new 6 střední čtvercová chyba h 2 i = n σ 2 + E Yb − µ Z . k 11.2. MISSPECIFIED REGRESSION SPACE 212 • If the assumed linear model is a correct model for data at hand, Gauss-Markov theorem states that Yb is the BLUE of the vector µ in which case h 2 i i = 1, . . . , n. MSE Ybi = E Ybi − µi Z = var Ybi Z , • Nevertheless, if the assumed linear model is not a correct model for data at hand, estimator Yb might be a biased estimator of the vector µ, in which case h 2 i MSE Ybi = E Ybi − µi Z o2 o2 n n , = var Ybi Z + bias Ybi = var Ybi Z + E Ybi − µi Z i = 1, . . . , n. • Expression of the mean squared error of prediction is MSEP(Yb new ) = n σ 2 + n X h 2 i MSE Ybi = n σ 2 + E Yb − µ Z . i=1 By specification of a model for the conditional response i.e., by specification of hexpectation, 2 i a model for µ, we can influence only the second factor E Yb − µ Z . The first factor (n σ 2 ) reflects the true (conditional) variability of the response which does not depend on specification of the model for the expectation. Hence, if evaluating a prediction quality of a linear model with respect to ability to predict replicated data, the only term that matters is n X h 2 i MSE Ybi = E Yb − µ Z , i=1 that relates to the error of the fitted values being considered as an estimator of the vector µ. 11.2.3 Omitted regressors In this section, we will assume that the correct model is model MXV : Y Z ∼ Xβ + Vγ, σ 2 In , with γ 6= 0l . Hence all estimators derived under model MXV are derived under the correct model and hence have usual properties of the LSE, namely, b E β = β, XV Z E Yb XV Z = Xβ + Vγ =: µ, n X i=1 MSE YbXV,i = n X b b var YXV,i Z = tr var Y XV Z = tr σ 2 HXV i=1 = σ 2 (k + l), E MSe,XV Z = σ 2 . (11.7) 11.2. MISSPECIFIED REGRESSION SPACE 213 Nevertheless, all estimators derived under model MX : Y Z ∼ Xβ, σ 2 In are calculated while assuming a misspecified model with omitted important regressors and their properties do not coincide with properties of the LSE calculated under the correct model. Theorem 11.5 Properties of the LSE in a model with omitted regressors. Let MXV : Y Z ∼ Xβ + Vγ, σ 2 In hold, i.e., µ := E Y Z satisfies µ = Xβ + Vγ for some β ∈ Rk , γ ∈ Rl . Then the least squares estimators derived while assuming model MX : Y Z ∼ Xβ, σ 2 In attain the following properties: b Z = β + X> X −1 X> Vγ, E β X E Yb X Z = µ − MX Vγ, n X MSE YbX,i 2 = k σ 2 + MX Vγ , i=1 MX Vγ 2 2 . E MSe,X Z = σ + n−k Proof. In accordance with our convention, condition will be omitted from notation of all expectations and variances. Nevertheless, all are still understood as conditional expectations and variances given the covariate values Z. b Z E β X −1 > > b b By Theorem 10.1: β X Vb γ XV . XV − β X = − X X n o −1 > > b b Hence, E β = E β + X X X Vb γ X XV XV −1 > > = β + X X X Vγ, b bias β X = X> X −1 X> Vγ. E Yb X Z b b γ XV . By Theorem 10.1: Yb XV − Yb X = X β XV − β X + Vb b b Hence, E Yb X = E Yb XV − Xβ γ XV XV + Xβ X − Vb −1 = µ − Xβ + Xβ + X X> X X> Vγ − Vγ n o −1 = µ + X X> X X> − In Vγ = µ − MX Vγ, bias Yb X = − MX Vγ. 11.2. MISSPECIFIED REGRESSION SPACE Pn i=1 MSE YbX,i 214 n > o b b b Let us first calculate MSE Y X = E Y X − µ Y X − µ : MSE Yb X = var Yb X + bias Yb X bias> Yb X = σ 2 HX + MX Vγγ > V> MX . Hence, n X MSE YbX,i = tr MSE Yb X i=1 = tr σ 2 HX + MX Vγγ > V> MX = tr σ 2 HX + tr MX Vγγ > V> MX = σ 2 k + tr γ > V> MX MX Vγ 2 = σ 2 k + MX Vγ . E MSe,X Z Proof/calculations for this part were skipped and are not requested for the exam. Proof/calculations below are shown only for those who are interested. Let us first calculate E SSe,X := E SSe,X Z . To do that, write the linear model MXV using the error terms as Y = Xβ + Vγ + ε, E ε = E ε Z = 0n , var ε = var ε Z = σ 2 In . 2 2 E SSe,X = EMX Y = EMX (Xβ + Vγ + ε) 2 = EMX Vγ + MX ε 2 2 = EMX Vγ + EMX ε + 2 E γ > V> MX MX ε | {z } γ > V> MX Eε=0 2 E ε> M X ε = MX Vγ + | {z } E tr(ε> MX ε) =tr E(MX εε> ) =tr σ 2 MX =σ 2 (n−k) 2 = MX Vγ + σ 2 (n − k). SSe,X Hence, E MSe,X = E n−k MX Vγ 2 = σ2 + , n−k MX Vγ 2 bias MSe,X = . n−k k 11.2. MISSPECIFIED REGRESSION SPACE 215 Least squares estimators −1 > > b b is not Theorem 11.5 shows that bias β X Vγ, nevertheless, the estimator β X = X X X necessarily biased. Let us consider two situations. (i) X> V = 0k×l , which means that each column of X is orthogonal with each column in V. In other words, regressors included in the matrix X are uncorrelated with regressors included in the matrix V. Then b =β b b • β and bias β = 0k . X XV X • Hence β can be estimated using the smaller model MX without any impact on a quality of the estimator. (ii) X> V 6= 0k×l b is a biased estimator of β. • β X Further, for the fitted values Yb X if those are considered as an estimator of the response vector expectation µ = Xβ + Vγ, we have bias Yb X = − MX Vγ. In this case, all elements of the bias vector would be equal to zero if MX V = 0n×l . Nevertheless, this would mean that M V ⊆ M X which is in contradition with our assumption rank X, V = k + l. That is, if the omitted covariates (included in the matrix V) are linearly independent (are not perfectly multiply correlated) with the covariates included in the model matrix X, the fitted values Yb X always provide a biased estimator of the response expectation. Prediction Let us compare predictions Yb new,X = Yb X based on a (misspecified) model MX and predictions Yb new,XV = Yb XV based on a (correct) model MXV . Properties of the fitted values in a correct model (Expressions (11.7)) together with results of Lemma 11.4 and Theorem 11.5 give MSEP Yb new,XV = n σ2 + k σ2 + l σ2, MSEP Yb new,X 2 = n σ 2 + k σ 2 + MX Vγ . That is, the average mean squared errors of prediction are k 2 l σ + σ2, n n 2 k 1 AMSEP Yb new,X = σ 2 + σ 2 + MX Vγ . n n AMSEP Yb new,XV = σ2 + We can now conclude the following. 2 • The term MX Vγ might be huge compared to l σ 2 in which case the prediction using the model with omitted important covariates is (much) worse than the prediction using the (correct) model. • Additionally, σ 2 → 0 with n → ∞ (while increasing the number of predictions). 2 • On the other hand, n1 MX Vγ does not necessarily tend to zero with n → ∞. l n 11.2. MISSPECIFIED REGRESSION SPACE 216 Estimator of the residual variance Theorem 11.5 shows that the mean residual square MSe,X in a misspecified model MX is a biased estimator of the residual variance σ 2 with the bias amounting to MX Vγ 2 bias MSe,X = . n−k Also in this case, bias does not necessarily tend to zero with n → ∞. 11.2.4 Irrelevant regressors In this section, we will assume that the correct model is model MX : Y Z ∼ Xβ, σ 2 In . This means, that also model MXV : Y Z ∼ Xβ + Vγ, σ 2 In holds, nevertheless, γ = 0l and hence the regressors from the matrix V are irrelevant. Since both models MX and MXV hold, estimators derived under both models have usual properties of the LSE, namely, b Z = E β b E β = β, X XV Z E Yb X Z = E Yb XV Z = Xβ =: µ, n X MSE YbX,i = i=1 = n X MSE YbXV,i i=1 = n X = tr σ 2 HX var YbX,i Z = tr var Yb X Z i=1 σ 2 k, n X = tr σ 2 HXV var YbXV,i Z = tr var Yb XV Z i=1 = σ 2 (k + l), E MSe,X Z = E MSe,XV Z = σ 2 . Least squares estimators b and β b Both estimators β X XV are unbiased estimators of a vector β. Nevertheless, as stated in Theorem 11.3, their quality expressed by the mean squared error which in this case coincide with the covariance matrix (may) differ since h h > i > i b b b b b b MSE β XV − MSE β XV = E β XV − β β XV − β Z − E β X − β β X − β Z b b = var β XV Z − var β X Z ≥ 0. In particular, we derived during the proof of Theorem 11.3 that n −1 > o−1 −1 > > > 2 > b b var β XV Z − var β X Z = σ X X−X V V V V X − X X . 11.2. MISSPECIFIED REGRESSION SPACE 217 Let us again consider two situations. (i) X> V = 0k×l , which means that each column of X is orthogonal with each column in V. In other words, regressors included in the matrix X are uncorrelated with regressors included in the matrix V. Then b =β b b b • β X XV and var β X Z = var β XV Z . • Hence β can be estimated using the model MXV with irrelevant covariates included without any impact on a quality of the estimator. (ii) X> V 6= 0k×l b b • The estimator β XV is worse than the estimator β X in terms of its variability. • If we take into account a fact that by including more regressors in the model, we are b increasing a danger of multicollinearity, difference between variability of β XV and that b may become huge. of β X Prediction Let us now compare predictions Yb new,X = Yb X based on a correct model MX and predictions Yb new,XV = Yb XV based on also a correct model MXV , where however, irrelevant covariates were included. Properties of the fitted values in a correct model together with results of Lemma 11.4 give MSEP Yb new,XV = n σ 2 + (k + l) σ 2 , MSEP Yb new,X = n σ2 + k σ2. That is, the average mean squared errors of prediction are k+l 2 σ , n k AMSEP Yb new,X = σ 2 + σ 2 . n AMSEP Yb new,XV = σ2 + The following can now be concluded. • If n → ∞, both AMSEP Yb new,XV and AMSEP Yb new,X tend to σ 2 . Hence on average, if sufficiently large number of predictions is needed, both models provide predictions of practically the same quality. • On the other hand, by using the richer model MXV (which for a finite n provides worse predictions than the smaller model MX ), we are eliminating a possible problem of omitted important covariates that leads to biased predictions with possibly even worse MSEP and AMSEP than that of model MXV . 11.2.5 Summary Interest in estimation of the regression coefficients and inference on them If interest lies in estimation of and inference on the regression coefficients β related to the regressors included in the model matrix X, the following was derived in Sections 11.2.3 and 11.2.4. (i) If we omit important regressors which are (multiply) correlated with the regressors of main interest included in the matrix X, the LSE of the regression coefficients is biased. 11.2. MISSPECIFIED REGRESSION SPACE 218 (ii) If we include irrelevant regressors which are (multiply) correlated with the regressors of main interest in the matrix X, we are facing a danger of multicollinearity and related inflation of the standard errors of the LSE of the regression coefficients. (iii) Regressors which are (multiply) uncorrelated with regressors of main interest influence neither b irrespective of whether they are omitted or irrelevantly included. bias nor variability of β Consequently, if a primary task of the analysis is to evaluate whether and how much the primary regressors included in the model matrix X influence the response expectation, detailed exploration and understanding of mutual relationships among all potential regressors and also between the regressors and the response is needed. In particular, regressors which are (multiply) correlated with the regressors from the model matrix X and at the same time do not have any influence on the response expectation should not be included in the model. On the other hand, regressors which are (multiply) uncorrelated with the regressors of primary interest can, without any harm, be included in the model. In general, it is necessary to find a trade-off between too poor and too rich model. Interest in prediction If prediction is the primary purpose of the regression analysis, results derived in Sections 11.2.3 and 11.2.4 dictate to follow a strategy to include all available covariates in the model. The reasons are the following. (i) If we omit important regressors, the predictions get biased and the averaged mean squared error of prediction is possibly not tending to the optimal value of σ 2 with n → ∞. (ii) If we include irrelevant regressors in the model, this has, especially with n → ∞, a negligible effect on a quality of the prediction. The averaged mean squared error of prediction is still tending to the optimal value of σ 2 . Chapter 12 Simultaneous Inference in a Linear Model In this chapter, we will assume that data are represented by a set of n random vectors Y , X , i i X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, that satisfy a normal linear model. That is, Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r ≤ k < n, > where Y = Y1 , . . . , Yn , X is a matrix with vectors X > 1 , . . . , X n in its rows and β = β0 , . . . , βk−1 ∈ Rk and σ 2 > 0 are unknown parameters. Further, we will assume that a matrix Lm×k > (m > 1) with rows l> 1 , . . . , lm is given such that > θ = Lβ = l> 1 β, . . . , lm β = θ1 , . . . , θm is an estimable vector parameter of the linear model. Our interest will lie in a simultaneous inference on elements of the parameter θ. This means, we will be interested in (i) deriving confidence regions for a vector parameter θ; (ii) testing a null hypothesis H0 : θ = θ 0 for given θ 0 ∈ Rm . 219 12.1. BASIC SIMULTANEOUS INFERENCE 12.1 220 Basic simultaneous inference If matrix Lm×k is such that (i) m ≤ r; (ii) its rows, i.e., vectors l1 , . . . , lm ∈ Rk are linearly independent, then we already have a tool for a simultaneous inference on θ = Lβ. It is based on point (x) of Theorem 3.1 (Least squares estimators under the normality). It provides a confidence region for θ with a coverage of 1 − α which is n θ ∈ Rm : b θ−θ o > n − o−1 b < m Fm,n−r (1 − α) , θ−θ MSe L X> X L> (12.1) b = Lb is the LSE of θ. The null hypothesis H0 : θ = θ 0 is tested using the statistic where θ n − > o−1 1 b 0 > > b − θ0 , Q0 = θ−θ MSe L X X L θ m (12.2) which under the null hypothesis follows an Fm,n−r distribution and the critical region of a test on the level of α is C(α) = Fm,n−r (1 − α), ∞ . (12.3) The P-value if Q0 = q0 is then given as p = 1 − CDFF , m, n−r (q0 ). Note that the confidence region (12.1) and the test based on the statistic Q0 and the critical region (12.3) are mutually dual. That is, the null hypothesis is rejected on a level of α if and only if θ 0 is not covered by the confidence region (12.1) with a coverage 1 − α. 12.2. MULTIPLE COMPARISON PROCEDURES 12.2 221 Multiple comparison procedures 12.2.1 Multiple testing 0 ) on a vector parameter θ = θ , . . . , θ The null hypothesis H0 : θ = θ 0 (θ 0 = θ10 , . . . , θm 1 m 0 . can also be written as H0 : θ1 = θ10 & · · · & θm = θm Definition 12.1 Multiple testing problem, elementary null hypotheses, global null hypothesis. A testing problem with the null hypothesis H0 : θ1 = θ10 & ... 0 & θm = θm , (12.4) is called the multiple testing problem1 with the m elementary hypotheses2 0 H1 : θ1 = θ10 , . . . , Hm : θm = θm . Hypothesis H0 is called in this context also as a global null hypothesis. Note. The above definition of the multiple testing problem is a simplified definition of a general multiple testing problem where the elementary null hypotheses are not necessarily simple hypotheses. Further, general multiple testing procedures consider also problems where the null hypothesis H0 is not necessarily given as a conjunction of the elementary hypotheses. Nevertheless, for our purposes in context of this lecture, Definition 12.1 will suffice. Also subsequent theory of multiple comparison procedures will be provided in a simplified way in an extent needed for its use in context of the multiple testing problem according to Definition 12.1 and in context of a linear model. Notation. • When dealing with a multiple testing problem, we will also write H0 or ≡ H0 H1 & H1 , ≡ or H0 = m \ ... ..., & Hm Hm Hj . j=1 • In context of a multiple testing, subscript 1 at H1 will never indicate an alternative hypothesis. A symbol { will rather be used to indicate an alternative hypothesis. • The alternative hypothesis of a multiple testing problem with the null hypothesis (12.4) will always be given by a complement of the parameter space under the global null hypothesis, i.e., H{0 : θ1 6= θ10 ≡ 1 problém vícenásobného testování 2 H{1 OR OR elementární hypotézy ... ... OR OR 0 θm 6= θm , H{m , 12.2. MULTIPLE COMPARISON PROCEDURES 222 where H{j : θj 6= θj0 , j = 1, . . . , m. We will also write H{0 = m [ H{j . j=1 • Different ways of indexing the elementary null hypotheses will also be used (e.g., a double subscript) depending on a problem at hand. Example 12.1 (Multiple testing problem for one-way classified group means). Suppose that a normal linear model Y X ∼ Nn Xβ, σ 2 In is used to model dependence of the response Y on a single categorical covariate Z with a sample space Z = {1, . . . , G}, where the regression space M X of a vector dimension G parameterizes the one-way classified group means m1 := E Y Z = 1 , . . . , mG = E Y Z = G . If we restrict ourselves to full-rank parameterizations (see Section 7.4.4), the regression coefficients vector is β = β0 , β Z , β Z = β1 , . . . , βG−1 and the group means are parameterized as Z mg = β0 + c> gβ , where g = 1, . . . , G, c> 1 . .. C= c> G is a chosen G × (G − 1) (pseudo)contrast matrix. The null hypothesis H0 : m1 = · · · = mG on equality of the G group means can be specified as a multiple testing problem with m = G2 elementary hypotheses (double subscript will be used to index them): H1,2 : m1 = m2 , . . . , HG−1,G : mG−1 = mG . The elementary null hypotheses can now be written in terms of a vector estimable parameter θ = θ1,2 , . . . , θG−1,G , > θg,h = mg − mh = cg − ch β Z , g = 1, . . . , G − 1, h = g + 1, . . . , G as H1,2 : θ1,2 = 0, ..., HG−1,G : θG−1,G = 0, or written directly in term of the group means as H1,2 : m1 − m2 = 0, HG−1,G : mG−1 − mG = 0, The global null hypothesis is H0 : θ = 0, where θ = Lβ. Here, L is an G2 × G matrix 0 . . L= . 0 ..., c1 − c2 .. . > cG−1 − cG > . Since rank C = G − 1, we have rank L = G − 1. We then have 12.2. MULTIPLE COMPARISON PROCEDURES 223 • For G ≥ 4, m = G2 > G. That is, in this case, the number of elementary null hypotheses is higher than the rank of the underlying linear model. • For G ≥ 3, the matrix L has linearly dependent rows. That is, for G ≥ 3, we can (i) neither calculate a simultaneous confidence region (12.1) for θ; (ii) nor use the test statistic (12.2) to test H0 : θ = 0. In this chapter, (i) we develop procedures that allow to test the null hypothesis H0 : Lβ = θ 0 and provide a simultaneous confidence region for θ = Lβ even if the rows of the matrix L are linearly dependent or its rank is higher than the rank of the underlying linear model; (ii) the test procedure will also decide which of the elementary hypotheses is/are responsible (in a certain sense) for rejection of the global null hypothesis; (iii) developed confidence regions will have a more appealing form of a product of intervals. 12.2.2 Simultaneous confidence intervals Suppose that a distribution of the random vector D depends on a (vector) parameter θ = θ1 , . . . , θm ∈ Θ1 × · × Θm = Θ ⊆ Rm . Definition 12.2 Simultaneous confidence intervals. (Random) intervals θjL , θjU , j = 1, . . . , m, where θjL = θjL (D) and θjU = θjU (D), j = 1, . . . , m, are called simultaneous confidence intervals3 for parameter θ with a coverage of 1 − α if for any 0 ∈ Θ, θ 0 = θ10 , . . . , θm L U P θ1L , θ1U × · · · × θm , θm 3 θ 0 ; θ = θ 0 ≥ 1 − α. Notes. • The condition in the definition can also be written as P ∀ j = 1, . . . , m : θjL , θjU 3 θj0 ; θ = θ 0 ≥ 1 − α. • The product of the simultaneous confidence intervals indeed forms a confidence region in a classical sense. Example 12.2 (Bonferroni simultaneous confidence intervals). α Let for each j = 1, . . . , m, θjL , θjU be a classical confidence interval for θj with a coverage of 1− m . That is, α ∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL , θjU 3 θj0 ; θj = θj0 ≥ 1 − . m We then have α ∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL , θjU 63 θj0 ; θj = θj0 ≤ . m 3 simultánní intervaly spolehlivosti 12.2. MULTIPLE COMPARISON PROCEDURES 224 Further, using elementary property of a probability (for any θ 0 ∈ Θ) P ∃ j = 1, . . . , m : θjL , θjU 63 θj0 ; θ=θ 0 m X ≤ P θjL , θjU 63 θj0 ; θ = θ 0 j=1 m X α ≤ = α. m j=1 Hence, P ∀ j = 1, . . . , m : θjL , θjU 3 θj0 ; θ = θ 0 ≥ 1 − α. That is, intervals θjL , θjU , j = 1, . . . , m, are simultaneous confidence intervals for parameter θ with a coverage of 1 − α. Simultaneous confidence intervals constructed in this way from univariate confidence intervals are called Bonferroni simultaneous confidence intervals. Their disadvantage is that they are often seriously conservative, i.e., having a coverage (much) higher than requested 1 − α. 12.2.3 Multiple comparison procedure, P-values adjusted for multiple comparison Suppose again that a distribution of the random vector D depends on a (vector) parameter θ = θ1 , . . . , θm ∈ Θ1 × · × Θm = Θ ⊆ Rm . Let for each 0 < α < 1 a procedure be given to construct the simultaneous confidence intervals θjL (α), θjU (α) , j = 1, . . . , m, for parameter θ with a coverage of 1 − α. Let for each j = 1, . . . , m, the procedure creates intervals satisfying a monotonicity condition 1 − α1 < 1 − α2 =⇒ θjL (α1 ), θjU (α1 ) ⊆ θjL (α2 ), θjU (α2 ) . Definition 12.3 Multiple comparison procedure. Multiple comparison procedure (MCP)4 for a multiple testing problem with the elementary null hypotheses Hj : θj = θj0 , j = 1, . . . , m, based on given procedure for construction of simultaneous confidence intervals for parameter θ is the testing procedure that for given 0 < α < 1 (i) rejects the global null hypothesis H0 : θ = θ 0 if and only if L U θ1L (α), θ1U (α) × · · · × θm (α), θm (α) 63 θ 0 ; (ii) for j = 1, . . . , m, rejects the jth elementary hypothesis Hj : θj = θj0 if and only if θjL (α), θjU (α) 3 6 θj0 . Note. Since θ1L (α), θ1U (α) L (α), θ U (α) 63 θ 0 if and only if there exists × · · · × θm m j = 1, . . . , m, such that θjL (α), θjU (α) 63 θj0 , the MCP rejects, for given 0 < α < 1, the global 4 procedura vícenásobného srovnávání 12.2. MULTIPLE COMPARISON PROCEDURES 225 null hypothesis H0 : θ = θ 0 if and only if, it rejects at least one out of m elementary null hypotheses. Note (Control of the type-I error rate). Classical duality between confidence regions and testing procedures provides that for any 0 < α < 1, the multiple comparison procedure defines a statistical test which (i) controls the type-I error rate with respect to the global null hypothesis H0 : θ = θ 0 , i.e., P H0 rejected; θ = θ 0 ≤ α; (ii) at the same time, for each j = 1, . . . , m, controls the type-I error rate with respect to the elementary hypothesis Hj : θj = θj0 , i.e., P Hj rejected; θj = θj0 ≤ α. End of Lecture #21 (10/12/2015) Definition 12.4 P-values adjusted for multiple comparison. Start of P-values adjusted for multiple comparison for a multiple testing problem with the elementary null Lecture #22 hypotheses Hj : θj = θj0 , j = 1, . . . , m, based on given procedure for construction of simultaneous (17/12/2015) adj confidence intervals for parameter θ are values padj 1 , . . . , pm defined as n o L U 0 padj = inf α : θ (α), θ (α) 3 6 θ j = 1, . . . , m. j j j , j Notes. The following is clear from construction: • The multiple comparison procedure rejects for given 0 < α < 1 the jth elementary hypothesis Hj : θj = θj0 (j = 1, . . . , m) if and only if padj j ≤ α. • Since the global null hypothesis H0 : θ = θ 0 is rejected by the MCP if and only if at least one elementary hypothesis is rejected, we have that the global null hypothesis is for given α rejected if and only if adj ≤ α. min padj 1 , . . . , pm That is, adj padj := min padj 1 , . . . , pm is the P-value of a test of the global null hypothesis based on the considered MCP. Example 12.3 (Bonferroni multiple comparison procedure, Bonferroni adjusted P-values). Let for 0 < α < 1, θjL (α), θjU (α) , j = 1, . . . , m, be the confidence intervals for parameters α θ1 , . . . , θm , each with a (univariate) coverage of 1 − m . That is, α ∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL (α), θjU (α) 3 θj0 ; θj = θj0 ≥ 1 − . m As shown in Example 12.2, θjL (α), θjU (α) , j = 1, . . . , m, are the Bonferroni simultaneous confidence intervals for parameter θ = θ1 , . . . , θm with a coverage of 1 − α. 12.2. MULTIPLE COMPARISON PROCEDURES 226 Let for j = 1, . . . , m, puni be a P-value related to the (single) test of the (jth elementary) hypothesis j 0 Hj : θj = θj being dual to the confidence interval θjL (α), θjU (α) . That is, puni j Hence, α : = inf m θjL (α), n min m puni j , 1 = inf α : θjU (α) 63 θj0 . o θjL (α), θjU (α) 3 6 θj0 . That is, the P-values adjusted for multiple comparison based on the Bonferroni simultaneous confidence intervals are uni pB j = 1, . . . , m. j = min m pj , 1 , The related multiple comparison procedure is called the Bonferroni MCP. Conservativeness of the Bonferroni MCP is seen, for instance, on the fact that the global null hypothesis H0 : θ = θ 0 is rejected for given 0 < α < 1 if and only if, at least one of the elementary hypothesis is rejected by its single test on a significance level of α/m which approaches zero as m, the number of elementary hypotheses, increases. 12.2.4 Bonferroni simultaneous inference in a normal linear model Consider a linear model Y X ∼ Nn Xβ, σ 2 In , Let rank Xn×k = r ≤ k < n. > θ = Lβ = l> 1 β, . . . , lm β = θ1 , . . . , θm be an estimable vector parameter of the linear model. At this point, we shall only require that lj 6= 0k for each j = 1, . . . , m. Nevertheless, we allow for m > r and also for possibly linearly dependent vectors l1 , . . . , lm . b = Lb = l> b, . . . , l> b = θb1 , . . . , θbm be the LSE of the vector θ and let MSe As usual, let θ 1 m be the residual mean square of the model. α It follows from properties of the LSE under normality that for given α, the 1 − 100% m confidence intervals for parameters θ1 , . . . , θm have the lower and the upper bounds given as q − α > > L > θj (α) = lj b − MSe lj X X lj tn−r 1 − , 2m (12.5) q − α > > U > θj (α) = lj b + MSe lj X X lj tn−r 1 − , j = 1, . . . , m. 2m By the Bonferroni principle, intervals θjL (α), θjU (α) , j = 1, . . . , m, are simultaneous confidence intervals for parameter θ with a coverage of 1 − α. For each j = 1, . . . , m, the confidence interval (12.5) is dual to the (single) test of the (jth elementary) hypothesis Hj : θj = θj0 based on the statistic 0 l> j b − θj Tj (θj0 ) = q , >X −l MSe l> X j j 12.2. MULTIPLE COMPARISON PROCEDURES 227 which under the hypothesis Hj follows the Student tn−r distribution. The univariate P-values are then calculated as puni = 2 CDFt, n−r − |tj,0 | , j where tj,0 is the value of the statistic Tj (θj0 ) attained with given data. Hence the Bonferroni adjusted P-values for a multiple testing problem with the elementary null hypotheses Hj : θj = θj0 , j = 1, . . . , m, are n o pB = min 2 m CDF − |t | j = 1, . . . , m. t, n−r j,0 , 1 , j 12.3. TUKEY’S T-PROCEDURE 12.3 228 Tukey’s T-procedure Method presented in this section is due to John Wilder Tukey (1915 – 2000) who published the initial version of the method in 1949 (Tukey, 1949). 12.3.1 Tukey’s pairwise comparisons theorem Lemma 12.1 Studentized range. Let T1 , . . . , Tm be a random sample from N (µ, σ 2 ), σ 2 > 0. Let R = max Tj − j=1,...,m min Tj j=1,...,m be the range of the sample. Let S 2 be the estimator of σ 2 such that S 2 and T = T1 , . . . , Tm are independent and ν S2 ∼ χ2ν for some ν > 0. σ2 Let R Q = . S The distribution of the random variable Q then depends on neither µ, nor σ. Proof. • We can write: R = S o Tj − µ Tj − µ 1n max − min max(Tj − µ) − min(Tj − µ) j j σ σ j j σ = . S S σ σ • Distribution of both the numerator and the denominator depends on neither µ, nor σ since Tj − µ • For all j = 1, . . . , m ∼ N (0, 1). σ S • Distribution of is a transformation of the χ2ν distribution. σ Note. The distribution of the random variable Q = sample size of T ) and ν (degrees of freedom of the S 2 ). χ2 k from Lemma 12.1 still depends on m (the distribution related to the variance estimator R S 12.3. TUKEY’S T-PROCEDURE 229 Definition 12.5 Studentized range. R from Lemma 12.1 will be called studentized range5 of a sample of size S m with ν degrees of freedom and its distribution will be denoted as qm,ν . The random variable Q = Notation. • For 0 < p < 1, the p 100% quantile of the random variable Q with distribution qm,ν will be denoted as qm,ν (p). • The distribution function of the random variable Q with distribution qm,ν will be denoted CDFq,m,ν (·). Theorem 12.2 Tukey’s pairwise comparisons theorem, balanced version. Let T1 , . . . , Tm be independent random variables and let Tj ∼ N (µj , v 2 σ 2 ), j = 1, . . . , m, where v 2 > 0 is a known constant. Let S 2 be the estimator of σ 2 such that S 2 and T = T1 , . . . , Tm are independent and ν S2 ∼ χ2ν for some ν > 0. σ2 Then √ P for all j 6= l: Tj − Tl − (µj − µl ) < qm,ν (1 − α) v 2 S 2 = 1 − α. Proof. • It follows from the assumptions that random variables distribution N (0, σ 2 ). Tj − µj , j = 1, . . . , m, are i.i.d. with the v Tj − µj Tj − µj − min . • Let R = max j j v v R ⇒ ∼ qm, ν . S • Hence for any 0 < α < 1 (qm,ν is a continuous distribution): Tj − µj Tj − µj − min max j j v v 1 − α = P < qm,ν (1 − α) S max(Tj − µj ) − min(Tj − µj ) j = P j vS < qm,ν (1 − α) = P max(Tj − µj ) − min(Tj − µj ) < v S qm,ν (1 − α) j 5 studentizované rozpětí j 12.3. TUKEY’S T-PROCEDURE 230 = P for all j 6= l (Tj − µj ) − (Tl − µl ) < v S qm,ν (1 − α) √ = P for all j 6= l Tj − Tl − (µj − µl ) < qm,ν (1 − α) v 2 S 2 . k Theorem 12.3 Tukey’s pairwise comparisons theorem, general version. Let T1 , . . . , Tm be independent random variables and let Tj ∼ N (µj , vj2 σ 2 ), j = 1, . . . , m, where vj2 > 0, j = 1, . . . , m are known constants. Let S 2 be the estimator of σ 2 such that S 2 and T = T1 , . . . , Tm are independent and ν S2 ∼ χ2ν σ2 for some ν > 0. Then s P for all j 6= l Tj − Tl − (µj − µl ) < qm,ν (1 − α) vj2 + vl2 2 ! S2 ≥ 1 − α. Proof. Proof/calculations were skipped and are not requested for the exam. See Hayter (1984). k Notes. • Tukey suggested that statement of Theorem 12.3 holds already in 1953 (in an unpublished manuscript Tukey, 1953) without proving it. Independently, it was also suggested by Kramer (1956). Consequently, the statement of Theorem 12.3 was called as Tukey–Kramer conjecture. • The proof is not an easy adaptation of the proof of the balanced version. 12.3.2 Tukey’s honest significance differences (HSD) A method of multiple comparison that will now be developed appears under several different names in the literature: Tukey’s method, Tukey–Kramer method, Tukey’s range test, Tukey’s honest significance differences (HSD) test. Assumptions. In the following, we assume that T = T1 , . . . , Tm ∼ Nm (µ, σ 2 V), where 12.3. TUKEY’S T-PROCEDURE 231 • µ = µ1 , . . . , µm ∈ Rm and σ 2 > 0 are unknown parameters; 2 on a diagonal. • V is a known diagonal matrix with v12 , . . . , vm That is, T1 , . . . , Tm are independent and Tj ∼ N (µj , σ 2 vj ), j = 1, . . . , m. Further, we will assume that an estimator S 2 of σ 2 is available which is independent of T and which satisfies ν S 2 /σ 2 ∼ χ2ν for some ν > 0. Multiple comparison problem. A multiple comparison procedure that will be developed aims in testing m? = hypotheses on all pairwise differences between the means µ1 , . . . , µm . Let m 2 elementary θj,l = µj − µl , j = 1, . . . , m − 1, l = j + 1, . . . , m, θ = θ1,2 , θ1,3 , . . . , θm−1,m . The elementary hypotheses of a multiple testing problem that we shall consider are 0 , Hj,l : θj,l (= µj − µl ) = θj,l j = 1, . . . , m − 1, l = j + 1, . . . , m, m? . The global null hypothesis is as usual H : θ = 0 , θ0 , . . . , θ0 for some θ 0 = θ1,2 0 1,3 m−1,m ∈ R θ0 . Note. The most common multiple testing problem in this context is with θ0 = 0m? which corresponds to all pairwise comparisons of the means µ1 , . . . , µm . The global null hypothesis then states that all the means are equal. Some derivations Using either of the Tukey’s pairwise comparison theorems (Theorems 12.2 and 12.3), we have (for chosen 0 < α < 1): s ! vj2 + vl2 P for all j 6= l Tj − Tl − (µj − µl ) < qm,ν (1 − α) S2 ≥ 1 − α, 2 2 . That is, we with equality of the above probability to 1 − α in the balanced case of v12 = · · · = vm have, T − T − (µ − µ ) j j l ql 2 2 P for all j 6= l < qm,ν (1 − α) ≥ 1 − α. vj +vl 2 S 2 0 ∈R Let for j 6= l and for θj,l 0 Tj − Tl − θj,l 0 Tj,l (θj,l ) := s . vj2 + vl2 S2 2 12.3. TUKEY’S T-PROCEDURE 232 That is ! 1 − α ≤ P for all j 6= l Tj,l (θ0 ) < qm,ν (1 − α); θ = θ 0 j,l 0 Tj − Tl − θj,l < qm,ν (1 − α); θ = θ 0 = P for all j 6= l q 2 2 v +v j l 2 S 2 TL TU 0 0 = P for all j 6= l θj,l (α), θj,l (α) 3 θj,l ; θ = θ , (12.6) where T L (α) θj,l T U (α) θj,l q vj2 +vl2 2 S2, q vj2 +vl2 2 S2, = Tj − Tl − qm,ν (1 − α) = Tj − Tl + qm,ν (1 − α) (12.7) (12.8) j < l. Theorem 12.4 Tukey’s honest significance differences. Random intervals given by (12.8) are simultaneous confidence intervals for parameters θj,l = µj − µl , j = 1, . . . , m − 1, l = j + 1, . . . , m with a coverage of 1 − α. 2 , the coverage is exactly equal to 1 − α, i.e., for any θ 0 ∈ Rm In the balanced case of v12 = · · · = vm 0 0 TU TL = 1 − α. P for all j 6= l θj,l (α), θj,l (α) 3 θj,l ; θ = θ ? 0 , θ 0 ∈ R, Related P-values for a multiple testing problem with elementary hypotheses Hj,l : θj,l = θj,l j,l j < l, adjusted for multiple comparison are given by j < l, pTj,l = 1 − CDFq,m,ν t0j,l , 0 )= where t0j,l is a value of Tj,l (θj,l 0 Tj −Tl −θj,l r v 2 +v 2 j l 2 attained with given data. S2 Proof. T L (α), θ T U (α) , j < l, are simultaneous confidence intervals for parameters The fact that θj,l j,l θj,l = µj − µl with a coverage of 1 − α follows from (12.7). The fact that the coverage of the simultaneous confidence intervals is exactly equal to 1 − α in a balanced case follows from the fact that inequality in (12.6) is equality in a balanced case. Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem 0 , j < l, follows from noting the following (for each with the elementary hypotheses Hj,l : θj,l = θj,l j < l): TL TU 0 0 θj,l (α), θj,l (α) 63 θj,l ⇐⇒ Tj,l θj,l ≥ qm, ν (1 − α) It now follows from monotonicity of the quantiles of a continuous Studentized range distribution that T TL TU 0 0 pj,l = inf α : θj,l (α), θj,l (α) 63 θj,l = inf α : Tj,l θj,l ≥ qm, ν (1 − α) 12.3. TUKEY’S T-PROCEDURE 233 is attained for pTj,l satisfying That is, if t0j,l 0 T θ j,l j,l = qm, ν 1 − pTj,l . 0 is a value of the statistic Tj,l θj,l attained with given data, we have pTj,l = 1 − CDFq,m,ν t0j,l . k 12.3.3 Tukey’s HSD in a linear model In context of a normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r ≤ k < n, the Tukey’s honest significance differences are applicable in the following situation. > • Lm×k is a matrix with non-zero rows l> 1 , . . . , lm such that the parameter > η = Lβ = l> 1 β, . . . , lm β = η1 , . . . , ηm is estimable. • Matrix L is such that V := L X> X − L> = vj,l j,l=1,...,m is a diagonal matrix with := vj,j , j = 1, . . . , m. − With b = X> X X> Y and the residual mean square MSe of the fitted linear model, we have (conditionally, given the model matrix X): vj2 > 2 b = l> T := η 1 b, . . . , lm b = Lb ∼ Nm η, σ V , (n − r)MSe ∼ χ2n−r , σ2 b and MSe independent. η Hence the Tukey’s T-procedure can be used for a multiple comparison problem on (also estimable) parameters > θj,l = ηj − ηl = lj − ll β, j < l. The Tukey’s simultaneous confidence intervals for parameters θj,l , j < l, with a coverage of 1 − α have then the lower and the upper bound given as q 2 2 vj +vl T L MSe , θj,l (α) = ηbj − ηbl − qm,n−r (1 − α) 2 q 2 2 vj +vl T U (α) = η θj,l bj − ηbl + qm,n−r (1 − α) MSe , j < l. 2 Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem with elementary hypotheses 0 Hj,l : θj,l = θj,l , j < l, 0 ∈ R, is based on statistics for chosen θj,l 0 ηbj − ηbl − θj,l 0 Tj,l (θj,l )= s , 2 2 vj + vl MSe 2 j < l. 12.3. TUKEY’S T-PROCEDURE 234 The above procedure is in particular applicable if all involved covariates are categorical and the model corresponds to one-way, two-way or higher-way classification. If normal and homoscedastic errors in the underlying linear model are assumed, the Tukey’s HSD method can then be used to develop a multiple comparison procedure for differences between the group means or between the means of the group means. One-way classification P Let Y = Y1,1 , . . . , YG,nG , n = G g=1 ng , and Yg,j = mg + εg,j , g = 1, . . . , G, j = 1, . . . , nG , i.i.d. εg,j ∼ N (0, σ 2 ). We then have (see Theorem 9.1, with random covariates conditionally given the covariate values) 1 0 Y1 m1 n1 . . . . .. .. . .. ∼ NG ... , σ 2 ... T := . . 0 . . . n1G YG mG Moreover, the mean square error MSe of the underlying one-way ANOVA linear model satisfies, with νe = n − G, νe MSe ∼ χ2νe , MSe and T independent σ2 (due to the fact that T is the LSE of the vector of group means m = m1 , . . . , mG ). Hence the Tukey’s simultaneous confidence intervals for θg,h = mg −mh , g = 1, . . . , G−1, h = g +1, . . . , G with a coverage of 1 − α, have then the lower and upper bounds given as s 1 1 1 Y g − Y h ± qG, n−G (1 − α) + MSe , g < h. 2 ng nh In case of a balanced data (n1 = · · · = nG ), the coverage of those intervals is even exactly equal to 1 − α, otherwise, the intervals are conservative (having a coverage greater than 1 − α). Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem with elementary hypotheses 0 Hg,h : θg,h = θg,h , g < h, 0 ∈ R, is based on statistics for chosen θg,h 0 Tg,h (θg,h ) 0 Y g − Y h − θg,h =s , 1 1 1 + MSe 2 ng nh g < h. Note. The R function TukeyHSD applied to objects obtained using the function aov (performs LSE based inference for linear models involving only categorical covariates) provides a software implementation of the Tukey’s T multiple comparison described here. 12.3. TUKEY’S T-PROCEDURE 235 Two-way classification PH P Let Y = Y1,1,1 , . . . , YG,H,nG,H , n = G h=1 ng,h , and g=1 Yg,h,j = mg,h + εg,h,j , g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h , i.i.d. εg,h,j ∼ N (0, σ 2 ). Let, as usual, ng• = H X ng,h , Y g• H ng,h 1 XX Yg,h,j , = ng• h=1 j=1 h=1 mg• H 1 X = mg,h , H h=1 mwt g• H 1 X = ng,h mg,h , ng• g = 1, . . . , G. h=1 Balanced data In case of balanced data (ng,h = J for all g, h), we have ng• = J H, mwt g = mg . Further, 1 0 Y 1• m1• JH ... . . .. .. 2 .. . . T := . . . ∼ NG . , σ . , Y G• mG• 0 . . . J 1H see Consequence of Theorem 9.2. Further, let MSZW and MSZ+W be the residual mean squares e e from the interaction model and the additive model, respectively, νeZW = n − G H, and νeZ+W = n − G − H + 1 degrees of freedom, respectively. We have shown in the proof of Consequence of Theorem 9.2 that for both the interaction model and the additive model, the sample means Y 1• , . . . , Y G• are LSE’s of estimable parameters m1• , . . . , mG• and hence, for both models, vector T is independent of the corresponding residual mean square. Further, depending on whether the interaction model or the additive model is assumed, we have νe? MS?e ∼ χ2νe? , σ2 where MS?e is the residual mean square of the model that is assumed (MSZW or MSZ+W ) and νe? e e ZW Z+W the corresponding degrees of freedom (νe or νe ). Hence the Tukey’s simultaneous confidence intervals for θg1 ,g2 = mg1 • − mg2 • , g1 = 1, . . . , G − 1, g2 = g1 + 1, . . . , G have then the lower and upper bounds given as r 1 Y g1 • − Y g2 • ± qG, νe? (1 − α) MS?e , JH and the coverage of those intervals is even exactly equal to 1 − α. Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem with elementary hypotheses End of Lecture #22 Hg1 ,g2 : θg1 ,g2 = θg01 ,g2 , g1 < g2 , (17/12/2015) 0 for chosen θg1 ,g2 ∈ R, is based on statistics Tg1 ,g2 (θg01 ,g2 ) = Y g1 • − Y g2 • − θg01 ,g2 r , 1 ? MSe JH g1 < g2 . 12.3. TUKEY’S T-PROCEDURE 236 Unbalanced data With unbalanced data, direct calculation shows that 1 Y 1• mwt 1• n1• . . 2 .. . . T := . ∼ NG . , σ . 0 Y G• mwt G• ... .. . ... 0 .. . . 1 nG• wt Further, the sample means Y 1• , . . . , Y G• are LSE’s of the estimable parameters mwt 1• , . . ., mG• in both the interaction and the additive model. This is obvious for the interaction model since there we know the fitted values (≡ LSE’s of the group means mg,h ). Those are Ybg,h,j = Y g,h• , g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h (Theorem 9.2). Hence the sample means Y 1• , . . ., Y G• , which are their linear combinations, are LSE’s of the corresponding linear combinations of wt the group means mg,h . Those are the weighted means of the means mwt 1• , . . ., mG• . To show that wt the sample means Y 1• , . . ., Y G• are the LSE’s for the estimable parameters mwt 1• , . . ., mG• in the additive model would, nevertheless, require additional derivations. For the rest, we can proceed in the same way as in the balanced case. That is, let MS?e and νe? denote the residual mean square and the residual degrees of freedom of the model that can be assumed (interaction or additive). Owing to the fact that T is a vector of the LSE’s of the estimable parameters for both models, it is independent of MS?e . The Tukey’s T multiple comparison procedure is now applicable for inference on parameters wt θgwt = mwt g1 • − mg2 • , 1 ,g2 g1 = 1, . . . , G − 1, g2 = g1 + 1, . . . , G. wt The Tukey’s simultaneous confidence intervals for θgwt = mwt g1 • − mg2 • , g1 = 1, . . . , G − 1, g2 = 1 ,g2 g1 + 1, . . . , G, with a coverage of 1 − α, have the lower and upper bounds given as s 1 1 1 ? Y g1 • − Y g2 • ± qG, νe? (1 − α) + MSe . 2 ng1 • ng2 • Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem with elementary hypotheses Hg1 ,g2 : θgwt = θgwt,0 , 1 ,g2 1 ,g2 g1 < g2 , for chosen θgwt,0 1 ,g2 ∈ R, is based on statistics Y g1 • − Y g2 • − θgwt,0 1 ,g2 s , Tg1 ,g2 (θgwt,0 ) = 1 ,g2 1 1 1 + MS?e 2 ng1 • ng2 • g1 < g2 . Notes. • Analogous procedure applies for the inference on the means of the means G m•h = 1 X mg,h , G g=1 mwt •h = G 1 X ng,h mg,h , n•h h = 1, . . . , H, g=1 by the second factor of the two-way classification. wt • The weighted means of the means mwt g• or m•h have a reasonable interpretation only in certain special situations. If this is not the case, the Tukey’s multiple comparison with unbalanced data does not make much sense. Beginning of skipped part 12.3. TUKEY’S T-PROCEDURE 237 • Even with unbalanced data, we can, of course, calculate the LSE’s of the (unweighted) means of the means mg• or m•h . Nevertheless, those LSE’s are correlated with unbalanced data and hence we cannot apply the Tukey’s procedure. Note (Tukey’s HSD in the R software). The R function TukeyHSD provides the Tukey’s T-procedure also for the two-way classification (for both the additive and the interaction model). For balanced data, it performs a simultaneous inference on parameters θg1 ,g2 = mg1 • − mg2 • (and analogous parameters with respect to the second factor) in a way described here. For unbalanced data, it performs a simultaneous inference wt on parameters θgwt = mwt g1 • − mg2 • as described here, nevertheless, only for the first factor 1 ,g2 mentioned in the model formula. Inference on different parameters is provided with respect to the second factor in the model formula. That is, with unbalanced data, output from the R function TukeyHSD and interpretation of the results depend on the order of the factors in the model formula. TukeyHSD with two-way classification for the second factor uses “new” observations that adjust for ? , given as the effect of the first factor. That is, it is worked with “new” observations Yg,h,j ? Yg,h,j = Yg,h,j − Y g• + Y , g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h . The Tukey’s T procedure is then applied to the sample means ? Y •h = Y •h − G 1 X ng,h Y g• + Y , n•h h = 1, . . . , H, g=1 whose expectations are mwt •h G G H 1 X 1XX wt ng,h2 mg,h2 , − ng,h mg• + n•h n g=1 h = 1, . . . , H, g=1 h2 =1 which, with unbalanced data, are not equal to mwt •h . End of skipped part 12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE 12.4 238 Hothorn-Bretz-Westfall procedure Start of The multiple comparison procedure presented in this section is applicable for any parametric model Lecture #23 where the parameters estimators follow either exactly (as in the case of a normal linear model) or (17/12/2015) at least asymptotically a (multivariate) normal or t-distribution. In full generality, it was published only rather recently (Hothorn et al., 2008, 2011), nevertheless, the principal ideas behind the method are some decades older. 12.4.1 Max-abs-t distribution Definition 12.6 Max-abs-t-distribution. Let T = T1 , . . . , Tm ∼ mvtm,ν Σ , where Σ is a positive semidefinite matrix. The distribution of a random variable H = max |Tj | j=1,...,m will be called the max-abs-t-distribution of dimension m with ν degrees of freedom and a scale matrix Σ and will be denoted as hm,ν (Σ). Notation. • For 0 < p < 1, the p 100% quantile of the distribution hm,ν (Σ) will be denoted as hm,ν (p; Σ). That is, hm,ν (p; Σ) is the number satisfying P max |Tj | ≤ hm,ν (p; Σ) = p. j=1,...,m • The distribution function of the random variable with distribution hm,ν (Σ) will be denoted CDFh,m,ν (·; Σ). Notes. • If the scale matrix Σ is positive definite (invertible), the random vector T ∼ mvtm,ν Σ has a density w.r.t. Lebesgue measure ν+m − 1 Γ ν+m t> Σ−1 t − 2 2 2 fT (t) = 1+ , t ∈ Rm . m m Σ ν Γ ν2 ν 2 π 2 • The distribution function CDFh,m,ν (·; Σ) of a random variable H = maxj=1,...,m |Tj | is then (for h > 0): CDFh,m,ν (h; Σ) = P max |Tj | ≤ h = P ∀j = 1, . . . , m |Tj | ≤ h j=1,...,m Z h Z h ··· = −h fT (t1 , . . . , tm ) dt1 · · · dtm . −h • That is, when calculating the CDF of the random variable H having the max-abs-t distribution, it is necessary to calculate integrals from a density of a multivariate t-distribution. • Computationally efficient methods not available until 90’s of the 20th century. • Nowadays, see, e.g., Genz and Bretz (2009) and the R packages mvtnorm or mnormt. • Calculation of CDFh,m,ν (·; Σ) is also possible with a singular scale matrix Σ. 12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE 12.4.2 239 General multiple comparison procedure for a linear model Assumptions. In the following, we consider a normal linear model Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. Further, let l> 1 . . = . l> m Lm×k be a matrix such that > θ = Lβ = l> 1 β, . . . , lm β = θ1 , . . . , θm is an estimable vector parameter with l1 6= 0k , . . . , lm 6= 0k . Notes. • The number m of the estimable parameters of interest may be arbitrary, i.e., even greater than r or k. • The rows of the matrix L may be linearly dependent vectors. Multiple comparison problem. A multiple comparison procedure that will be developed aims in providing a simultaneous inference on m estimable parameters θ1 , . . . , θm with the multiple testing problem composed of m elementary hypotheses Hj : θj = θj0 , j = 1, . . . , m, 0 ∈ Rm . The global null hypothesis is as usual H : θ = θ 0 . for some θ 0 = θ10 , . . . , θm 0 Notation. In the following, the following (standard) notation will be used: − • b = X> X X> Y (any solution to normal equations X> Xb = X> Y ); b = Lb = l> b, . . . , l> b = θb1 , . . . , θbm : LSE of θ; • θ 1 m − • V = L X> X L> = vj,l j,l=1,...,m (which does not depend on a choice of a pseudoinverse − X> X ); 1 1 • D = diag √ , ..., √ ; v1,1 vm,m • MSe : the residual mean square of the model with νe = n − r degrees of freedom. 12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE 240 Reminders from Chapter 3 • For j = 1, . . . , m, (both conditionally given X and unconditionally as well): θbj − θj Zj := p σ 2 vj,j ∼ N (0, 1), θbj − θj Tj := p MSe vj,j ∼ tn−r . • Further (conditionally given X): 1 b − θ ∼ Nm 0m , DVD , Z = Z1 , . . . , Zm = √ D θ σ2 1 b − θ ∼ mvtm, n−r DVD . D θ T = T1 , . . . , Tm = √ MSe Notes. • Matrices V and DVD are not necessarily invertible. • If rank L = m ≤ r then both matrices V and DVD are invertible and Theorem 3.1 further provides (both conditionally given X and unconditionally as well) that under H0 : θ = θ 0 : > −1 1 b b − θ 0 = 1 T > DVD −1 T ∼ Fm, n−r . Q0 = θ − θ0 MSe V θ m m This was used to test the global null hypothesis H0 : θ = θ 0 and to derive the elliptical confidence sets for θ. • It can also be shown that if m0 = rank L then under H0 : θ = θ 0 : > + 1 b b − θ 0 = 1 T > DVD + T ∼ Fm , n−r θ − θ0 MSe V θ 0 m m (both conditionally given X and unconditionally), where symbol + denotes the Moore-Penrose pseudoinverse. Q0 = Some derivations Let for θj0 ∈ R, j = 1, . . . , m, θbj − θj0 Tj (θj0 ) = p , MSe vj,j j = 1, . . . , m. Then, under H0 : θ = θ 0 : 0 T θ 0 := T1 (θ10 ), . . . , Tm (θm ) ∼ mvtm, n−r (DVD). We then have, for 0 < α < 1: 1 − α = P max Tj (θj0 ) < hm, n−r (1 − α; DVD); θ = θ 0 j=1,...,m = P for all j = 1, . . . , m Tj (θj0 ) < hm, n−r (1 − α; DVD); θ = θ 0 ! b θj − θj0 < hm, n−r (1 − α; DVD); θ = θ 0 = P for all j = 1, . . . , m p MSe vj,j = P for all j = 1, . . . , m θjHL (α), θjHU (α) 3 θj0 ; θ = θ 0 , (12.9) 12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE where 241 θjHL (α) = θbj − hm, n−r (1 − α; DVD) p MSe vj,j , p θjHU (α) = θbj + hm, n−r (1 − α; DVD) MSe vj,j , j = 1, . . . , m. (12.10) Theorem 12.5 Hothorn-Bretz-Westfall MCP for linear hypotheses in a normal linear model. Random intervals given by (12.10) are simultaneous confidence intervals for parameters θj = l> β, 0 0 0 m j = 1, . . . , m, with an exact coverage of 1 − α, i.e., for any θ = θ1 , . . . , θm ∈ R HL HU 0 0 P for all j = 1, . . . , m θj (α), θj (α) 3 θj ; θ = θ = 1 − α. Related P-values for a multiple testing problem with elementary hypotheses Hj : θj = θj0 , θj0 ∈ R, j = 1, . . . , m, adjusted for multiple comparison are given by 0 pH j = 1, . . . , m, j = 1 − CDFh,m,n−r tj ; DVD , θbj −θj0 where t0j is a value of Tj (θj0 ) = √ MSe vj,j attained with given data. Proof. The fact that θjHL (α), θjHU (α) , j = 1, . . . , m, are simultaneous confidence intervals for pa- rameters θj = l> j β with an exact coverage of 1 − α follows from (12.9). Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem with the elementary hypotheses Hj : θj = θj0 , j = 1, . . . , m, follows from noting the following (for each j = 1, . . . , m): HL HU 0 0 θj (α), θj (α) 63 θj ⇐⇒ Tj θj ≥ hm, n−r (1 − α; DVD). It now follows from monotonicity of the quantiles of a continuous max-abs-t-distribution that H HL HU 0 0 pj = inf α : θj (α), θj (α) 63 θj = inf α : Tj θj ≥ hm, n−r (1 − α; DVD) is attained for pH j satisfying Tj θj0 = hm, n−r (1 − pH j ; DVD). That is, if t0j is a value of the statistic Tj θj0 attained with given data, we have 0 pH = 1 − CDF t ; DVD . h,m,n−r j j k 12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE 242 Note (Hothorn-Bretz-Westfall MCP in the R software). In the R software, the Hothorn-Bretz-Westfall MCP for linear hypotheses on parameters of (generalized) linear models is implemented in package multcomp. After fitting a model (by the function lm), it is necessary to call sequentially the following functions: (i) glht. One of its arguments specifies the linear hypothesis of interest (specification of the L matrix). Note that for some common hypotheses, certain keywords can be used. For example, pairwise comparison of all group means in context of the ANOVA models is achieved by specifying the keyword “Tukey”. Nevertheless, note that invoked MCP is still that of Hothorn-Bretz-Westfall and it is not based on the Tukey’s procedure. The “Tukey” keyword only specifies what should be compared and not how it should be compared. (ii) summary (applied on an object of class glht) provides P-values adjusted for multiple comparison. (iii) confint (applied on an object of class glht) provides simultaneous confidence intervals which, among other things, requires calculation of a critical value hm, n−r (1 − α), that is also available in the output. Note that both calculation of the P-values adjusted for multiple comparison and calculation of the critical value hm, n−r (1 − α) needed for the simultaneous confidence intervals requires calculation of a multivariate t integral. This is calculated by a Monte Carlo integration (i.e., based on a certain stochastic simulation) and hence the results slightly differ if repeatedly calculated at different occasions. Setting a seed of the random number generator (set.seed()) is hence recommended for full reproducibility of the results. 12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION 12.5 243 Confidence band for the regression function In this section, we shall assume that data are represented by i.i.d. random vectors Yi , Z i , i = 1, . . . , n, being sampled from a distribution of a generic random vector Y, Z ∈ R1+p . It is further assumed that for some known transformation t : Rp −→ Rk , a normal linear model with regressors X i = t(Z i ), i = 1, . . . , n, holds. That is, it is assumed that Yi = X > i β + εi , i.i.d. εi ∼ N (0, σ 2 ), for some β ∈ Rk , σ 2 > 0. The corresponding regression function is E Y X = t(z) = E Y Z = z = m(z) = t> (z)β, z ∈ Rp . It will further be assumed that the corresponding model matrix > (Z ) X> t 1 1 . . .. = .. X= > > Xn t (Z n ) b will be the LSE of a vector of is of full-rank (almost surely), i.e., rank Xn×k = k. As it is usual, β β and MSe the residual mean square. Reminder from Section 3.3 Let z ∈ Rp be given. Theorem 3.2 then states that a random interval with the lower and upper bounds given as q −1 b ± tn−k 1 − α MSe t> (z) X> X t(z), t> (z)β 2 is the confidence interval for m(z) = t> (z)β with a coverage of 1 − α. That is, for given z ∈ Rp , for any β 0 ∈ Rk , q −1 α > b P t (z)β ± tn−k 1 − MSe t> (z) X> X t(z) 3 t> (z)β 0 ; β = β 0 = 1 − α. 2 Theorem 12.6 Confidence band for the regression function. Let Yi , Z i , i = 1, . . . , n, Z i ∈ Rp , be such that Yi = X > i β + εi , i.i.d. εi ∼ N (0, σ 2 ), where X i = t(Z i ), i = 1, . . . , n, for a known transformation t : Rp −→ Rk , and β ∈ Rk and σ 2 > 0 are unknown parameters. Let for all z ∈ Rp t(z) 6= 0k and let rank Xn×k = k, where X is a matrix with vectors X 1 , . . . , X n in rows. Then for any β 0 ∈ Rk P for all z ∈ Rp b ± t> (z)β q −1 k Fk, n−k (1 − α) MSe t> (z) X> X t(z) 3 t> (z)β 0 ; β = β0 = 1 − α. 12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION 244 Note. Requirement t(z) 6= 0k for all z ∈ Rp is not too restrictive from a practical point of view as it is satisfied, e.g., for all linear models with intercept. Proof. Let (for 0 < α < 1) n K = β ∈ Rk : b β−β > X> X o b ≤ k MSe Fk,n−k (1 − α) . β−β Section 3.2: K is a confidence ellipsoid for β with a coverage of 1 − α, that is, for any β 0 ∈ Rk P K 3 β 0 ; β = β 0 = 1 − α. K is an ellipsoid in Rk , that is, bounded, convex and with our definition also closed subset of Rk . Let for z ∈ Rp : L(z) = inf t> (z)β, β∈K U (z) = sup t> (z)β. β∈K From construction: β ∈ K ⇒ ∀z ∈ Rp L(z) ≤ t> (z)β ≤ U (z). Due to the fact that K is bounded, convex and closed, we also have ∀z ∈ Rp That is, L(z) ≤ t> (z)β ≤ U (z) ⇒ β ∈ K. β ∈ K ⇔ ∀z ∈ Rp L(z) ≤ t> (z)β ≤ U (z). and hence, for any β 0 ∈ Rk , 1 − α = P K 3 β 0 ; β = β 0 = P for all z ∈ Rp L(z) ≤ t> (z)β 0 ≤ U (z); β = β 0 . (12.11) Further, since t> (z)β is a linear function (in β) and K is bounded, convex and closed, we have L(z) = inf t> (z)β= min t> (z)β, β∈K β∈K U (z) = sup t> (z)β= max t> (z)β, β∈K β∈K and both extremes must lie on a boundary of K, that is, both extremes are reached for β satisfying b > X> X β − β b = k MSe Fk,n−k (1 − α). β−β Method of Lagrange multipliers: o 1 n b > X> X β − β b − k MSe Fk,n−k (1 − α) ϕ(β, λ) = t> (z)β + λ β − β 2 ( 12 is only included to simplify subsequent expressions). 12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION 245 Derivatives of ϕ: ∂ϕ b , (β, λ) = t(z) + λ X> X β − β ∂β o ∂ϕ 1n b > X> X β − β b − k MSe Fk,n−k (1 − α) . (β, λ) = β−β ∂λ 2 With given λ, the first set of equations is solved (with respect to β) for b − 1 X> X −1 t(z). β(λ) = β λ Use β(λ) in the second equation: −1 −1 1 > t (z) X> X X> X X> X t(z) = k MSe Fk,n−k (1 − α), 2 λ s −1 t> (z) X> X t(z) . λ=± k MSe Fk,n−k (1 − α) Hence, β which minimizes/maximizes t> (z)β subject to b > X> X β − β b = k MSe Fk,n−k (1 − α) β−β is given as s −1 k MSe Fk,n−k (1 − α) > X X t(z), −1 t> (z) X> X t(z) s −1 k MSe Fk,n−k (1 − α) X> X t(z). −1 t> (z) X> X t(z) b− β min = β β max b+ =β −1 Note that with our assumptions of t(z) 6= 0, we never divide by zero since X> X is a positive definite matrix. That is, L(z) = t> (z)β min > b − = t (z)β q MSe t> (z)(X> X)−1 t(z) k Fk,n−k (1 − α), U (z) = t> (z)β max b + = t> (z)β q MSe t> (z)(X> X)−1 t(z) k Fk,n−k (1 − α). The proof is finalized by looking back at expression (12.11) and realizing that, due to continuity, 1 − α = P for all z ∈ Rp L(z) ≤ t> (z)β 0 ≤ U (z); β = β 0 = P for all z ∈ Rp L(z) < t> (z)β 0 < U (z); β = β 0 = P for all z ∈ Rp b ± t> (z)β q −1 k Fk, n−k (1 − α) MSe t> (z) X> X t(z) 3 t> (z)β 0 ; β = β 0 . 12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION 246 k Terminology (Confidence band for the regression function). If the covariates Z1 , . . . , Zn ∈ R, confidence intervals according to Theorem (12.6) are often calculated for an (equidistant) sequence of values z1 , . . . , zN ∈ R and then plotted together with b z ∈ R. A band that is obtained in this way is called the fitted regression function m(z) b = t> (z)β, the confidence band for the regression function 6 as it covers jointly all true values of the regression function with a given probability of 1 − α. Note (Confidence band for and around the regression function). For given z ∈ R: Half width of the confidence band FOR the regression function (overall coverage) is q k Fk,n−k (1 − α) MSe t> (z)(X> X)−1 t(z). Half width of the confidence band AROUND the regression function (pointwise coverage) is q α tn−k 1 − MSe t> (z)(X> X)−1 t(z) 2 q = F1,n−k (1 − α) MSe t> (z)(X> X)−1 t(z), α since for any ν > 0, t2ν 1 − = F1,ν (1 − α). 2 For k ≥ 2, and any ν > 0, k Fk,ν (1 − α) > F1,ν (1 − α) and hence the confidence band for the regression function is indeed wider than the confidence band around the regression function. Their width is the same only if k = 1. 6 pás spolehlivosti pro regresní funkci Chapter 13 Asymptotic Properties of the LSE and Sandwich Estimator 13.1 Assumptions and setup Assumption (A0). (i) Let Y1 , X 1 , Y2 , X 2 , . . . be a sequence of (1 + k)-dimensional independent and identi cally distributed (i.i.d.) random random vector Y, X , vectors being distributed as a generic (X = X0 , X1 , . . . , Xk−1 , X i = Xi,0 , Xi,1 , . . . , Xi,k−1 , i = 1, 2, . . .); (ii) Let β = β0 , . . . , βk−1 be an unknown k-dimensional real parameter; (iii) Let E Y X = X > β. Notation (Error terms). We denote ε = Y − X > β, εi = Yi − X > i β, i = 1, 2, . . .. Convention. In this chapter, all unconditional expectations are understood as expectations with respect to the joint distribution of a random vector (Y, X) (which depends on the vector β). Note. From assumption (A0), the error terms ε1 , ε2 , . . . are i.i.d. with a distribution of a generic error term ε. The following can be concluded for their first two (conditional) moments: E ε X = E Y − X >β X = 0, var ε X = var Y − X > β X = var Y X =: σ 2 (X), 247 13.1. ASSUMPTIONS AND SETUP 248 E ε = E E εX = E 0 = 0, var ε = var E ε X + E var ε X = var 0 + E σ 2 (X) = E σ 2 (X) . Assumption (A1). Let the covariate random vector X = X0 , . . . , Xk−1 satisfy (i) EXj Xl < ∞, j, l = 0, . . . , k − 1; (ii) E XX > = W, where W is a positive definite matrix. Notation (Covariates second and first mixed moments). Let W = wj,l j,l=0,...,k−1 . We have, wj2 := wj,j = E Xj2 , j = 0, . . . , k − 1, wj,l = E Xj Xl , Let V := W−1 = vj,l j 6= l. j,l=0,...,k−1 . Notation (Data of size n). For n ≥ 1: Yn Y1 . . := . , Yn X> 1 . . Xn := . , X> n Wn := X> n Xn = n X X iX > i , i=1 Vn := X> n Xn −1 (if it exists). Lemma 13.1 Consistent estimator of the second and first mixed moments of the covariates. Let assumpions (A0) and (A1) hold. Then 1 a.s. Wn −→ W n a.s. n Vn −→ V as n → ∞, as n → ∞. 13.1. ASSUMPTIONS AND SETUP 249 Proof. The statement of Lemma follows from applying, for each j = 0, . . . , k − 1 and l = 0, . . . , k−1, the strong law of large numbers for i.i.d. random variables (Theorem C.2) to a sequence Zi,j,l = Xi,j Xi,l , i = 1, 2, . . . . k End of Lecture #23 (17/12/2015) LSE based on data of size n Since 1 n Start of Lecture #24 (07/01/2016) a.s. X> n Xn −→ W > 0 then P there exists n0 > k such that for all n ≥ n0 rank Xn = k = 1 and we define (for n ≥ n0 ) b = β n X> n Xn −1 X> nY n = n X X iX > i n −1 X i=1 MSe,n = X i Yi , i=1 1 1 b 2 = Y n − Xn β n n−k n−k n X b 2 (Yi − X > i βn ) , i=1 which are the LSE of β and the residual mean square based on the assumed linear model for data of size n. Mn : Y n Xn ∼ Xn β, σ 2 In . Further, for n ≥ n0 any non-trivial linear combination of regression coefficients is estimable parameter of model Mn . • For a given real vector l = l0 , l1 , . . . , lk−1 6= 0k we denote θ = l> β, b . θbn = l> β n > > > • For a given m × k matrix L with rows l> 1 6= 0k , . . . , lm 6= 0k we denote ξ = Lβ, b . b ξ n = Lβ n It will be assumed that m ≤ k and that the rows of L are lineary independent. Interest in asymptotics (as n → ∞) behavior of b ; (i) β n (ii) MSe,n ; b for given l 6= 0k ; (iii) θbn = l> β n b for given m × k matrix L with linearly independent rows; (iv) b ξ n = Lβ n under two different scenarios (two different truths) (i) homoscedastic errors (i.e., model Mn : Y n Xn ∼ Xn β, σ 2 In is correct); 13.1. ASSUMPTIONS AND SETUP 250 (ii) heteroscedastic errors where var ε X is not necessarily constant and perhaps depends on the covariate values X (i.e., model Mn is not necessarily fully correct). Normality of the errors will not be assumed. Assumption (A2 homoscedastic). Let the conditional variance of the response satisfy σ 2 (X) := var Y X = σ 2 , where ∞ > σ 2 > 0 is an unknown parameter. Assumption (A2 heteroscedastic). Let σ 2 (X) := var Y X satisfy, for each j, l = 0, . . . , k − 1, the condition E σ 2 (X)Xj Xl < ∞. Notes. • Condition (A2 heteroscedastic) states that the matrix WF := E σ 2 (X) XX > is a real matrix (with all elements being finite). • If (A0) and (A1) are assumed then (A2 homoscedastic) =⇒ (A2 heteroscedastic). Hence everything that will be proved under (A2 heteroscedastic) holds also under (A2 homoscedastic). • Under assumptions (A0) and (A2 homoscedastic), we have E Yi X i = X > var Yi X i = var εi X i = σ 2 , i β, i = 1, 2, . . . , and for each n > 1, Y1 , . . . , Yn are, given Xn , independent and satisfying a linear model Y n Xn ∼ Xn β, σ 2 In . • Under assumptions (A0) and (A2 heteroscedastic), we have E Yi X i = X > var Yi X i = var εi X i = σ 2 (X i ), i β, i = 1, 2, . . . , and for each n > 1, Y1 , . . . , Yn are, given Xn , independent with σ 2 (X 1 ) . . . 0 .. .. .. . E Y n Xn = Xn β, var Y n Xn = . . . 2 0 . . . σ (X n ) 13.2. CONSISTENCY OF LSE 13.2 251 Consistency of LSE We shall show in this section: b , θbn , b (i) Strong consistency of β ξ n (LSE’s regression coefficients or their linear combinations). n • No need of normality; • No need of homoscedasticity. (ii) Strong consistency of MSe,n (unbiased estinator of the residual variance). • No need of normality. Theorem 13.2 Strong consistency of LSE. Let assumptions (A0), (A1) and (A2 heteroscedastic) hold. Then a.s. b −→ β β n a.s. as n → ∞, b = θbn −→ θ l> β n = l> β as n → ∞, a.s. b = b ξ n −→ ξ Lβ n = Lβ as n → ∞. Proof. a.s. b −→ It is sufficient to show that β β. The remaining two statements follow from properties of n convergence almost surely. We have −1 > X> Xn Y n n Xn −1 1 > 1 > X Xn X Yn , = n n n n | {z }| {z } An Bn b = β n where An = 1 > X Xn n n −1 a.s. −→ W−1 by Lemma 13.1. Further Bn = n 1 X 1 > > Xn Y n = X i Yi − X > i β + Xi β n n i=1 = (a) C n = n n 1 X 1 X X i εi + X iX > i β. n n | i=1{z } | i=1 {z } Cn Dn n 1 X a.s. X i εi −→ 0k due to the SLLN (i.i.d., Theorem C.2). This is justified as follows. n i=1 n n 1 X 1 X • The jth (j = 0, . . . , k − 1) element of the vector X i εi is Xi,j εi . n n i=1 i=1 13.2. CONSISTENCY OF LSE 252 • The random variables Xi,j εi , i = 1, 2, . . . are i.i.d. by (A0). • E Xi,j εi = E E Xi,j εi X i = E Xi,j E εi X i = E Xi,j 0 = • var Xi,j εi 0. = + var E Xi,j εi X i E var Xi,j εi X i 2 var ε X E Xi,j + var Xi,j 0 i i = 2 σ 2 (X ) E Xi,j i < ∞ by (A2 heteroscedastic) = =⇒ EXi,j εi < ∞. n 1 1 X a.s. (b) D n = Wn β −→ Wβ by Lemma 13.1. X iX > i β = n n i=1 a.s. b = An C n + D n , where An −→ In summary: β W−1 , n a.s. C n −→ 0k , a.s. D n −→ W β. Hence a.s. b −→ β W−1 W β = β. n Theorem 13.3 Strong consistency of the mean squared error. Let assumptions (A0), (A1), (A2 homoscedastic) hold. Then a.s. MSe,n −→ σ 2 as n → ∞. Proof. We have MSe,n = n 1 n 1 X b 2 Yi − X > SSe,n = i β . n−k n−k n i=1 k 13.2. CONSISTENCY OF LSE Since lim n n→∞ n−k 253 = 1, it is sufficient to show that n a.s. 2 1 X b 2 −→ Yi − X > σ i βn n as n → ∞. i=1 We have n n X 1 X > >b 2 b 2 = 1 Yi − X > β Yi − X > i n i β + X i β − X i βn n n i=1 i=1 1 = n | n X Yi − n n o2 > 1 X > 2 X b b + X i β − βn + Yi − X > i β X i β − βn . n n } | i=1 {z } | i=1 {z } Bn Cn 2 X> i β i=1 {z An n n 2 1 X 2 a.s. 1 X > Yi − X i β = εi −→ σ 2 due to the SLLN (i.i.d., Theorem C.2). (a) An = n n i=1 i=1 This is justified by noting the following. • The random variables ε2i , i = 1, 2, . . . are i.i.d. by (A0). • E εi = 0 =⇒ E ε2i = var εi = E σ 2 (X i ) = E σ 2 = σ 2 by assumption (A2 homoscedastic). • Eε2i = E ε2i = σ 2 < ∞ by assumption (A2 homoscedastic). (b) B n n o2 a.s. 1 X > b Xi β − β −→ 0, which is seen as follows. = n n i=1 Bn n o2 1 X > b Xi β − β = n n i=1 = n 1 X b > X iX > β − β b β −β n n i n i=1 n > 1 X b X iX > β −β n i n = b β −β n = b . b > 1 X> Xn β − β β −β n n n n i=1 Now a.s. b −→ β −β 0k due to Theorem 13.2. n 1 > a.s. X Xn −→ W due to Lemma 13.1. n n Hence a.s. B n −→ 0> k W 0k = 0. 13.2. CONSISTENCY OF LSE 254 n > a.s. 2 X b (c) C n = Yi − X > i β X i β − β n −→ 0, which is justified by the following. n i=1 Cn n > 2 X b Yi − X > = i β X i β − βn n i=1 = 2 Now n 1 X n εi X > i b . β −β n i=1 n 1 X a.s. > εi X > i −→ 0k as was shown in the proof of Theorem 13.2. n i=1 a.s. b −→ β −β 0k due to Theorem 13.2. n Hence In summary: MSe,n = a.s. C n −→ 0> k 0k = 0. n An + B n + C n , where n−k n n−k → 1, a.s. An −→ σ 2 , a.s. B n −→ 0, a.s. C n −→ 0. Hence a.s. MSe,n −→ 1 (σ 2 + 0 + 0) = σ 2 . k 13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY 13.3 255 Asymptotic normality of LSE under homoscedasticity b , θbn , b We shall show in this section: asymptotic normality of β ξ n (LSE’s regression coefficients or n their linear combinations) when homoscedasticity of the errors is assumed but not their normality. n Reminder. V = E XX > o−1 . Theorem 13.4 Asymptotic normality of LSE in homoscedastic case. Let assumptions (A0), (A1), (A2 homoscedastic) hold. Then √ D b −β n β −→ Nk (0k , σ 2 V) n √ D n θbn − θ −→ N1 (0, σ 2 l> V l) √ D n b ξn − ξ −→ Nm (0m , σ 2 L V L> ) as n → ∞, as n → ∞, as n → ∞. Proof. Will be done jointly with Theorem 13.5. 13.3.1 k Asymptotic validity of the classical inference under homoscedasticity but non-normality For given n ≥ n0 > k, the following statistics are used to infer on estimable parameters of the linear model Mn based on the response vector Y n and the model matrix Xn (see Chapter 3): θbn − θ Tn := q −1 , MSe,n l> X> X l n n Qn := 1 m b ξn − ξ > n −1 > o−1 b L X> L ξn − ξ n Xn MSe,n Reminder. • Vn = X> n Xn −1 . a.s. • Under assumptions (A0) and (A1): n Vn −→ V as n → ∞. (13.1) . (13.2) 13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY 256 Consequence of Theorem 13.4: Asymptotic distribution of t- and F-statistics. Under assumptions of Theorem 13.4: Tn m Qn D as n → ∞, D as n → ∞. −→ N1 (0, 1) −→ χ2m Proof. It follows directly from Lemma 13.1, Theorem 13.4 and Cramér-Slutsky theorem (Theorem C.7) as follows. √ >b n l β n − l> β p −1 = σ 2{zl> Vl X> X l n n | } b − l> β l> β n Tn = q MSe,n l> D −→ N (0, 1) m Qm = b − Lβ Lβ n D σ 2 l> Vl n −1 o . MSe,n l> n X> l n Xn | {z } P −→ 1 > n −1 > o−1 b − Lβ Lβ MSe,n L X> X L n n n √ = v u u t n −1 > o−1 MSe,n L n X> X L n {zn } | b − Lβ > n Lβ n | {z } −→ Nm 0m , σ 2 LVL> P −→ σ 2 LVL> D b − Lβ Lβ n | {z √ n } . −→ Nm 0m , σ 2 LVL> Convergence to χ2m in distribution follows from a property of (multivariate) normal distribution concerning the distribution of a quadratic form. k If additionaly normality is assumed, i.e., if it is assumed Y n Xn Theorem 3.1 (LSE under the normality) provides ∼ Nn Xn β, σ 2 In then Tn ∼ tn−k , Qn ∼ Fm, n−k . This is then used for inference (derivation of confidence intervals and regions, construction of tests) on the estimable parameters of a linear model under assumption of normality. The following holds in general: D Tν ∼ tν then Tν −→ N (0, 1) as ν → ∞, Qν ∼ Fm, ν then m Qν −→ χ2m D as ν → ∞. (13.3) This, together with Consequence of Theorem 13.4 then justify asymptotic validity of a classical inference based on statistics Tn (Eq. 13.1) and Qn (Eq. 13.2), respectively and a Student t and Fdistribution, respectively, even if normality of the error terms of the linear model does not hold. The only requirements are assumptions of Theorem 13.4. That is, for example, both intervals 13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY (i) InN := θbn − u(1−α/2) q > −1 X> l, n Xn MSe,n l q −1 (ii) Int := θbn − tn−k (1−α/2) MSe,n l> X> l, n Xn 257 q −1 MSe,n l> X> l ; n Xn q −1 θbn + tn−k (1−α/2) MSe,n l> X> X l , n n θbn + u(1−α/2) satisfy, for any θ0 ∈ R (even without normality of the error terms) P InN 3 θ0 ; θ = θ0 −→ 1 − α as n → ∞, P Int 3 θ0 ; θ = θ0 −→ 1 − α as n → ∞. Analogously, due to a general asymptotic property of the F-distribution (Eq. 13.3), asymptotically valid inference on the estimable vector parameter ξ = Lβ of a linear model can be based either on the statistic m Qn and the χ2m distribution or on the statistic Qn and the Fm, n−k distribution. For example, for both ellipsoids n o > n −1 > o−1 (i) Knχ := ξ ∈ Rm : ξ − b ξ MSe,n L X> L ξ−b ξ < χ2m (1 − α) ; n Xn o n −1 > o−1 > n ξ−b ξ < m Fm,n−k (1 − α) , MSe,n L X> L (ii) KnF := ξ ∈ Rm : ξ − b ξ n Xn we have for any ξ 0 ∈ Rm (under assumptions of Theorems 13.4): P Knχ 3 ξ 0 ; ξ = ξ 0 −→ 1 − α as n → ∞, P KnF 3 ξ 0 ; ξ = ξ 0 −→ 1 − α as n → ∞. 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 13.4 258 Asymptotic normality of LSE under heteroscedasticity b , θbn , b We shall show in this section: asymptotic normality of β ξ n (LSE’s regression coefficients or n their linear combinations) when even homoscedasticity of the errors is not assumed. Reminder. n o−1 . • V = E XX > • WF = E σ 2 (X) XX > . Theorem 13.5 Asymptotic normality of LSE in heteroscedastic case. Let assumptions (A0), (A1), (A2 heteroscedastic) hold. Then √ D b −β n β −→ Nk (0k , VWF V) n √ D n θbn − θ −→ N1 (0, l> VWF V l) √ n b ξn − ξ as n → ∞, as n → ∞, D −→ Nm (0m , L VWF V L> ) as n → ∞. Proof. We will jointly prove also Theorem 13.4. We have b = β n X> Xn | n {z −1 X> nY n } Vn = Vn n X X i Yi i=1 = Vn n X Xi X> i β + εi i=1 = Vn X n X iX > i β + Vn {z } V−1 n = β + Vn X i εi i=1 i=1 | n X n X X i εi . i=1 That is, b − β = Vn β n n X X i εi = n V n i=1 n 1 X X i εi . n (13.4) i=1 a.s. By Lemma 13.1, n Vn −→ V which implies P n Vn −→ V as n → ∞. (13.5) 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 259 Pn In the following, let us explore asymptotic behavior of the term n1 i=1 X i εi . 1 Pn From assumption (A0), the term n i=1 X i εi is a sample mean of i.i.d. random vector X i εi , i = 1, . . . , n. The mean and the covariance matrix of the distribution of those random vectors are E Xε = 0k (was shown in the proof of Theorem 13.2), + var E Xε X var Xε = E var Xε X = E X var ε X X > + var X E ε X | {z } | {z } 0 σ 2 (X) = E σ 2 (X)XX > . Depending, on whether (A2 homoscedastic) or (A2 heteroscedastic) is assumed, we have σ 2 E XX > = σ 2 W, (A2 homoscedastic), var Xε = E σ 2 (X)XX > = WF , (A2 heteroscedastic). (13.6) Under both (A2 homoscedastic) and (A2 heteroscedastic) all elements of the covariance matrix var Xε are finite. Hence by Theorem C.5 (multivariate CLT for i.i.d. random vectors): n n √ 1 X 1 X D X i εi = √ n X i εi −→ Nk 0k , E σ 2 (X)XX > n n i=1 as n → ∞. i=1 From (13.4) and (13.5), we now have, b − β β n n 1 X √ X i εi n i=1 {z } | = n Vn | {z } P −→V D 1 √ . n −→Nk 0k , E σ 2 (X)XX > That is, √ b − β n β n = n Vn | {z } P −→V D n 1 X √ X i εi n i=1 | {z } −→Nk 0k , E σ 2 (X)XX > Finally, by applying Theorem C.7 (Cramér–Slutsky): D √ b − β −→ n β Nk 0k , V E σ 2 (X)XX > V> n . as n → ∞. By using (13.6) and realizing that V> = V, we get Under (A2 homoscedastic) V E σ 2 (X)XX > V> = V σ 2 W V = σ 2 V V−1 V = σ 2 V and hence √ b − β n β n D −→ Nk 0k , σ 2 V as n → ∞. 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 260 Under (A2 heteroscedastic) V E σ 2 (X)XX > V> = V WF V and hence D √ F b − β −→ n β N 0 , V W V k k n as n → ∞. b and of ξ = Lβ b follows now from Theorem C.6 (Cramér– Asymptotic normality of θbn = l> β n n n Wold). k Notation (Residuals and related quantities based on a model for data of size n). For n ≥ n0 > k, the following notation will be used for quantities based on the model Mn : Y n Xn ∼ Xn β, σ 2 In . −1 • Hat matrix: Hn = Xn X> n Xn • Residual projection matrix: Mn = In − Hn ; • Diagonal elements of matrix Hn : hn,1 , . . . , hn,n ; • Diagonal elements of matrix Mn : mn,1 = 1 − hn,1 , . . . , mn,n = 1 − hn,n ; U n = Mn Y n = Un,1 , . . . , Un,n . • Residuals: X> n; Reminder. • Vn = n X X iX i> −1 = X> n Xn −1 . i=1 a.s. • Under assumptions (A0) and (A1): n Vn −→ V as n → ∞. Theorem 13.6 Sandwich estimator of the covariance matrix. Let assumptions (A0), (A1), (A2 heteroscedastic) hold. Let additionally, for each s, t, j, l = 0, . . . , k−1 Eε2 Xj Xl < ∞, Then Eε Xs Xj Xl < ∞, a.s. F n Vn WF n Vn −→ V W V EXs Xt Xj Xl < ∞. as n → ∞, where for n = 1, 2, . . ., WF n = n X 2 > Un,i X iX > i = Xn Ωn Xn , i=1 Ωn = diag ωn,1 , . . . , ωn,n , 2 ωn,i = Un,i , i = 1, . . . , n. End of Lecture #24 (07/01/2016) Start of Lecture #26 (14/01/2016) 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 261 Proof. First, remind that n n o−1 o−1 , E σ 2 (X) XX > E XX > E XX > V WF V = and we know from Lemma 13.1 that −1 a.s. n o−1 > =V n Vn = n X> X −→ E XX n n as n → ∞. Hence, if we show that n 1 X 2 1 F a.s. Wn = Un,i X i X > −→ E σ 2 (X) XX > = WF i n n as n → ∞, i=1 the statement of Theorem will be proven. Remember, σ 2 (X) = var ε X = E ε2 X . From here, for each j, l = 0, . . . , k − 1 E ε2 Xj Xl = E E ε2 Xj Xl X = E Xj Xl E ε2 X = E σ 2 (X) Xj Xl . For each j, l = 0, . . . , k − 1, Eε2 Xj Xl < ∞ by assumptions of Theorem. By assumption (A0), εi Xi,j Xi,l , i = 1, 2, . . ., is a sequence of i.i.d. random variables. Hence by Theorem C.2 (SLLN, i.i.d.), n 1 X 2 a.s. εi Xi,j Xi,l −→ E σ 2 (X) Xj Xl n as n → ∞. i=1 That is, in a matrix form, n 1 X 2 a.s. εi X i X > −→ E σ 2 (X) XX > = WF i n as n → ∞. (13.7) i=1 In the following, we show that (unobservable) squared error terms ε2i in (13.7) can be replaced by 2 = Y − X >β b 2 while keeping the same limitting matrix WF as in (13.7). squared residuals Un,i i i n We have n n 1 X 2 1 X b 2 Xi X> Un,i X i X > = Yi − X > i i βn i n n i=1 i=1 | {z } WF n 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 262 n 1 X > >b 2 Xi X> = Yi − X > i i β + X i β − X i βn | {z } n i=1 εi = n n 1 X 2 1 X b > X iX > β − β b X iX > + εi X i X > β−β n n i i i n n i=1 i=1 {z } {z } | | An Bn + (a) An = n 2 X b > X i εi X i X > . β−β n i n i=1 | {z } Cn n 1 X 2 a.s. 2 > εi X i X > = WF due to (13.7). i −→ E σ (X) XX n i=1 n 1 X b > X iX > β − β b X i X > , we can realize that β − β−β n n i i n i=1 b is a scalar quantity. Hence β−β n (b) To work with Bn = b β n > Xi = X> i Bn n 1 X b > X i X iX > X > β − β b β−β = n n i i n i=1 and the (j, l)th element of matrix Bn (j, l = 0, . . . , k − 1) is Bn (j, l) = n 1 X b > X i (Xi,j Xi,l ) X > β − β b β−β n n i n i=1 = b β−β n > n 1 X > b . (Xi,j Xi,l ) X i X i β−β n n i=1 a.s. b • From Theorem 13.2: β − β n −→ 0k as n → ∞. • Due to assumption (A0) and assumption EXs Xt Xj Xl < ∞ for any s, t, j, l = 0, . . . , k − 1, by Theorem C.2 (SLLN, i.i.d.), for any j, l = 0, . . . , k − 1: n 1 X a.s. > (Xi,j Xi,l ) X i X > . i −→ E Xj Xl XX n i=1 a.s. > • Hence, for any j, l = 0, . . . , k − 1, Bn (j, l) −→ 0> 0k = 0 and finally, k E Xj Xl XX a.s. as n → ∞. Bn −→ 0k×k n 2 X b > X i εi X i X > and the (j, l)th element of matrix Cn (j, l = (c) Cn = β−β n i n i=1 0, . . . , k − 1) is Cn (j, l) = n 2 X b > X i εi Xi,j Xi,l β−β n n i=1 b = 2 β−β n > n 1 X X i εi Xi,j Xi,l . n i=1 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 263 a.s. b • From Theorem 13.2: β − β n −→ 0k as n → ∞. • Due to assumption (A0) and assumption Eε Xs Xj Xl < ∞ for any s, j, l = 0, . . . , k −1, by Theorem C.2 (SLLN, i.i.d.), for any j, l = 0, . . . , k − 1: n 1 X a.s. X i εi Xi,j Xi,l −→ E Xε Xj Xl . n i=1 a.s. • Hence, for any j, l = 0, . . . , k − 1, Cn (j, l) −→ 2 0> k E Xε Xj Xl = 0 and finally, a.s. Cn −→ 0k×k as n → ∞. In summary: n Vn WF n Vn = n Vn 1 n WF n Vn n = n Vn An + Bn + Cn n Vn , a.s. where n Vn −→ V, a.s. An −→ WF , a.s. Bn −→ 0k×k , a.s. Cn −→ 0k×k . Hence a.s. F n Vn WF + 0k×k + 0k×k V = VWF V n Vn −→ V W as n → ∞. k Terminology (Heteroscedasticity consistent (sandwich) estimator of the covariance matrix). Matrix > Vn W F n Vn = Xn Xn −1 > X> n Ωn Xn Xn Xn −1 (13.8) b of is called the heteroscedasticity consistent (HC) estimator of the covariance matrix of the LSE β n the regression coefficients. Due to its form, the matrix (13.8) is also called as the sandwich estimator −1 > composed of a bread X> Xn and a meat Ωn . n Xn Notes (Alternative sorts of meat for the sandwich). • It is directly seen that the meat matrix Ωn can, for a chosen sequence νn , such that n → ∞, be replaced by a matrix n Ωn , νn n νn → 1 as and the statement of Theorem 13.6 remains valid. A value νn is then called degrees of freedom of the sandwich. 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 264 • It can also be shown (see references below) that the meat matrix Ωn can, for a chosen sequence νn , such that νnn → 1 as n → ∞ and a suitable sequence δ n = δn,1 , . . . , δn,n , n = 1, 2, . . ., be replaced by a matrix ΩHC := diag ωn,1 , . . . , ωn,n , n ωn,i = 2 n Un,i , νn mδn,i n,i i = 1, . . . , n. • The following choices of sequences νn and δ n have appeared in the literature (n = 1, 2, . . ., i = 1, . . . , n): HC0: νn = n, δn,i = 0, that is, 2 ωn,i = Un,i . This is the choice due to White (1980) who was the first who proposed the sandwich estimator of the covariance matrix. This choice was also used in Theorem 13.6. HC1: νn = n − k, δn,i = 0, that is, n U2 . n − k n,i This choice was suggested by MacKinnon and White (1985). ωn,i = HC2: νn = n, δn,i = 1, that is, ωn,i = 2 Un,i . mn,i This is the second proposal of MacKinnon and White (1985). HC3: νn = n, δn,i = 2, that is, ωn,i = 2 Un,i m2n,i . This is the third proposal of MacKinnon and White (1985). HC4: νn = n, δn,i = min 4, n hn,i /k , that is, ωn,i = 2 Un,i δ n,i mn,i . This was proposed relatively recently by Cribari-Neto (2004). Note that k = hence n n h o 1 X n,i δn,i = min 4, , hn = hn,i . n hn i=1 Pn i=1 hn,i , and • An extensive study towards small sample behavior of different sandwich estimators was carried out by Long and Ervin (2000) who recommended usage of the HC3 estimator. Even better small sample behavior, especially in presence of influential observations was later concluded by Cribari-Neto (2004) for the HC4 estimator. • Labels HC0, HC1, HC2, HC3, HC4 for the above sandwich estimators are used by the R package sandwich (Zeileis, 2004) that enables for their easy calculation based on the fitted linear model. 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 13.4.1 265 Heteroscedasticity consistent asymptotic inference Let for given sequences νn and δ n , n = 1, 2, . . ., ΩHC be a sequence of the meat matrices that n b . Let for lead to the heteroscedasticity consistent estimator of the covariance matrix of the LSE β n given n ≥ n0 > k, −1 > HC −1 VHC := X> Xn Ωn Xn X> . n n Xn n Xn Finally, let the statistics TnHC and QHC be defined as n θbn − θ , TnHC := q l> VHC l n QHC := n > 1 b > −1 b ξn − ξ LVHC ξn − ξ . n L m Note that the statistics TnHC and QHC are the usual statistics Tn (Eq. 13.1) and Qn n , respectively, −1 > (13.2), respectively, in which the term MSe,n Xn Xn is replaced by the sandwich estimator VHC . n Consequence of Theorems 13.5 and 13.6: Heteroscedasticity consistent asymptotic inference. Under assumptions of Theorem 13.5 and 13.6: TnHC m QHC n D as n → ∞, D as n → ∞. −→ N1 (0, 1) −→ χ2m Proof. Proof/calculations were available on the blackboard in K1. k Due to a general asymptotic property of the Student t-distribution (Eq. 13.3), asymptotically valid inference on the estimable parameter θ = l> β of a linear model where neither normality, nor homoscedasticity is necessarily satisfied, can be based on the statistic TnHC and either a Student tn−k or a standard normal distribution. Under assumptions of Theorems 13.5 and 13.6, both intervals q q bn + u(1 − α/2) l> VHC l ; l, θ (i) InN := θbn − u(1 − α/2) l> VHC n n q q bn + tn−k (1 − α/2) l> VHC l , (ii) Int := θbn − tn−k (1 − α/2) l> VHC l, θ n n satisfy, for any θ0 ∈ R: P InN 3 θ0 ; θ = θ0 −→ 1 − α P Int 3 θ0 ; θ = θ0 −→ 1 − α as n → ∞, as n → ∞. 13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY 266 Analogously, due to a general asymptotic property of the F-distribution (Eq. 13.3), asymptotically valid inference on the estimable vector parameter ξ = Lβ of a linear model can be based either on the statistic m QHC and the χ2m distribution or on the statistic QHC and the Fm, n−k distribution. n n For example, for both ellipsoids n o > m HC > −1 2 b b (i) := ξ ∈ R : ξ − ξ L Vn L ξ − ξ < χm (1 − α) ; n o > > −1 b (ii) KnF := ξ ∈ Rm : ξ − b ξ L VHC L ξ − ξ < m F (1 − α) , m,n−k n Knχ we have for any ξ 0 ∈ Rm (under assumptions of Theorems 13.5 and 13.6): P Knχ 3 ξ 0 ; ξ = ξ 0 −→ 1 − α as n → ∞, P KnF 3 ξ 0 ; ξ = ξ 0 −→ 1 − α as n → ∞. End of Lecture #26 (14/01/2016) Chapter 14 Unusual Observations In the whole chapter, we assume a linear model M : Y X ∼ Xβ, σ 2 In , Start of Lecture #25 (07/01/2016) rank(Xn×k ) = r ≤ k, where standard notation is considered. That is, − • b = X> X X> Y = b0 , . . . , bk−1 : any solution to normal equations; − • H = X X> X X> = hi,t i,t=1,...,n : the hat matrix; • M = In − H = mi,t i,t=1,...,n : the residual projection matrix; • Yb = HY = Xb = Yb1 , . . . , Ybn : the vector of fitted values; • U = MY = Y − Yb = U1 , . . . , Un : the residuals; 2 • SSe = U : the residual sum of squares; SSe is the residual mean square; = U1std , . . . , Unstd : vector of standardized residuals, • MSe = • U std 1 n−r Uistd = √ Ui , MSe mi,i i = 1, . . . , n. The whole chapter will deal with idntification of “unusual” observations in a particular dataset. Any probabilistic statements will hence be conditioned by the realized covariate values X 1 = x1 , . . . , X n = xn . The same symbol X will be used for (in general random) model matrix and its realized counterpart, i.e., X> x> 1 1 . . . . X= . = . . > Xn x> n 267 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 14.1 Leave-one-out and outlier model Notation. For chosen t ∈ 1, . . . , n , we will use the following notation. • Y (−t) : vector Y without the tth element; • xt : the tth row (understood as a column vector) of the matrix X; • X(−t) : matrix X without the tth row; • j t : vector 0, . . . , 0, 1, 0, . . . , 0 of length n with 1 on the tth place. Definition 14.1 Leave-one-out model. The tth leave-one-out model1 is a linear model M(−t) : Y (−t) X(−t) ∼ X(−t) β, σ 2 In−1 . Definition 14.2 Outlier model. The tth outlier model2 is a linear model out 2 Mout t : Y X ∼ Xβ + j t γt , σ In . Notation (Quantities related to the leave-one-out and outlier models). • Quantities related to model M(−t) will be recognized by subscript (−t), i.e., b(−t) , Yb (−t) , SSe,(−t) , MSe,(−t) , . . . • Quantities related to model Mout will be recognized by subscript t and superscript out, i.e., t out out out b bout t , Y t , SSe,t , MSe,t , . . . • Solutions to normal equations in model Mout will be denoted as t out bout . t , ct • If γtout is an estimable parameter of model Mout then its LSE will be denoted as γ btout . t Theorem 14.1 Four equivalent statements. The following four statements are equivalent: (i) rank(X) = rank X(−t) , i.e., xt ∈ M X> (−t) ; 1 model vynechaného ttého pozorovńí 2 model ttého odlehlého pozorovńí 268 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 269 (ii) mt,t > 0; (iii) γtout is an estimable parameter of model Mout t ; (iv) µt := E Yt X t = xt = x> t β is an estimable parameter of model M(−t) . Proof. (ii) ⇔ (i) • We will show this by showing non(i) ⇔ non(ii). > . • non(i) means that xt ∈ / M X> (−t) ⊂ M X > M X> and M X(−t) 6= M X> . (−t) ⊂ M X ⊥ ⊥ ⊥ ⊥ ⇔ M X> ⊂ M X> and M X> 6= M X> . (−t) (−t) ⊥ ⊥ > such that a ∈ / M X> . • That is, ⇔ ∃a ∈ M X(−t) > ⇔ ∃a ∈ Rk such that a> X> & a> X> 6= 0> . (−t) = 0 ⇔ ∃a ∈ Rk such that X(−t) a = 0 & Xa 6= 0. It must be Xa = 0, . . ., 0, c, 0, . . ., 0 > = c jt for some c 6= 0. ⇔ ∃a ∈ Rk such that Xa = cj t , c 6= 0. ⇔ jt ∈ M X . ⇔ Mj t |{z} = 0. tth column of M ⇔ mt = 0. 2 ⇔ mt = mt,t = 0. ⇔ non(ii). mt denotes the tth row of M (and also its t column since M is symmetric). (iii) ⇔ (i) > • γtout = 0> ,1 | k {z } l> ! β . γtout 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 270 > • γtout is estimable parameter of Mout ⇔ l ∈ M X, j . t t ⇔ ∃a ∈ Rn such that 0> , 1 = a> X, j t . ⇔ ∃a ∈ Rn such that 0> = a> X & 1 = a> j t . ⇔ ∃a ∈ Rn such that 0> = a> X & at = 1. > ⇔ ∃a ∈ Rn such that x> t = − a(−t) X(−t) . ⇔ xt ∈ M X> (−t) . ⇔ (i). (iv) ⇔ (i) • Follows directly from Theorem 2.7. k Theorem 14.2 Equivalence of the outlier model and the leave-one-out model. are the same, i.e., 1. The residual sums of squares in models M(−t) and Mout t SSe,(−t) = SSout e,t . out 2. Vector b(−t) solves the normal equations of model M(−t) if and only if a vector bout t , ct out solves the normal equations of model Mt , where bout = b(−t) , t cout = Yt − xt > b(−t) . t Proof. Solution to normal equations minimizes the corresponding sum of squares. The sum of squares to be minimized w.r.t. β and γtout in the outlier model Mout is t 2 SSout β, γtout = Y − Xβ − j t γtout separate the tth element of the sum t 2 out 2 = Y (−t) − X(−t) β + Yt − x> t β − γt out 2 = SS(−t) (β) + Yt − x> , t β − γt where SS(−t) (β) is the sum of squares to be minimized w.r.t. β in the leave-one-out model M(−t) . out The term Yt − x> t β − γt 2 can for any β ∈ Rk be equal to zero if we, for given β ∈ Rk , take γtout = Yt − x> t β. 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 271 That is out (i) min SSout t (β, γt ) = min SS(−t) (β); β β, γtout | {z } | {z } out SS e,(−t) SSe,t (ii) A vector b(−t) ∈ Rk minimizes SS(−t) (β) if and only if a vector k+1 b(−t) , Yt − x> t b(−t) ∈ R {z } | {z } | bout cout t t out minimizes SSout t (β, γt ). k Notation (Leave-one-out least squares estimators of the response expectations). If mt,t > 0 for all t = 1, . . . , n, we will use the following notation: Yb[t] := x> t b(−t) , t = 1, . . . , n, which is the LSE of the parameter µt = E Yt X t = xt = x> t β based on the leave-one-out model M(−t) ; Yb [•] := Yb[1] , . . . , Yb[n] , which is an estimator of the parameter µ = µ1 , . . . , µn = E Y X , where each element is estimated using the linear model based on data with the corresponding observation being left out. Calculation of quantities of the outlier and the leave-one-out models Model Mout is a model with added regressor for model M. Suppose that mt,t > 0 for given t t = 1, . . . , n. By applying Lemma 10.1, we can express the LSE of the parameter γtout as γ btout = j> t Mj t − − −1 j> t U = (mt,t ) Ut = (mt,t ) Ut = Ut . mt,t Analogously, other quantities of the outlier model can be expressed using the quantities of model M. Namely, − Ut bout = b− X> X x t , t mt,t out Ut Yb t = Yb + mt , mt,t 2 Ut2 SSe − SSout = = MSe Utstd , e,t mt,t where mt denotes the tth column (and row as well) of the residual projection matrix M. 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 272 Lemma 14.3 Quantities of the outlier and leave-one-out model expressed using quantities of the original model. Suppose that for given t ∈ {1, . . . , n}, mt,t > 0. The following quantities of the outlier model Mout t and the leave-one-out model M(−t) are expressable using the quantities of the original model M as follows. Ut b γ btout = Yt − x> , t b(−t) = Yt − Y[t] = mt,t − Ut b(−t) = bout = b− X> X x t , t mt,t (14.1) Ut2 std 2 out = SSe − MSe Ut SSe,(−t) = SSe,t = SSe − , mt,t 2 MSout MSe,(−t) n − r − Utstd e,t = = . MSe MSe n−r−1 Proof. Equality between the quantities of the outlier and the leave-one-out model follows from Theorem 14.2. Remaining expressions follow from previously conducted calculations. To see the last equality in (14.1), remember that the residual degrees of freedom of both the outlier and the leave-one-out models are equal to n − r − 1. That is, whereas in model M, MSe = SSe , n−r in the outlier and the leave-one-out model, MSe,(−t) = SSout SSe,(−t) e,t = = MSout e,t . n−r−1 n−r−1 k Notes. • Expressions in Lemma 14.3 quantify the influence of the tth observation on (i) the LSE of a vector β of the regression coefficients (in case they are estimable); (ii) the estimate of the residual variance. • Lemma 14.3 also shows that it is not necessary to fit n leave-one-out (or outlier models) to calculate their LSE-related quantities. All important quantities can be calculated directly from the LSE-related quantities of the original model M. 14.1. LEAVE-ONE-OUT AND OUTLIER MODEL 273 Definition 14.3 Deleted residual. If mt,t > 0, then the quantity γ btout = Yt − Yb[t] = is called the tth deleted residual of the model M. Ut mt,t 14.2. OUTLIERS 14.2 274 Outliers By outliers3 of the model M, we shall understand observations for which the response expectation does not follow the assumed model, i.e., the tth observation (t ∈ {1, . . . , n}) is an outlier if E Yt X t = xt 6= x> t β, in which case we can write out E Yt X t = xt = x> t β + γt . As such, an outlier can be characterized as an observation with unusual response (y) value. If mt,t > 0, γtout is an estimable parameter of the tth outlier model Mout (for which the model M t is a submodel) and decision on whether the tth observation is an outlier can be transferred into a problem of testing H0 : γtout = 0 in the tth outlier model Mout t . Note that the above null hypothesis also expresses the fact that the submodel M of the model Mout holds. t If normality is assumed, this null hypothesis can be tested using a classical t-test on a value of the estimable parameter. The corresponding t-statistic has a standard form γ bout Tt = q t c γ var btout and under the null hypothesis follows the Student t distribution with n − r − 1 degrees of freedom (residual degrees of freedom of the outlier model). From Section 14.1, we have γ btout = Ut = Yt − Yb[t] . mt,t Hence (the variance is conditional given the covariate values), (?) 1 2 Ut 1 σ2 out var γ bt = var = var U = σ m = . t t,t mt,t mt,t m2t,t m2t,t (?) The equality = holds irrespective of whether γtout = 0 (and model M holds) or γtout 6= 0 (and model Mout holds). t The estimator γ btout is the LSE of a parameter of the outlier model and hence MSout e,t c γ var btout = , mt,t and finally, γ bout Tt = r t . MSout e,t mt,t Two useful expressions of the statistic Tt are obtained by remembering from Section 14.1 (a) t MSout btout = Yt − Yb[t] = γ btout = mUt,t . This leads e,t = MSe,(−t) and (b) two expressions of γ to Yt − Yb[t] √ Ut Tt = p mt,t = p . MSe,(−t) MSe,(−t) mt,t 3 odlehlá pozorování 14.2. OUTLIERS 275 Definition 14.4 Studentized residual. If mt,t > 0, then the quantity Yt − Yb[t] √ Ut Tt = p mt,t = p MSe,(−t) MSe,(−t) mt,t is called the tth studentized residual4 of the model M. Notes. • Using the last equality in (14.1), we can derive one more expression of the studentized residual using the standardized residual Ut Utstd = p . MSe mt,t Namely, s Tt = n−r−1 n−r− 2 Utstd Utstd . This directly shows that it is not necessary to fit the leave-one-out or the outlier model to calculate the studentized residual of the initial model M. Theorem 14.4 On studentized residuals. 2 Let Y X ∼ Nn Xβ, σ In , where rank Xn×k = r ≤ k < n. Let further n > r + 1. Let for given t ∈ 1, . . . , n mt,t > 0. Then 1. The tth studentized residual Tt follows the Student t-distribution with n − r − 1 degrees of freedom. 2. If additionally n > r + 2 then E Tt = 0. n−r−1 3. If additionally n > r + 3 then var Tt = . n−r−3 Proof. Point (i) follows from preceeding derivations, points (ii) and (iii) follow from properties of the Student t distribution. Test for outliers The studentized residual Tt of the model M is the test statistic (with tn−r−1 distribution under the null hypothesis) of the test 4 studentizované reziduum 14.2. OUTLIERS H0 : γtout = 0, H1 : γtout 6= 0 276 out 2 in the tth outlier model Mout t : Y X ∼ Nn Xβ + j t γt , σ In . The above testing problem can also be interpreted as a test of H0 : tth observations is not outlier of model M, H1 : tth observations is outlier of model M, where “outlier” means outlier with respect to model M: Y X ∼ Nn Xβ, σ 2 In : • The expected value of the tth observation is different from that given by model M; • The observed value of Yt is unusual under model M. When performing the test for outliers for all observations in the dataset, we are in fact facing a multiple testing problem and hence adjustment of the P-values resulted from comparison of the values of the studentized residuals with the quantiles of the Student tn−r−1 distribution are needed to keep the rate of falsely identified outliers under the requested level of α. For example, Bonferroni adjustment can be used. Notes. • Two or more outliers next to each other can hide each other. • A notion of outlier is always relative to considered model (also in other areas of statistics). Observation which is outlier with respect to one model is not necessarily an outlier with respect to some other model. • Especially in large datasets, few outliers are not a problem provided they are not at the same time also influential for statistical inference (see next section). • In a context of a normal linear model, presence of outliers may indicate that the error distribution is some distribution with heavier tails than the normal distribution. • Outlier can also suggest that a particular observation is a data-error. • If some observation is indicated to be an outlier, it should always be explored: • Is it a data-error? If yes, try to correct it, if this is impossible, no problem (under certain assumptions) • • • • to exclude it from the data. Is the assumed model correct and it is possible to find a physical/practical explanation for occurrence of such unusual observation? If an explanation is found, are we interested in capturing such artefacts by our model or not? Do the outlier(s) show a serious deviation from the model that cannot be ignored (for the purposes of a particular modelling)? .. . • NEVER, NEVER, NEVER exclude “outliers” from the analysis in an automatic manner. • Often, identification of outliers with respect to some model is of primary interest: • Example: model for amount of credit card transactions over a certain period of time depending on some factors (age, gender, income, . . . ). • Model found to be correct for a “standard” population (of clients). • Outlier with respect to such model ≡ potentially a fraudulent use of the credit card. • If the closer analysis of “outliers” suggest that the assumed model is not satisfactory capturing the reality we want to capture (it is not useful), some other model (maybe not linear, maybe not normal) must be looked for. 14.3. LEVERAGE POINTS 14.3 277 Leverage points By leverage points5 of the model M, we shall understand observations with, in a certain sense, unusual regressor (x) values. As will be shown, the fact whether the regressor values of a certain observation are unusual is closely related to the diagonal elements h1,1 , . . . , hn,n of the hat matrix − H = X X> X X> of the model. Terminology (Leverage). A diagonal element ht,t (t = 1, . . . , n) of the hat matrix H is called the leverage of the tth observation. Interpretation of the leverage To show that the leverage expresses how unusual the regressor values of the tth observations are, let us consider a linear model with intercept, i.e., the realized model matrix is X = 1n , x1 , . . . , xk−1 , where x1,1 . . x1 = . , xn,1 Let x1,k−1 . . = . . xn,k−1 ..., xk−1 ..., xk−1 = n x1 = 1X xi,1 , n n i=1 1X xi,k−1 n i=1 be the means of the non-intercept columns of the model matrix. That is, a vector x = x1 , . . . , xk−1 provides the mean values of the non-intercept regressors included in the model matrix X and as such is a gravity centre of the rows of the model matrix X (with excluded intercept). e be the non-intercept part of the model matrix X with all columns being centered, i.e., Further, let X x1,1 − x1 . . . x1,k−1 − xk−1 .. .. .. e = x1 − x1 1n , . . . , xk−1 − xk−1 1n = . X . . . 1 k−1 xn,1 − x . . . xn,k−1 − x e . Hence the hat matrix H = X X> X − X> can also be calculated Clearly, M X = M 1n , X e , where we can use additional property 1> X e = 0> : using the matrix 1, X n k−1 n > o− > e e e e H = 1n , X 1n , X 1n , X 1n , X > e − 1> n 1n 1n X | {z } | {z } > n 1n 0> k−1 e = 1n , X e> e> X e >X e X 1n X | {z } 0k−1 5 vzdálená pozorování 14.3. LEVERAGE POINTS 278 1 e n = 1n , X 0k−1 = 0> k−1 e> X e X − 1> n e> X 1 e e> e − X e >. 1n 1> n + X X X n That is, the tth leverage equals ht,t = > − > 1 e X e xt,1 − x1 , . . . , xt,k−1 − xk−1 . + xt,1 − x1 , . . . , xt,k−1 − xk−1 X n The second term is then a square of the generalized distance between the non-intercept regressors xt,1 , . . . , xt,k−1 of the tth observation and the vector of mean regressors x. Hence the observations with a high value of the leverage ht,t are observations with the regressor values being far from the mean regressor values and in this sense have unusual regressor (x) values. High value of a leverage To evaluate which values of the leverage are high enough to call a particular observation as a leverage point, let us remind an expression of the hat matrix using the orthonormal basis Q of the regression space M X , which is a vector space of dimension r = rankX. We know that H = QQ> and hence n X hi,i = tr(H) = tr QQ> = tr Q> Q = tr(Ir ) = r. i=1 That is, n h= 1X r hi,i = . n n (14.2) i=1 Several rules of thumbs can be found in the literature and software implementations concerning a lower bound for the leverage to call a particular observation as a leverage point. Owing to (14.2), a reasonable bound is a value higher than nr . For example, the R function influence.measures marks the tth observation as a leverage point if ht,t > 3r . n Influence of leverage points The fact that the leverage points may constitute a problem for the least squares based statistical inference in a linear model comes from remembering an expression for the variance (conditional given the covariate values) of the residuals of a linear model: var(Ut ) = σ 2 mt,t = σ 2 (1 − ht,t ), t = 1, . . . , n. Remind that Ut = Yt − Ybt and hence also var Yt − Ybt = σ 2 (1 − ht,t ), t = 1, . . . , n. That is, var(Ut ) = var Yt − Ybt is low for observations with a high leverage. In other words, the fitted values of high leverage observations are forced to be closer to the observed response values than those of low leverage observations. In this way, the high leverage observations have a higher impact on the fitted regression function than the low leverage observations. End of Lecture #25 (07/01/2016) 14.4. INFLUENTIAL DIAGNOSTICS 14.4 279 Influential diagnostics Start of Both outliers and leverage points do not necessarily constitute a problem. This occurs if they too Lecture #26 much influence statistical inference of primary interest. Also other observations (neither outliers nor (14/01/2016) leverage points) may harmfully influence the statistical inference. In this section, several methods of quantifying the influence of a particular, tth (t = 1, . . . , n) observation on statistical inference will be introduced. In all cases, we will compare a quantity of primary interest based on the model at hand, i.e., M : Y X ∼ Xβ, σ 2 In , rank Xn×k = r, and the quantity based on the leave-one-out model M(−t) : Y (−t) X(−t) ∼ X(−t) β, σ 2 In−1 . It will overally be assumed, that mt,t > 0 which implies (see Theorem 14.1) rank X(−t) rank X = r. 14.4.1 = DFBETAS Let r = k, i.e., both models M and M(−t) are full-rank models. The LSE’s of the vector of regression coefficients based on the two models are −1 > b = βb0 , . . . , βbk−1 > M: β = X> X X Y, M(−t) : b β (−t) = βb(−t),0 , . . . , βb(−t),k−1 Using (14.1): b − β b β (−t) = > = X(−t) > X(−t) −1 X(−t) > Y (−t) . −1 Ut X> X xt , mt,t (14.3) which quantifies influence of the tth observation on the LSE of the regression coefficients. In the following, let v 0 = v0,0 , . . . , v0,k−1 , . . . , v k−1 = vk−1,0 , . . . , vk−1,k−1 be the rows of the −1 matrix X> X , i.e., v> v . . . v 0,0 0,k−1 0 −1 . . .. .. > . . . X X = . = . . . v> v . . . v k−1,0 k−1,k−1 k−1 Expression (14.3) written elementwise lead to a quantities called DFBETA: DFBETAt,j := βbj − βb(−t),j = Ut > v xt , mt,t t t = 1, . . . , n, j = 0, . . . , k − 1. Note that DFBETAt,j has a scale of the jth regressor. To get a dimensionless quantity, we can divide it by the standard error of either βbj or βb(−t),j . We have S.E. βbj = p MSe vj,j , S.E. βb(−t),j = q MSe,(−t) v(−t),j,j , −1 where v(−t),j,j is p the jth diagonal element of matrix X(−t) > X(−t) . In practice, a combined quantity, namely MSe,(−t) vj,j is used leading to so called DFBETAS (the last “S” stands for 14.4. INFLUENTIAL DIAGNOSTICS 280 “scaled”): βbj − βb(−t),j DFBETASt,j := p MSe,(−t) vj,j = mt,t p Ut v> xt , MSe,(−t) vj,j t t = 1, . . . , n, j = 0, . . . , k − 1. p The reason for using MSe,(−t) vj,j as a scale factor is that MSe,(−t) is a safer estimator of the residual variance σ 2 not being based on the observation whose influence is examined but at the same time, it can still be calculated from quantities of the full model M (see Eq. 14.1). On the other hand, a value of v(−t),j,j (that fits with the leave-one-out residual mean square MSe,(−t) ) cannot, in general, be calculated from quantities of the full model M and hence (a close) value of vj,j is used. Consequently, all values of DFBETAS can be calculated from quantities of the full model M and there is no need to fit n leave-one-out models. Note (Rule-of-thumb used by R). The R function influence.measures marks the tth observation as being influential with respect to the LSE of the jth regression coefficient if DFBETASt,j > 1. 14.4.2 DFFITS We are assuming mt,t > 0 and hence by Theorem 14.1, parameter µt := E Yt X t = xt = x> t β − > > is estimable in both models M and M(−t) . Let as usual, b = X X X Y be any solution to normal equations in model M (which is now not necessarily of a full-rank) and let b(−t) = − X(−t) > X(−t) X(−t) > Y (−t) be any solution to normal equations in the leave-one-out model M(−t) . The LSE’s of µt in the two models are M: M(−t) : Ybt = x> t b, Yb[t] = x> t b(−t) . Using (14.1): n − o ht,t Ut Ut > > − > Yb[t] = x> b − X X xt = Ybt − xt X X xt = Ybt − Ut . t mt,t mt,t mt,t Difference between Ybt and Yb[t] is called DFFIT and quantifies influence of the tth observation on the LSE of its own expectation: ht,t , DFFITt := Ybt − Yb[t] = Ut mt,t t = 1, . . . , n. Analogously to DFBETAS, also DFFIT is scaled by a quantity that resembles the standard error of p either Ybt or Yb[t] (remember, S.E. Ybt = MSe ht,t ) leading to a quantity called DFFITS: Ybt − Yb[t] DFFITSt := p MSe,(−t) ht,t ht,t Ut p = = mt,t MSe,(−t) ht,t s ht,t Ut p = mt,t MSe,(−t) mt,t s ht,t Tt , mt,t t = 1, . . . , n, 14.4. INFLUENTIAL DIAGNOSTICS 281 where Tt is the tth studentized residual of the model M. Again, all values of DFFITS can be calculated from quantities of the full model M and there is no need to fit n leave-one-out models. Note (Rule-of-thumb used by R). The R function influence.measures marks the tth observation as excessively influencing the LSE of its expectation if r r DFFITSt > 3 . n−r 14.4.3 Cook distance In this Section, we concentrate on evaluation of the influence of the tth observation − on the LSE of a vector parameter µ := E Y X = Xβ. As in Section 14.4.2, let b = X> X X> Y be any − solution to normal equations in model M and let b(−t) = X(−t) > X(−t) X(−t) > Y (−t) be any solution to normal equations in the leave-one-out model M(−t) . The LSE’s of µ in the two models are M: M(−t) : Yb = Xb = HY , Yb (−t•) := Xb(−t) . Note. Remind that Yb (−t•) , Yb [•] and Yb (−t) are three different quantities. Namely, Yb (−t•) = Xb(−t) x> 1 b(−t) .. , = . > xn b(−t) Yb [•] Yb[1] x> 1 b(−1) . .. . . = . . = > b Y[n] xn b(−n) Finally, Yb (−t) = X(−t) b(−t) is a subvector of length n − 1 of a vector Yb (−t•) of length n. Possible quantification of influence of the tth observation on the LSE of a vector parameter µ is obtained by considering a quantity Yb − Yb (−t•) 2 . Let us remind from Lemma 14.3: b − b(−t) = − Ut X> X xt . mt,t Hence, Yb − Yb (−t•) = X b − b(−t) = − Ut X X> X x t . mt,t Then 2 Yb − Yb (−t•) 2 = Ut X X> X − xt mt,t = − Ut2 > > − > xt X X X X X> X xt 2 mt,t = Ut2 ht,t . m2t,t The equality (14.4) follows from noting that (14.4) 14.4. INFLUENTIAL DIAGNOSTICS 282 − > − − − > (a) x> X X X> X xt is the tth diagonal element of matrix X X> X X> X X> X X> ; t X X − − − (b) X X> X X> X X> X X> = X X> X X> = H by the five matrices rule (Theorem A.2). The so called Cook distance of the tth observation is (14.4) modified to get a unit-free quantity. Namely, the Cook distance is defined as Dt := 1 Yb − Yb (−t•) 2 . r MSe Expression (14.4) shows that it is again not necessary to fit the leave-one-out model to calculate the Cook distance. Moreover, we can express it as follows Dt = 2 1 ht,t Ut2 1 ht,t = Utstd . r mt,t MSe mt,t r mt,t Notes. • We are assuming mt,t > 0. Hence ht,t = 1 − mt,t ∈ (0, 1) and the term ht,t /mt,t increases with the leverage ht,t (having a limit of ∞ with ht,t → 1). The “ht,t /mt,t ” part of the Cook distance thus quantifies how much is the tth observation the leverage point. • The “Utstd ” part of the Cook distance increases with the distance between the observed and fitted value which is high for outliers. • The Cook distance is thus a combined measure being high for observations which are either leverage points or outliers or both. Cook distance in a full-rank model If r = k and both M and M(−t) are of full-rank, we have b = b = β b b(−t) = β (−t) = X> X −1 X> Y , X(−t) > X(−t) −1 X(−t) > Y (−t) . Then, directly from definition, 2 b − Xβ b Yb − Yb (−t•) 2 = Xβ = (−t) b b β (−t) − β > b b X> X β (−t) − β . The Cook distance is then Dt = b b β (−t) − β > b b X> X β (−t) − β , k MSe b and β b which is a distance between β (−t) in a certain metric. Remember now that under normality, the confidence region for parameter β with a coverage of 1 − α, derived while assuming model M is C(α) = β : b β − β That is b β (−t) ∈ C(α) > b X> X β − β if and only if < k MSe Fk,n−k (1 − α) . Dt < Fk,n−k (1 − α). (14.5) 14.4. INFLUENTIAL DIAGNOSTICS 283 This motivates the following rule-of-thumb. Note (Rule-of-thumb used by R). The R function influence.measures marks the tth observation as excessively influencing the LSE of the full response expectation µ if Dt > Fr,n−r (0.50). 14.4.4 COVRATIO In this Section, we will again assume full-rank models (r = k) and explore influence of the tth observation on precision of the LSE of the vector of regression coefficients. The LSE’s of the vector of regression coefficients based on the two models are M: b = β M(−t) : b β (−t) = X> X −1 X> Y , X(−t) > X(−t) −1 X(−t) > Y (−t) . b and β b The estimated covariance matrices of β (−t) , respectively, are b c β var −1 = MSe X> X , b c β var (−t) = MSe,(−t) X> (−t) X(−t) −1 . Influence of the tth observation on the precision of the LSE of the vector of regression coefficients is quantified by so called COVRATIO being defined as n o b c β det var (−t) n t = 1, . . . , n. COVRATIOt = o , b c β det var After some calculation (see below), it can be shown that COVRATIOt 1 = mt,t n − k − Utstd n−k−1 2 k , t = 1, . . . , n. That is, it is again not necessary to fit n leave-one-out models to calculate the COVARTIO values for all observations in the dataset. Note (Rule-of-thumb used by R). The R function influence.measures marks the tth observation as excessively influencing precision of the estimation of the regression coefficients if 1 − COVRATIOt > 3 k . n−k 14.4. INFLUENTIAL DIAGNOSTICS 284 Calculation towards COVRATIO First, remind a matrix identity (e.g., Anděl, 2007, Theorem A.4): If A and D are square invertible matrices then A B = A · D − CA−1 B = D · A − BD−1 C. C D Use twice the above identity: X> X x −1 t > xt = X> X mt,t , = X> X · 1 − x> t X X > xt | {z } 1 1 − ht,t = mt,t = X> X(−t) . = |1| · X> X − xt x> t (−t) So that, mt,t X> X = X> (−t) X(−t) . Then, n o −1 > X b c β det var MS X e,(−t) (−t) (−t) (−t) o n = −1 b c β det var MSe X> X = MSe,(−t) MSe k > −1 X MSe,(−t) k 1 (−t) X(−t) . · = · X> X−1 MSe mt,t Expression (14.1): MSe,(−t) n − k − Utstd = MSe n−k−1 2 Hence, n o 2 k b c β det var (−t) n − k − Utstd 1 n . o = m n−k−1 b t,t c β det var 14.4.5 Final remarks . • All presented influence measures should be used sensibly. • Depending on what is the purpose of the modelling, different types of influence are differently harmful. • There is certainly no need to panic if some observations are marked as “influential”! End of Lecture #26 (14/01/2016) Appendix A Matrices A.1 Pseudoinverse of a matrix Definition A.1 Pseudoinverse of a matrix. The pseudoinverse of a real matrix An×k is such a matrix A− of dimension k × n that satisfies AA− A = A. Notes. • The pseudoinverse always exists. Nevertheless, it is not necessarily unique. • If A is invertible then A− = A−1 is the only pseudoinverse. Definition A.2 Moore-Penrose pseudoinverse of a matrix. The Moore-Penrose pseudoinverse of a real matrix An×k is such a matrix A+ of dimension k × n that satisfies the following conditions: (i) AA+ A = A; (ii) A+ AA+ = A+ ; > (iii) AA+ = AA+ ; > (iv) A+ A = A+ A. Notes. • The Moore-Penrose pseudoinverse always exists and it is unique. • The Moore-Penrose pseudoinverse can be calculated from the singular value decomposition (SVD) of the matrix A. 285 Start of Lecture #2 (08/10/2015) A.1. PSEUDOINVERSE OF A MATRIX 286 Theorem A.1 Pseudoinverse of a matrix and a solution of a linear system. Let An×k be a real matrix and let cn×1 be a real vector. Let there exist a solution of a linear system Ax = c, i.e., the linear system Ax = c is consistent. Let A− be the pseudoinverse of A. A vector xk×1 solves the linear system Ax = c if and only if x = A− c. Proof. See Anděl (2007, Appendix A.4). k Theorem A.2 Five matrices rule. For a real matrix An×k , it holds That is, a matrix A> A − − A A> A A> A = A. A> is a pseudoinverse of a matrix A. Proof. See Anděl (2007, Theorem A.19). k End of Lecture #2 (08/10/2015) A.2. KRONECKER PRODUCT A.2 287 Kronecker product Start of Lecture #7 (22/10/2015) Definition A.3 Kronecker product. Let Am×n and Cp×q be real matrices. Their Kronecker product A ⊗ C is a matrix Dm·p×n·q such that a1,1 C . . . a1,s C . .. .. . D=A⊗C= . . . = ai,j C i=1,...,m,j=1,...,n . ar,1 C . . . ar,s C Note. For a ∈ Rm , b ∈ Rp , we can write a b> = a ⊗ b> . Theorem A.3 Properties of a Kronecker product. It holds for the Kronecker product: (i) 0 ⊗ A = 0, A ⊗ 0 = 0. (ii) (A1 + A2 ) ⊗ C = (A1 ⊗ C) + (A2 ⊗ C). (iii) A ⊗ (C1 + C2 ) = (A ⊗ C1 ) + (A ⊗ C2 ). (iv) aA ⊗ cC = a c (A ⊗ C). (v) A1 A2 ⊗ C1 C2 = (A1 ⊗ C1 ) (A2 ⊗ C2 ). −1 (vi) A ⊗ C = A−1 ⊗ C−1 , if the inversions exist. − (vii) A ⊗ C = A− ⊗ C− , for arbitrary pseudoinversions. > (viii) A ⊗ C = A> ⊗ C> . (ix) A, C ⊗ D = A ⊗ D, C ⊗ D . (x) Upon a suitable reordering of the columns, matrices A ⊗ C, A ⊗ D and A ⊗ C, D are the same. (xi) rank A ⊗ C = rank A rank C . Proof. See Rao (1973, Section 1b.8). k End of Lecture #7 (22/10/2015) A.2. KRONECKER PRODUCT 288 Start of Lecture #11 (05/11/2015) Definition A.4 Elementwise product of two vectors. Let a = a1 , . . ., ap ∈ Rp , c = b1 , . . . , bp ∈ Rp . Their elementwise product1 is a vector a1 c1 , . . . , ap cp that will be denoted as a : c. That is, a1 c1 . . a:c= . . ap cp Definition A.5 Columnwise product of two matrices. Let An×p = a1 , . . . , ap and Cn×q = c1 , . . . , cq be real matrices. Their columnwise product2 A : C is a matrix Dn×p·q such that D = A : C = a1 : c1 , . . . , ap : c1 , . . . , a1 : cq , . . . , ap : cq . Notes. • If we write a> 1 . . A= . , > an c> 1 . . C= . , > cn the columnwise product of two matrices can also be written as a matrix rows of which are obtained as Kronecker products of the rows of the two matrices: > c> ⊗ a 1 1 .. . A:C= (A.1) . > > cn ⊗ an • It perhaps looks more logical to define the columnwise product of two matrices as > a> ⊗ c 1 1 .. = a1 : c1 , . . . , a1 : cq , . . . , ap : c1 , . . . , ap : cq , A:C= . > a> n ⊗ cn which only differs by ordering of the columns of the resulting matrix. Our definition (A.1) is motivated by the way in which an operator : acts in the R software. 1 součin po složkách 2 součin po sloupcích End of Lecture #11 (05/11/2015) A.3. ADDITIONAL THEOREMS ON MATRICES A.3 289 Additional theorems on matrices Theorem A.4 Inverse of a matrix divided into blocks. Let M= A B B> D be a positive definite matrix divided in blocks A, B, D. Then the following holds: (i) Matrix Q = A − BD−1 B> is positive definite. (ii) Matrix P = D − B> A−1 B is positive definite. (iii) The inverse to M is M−1 = = Q−1 − D−1 B> Q−1 − Q−1 BD−1 D−1 A−1 + A−1 BP−1 B> A−1 − P−1 B> A−1 Proof. See Anděl (2007, Theorem A.10 in Appendix A.2). + D−1 B> Q−1 BD−1 − A−1 BP−1 P−1 . k Appendix B Distributions B.1 Non-central univariate distributions Definition B.1 Non-central Student t-distribution. Let U ∼ N (0, 1), let V ∼ χ2ν for some ν > 0 and let U and V be independent. Let λ ∈ R. Then we say that a random variable U +λ T = r V ν follows a non-central Student t-distribution1 with ν degrees of freedom2 and a non-centrality parameter 3 λ. We shall write T ∼ tν (λ). Notes. • Non-central t-distribution is different from simply a shifted (central) t-distribution. • Directly seen from definition: tν (0) ≡ tν . • Moments of a non-central Student t-distribution: r Γ ν−1 ν 2 λ , if ν > 1, 2 Γ ν2 E(T ) = does not exist, if ν ≤ 1. 2 2) 2 Γ ν−1 ν(1 + λ νλ 2 − , if ν > 2, ν−2 2 Γ ν2 var(T ) = does not exist, if ν ≤ 2. 1 necentrální Studentovo t-rozdělení 2 stupně volnosti 3 parametr necentrality 290 Start of Lecture #6 (22/10/2015) B.1. NON-CENTRAL UNIVARIATE DISTRIBUTIONS 291 Definition B.2 Non-central χ2 distribution. Let U1 , . . . , Uk be independent random variables. Let further Ui ∼ N (µi , 1), i = 1, .. . , k, for some µ1 , . . . , µk ∈ R. That is U = U1 , . . . , Uk ∼ Nk µ, Ik , where µ = µ1 , . . . , µk . Then we say that a random variable k X 2 X= Ui2 = U i=1 follows a non-central chi-squared distribution4 with k degrees of freedom and a non-centrality parameter k X λ= µ2i = kµk2 . i=1 We shall write X ∼ χ2k (λ). Notes. • It can easily be proved that the of the random variable X from Definition B.2 indeed Pdistribution k 2 depends only on k and λ = i=1 µi and not on the particular values of µ1 , . . . , µk . • As an exercise for the use of a convolution theorem, we can derive a density of the χ2k (λ) distribution which is ∞ − x+λ k−2 X λ j xj k−1 1 e 2 x 2 B , +j , x > 0, k 1 (2j)! 2 2 f (x) = Γ 2 2 Γ k−1 j=0 2 2 0, x ≤ 0. • The non-central χ2 distribution with general degrees of freedom ν ∈ (0, ∞) is defined as a distribution with the density given by the above expression with k replaced by ν. • χ2ν (0) ≡ χ2ν . • Moments of a non-central χ2 distribution: E(X) = ν + λ, var(X) = 2 (ν + 2λ). 4 necentrální chí-kvadrát rozdělení B.1. NON-CENTRAL UNIVARIATE DISTRIBUTIONS 292 Definition B.3 Non-central F-distribution. Let X ∼ χ2ν1 (λ), where ν1 , λ > 0. Let Y ∼ χ2ν2 , where ν2 > 0. Let further X and Y be independent. Then we say that a random variable X ν1 Q= Y ν2 follows a non-central F-distribution5 with ν1 and ν2 degrees of freedom and a noncentrality parameter λ. We shall write Q ∼ Fν1 ,ν2 (λ). Notes. • Directly seen from definition: Fν1 ,ν2 (0) ≡ Fν1 ,ν2 . • Moments of a non-central F-distribution: ν (ν + λ) 2 1 , if ν2 > 2, ν1 (ν2 − 2) E(Q) = does not exist, if ν2 ≤ 2. 2 2 + (ν + 2λ) (ν − 2) (ν + λ) ν2 1 1 2 2 , if ν2 > 4, 2 (ν2 − 2) (ν2 − 4) ν1 var(Q) = does not exist, if ν2 ≤ 4. End of Lecture #6 (22/10/2015) 5 necentrální F-rozdělení B.2. MULTIVARIATE DISTRIBUTIONS B.2 293 Multivariate distributions Start of Lecture #4 (15/10/2015) Definition B.4 Multivariate Student t-distribution. Let U ∼ Nn (0n , Σ), where Σn×n is a positive semidefinite matrix. Let further V ∼ χ2ν for some ν > 0 and let U and V be independent. Then we say that a random vector r ν T =U V follows an n-dimensional multivariate Student t-distribution6 with ν degrees of freedom and a scale matrix7 Σ. We shall write T ∼ mvtn,ν (Σ). Notes. • Directly seen from definition: mvt1,ν (1) ≡ tν . • If Σ is a regular (positive definite) matrix, then the density of the mvtn,ν (Σ) distribution is f (t) = ν+n 2 Γ Γ ν 2 n 2 ν π n 2 − 1 Σ 2 t> Σ−1 t 1+ ν − ν+n 2 , t ∈ Rn . • Expectation and a covariance matrix of T ∼ mvtn,ν (Σ) are E(T ) = var(T ) = 0n , if ν > 1, does not exist, if ν ≤ 1. ν Σ, ν−2 if ν > 2, does not exist, if ν ≤ 2. Lemma B.1 Marginals of the multivariate Student t-distribution. > Let T = T1 , . . . , Tn ∼ mvtn,ν (Σ), where the scale matrix Σ has positive diagonal elements σ12 > 0, . . . , σn2 > 0. Then Tj ∼ tν , j = 1, . . . , n. σj r Proof. ν • From definition of the multivariate t-distribution, T can be written as T = U , where V > U = U1 , . . . , Un ∼ Nn (0n , Σ) and V ∼ χ2ν are independent. 6 vícerozměrné Studentovo t-rozdělení 7 měřítková matice B.2. MULTIVARIATE DISTRIBUTIONS • Then for all j = 1, . . . , n: Tj Uj = σj σj 294 r Zj ν =q , V V ν where Zj ∼ N (0, 1) is independent of V ∼ χ2ν . k End of Lecture #4 (15/10/2015) Appendix C Asymptotic Theorems Start of Lecture #22 Theorem C.1 Strong law of large numbers (SLLN) for i.n.n.i.d. random variables. (17/12/2015) Let Z1 , Z2 , . . . be a sequence of independent not necessarily identically distributed (i.n.n.i.d.) random variables. Let E(Zi ) = µi , var(Zi ) = σi2 , i = 1, 2, . . .. Let ∞ X σ2 i i=1 Then i2 < ∞. n 1 X a.s. (Zi − µi ) −→ 0 n as n → ∞. i=1 Proof. See Probability and Mathematical Statistics (NMSA202) lecture (2nd year of the Bc. study programme). k Theorem C.2 Strong law of large numbers (SLLN) for i.i.d. random variables. Let Z1 , Z2 , . . . be a sequence of independent identically distributed (i.i.d.) random variables. Then n 1 X a.s. Zi −→ µ n as n → ∞ i=1 for some µ ∈ R if and only if EZ1 < ∞, in which case µ = E Z1 . 295 296 Proof. See Probability and Mathematical Statistics (NMSA202) lecture (2nd year of the Bc. study programme). k Theorem C.3 Central limit theorem (CLT), Lyapunov. Let Z1 , Z2 , . . . be a sequence of i.n.n.i.d. random variables with E Zi = µi , ∞ > var Zi = σi2 > 0, Let for some δ > 0 Zi − µi 2+δ E i=1 −→ 0 P 2+δ 2 n 2 i=1 σi Pn Then i = 1, 2, . . . as n → ∞. Pn D i=1 (Zi − µi ) q −→ N (0, 1) as n → ∞. Pn 2 i=1 σi Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme). k Theorem C.4 Central limit theorem (CLT), i.i.d.. Let Z1 , Z2 , . . . be a sequence of i.i.d. random variables with E Zi = µ, ∞ > var Zi = σ 2 > 0, P Let Z n = n1 ni=1 Zi . i = 1, 2, . . . . Then n 1 X Zi − µ D √ −→ N (0, 1) σ n as n → ∞, i=1 √ n Zn − µ D −→ N (0, σ 2 ) as n → ∞. Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme). k 297 Theorem C.5 Central limit theorem (CLT), i.i.d. multivariate. Let Z 1 , Z 2 , . . . be a sequence of i.i.d. p-dimensional random vectors with E Z i = µ, var Z i = Σ, i = 1, 2, . . . , P where Σ is a real positive semidefinite matrix. Let Z n = n1 ni=1 Z i . Then √ n Zn − µ D −→ Np (0p , Σ). If Σ is positive definite then also n D 1 X −1/2 √ Σ Z i − µ −→ Np (0p , Ip ). n i=1 Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme). k Theorem C.6 Cramér-Wold. Let Z 1 , Z 2 , . . . be a sequence of p-dimensional random vectors. Let Z be a p-dimensional random vector. D Z n −→ Z as n → ∞ if and only if for all l ∈ Rp D l> Z n −→ l> Z as n → ∞. Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme). k 298 Theorem C.7 Cramér-Slutsky. Let Z 1 , Z 2 , . . . be a sequence of random vectors such that D Z n −→ Z as n → ∞, where Z be a random vector. Let S1 , S2 , . . . be a sequence of random variables such that P Sn −→ S as n → ∞, where S ∈ R is a real constant. Then D (i) Sn Z n −→ S Z (ii) 1 D 1 Z n −→ Z Sn S as n → ∞. as n → ∞, if S 6= 0. Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme). See also Shao (2003, Theorem 1.11 in Section 1.5). k End of Lecture #22 (17/12/2015) Bibliography Anděl, J. (2007). Základy matematické statistiky. Matfyzpress, Praha. ISBN 80-7378-001-1. Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London, Series A, Mathematical and Physical Sciences, 160(901), 268–282. doi: 10.1098/rspa.1937. 0109. Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47(5), 1287–1294. doi: 10.2307/1911963. Brown, M. B. and Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364–367. doi: 10.1080/01621459.1974.10482955. Cipra, T. (2008). Finanční ekonometrie. Ekopress, Praha. ISBN 978-80-86929-43-9. Cook, R. D. and Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika, 70 (1), 1–10. doi: 10.1093/biomet/70.1.1. Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics and Data Analysis, 45(2), 215–233. doi: 10.1016/S0167-9473(02)00366-3. de Boor, C. (1978). A Practical Guide to Splines. Springer, New York. ISBN 0-387-90356-9. de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York, Revised edition. ISBN 0-387-95366-3. Dierckx, P. (1993). Curve and Surface Fitting with Splines. Clarendon, Oxford. ISBN 0-19-853440-X. Draper, N. R. and Smith, H. (1998). Applied Regression Analysis. John Wiley & Sons, New York, Third edition. ISBN 0-471-17082-8. Durbin, J. and Watson, G. S. (1950). Testing for serial correlation in least squares regression I. Biometrika, 37, 409–428. Durbin, J. and Watson, G. S. (1951). Testing for serial correlation in least squares regression II. Biometrika, 38(1/2), 159–177. doi: 10.2307/2332325. Durbin, J. and Watson, G. S. (1971). Testing for serial correlation in least squares regression III. Biometrika, 58(1), 1–19. doi: 10.2307/2334313. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1–26. doi: 10.1214/aos/1176344552. 299 BIBLIOGRAPHY 300 Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties (with Discussion). Statistical Science, 11(1), 89–121. doi: 10.1214/ss/1038425655. Farebrother, R. W. (1980). Algorithm AS 153: Pan’s procedure for the tail probabilities of the Durbin-Watson statistics. Applied Statistics, 29(2), 224–227. Farebrother, R. W. (1984). Remark AS R53: A remark on algorithm AS 106, AS 153, AS 155: The distribution of a linear combination of χ2 random variables. Applied Statistics, 33, 366–369. Fligner, M. A. and Killeen, T. J. (1976). Distribution-free two-sample tests for scale. Journal of the American Statistical Association, 71(353), 210–213. doi: 10.2307/2285771. Fox, J. and Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178–183. doi: 10.1080/01621459.1992.10475190. Genz, A. and Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities. Springer-Verlag, New York. ISBN 978-3-642-01688-2. Goldfeld, S. M. and Quandt, R. E. (1965). Some tests for homoscedasticity. Journal of the American Statistical Association, 60(310), 539–547. doi: 10.1080/01621459.1965.10480811. Hayter, A. J. (1984). A proof of the conjecture that the Tukey-Kramer multiple comparisons procedure is conservative. The Annals of Statistics, 12(1), 61–75. doi: 10.1214/aos/1176346392. Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50(3), 346–363. doi: 10.1002/bimj.200810425. Hothorn, T., Bretz, F., and Westfall, P. (2011). Multiple Comparisons Using R. Chapman & Hall/CRC, Boca Raton. ISBN 978-1-5848-8574-0. Khuri, A. I. (2010). Linear Model Methodology. Chapman & Hall/CRC, Boca Raton. ISBN 978-1-58488481-1. Koenker, R. (1981). A note on studentizing a test for heteroscedasticity. Journal of Econometrics, 17 (1), 107–112. doi: 10.1016/0304-4076(81)90062-2. Kramer, C. Y. (1956). Extension of multiple range tests to group means with unequal numbers of replications. Biometrics, 12(3), 307–310. doi: 10.2307/3001469. Levene, H. (1960). Robust tests for equality of variances. In Olkin, I., Ghurye, S. G., Hoeffding, W., Madow, W. G., and Mann, H. B., editors, Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 278–292. Stanford University Press, Standord. Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3), 217–224. doi: 10.2307/2685594. MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3), 305–325. doi: 10.1016/0304-4076(85)90158-7. R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. Rao, C. R. (1973). Linear Statistical Inference and its Applications. John Wiley & Sons, New York, Second edition. ISBN 0-471-21875-8. BIBLIOGRAPHY 301 Searle, S. R. (1987). Linear Models for Unbalanced Data. John Wiley & Sons, New York. ISBN 0-471-84096-3. Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis. John Wiley & Sons, New York, Second edition. ISBN 978-0-47141-540-4. Shao, J. (2003). Mathematical Statistics. Springer Science+Business Media, New York, Second edition. ISBN 0-387-95382-5. Sun, J. (2003). Mathematical Statistics. Springer Science+Business Media, New York, Second edition. ISBN 0-387-95382-5. Tukey, J. W. (1949). Comparing individual means in the Analysis of variance. Biometrics, 5(2), 99–114. doi: 10.2307/3001913. Tukey, J. W. (1953). The problem of multiple comparisons (originally unpublished manuscript). In Braun, H. I., editor, The Collected Works of John W. Tukey, volume 8, 1994. Chapman & Hall, New York. Weisberg, S. (2005). Applied Linear Regression. John Wiley & Sons, Hoboken, Third edition. ISBN 0-471-66379-4. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838. doi: 10.2307/1912934. Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estimators. Journal of Statistical Software, 11(10), 1–17. URL http://www.jstatsoft.org/v11/i10/. Zvára, K. (1989). Regresní analýza. Academia, Praha. ISBN 80-200-0125-5. Zvára, K. (2008). Regrese. Matfyzpress, Praha. ISBN 978-80-7378-041-8.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement