# NMSA407 Linear Regression Course Notes 2015–16 Arnošt Komárek ```Department of Probability and Mathematical Statistics
NMSA407 Linear Regression
Course Notes
2015–16
Arnošt Komárek
These course notes contain an overview of notation, definitions, theorems and comments
covered by the course “NMSA407 Linear Regression”, which is a part of the curriculum
of the Master’s programs “Probability, Mathematical Statistics and Econometrics” and
“Financial and Insurance Mathematics”.
This document undergoes continuing development.
This version is dated January 14, 2016.
Arnošt Komárek
[email protected]
On Řečička, in Karlín, in Zvůle from May 2015, partially based on lecture overheads
used in fall 2013 and 2014.
Contents
1
Linear Model
1
1.1
Regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Basic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.2
Probabilistic model for the data . . . . . . . . . . . . . . . . . . . . . . . . .
2
Linear model: Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Definition of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Definition of a linear model using the error terms . . . . . . . . . . . . . .
3
1.2.3
Rank of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.4
Independent observations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.5
Linear model with i.i.d. errors . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.6
Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.7
Transformations of covariates . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2.8
Linear model with intercept . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2.9
Interpretation of regression coefficients . . . . . . . . . . . . . . . . . . . . .
8
1.2.10
Fixed or random covariates . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2.11
Limitations of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2
2 Least Squares Estimation
2.1
11
Regression and residual space, projections . . . . . . . . . . . . . . . . . . . . . . .
12
2.1.1
Regression and residual space . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.1.2
Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2
Fitted values, residuals, Gauss–Markov theorem . . . . . . . . . . . . . . . . . . . .
16
2.3
Normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4
Estimable parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.5
Parameterizations of a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.5.1
Equivalent linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.5.2
Full-rank parameterization of a linear model . . . . . . . . . . . . . . . . . .
30
iii
CONTENTS
2.6
iv
Matrix algebra and a method of least squares . . . . . . . . . . . . . . . . . . . . .
33
2.6.1
QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.6.2
SVD decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3 Normal Linear Model
4
5
35
3.1
Normal linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2
Properties of the least squares estimators under the normality . . . . . . . . . . . .
37
3.2.1
Statistical inference in a full-rank normal linear model . . . . . . . . . . . .
38
3.2.2
Statistical inference in a general rank normal linear model . . . . . . . . . .
40
3.3
Confidence interval for the model based mean, prediction interval . . . . . . . . . .
42
3.4
Distribution of the linear hypotheses test statistics under the alternative . . . . . .
44
Basic Regression Diagnostics
46
4.1
(Normal) linear model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Standardized residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.3
Graphical tools of regression diagnostics . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.1
(A1) Correctness of the regression function . . . . . . . . . . . . . . . . . . .
50
4.3.2
(A2) Homoscedasticity of the errors . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.3
(A3) Uncorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3.4
(A4) Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Submodels
53
5.1
Submodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.1.1
Projection considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.1.2
Properties of submodel related quantities . . . . . . . . . . . . . . . . . . .
56
5.1.3
Series of submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.1.4
Statistical test to compare nested models . . . . . . . . . . . . . . . . . . .
58
5.2
Omitting some covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.3
Linear constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.3.1
F-statistic to verify a set of linear constraints . . . . . . . . . . . . . . . . .
65
5.3.2
t-statistic to verify a linear constraint . . . . . . . . . . . . . . . . . . . . . .
65
Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.4.1
Intercept only model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.4.2
Models with intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.4.3
Evaluation of a prediction quality of the model . . . . . . . . . . . . . . . .
68
5.4.4
Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.4.5
Overall F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.4
6 General Linear Model
73
CONTENTS
7
v
Parameterizations of Covariates
77
7.1
Linearization of the dependence of the response on the covariates . . . . . . . . . .
77
7.2
Parameterization of a single covariate . . . . . . . . . . . . . . . . . . . . . . . . . .
78
7.2.1
Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
7.2.2
Covariate types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
Numeric covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
7.3.1
Simple transformation of the covariate . . . . . . . . . . . . . . . . . . . . .
81
7.3.2
Raw polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
7.3.3
Orthonormal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
7.3.4
Regression splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Categorical covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
7.4.1
Link to a G-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . .
88
7.4.2
Linear model parameterization of one-way classified group means . . . . .
90
7.4.3
ANOVA parameterization of one-way classified group means . . . . . . . . .
91
7.4.4
Full-rank parameterization of one-way classified group means . . . . . . . .
97
7.3
7.4
8.1
8.2
8.3
8.4
8.5
8.6
105
Additivity and partial effect of a covariate . . . . . . . . . . . . . . . . . . . . . . .
105
8.1.1
Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
8.1.2
Partial effect of a covariate . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
8.1.3
Additivity, partial covariate effect and conditional independence . . . . . .
106
Additivity of the effect of a numeric covariate . . . . . . . . . . . . . . . . . . . . .
107
8.2.1
Partial effect of a numeric covariate . . . . . . . . . . . . . . . . . . . . . .
107
Additivity of the effect of a categorical covariate . . . . . . . . . . . . . . . . . . . .
108
8.3.1
Partial effects of a categorical covariate . . . . . . . . . . . . . . . . . . . .
108
8.3.2
Interpretation of the regression coefficients . . . . . . . . . . . . . . . . . .
109
Effect modification and interactions . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
8.4.1
Effect modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
8.4.2
Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
8.4.3
Interactions with the regression spline . . . . . . . . . . . . . . . . . . . . .
112
8.4.4
Linear model with interactions . . . . . . . . . . . . . . . . . . . . . . . . .
113
8.4.5
Rank of the interaction model . . . . . . . . . . . . . . . . . . . . . . . . . .
114
Interaction of two numeric covariates . . . . . . . . . . . . . . . . . . . . . . . . . .
116
8.5.1
Mutual effect modification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
8.5.2
Mutual effect modification with regression splines . . . . . . . . . . . . . .
117
Interaction of a categorical and a numeric covariate . . . . . . . . . . . . . . . . . .
120
8.6.1
Categorical effect modification . . . . . . . . . . . . . . . . . . . . . . . . .
120
8.6.2
Categorical effect modification with regression splines . . . . . . . . . . . .
124
CONTENTS
8.7
8.8
vi
Interaction of two categorical covariates . . . . . . . . . . . . . . . . . . . . . . . . .
126
8.7.1
Linear model parameterization of two-way classified group means . . . . .
127
8.7.2
ANOVA parameterization of two-way classified group means . . . . . . . . .
130
8.7.3
Full-rank parameterization of two-way classified group means . . . . . . . .
133
8.7.4
Relationship between the full-rank and ANOVA parameterizations . . . . . .
135
8.7.5
Additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
8.7.6
Interpretation of model parameters for selected choices of (pseudo)contrasts 137
Hierarchically well-formulated models, ANOVA tables . . . . . . . . . . . . . . . . .
143
8.8.1
Model terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
143
8.8.2
Model formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
146
8.8.3
Hierarchically well formulated model . . . . . . . . . . . . . . . . . . . . . .
146
8.8.4
ANOVA tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
9 Analysis of Variance
9.1
9.2
152
One-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
153
9.1.1
Parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
153
9.1.2
One-way ANOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
154
9.1.3
Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
156
9.1.4
Within and between groups sums of squares, ANOVA F-test . . . . . . . . .
157
Two-way classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
9.2.1
Parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
9.2.2
Two-way ANOVA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
9.2.3
Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
9.2.4
Sums of squares and ANOVA tables with balanced data . . . . . . . . . . .
170
10 Checking Model Assumptions
173
10.1 Model with added regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
174
10.2 Correct regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177
10.2.1
Partial residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177
10.2.2
Test for linearity of the effect . . . . . . . . . . . . . . . . . . . . . . . . . .
181
10.3 Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
10.3.1
Tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
10.3.2
Score tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . .
183
10.3.3
Some other tests of homoscedasticity . . . . . . . . . . . . . . . . . . . . . .
185
10.4 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
10.4.1
Tests of normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
188
10.5 Uncorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
190
10.5.1
Durbin-Watson test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
190
CONTENTS
vii
10.6 Transformation of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6.1
Prediction based on a model with transformed response . . . . . . . . . . .
193
10.6.2
Log-normal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
193
11 Consequences of a Problematic Regression Space
11.1
11.2
196
Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197
11.1.1
Singular value decomposition of a model matrix . . . . . . . . . . . . . . .
197
11.1.2
Multicollinearity and its impact on precision of the LSE . . . . . . . . . . .
198
11.1.3
Variance inflation factor and tolerance . . . . . . . . . . . . . . . . . . . . . 200
11.1.4
Basic treatment of multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 204
Misspecified regression space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2.1
Omitted and irrelevant regressors . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2.2
Prediction quality of the fitted model . . . . . . . . . . . . . . . . . . . . . . 209
11.2.3
Omitted regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
212
11.2.4
Irrelevant regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
216
11.2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217
12 Simultaneous Inference in a Linear Model
12.1
193
219
Basic simultaneous inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
12.2 Multiple comparison procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221
12.2.1
Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221
12.2.2
Simultaneous confidence intervals . . . . . . . . . . . . . . . . . . . . . . . 223
12.2.3
Multiple comparison procedure, P-values adjusted for multiple comparison
12.2.4
Bonferroni simultaneous inference in a normal linear model . . . . . . . . . 226
224
12.3 Tukey’s T-procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.3.1
Tukey’s pairwise comparisons theorem . . . . . . . . . . . . . . . . . . . . . 228
12.3.2
Tukey’s honest significance differences (HSD) . . . . . . . . . . . . . . . . . 230
12.3.3
Tukey’s HSD in a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.4 Hothorn-Bretz-Westfall procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.4.1
Max-abs-t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.4.2
General multiple comparison procedure for a linear model . . . . . . . . . 239
12.5 Confidence band for the regression function . . . . . . . . . . . . . . . . . . . . . . 243
13 Asymptotic Properties of the LSE and Sandwich Estimator
13.1
247
Assumptions and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
247
13.2 Consistency of LSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
13.3 Asymptotic normality of LSE under homoscedasticity . . . . . . . . . . . . . . . . . 255
13.3.1
Asymptotic validity of the classical inference under homoscedasticity but
non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
CONTENTS
viii
13.4 Asymptotic normality of LSE under heteroscedasticity . . . . . . . . . . . . . . . . . 258
13.4.1
Heteroscedasticity consistent asymptotic inference . . . . . . . . . . . . . . 265
14 Unusual Observations
14.1
267
Leave-one-out and outlier model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
14.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
274
14.3 Leverage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
277
14.4 Influential diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
14.4.1
DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
14.4.2
DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
14.4.3
Cook distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.4
COVRATIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
14.4.5
Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
A Matrices
281
285
A.1
Pseudoinverse of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.2
Kronecker product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.3
Additional theorems on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
B Distributions
290
B.1
Non-central univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
B.2
Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
C Asymptotic Theorems
295
Bibliography
299
Preface
• R software (R Core Team, 2015).
• Basic literature: Khuri (2010); Zvára (2008).
• Supplementary literature: Seber and Lee (2003); Draper and Smith (1998); Sun (2003); Weisberg
(2005); Anděl (2007); Cipra (2008); Zvára (1989).
ix
Notation and general
conventions
• Vectors are understood as column vectors (matrices with one column). Nevertheless, to simplify
notation, standalone vectors will also be written as rows, e.g.,
z = z1 , . . . , zp .
The transposition symbol (>) will only be used if it is necessary to highlight the fact that the
vector must be used as a row vector (matrix with one row). This mainly points to situations
when
(i) vectors are used within the matrix multiplication, e.g.,
µ = z > γ,
z = z1 , . . . , zp , γ = γ1 , . . . , γp ,
means that µ is a product of a row vector z > and a column vector γ (i.e., a scalar product
of the two vectors);
(ii) vectors are used to indicate rows of a matrix, e.g.,
 
Z>
Z 1 = Z1,1 , . . . , Z1,p ,
1
 . 
..
. 
Z=
.
 . ,
>
Zn
Z n = Zn,1 , . . . , Zn,p ,
>
means that the matrix Z has rows given by the (row) vectors Z >
1 , . . ., Z n , i.e.,


Z1,1 . . . Z1,p
 .
..
.. 
..
Z=
.
. 

.
Zn,1 . . . Zn,p
• Statements concerning equalities between two random quantities are understood as equalities
almost surely even if “almost surely” is not explicitely stated.
x
Chapter
1
Linear Model
1.1
Regression analysis
Start of
Linear
is a basic method of so called regression
which covers a variety of Lecture #1
methods to model on how distribution of one variable depends on one or more other variables. (05/10/2015)
A principal tool of linear regression is then so called linear model 3 which will be a main topic of
this lecture.
regression1
1.1.1
analysis2
Basic setup
Most methods
of regression analysis assume that the data
can be represented by n random vectors
Yi , X i , i = 1, . . . , n, where X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, have all the same number
k < n components. This will also be a basic assumption used throughout the whole lecture.
Notation and terminology (Response, covariates, response vector, model matrix).
• Yi is called response 4 or dependent variable 5 .
• The components of X i are called covariates6 , explanatory variables7 , predictors8 , or independent
variables9 .
• The sample space10 of the covariates will be denoted as X . That is, X ⊆ Rk , and among other
things, P(X i ∈ X ) = 1, i = 1, . . . , n.
Further, let


Y1
 . 
. 
Y =
 . ,
Yn

 

X>
X1,0 . . . X1,k−1
1
 .   .
..
.. 
.   .
X=
.
. 
 . = .
.
X>
Xn,0 . . . Xn,k−1
n
• Vector Y is called the response vector 11 .
• The n × k matrix X is called the model matrix 12 or the regression matrix 13 .
1
lineární regrese 2 regresní analýza 3 lineární model 4 odezva 5 závisle proměnná 6 Nepřekládá se. Výraz
„kovariáty“ nepoužívat! 7 vysvětlující proměnné 8 prediktory 9 nezávisle proměnné 10 výběrový prostor 11 vektor
odezvy 12 matice modelu 13 regresní matice
1
1.1. REGRESSION ANALYSIS
2
Notation. Letter Y (or y) will always denote a response related quantity. Letters X (or x) and Z
(or z) will always denote a quantity related to the covariates.
This lecture:
• Response Y is continuous.
• Interest in modelling dependence of only the expected value (the mean) of Y on the covariates.
• Covariates can be of any type (numeric, categorical).
1.1.2
Probabilistic model for the data
Any statistical analysis is based on specifying a stochastic mechanism which is assumed to generate
the data. In our situation, with data Yi , X i , i = 1, . . . , n, the data generating mechanism
cor-
responds to a joint distribution of a “long” random vector Y1 , . . . , Yn , X 1 , . . . , X n ≡ Y , X
which can be given by a joint density
(1.1)
fY ,X (y1 , . . . , yn , x1 , . . . , xn ) ≡ fY ,X (y, x)
(with respect to some σ-finite product measure λY × λX ).
Note. For the purpose of this lecture, λY will always be a Lebesgue measure on Rn , Bn .
Further, it is known from basic lectures on probability that any joint density can be decomposed
into a product of a conditional and a marginal density as
(1.2)
fY ,X (y, x) = fY |X y x fX (x).
With the regression analysis, and with the linear regression
in particular, the interest lies in revealing certain features of the conditional distribution Y X (given by the density fY |X ) while
considering the marginal distribution of the covariates X (given by the density fX ) as nuisance.
It will be shown during the lecture that a valid statistical inference is possible for suitable characteristics of the conditional distribution of the response given the covariates while leaving the
covariates distribution fX practically
unspecified. Moreover, to infer
characteristics of
certain
on
the conditional distribution Y X, e.g., on the conditional mean E Y X , even the density fY |X
might be left practically unspecified for many tasks.
1.2. LINEAR MODEL: BASICS
1.2
1.2.1
3
Linear model: Basics
Definition of a linear model
Definition 1.1 Linear model.
The data Yi , X i , i = 1, . . . , n,
E Y
satisfy a linear model if
X = Xβ,
var Y X = σ 2 In ,
where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters.
Terminology (Regression coefficients, residual variance and standard deviation).
• β = β0 , . . . , βk−1 is called the vector of regression coefficients14 or regression parameters.15
• σ 2 is called the residual variance.16
√
• σ = σ 2 is called the residual standard deviation.17
Note. The linear model as specified by Definition 1.1 deals with specifying only the first two
moments of the conditional distribution Y X. For the rest, both the density fY |X and the density
fX from (1.2) can be arbitrary.
1.2.2
Definition of a linear model using the error terms
The linear model can equivalently be defined as follows.
Alternative to Definition 1.1
The data Yi , X i , i = 1, . . . , n satisfy a linear model if
Y = Xβ + ε,
where ε = ε1 , . . . , εn is a random vector such that ε and X are independent,
E(ε) = 0n ,
var(ε) = σ 2 In ,
and β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters.
Terminology (Error terms).
Random variables ε1 , . . . , εn are called the error terms (disturbances).18
Notation.
• To indicate that a random vector Y follows some distribution with mean µ and a covariance
matrix Σ, we will write
Y ∼ µ, Σ .
14
regresní koeficienty
odchylky
15
regresní parametry
16
reziduální rozptyl
17
reziduální směrodatná odchylka
18
náhodné
1.2. LINEAR MODEL: BASICS
4
• The fact that data Yi , X i , i = 1, . . . , n, follow a linear model can now be indicated by writing
Y X ∼ Xβ, σ 2 In .
• When using the error term vector ε, the fact that data Yi , X i , i = 1, . . . , n, follow a linear
model can be written as
Y = Xβ + ε,
ε ∼ 0n , σ 2 In .
1.2.3
Rank of the model
The k-dimensional covariate vectors X 1 , . . . , X n (the n × k model matrix X) are in general
generated by some (n · k)-dimensional joint distribution with a density fX (x1 , . . . , xn ) = fX (x).
Next to assuming that n > k, we will additionally assume in the whole lecture that for a fixed
r ≤ k,
P rank(X) = r = 1.
(1.3)
That is, we will assume that the (column) rank of the model matrix is fixed rather than being
random. It should gradually become clear throughout the lecture that this assumption is not really
restrictive for most of the practical applications of a linear model.
Convention. In the
reminder of the lecture, we will only write rank X = r which will mean
that P rank(X) = r = 1 if randomness of the covariates should be taken into account.
Notation. To indicate that data Yi , X i , i = 1, . . . , n follow a linear model where the k
dimensional covariates satisfy the condition (1.3) will be denoted as
Y X ∼ Xβ, σ 2 In , rank Xn×k = r,
or as
Y = Xβ + ε,
ε ∼ 0n , σ 2 In , rank Xn×k = r.
Definition 1.2 Full-rank linear model.
A full-rank linear model19 is such a linear model where r = k.
Note. In a full-rank linear model, columns of the model matrix X are linearly independent vectors
in Rn (almost surely).
1.2.4
Independent observations
In many areas of regression modelling, the data Yi , X i , i = 1, . . . , n, correspond to independently behaving units (experimental units, individuals
sampled randomly from a certain population,
. . . ). If it is so, the random vectors Yi , X i are independent for i = 1, . . . , n, and the joint
density (1.1) takes a product form
fY ,X (y, x) =
n
Y
i=1
19
lineární model o plné hodnosti
fYi ,X i (yi , xi ),
1.2. LINEAR MODEL: BASICS
5
where fYi ,X i is a joint density of a random vector Yi , X i . This can again be decomposed into
a product of a conditional and a marginal density as
i = 1, . . . , n,
fYi ,X i (yi , xi ) = fYi |X i yi xi fX i (xi ),
leading to the joint density of all data points of a form
fY ,X (y, x) =
n
nY
fYi |X i
n
o
o nY
yi xi
fX i (xi ) .
i=1
|
(1.4)
i=1
{z fY |X y x
} |
{z
fX (x)
}
Due to the fact that the
Q covariate distribution is in fact a nuisance, it is not really necessary to
assume that fX (x) = ni=1 fX i (xi ) to derive most results that will be presented in this lecture.
In other words, the form of fX in the expression (1.4) might be left in its general form which leads
to the joint density of the data points of the form
fY ,X (y, x) =
n
nY
| i=1
o
fYi |X i yi xi
fX (x).
{z fY |X y x
}
Definition 1.3 Independent observations in a regression context.
In the regression context, we say that we deal with independent observations if the conditional density
of the response vector Y given the covariates X takes the form
n
Y
fY |X y x =
fYi |X i yi xi .
(1.5)
i=1
Note. When dealing with independent observations in a regression context according to Defini-
tion 1.3, no independence assumptions are imposed on the joint distribution fX of the covariates.
1.2.5
Linear model with i.i.d. errors
If we deal with independent observations in a regression context and a form (1.5) of the conditional
distribution of Y given X can be assumed, it is often useful to assume not only a certain form of
the first two moments of each conditional distribution fYi |X i , i = 1, . . . , n, but even a common
functional form for all those conditional distributions.
Definition 1.4 Linear model with i.i.d. errors.
The data Yi , X i , i = 1, . . . , n, satisfy a linear model with i.i.d. errors20 if the joint conditional
density of Y given X takes a form
fY |X (y, x) =
n
Y
i=1
20
lineární model s nezávislými stejně rozdělenými chybami
fY |X yi xi ,
1.2. LINEAR MODEL: BASICS
where
6
2
fY |X yi xi = fe (yi − x>
i β; σ ), i = 1, . . . , n,
β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters and fe (·; σ 2 ) is some density of a continuous
distribution with zero mean and a variance σ 2 .
2
Note. In a linear model with i.i.d. errors, fY |X yi xi = fe (yi − x>
i β; σ ), i = 1, . . . , n and
since fe (·; σ 2 ) is a density of zero mean distribution with variance σ 2 , we still have
2
E Yi X i = X >
i = 1, . . . , n,
i β, var Yi X i = σ ,
var Y X = σ 2 In .
E Y X = Xβ,
That is, the linear model with i.i.d. errors indeed implies a linear model according to the basic
definition (Definition 1.1).
Analogously to Section 1.2.2, also a linear model with i.i.d. errors can alternatively be defined while
using the (i.i.d.) error terms as follows.
Definition 1.4 using the error terms
The data Yi , X i , i = 1, . . . , n, satisfy a linear model with i.i.d. errors if
Yi = X >
i β + εi ,
i = 1, . . . , n,
where ε1 , . . . , εn are independent and identically distributed (i.i.d.) random variables with a continuous distribution with zero mean and a variance σ 2 , εi and X j independent for each i, j =
1, . . . , n. Parameters β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters.
Example 1.1 (Normal linear model).
So called normal linear model that we will extensively work with starting from Chapter 3 is obtained
by considering fe (·; σ 2 ) as a density of a normal distribution N (0, σ 2 ). We then have
2
Yi X i ∼ N (X >
i = 1, . . . , n,
i β, σ ),
Y X ∼ Nn (Xβ, σ 2 In ),
i.i.d.
ε1 , . . . , εn ∼ N (0, σ 2 ).
1.2.6
Regression function
Let Y, X denote a (generic) random vector that correspond to a new data point being generated
by the same random mechanism as data at hand. When using the linear model, we do not specify
any characteristic of the random mechanism that generates the covariate value X. Nevertheless,
as soon as a covariate value x ∈ X is given, and i.i.d. errors are assumed,
the linear model
specifies (models) the first two moments of the conditional distribution Y X = x (with a density
fY |X · x)). Namely, the linear model claims that
E Y X = x = x> β,
var Y X = x = σ 2 ,
1.2. LINEAR MODEL: BASICS
7
for some β ∈ Rk and some σ 2 .
Terminology (Regression function).
A function m : Rk −→ R given as
m(x) = x> β = β0 x0 + · · · + βk−1 xk−1 ,
x = x0 , . . . , xk−1 ∈ X ,
is called the regression function.21
Note. The regression function is an assumed model for evolution of the response expectation
given the covariates if those are changing. That is,
m(x) = E Y X = x ,
1.2.7
x ∈ X.
Transformations of covariates
In most practical
situations,p we primarily observe data Yi , Z i , i = 1, . . . , n, where Z i =
Zi,1 , . . . , Zi,p ∈ Z ⊆ R , i = 1, . . . , n. The main interest lies in modelling
the conditional
expectations E Yi Z i , or generically the conditional expectation E Y Z = z , z ∈ Z.
To allow for use of a linear model, original covariates Z i must often be transformed into X i =
Xi,0 , . . . , Xi,k−1 as
Xi,0 = t0 (Z i ),
...
, Xi,k−1 = tk−1 (Z i ),
where tj : Rp −→ R, j = 0, . . . , k − 1, are some functions. When applying classical linear model
methodology, it is assumed that the functions t0 , . . . , tk−1 are known.
In the following, let for z ∈ Z:
x = x0 , . . . , xk−1 = t0 (z), . . . , tk−1 (z) = t(z).
The linear model then specifies (i = 1, . . . , n):
E Yi Z i = t> (Z i )β = X >
i β = E Yi X i ,
or generically
E Y Z = z = t> (z)β = x> β = E Y X = x .
The model matrix takes the form
 
 
 


X1,0 . . . X1,k−1
t0 (Z 1 ) . . . tk−1 (Z 1 )
t> (Z 1 )
X>
1
  . 
 .   .

..
.. 
..
..
 =  ...
 =  ..  ,
..  =  ..
X=
.
.
.
.

 
 
 

> (Z )
X>
X
.
.
.
X
t
(Z
)
.
.
.
t
(Z
)
t
n
n,0
0
n
n
k−1
n,k−1
n
and the corresponding regression function is
m(z) = E Y Z = z = t> (z)β = β0 t0 (z) + · · · + βk−1 tk−1 (z),
z = z1 , . . . , zp ∈ Z. (1.6)
Note. Even though some of the transformations t0 , . . . , tk−1 are often non-linear functions,
a model with the regression function (1.6) still belongs to the area of linear regression. The reason is
that the regression function is linear with respect to the unknown parameters, i.e., in the regression
coefficients β.
21
regresní funkce
1.2. LINEAR MODEL: BASICS
1.2.8
8
Linear model with intercept
Often, Xi,0 is taken to be constantly equal to one. That is, the first element of the (transformed)
covariate vectors corresponds to a random variable which is almost surely equal to one. In this
case, the model matrix and the regression function take the forms


1 X1,1 . . . X1,k−1
m(x) = β0 + β1 x1 + · · · + βk−1 xk−1 ,
.

..
..
,
..
X=
.
.


x = x0 , . . . , xk−1 ∈ X ⊆ Rk .
1 Xn,1 . . . Xn,k−1
Definition 1.5 Linear model with intercept.
Linear model in which the first column of the model matrix X is almost surely equal to a vector 1n of
ones is called the linear model with intercept22 .
Terminology (Intercept).
In the linear model with intercept, the first column of the model matrix, X1,0 , . . . , Xn,0 which is
almost surely equal to a vector 1n of ones is called the intercept column. The regression coefficient
β0 is called the intercept term23 (or just the intercept) of the regression function.
1.2.9
Interpretation of regression coefficients
The regression parameters
express influence of the covariates on response. Let for a chosen
j ∈ 0, 1, . . . , k − 1
x = x0 , . . . , xj . . . , xk−1 ∈ X ,
and xj(+1) := x0 , . . . , xj + 1 . . . , xk−1 ∈ X .
We then have
E Y X = xj(+1) − E Y X = x
= E Y X0 = x0 , . . . , Xj = xj + 1, . . . , Xk−1 = xk−1
− E Y X0 = x0 , . . . , Xj = xj , . . . , Xk−1 = xk−1
= β0 x0 + · · · + βj (xj + 1) + · · · + βk−1 xk−1
− β0 x0 + · · · + βj xj + · · · + βk−1 xk−1
= βj .
That is, the regression coefficient βj expresses a change of the response expectation corresponding
to a unity change of the jth regressor while keeping the remaining covariates unchanged. Further,
let for a fixed δ ∈ R
xj(+δ) := x0 , . . . , xj + δ . . . , xk−1 ∈ X ,
22
lineární model s absolutním členem
23
absolutní člen
1.2. LINEAR MODEL: BASICS
9
we then have
E Y X = xj(+δ) − E Y X = x
= E Y X0 = x0 , . . . , Xj = xj + δ, . . . , Xk−1 = xk−1
− E Y X0 = x0 , . . . , Xj = xj , . . . , Xk−1 = xk−1
= βj δ.
It is thus implied by the assumed linear model:
(i) The change of the response expectation corresponding to a constant change δ of the jth
regressor does not depend on the value xj of that regressor which is changed by δ.
(ii) The change of the response expectation corresponding to a constant change δ of the jth
regressor does not depend on the values of the remaining regressors.
Terminology (Effect of the regressor).
The regression coefficient βj is also called the effect of the jth regressor.
Linear model with intercept
In a model with intercept where Xi,0 is almost surely equal to one, it does not make sense to
consider a change of this covariate by any fixed value. The intercept β0 has then the following
interpretation. If
x0 , x1 , . . . , xk−1 = 1, 0, . . . , 0 ∈ X ,
that is, if the non-intercept covariates may all attain zero values, we have
β0 = E Y X1 = 0, . . . , Xk−1 = 0 .
1.2.10
Fixed or random covariates
In certain application areas (e.g., designed experiments), the covariates can all (or some of them)
be fixed rather than random variables. This means that the covariate values are determined/set by
the analyst rather than being observed on (randomly selected) subjects. For majority of the theory
presented throughout this course, it does not really matter whether the covariates are considered
as random or as fixed quantities. Many proofs proceed almost exactly in the same way in both
situations. Nevertheless, especially when dealing with asymptotic properties of the estimators used
in the context of a linear model (see Chapter 13), care must be taken on whether the covariates are
considered as random or as fixed.
End of
Lecture #1
Convention. For majority of the lecture, most expectations that we will work with, will be (05/10/2015)
conditional expectations with respect to the conditional distribution of the response Y given the Start of
covariate values
we will use E(·) Lecture #2
matrix X). To simplify notation,
X 1 , . . . , X n (given the model
X . That is, if g = g1 , . . . , gm : Rn·(1+k) −→ Rm is
also for E · X and var(·) also
for
var
·
(08/10/2015)
a measurable function then E g(Y , X) and var g(Y , X) means
E g(Y , X) := E g(Y , X) X
Z
=
gj (y, X) fY |X y X dλY (y)
,
Rn
var g(Y , X) := var g(Y , X) X .
j=1,...,m
1.2. LINEAR MODEL: BASICS
10
Further, for x ∈ Rn·k , we will use
E g(Y , x) := E g(Y , X) X = x ,
var g(Y , x) := var g(Y , X) X = x .
(Unconditional) expectations and covariance matrices with respect to the joint distribution of Y
and X will be indicated by a subscript Y , X, that is,
Z
EY , X g(Y , X) =
gj (y, x) fY ,X (y, x) dλY (y) dλX (x)
.
Rn(1+k)
j=1,...,m
Analogously, expectations with respect to the marginal distribution of X will also be indicated by
an appropriate subscript. That is, if h = h1 , . . . , hm : Rn·k −→ Rm is a measurable function
then
Z
hj (x) fX (x) dλX (x)
EX h(X) =
.
Rn·k
j=1,...,m
Note. To calculate EY , X , a basic relationship
EY , X
h i
g(Y , X) = EX E g(Y , X)
(1.7)
known from probability courses can conveniently be used. Furthermore, it trivially
follows from
n·k
(1.7) that if E g(Y , x) = c for λX -almost all x ∈ R then also EY , X g(Y , X) = c.
1.2.11
Limitations of a linear model
“Essentially, all models are wrong, but some are useful. The practical question is how
wrong do they have to be to not be useful.”
George E. P. Box (1919 – 2013)
Linear model is indeed only one possibility (out of infinitely many) on how to model dependence of
the response on the covariates. The linear model as defined by Definition 1.1 is (possibly seriously)
wrong if, for example,
• The expected value E Y X = x , x ∈ X , cannot be expressed as a linear function of x.
⇒ Incorrect regression function.
• The conditional variance var Y X = x , x ∈ X , is not constant. It may depend on x as
well, it may depend on other factors.
⇒ Heteroscedasticity.
• Response random variables are not conditionally uncorrelated/independent (the error terms
are not uncorrelated/independent). This is often the case if response is measured repeatedly
(e.g., over time) on n subjects included in the study.
Additionally, the linear model deals with modelling of only the first two (conditional) moments of
the response. In many application areas, other characteristics of the conditional distribution Y X
are of (primary) interest.
Chapter
2
Least Squares Estimation
We keep considering a set of n random vectors Yi , X i , X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n,
that satisfy a linear model. That is,
Y X ∼ Xβ, σ 2 In ,
rank Xn×k = r ≤ k < n,
(2.1)
>
where Y = Y1 , . . . , Yn , X is a matrix with vectors X >
1 , . . . , X n in its rows and β = β0 , . . . ,
k
2
βk−1 ∈ R and σ > 0 are unknown parameters. In this chapter, we introduce a method of
least squares1 to estimate the unknown parameters of the linear model (2.1). All results in this
chapter will be derived without imposing any (parametric) distributional assumptions concerning
the conditional distribution of the response given the covariates and without assuming independent
observations in the regression context or i.i.d. errors.
1
metoda nejmenších čtverců
11
2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS
2.1
2.1.1
12
Regression and residual space, projections
Regression and residual space
Notation (Linear span of columns of the model matrix and its orthogonal complement).
For given dataset and a linear model, the model matrix X is a real n×k matrix. Let x0 , . . . , xk−1 ∈
Rn denote its columns, i.e.,
X = x0 , . . . , xk−1 .
0
k−1 will be
• The linear span2 of
columns of X, i.e., a vector space generated by vectors x , . . . , x
denoted as M X , that is,
M X = v: v=
k−1
X
βj xj , β = β0 , . . . , βk−1 ∈ Rk .
j=0
⊥
• The orthogonal complement to M X will be denoted as M X , that is,
M X
⊥
= u : u ∈ Rn , v > u = 0 for all v ∈ M X .
Note. We know from linear algebra lectures that the linear span of column of X, M X , is
a vector subspace of dimension r of the n-dimensional Euclidean space Rn . Similarly, M X
a vector subspace of dimension n − r of the n-dimensional Euclidean space Rn . We have
⊥
is
⊥
⊥ M X ∪ M X = Rn ,
M X ∩ M X = 0n ,
⊥
for any v ∈ M X , u ∈ M X
v > u = 0.
Definition 2.1 Regression and residual space of a linear model.
Consider a linear modelY X ∼ Xβ, σ 2 In , rank(X) = r. The regression space3 of the model
is a vector space M X . The residual space4 of the model is the orthogonal complement of the
⊥
regression space, i.e., a vector space M X .
Notation (Orthonormal vector bases of the regression and residual space).
Let q 1 , . . . , q r be (any) orthonormal vector basis of the regression space M X and let n1 , . . . , nn−r
⊥
be (any) orthonormal vector basis of the residual space M X . That is, q 1 , . . . , q r , n1 , . . . , nn−r
is an orthonormal vector basis of the n-dimensinal Euclidean space Rn . We will denote
• Qn×r = q 1 , . . . , q r .
• Nn×(n−r) = n1 , . . . , nn−r .
2
lineární obal
3
regresní prostor
4
reziduální prostor
2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS
13
• Pn×n = q 1 , . . . , q r , n1 , . . . , nn−r = Q, N .
It follows from the linear algebra lectures.
• Properties of the columns of the Q matrix:
• q>
j = 1, . . . , r;
j q j = 1,
j, l = 1, . . . , r, j 6= l.
• q>
j q l = 0,
Notes.
• Properties of the columns of the N matrix:
• n>
j = 1, . . . , n − r;
j nj = 1,
>
• nj nl = 0, j, l = 1, . . . , n − r, j 6= l.
• Mutual properties of the columns of the Q and N matrix:
>
j = 1, . . . , r, l = 1, . . . , n − r.
• q>
j nl = nl q j = 0,
• Above properties written in a matrix form:
Q> Q = Ir ,
N> N = In−r ,
Q> N = 0r×(n−r) ,
N> Q = 0(n−r)×r ,
P> P = In .
(2.2)
• It follows from (2.2) that P> is inverse to P and hence
!
Q>
>
In = P P = Q, N
= Q Q> + N N> .
N>
• It is also useful to remind
M X =M Q ,
M X
⊥
=M N ,
Rn = M P .
Notation. In the following, let
H = Q Q> ,
M = N N> .
Note. Matrices H and M are symmetric and idempotent:
>
H> = Q Q> = Q Q> = H,
>
M> = N N> = N N> = M,
2.1.2
H H = Q Q> Q Q> = Q Ir Q> = Q Q> = H,
M M = N N> N N> = N In−r N> = N N> = M.
Projections
Let y ∈ Rn . We can then write (while using identity in Expression 2.2)
y = In y = Q Q> + N N> y = (H + M)y = Hy + My.
We have
2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS
14
b := Hy = Q Q> y ∈ M X .
• y
⊥
• u := My = N N> y ∈ M X .
• y > u = y > Q Q> N N> y = y > Q 0r×(n−r) N> y = 0.
That is, we have decomposition of any y ∈ Rn into
b + u,
y=y
⊥
b∈M X , u∈M X ,
y
b ⊥ u.
y
⊥
b and u are projections of y into M X and M X , respectively, and H and M
In other words, y
are corresponding projection matrices.
It follows from the linear algebra lectures.
b + u is unique.
• Decomposition y = y
Notes.
• Projection matrices H, M are unique.
That is H = Q Q> does not depend on a choice of the
orthonormal vector basis of M X included in the Q matrix and M = N N> does not depend
⊥
on a choice of the orthonormal vector basis of M X included in the N matrix.
b = yb1 , . . . , ybn is the closest point (in a Euclidean metric) in the regression space
• Vector y
M(X) to a given vector y = y1 , . . . , yn , that is,
e = ye1 , . . . , yen ∈ M(X)
∀y
n
n
X
X
2
bk =
e k2 .
ky − y
(yi − ybi ) ≤
(yi − yei )2 = ky − y
2
i=1
i=1
Definition 2.2 Hat matrix, residual projection matrix.
Consider a linear model Y X ∼ Xβ, σ 2 In , where Q and N are the orthonormal bases of the
regression and the residual space, respectively.
1. The hat matrix5 of the model is the matrix Q Q> which is denoted as H.
2. The residual projection matrix6 of the model is the matrix N N> which is denoted as M.
Lemma 2.1 Expressions of the projection matrices using the model matrix.
The hat matrix H and the residual projection matrix M can be expressed as
H = X X> X
−
X> ,
−
M = In − X X> X X> .
Proof.
−
• Five matrices rule (Theorem A.2): n
X X> X Xo> X
−
In − X X> X X> X
5
regresní projekční matice, lze však užívat též výrazu „hat matice“
6
=
X,
=
0n×k .
reziduální projekční matice
2.1. REGRESSION AND RESIDUAL SPACE, PROJECTIONS
e = X X> X
• Let H
15
−
X> ,
e = In − X X> X − X> = In − H.
e
M
e = 0n×k , both H
e and M
e are symmetric.
• We have MX
• We now have:
e + In − H
e y = Hy
e + My.
e
y = In y = H
e =X
• Clearly, Hy
n
X> X
−
o
X> y ∈ M X .
e = b> X> My = y > MX b = 0. Hence My
e ∈ M X ⊥.
• For any z = Xb ∈ M X : z > My
|{z}
0n
⇒ Uniqueness of projections and projection matrices
e = X X> X
H=H
−
X> ,
e = In − X X> X − X> .
M=M
k
Notes.
−
−
• Expression X X> X X> does not depend on a choice of the pseudoinverse matrix X> X .
• If r = rank Xn×k = k then
−1
H = X X> X X> ,
−1
M = In − X X> X X> .
2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM
2.2
16
Fitted values, residuals, Gauss–Markov theorem
Before starting to deal with estimation of the principal parameters of the linear model, which are the
regression coefficients β, we will deal with estimation of the full (conditional) mean E(Y ) = Xβ
of the response vector Y and its (conditional) covariance matrix var(Y ) = σ 2 In for which it is
sufficient to estimate the residual variance σ 2 .
Notation. We denote
µ := Xβ = E(Y ).
By saying that we are now interested in estimation of the full (conditional) expectation E(Y ) we
mean that we want to estimate the parameter vector µ on its own without necessity to know its
decomposition into Xβ.
Definition 2.3 Fitted values, residuals, residual sum of squares.
Consider a linear model Y X ∼ Xβ, σ 2 In .
1. The fitted values7 or the vector of fitted values of the model is a vector HY which will be
denoted as Yb . That is,
Yb = Yb1 , . . . , Ybn = HY .
2. The residuals8 or the vector of residuals of the model is a vector MY which will be denoted as
U . That is,
U = U1 , . . . , Un = MY = Y − Yb .
2
3. The residual sum of squares9 of the model is a quantity U which will be denoted as SSe .
That is,
n
X
2
SSe = U = U > U =
Ui2
i=1
=
n
X
Yi − Ybi
2
= Y − Yb
>
2
Y − Yb = Y − Yb .
i=1
Notes.
• The fitted values Yb and the residuals U are projections of the response vector Y into the
⊥
regression space M X and the residual space M X , respectively.
• Using different quantities and expressions introduced in Section 2.1, we can write
−
Yb = HY = Q Q> Y = X X> X X> Y ,
n
− o
U = MY = N N> Y = In − X X> X X> Y = Y − Yb .
7
vyrovnané hodnoty
8
rezidua
9
reziduální součet čtverců
2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM
17
>
• It follows from the projection properties that the vector Yb = Yb1 , . . . , Ybn is the nearest point
>
of the regression space M X to the response vector Y = Y1 , . . . , Yn , that is,
∀ Ye = Ye1 , . . . , Yen
>
∈ M(X)
n
n
X
X
2
Y − Yb 2 =
(Yi − Ybi )2 ≤
(Yi − Yei )2 = Y − Ye . (2.3)
i=1
i=1
• The Gauss–Markov theorem introduced below shows that Yb is a suitable estimator of µ = Xβ.
Owing to (2.3), it is also called the least squares estimator (LSE).10 The method of estimation is
then called the method of least squares11 or the method of ordinary least squares (OLS).
Theorem 2.2 Gauss–Markov.
Assume a linear model Y X ∼ Xβ, σ 2 In . Then the vector of fitted values Yb is the best linear
unbiased estimator (BLUE)12 of a vector parameter µ = E(Y ). Further,
−
var Yb = σ 2 H = σ 2 X X> X X> .
Proof. First, remind our notational convention that E(·) = E · X and var(·) = var · X .
Linearity means that Yb is a linear function of the response vector Y which is clear from the
expression Yb = HY .
Unbiasedness. Let us calculate E Yb .
E Yb = E(HY ) = H E(Y ) = H Xβ = Xβ = µ.
The pre-last
equality holds due to the fact that HX is a projection of each column of X into
M X which is generated by those columns. That is HX = X.
Optimality. Let Ye = a + BY be some other linear unbiased estimator of µ = Xβ.
• That is,
∀ β ∈ Rk E Ye = Xβ,
∀ β ∈ Rk a + B E(Y ) = Xβ,
∀ β ∈ Rk a + BXβ = Xβ.
It follows from here, by using above equality with β = 0k , that a = 0k .
• That is, from unbiasedness, we have that ∀ β ∈ Rk B Xβ = Xβ. Take now β =
>
0, . . . , 1, . . . , 0 while changing a position of one. From here, it follows that BX = X.
• We now have:
Ye = a + BY unbiased estimator of µ
=⇒
a = 0k & BX = X.
Trivially (but we will not need it here), also the opposite implication holds (if Ye = BY
with BX = X then Ye is the unbiased estimator of µ = Xβ). In other words,
Ye = a + BY is unbiased estimator of µ
10
11
⇐⇒
metoda nejmenších čtverců (MNČ)
12
a = 0k & BX = X.
2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM
18
• Let us now explore what can be concluded from the equality BX = X.
· X> X − X>
BX = X,
−
−
BX X> X X> = X X> X X> ,
BH = H,
(2.4)
H> B> = H> ,
HB> = H.
(2.5)
• Let us calculate var Yb :
var Yb = var(HY ) = H var(Y ) H> = H (σ 2 In ) H>
−
= σ 2 HH> = σ 2 H = σ 2 X X> X X> .
• Analogously, we calculate var Ye for Ye = BY , where BX = X:
var Ye = var BY = B var(Y ) B> = B (σ 2 In ) B>
= σ 2 BB> = σ 2 (H + B − H) (H + B − H)>
>
>
>
>
+
H(B
−
H)
= σ 2 HH
+
(B
−
H)H
+
(B
−
H)
(B
−
H)
| {z } |
{z
} |
{z
}
H
0n
0n
= σ 2 H + σ 2 (B − H) (B − H)> ,
where H(B − H)> = (B − H)H> = 0n follow from (2.4) and (2.5) and from the fact that
H is symmetric and idempotent.
• Hence finally,
var Ye − var Yb = σ 2 (B − H) (B − H)> ,
which is a positive semidefinite matrix. That is, the estimator Yb is not worse than the
estimator Ye .
k
Note. It follows from the Gauss–Markov theorem that
Yb X ∼ Xβ, σ 2 H .
Historical remarks
• The method of least squares was used in astronomy and geodesy already at the beginning of the
19th century.
• 1805: First documented publication of least squares.
Adrien-Marie Legendre. Appendix “Sur le méthode des moindres quarrés” (“On the method of least squares”)
in the book Nouvelles Méthodes Pour la Détermination des Orbites des Comètes (New Methods for the Determination of the Orbits of the Comets).
2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM
19
• 1809: Another (supposedly independent) publication of least squares.
Carl Friedrich Gauss. In Volume 2 of the book Theoria Motus Corporum Coelestium in Sectionibus Conicis
Solem Ambientium (The Theory of the Motion of Heavenly Bodies Moving Around the Sun in Conic Sections).
• C. F. Gauss claimed he had been using the method of least squares since 1795 (which is
probably true).
• The Gauss–Markov theorem was first proved by C. F. Gauss in 1821 – 1823.
• In 1912, A. A. Markov provided another version of the proof.
• In 1934, J. Neyman described the Markov’s proof as being “elegant” and stated that Markov’s
contribution (written in Russian) had been overlooked in the West.
⇒ The name Gauss–Markov theorem.
Theorem 2.3 Basic properties of the residuals and the residual sum of squares.
Let Y = Xβ + ε, ε ∼ 0n , σ 2 In , where rank Xn×k = r ≤ k < n. The following then holds:
(i) U = Mε.
(ii) SSe = Y > MY = ε> Mε.
(iii) E U = 0n , var U = σ 2 M.
(iv) E SSe = EY , X SSe = (n − r) σ 2 .
Notes.
• Point (i) of Theorem 2.3 says that the residuals can be obtained not only by projecting the
⊥
response vector Y into M X but also by projecting the vector of the error terms of the linear
⊥
model into M X .
• Point (iii) of Theorem 2.3 can also be briefly written as
U X ∼ 0n , σ 2 M .
Proof.
(i) U = MY = M (Xβ + ε) = |{z}
MX β + Mε = Mε.
0n
(ii)
SSe = U > U = (MY )> MY
>
>
= Y>M
| {zM} Y = Y MY
M
(i)
= ε> M> M ε = ε> M ε.
2.2. FITTED VALUES, RESIDUALS, GAUSS–MARKOV THEOREM
(iii)
20
E(U ) = E(MY ) = |{z}
MX β = 0n .
0n
var(U ) = var(MY ) = M var(Y ) M> = M σ 2 In M> = σ 2 MM> = σ 2 M.
(iv)
E SSe
n
n
n
o
o
o
= E ε> M ε = E tr ε> M ε = E tr Mεε> = tr E Mεε>
n
o
= tr M E εε> = tr M σ 2 In = tr σ 2 M = σ 2 tr(M)
| {z }
var(ε)
2
= σ tr NN> = σ 2 tr N> N = σ 2 tr In−r = σ 2 (n − r),
where (to remind) N denotes an n × (n − r) matrix whose columns form an orthonormal
⊥
vector basis of a residual space M X .
Z
Further, EY , X SSe =
E SSe fX (x) dλX (x)
Rn·k
Z
=
Rn·k
σ 2 (n − r) fX (x) dλX (x) = σ 2 (n − r).
k
Definition 2.4 Residual mean square and residual degrees of freedom.
Consider a linear model Y X ∼ Xβ, σ 2 In , rank(X) = r.
1. The residual mean square13 of the model is a quantity SSe /(n − r) and will be denoted as
MSe . That is,
SSe
.
MSe =
n−r
2. The residual degrees of freedom14 of the model is the dimension of the residual space and will
be denotes as νe . That is,
νe = n − r.
Theorem 2.4 Unbiased estimator of the residual variance.
The residual mean square MSe is an unbiased estimator (both conditionally given X and also
with
joint
distribution
of Y and X) of the residual variance σ 2 in a linear model
respect to the
Y X ∼ Xβ, σ 2 In , rank Xn×k = r ≤ k < n.
Proof. Direct consequence of Theorem 2.3, point (iv).
13
reziduální střední čtverec
14
reziduální stupně volnosti
k
End of
Lecture #2
(08/10/2015)
2.3. NORMAL EQUATIONS
2.3
21
Normal equations
Start of
The vector of fitted values Yb = HY is a projection of the response vector into M X . Hence, it Lecture #3
must be possible to write Yb as a linear combination of the columns of the model matrix X. That (08/10/2015)
is, there exists b ∈ Rk such that
Yb = Xb.
(2.6)
Notes.
• In a full-rank model (rank Xn×k = k), linearly independent columns of X form a vector basis
of M X . Hence b ∈ Rk such that Yb = Xb is unique.
• If rank Xn×k = r < k, a vector b ∈ Rk such that Yb = Xb is not unique.
We already know from the Gauss-Markov theorem (Theorem 2.2) that E Yb = Xβ. Hence if
we manage to express Yb as Yb = Xb and b will be unique, we have a natural candidate for an
estimator of the regression coefficients β. Nevertheless, before we proceed to estimation of β, we
derive conditions that b ∈ Rk must satisfy to fulfill also (2.6).
Definition 2.5 Sum of squares.
Consider a linear model Y X ∼ Xβ, σ 2 In . The function SS : Rk −→ R given as follows
2
>
Y − Xβ ,
SS(β) = Y − Xβ = Y − Xβ
β ∈ Rk
will be called the sum of squares15 of the model.
Theorem 2.5 Least squares and normal equations.
Assume a linear model Y X ∼ Xβ, σ 2 In . The vector of fitted values Yb equals to Xb, b ∈ Rk if
and only if b solves a linear system
X> Xb = X> Y .
(2.7)
Proof.
Yb = Xb, is a projection of Y into M X
⇔
⇔
Yb = Xb is the closest point to Y in M X
2
Yb = Xb, where b minimizes SS(β) = Y − Xβ over β ∈ Rk .
Let us find conditions under which the term SS(β) attains its minimal value over β ∈ Rk . To this
end, a vector of the first derivatives (a gradient) and a matrix of the second derivatives (a Hessian)
of SS(β) are needed.
2
>
SS(β) = Y − Xβ = Y − Xβ
Y − Xβ = Y > Y − 2Y > Xβ + β > X> Xβ.
∂SS
(β) = −2 X> Y + 2 X> Xβ.
∂β
15
součet čtverců
2.3. NORMAL EQUATIONS
22
∂ 2 SS
(β) = 2 X> X.
∂β∂β >
(2.8)
For any β ∈ Rk , the Hessian (2.8) is a positive semidefinite matrix and hence b minimizes SS(β)
over β ∈ Rk if and only if
∂SS
(b) = 0k ,
∂β
that is, if and only if
X> Xb = X> Y .
k
Definition 2.6 Normal equations.
Consider a linear model Y X ∼ Xβ, σ 2 In . The system of normal equations16 or concisely normal
equations17 of the model is the linear system
X> Xb = X> Y ,
or equivalently, the linear system
X> (Y − Xb) = 0k .
Note. In general, the linear system (2.7) of normal equations would not have to have a solution.
Nevertheless, in our case, existence of the solution follows
from the fact that it corresponds to
b
the projection Y of Y into the regression space M X and existence of the projection Yb is
guaranteed by the projection properties known from the linear algebra lectures. On the other hand,
we can also show quite easily that there exists a solution to the normal equations by using the
following lemma.
Lemma 2.6 Vector spaces generated by the rows of the model matrix.
Let Xn×k be a real matrix. Then
M X> X = M X> .
Note. A vector space M X> is a vector space generated by the columns of the matrix X> , that
is, it is a vector space generated by the rows of the matrix X.
⊥
⊥
Proof. First note that M X> X = M X> is equivalent to M X> X = M X> . We will
⊥
⊥
show this by showing that for any a ∈ Rk a ∈ M X> if and only if a ∈ M X> X .
(i) a ∈ M X>
⊥
> >
>
⇒ a> X> = 0>
n ⇒ a X X = 0k
⊥
⇔ a ∈ M X> X
16
systém normálních rovnic
17
normální rovnice
2.3. NORMAL EQUATIONS
(ii) a ∈ M X> X
⊥
23
⇒ a> X> X = 0>
⇒ a> X> X a = 0
k
⇒ kXak = 0 ⇔ Xa = 0n ⇔ a> X> = 0>
n
⊥
⇔ a ∈ M X>
k
Notes.
• Existence of a solution
to normal
equations (2.7) follows from the fact that its right-hand side
>
>
>
X Y ∈ M X and M X is (by Lemma 2.6) the same space as a vector space generated
by the columns of the matrix of the linear system (X> X).
• By Theorem A.1, all solutions to normalequations, i.e., a set of points that minimize the sum of
−
−
squares SS(β) are given as b = X> X X> Y , where X> X is any pseudoinverse to X> X
(if rank Xn×k = r < k, this pseudoinverse is not unique).
−
• We also have that for any b = X> X X> Y :
SSe = SS(b).
Notation.
• In the following, symbol b will be exclusively used to denote any solution to normal equations,
that is,
−
b = b0 , . . . , bk−1 = X> X X> Y .
• For a full-rank linear model, (rank Xn×k = k), the following holds:
−
−
−1
• The only pseudoinverse X> X is X> X = X> X .
−1 >
• The only solution of normal equations is b = X> X
X Y which is also a unique
minimizer of the sum of squares SS(β).
b That is,
In this case, we will denote the unique solution to normal equations as β.
b = βb0 , . . . , βbk−1 = X> X −1 X> Y .
β
2.4. ESTIMABLE PARAMETERS
2.4
24
Estimable parameters
We have seen in the previous section that the sum of squares SS(β) does not necessarily attain
a unique minimum. This happens if the model matrix Xn×k has linearly dependent columns (its
rank r < k) and hence there exist (infinitely) many possibilities on how to express the vector of
the fitted values Yb ∈ M X as a linear combination of the columns of the model matrix X. In
other words, there exist (infinitely) many vectors b ∈ Rk such that Yb = Xb. This could also be
interpreted as that there are (infinitely) many estimators of the regression parameters β leading to
the (unique) unbiased estimator of the response mean µ = E(Y ) = Xβ. It then does not make
much sense to talk about estimation of the regression parameters β. To avoid such situations, we
now define a notion of an estimable parameter 18 of a linear model.
Definition 2.7 Estimable parameter.
Consider a linear model Y X ∼ Xβ, σ 2 In . Let l ∈ Rk . We say that a parameter
θ = l> β
is an estimable parameter of the model if for all µ ∈ M X the expression l> β does not depend on
a choice of a solution to the linear system Xβ = µ.
Notes.
• Definition of an estimable parameter is equivalent to the requirement
∀β 1 , β 2 ∈ Rk
Xβ 1 = Xβ 2 ⇒ l> β 1 = l> β 2 .
That is, the estimable parameter is such a linear combination of the regression coefficients β
whichdoes not depend on a choice of the β leading to the same vector in the regression space
M X (leading to the same vector of the response expectation µ).
• In a full-rank model (rank Xn×k = k), columns of the model matrix X form a vector basis of
the regression
space M X . It then follows from the properties of a vector basis that for any
µ ∈ M X there exist a unique β such that Xβ = µ. Trivially, for any l ∈ Rk , the expression
l> β then does not depend on a choice of a solution to the linear system Xβ = µ since there is
only one such solution. In other words, in a full-rank model, any linear function of the regression
coefficients β is estimable.
Definition 2.8 Estimable vector parameter.
Consider a linear model Y X ∼ Xβ, σ 2 In . Let l ∈ Rk . Let l1 , . . . , lm ∈ Rk . Let L be an m × k
>
matrix having vectors l>
1 , . . . , lm in its rows. We say that a vector parameter
θ = Lβ
is an estimable vector parameter of the model if all parameters θj = l>
j β, j = 1, . . . , m, are estimable.
18
2.4. ESTIMABLE PARAMETERS
25
Notes.
• Definition of an estimable parameter is equivalent to the requirement
∀β 1 , β 2 ∈ Rk
Xβ 1 = Xβ 2 ⇒ Lβ 1 = Lβ 2 .
• Trivially, a vector parameter µ = E(Y ) = Xβ is always estimable. We also already know its
BLUE which is the vector of fitted values Yb .
• In a full-rank model (rank Xn×k = k), the regression coefficients vector β is an estimable
vector parameter.
Example 2.1 (Overparameterized two-sample problem).
Consider a two-sample problem, where Y1 , . . . , Yn1 are assumed to be identically distributed random variables with E(Y1 ) = · · · = E(Yn1 ) = µ1 and var(Y1 ) = · · · = var(Yn1 ) = σ 2 (sample 1), Yn1 +1 , . . . , Yn1 +n2 are also assumed to be identically distributed random variables with
E(Yn1 +1 ) = · · · = E(Yn1 +n2 ) = µ2 and var(Yn1 +1 ) = · · · = var(Yn1 +n2 ) = σ 2 (sample 2) and
Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 are assumed to be independent. This situation can be described by
a linear model Y ∼ Xβ, σ 2 In , n = n1 + n2 , where


  



Y1
µ1
β0 + β1
1 1 0
 . 
  . 

. . .
.
 .. 
  .. 

 .. .. .. 
..
 


  





  



β0
 Yn1 
β0 + β1  µ1 
1 1 0
 


 =  .



Y =
µ = Xβ = 
 µ 
 , X = 1 0 1 , β = β1  ,
Y
β
+
β
 n1 +1 

 2
 0

2
β2
 . 

  . 
. . .
..
 . 
  . 

. . .
.
 . 
  . 

. . .
Yn1 +n2
1 0 1
β0 + β2
µ2
(i) Parameters µ1 = β0 + β1 and µ2 = β0 + β2 are (trivially) estimable.
>
=
0,
1,
0
and
(ii) None of the elements
of
the
vector
β
is
estimable.
For
example,
take
β
1
β 2 = 1, 0, −1 . We have Xβ 1 = Xβ 2 = 1, . . . , 1, 0, . . . , 0 but none of the elements of
β 1 and β 2 is equal. This corresponds to the fact that two means µ1 and µ2 can be expressed
in infinitely many ways using three numbers β0 , β1 , β2 as µ1 = β0 + β1 and µ2 = β0 + β2 .
(iii) A non-trivial estimable parameter is, e.g.,
θ = µ2 − µ1 = β2 − β1 = l> β,
l = 0, −1, 1 .
>
We have for β 1 = β1,0 , β1,1 , β1,2 ∈ R3 and β 2 = β2,0 , β2,1 , β2,2 ∈ R3 :
Xβ 1 = Xβ 2 ⇔
β1,0 + β1,1 = β2,0 + β2,1 ,
β1,0 + β1,2 = β2,0 + β2,2
⇒ β1,2 − β1,1 = β2,2 + β2,1 ⇔ l> β 1 = l> β 2 .
Definition 2.9 Contrast.
Consider a linear model Y X ∼ Xβ, σ 2 In . An estimable parameter θ = c> β, given by a real
>
vector c = c0 , . . . , ck−1 which satisfies
c> 1k = 0,
i.e.,
k−1
X
j=0
cj = 0,
2.4. ESTIMABLE PARAMETERS
26
is called contrast19 .
Definition 2.10 Orthogonal contrasts.
Consider a linear model Y X ∼ Xβ, σ 2 In . Contrasts θ = c> β and η = d> β given by orthogo>
>
nal vectors c = c0 , . . . , ck−1 and d = d0 , . . . , dk−1 , i.e., given by vectors c and d that satisfy
c> d = 0, are called (mutually) orthogonal contrasts.
Theorem 2.7 Estimable parameter, necessary and sufficient condition.
Assume a linear model Y X ∼ Xβ, σ 2 In .
(i) Let l ∈ Rk . Parameter θ = l> β is an estimable parameter if and only if
l ∈ M X> .
(ii) A vector θ = Lβ is an estimable vector parameter if and only if
M L> ⊂ M X> .
Proof.
(i) θ = l> β is estimable
⇔
∀ β 1 , β 2 ∈ Rk Xβ 1 = Xβ 2 ⇒ l> β 1 = l> β 2
⇔
∀ β 1 , β 2 ∈ Rk X(β 1 − β 2 ) = 0n ⇒ l> (β 1 − β 2 ) = 0
⇔
∀ γ ∈ Rk Xγ = 0n ⇒ l> γ = 0
⇔
∀ γ ∈ Rk γ orthogonal to all rows of X ⇒ l> γ = 0
⊥
∀ γ ∈ Rk γ ∈ M X>
⇒ l> γ = 0
l ∈ M X> .
⇔
⇔
(ii) Direct consequence of point (i).
k
Note. In a full-rank model (rank Xn×k = k < n), M X> = Rk . That is, any linear function
of β is indeed estimable (statement that we already concluded from the definition of an estimable
parameter).
19
kontrast
2.4. ESTIMABLE PARAMETERS
27
Theorem 2.8 Gauss–Markov for estimable parameters.
Let θ = l> β be an estimable parameter of a linear model Y X ∼ Xβ, σ 2 In . Let b be any
solution to the normal equations. The statistic
θb = l> b
then satisfies:
(i) θb does not depend on a choice of the solution b of the normal equations, i.e., it does not depend
−
on a choice of a pseudoinverse in b = X> X X> Y .
(ii) θb is the best linear unbiased estimator (BLUE) of the parameter θ.
−
(iii) var θb = σ 2 l> X> X l , that is,
− θb | X ∼ θ, σ 2 l> X> X l ,
where l> X> X
−
l does not depend on a choice of the pseudoinverse X> X
−
If additionally l 6= 0k then l> X> X l > 0.
−
.
>
Let further θ1 = l>
1 β and θ2 = l2 β be estimable parameters. Let
θb1 = l>
1 b,
Then
cov θb1 , θb2
>
where l>
1 X X
−
θb2 = l>
2 b.
>
= σ 2 l>
1 X X
−
l2 ,
l2 does not depend on a choice of the pseudoinverse X> X
−
.
Proof.
(i) Let b1 , b2 be two solutions to normal equations, that is,
X> Y = X> X b1 = X> X b2 .
By Theorem 2.5 (Least squares and normal equations):
⇔
Yb = Xb1 & Yb = Xb2 ,
that is, Xb1 = Xb2 .
Estimability of θ:
⇒
l> b1 = l> b2 .
(ii) Parameter θ = l> β is estimable. By Theorem 2.7:
⇔ l ∈ M X>
⇔ l = X> a for some a ∈ Rn
⇒ θb = a> Xb = a> Yb .
That is, θb is a linear function of Yb which is the BLUE of µ = Xβ. It then follows that θb is
the BLUE of the parameter
a> µ = a> Xβ = l> β = θ.
2.4. ESTIMABLE PARAMETERS
28
(iii) Proof/calculations were available on the blackboard in K1.
>
The last part “Let further θ1 = l>
1 β and θ2 = l2 β be. . . ”:
Proof/calculations were available on the blackboard in K1.
k
Theorem 2.9 Gauss–Markov for estimable vector parameter.
Let θ = Lβ be an estimable vector parameter of a linear model Y X ∼ Xβ, σ 2 In . Let b be any
solution to normal equations. The statistic
b = Lb
θ
then satisfies:
b does not depend on a choice of the solution b of the normal equations.
(i) θ
b is the best linear unbiased estimator (BLUE) of the vector parameter θ.
(ii) θ
b = σ 2 L X> X − L> , that is,
(iii) var θ
b | X ∼ θ, σ 2 L X> X − L> ,
θ
−
−
where L X> X L> does not depend on a choice of the pseudoinverse X> X .
If additionally m ≤ r and the rows of the matrix L are linearly independent then L X> X
is a positive definite (invertible) matrix.
Proof.
Direct consequence of Theorem 2.8, except positive definiteness of L X> X
situations when L has linearly independent rows.
−
Positive definiteness of L X> X L> if Lm×k has linearly independent rows:
Proof/calculations were available on the blackboard in K1.
Consequence of Theorem 2.9.
Assume a full-rank linear model Y X ∼ Xβ, σ 2 In , rank Xn×k = k < n. The statistic
b = X> X
β
−1
X> Y
then satisfies:
b is the best linear unbiased estimator (BLUE) of the regression coefficients β.
(i) β
−
−
L>
L> in
k
2.4. ESTIMABLE PARAMETERS
29
b = σ 2 X> X −1 , that is,
(ii) var β
b X ∼ β, σ 2 X> X −1 .
β
Proof. Use L = Ik in Theorem 2.9.
k
2.5. PARAMETERIZATIONS OF A LINEAR MODEL
2.5
2.5.1
30
Parameterizations of a linear model
Equivalent linear models
Definition 2.11 Equivalent linear models.
Assume two linear models: M1 : Y X1 ∼ X1 β, σ 2 In , where X1 is an n × k matrix with
rank X = r and M2 : Y X2 ∼ X2 γ, σ 2 In , where X2 is an n × l matrix with rank X2 = r.
We say that models M1 and M2 are equivalent if their regression spaces are the same. That is, if
M X1 = M X2 .
Notes.
• The two equivalent models:
−
−
• have the same hat matrix H = X X> X X> = Z Z> Z Z> and a vector of fitted values
Yb = HY ;
• have the same residual projection matrix M = In − H and a vector of residuals U = MY ;
• have the same value of the residual sum of squares SSe = U > U , residual degrees of freedom
νe = n − r and the residual mean square MSe = SSe /(n − r).
• The two equivalent models provide two different parameterizations of one situation. Nevertheless, practical interpretation of the regression coefficients β ∈ Rk and γ ∈ Rl in the two models
might be different. In practice, both parameterizations might be useful and this is also the reason
why it often makes sense to deal with both parameterizations.
2.5.2
Full-rank parameterization of a linear model
Any linear model can be parameterized such that the model matrix has linearly independent
2
columns, i.e.,
is of a full-rank. To see this, consider a linear model Y X ∼ Xβ, σ In ,where
rank Xn×k = r ≤ k < n. If Qn×r is a matrix with the orthonormal vector basis of M X in its
columns (that is, rank(Q) = r), the linear model
Y Q ∼ Qγ, σ 2 In
(2.9)
is equivalent to the original model with the model matrix X. Nevertheless, parameterization of
a model using the orthonormal basis and the Q matrix is only rarely used in practice since the
interpretation of the regression coefficients γ in model (2.9) is usually quite awkward.
Parameterization of a linear model using the orthonormal basis matrix Q is indeed not the only
full-rank parameterization of a given linear model. There always exist infinitely many full-rank
parameterizations and in reasonable practical analyses, it should always be possible to choose such
a full-rank parameterization or even parameterizations that also provide practically interpretable
regression coefficients.
Example 2.2 (Different parameterizations of a two-sample problem).
Let us again consider a two-sample problem (see also Example 2.1). That is,
>
Y = Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 ,
2.5. PARAMETERIZATIONS OF A LINEAR MODEL
31
where Y1 , . . . , Yn1 are identically distributed random variables with E(Y1 ) = · · · = E(Yn1 ) = µ1
and var(Y1 ) = · · · = var(Yn1 ) = σ 2 (sample 1), Yn1 +1 , . . . , Yn1 +n2 are also identically distributed random variables with E(Yn1 +1 ) = · · · = E(Yn1 +n2 ) = µ2 and var(Yn1 +1 ) = · · · =
var(Yn1 +n2 ) = σ 2 (sample 2) and Y1 , . . . , Yn1 , Yn1 +1 , . . . , Yn1 +n2 are assumed
to be independent.
This situation can be described by differently parameterized linear models Y X ∼ Xβ, σ 2 In ,
n = n1 + n2 where the model matrix X is always divided into two blocks as
!
X1
X=
,
X2
where X1 is an n1 × k matrix having n1 identical rows x>
1 and X2 is an n2 × k matrix having n2
identical rows x>
.
The
response
mean
vector
µ
=
E(Y
)
is
then
2

  
x>
β
µ1
1
 .   . 
 ..   .. 
  
! 
 >   
x1 β  µ1 
X1 β
  
µ = Xβ =
=
x> β  = µ  .
X2 β
 2   2
 .   . 
 .   . 
 .   . 
x>
µ2
2β
That is, parameterization of the model is given by choices of vectors x1 6= x2 , x1 6= 0k , x2 6= 0k
leading to expressions of the means of the two samples as
µ 1 = x>
1 β,
µ2 = x>
2 β.
The rank of the model is always r = 2.
Overparameterized model x1 = 1, 1, 0 , x2 = 1, 0, 1 :


1 1 0
. . .
 .. .. .. 




1 1 0

X=
1 0 1 ,


. . .
. . .
. . .
1 0 1
√
 
β0
 
β = β1  ,
β2
µ1 = β0 + β1 ,
µ2 = β0 + β2 .
√
Orthonormal basis x1 = 1/ n1 , 0 , x2 = 0, 1/ n2 :

√1
n1
 .
 .
 .

 √1

X = Q =  n1
 0

 .
 ..

0
0
..
.
0







,
√1 
n2 
.. 
. 

√1
n2
β=
β1
β2
!
,
1
µ 1 = √ β1 ,
n1
1
µ2 = √ β 2 ,
n2
β1 =
√
n 1 µ1 ,
β2 =
√
n 2 µ2 .
2.5. PARAMETERIZATIONS OF A LINEAR MODEL
32
Group means x1 = 1, 0 , x2 = 0, 1 :


1 0
. .
 .. .. 




1 0 
,

X=

0 1 
. .
. .
. .
0 1
β=
β1
β2
!
,
µ1 = β1 ,
µ2 = β2 .
This could also be viewed as the overparameterized model constraint by a condition β0 = 0.
Group differences x1 = 1, 1 , x2 = 1, 0 :


1 1
. .
 .. .. 




1 1

X=
1 0 ,


. .
. .
. .
1 0
β=
β0
β1
!
,
µ 1 = β0 + β1 ,
β1 = µ1 − µ2 .
µ 2 = β0 ,
This could also be viewed as the overparameterized model constraint by a condition β2 = 0.
Deviations from the mean of the means x1 = 1, 1 , x2 = 1, −1 :

1
1
.
.. 
 ..
. 




1
1
,

X=

1
−1


.
.. 

.
. 
.
1 −1

µ1 = β0 + β1 ,
β=
β0
β1
!
,
µ2 = β0 − β1 ,
µ1 + µ2
,
2
µ1 + µ2
β1 = µ 1 −
2
µ1 + µ2
=
− µ2 .
2
β0 =
This could also be viewed as the overparameterized model constraint by a condition β1 + β2 = 0.
Except the overparameterized model, all above parameterizations are based on a model matrix having
full-rank r = 2.
End of
Lecture #3
(08/10/2015)
2.6. MATRIX ALGEBRA AND A METHOD OF LEAST SQUARES
2.6
33
Matrix algebra and a method of least squares
Start of
2
Lecture #4
We have seen in Section 2.5 that any linear model Y X ∼ Xβ, σ In can be reparameterized
such that the model matrix X has linearly independent columns, that is, rank Xn×k = k. Remind (15/10/2015)
now expressions of some quantities that must be calculated when dealing with the least squares
estimation of parameters of the full-rank linear model:
−1
−1
H = X X> X X> ,
M = In − H = In − X X> X X> ,
−1
−1
Yb = HY = X X> X X> Y , var Yb = σ 2 H = σ 2 X X> X X> ,
n
−1 U = MY = Y − Yb ,
var U = σ 2 M = σ 2 In − X X> X X> ,
b = X> X −1 X> Y ,
β
b = σ 2 X> X −1 .
var β
−1
The only non-trivial calculation involved in above expressions is calculation of the inverse X> X .
Nevertheless, all above expressions (and many others needed in a context of the least squares
estimation) can be calculated without explicit evaluation of the matrix X> X. Some of above ex−1
pressions can even be evaluated without knowing explicitely the form of the X> X
matrix. To
this end, methods of matrix algebra can be used (and are used by all reasonable software routines
dealing with the least squares estimation). Two methods, known from the course Fundamentals of
Numerical Mathematics (NMNM201), that have direct usage in the context of least squares are:
• QR decomposition;
• Singular value decomposition (SVD)
applied to the model matrix X. Both of them can be used, among other things, to find the
orthonormal vector basis of the regression space M X and to calculate expressions mentioned
above.
2.6.1
QR decomposition
QR decomposition of the model matrix is used, for example, by the R software (R Core Team, 2015)
to estimate a linear model by the method of least squares. If Xn×k is a real matrix with rank X =
k < n then we know from the course Fundamentals of Numerical Mathematics (NMNM201) that it
can be decomposed as
X = QR,
where
Qn×k = q 1 , . . . , q k ,
q j ∈ Rk , j = 1, . . . , k,
q 1 , . . . , q k is an orthonormal basis of M X and Rk×k is upper triangular matrix. That is,
Q> Q = Ik ,
We then have
QQ> = H.
X> X = R> Q> Q R = R> R.
| {z }
Ik
(2.10)
That is, R> R is a Cholesky (square root) decomposition of the symmetric matrix X> X. Note
that this is a special case of an LU decomposition for symmetric matrices. Decomposition (2.10)
−1
can now be used to get easily (i) matrix X> X , (ii) a value of its determinant or a value of
determinant of X> X, (iii) solution to normal equations.
2.6. MATRIX ALGEBRA AND A METHOD OF LEAST SQUARES
34
−1
(i) Matrix X> X .
X> X
−1
= R> R
−1
= R−1 R>
−1
= R−1 R−1
>
.
That is, to invert the matrix X> X, we only have to invert the upper triangular matrix R.
−1
(ii) Determinant of X> X and X> X .
Let r1 , . . . , rk denote diagonal elements of the matrix R. We then have
k
2
2 Y
det X> X = det R> R = det(R) =
rj ,
j=1
n
−1 o n
o−1
.
det X> X
= det X> X
b = X> X
(iii) Solution to normal equations β
b by solving:
We can obtain β
−1
X> Y .
X> X b = X> Y
R> R b = R> Q> Y
R b = Q> Y .
(2.11)
b it is only necessary to solve a linear system with the upper triangular
That is, to get β,
system matrix which can easily be done by backward substitution.
>
Further, the right-hand-side c = c1 , . . . , ck
:= Q> Y of the linear system (2.11) additionally
serves to calculate the vector of fitted values. We have
Yb = HY = QQ> Y = Q c =
k
X
cj q j .
j=1
That is, the vector c provides coefficients
of the linear combination of the orthonormal vector basis
of the regression space M X that provide the fitted values Yb .
2.6.2
SVD decomposition
Use of the SVD decomposition for the least squares will not be explained in detail in this course.
It is covered by the Fundamentals of Numerical Mathematics (NMNM201) course.
Chapter
3
Normal Linear Model
Until now,
all proved theorems did not
pose any distributional assumptions on the random vectors
Yi , X i , X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, that represent the data. We only assumed
a certain form
of the (conditional) expectation and the (conditional) covariance matrix of Y =
Y1 , . . . , Yn given X 1 , . . . , X n (given the model matrix X). In this chapter, we will additionally
assume that the response is conditionally normally distributed given the covariates which will lead
us to the normal linear model.
35
3.1. NORMAL LINEAR MODEL
3.1
36
Normal linear model
Definition 3.1 Normal linear model.
The data Yi , X i , i = 1, . . . , n, satisfy a normal linear model1 if they satisfy a linear model with
i.i.d. errors with
2
Yi X i ∼ N X >
i β, σ ,
where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters.
Notation. It follows from Definition 3.1, from definition of a linear model with i.i.d. errors
(Definition 1.4) and from properties of a normal distribution that a joint conditional distribution of
Y given X is multivariate normal with a mean Xβ and a covariance matrix σ 2 In . Hence the fact
that data follow a normal linear model will be indicated by notation
Y X ∼ Nn Xβ, σ 2 In .
Definition 3.1 using the error terms
The data Yi , X i , i = 1, . . . , n, satisfy a normal linear model if
Yi = X >
i β + εi ,
i = 1, . . . , n,
i.i.d.
where ε1 , . . . , εn ∼ N (0, σ 2 ), ε = ε1 , . . . , εn and X independent, and β ∈ Rk and 0 < σ 2 <
∞ are unknown parameters.
Notation. To indicate that data follow a normal linear model, we will also use notation
Y = Xβ + ε,
1
normální lineární model
ε ∼ Nn (0n , σ 2 In ).
3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY
3.2
37
Properties of the least squares estimators under the
normality
Theorem 3.1 Least squares estimators under the normality.
Let Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r. Let Lm×k is a real matrix with non-zero rows
>
>
>
>
l>
= l>
= Lβ is an estimable parameter.
1 , . . . , lm such that θ = θ1 , . . . , θm
1 β, . . . , lm β
>
>
>
>
b
b
b
Let θ = θ1 , . . . , θm = l b, . . . , l b = Lb be its least squares estimator. Further, let
1
m
−
V = L X> X L> = vj,t j,t=1,...,m ,
1
1
, ..., √
,
D = diag √
v1,1
vm,m
θbj − θj
Tj = p
,
j = 1, . . . , m,
MSe vj,j
>
1
b−θ .
T = T1 , . . . , Tm = √
D θ
MSe
The following
then holds.
(i) Yb X ∼ Nn Xβ, σ 2 H .
(ii) U X ∼ Nn 0n , σ 2 M .
b X ∼ Nm θ, σ 2 V .
(iii) θ
(iv) Statistics Yb and U are conditionally, given X, independent.
b and SSe are conditionally, given X, independent.
(v) Statistics θ
Yb − Xβ 2
(vi)
∼ χ2r .
σ2
(vii)
SSe
∼ χ2n−r .
σ2
(viii) For each j = 1, . . . , m,
Tj ∼ tn−r .
(ix) T | X ∼ mvtm,n−r DVD .
(x) If additionally rank Lm×k = m ≤ r then the matrix V is invertible and
>
−1
1 b
b − θ ∼ Fm, n−r .
θ−θ
MSe V
θ
m
Proof. Proof/calculations were available on the blackboard in K1.
k
b = X> X −1 X> Y and under the normality assumption,
In a full-rank linear model, we have β
b of the regression coefficients β.
Theorem 3.1 can be used to state additional properties of the LSE β
3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY
38
Consequence of Theorem 3.1: Least squares estimator of the regression coefficients in a full-rank normal linear model.
Let Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = k. Further, let
−1
V = X> X
= vj,t j,t=0,...,k−1 ,
1
1
D = diag √
, ..., √
.
v0,0
vk−1,k−1
The following then holds.
b X ∼ Nk β, σ 2 V .
(i) β
b and SSe are conditionally, given X, independent.
(ii) Statistics β
βbj − βj
(iii) For each j = 0, . . . , k − 1, Tj := p
∼ tn−k .
MSe vj,j
>
1
b − β ∼ mvtk,n−k DVD .
D β
(iv) T := T0 , . . . , Tk−1 = √
MSe
>
1 b
>
b
(v)
β − β MS−1
e X X β − β ∼ Fk, n−k .
k
Proof. Use L = Ik in Theorem 3.1 and realize that the only pseudoinverse to the matrix X> X in
−1
a full-rank model is the inverse X> X .
k
Theorem 3.1 and its consequence can now be used to perform principal statistical inference, i.e.,
calculation of confidence intervals and regions, testing statistical hypotheses, in a normal linear
model.
3.2.1
Statistical inference in a full-rank normal linear model
Assume a full-rank normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = k and keep
−1
denoting V = X> X
= vj,t j,t=0,...,k−1 .
Inference on a chosen regression coefficient
First, take a chosen j ∈ 0, . . . , k − 1 . We then have the following.
• Standard error of βbj and confidence interval for βj
We have var βbj = σ 2 vj,j (Consequence of Theorem 2.9) which is unbiasedly estimated as
MSe vj,j (Theorem 2.4). The square root of this quantity, i.e., estimated standard deviation of βbj
is then called as standard error 2 of the estimator βbj . That is,
p
S.E. βbj = MSe vj,j .
(3.1)
2
směrodatná, příp. standardní chyba
3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY
39
The standard error (3.1) is also the denominator of the t-statistic Tj from point (iii) of Consequence of Theorem 3.1. Hence the lower and the upper bounds of the Wald-type (1 − α) 100%
confidence interval for βj based on the statistic Tj are
α
βbj ± S.E. βbj tn−k 1 −
.
2
Analogously, also one-sided confidence interval can be calculated.
• Test on a value of βj
Suppose that for a given βj0 ∈ R, we aim in testing H0 :
H1 :
βj = βj0 ,
βj 6= βj0 .
The Wald-type test based on point (iii) of Consequence of Theorem 3.1 proceeds as follows:
βbj − βj0
βbj − βj0
.
Test statistic:
Tj,0 =
=p
MSe vj,j
S.E. βbj
Reject H0 if
α
|Tj,0 | ≥ tn−k 1 −
.
2
P-value when Tj,0 = tj,0 :
p = 2 CDFt, n−k − |tj,0 | .
Analogously, also one-sided tests can be conducted.
End of
Lecture #4
(15/10/2015)
Simultaneous inference on a vector of regression coefficients
Start of
When the interest lies in the inference for the full vector of the regression coefficients β, the Lecture #6
(22/10/2015)
following procedures can be used.
• Simultaneous confidence region3 for β
It follows from point (v) of Consequence of Theorem 3.1 that the simultaneous (1 − α) 100%
confidence region for β is the set
n
o
b > MS−1 X> X β − β
b < k Fk,n−k (1 − α) ,
β ∈ Rk : β − β
e
which is an ellipsoid with center:
shape matrix:
diameter:
b
β,
−1
b ,
c β
MSe X> X
= var
p
k Fk,n−k (1 − α).
Remember from the linear algebra and geometry lectures that the shape matrix determines the
principal directions of the ellipsoid as those are given by the eigen vectors of this matrix. In
this case, the principal directions of the confidence ellipsoid are given by the eigen vectors of the
b .
c β
estimated covariance matrix var
• Test on a value of β
Suppose that for a given β 0 ∈ Rk , we aim in testing H0 :
H1 :
β = β0 ,
β 6= β 0 .
The Wald-type test based on point (v) of Consequence of Theorem 3.1 proceeds as follows:
>
1 b
>
0
b
Test statistic:
Q0 =
β − β 0 MS−1
e X X β−β .
k
3
Reject H0 if
Q0 ≥ Fk,n−k (1 − α).
P-value when Q0 = q0 :
p = 1 − CDFF , k,n−k q0 .
simultánní konfidenční oblast
3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY
3.2.2
40
Statistical inference in a general rank normal linear model
Let us now assume a geneal rank normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k =
r ≤ k.
Inference on an estimable parameter
Let θ = l> β, l 6= 0k , be an estimable parameter and let θb = l> b be its least squares estimator.
• Standard error of θb and confidence interval for θ
−
−
We have var θb = σ 2 l> X> X l (Theorem 2.8) which is unbiasedly estimated as MSe l> X> X l
(Theorem 2.4). Hence the standard error of θb is
−
q
b
S.E. θ = MSe l> X> X l.
(3.2)
The standard error (3.2) is also the denominator of the appropriate t-statistic from point (viii) of
Theorem 3.1. Hence the lower and the upper bounds of the Wald-type (1 − α) 100% confidence
interval for θ based on this t-statistic are
α
.
θb ± S.E. θb tn−r 1 −
2
Analogously, also one-sided confidence interval can be calculated.
• Test on a value of θ
Suppose that for a given θ0 ∈ R, we aim in testing H0 :
H1 :
θ = θ0 ,
θ 6= θ0 .
The Wald-type test based on point (viii) of Theorem 3.1 proceeds as follows:
θb − θ0
θb − θ0
Test statistic:
T0 =
=q
− .
S.E. θb
MSe l> X> X l
α
Reject H0 if
|T0 | ≥ tn−r 1 −
.
2
P-value when T0 = t0 : p = 2 CDFt, n−r − |t0 | .
Analogously, also one-sided tests can be conducted.
Simultaneous inference on an estimable vector parameter
Finally, let θ = Lβ be an estimable parameter, where L is an m × k matrix with m ≤ r linearly
b = Lb be the least squares estimator of θ.
independent rows. Let θ
• Simultaneous confidence region for θ
It follows from point (x) of Theorem 3.1 that the simultaneous (1 − α) 100% confidence region
for θ is the set
n
o
> n
− > o−1
m
>
b
b
θ ∈R : θ−θ
MSe L X X L
θ − θ < m Fm,n−r (1 − α) ,
which is an ellipsoid with center:
shape matrix:
diameter:
b
θ,
−
b ,
c θ
MSe L X> X L> = var
p
m Fm,n−r (1 − α).
3.2. PROPERTIES OF THE LEAST SQUARES ESTIMATORS UNDER THE NORMALITY
• Test on a value of θ
Suppose that for a given θ 0 ∈ Rm , we aim in testing H0 :
H1 :
41
θ = θ0 ,
θ 6= θ 0 .
The Wald-type test based on point (x) of Theorem 3.1 proceeds as follows:
> n
− o−1
1 b
b − θ0 .
Test statistic:
Q0 =
θ
θ − θ0
MSe L X> X L>
m
Reject H0 if
Q0 ≥ Fm,n−r (1 − α).
P-value when Q0 = q0 :
p = 1 − CDFF , m,n−r q0 .
Note. Assume again a full-rank model (r = k) and take L as a submatrix of the identity matrix
Ik by selecting some of its rows. The above procedures can then be used to infer simultaneously
on a subvector of the regression coefficients β.
Note. All tests, confidence intervals and confidence regions derived in this Section were derived
under the assumption of a normal linear model. Nevertheless, we show in Chapter 13 that under
certain conditions, all those methods of statistical inference remain asymptotically valid even if
normality does not hold.
3.3. CONFIDENCE INTERVAL FOR THE MODEL BASED MEAN, PREDICTION INTERVAL
3.3
42
Confidence interval for the model based mean, prediction interval
We keep assuming that the data Yi , X i , i = 1, . . . , n, follow a normal linear model. That is,
Yi = X >
i β + εi ,
In other words,
i.i.d.
εi ∼ N (0, σ 2 ).
2
Yi X i ∼ N (X >
i β, σ )
and Y1 , . . . , Yn are conditionally independent given X 1 , . . . , X n .
Remember that X ⊆ Rk denote a sample space of the covariate random vectors X 1 , . . . , X n . Let
xnew ∈ X and let
Ynew = x>
new β + εnew ,
>
where εnew ∼ N (0, σ 2 ) is independent of ε = ε1 , . . . , εn . A value of Ynew is thus a value of
a “new” observation sampled from the conditional distribution
2
Ynew X new = xnew ∼ N (x>
new β, σ )
independently of the “old” observations. We will now tackle two important problems:
(i) Interval estimation of µnew := E Ynew X new = xnew = x>
new β.
(ii) Interval estimation of the value of the random variable Ynew itself, given the covariate vector
X new = xnew .
Solution to the outlined problems will be provided by the following theorem.
Theorem 3.2 Confidence interval for the model based mean, prediction interval.
Let Y = Xβ + ε, ε ∼ Nn (0n , σ 2 In ), rank Xn×k = r. Let xnew ∈ X ∩ M X> , xnew 6= 0k .
Let εnew ∼ N (0, σ 2 ) is independent of ε. Finally, let Ynew = x>
new β + εnew . The following then
holds:
(i) µnew = x>
new β is estimable,
µ
bnew = x>
new b
is its best linear unbiased estimator (BLUE) with the standard error of
q
−
>
S.E. µ
bnew = MSe x>
xnew
new X X
and the lower and the upper bound of the (1 − α) 100% confidence interval for µnew are
α
µ
bnew ± S.E. µ
bnew tn−r 1 −
.
(3.3)
2
(ii) A (random) interval with the bounds
α
µ
bnew ± S.E.P. xnew tn−r 1 −
,
(3.4)
2
where
r
S.E.P. xnew =
n
o
>X −x
,
MSe 1 + x>
X
new
new
covers with the probability of (1 − α) the value of Ynew .
(3.5)
3.3. CONFIDENCE INTERVAL FOR THE MODEL BASED MEAN, PREDICTION INTERVAL
43
Proof. Proof/calculations were available on the blackboard in K1.
k
Terminology (Confidence interval for the model based mean, prediction interval,
standard error of prediction).
• The interval with the bounds (3.3) is called the confidence interval for the model based mean.
• The interval with the bounds (3.4) is called the prediction interval.
• The quantity (3.5) is called the standard error of prediction.
Terminology (Fitted regression function).
b of the regression
Suppose that the corresponding linear model is of full-rank with the LSE β
coefficients. The function
b
m(x)
b
= x> β,
x ∈ X,
which, by Theorem 3.2, provides BLUE’s of the values of
µ(x) := E Ynew X new = x = x> β
and also provides predictions for Ynew = x> β + εnew , is called the fitted regression function.4
Terminology (Confidence band around the regression function, prediction band).
As was explained in Section 1.2.7, the covariates X i ∈ X ⊆ Rk used in the linear model are
often obtained by transforming some original covariates Z i ∈ Z ⊆ Rp . Common situation is that
Z ⊆ R is an interval and
>
>
X i = Xi,0 , . . . , Xi,k−1 = t0 (Zi ), . . . , tk−1 (Zi ) = t(Zi ),
i = 1, . . . , n,
where t : R −→ Rk is a suitable transformation such that
E Yi Zi = t> (Zi )β = X >
i β.
b of the regression
Suppose again that the corresponding linear model is of full-rank with the LSE β
coefficients. Confidence intervals for the model based mean or prediction intervals can then be
calculated for an (equidistant) sequence of values znew,1
, . . . , znew,N ∈ Z and then drawn over
a scatterplot of observed data Y1 , Z1 , . . . , Yn , Zn . In this way, two different bands with
a fitted regression function
b
m(z)
b
= t> (z)β,
z ∈ Z,
going through the middle of both the bands, are obtained. In this context,
(i) The band based on the confidence intervals for the model based mean (Eq. 3.3) is called the
confidence band around the regression function;5
(ii) The band based on the prediction intervals (Eq. 3.4) is called the prediction band.6
4
5
pás spolehlivosti okolo regresní funkce
6
predikční pás
3.4. DISTRIBUTION OF THE LINEAR HYPOTHESES TEST STATISTICS UNDER THE
ALTERNATIVE
3.4
44
Distribution of the linear hypotheses test statistics
under the alternative
Beginning of
Section 3.2 provided classical tests of the linear hypotheses (hypotheses on the values of estimable skipped part
parameters). To allow for power or sample size calculations, we additionally need distribution of
the test statistics under the alternatives.
Theorem 3.3 Distribution of the linear hypothesis test statistics under the alternative.
Let Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. Let l 6= 0k such that θ = l> β is estimable. Let
θb = l> b be its LSE. Let θ0 , θ1 ∈ R, θ0 =
6 θ1 and let
T0 = q
θb − θ0
MSe l
>
−
X> X l
.
Then under the hypothesis θ = θ1 ,
T0 X ∼ tn−r (λ),
θ1 − θ0
λ= q
− .
σ 2 l > X> X l
Note. The statistic T0 is the test statistic to test the null hypothesis H0 : θ = θ0 using point (viii)
of Theorem 3.1.
Proof. Proof/calculations were skipped and are not requested for the exam.
k
Theorem 3.4 Distribution of the linear hypotheses test statistics under the alternative.
Let Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. Let Lm×k be a real matrix with m ≤ r linearly
b = Lb be its LSE. Let θ 0 , θ 1 ∈ Rm , θ 0 6= θ 1
independent rows such that θ = Lβ is estimable. Let θ
and let
> n
− o−1
1 b
b − θ0 .
Q0 =
θ − θ0
MSe L X> X L>
θ
m
Then under the hypothesis θ = θ 1 ,
Q0 X ∼ Fm,n−r (λ),
λ = θ1 − θ0
> n 2
− o−1 1
σ L X> X L>
θ − θ0 .
Note. The statistic Q0 is the test statistic to test the null hypothesis H0 : θ = θ0 using point (x)
of Theorem 3.1.
3.4. DISTRIBUTION OF THE LINEAR HYPOTHESES TEST STATISTICS UNDER THE
ALTERNATIVE
Proof. Proof/calculations were skipped and are not requested for the exam.
45
k
Note. We derived only a conditional (given the covariates) distribution of the test statistics at
hand. This corresponds to the fact that power and sample size calculations for linear models are
mainly used in the area of designed experiments7 where the covariate values, i.e., the model matrix
X is assumed to be fixed and not random. A problem of the sample size calculation then involves
not only calculation of needed sample size n but also determination of the form of the model
matrix X. More can be learned in the course Experimental Design (NMST436).8
7
navržené experimenty
8
Návrhy experimentů (NMST436)
End of
skipped part
Chapter
4
Basic Regression Diagnostics
We will now assume that data are represented by n random vectors Yi , Z i , Z i = Zi,1 , . . . ,
Zi,p ∈ Z ⊆ Rp i = 1, . . . , n. We keep considering that the principal aim of the statistical
analysis
is to find a suitable model to express the (conditional) response expectation E Y := E Y Z ,
where Z is a matrix with vectors Z 1 , . . ., Z n in its rows. Suppose that t : Z −→ X ⊆ Rk is
a transformation of covariates leading to the model matrix

 

X>
t> (Z 1 )
1
 .   . 
.   . 
X=
rank Xn×k = r ≤ k.
 .  =  .  =: t(Z),
X>
t> (Z n )
n
46
4.1. (NORMAL) LINEAR MODEL ASSUMPTIONS
4.1
47
(Normal) linear model assumptions
Basis for statistical inference shown
for the
by now was derived whilek assuming
a linear
model
2
data, i.e., while
assuming that E Y Z = Xβ for some β ∈ R and var Y Z = σ In . For the
data Yi , X i , i = 1, . . . , n, where we directly work with the transformed covariate vectors, this
means the following assumptions (i = 1, . . . , n):
(A1) E Yi X i = x = x> β for some β ∈ Rk and any x ∈ X .
≡ Correct regression function m(z) = t> (z)β, z ∈ Z, correct choice of transformation t of the
original covariates leading to linearity of the (conditional) response expectation.
(A2) var Yi X i = x = σ 2 for some σ 2 irrespective of the value of x ∈ X .
≡ The conditional response variance is constant (does not depend on the covariates or other factors)
≡ homoscedasticity 1 of the response.
(A3) cov Yi , Yl X = 0, i 6= l.
≡ The responses are conditionally uncorrelated.
Some of our results (especially those shown in Chapter 3) were derived while additionally assuming
normality of the response, i.e., while assuming
(A4) Yi | X i = x ∼ N x> β, σ 2 , x ∈ X .
≡ Normality of the response.
If we use a definition of the linear model using the error terms, i.e., while assuming that
Y = Xβ + ε
for some β ∈ Rk ,
the linear model assumptions are all transferred into assumptions on the error terms ε = ε1 , . . . , εn .
Namely (i = 1, . . . , n):
(A1) E εi = 0.
≡ This again means that a structural part of the model stating that E Y X = Xβ for some
β ∈ Rk is correctly specified, or in other words, that the regression function of the model is
correctly specified.
(A2) var εi = σ 2 for some σ 2 which is constant irrespective of the value if i.
≡ The error variance is constant ≡ homoscedasticity of the errors.
(A3) cov εi , εl = 0, i 6= l.
≡ The errors are uncorrelated.
Possible assumption of normality is transferred into the errors as
(A4) εi ∼ N 0, σ 2
i.i.d.
≡ The errors are normally distributed and owing to previous assumptions, ε1 , . . . , εn ∼ N (0, σ 2 ).
Remember now that many important results, especially those already derived in Chapter 2, are valid
even without assuming normality of the response/errors. Moreover, we shall show in Chapter 13 that
also majority of inferential tools based on results of Chapters 3 and 5 are, under certain conditions,
asymptotically valid even if normality does not hold.
1
homoskedasticita
4.1. (NORMAL) LINEAR MODEL ASSUMPTIONS
48
In general, if inferential tools based on a statistical model with certain properties (assumptions)
are to be used, we should verify, at least into some extent, validity of those assumptions with
a particular dataset. In a context of regression models, the tools to verify the model assumptions
are usually referred to as regression diagnostic 2 tools. In this chapter, we provide only the most
basic graphical methods. Additional, more advanced tools of the regression diagnostics will be
provided in Chapters 10 and 14.
As already mentioned above, the assumptions (A1)–(A4) are not equally important. Some of them
are not needed to justify usage of a particular inferential tool (estimator, statistical test, . . . ), see
assumptions and proofs of corresponding Theorems. This should be taken into account when
using the regression diagnostics. It is indeed not necessary to verify those assumptions that are not
needed for a specific task. It should finally be mentioned that with respect to the importance of the
assumptions (A1)–(A4), far the most important is assumption (A1) concerning a correct specification
of the regression function. Remember that practically all Theorems in this lecture that are related
to the inference
of a linear model use in their proofs, in some sense, the
on the parameters
assumption E Y X ∈ M X . Hence if this is not satisfied, majority of the traditional statistical
inference is not correct. In other words, special attention in any data analysis should be devoted to
verifying the assumption (A1) related to a correct specification of the regression function.
As we shall show, the assumptions of the linear model are basically checked through exploration
of the properties of the residuals U of the model, where
−
U = MY ,
M = In − X X> X X> = mi,l i,l=1,...,n .
When doing so, it is exploited that each of assumptions (A1)–(A4) implies a certain property of the
residuals stated earlier in Theorems 2.3 (Basic properties of the residuals and the residual sum of
squares) or will be stated later in Theorem 3.1 (Properties of the LSE under the normality). It follows
from those theorems (or their proofs) the following:
1. (A1)
=⇒ E U X =: E U = 0n .
2. (A1) & (A2) & (A3)
=⇒ var U X =: var U = σ 2 M.
3. (A1) & (A2) & (A3) & (A4) =⇒ U | X ∼ Nn 0n , σ 2 M .
Usually, the right-hand side of the implication is verified and if it is found not to be satisfied, we
know that also the left-hand side of the implication (a particular assumption or a set of assumptions)
is not fulfilled. Clearly, if we conclude that the right-hand side of the implication is fulfilled, we
still do not know whether the left-hand side (a model assumption) is valid. Nevertheless, it is
common to most of the statistical diagnostic tools that they are only able to reveal unsatisfied
model assumptions but are never able to confirm their validity.
An uncomfortable property ofthe residuals of the linear model is the fact that even if the errors
(ε) are homoscedastic (var εi = σ 2 for all i = 1, . . . , n), the residuals U are, in general, heteroscedastic
(having unequal
variances). Indeed, even if the assumption (A2) if fulfilled, we have
var U = σ 2 M, var Ui = σ 2 mi,i (i = 1, . . . , n), where note that the residual projection matrix
M, in general, does not have a constant diagonal m1,1 , . . . , mn,n . Moreover, the matrix M is
even not a diagonal matrix. That is, even if the errors ε1 , . . . , εn are uncorrelated, the residuals
U1 , . . . , Un are, in general, correlated. This must be taken into account when the residuals U are
used to check validity of assumption (A2). The problem of heteroscedasticity of the residuals U is
then partly solved be defining so called standardized residuals.
2
regresní diagnostika
4.2. STANDARDIZED RESIDUALS
4.2
49
Standardized residuals
Consider a linear model Y X ∼ Xβ, σ 2 In , with the vector or residuals U = U1 , . . . , Un , the
residual mean square MSe , and the residual projection matrix M having a diagonal m1,1 , . . . , mn.n .
The following definition is motivated by the facts following the properties of residuals shown in
Theorem 2.3:
E U = 0n ,
var U = σ 2 M,
!
!
Ui
Ui
E p
= 0,
var p
= 1, if mi,i > 0, i = 1, . . . , n.
(4.1)
σ 2 mi,i
σ 2 mi,i
Definition 4.1 Standardized residuals.
The standardized residuals3 or the vector of standardized residuals of the model is a vector U std =
U1std , . . . , Unstd , where

Ui


, mi,i > 0,
 p
MSe mi,i
Uistd =
i = 1, . . . , n.


 undefined,
m = 0,
i,i
Notes.
• It will be shown
in Section 10.4 that if a normal linear model is assumed, i.e., if Y X ∼
Nn Xβ, σ 2 In and if for given i ∈ {1, . . . , n}, mi,i > 0 then, analogously to (4.1),
E Uistd = 0,
var Uistd = 1.
• Unfortunately, even in a normal linear model, the standardized residuals U1std , . . . , Unstd are, in
general,
• neither normally distributed;
• nor uncorrelated.
• In some literature (and some software packages), the standardized residuals are called studentized
residuals4 .
• In other literature including those course notes (and many software packages including R), the
term studentized residuals is reserved for a different quantity which we shall define in Chapter 14.
3
standardizovaná rezidua
4
studentizovaná rezidua
4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS
4.3
50
Graphical tools of regression diagnostics
In the whole section, the columns of the model matrix X (the regressors), will be denoted as
X 0 , . . . , X k−1 , i.e.,
X = X 0 , . . . , X k−1 .
Remember that usually X 0 = 1, . . . , 1 is an intercept column. Further, in many situations,
see Section 5.2 dealing with a submodel obtained by omitting some regressors, the current model
matrix X is the model matrix of just a candidate submodel (playing the role of the model matrix X0
in Section
regressors are available to model the response expectation
1
E Y Z . Let us denote them as V , . . . , V m . That is, in the notation of Section 5.2,
X1 = V 1 , . . . , V m .
The reminder of this section provides purely an overview of basic residual plots that are used as
basic diagnostic tools in the context of a linear regression. More explanation on use of those plots
will be/was provided during the lecture and the exercise classes.
4.3.1
(A1) Correctness of the regression function
To detect:
Overall inappropriateness of the regression function
⇒ scatterplot Yb , U of residuals versus fitted values.
Nonlinearity of the regression function with respect to a particular regressor X j
⇒ scatterplot X j , U of residuals versus that regressor.
Possibly omitted regressor V
⇒ scatterplot V , U of residuals versus that regressor.
For all proposed plots, a slightly better insight is obtained if standardized residuals U std are used
instead of the raw residuals U .
4.3.2
(A2) Homoscedasticity of the errors
To detect
Residual variance that depends on the response expectation
⇒ scatterplot Yb , U of residuals versus fitted values.
Residual variance that depends on a particular regressor X j
⇒ scatterplot X j , U of residuals versus that regressor.
Residual variance that depend on a regressor V not included in the model
⇒ scatterplot V , U of residuals versus that regressor.
4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS
51
For all proposed plots, a better insight is obtained if standardized residuals U std are used instead
of the raw residuals U . This due to the fact that even if homoscedasticity
of the errors is fulfilled,
2
the raw residuals U are not necessarily homoscedastic (var U Z = σ M), but the standardized
residuals are homoscedastic having all a unity variance if additionally normality of the response
holds.
So called scale-location plots are obtained, if on the above proposed plots, the vector of raw residuals
U is replaced by a vector
q
q
U std , . . . , Unstd .
1
4.3.3
(A3) Uncorrelated errors
End of
Lecture #6
(22/10/2015)
Start of
Assumption of uncorrelated errors is often justified by the used data gathering mechanism (e.g., Lecture #8
observations/measurements performed on clearly independently behaving units/individuals). In (29/10/2015)
that case, it does not make much sense to verify this assumption. Two typical situation when
uncorrelated errors cannot be taken for granted are
(i) repeated observations performed on N independently behaving units/subjects;
(ii) observations performed sequentially in time where the ith response value Yi is obtained in
time ti and the observational occasions t1 < · · · < tn form an equidistant sequence.
In the following, we will not discuss any further the case (i) of repeated observations. In that
case, a simple linear model is in most cases fully inappropriate for a statistical inference and more
advanced models and methods must be used, see the course Advanced Regression Models (NMST432).
In case (ii), the errors ε1 , . . . , εn can be considered as a time series5 . The assumptions (A1)–(A3) of
the linear model then states that this time series (the errors of the model) forms a white noise 6 .
Possible serial correlation (autocorrelation) between the error terms is then usually considered as
possible violation of the assumption (A3) of uncorrelated errors.
As stated above, even if the errors are uncorrelated and assumption (A3) is fulfilled, the residuals
U are in general correlated. Nevertheless, the correlation is usually rather low and the residuals
are typically used to check assumption (A3) and possibly to detect a form of the serial correlation
present in data at hand. See Stochastic Processes 2 (NMSA409) course for basic diagnostic methods
that include:
• Autocorrelation and partial autocorrelation plot based on residuals U .
• Plot of delayed residuals, that is a scatterplot based on points (U1 , U2 ), (U2 , U3 ), . . ., (Un−1 , Un ).
4.3.4
(A4) Normality
To detect possible non-normality of the errors, standard tools used to check normality of a random
sample known from the course Mathematical Statistics 1 (NMSA331) are used, now with the vector
of residuals U or standardized residuals U std in place of the random sample which normality is
to be checked. A basic graphical tool to check the normality of a sample is then
• the normal probability plot (the QQ plot).
Usage of both the raw residuals U and the standardized residuals U std to check the normality
assumption (A4) bears certain inconveniences. If all assumptions of the normal linear model are
fulfilled, then
5
6
bílý šum
4.3. GRAPHICAL TOOLS OF REGRESSION DIAGNOSTICS
52
The raw residuals U satisfy U | Z ∼ Nn 0m , σ 2 M . That is, they
maintain the normality,
nevertheless, they are, in general, not homoscedastic (var Ui = σ 2 mi,i , i = 1, . . . , n).
Hence seeming non-normality of a “sample” U1 , . . . , Un might be caused by the fact that
the residuals are imposed to different variability.
The standardized residuals U std satisfy E Uistd Z = 0, var Uistd Z = 1 for all i = 1, . . . , n.
That is, the standardized residuals are homoscedastic (with a known variance of one), nevertheless, they are not necessarily normally distributed. On the other hand, deviation of the
distributional shape of the standardized residuals from the distributional shape of the errors
ε is usually rather minor and hence the standardized residuals are usually useful in detecting
non-normality of the errors.
Chapter
5
Submodels
In this chapter, we will
data being represented by n random vectors Yi , Z i ,
again consider
Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp , i = 1, . . . , n. The main aim
is still
to find a suitable model
to express the (conditional) response expectation E Y := E Y Z , where Z is a matrix with
vectors Z 1 , . . ., Z n in its rows. Suppose that t0 : Rp −→ Rk0 and t : Rp −→ Rk are two
transformations of covariates leading to model matrices




>
X 01
X 01 = t0 (Z 1 ),
X>
X 1 = t(Z 1 ),
1
 . 
 . 
.
..
0
. 
..
. 
X =
X=
(5.1)
.
 . ,
 . ,
>
X 0n = t0 (Z n ),
X>
X n = t(Z n ).
X 0n
n
Briefly, we will write
Let
X0 = t0 (Z),
X = t(Z).
rank(X0 ) = r0 ,
rank(X) = r,
(5.2)
where 0 < r0 ≤ k0 < n, 0 < r ≤ k < n. We will now deal with a situation when the matrices X0
and X determine two linear models:
Model M0 : Y | Z ∼ X0 β 0 , σ 2 In ,
Model M : Y | Z ∼ Xβ, σ 2 In ,
and the task is to decide on whether one of the two models fits “better” the data. In this chapter,
we limit ourselves to a situation when M0 is so called submodel of the model M.
53
5.1. SUBMODEL
5.1
54
Submodel
Definition 5.1 Submodel.
We say that the model M0 is the submodel1 (or the nested model2 ) of the model M if
M X0 ⊂ M X with r0 < r.
Notation. Situation that a model M0 is a submodel of a model M will be denoted as
M0 ⊂ M.
Notes.
• Submodel provides a more parsimonious expression of the response expectation E Y .
• The fact that the submodel M0 holds means E Y ∈ M X0 ⊂ M X . That is, if the
submodel M0 holds then also the larger model M holds. That is, there exist β 0 ∈ Rk0 and
β ∈ Rk such that
E Y = X0 β 0 = Xβ.
• The fact
submodel M0 does not hold but the model M holds means that E Y ∈
that the
M X \ M X0 . That is, there exist no β 0 ∈ Rk0 such that E Y = X0 β 0 .
5.1.1
Projection considerations
Decomposition of the n-dimensional Euclidean space
Since M X0 ⊂ M X ⊆ Rn , it is possible to construct an orthonormal vector basis
Pn×n = p1 , . . . , pn
of the n-dimensional Euclidean space as
P = Q0 , Q1 , N ,
where
• Q0n×r0 : orthonormal vector basis of a submodel regression space, i.e.,
M X0 = M Q0 .
• Q1n×(r−r0 ) : orthonormal vectors such that Q := Q0 , Q1 is an orthonormal vector basis of
a model regression space, i.e.,
M X = M Q = M Q0 , Q1 .
1
podmodel
2
vnořený model
5.1. SUBMODEL
55
• Nn×(n−r) : orthonormal vector basis of a model residual space, i.e.,
M X
⊥
=M N .
Further,
• N0n×(n−r0 ) := Q1 , N : orthonormal vector basis of a submodel residual space, i.e.,
M X0
⊥
= M N0 = M Q1 , N .
It follows from the orthonormality of columns of the matrix P:
>
>
In = P> P = P P> = Q0 Q0 + Q1 Q1 + N N>
= Q Q> + N N>
>
>
= Q0 Q0 + N0 N0 .
Notation. In the following, let
>
H0 = Q0 Q0 ,
>
>
M0 = N0 N0 = Q1 Q1 + N N> .
Notes.
• Matrices H0 and M0 which are symmetric and idempotent, are projection matrices into the
regression and residual space, respectively, of the submodel.
• The hat matrix and the residual projection matrix of the model can now also be written as
>
>
>
H = Q Q> = Q0 Q0 + Q1 Q1 = H0 + Q1 Q1 ,
>
M = N N> = M0 − Q1 Q1 .
Projections into subspaces of the n-dimensional Euclidean space
Let y ∈ Rn . We can then write
y = In y = Q0 Q0
>
+ Q1 Q1
>
>
>
>
>
+ NN> y
= Q0 Q0 y + Q1 Q1 y + NN> y
|
{z
} | {z }
u
b
y
= Q0 Q0 y + Q1 Q1 y + NN> y.
| {z } |
{z
}
0
0
u
b
y
We have
b = Q0 Q0
• y
>
+ Q1 Q1
>
y = Hy ∈ M X .
5.1. SUBMODEL
56
• u = N N> y = My ∈ M X
⊥
.
>
b 0 := Q0 Q0 y = H0 y ∈ M X0 .
• y
⊥
>
• u0 := Q1 Q1 + N N> y = M0 y ∈ M X0 .
>
b−y
b 0 = u0 − u.
• d := Q1 Q1 y = y
5.1.2
Properties of submodel related quantities
Notation (Quantities related to a submodel).
When dealing with a pair of a model and a submodel, quantities related to the submodel will be
denoted by a superscript (or by a subscript) 0. In particular:
0
>
• Yb = H0 Y = Q0 Q0 Y : fitted values in the submodel (projection of Y into the submodel
regression space.
0
>
• U 0 = Y − Yb = M0 Y = Q1 Q1 + NN> Y : residuals of the submodel.
2
• SS0e = U 0 : residual sum of squares of the submodel.
• νe0 = n − r0 : submodel residual degrees of freedom.
• MS0e =
SS0e
: submodel residual mean square.
νe0
Additionally, as D, we denote projection of the response vector Y into the space M Q1 , i.e.,
>
0
D = Q1 Q1 Y = Yb − Yb = U 0 − U .
(5.3)
Theorem 5.1 On a submodel.
Consider two linear models M : Y | Z ∼ Xβ, σ 2 In and M0 : Y | Z ∼ X0 β 0 , σ 2 In such that
M0 ⊂ M. Let the submodel M0 holds, i.e., let E Y ∈ M X0 . Then
0
(i) Yb is the best linear unbiased estimator (BLUE) of a vector parameter µ0 = X0 β 0 = E Y .
(ii) The submodel residual mean square MS0e is the unbiased estimator of the residual variance σ 2 .
0
(iii) Statistics Yb and U 0 are conditionally, given Z, uncorrelated.
0
(iv) A random vector D = Yb − Yb = U 0 − U satisfies
2
D = SS0e − SSe .
(v) If additionally, a normal linear model is assumed, i.e., if Y | Z ∼ Nn X0 β 0 , σ 2 In then the
0
statistics Yb and U 0 are conditionally, given Z, independent and
SS0e − SSe
SS0e − SSe
νe0 − νe
r − r0
F0 =
=
∼ Fr−r0 , n−r = Fνe0 −νe , νe .
SSe
SSe
n−r
νe
(5.4)
5.1. SUBMODEL
57
Proof. Proof/calculations were available on the blackboard in K1.
k
5.1.3
Series of submodels
When looking for a suitable model to express E Y , often a series of submodels is considered. Let
us now assume a series of models
Model M0 : Y | Z ∼ X0 β 0 , σ 2 In ,
Model M1 : Y | Z ∼ X1 β 1 , σ 2 In ,
Model M : Y | Z ∼ Xβ, σ 2 In ,
where, analogously to (5.1), an n × k1 matrix X1 is given as


>
X 11
X 11 = t1 (Z 1 ),
 . 
..
. 
X1 = 
.
 . ,
>
1
X n = t1 (Z n ),
X 1n
for some transformation t1 : Rp −→ Rk1 of the original covariates Z 1 , . . . , Z n , which we briefly
write as
X1 = t1 (Z).
Analogously to (5.2), we will assume that for some 0 < r1 ≤ k1 < n,
rank(X1 ) = r1 .
Finally, we will assume that the three considered models are mutually submodels. That is, we will
assume that
M X0 ⊂ M X1 ⊂ M X
with r0 < r1 < r,
which we denote as
M0 ⊂ M1 ⊂ M.
Notation. Quantities derived while assuming a particular model will be denoted by the corresponding superscript (or by no superscript in case of the model M). That is:
0
• Yb , U 0 , SS0e , νe0 , MS0e : quantities based on the (sub)model M0 : Y | Z ∼ X0 β 0 , σ 2 In ;
1
• Yb , U 1 , SS1e , νe1 , MS1e : quantities based on the (sub)model M1 : Y | Z ∼ X1 β 1 , σ 2 In ;
• Yb , U , SSe , νe , MSe : quantities based on the model M: Y | Z ∼ Xβ, σ 2 In .
Theorem 5.2 On submodels.
Consider three normal linear models M : Y | Z ∼ Nn Xβ, σ 2 In , M1 : Y | Z ∼ Nn X1 β 1 , σ 2 In ,
5.1. SUBMODEL
58
0
M0 : Y | Z∼ Nn X0 β
, σ 2 In such that M0 ⊂ M1 ⊂ M. Let the (smallest) submodel M0 hold,
i.e., let E Y ∈ M X0 . Then
F0,1
SS0e − SS1e
SS0e − SS1e
νe0 − νe1
r1 − r0
=
=
∼ Fr1 −r0 , n−r = Fνe0 −νe1 , νe .
SSe
SSe
n−r
νe
Proof. Proof/calculations were available on the blackboard in K1.
(5.5)
k
Note. Both F-statistics (5.4) and (5.5) contain
• In the numerator: a difference in the residual sums of squares of the two models where one of
them is a submodel of the other divided by the difference of the residual degrees of freedom of
those two models.
• In the denominator: a residual sum of squares of the model which is larger or equal to any of
the two models whose quantities appear in the numerator, divided by the corresponding degrees
of freedom.
• To obtain an F-distribution of the F-statistics (5.4) or (5.5), the smallest model whose quantities
appear in that F-statistic must hold which implies that any other larger model holds as well.
Notation (Differences when dealing with a submodel).
Let MA and MB are two models distinguished by symbols “A” and “B” such that MA ⊂ MB . Let
A
B
B
Yb and Yb , U A and U B , SSA
e and SSe denote the fitted values, the vectors of residuals and
the residual sums of squares based on models MA and MB , respectively. The following notation
will be used if it becomes necessary to indicate which are the two model related to the vector D
or to the difference in the sums of squares:
B
A
D MB MA = D B A := Yb − Yb = U A − U B .
B
SS MB MA = SS B A := SSA
e − SSe .
Notes.
• Both F-statistics (5.4) and (5.5) contain certain SS B A in their numerators.
• Point (iv) of Theorem 5.1 gives
2
SS B A = D B A .
5.1.4
Statistical test to compare nested models
End of
Theorems 5.1 and 5.2 provide a way to compare two nested models by the mean of a statistical Lecture #8
test.
(29/10/2015)
Start of
Lecture #10
(05/11/2015)
5.1. SUBMODEL
59
F-test on a submodel based on Theorem 5.1
Consider two normal linear models: Model M0 :
Y | Z ∼ Nn X0 β 0 , σ 2 In ,
Model M:
Y | Z ∼ Nn Xβ, σ 2 In ,
where M0 ⊂ M, and a set of statistical hypotheses: H0 : E Y ∈ M X0
H1 : E Y ∈ M X \ M X0 ,
that aim in answering the questions:
• Is model M significantly better than model M0 ?
• Does the (larger) regression space M X provide a significantly better expression for E Y over
the (smaller) regression space M X0 ?
The F-statistic (5.4) from Theorem 5.1 now provides a way to test the above hypotheses as follows:
SS M M0
SS0e − SSe
r − r0
r − r0
=
.
Test statistic:
F0 =
SSe
SSe
n−r
n−r
Reject H0 if
F0 ≥ Fr−r0 ,n−r (1 − α).
P-value when F0 = f0 :
p = 1 − CDFF , r−r0 ,n−r f0 .
F-test on a submodel based on Theorem 5.2
Consider three normal linear models: Model M0 :
Y | Z ∼ Nn X0 β 0 , σ 2 In ,
Model M1 : Y | Z ∼ Nn X1 β 1 , σ 2 In ,
Model M:
Y | Z ∼ Nn Xβ, σ 2 In ,
where M0 ⊂ M1 ⊂ M, and a set of statistical hypotheses: H0 : E Y ∈ M X0
H1 : E Y ∈ M X1 \ M X0 ,
that aim in answering the questions:
• Is model M1 significantly better than model M0 ?
• Does the (larger) regression space M X1 provide a significantly better expression for E Y
over the (smaller) regression space M X0 ?
The F-statistic (5.5) from Theorem 5.2 now provides a way to test the above hypotheses as follows:
SS M1 M0
SS0e − SS1e
r1 − r0
r1 − r0
=
.
Test statistic:
F0,1 =
SSe
SSe
n−r
n−r
Reject H0 if
F0,1 ≥ Fr1 −r0 ,n−r (1 − α).
P-value when F0,1 = f0,1 :
p = 1 − CDFF , r1 −r0 ,n−r f0,1 .
5.2. OMITTING SOME COVARIATES
5.2
60
Omitting some covariates
The most common couple (model – submodel) is Model M:
Submodel M0 :
Y | Z ∼ Xβ, σ 2 In ,
Y | Z ∼ X0 β 0 , σ 2 In ,
where the submodel matrix X0 is obtained by omitting selected columns from the model matrix
X. In other words, some covariates are omitted from the original covariate vectors X 1 , . . . , X n to
get the submodel and the matrix X0 . In the following, without the loss of generality, let
X = X0 , X1 ,
0 < rank X0 = r0 < r = rank X < n.
The corresponding submodel F-test then evaluates whether, given the knowledge of the covariates
included in the submodel matrix X0 , the covariates included in the matrix X1 has an impact on
the response expectation.
Theorem 5.3 Effect of omitting some covariates.
Consider a couple (model – submodel), where the submodel is obtained by omitting some covariates
from the model. Then
(i) D 6= 0n and SS0e − SSe > 0.
(ii) If M X1 ⊥ M X0 then
>
D = X1 X1 X1
−
>
1
X1 Y =: Yb ,
which are the fitted values from a linear model Y | Z ∼ X1 β 1 , σ 2 In .
Proof. Proof/calculations were available on the blackboard in K1.
k
Note. If we take the residual sum of squares as a measure of a quality of the model, point
(i) of Theorem 5.3 says that the model is always getting worse if some covariates are removed.
Nevertheless, in practice, it is always a question whether this worsening is statistically significant
(the submodel F-test answers this) or practically important (additional reasoning is needed).
5.3. LINEAR CONSTRAINTS
5.3
61
Linear constraints
Suppose that a linear model Y | Z ∼ Xβ, σ2 In , rank Xn×k = r is given and it is our aim to
verify whether the response expectation E Y lies in a constrained regression space
M X; Lβ = θ 0 := v : v = Xβ, β ∈ Rk , Lβ = θ 0 ,
(5.6)
where Lm×k is a given real matrix and θ 0 ∈ Rm is a given vector. In other words, verification of
whether the response expectation lies in the space M X; Lβ = θ 0 corresponds to verification of
whether the regression coefficients satisfy a linear constraint Lβ = θ 0 .
Lemma 5.4 Regression space given by linear constraints.
Consider a linear model Y | Z ∼ Xβ, σ 2 In , rank Xn×k = r ≤ k < n. Let Lm×k be a real
matrix with m ≤ r rows such that
(i) rank L = m (i.e., L is a matrix with linearly independent rows);
(ii) θ = Lβ is estimable parameter of the considered linear model.
The space
M X; Lβ = 0m is then a vector subspace of dimension r − m of the regression space
M X .
Proof. Proof/calculations were available on the blackboard in K1.
k
Notes.
0
• The space M X;
Lβ
=
θ
is a vector space only if θ 0 = 0m since otherwise, 0n ∈
/
0
M X; Lβ = θ . Nevertheless, for the purpose of the statistical analysis, it is possible (and in
practice also necessary) to work also with θ 0 6= 0m .
• With m = r, M X; Lβ = 0m = 0n .
Definition 5.2 Submodel given by linear constraints.
We say that the model
given by linear constraints3 Lβ = θ 0 of model M:
M0 is a submodel
2
Y | Z ∼ Xβ, σ In , rank Xn×k = r, if matrix L satisfies conditions of Lemma 5.4, m < r and
the response expectation E Y under the model M0 is assumed to lie in a space M X; Lβ = θ 0 .
Notation. A submodel given by linear constraints will be denoted as
M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 .
3
5.3. LINEAR CONSTRAINTS
62
Since with θ 0 6= 0m , the space M X; Lβ = 0m is not a vector space, we in general cannot
talk about projections in a sense of linear algebra when deriving the fitted values, the residuals
and other quantities related to the submodel given by linear constraints. Hence we introduce the
following definition.
Definition 5.3 Fitted values, residuals, residual sum of squares, rank of the model
and residual degrees of freedom in a submodel given by linear constraints.
2
Let b0 ∈ Rk minimize SS(β)
= Y − Xβ over β ∈ Rk subject to Lβ = θ 0 . For the submodel
M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 , the following quantities are defined as follows:
0
Fitted values: Yb := Xb0 .
0
Residuals: U 0 := Y − Yb .
2
Residual sum of squares: SS0e := U 0 .
Rank of the model: r0 = r − m.
Residual degrees of freedom: νe0 := n − r0 .
Note. The fitted values could also be defined as
0
Yb =
argmin
e ∈M X; Lβ=θ 0
Y
Y − Ye 2 .
That is, the fitted
values are (still) the closest point to Y in the constrained regression space
M X; Lβ = θ 0 .
Theorem 5.5 On a submodel given by linear constraints.
Let M0 : Y | Z ∼ Xβ, σ 2 In , Lβ = θ 0 be a submodel given by linear constraints of a model
M : Y | Z ∼ Xβ, σ 2 In . Then
0
(i) The fitted values Yb and consequently also the residuals U 0 and the residual sum of squares
SS0e are unique.
2
(ii) b0 minimizes SS(β) = Y − Xβ subject to Lβ = θ 0 if and only if
b0 = b − X> X
where b = X> X
−
−
n
− o−1
L> L X> X L>
Lb − θ 0 ,
X> Y is (any) solution to a system of normal equations X> Xb = X> Y .
0
(iii) The fitted values Yb can be expressed as
0
Yb = Yb − X X> X
−
n
− o−1
L> L X> X L>
Lb − θ 0 .
0
(iv) The vector D = Yb − Yb satisfies
n
o−1
2
D = SS0e − SSe = (Lb − θ 0 )> L X> X − L>
(Lb − θ 0 ).
(5.7)
5.3. LINEAR CONSTRAINTS
63
Proof. First mention that under our assumptions, the matrix L X> X
−
L> is
(i) invertible;
(ii) does not depend on a choice of the pseudoinverse X> X
−
.
This follows from Theorem 2.9 (Gauss–Markov for estimable vector parameter).
2
0
Second, try to look for Yb = Xb0 such that b0 minimizes SS(β) = Y − Xβ over β ∈ Rk
subject to Lβ = θ 0 by a method of Lagrange multipliers. Let
2
ϕ(β, λ) = Y − Xβ + 2λ> Lβ − θ 0
>
= Y − Xβ
Y − Xβ + 2λ> Lβ − θ 0 ,
where a factor of 2 in the second part of expression of the Lagrange function ϕ is only included to
simplify subsequent expressions.
The first derivatives of ϕ are as follows:
∂ϕ
(β, λ) = −2 X> Y − Xβ + 2 L> λ,
∂β
∂ϕ
(β, λ) = 2 Lβ − θ 0 .
∂λ
Realize now that
∂ϕ
(β, λ) = 0k if and only if
∂β
X> Xβ = X> Y − L> λ.
(5.8)
Note that the linear system (5.8) is consistent for any λ ∈ Rm and any Y ∈ Rn . This
follows from
>
>
the fact that due to estimability of a parameter Lβ, we have M L ⊂ M X (Theorem 2.7).
Hence the right-hand-side of the system (5.8) lies in M X> , for any λ ∈ Rm and any Y ∈ Rn .
The left-hand-side
of the system (5.8) lies in M X> X , for any β ∈ Rk . We already know that
>
>
M X = M X X (Lemma 2.6) which proves that there always exist a solution to the linear
system (5.8).
Let b0 (λ) be any solution to X> Xβ = X> Y − L> λ. That is,
−
−
b0 (λ) = X> X X> Y − X> X L> λ
−
= b − X> X L> λ,
which depends on a choice of X> X
Further,
−
.
∂ϕ
(β, λ) = 0m if and only if
∂λ
Lb0 (λ) = θ 0
−
Lb − L X> X L> λ = θ 0
−
L X> X L> λ = Lb − θ 0 .
|
{z
}
5.3. LINEAR CONSTRAINTS
That is,
64
n
o−1
−
λ = L X> X L>
Lb − θ 0 .
Finally,
0
>
b = b− X X
−
n
− > o−1
>
(Lb − θ 0 ),
L L X X L
0
Yb = Xb0 = Yb − X X> X
>
−
n
− o−1
(Lb − θ 0 ).
L> L X> X L>
Realize again that M L> ⊂ M X> . That is, there exist a matrix A such that
L> = X> A> ,
L = AX.
0
Under our assumptions, matrix A is even unique. The vector Yb can now be written as
n
− o−1
−
0
A>
L X> X L>
Lb −θ 0 .
Yb = |{z}
Yb − X X> X X> |{z}
|{z}
{z
}
|
{z
} unique
unique |
unique
unique
unique
(5.9)
0
To show point (iv), use (5.9) in expressing the vector D = Yb − Yb :
n
− o−1
−
(Lb − θ 0 ).
D = X X> X X> A> L X> X L>
That is,
o−1
n
2
−
−
D = (Lb − θ 0 )> L X> X − L>
A X X> X X> X X> X X> A>
|
{z
}
X by the five matrices rule
n
− o−1
L X> X L>
(Lb − θ 0 )
= (Lb − θ 0 )
>
n
n
− o−1
− o−1
−
(Lb − θ 0 )
L X> X L>
AX X> X X> A> L X> X L>
= (Lb − θ 0 )
>
n
n
− o−1
−
− o−1
(Lb − θ 0 )
L X> X L>
L X> X L> L X> X L>
= (Lb − θ 0 )
>
n
− o−1
L X> X L>
(Lb − θ 0 ).
2
It remains to be shown that D = SS0e − SSe . We have
n
o−1
0 2
0 2
b + X X> X − L> L X> X − L>
SS0e = Y − Yb = Y
−
Y
(Lb
−
θ
)
| {z }
{z
}
⊥ |
U ∈M X
D∈M X
2 2
2
= U + D = SSe + D .
k
5.3. LINEAR CONSTRAINTS
5.3.1
65
F-statistic to verify a set of linear constraints
Let us take the expression (5.7) for the difference between the residual sums of squares of the
model and the submodel given by linear constraints and derive the submodel F-statistic (5.4):
>
(Lb − θ 0 )
SS0e − SSe
r − r0
F0 =
=
SSe
n−r
n
− o−1
L X> X L>
(Lb − θ 0 )
m
SSe
n−r
=
n
− o−1
1
>
(Lb − θ 0 ) MSe L X> X L>
(Lb − θ 0 )
m
=
n
− o−1
>
1 b
b − θ 0 ),
(θ − θ 0 ) MSe L X> X L>
(θ
m
(5.10)
b = Lb is the LSE of the estimable vector parameter θ = Lβ in the linear model Y X ∼
where θ
Xβ, σ 2 In without constraints. Note now that (5.10) is exactly equal to the Wald-type statistic
0
Q0 (see page 41) that we used in Section 3.2.2 to test the
null hypothesis2 H0 : θ = θ on an
estimable vector parameter θ in a normal linear model Y Z ∼ Nn Xβ, σ In . If normality can
be assumed, point (x) of Theorem 3.1 then provided that under the null hypothesis H0 : θ = θ 0 ,
that is, under the validity of the submodel given by linear constraints Lβ = θ 0 , the statistic F0
follows the usual F-distribution Fm,n−r . This shows that the Wald-type test on the estimable vector
parameter in a normal linear model based on Theorem 3.1 is equivalent to the submodel F-test
based on Theorem 5.1.
5.3.2
t-statistic to verify a linear constraint
>
k
Consider L = l>
, l ∈ R , l 6= 02k such
that 0θ = l β is an estimable parameter of the normal
linear model Y Z ∼ Nn Xβ, σ In . Take θ ∈ R and consider the submodel given by m = 1
linear constraint l> β = θ0 . Let θb = l> b, where b is any solution to the normal equations in the
model without constraints. The statistic (5.10) then takes the form
!2
b − θ0
− o−1
n
1 b
θ
0
>
>
0
= T02 ,
MSe l X X l
F0 =
θ−θ
θb − θ = q
−
m
MSe l> X> X l
where
θb − θ0
T0 = q
−
MSe l> X> X l
is the Wald-type test statistic introduced in Section 3.2.2 (on page 40) to test the null hypothesis
H0 : θ = θ0 in a normal linear model Y Z ∼ Nn Xβ, σ 2 In . Point (viii) of Theorem 3.1 provided
that under the null hypothesis H0 : θ = θ0 , the statistic T0 follows the Student t-distribution tn−r
which is indeed in agreement with the fact that T02 = F0 follows the F-distribution F1,n−r .
5.4. COEFFICIENT OF DETERMINATION
5.4
5.4.1
66
Coefficient of determination
Intercept only model
Notation (Response sample mean).
The sample mean over the response vector Y = Y1 , . . . , Yn
>
will be denoted as Y . That is,
n
1X
1
Y =
Yi = Y > 1n .
n
n
i=1
Definition 5.4 Regression and total sums of squares in a linear model.
Consider a linear model Y X ∼ Xβ, σ 2 In , rank(Xn×k ) = r ≤ k. The following expressions
define the following quantities:
(i) Regression sum of squares4 and corresponding degrees of freedom:
n
2 X
2
SSR = Yb − Y 1n =
Ybi − Y ,
νR = r − 1,
i=1
(ii) Total sum of squares5 and corresponding degrees of freedom:
n
2 X
2
Yi − Y ,
SST = Y − Y 1n =
νT = n − 1.
i=1
Lemma 5.6 Model with intercept only.
Let Y ∼ 1n γ, ζ 2 In . Then
(i) Yb = Y 1n = Y , . . . , Y
>
.
(ii) SSe = SST .
Proof. This is a full-rank model with X = 1n . Further,
X> X
Hence γ
b=
4
1
n
Pn
i=1 Yi
regresní součet čtverců
5
−1
= 1>
n 1n
−1
=
1
,
n
X > Y = 1>
nY =
= Y and Yb = Xb
γ = 1n Y = Y 1n .
celkový součet čtverců
n
X
Yi .
i=1
k
5.4. COEFFICIENT OF DETERMINATION
5.4.2
67
Models with intercept
Lemma 5.7 Identity in a linear model with intercept.
Let Y X ∼ Xβ, σ 2 In where 1n ∈ M X . Then
1>
nY =
n
X
Yi =
n
X
i=1
b
Ybi = 1>
nY .
i=1
Proof.
• Follows directly from the normal equations if 1n is one of the columns of X matrix.
• General proof:
b
b>
1>
n Y = Y 1n = HY
>
1n = Y > H1n = Y > 1n ,
since H1n = 1n due to the fact that 1n ∈ M X .
k
Theorem 5.8 Breakdown of the total sum of squares in a linear model with intercept.
Let Y X ∼ Xβ, σ 2 In where 1n ∈ M X . Then
SST
n
X
i=1
Yi − Y
=
2
=
SSe
n
X
Yi − Ybi
+
2
+
i=1
SSR
n
X
Ybi − Y
2
.
i=1
Proof.
The identity SST = SSe + SSR follows trivially if r = rank X = 1 since then
M X = M 1n and hence (by Lemma 5.6) Yb = Y 1n . Then SST = SSe , SSR = 0.
0, σ2I
In the following,
let
r
=
rank
X
>
1.
Then,
model
Y
|
X
∼
1
β
is a submodel of the
n
n
model Y X ∼ Xβ, σ 2 In and by Lemma 5.6, SST = SS0e . Further, from definition of SSR ,
2
0
it equals to SSR = D , where D = Yb − Yb . By point (iv) of Theorem 5.1 (on a submodel),
2
D = SS0e − SSe . In other words,
SSR = SST − SSe .
k
5.4. COEFFICIENT OF DETERMINATION
68
The identity SST = SSe + SSR can also be shown directly while using a little algebra. We have
SST =
n
X
Yi − Y
2
=
i=1
=
n
X
n
X
Yi − Ybi + Ybi − Y
2
i=1
Yi − Ybi
2
+
i=1
n
X
Ybi − Y
2
+2
= SSe + SSR + 2
Yi − Ybi Ybi − Y
i=1
i=1
n
nX
n
X
Yi Ybi − Y
i=1
n
X
Yi + Y
i=1
n
X
i=1
Ybi −
n
X
Ybi2
i=1
{z
0
|
o
}
= SSe + SSR
P
P
since ni=1 Yi = ni=1 Ybi and additionally
n
X
Yi Ybi = Y > Yb = Y > HY ,
i=1
5.4.3
n
X
>
Ybi2 = Yb Yb = Y > HHY = Y > HY .
i=1
k
Evaluation of a prediction quality of the model
End of
Lecture #10
(05/11/2015)
Start of
One of the usual aims of regression modelling is so called prediction in which case the model based
Lecture #12
mean is used as the predicted response value. In such situations, it is assumed that data Yi , X i , (12/11/2015)
i = 1, . . . , n, are a random sample from some joint distribution of a generic random vector Y, X
and theconditional distribution Y | X can be described by a linear model Y | X ∼ Xβ, σ 2 In ,
rank X = r, for the data. That is, the mean and the variance of the conditional distribution
Y | X are given as
E Y X = X > β and var Y X = σ 2 , respectively.
In the following, let EY and varY , respectively, denote expectation and variance, respectively,
with
2
respect to the marginal distribution of Y . The only intercept model Y ∼ 1n γ, ζ In for the data
then corresponds to the marginal distribution of the response Y with
EY Y = γ and varY Y = ζ 2 .
Suppose now that notyet observed random vector Ynew , X new is also distributed as a generic
random vector Y, X and assume that all parameters of considered models are known. The aim
is to provide the prediction Ybnew of the value of Ynew . To this end, the quality of the prediction is,
most classically, evaluated by the mean squared error of prediction6 (MSEP) defined as
2
MSEP Ybnew = EY, X Ybnew − Ynew ,
(5.11)
where symbol
EY, X denotes expectation with respect to the joint distribution of the random vector
Y, X .
To predict the value of Ynew , we have basically two options depending on whether a value of X new
(covariates for the new observation) is or is not available to construct the prediction.
6
střední čtvercová chyba predikce
5.4. COEFFICIENT OF DETERMINATION
69
(i) If the value of X new is not available, prediction can only be based on the marginal (intercept
M of
only) model where the model based mean equals to γ, which is also the prediction Ybnew
Ynew . That is,
M
Ybnew
= γ,
and we get
2
2
M
MSEP Ybnew
= EY, X γ − Ynew = EY γ − Ynew = varY Ynew = ζ 2 .
(ii) If the value of X new is available, the conditional (regression) model can be used leading to
the prediction of Ynew of the form
C
Ybnew
= X>
new β.
Let EX denote the expectation with respect to the marginal distribution of the covariates X.
The MSEP then equals
h
i
2
2 C
>
>
b
MSEP Ynew = EY, X X new β − Ynew = EX E X new β − Ynew X new
n o
= EX var Ynew X new = EX σ 2 = σ 2 .
Finally, we get
C
MSEP Ybnew
σ2
= 2.
M
ζ
MSEP Ybnew
C based on the regression
That is, the ratio σ 2 /ζ 2 quantifies advantage of using the prediction Ybnew
M which is equal to
model and the covariate values X new compared to using the prediction Ybnew
the marginal response expectation.
5.4.4
Coefficient of determination
To estimate the ratio σ 2 /ζ 2 between the conditional and the marginal response variances, we can
straightforwardly consider the following. First,
n
2
1
1 X
SST =
Yi − Y
νT
n−1
i=1
is a standard sample variance based on the random sample Y1 , . . . , Yn . That is, it is an unbiased
estimator of the marginal variance
ζ 2 . Note that it is also the residual mean square from the only
2
intercept model 1n γ, ζ In . Further,
n
2
1
1 X
SSe =
Yi − Ybi ,
νe
n−r
i=1
which is the residual mean square from the considered linear model Xβ, σ 2 In , is an unbiased
estimator of the conditional variance σ 2 . That is, a suitable estimator of the ratio σ 2 /ζ 2 is
1
n−r SSe
1
n−1 SST
=
n − 1 SSe
.
n − r SST
(5.12)
5.4. COEFFICIENT OF DETERMINATION
70
Alternatively, if Y ∼ N (γ, ζ 2 ), that is, if Y1 , . . . , Yn is a random sample from N (γ, ζ 2 ), it can be
(it was) easily derived that a quantity
n
2
1
1 X
SST =
Yi − Y
n
n
i=1
is the maximum-likelihood
estimator7 (MLE) of the marginal variance ζ 2 . Analogously, if Y | X ∼
>
N X β, σ 2 , it can be derived (see the exercise class) that a quantity
n
2
1
1 X
SSe =
Yi − Yb
n
n
i=1
is the MLE of the conditional variance σ 2 . Alternative estimator of the ratio σ 2 /ζ 2 is then
1
n SSe
1
n SST
=
SSe
.
SST
(5.13)
Remember that in the model Y | X ∼ Xβ, σ 2 In with intercept (1n ∈ M X ), we have,
n
X
Yi − Y
2
Yi − Ybi
2
i=1
i=1
|
=
n
X
{z
SST
}
|
+
n
X
Ybi − Y
2
,
i=1
{z
SSe
}
|
{z
SSR
}
where the three sums of squares represent different sources of the response variability:
SST (total sum of squares):
original (marginal) variability of the response,
SSe (residual sum of squares):
variability not explained by the regression model,
(residual variability, conditional variability)
SSR (regression sum of squares):
variability explained by the regression model.
Expressions (5.12) and (5.13) then motivate the following definition.
Definition 5.5 Coefficients of determination.
Consider a linear model Y X ∼ Xβ, σ 2 In , rank(X) = r where 1n ∈ M X . A value
R2 = 1 −
SSe
SST
is called the coefficient of determination8 of the linear model.
A value
2
=1−
n − 1 SSe
n − r SST
is called the adjusted coefficient of determination9 of the linear model.
7
8
koeficient determinace
9
upravený koeficient determinace
5.4. COEFFICIENT OF DETERMINATION
71
Notes.
• By Theorem 5.8, SST = SSe + SSR and at the same time SST ≥ 0. Hence
0 ≤ R2 ≤ 1,
and R2 can also be expressed as
R2 =
2
≤ 1,
SSR
.
SST
2 are often reported as R2 · 100% and R2 · 100% which can be interpreted
as a percentage of the response variability explained by the regression model.
2 quantify a relative improvement of the quality of prediction if the regression
model and the conditional distribution of response given the covariates is used compared to the
prediction based on the marginal distribution of the response.
• Both coefficients of determination only quantifies the predictive ability of the model. They do
not say much about the quality
of the model with respect to the possibility 2to capture correctly
) might be useful
the conditional mean E Y X . Even a model with a low value of R2 (Radj
with respect to modelling the conditional mean E Y X . The model is perhaps only useless
for prediction purposes.
5.4.5
Overall F-test
Lemma 5.9 Overall F-test.
Assume a normal linear model Y | X ∼ Nn Xβ, σ 2 In , rank Xn×k = r > 1 where 1n ∈ M X .
Let R2 be its coefficient
of determination. The submodel F-statistic to compare
model M : Y | X ∼
Nn Xβ, σ 2 In and the only intercept model M0 : Y | X ∼ Nn 1n γ, σ 2 In takes the form
F0 =
R2
n−r
·
.
2
1−R
r−1
(5.14)
Proof.
• R2 = 1 −
SSe
SST
and according to Lemma 5.6: SST = SS0e .
• Hence
R2 = 1 −
SSe
SS0e − SSe
=
,
SS0e
SS0e
1 − R2 =
SSe
.
SS0e
• At the same time
F0 =
SS0e −SSe
r−1
SSe
n−r
n − r SS0e − SSe
n−r
=
=
r−1
SSe
r−1
SS0e −SSe
SS0e
SSe
SS0e
=
n − r R2
.
r − 1 1 − R2
k
5.4. COEFFICIENT OF DETERMINATION
72
Note. The F-test with the test statistic (5.14) is sometimes (especially in some software packages)
referred to as an overall goodness-of-fit test. Nevertheless be cautious when interpreting the results
of such test. It says practically nothing about the quality of the model and the “goodness-of-fit”!
Chapter
6
General Linear Model
We still assume
that data are represented by a set of n random vectors Yi , X i , X i = Xi,0 ,
. . . , Xi,k−1 , i = 1, . . . , n, and use symbols Y for a vector Y1 , . . . , Yn and X for an n × k
matrix with rows given by vectors X 1 , . . . , X n . In this chapter, we mildly extend a linear model
by allowing for a covariance matrix having different form than σ 2 In assumed by now.
Definition 6.1 General linear model.
The data Yi , X i , i = 1, . . . , n, satisfy a general linear model1 if
E Y X = Xβ,
var Y X = σ 2 W−1 ,
where β ∈ Rk and 0 < σ 2 < ∞ are unknown parameters and W is a known positive definite
matrix.
Notes.
• The fact that data follow a general linear model is denoted as
Y X ∼ Xβ, σ 2 W−1 .
• General linear model should not be confused with a generalized linear model 2 which is something
different (see Advanced Regression Models (NMST432) course). In the literature, abbreviation “GLM”
is used for (unfortunately) both general and generalized linear model. It must be clear from
context which of the two is meant.
Example 6.1 (Regression based on sample means).
Suppose that data are represented by random vectors
Ye1,1 , . . . , Ye1,w1 , X 1 ,
...,
Yen,1 , . . . , Yen,wn , X n
such that for each i = 1, . . . , n, the random variables Yei,1 , . . . , Yei,wi are uncorrelated with a common
conditional (given X i ) variance σ 2 .
1
obecný lineární model
2
zobecněný lineární model
73
74
Suppose that we are only able to observe the sample means of the “Ye ” variables leading to the response
variables Y1 , . . . , Yn , where
Y1 =
w1
1 X
Ye1,j ,
w1
...,
Yn =
j=1
wn
1 X
Yen,j .
wn
j=1
The covariance matrix (conditional given X) of a random vector Y = Y1 , . . . , Yn is then

1
w1
 .
.
var Y := var Y X = σ 2 
 .
0
|
...
..
.
...
{z
W−1

0
.. 
. 
.
1
wn
}
Theorem 6.1 Generalized least squares.
Assume a general linear model Y X ∼ Xβ, σ 2 W−1 , where rank Xn×k = r ≤ k < n. The
following then holds:
(i) A vector
Yb G := X X> WX
−
X> WY
is the best linear unbiased estimator (BLUE) of a vector parameter µ := E Y = Xβ, and
−
var Yb G = σ 2 X X> WX X> .
−
Both Yb G and var Yb G do not depend on a choice of the pseudoinverse X> WX .
If further Y X ∼ Nn Xβ, σ 2 W−1 then
− Yb G X ∼ Nn Xβ, σ 2 X X> WX X> .
(ii)
Let l ∈ Rk , l 6= 0k , be such that θ = l> β is an estimable parameter of the model and
let
bG := X> WX
−
X> WY .
Then θbG = l> bG does not depend on a choice of the pseudoinverse used to calculate bG
and θbG is the best linear unbiased estimator (BLUE) of θ with
−
var θbG = σ 2 l> X> WX l,
which also does not depend on a choice of the pseudoinverse.
If further Y X ∼ Nn Xβ, σ 2 W−1 then
− θbG X ∼ N θ, σ 2 l> X> WX l .
75
(iii) If further r = k (full-rank general linear model), then
b := X> WX −1 X> WY
β
G
is the best linear unbiased estimator (BLUE) of β with
b = σ 2 X> WX −1 .
var β
G
If additionally Y X ∼ Nn Xβ, σ 2 W−1 then
b X ∼ Nk β, σ 2 X> WX −1 .
β
G
(iv) The statistic
MSe,G :=
where
SSe,G
,
n−r
1
>
2
SSe,G := W 2 Y − Yb G = Y − Yb G W Y − Yb G ,
is the unbiased estimator of the residual variance σ 2 .
If additionally Y X ∼ Nn Xβ, σ 2 W−1 then
SSe,G
∼ χ2n−r ,
σ2
and the statistics SSe,G and Yb G are conditionally, given X, independent.
Proof. Proof/calculations were available on the blackboard in K1.
k
Note. Mention also that as consequence of the above theorem, all classical tests, confidence
intervals etc. work in the same way as in the OLS case.
Terminology (Generalized fitted values, residual sum of squares, mean square,
least square estimator).
−
• The statistic Yb G = X X> WX X> WY is called the vector of the generalized fitted values.3
1
>
2
• The statistic SSe,G = W 2 Y − Yb G = Y − Yb G W Y − Yb G is called the generalized residual sum of squares.4
SSe,G
is called the generalized mean square.5
n−r
b = X> WX −1 X> WY in a full-rank general linear model is called the gener• The statistic β
G
alized least squares (GLS) estimator 6 of the regression coefficients.
• The statistic MSe,G =
3
zobecněné vyrovnané hodnoty
zobecněných nejmenších čtverců
4
zobecněný reziduální součet čtverců
5
zobecněný střední čtverec
6
76
Note. The most common use of the generalized least squares is the situation described in Example 6.1, where

1
w1
 .
.
W−1 = 
 .
0
...
..
.
...

0
.. 
. 
.
1
wn
We then get
X> WY =
n
X
wi Yi X i ,
i=1
SSe,G =
X> WX =
n
X
wi X i X >
i ,
i=1
n
X
2
wi Yi − YbG,i .
i=1
The method of the generalized least squares is then usually referred to as the method of the
weighted least squares (WLS).7
7
vážené nejmenší čtverce
End of
Lecture #12
(12/11/2015)
Chapter
7
Parameterizations of Covariates
7.1
Linearization of the dependence of the response on
the covariates
Start of
As it is usual in this lecture, we represent data by n random vectors Yi , Z i , Z i = Zi,1 , . . . , Lecture #5
(15/10/2015)
Zi,p ∈ Z ⊆ Rp , i = 1, . . . , n. The principal problem we consider
is to find a suitable model
to express the (conditional) response expectation E Y := E Y Z , where Z is a matrix
with
vectors Z 1 , . . ., Zn in its rows. To this end, we consider a linear model, where E Y can be
expressed as E Y = Xβ for some β ∈ Rk , where


X>
X 1 = X1,0 , . . . , X1,k−1 = t(Z 1 ),
1
 . 
..
. 
X=
.
 . ,
>
Xn
X n = Xn,0 , . . . , Xn,k−1 = t(Z n ),
and t : Z −→ X ⊆ Rk , t(z) = t0 (z), . . . , tk−1 (z) = x0 , . . . , xk−1 = x, is a suitable transformation of the original covariates that linearize the relationship between the response expectation
and the covariates. The corresponding regression function is then
m(z) = β0 t0 (z) + · · · + βk−1 tk−1 (z),
z ∈ Z.
(7.1)
One of the main problems of a regression analysis is to find a reasonable form of the transformation
function t to obtain a model that is perhaps
but at least useful to capture sufficiently the
wrong
form of E Y and in general to express E Y Z = z , z ∈ Z, for a generic response Y being
generated, given the covariate value Z = z, by the same probabilistic mechanism as the original
data.
77
7.2. PARAMETERIZATION OF A SINGLE COVARIATE
7.2
78
Parameterization of a single covariate
In this and two following sections, we first limit ourselves to the situation of a single covariate, i.e.,
p = 1, Z ⊆ R, and show some classical choices of the transformations that are used in practical
analyses when attempting to find a useful linear model.
7.2.1
Parameterization
Our aim is to propose transformations t : Z −→ Rk , t(z) = t0 (z), . . . , tk−1 (z) such that
a regression
function (7.1) can possibly provide a useful model for the response expectation
E Y Z = z . Furthermore, in most cases, we limit ourselves to transformations that lead to
a linear model with intercept. In such cases, the regression function will be
m(z) = β0 + β1 s1 (z) + · · · + βk−1 sk−1 (z),
z ∈ Z,
(7.2)
where the non-intercept part of the transformation t will be denoted as s. That is, for z ∈ Z,
j = 1, . . . , k − 1,
s(z) = s1 (z), . . . , sk−1 (z) = t1 (z), . . . , tk−1 (z) .
sj (z) = tj (z),
s : Z −→ Rk−1 ,
Definition 7.1 Parameterization of a covariate.
Let Z1 , . . . , Zn be values of a given univariate covariate Z ∈ Z ⊆ R. By a parameterization of this
covariate we mean
(i) a function s : Z −→ Rk−1 , s(z) = s1 (z), . . . , sk−1 (z) , z ∈ Z, where all s1 , . . . , sk−1
are non-constant functions on Z, and
(ii) an n × (k − 1) matrix S, where

 

s> (Z1 )
s1 (Z1 ) . . . sk−1 (Z1 )
 .   .

..
..
.
..  =  ..
S=
.
.

 

s> (Zn )
s1 (Zn ) . . . sk−1 (Zn )
Terminology (Reparameterizing matrix, regressors).
Matrix S from Definition 7.1 is called reparameterizing matrix 1 of a covariate. Its columns, i.e.,
vectors




sk−1 (Z1 )
s1 (Z1 )
 . 


..

..  , . . . , X k−1 = 
X1 = 
.




s1 (Zn )
sk−1 (Zn )
are called regressors.2
1
reparametrizační matice
2
regresory
7.2. PARAMETERIZATION OF A SINGLE COVARIATE
79
Notes.
• A model matrix X of the model with


1 X>
1

..
.. 
X = 1n , S = 
. 
=
.
>
1 Xn
X i = s(Zi ),
the regression function (7.2) is


1 X1,1 . . . X1,k−1
.
..
..
.. 
1
k−1 ,
 ..

=
1
,
X
,
.
.
.
,
X
n
.
.
. 

1 Xn,1 . . . Xn,k−1
Xi,j = sj (Zi ), i = 1, . . . , n, j = 1, . . . , k − 1.
• Definition 7.1 is such that an intercept vector 1n (or a vector c 1n , c ∈ R) is (with a positive
probability) not included in the reparameterizing matrix S. Nevertheless, it will be useful in
some situations to consider such parameterizations that (almost surely) include an intercept term
in the corresponding regression space. That is, for
some parameterizations (see the regression
splines in Section 7.3.4), we will have 1n ∈ M S .
7.2.2
Covariate types
The covariate space Z and the corresponding univariate covariates Z1 , . . . , Zn are usually of one
of the two types and different parameterizations are useful depending on the covariate type which
are the following.
Numeric covariates
Numeric 3 covariates are such covariates where a ratio of the two covariate values makes sense and
a unity increase of the covariate value has an unambiguous meaning. The numeric covariate is
then usually of one of the two following subtypes:
(i) continuous, in which case Z is mostly an interval in R. Such covariates have usually a physical
interpretation and some units whose choice must be taken into account when interpreting
the results of the statistical analysis. The continuous numeric covariates are mostly (but not
necessarily) represented by continuous random variables.
(ii) discrete, in which case Z is infinite countable or finite (but “large”) subset of R. The most
common situation of a discrete numeric covariate is a count 4 with Z ⊆ N0 . The numeric
discrete covariates are represented by discrete random variables.
Categorical covariates
Categorical 5 covariates (in the R software referred to as factors), are such covariates where the ratio
of the two covariate values does not necessarily make sense and a unity increase of the covariate
value does not necessarily have an unambiguous meaning. The sample space Z is a finite (and
mostly “small”) set, i.e.,
Z = ω1 , . . . , ωG ,
where the values ω1 < · · · < ωG are somehow arbitrarily chosen labels of categories purely used
to obtain a mathematical representation of the covariate values. The categorical covariate is always
represented by a discrete random variable. Even for categorical covariates, it is useful to distinguish
the two subtypes:
3
numerické, příp. kvantitativní
4
počet
5
kategoriální, příp. kvalitativní
7.2. PARAMETERIZATION OF A SINGLE COVARIATE
80
(i) nominal 6 where from a practical point of view, chosen values ω1 , . . . , ωG are completely
arbitrary. Consequently, practically interpretable results and conclusions of any sensible
statistical analysis should be invariant towards the choice of ω1 , . . . , ωG . The nominal
categorical covariate mostly represents a pertinence to some group (a group label), e.g.,
region of residence.
(ii) ordinal 7 where ordering ω1 < · · · < ωG makes sense also from a practical point of view. An
Notes.
• From the practical point of view, it is mainly important to distinguish numeric and categorical
covariates.
• Often, ordinal categorical covariate can be viewed also as a discrete numeric. Whatever in this
lecture that will be applied to the discrete numeric covariate can also be applied to the ordinal
categorical covariate if it makes sense to interprete, at least into some extent, its unity increase
(and not only the ordering of the covariate values).
6
nominální
7
ordinální
7.3. NUMERIC COVARIATE
7.3
81
Numeric covariate
It is now assumed that Zi ∈ Z ⊆ R, i = 1, . . . , n, are numeric covariates. Our aim is now to
propose their sensible parameterizations.
7.3.1
Simple transformation of the covariate
The regression function is
m(z) = β0 + β1 s(z),
z ∈ Z,
(7.3)
where s : Z −→ R is a suitable non-constant function. The corresponding reparameterizing matrix
is


s(Z1 )
 . 
. 
S=
 . .
s(Zn )
Due to interpretability issues, “simple” functions like: identity, logarithm, exponential, square root,
reciprocal, . . . , are considered in place of the transformation s.
Evaluation of the effect of the original covariate
Advantage of a model with the regression function (7.3) is the fact that a single regression coefficient
β1 (the slope in a model with the regression line in x = s(z)) quantifies the effect of the covariate
on the response expectation which can then be easily summarized by a single point estimate and
a confidence interval. Evaluation of a statistical significance of the effect of the original covariate
on the response expectation is achieved by testing the null hypothesis
H0 : β1 = 0.
A possible test procedure was introduced in Section 3.2.
Interpretation of the regression coefficients
Disadvantage is the fact that the slope β1 expresses the change of the response expectation that
corresponds to a unity change of the transformed covariate X = s(Z), i.e., for z ∈ Z:
β1 = E Y X = s(z) + 1 − E Y X = s(z) ,
which is not always easily interpretable.
Moreover, unless
that corresponds
E Y
the transformation s is a linear function, the change in the response expectation
to a unity change of the original covariate is a function of that covariate:
Z = z + 1 − E Y Z = z = β1 s(z + 1) − s(z) ,
z ∈ Z.
In other words, a model with the regression function (7.3) and a non-linear transformation s
expresses the fact that the original covariate has different influence on the response expectation
depending on the value of this covariate.
Note. It is easily seen that if n > k = 2, the transformation s is strictly monotone and the
data contain at least two different values among Z1 , . . . , Zn (which has a probability of one if the
covariates Zi are sampled from a continuous distribution), the model matrix X = 1n , S is of
a full-rank r = k = 2.
7.3. NUMERIC COVARIATE
7.3.2
82
Raw polynomials
The regression function is polynomial of a chosen degree k − 1, i.e.,
m(z) = β0 + β1 z + · · · + βk−1 z k−1 ,
z ∈ Z.
(7.4)
The parameterization is
s : Z −→ Rk−1 ,
s(z) = z, . . . , z k−1 , z ∈ Z
and the corresponding reparameterizing matrix

Z1
 .
.
S=
 .
Zn
is

. . . Z1k−1
..
.. 
.
. 
.
k−1
. . . Zn
Evaluation of the effect of the original covariate
The effect of the original covariate on the response
expectation is now quantified by a set of k − 1
Z
regression coefficients β := β1 , . . . , βk−1 . To evaluate a statistical significance of the effect of
the original covariate on the response expectation we have to test the null hypothesis
H0 : β Z = 0k−1 .
An appropriate test procedure was introduced in Section 3.2.
Interpretation of the regression coefficients
With k > 2 (at least a quadratic regression function), the single regression coefficients β1 , . . . ,
βk−1 only occasionally have a direct reasonable interpretation. Analogously to simple non-linear
transformation of the covariate, the change in the response expectation that corresponds to a unity
change of the original covariate is a function of that covariate:
E Y Z = z + 1 − E Y Z = z
= β1 + β2 (z + 1)2 − z 2 + · · · + βk−1 (z + 1)k−1 − z k−1 ,
z ∈ Z.
Note. It is again easily seen that if n > k and the data contain at least k different values
among among Z1 , . . . , Zn (which has a probabilityof one if the covariates Zi are sampled from
a continuous distribution), the model matrix 1n , S is of a full-rank r = k.
Degree of a polynomial
Test on a subset of regression coefficients (Section 3.2) or a submodel test (Section 5.2) can be
used to infer on the degree of a polynomial in the regression function (7.4). The null hypothesis
expressing, for d < k, belief that the regression function is a polynomial of degree d−1 corresponds
to the null hypothesis
H0 : βd = 0 & . . . & βk−1 = 0.
7.3. NUMERIC COVARIATE
7.3.3
83
Orthonormal polynomials
The regression function is again polynomial of a chosen degree k − 1, nevertheless, a different
basis of the regression space, i.e., a different parameterization of the polynomial is used. Namely,
the regression function is
m(z) = β0 + β1 P 1 (z) + · · · + βk−1 P k−1 (z),
z ∈ Z,
(7.5)
where P j is an orthonormal polynomial of degree j, j = 1, . . . , k − 1 built above a set of the
covariate datapoints Z1 , . . . , Zn . That is,
P j (z) = aj,0 + aj,1 z + · · · + aj,j z j ,
j = 1, . . . , k − 1,
(7.6)
and the polynomial coefficients aj,l , j = 1, . . . , k − 1, l = 0, . . . , j are such that vectors


P j (Z1 )
 . 
. 
Pj = 
 .  , j = 1, . . . , k − 1,
P j (Zn )
are all orthonormal and also orthogonal to an intercept vector P 0 = 1, . . . , 1 . The corresponding
reparameterizing matrix is


P 1 (Z1 ) . . . P k−1 (Z1 )


..
..
..

S = P 1 , . . . , P k−1 = 
(7.7)
.
.
 .

P 1 (Zn ) . . . P k−1 (Zn ),
which leads to the model matrix X = 1n , S which have all columns mutually orthogonal and
the non-intercept columns having even a unity norm. For methods of calculation of the coefficients
of the polynomials (7.6), see lectures on linear algebra. It can only be mentioned here that as soon
as the data contain at least k different values among Z1 , . . . , Zn , those polynomial coefficients
exist and are unique.
Note. For given dataset and given polynomial degree k −1, the model matrix X = 1n , S based
on the orthonormal polynomial provide the same regression space as the model matrix based on
the raw polynomials. Hence, the two model matrices determine two equivalent linear models.
Advantages of orthonormal polynomials compared to raw polynomials
• All non-intercept columns of the model matrix have the same (unity) norm. Consequently, all
non-intercept regression coefficients β1 , . . . , βk−1 have the same scale. This may be helpful
when evaluating a practical (not statistical!) importance of higher-order degree polynomial terms.
b
• Matrix X> X is a diagonal matrix diag(n, 1, . . . , 1). Consequently, the covariance matrix var β
is also a diagonal matrix, i.e., the LSE of the regression coefficients are uncorrelated.
Evaluation of the effect of the original covariate
The effect of the original covariate on the response
expectation is again quantified by a set of k − 1
Z
regression coefficients β := β1 , . . . , βk−1 . To evaluate a statistical significance of the effect of
the original covariate on the response expectation we have to test the null hypothesis
H0 : β Z = 0k−1 .
See Section 3.2 for a possible test procedure.
7.3. NUMERIC COVARIATE
84
Interpretation of the regression coefficients
The single regression coefficients β1 , . . . , βk−1 do not usually have a direct reasonable interpretation.
Degree of a polynomial
Test on a subset of regression coefficients/test on submodels (were introduced in Sections 3.2
and 5.2) can again be used to infer on the degree of a polynomial in the regression function (7.5)
in the same way as with the raw polynomials. The null hypothesis expressing, for d < k, belief
that the regression function is a polynomial of degree d − 1 corresponds to the null hypothesis
H0 : βd = 0 & . . . & βk−1 = 0.
7.3.4
Regression splines
Basis splines
The advantage of a polynomial regression function introduced in Sections 7.3.2 and 7.3.3 is that it
is smooth (have continuous derivatives of all orders) on the whole real line. Nevertheless, with the
least squares estimation, each data point affects globally the fitted regression function. This often
when
effects
the fitted regression function only poorly approximates
the response expectation E Y Z = z for the values of z being close to the boundaries of the
covariate space Z. This can be avoided with so-called regression splines.
Definition 7.2 Basis spline with distinct knots.
Let d ∈ N0 and λ = λ1 , . . . , λd+2 ∈ Rd+2 , where −∞ < λ1 < · · · < λd+2 < ∞. The basis
spline of degree d with distinct knots8 λ is such a function B d (z; λ), z ∈ R that
(i) B d (z; λ) = 0, for z ≤ λ1 and z ≥ λd+2 ;
(ii) On each of the intervals (λj , λj+1 ), j = 1, . . . , d + 1, B d (·; λ) is a polynomial of degree d;
(iii) B d (·; λ) has continuous derivatives up to an order d − 1 on R.
Notes.
• The basis spline with distinct knots is piecewise 9 polynomial of degree d on (λ1 , λd+2 ).
• The polynomial pieces are connected smoothly (of order d − 1) at inner knots λ2 , . . . , λd+1 .
• On the boundary (λ1 and λd+2 ), the polynomial pieces are connected smoothly (of order d − 1)
with a constant zero.
Definition 7.3 Basis spline with coincident left boundary knots.
Let d ∈ N0 , 1 < r < d + 2 and λ = λ1 , . . . , λd+2 ∈ Rd+2 , where −∞ < λ1 = · · · = λr <
· · · < λd+2 < ∞. The basis spline of degree d with r coincident left boundary knots10 λ is such
a function B d (z; λ), z ∈ R that
8
bazický spline [čti splajn] stupně d se vzájemně různými uzly
se levými uzly
9
po částech
10
bazický spline stupně d s r překrývajícími
7.3. NUMERIC COVARIATE
85
(i) B d (z; λ) = 0, for z ≤ λr and z ≥ λd+2 ;
(ii) On each of the intervals (λj , λj+1 ), j = r, . . . , d + 1, B d (·; λ) is a polynomial of degree d;
(iii) B d (·; λ) has continuous derivatives up to an order d − 1 on (λr , ∞);
(iv) B d (·; λ) has continuous derivatives up to an order d − r in λr .
Notes.
• The only qualitative difference between the basis spline with coincident left boundary knots and
the basis spline with distinct knots is the fact that the basis spline with coincident left boundary
knots is at the left boundary smooth of order only d − r compared to order d − 1 in case of the
basis spline with distinct knots.
• By mirroring Definition 7.3 to the right boundary, basis spline with coincident right boundary
knots is defined.
Basis B-splines
There are many ways on how to construct the basis splines that satisfy conditions of Definitions 7.2
and 7.3, see Fundamentals of Numerical Mathematics (NMNM201) course. In statistics, so called
B-splines have proved to be extremely useful for regression purposes. It goes beyond the scope of
this lecture to explain in detail their construction which is fully covered by two landmark books
de Boor (1978, 2001); Dierckx (1993) or in a compact way, e.g., by a paper Eilers and Marx (1996). For
the purpose of this lecture it is assumed that a routine is available to construct the basis B-splines
of given degree with given knots (e.g., the R function bs from the recommended package splines).
An important property of the basis B-splines is that they are positive inside their support interval
(general basis splines
is, if
can also attain negative values inside the support interval). That
d
λ = λ1 , . . . , λd+2 is a set of knots (either distinct or coincident left or right) and B (·, λ) is
a basis B-spline of degree d built above the knots λ then
B d (z, λ) > 0,
λ1 < z < λd+2 ,
B d (z, λ) = 0,
z ≤ λ1 , z ≥ λd+2 .
Spline basis
Definition 7.4 Spline basis.
Let d ∈ N0 , k ≥ d+1 and λ = λ1 , . . . , λk−d+1 ∈ Rk−d+1 , where −∞ < λ1 < . . . < λk−d+1 <
∞. The spline basis11 of degree d with knots λ is a set of basis splines B1 , . . . , Bk , where for z ∈ R,
B1 (z) = B d (z; λ1 , . . . , λ1 , λ2 ),
| {z }
(d+1)×
B2 (z) = B d (z; λ1 , . . . , λ1 , λ2 , λ3 ),
| {z }
d×
..
.
11
splinová báze
7.3. NUMERIC COVARIATE
86
Bd (z) = B d (z; λ1 , λ1 , λ2 , . . . , λd+1 ),
| {z }
2×
Bd+1 (z) = B d (z; λ1 , λ2 , . . . , λd+2 ),
Bd+2 (z) = B d (z; λ2 , . . . , λd+3 ),
..
.
Bk−d (z) = B d (z; λk−2d , . . . , λk−d+1 ),
Bk−d+1 (z) = B d (z; λk−2d+1 , . . . , λk−d+1 , λk−d+1 ),
{z
}
|
2×
..
.
Bk−1 (z) = B d (z; λk−d−1 , λk−d . . . , λk−d+1 , . . . , λk−d+1 ),
{z
}
|
d×
Bk (z) = B d (z; λk−d . . . , λk−d+1 , . . . , λk−d+1 ).
{z
}
|
(d+1)×
Properties of the B-spline basis
If k ≥ d + 1, a set of knots λ = λ1 , . . . , λk−d+1 , −∞ < λ1 < . . . < λk−d+1 < ∞ is given and
B1 , . . . , Bk is the spline basis of degree
d with knots λ composed of basis B-splines k ≥ d + 1,
a set of knots λ = λ1 , . . . , λk−d+1 , −∞ < λ1 < . . . < λk−d+1 < ∞ is given and B1 , . . . , Bk
is the spline basis of degree d with knots λ composed of basis B-splines then
(a)
k
X
Bj (z) = 1
for all z ∈ λ1 , λk−d+1 ;
(7.8)
j=1
(b)
for each m ≤ d there exist a set of coefficients γ1m , . . . , γkm such that
k
X
γjm Bj (z) is on (λ1 , λk−d+1 ) a polynomial in z of degree m.
(7.9)
j=1
Regression spline
It will now be assumed that the covariate space is a bounded interval, i.e., Z = zmin , zmax ,
−∞ < zmin < zmax < ∞. The regression function that exploits the regression splines is
m(z) = β1 B1 (z) + · · · + βk Bk (z),
z ∈ Z,
(7.10)
where B1 , . . . , Bk is the spline basis of chosen degree
d ∈ N0 composed of basis B-splines built
above a set of chosen knots λ = λ1 , . . . , λk−d+1 , zmin = λ1 < . . . < λk−d+1 = zmax . The
corresponding reparameterizing matrix coincided with the model matrix and is


B1 (Z1 ) . . . Bk (Z1 )
 .
..
.. 
..
X=S=
(7.11)
.
. 

 =: B.
B1 (Zn ) . . . Bk (Zn )
End of
Lecture #5
(15/10/2015)
7.3. NUMERIC COVARIATE
87
Start of
Lecture #7
(22/10/2015)
Notes.
• It follows from (7.8) that
1n ∈ M B .
This is also the reason why we do not explicitely include the intercept term in the regression
function since it is implicitely included in the regression space. Due to clarity of notation, the
regression coefficients
are now indexed from 1 to k. That is, the vector of regression coefficients
is β = β1 , . . . , βk .
• It also follows from (7.9) that for any m ≤ d, a linear model with the regression function based
on either raw or orthonormal polynomials of degree m is a submodel of the linear model with
the regression function given by a regression spline and the model matrix B.
• With d = 0, the regression spline (7.10) is simply a piecewise constant function.
• In practice, not much attention is paid to the choice of the degree d of the regression spline.
Usually d = 2 (quadratic spline) or d = 3 (cubic spline) is used which provides continuous first
or second derivatives, respectively, of the regression function inside the covariate domain Z.
• On the other hand, the placement of knots (selection of the values of λ1 , . . . , λk−d+1 ) is quite
important to obtain
function that sufficiently well approximates the response
the regression
expectations E Y Z = z , z ∈ Z. Unfortunately, only relatively ad-hoc methods towards
selection of the knots will be demonstrated during this lecture as profound methods of the knots
selection go far beyond the scope of this course.
Advantages of the regression splines compared to raw/orthogonal polynomials
• Each data point influences the LSE of the regression coefficients and hence the fitted regression
function only locally. Indeed, only the LSE of those regression coefficients that correspond to the
basis splines whose supports cover a specific data point are influenced by those data points.
• Regression splines of even a low degree d (2 or 3) are, with a suitable choice of knots, able to
approximate sufficiently well even functions with a highly variable curvature and that globally
on the whole interval Z.
Evaluation of the effect of the original covariate
To evaluate a statistical significance of the effect of the original covariate on the response expectation we have to test the null hypothesis
H0 : β1 = · · · = βk .
Due to the property (7.8), this null hypothesis corresponds to assuming that E Y Z ∈ M 1n ⊂
M B . Consequently, it is possible to use a test on submodel that was introduced in Section 5.1
to test the above null hypothesis.
Interpretation of the regression coefficients
The single regression coefficients β1 , . . . , βk do not usually have a direct reasonable interpretation.
7.4. CATEGORICAL COVARIATE
7.4
88
Categorical covariate
In this Section, it is assumed that Zi ∈ Z, i = 1, . . . , n, are categorical covariates. That is, the
covariate sample space Z is finite and its elements are only understood as labels. Without loss of
generality, we will use, unless stated otherwise, a simple sequence 1, . . . , G for those labels, i.e.,
Z = 1, . . . , G .
Unless explicitely stated (in Section 7.4.4), even the ordering of the labels 1 < · · · < G will not
be used for any but notational purposes and the methodology described below is then suitable for
both nominal and ordinal categorical covariates.
The regression function, m : Z −→ R is now a function
defined on a finite
set aiming
in
parameterizing just G (conditional) response expectations E Y Z = 1 , . . . , E Y Z = G . For
some clarity in notation, we will also use symbols m1 , . . . , mG for those expectations, i.e.,
m(1) = E Y Z = 1
=: m1 ,
..
..
.
.
m(G) = E Y Z = G =: mG .
Notation and terminology (One-way classified group means).
Since a categorical covariate often indicates pertinence to one of G groups, we will call m1 , . . . , mG
as group means12 or one-way classified group means A vector
m = m1 , . . . , mG
will be called a vector of group means,13 or a vector of one-way classified group means.
Note. Perhaps appealing simple regression function of the form
m(z) = β0 + β1 z,
z = 1, . . . , G,
is in most cases fully inappropriate. First, it orders ad-hoc the group means to form a monotone
sequence (increasing if β1 > 0, decreasing if β1 < 0). Second, it ad-hoc assumes a linear
relationship between the group means. Both those properties also depend on the ordering or
even the values of the labels (1, . . . , G in our case) assigned to the G categories at hand. With
a nominal categorical covariate, none of it is justifiable, with an ordinal categorical covariate, such
assumptions should, at least, never be taken for granted and used without proper verification.
7.4.1
For following considerations, we will additionally assume (again without loss of generality) that the
data Yi , Zi , i = 1, . . . , n, are sorted according to the covariate values Z1 , . . . , Zn . Furthermore,
we will also exchangeably use a double subscript with the response where the first subscript will
12
skupinové střední hodnoty
13
vektor skupinových středních hodnot
7.4. CATEGORICAL COVARIATE
indicate the covariate value, i.e.,
 

1
Z1
  .

.
  ..

..
 

 

  1

Zn1
 

 − − −   −−
 

  .

..
 =  ..

Z=
.
 
 

 − − −   −−
 

 
 Z
 n−nG +1   G
  .

..
  .

.
  .

Zn
G










,









89





n1 -times







nG -times



Y1
..
.



 

 

 

 

 
Yn1

 
 − − −−  

 

 
..

=
Y =
.
 

 
 − − −−  

 
 Y
 
 n−nG +1  

 
..

 
.

 
Yn

Y1,1
..
.




Y1,n1 

−−− 


..
.
.


−−− 

YG,1 


..

.

YG,nG
Finally, let
Y g = Yg,1 , . . . , Yg,ng ,
g = 1, . . . , G,
denote a subvector of the response vector that corresponds to observations with the covariate value
being equal to g. That is,
Y = Y1 , . . . , Yn = Y 1 , . . . , Y G .
A related regression model written using above introduced notation and the error terms is then
Yg,j = mg + εg,j ,
E ε = 0n ,
ε := ε1,1 , . . . , εG,nG ,
var ε = σ 2 In .
(7.12)
Notes.
• If the covariates Z1 , . . . , ZG are random then also n1 , . . . , nG are random.
• If the covariates Z1 , . . . , ZG are fixed and errors ε1,1 , . . . , εG,nG in (7.12) are assumed to be
independent (possibly also identically distributed) then (7.12) is a “regression” parameterization
of a classical G-sample problem, where n1 , . . . , nG are the sample sizes of each sample.
Observation of the sample g, g = 1, . . . , G, are given by a vector Y g , the expected value in the
sample g is given by mg and the variance in the sample g is given by σ 2 . Note that if model
(7.12) is assumed then it is assumed that the G samples are homoscedastic, i.e., have all the same
variance.
• In the following, it is always assumed that n1 > 0, . . . , nG > 0.
7.4. CATEGORICAL COVARIATE
7.4.2
90
Linear model parameterization of one-way classified group
means
As usual, let µ be the (conditional)

µ1,1

..

.


 µ1,n1

 −−


..
E Y Z = µ := 
.


 −−

 µ
 G,1

..

.

µG,nG
response expectation, i.e.,

 


m1


  . 
  .. 
n1 -times
 



 


  m1 
 



  −− 
m1 1n1
 

  . 


..
 =  .. 
.
=
.
 



 

  −− 
mG 1nG
 


  m 

  G 


  . 
  . 
nG -times
  . 



m
(7.13)
G
Notation and terminology (Regression space of a categorical covariate).
A vector space





m
1
1
n


1



.

 : m1 , . . . , mG ∈ R ⊆ Rn
..






 m 1

G nG
will be called the regression space of a categorical covariate (factor) with levels frequencies n1 , . . . , nG
and will be denoted as MF (n1 , . . . , nG ).
Note. Obviously, with n1 > 0, . . . , nG > 0, a vector dimension of MF (n1 , . . . , nG ) is equal
to G and a possible (orthogonal) vector basis is




1 ... 0


 .

.
.
 ..
..
.. 
n1 -times







 1 ... 0 



 − − −− 
1n1 ⊗ 1, . . . ,


 .


.
.
..
..
.. 
Q=
=
.
 ..




 − − −− 
1nG ⊗ 0, . . . ,



 0 ... 1 





 .

.
.
 .

.
.
nG -times
.
.
.





0 ... 1

0

.

1
(7.14)
When using the linear model, we are trying to allow for expressing the response expectation µ, i.e.,
a vector from MF (n1 , . . . , nG ) as a linear combination of columns of a suitable n × k matrix X,
i.e., as
µ = Xβ, β ∈ Rk .
It is obvious that any model matrix that parameterizes the regression space MF (n1 , . . . , nG )
7.4. CATEGORICAL COVARIATE
91
must have at least G columns, i.e., k ≥ G


x>
1
 . 
 .. 


 > 
 x1 


 −− 


 . 

X=
 .. 


 −− 


 x> 
 G 
 . 
 . 
 . 
x>
G
and must be of the type




n1 -times




1n1 ⊗ x>
1

..
=
.

1nG ⊗ x>
G




nG -times





,

(7.15)
where x1 , . . . , xG ∈ Rk are suitable vectors.
Problem of parameterizing a categorical covariate with G levels thus simplifies into selecting a G×k
e such that
matrix X
 
x>
1
 . 
e

.
X= . 
.
>
xG
Clearly,
e .
rank X = rank X
Hence to be able to parameterize the regression space MF (n1 , . . . , nG ) which has a vector
e must satisfy
dimension of G, the matrix X
e = G.
rank X
The group means then depend on a vector β = β0 , . . . , βk−1 of the regression coefficients as
mg = x>
g β,
g = 1, . . . , G,
e
m = Xβ.
A possible (full-rank) linear model parameterization of regression space of a categorical covariate
e = IG and we have
uses matrix Q from (7.14) as a model matrix X. In that case, X
µ = Q β,
m = β.
(7.16)
Even though parameterization (7.16) seems appealing since the regression coefficients are directly
equal to the group means, it is only rarely considered in practice for reasons that will become clear
later on. Still, it is useful for some of theoretical derivations.
7.4.3
ANOVA parameterization of one-way classified group means
In practice and especially in the area of designed experiments, the group means are parameterized
as
mg = α0 + αg ,
g = 1, . . . , G,
(7.17)
m = 1G , IG α = α0 1G + αZ ,
7.4. CATEGORICAL COVARIATE
92
where α = α0 , α1 , . . . , αG is a vector of regression coefficients and αZ = α1 , . . . , αG is its
e is
non-intercept subvector. That is, the matrix X
e = 1G , IG .
X
The model matrix is then

1 1 ... 0
 . .
..
..
 .. ..
.
.


 1 1 ... 0

 − − − − −−

 . .
..
..
X=
.
.
 .. ..

 − − − − −−

 1 0 ... 1

 . .
..
..
 . .
.
.
 . .
1 0 ... 1
























n1 -times





1n1 ⊗ 1, 1, . . . , 0


..
,
=
.


1nG ⊗ 1, 0, . . . , 1




(7.18)
nG -times



which has G + 1 columns but its rank is G (as required). That is, thelinear model Y | Z ∼
Xα, σ 2 In is less-than-full rank. In other words, for given µ ∈ M X = MF (n1 , . . . , nG )
a vector α ∈ RG+1 such that µ = Xα is not unique. Consequently, also a solution to related
normal equations is not unique. Nevertheless, unique solution can be obtained if suitable indetifying
constraints14 are imposed on the vector of regression coefficients α.
Terminology (Effects of a categorical covariate).
Values of α1 , . . . , αG (a vector αZ ) are called effects of a categorical covariate.
Note. Effects of a categorical covariate are not unique. Hence their interpretation depends on
chosen identifying constraints.
Identification in less-than-full-rank linear model
In the following, a linear model Y X ∼ Xβ, σ 2 In , rank(Xn×k ) = r will be assumed (in our
general notation), where r < k. We shall consider linear constraints on a vector of regression
coefficients, i.e., constraints of the type
Aβ = 0m ,
where A is an m × k matrix.
Definition 7.5 Identifying constraints.
We say that a constraint
Aβ = 0m
identifies a vector β in a linear model Y X ∼ Xβ, σ 2 In if and only if for each µ ∈ M X there
exists only one vector β which satisfies at the same time
µ = Xβ
14
identifikační omezení
and
Aβ = 0m .
7.4. CATEGORICAL COVARIATE
93
Note. If a matrix A determines the identifying constraints, then, due to Theorem 2.5 (least squares
and normal equations), it also uniquely determines the solution to normal equations. That is, there
b that jointly solves linear systems
is a unique solution b = β
X> Xb = X> Y ,
Ab = 0m ,
or written differently, there is a unique solution to a linear system
!
!
X> X
X> Y
.
b=
A
0m
The question is now, what are the conditions for a matrix A to
determinean identifying constraint.
Remember (Theorem 2.7): If a matrix Lm×k satisfies M L> ⊂ M X> then a parameter vector
θ = Lβ is estimable which also means that for all real vectors β 1 , β 2 the following holds:
Xβ 1 = Xβ 2
=⇒
Lβ 1 = Lβ 2 .
That is, if two different solutions of normal equations are taken and one of them satisfies the
constraint then do the both. It was also shown in Section 5.3 that if further L has linearly
independent rows then a set of linear constraints Lβ = 0 determines a so called submodel
(Lemma 5.4). It follows from above that for identification, we cannot use such a matrix L for
identification.
Theorem 7.1 Scheffé on identification in a linear model.
X ∼
Constraint Aβ
=
0
with
a
real
matrix
A
identifies
a
vector
β
in
a
linear
model
Y
m
m×k
2
Xβ, σ In , rank(Xn×k ) = r < k ≤ n if and only if
M A> ∩ M X> = 0 ,
rank(X) + rank(A) = k.
Proof. We have to show that, for any µ ∈ M X , the conditions stated in the theorem are
equivalent to existence of the unique solution to a linear system Xβ = µ that satisfies Aβ = 0m .
Existence of the solution
⇔
∀µ ∈ M X there exists a vector β ∈ Rk such that
Xβ = µ & Aβ = 0m .
⇔
∀µ ∈ M X there exists a vector β ∈ Rk such that
!
!
Xn×k
µ
β =
.
0m
Am×k
| {z }
D
7.4. CATEGORICAL COVARIATE
⇔
⇔
⇔
⇔
∀µ ∈ M X there exists a vector β ∈ Rk such that
!
µ
.
Dβ =
0m
n
>
o
⊆M D .
µ> , 0>
,
µ
∈
M
X
m
n
o⊥ n > > >
o⊥
⊆ µ , 0m , µ ∈ M X
.
M D
∀v 1 ∈ Rn , v 2 ∈ Rm ∀µ ∈ M X
v>
1,
⇔
>
v2 D =
∀v 1 ∈ Rn , v 2 ∈ Rm
v>
1,
⇔
94
v>
2
n
∀v 1 ∈ Rn , v 2 ∈ Rm
0>
k
=0 .
!
=0 .
⇒
v>
1,
v2
⇒
v>
1,
v>
2
Xβ
0m
∀β ∈ Rk
D=
0>
k
∀β ∈ Rk
>
v>
1 X = −v 2 A
⇔
n
∀v 1 ∈ Rn , v 2 ∈ Rm
n
∀v 1 ∈ Rn , v 2 ∈ Rm
⇔
n
∀u ∈ Rk
⇔
M A> ∩ M X> = 0 .
⇔
!
µ
0m
>
⇒
o
v>
Xβ
=
0
.
1
>
v>
1 X = −v 2 A
⇒
o
>
v>
X
=
0
1
k .
o
X> v 1 = −A> v 2 ⇒ X> v 1 = 0k .
| {z }
o
u
>
u∈M X
∩ M A>
⇒ u = 0k .
Uniqueness of the solution
⇔
∀µ ∈ M X there exists a unique vector β ∈ Rk such that
Xβ = µ & Aβ = 0m .
⇔
⇔
⇔
∀µ ∈ M X there exists a unique vector β ∈ Rk such that
!
!
Xn×k
µ
β =
.
Am×k
0m
| {z }
D
rank(D) = k.
n
o
A has rows such that dim M A> = k − r (since rank(X) = r)
and all rows in A are linearly independent with rows in X.
⇔
rank(A) = k − r (since we already have a condition
M X> ∩ M A> = 0} needed for existence of the solution).
7.4. CATEGORICAL COVARIATE
95
k
Notes.
1. Matrix Am×k used for identification must satisfy rank(A) = k − r. In practice, the number
of identifying constraints (the number of rows of the matrix A) is usually the lowest possible,
i.e., m = k − r.
2. Theorem 7.1 further states that the matrix A must be such that a vector parameter θ = Aβ
is not estimable in a given model.
3. In practice, a vector µ, for which we look for a unique β such that µ = Xβ, Aβ = 0m
b ∈ Rk (that, since
is equal to the vector of fitted values Yb . That is, we look for a unique β
being unique, can be considered as the LSE of the regression coefficients) such that
b = Yb
Xβ
&
b = 0m .
Aβ
b if and only if β
b solves
By Theorem 2.5 (Least squares and normal equations), Yb = Xβ
>
>
b =X Y.
normal equations X Xβ
Suppose now that rank A = m = k − r, i.e., the regression parameters are identified by
b we have to solve
a set of m = k − r linearly independent linear constraints. To get β,
a linear system
b = X> Y ,
X> Xβ
b = 0m ,
Aβ
which can be solved by solving
b = X> Y ,
X> Xβ
b = 0m ,
A> Aβ
or using a linear system
which written differently is
b = X> Y ,
X> X + A> A β
b = X> Y ,
D> Dβ
with
D=
!
X
.
A
Matrix D> D is now an invertible k × k matrix and hence the unique solution is
b = D> D
β
−1
X> Y .
7.4. CATEGORICAL COVARIATE
96
Identification in a one-way ANOVA model
As example of use of Scheffé’s Theorem 7.1, consider a model matrix X given by (7.18) that provides
an ANOVA parameterization of a single categorical covariate, i.e., a linear model for the one-way
classified group means parameterized as
mg = α0 + αg ,
g = 1, . . . , G.
We have rank Xn×(G+1) = G with a vector α = α0 , α1 , . . . , αG of the regression coefficients.
The smallest matrix Am×(G+1) that identifies α with respect to the regression space M X =
MF (n1 , . . . , nG ) is hence a non-zero matrix with m = 1 row, i.e.,
A = a> = a0 , a1 , . . . , aG 6= 0G+1
such that a ∈
/ M X> , i.e., such that θ = a> α is not estimable in the linear model Y X ∼
Xα, σ 2 In .
It is seen from a structure of the matrix X given by (7.18) that
a ∈ M X>
⇐⇒
a=
G
X
cg , c1 , . . . , cG
g=1
for some c = c1 , . . . , cG ∈ RG , c 6= 0G . That is, for identification of α in the linear model
Y X ∼ Xα, σ 2 In with the model matrix (7.18), we can use any vector a = a0 , a1 , . . . , aG 6=
0G−1 that satisfy
G
X
a0 6=
ag .
g=1
Commonly used identifying constraints include:
Sum constraint:
A1 = a>
1 = 0, 1, . . . , 1
⇐⇒
G
X
αg = 0
g=1
that imply the following interpretation of the model parameters:
G
α0 =
1 X
mg
G
=: m,
g=1
α1 = m1 − m,
..
.
αG = mG − m.
Weighted sum constraint:
A2 =
a>
2
= 0, n1 , . . . , nG
⇐⇒
G
X
ng αg = 0
g=1
that implies
G
α0 =
1X
ng mg
n
g=1
α1 = m1 − mW ,
..
.
αG = mG − mW .
=: mW ,
7.4. CATEGORICAL COVARIATE
97
Reference group constraint (l ∈ {1, . . . , G}):
A3 = a>
. . . , 1, . . . , 0
3 = 0, 0,
|
{z
}
1 on lth place
⇐⇒
αl = 0,
which corresponds to omitting one of the non-intercept columns in the model matrix X
given by (7.18) and using the resulting full-rank parameterization. It implies
α0 = m l ,
α1 = m 1 − m l ,
..
.
αG = mG − ml .
No intercept:
A4 = a>
4 = 1, 0, . . . , 0
⇐⇒
α0 = 0,
which corresponds to omitting the intercept column in the model matrix X given by (7.18)
and using the full-rank parameterization with the matrix Q given by (7.14). That is,
α0 = 0,
α1 = m 1 ,
..
.
αG = mG .
Note. Identifying constraints given by vectors a1 , a2 , a3 (sum, weighted sum and reference group
constraint) correspond to one of commonly used full-rank parameterizations that will be introduced
in Section 7.4.4 where we shall also discuss interpretation of the effects αZ = α1 , . . . , αG if
different identifying constraints are used.
7.4.4
End of
Lecture #7
(22/10/2015)
Full-rank parameterization of one-way classified group means
Start of
In the following, we limit ourselves to full-rank parameterizations that involve an intercept column. Lecture #9
That is, the model matrix will be an n × G matrix
(29/10/2015)




1 c>

1

 .

.. 
 ..
n1 -times
. 






 1 c>

1 



 −−− 
1n1 ⊗ 1, c>


1


 .
..
.. 

,

.
=
X=
.
. 
 .




>
 −−− 
1nG ⊗ 1, cG



 1 c> 



G 

 .

.
 .

.
nG -times
. 
 .



1 c>
G
where c1 , . . . , cG ∈
are suitable vectors. In the following, let C be an G × (G − 1) matrix
with those vectors as rows, i.e.,
 
c>
1
 . 

.
C= . 
.
>
cG
RG−1
7.4. CATEGORICAL COVARIATE
98
e is thus a G × G matrix
A matrix X
e = 1G , C .
X
If β = β0 , . . . , βG−1 ∈ RG denote, as usual, a vector of regression coefficients, the group means
m are parameterized as
where β Z
know,
Z
mg = β0 + c>
g = 1, . . . , G,
gβ ,
(7.19)
e = 1G , C β = β0 1G + Cβ Z ,
m = Xβ
= β1 , . . . , βG−1 is a non-intercept subvector of the regression coefficients. As we
e = rank (1G , C) .
rank X = rank X
Hence, to get the model matrix X of a full-rank (rank X = G), the matrix C must satisfy
rank C = G − 1 and 1G ∈
/ M C . That is, the columns of C must be
(i) (G − 1) linearly independent vectors from RG ;
(ii) being all linearly independent with a vector of ones 1G .
Definition 7.6 Full-rank parameterization of a categorical covariate.
Full-rank parameterization of a categorical covariate with G levels (G = card(Z)) is a choice of the
G × (G − 1) matrix C that satisfies
rank C = G − 1,
1G ∈
/M C .
Terminology ((Pseudo)contrast matrix).
Columns of matrix C are often chosen to form a set of G − 1 linearly independent contrasts from
RG . In this case, we will call the matrix C as a contrast matrix.15 In other cases, the matrix C will
be called as a pseudocontrast matrix.16
Note. The (pseudo)contrast matrix C also determines parameterization of a categorical covariate
according to Definition 7.1. Corresponding function s : Z −→ RG−1 is
s(z) = cz ,
z = 1, . . . , G,
and the reparameterizing matrix S is an n × (G − 1) matrix




c>

1

 . 
 .. 
n1 -times




 > 

 c1 



 −− 
1n1 ⊗ c>


1

 . 
.



..
=
S =  .. 


 −− 
1nG ⊗ c>
G



 c> 

 G 


 . 
 . 
nG -times
 . 



c>
G
15
kontrastová matice
16
pseudokontrastová matice


.

7.4. CATEGORICAL COVARIATE
99
Evaluation of the effect of the categorical covariate
With a given full-rank parameterization of a categorical covariate, evaluation of a statistical significance of its effect on the response expectation corresponds to testing the null hypothesis
H0 : β1 = 0 & · · · & βG−1 = 0,
or written concisely
(7.20)
H0 : β Z = 0G−1 .
This null hypothesis indeed also corresponds to a submodel where only intercept is included in the
model matrix. Finally, it can be mentioned that the null hypothesis (7.20) is indeed equivalent to
the hypothesis of equality of the group means
H0 : m1 = · · · = mG .
(7.21)
If normality of the response is assumed, equivalently an F-test on a submodel (Theorem 5.1) or
a test on a value of a subvector of the regression coefficients (F-test if G ≥ 2, t-test if G = 2, see
Theorem 3.1) can be used.
The following can
be shown with only a little algebra:
• If G = 2, β = β0 , β1 . The (usual) t-statistic to test the hypothesis H0 : β1 = 0 using
point (viii) of Theorem 3.1, i.e., the statistic based on the LSE of β, is the same as a statistic of
a standard two-sample t-test.
Notes.
• If G ≥ 2, the (usual) F-statistic to test the null hypothesis (7.20) using point (x) of Theorem 3.1
which is the same as the (usual) F-statistic on a submodel, where the submodel is the onlyintercept model, is the same as an F-statistic used classically in one-way analysis of variance
(ANOVA) to test the null hypothesis (7.21).
Reference group pseudocontrasts


0 ... 0


1 . . . 0

 =
C = . .
. . ... 
 ..

0 ... 1
0>
G−1
IG−1
!
(7.22)
The regression coefficients have the following interpretation
m1 = β0 ,
m2 = β0 + β1 ,
..
.
β0 = m1 ,
β 1 = m2 − m1 ,
..
.
(7.23)
βG−1 = mG − m1 .
mG = β0 + βG−1 ,
That is, the interceptβ0 is equal to the mean of the first (reference) group, the elements of
β Z = β1 , . . . , βG−1 (the effects of Z) provide differences between the means of the remaining
groups and the reference one. The regression function can be written as
m(z) = β0 + β1 I(z = 2) + · · · + βG−1 I(z = G),
z = 1, . . . , G.
It is seen from (7.23) that the full-rank parameterization using the reference group pseudocontrasts
is equivalent to the less-than-full-rank
(ANOVA) parameterization mg = α0 + αg , g = 1, . . . , G,
where α = α0 , α1 , . . . , αG is identified by the reference group constraint
α1 = 0.
7.4. CATEGORICAL COVARIATE
100
Notes.
• With the pseudocontrast matrix C given by (7.22), a group labeled by Z = 1 is chosen as
a reference for which the intercept β0 provides the group mean. In practice, any other group
can be taken as a reference by moving the zero row of the C matrix.
• In the R software, the reference group pseudocontrasts with the C matrix being of the form
(7.22) are used by default to parameterize categorical covariates (factors). Explicitely this choice
is indicated by the contr.treatment function. Alternatively, the contr.SAS function provides
a pseudocontrast matrix in which the last Gth group serves as the reference, i.e., the C matrix
has zeros on its last row.
Sum contrasts


1 ...
0
 . .
.. 
..
 ..
.
 =

C=

1
 0 ...
−1 . . . −1
Let
IG−1
− 1>
G−1
!
(7.24)
G
1 X
m=
mg .
G
g=1
The regression coefficients have the following interpretation
β0 = m,
β1 = m1 − m,
m1 = β0 + β1 ,
mG−1
..
.
= β0 + βG−1 ,
mG = β0 −
G−1
X
βG−1
..
.
= mG−1 − m.
(7.25)
βg ,
g=1
The regression function can be written as
m(z) = β0 + β1 I(z = 1) + · · · + βG−1 I(z = G − 1) −
G−1
X
βg I(z = G),
g=1
z = 1, . . . , G.
If we consider the less-than-full-rank ANOVA parameterization of the group means as mg = α0 +αg ,
g = 1, . . . , G, it is seen from (7.25) that the full-rank parameterization using the contrast matrix
(7.24) links the regression coefficients of the two models as
= m,
= µ1 − m,
..
.
α0 = β0
α1 = β1
..
.
αG−1 = βG−1
G−1
X
αG = −
βg
g=1
= µG−1 − m,
= µG − m.
(7.26)
7.4. CATEGORICAL COVARIATE
101
At the same time, the vector α satisfies
G
X
(7.27)
αg = 0.
g=1
That is, the full-rank parameterization using the sum contrasts (7.25) is equivalent to the lessthan-full-rank ANOVA parameterization, where the regression coefficients are identified by the sum
constraint (7.27). The intercepts
α0 = β0 equal
to the mean of the group means and the elements of
β Z = β1 , . . . , βG−1 = α1 , . . . , αG−1 are equal to the differences between the corresponding
group mean and the means of the
P group means. The same quantity for the last, Gth group, αG is
calculated from β Z as αG = − G−1
g=1 βg .
Note. In the R software, the sum contrasts with the C matrix being of the form (7.24) can be
used by the mean of the function contr.sum.
Weighted sum contrasts

1
..
.



C=
 0

 n1
−
nG
Let
...
..
.
...
... −

0
.. 
. 


1 

nG−1 
(7.28)
nG
G
mW
1X
=
ng mg .
n
g=1
The regression coefficients have the following interpretation
β0 = mW ,
β1 = m1 − mW ,
m1 = β0 + β1 ,
mG−1
..
.
= β0 + βG−1 ,
mG = β0 −
G−1
X
g=1
βG−1
..
.
= mG−1 − mW .
(7.29)
ng
βg ,
nG
The regression function can be written as
m(z) = β0 + β1 I(z = 1) + · · · + βG−1 I(z = G − 1) −
G−1
X ng βg I(z = G),
nG
g=1
z = 1, . . . , G.
If we consider the less-than-full-rank ANOVA parameterization of the group means as mg = α0 +αg ,
g = 1, . . . , G, it is seen from (7.29) that the full-rank parameterization using the contrast matrix
7.4. CATEGORICAL COVARIATE
102
(7.28) links the regression coefficients of the two models as
α0 = β0
α1 = β1
..
.
= mW ,
= m1 − mW ,
..
.
= mG−1 − mW ,
αG−1 = βG−1
G−1
X ng
βg
αG = −
nG
= mG − mW .
g=1
At the same time, the vector α satisfies
G
X
(7.30)
ng αg = 0.
g=1
That is, the full-rank parameterization using the weighted sum pseudocontrasts (7.29) is equivalent
to the less-than-full-rank ANOVA parameterization, where the regression coefficients are identified
by the weighted sum constraint (7.30). The intercepts α0 = β0 equal to the weighted
mean of the
group means and the elements of β Z = β1 , . . . , βG−1 = α1 , . . . , αG−1 are equal to the
differences between the corresponding group mean and the weighted means of the
group means.
PG−1
n
Z
The same quantity for the last, Gth group, αG is calculated from β as αG = − g=1 nGg βg .
Helmert contrasts


−1 −1 . . .
−1


 1 −1 . . .
−1 



2 ...
−1 
C= 0

 .

.
.
.
 ..
..
..
.. 


0
0 ... G − 1
The group means are obtained from the regression coefficients as
m1 = β0 −
G−1
X
βg ,
g=1
m2 = β0 + β1 −
G−1
X
βg ,
g=2
G−1
X
m3 = β0 + 2 β2 −
βg ,
g=3
mG−1
..
.
= β0 + (G − 2) βG−2 − βG−1 ,
mG = β0 + (G − 1) βG−1 .
(7.31)
7.4. CATEGORICAL COVARIATE
103
Inversely, the regression coefficients are linked to the group means as
G
β0 =
β1 =
β2 =
β3 =
..
.
βG−1 =
1 X
mg
G
g=1
1
(m2 − m1 ),
2
o
1
1n
m3 − (m1 + m2 ) ,
3
2
n
o
1
1
m4 − (m1 + m2 + m3 ) ,
4
3
=: m,
G−1
o
1n
1 X
mG −
mg .
G
G−1
g=1
which provide their (slightly awkward) interpretation: βg , g = 1, . . . , G − 1, is 1/(g + 1) times the
difference between the mean of group g + 1 and the mean of the means of the previous groups
1, . . . , g.
Note. In the R software, the Helmert contrasts with the C matrix being of the form (7.31) can be
used by the mean of the function contr.helmert.
Orthonormal polynomial contrasts

P 1 (ω1 )
 1
 P (ω2 )
C=
..

.

P 2 (ω1 )
P 2 (ω2 )
..
.
...
...
..
.

P G−1 (ω1 )

P G−1 (ω2 ) 
,
..

.

(7.32)
P 1 (ωG ) P 2 (ωG ) . . . P G−1 (ωG )
where ω1 < · · · < ωG is an equidistant (arithmetic) sequence of the group labels and
P j (z) = aj,0 + aj,1 z + · · · + aj,j z j ,
j = 1, . . . , G − 1,
are orthonormal polynomials of degree 1, . . . , G − 1 built above a sequence of the group labels.
Note. It can be shown that the polynomial coefficients aj,l , j = 1, . . . , G − 1, l = 0, . . . , j and
hence the C matrix (7.32) is for given G invariant towards the choice of the group labels as soon
7.4. CATEGORICAL COVARIATE
104
as they form an equidistant (arithmetic) sequence. For example, for G = 2, 3, 4 the C matrix is
G=2

G=3

1
√
−

2



C= 0


 1
√
2
1 
−√

2

C=
 1 ,
√
2
G=4

3
− √
 2 5


1
− √

 2 5
C=

1

√
 2 5


3
√
2 5
1
2
1
−
2
1
−
2
1
2

1
√
6

2 
−√ 
,
6

1 
√
6
1 
− √
2 5

3 
√ 

2 5
.
3 
− √ 
2 5

1 
√
2 5
The group means are then obtained as
m1 = m(ω1 ) = β0 + β1 P 1 (ω1 ) + · · · + βG−1 P G−1 (ω1 ),
m2 = m(ω2 ) = β0 + β1 P 1 (ω2 ) + · · · + βG−1 P G−1 (ω2 ),
..
.
mG = m(ωG ) = β0 + β1 P 1 (ωG ) + · · · + βG−1 P G−1 (ωG ),
where
m(z) = β0 + β1 P 1 (z) + · · · + βG−1 P G−1 (z),
z ∈ ω1 , . . . , ωG
is the regression function. The regression coefficients β now do not have any direct interpretation.
That is why, even though the parameterization with the contrast matrix (7.32) can be used with
the categorical nominal covariate, it is only rarely done so. Nevertheless, in case of the categorical
ordinal covariate where the ordered group labels ω1 < · · · < ωG have also practical interpretability,
parameterization (7.32) can be used to reveal possible polynomial trends in the evolution of the
group means m1 , . . . , mG and to evaluate whether it may make sense to consider that covariate
as numeric rather than categorical. Indeed, for d < G, the null hypothesis
H0 : βd = 0 & . . . & βG−1 = 0
corresponds to the hypothesis that the covariate at hand can be considered as numeric (with values
ω1 , . . . , ωG of the form of an equidistant sequence) and the evolution of the group means can be
described by a polynomial of degree d − 1.
Note. In the R software, the orthonormal polynomial contrasts with the C matrix being of the
form (7.32) can be used by the mean of the function contr.poly. It is also a default choice if the
covariate is coded as categorical ordinal (ordered).
End of
Lecture #9
(29/10/2015)
Chapter
8
8.1
Additivity and partial effect of a covariate
Suppose now that the covariate vectors are
Z1 , V 1 , . . . , Zn , V n ∈ Z × V,
Z ⊆ R, V ⊆ Rp−1 ,
As usual, let Z, V ∈ Z × V denote a generic covariate, and let
 


Z1
V>
1
 . 
 . 
 .. 
..  ,
Z=
V
=
 


Zn
V>
n
Start of
Lecture #11
(05/11/2015)
p ≥ 2.
be matrices covering the observed values of the two sets of covariates.
8.1.1
Definition 8.1 Additivity of the covariate effect.
We say that a covariate Z ∈ Z acts additively in the regression model with covariates Z, V ∈ Z×V,
if the regression function is of the form
m(z, v) = mZ (z) + mV (v),
z, v ∈ Z × V,
(8.1)
where mZ : Z −→ R and mV : V −→ R are some functions.
8.1.2
Partial effect of a covariate
If the effect of Z ∈ Z acts additively in a regression model, we have for any fixed v ∈ V:
E Y Z = z + 1, V = v − E Y Z = z, V = v = mZ (z + 1) − mZ (z), z ∈ Z.
(8.2)
That is, the influence (effect) of the covariate Z on the response expectation is the same with any
value of V ∈ V.
105
8.1. ADDITIVITY AND PARTIAL EFFECT OF A COVARIATE
106
Terminology (Partial effect of a covariate).
If a covariate Z ∈ Z acts additively in the regression model with covariates (Z, V ) ∈ Z × V,
quantity (8.2) expresses so called partial effect 1 of the covariate Z on the response given the value
of V .
8.1.3
Additivity, partial covariate effect and conditional independence
In a context of a linear model, both mZ and mV are chosen to be linear in unknown (regression)
parameters and the corresponding model matrix is decomposed as
X = XZ , XV ,
where XZ corresponds to the regression function mZ and depends only on the covariate values Z,
and XV corresponds to the regression function mV and depends only on the covariate values V.
That is, the response expectation is assumed to be
E Y Z, V = XZ β + XV γ,
for some real vectors of regression coefficients β and γ.
Matrix XZ and the regression function mZ then correspond to parameterization of a single covariate for which any choice out of those introduced in Sections 7.3 and 7.4 (or others notdiscussed
here) can be used. Further, matrix XZ is/can be usually chosen such that 1n ∈ M XZ in which
case a hypothesis of no partial effect of the Z covariate corresponds to testing a submodel with
the model matrix
X 0 = 1n , X V
againts a model with the model matrix X = XZ , XV . Note that if it can be assumed that the covariates at hand influence only the (conditional) response expectation and not other characteristics
of the conditional distribution of the response given the covariates, then testing a submodel with
the model matrix X0 against a model with the model matrix X corresponds to testing a conditional
independence of the response and the Z covariate given the remaining covariates V .
1
parciální efekt
8.2. ADDITIVITY OF THE EFFECT OF A NUMERIC COVARIATE
8.2
107
Additivity of the effect of a numeric covariate
Assume that Z is a numeric covariate with Z ⊆ R. While limiting ourselves to parameterizations
discussed in Section 7.3, the matrix XZ can be
(i) XZ = 1n , SZ , where SZ is a reparameterizing matrix of a parameterization
sZ = s1 , . . . , sk−1 : Z −→ Rk−1
having a form of either
(a) a simple transformation (Section 7.3.1);
(b) raw polynomials (Section 7.3.2);
(c) orthonormal polynomials (Section 7.3.3).
Z
If we denote
the regression coefficients related to the model matrix X as β = β0 , β1 ,
. . . , βk−1 , the regression function is
m(z, v) = β0 + β1 s1 (z) + · · · + βk−1 sk−1 (z) + mV (v),
z, v ∈ Z × V, (8.3)
which can also be interpreted as
m(z, v) = γ0 (v) + β1 s1 (z) + · · · + βk−1 sk−1 (z),
z, v ∈ Z × V,
where γ0 (v) = β0 + mV (v).
In other words, if a certain covariate acts additively and its effect on the response is described
by parameterization sZ then the remaining covariates V only modify an intercept term in
the relationship between the response and the covariate Z.
(ii) XZ = BZ , where BZ is a model matrix (7.11) of the regression splines B1 , . . . , Bk . With the
regression coefficients related to the model matrix BZ being denoted as β = β1 , . . . , βk ,
the regression function becomes
m(z, v) = β1 B1 (z) + · · · + βk Bk (z) + mV (v),
z, v ∈ Z × V,
(8.4)
where the term mV (v) can again be interpreted as an intercept γ0 (v) = mV (v) in the
relationship between response and the covariate Z whose value depends on the remaining
covariates V .
8.2.1
Partial effect of a numeric covariate
With the regression function (8.3), the partial effect of the Z covariate on the response is determined
by a set of the non-intercept regression coefficients β Z := β1 , . . . , βk−1 . The null hypothesis
H0 : β Z = 0k−1
then expresses the hypothesis that the covariate Z has, conditionally given a fixed (even though
arbitrary) value of V , no effect on the response expectation. That is, it is a hypothesis of no partial
effect of the covariate Z on the response expectation.
With the spline-based regression function (8.4), the partial effect of the Z covariate is expressed by
(all) spline-related regression coefficients β1 , . . . , βk . Nevertheless, due to the B-splines property
(7.8), the null hypothesis of no partial effect of the Z covariate is now
H0 : β 1 = · · · = β k .
8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE
8.3
108
Additivity of the effect of a categorical covariate
Assume that Z is a categorical covariate with Z = {1, . . . , G} where Z = g, g = 1, . . . , G,
is repeated ng -times in the data which are assumed to be sorted according to the values of this
covariate. The group means used in Section 7.4 must now be understood as conditional group
means, given a value of the covariates V , and the regression function (8.1) parameterizes those
conditional group means, i.e., for v ∈ V:
m(1, v) = E Y Z = 1, V = v
=: m1 (v),
..
..
.
.
m(G, v) = E Y Z = G, V = v =: mG (v).
(8.5)
Let
m(v) = m1 (v), . . . , mG (v)
be a vector of those conditional group means.
The matrix XZ can be any of the model matrices discussed in Section 7.4. If we restrict ourselves
to the full-rank parameterizations introduced in Section 7.4.4, the matrix XZ is


1n1 ⊗ c>
1


..
,
XZ = 1n , SZ , SZ = 
.


>
1nG ⊗ cG
where c1 , . . . , cG ∈ RG−1 are rows of a chosen (pseudo)contrast matrix
 
c>
1
 . 
..  .
C=
 
c>
G
If β = β0 , β1 , . . . , βG−1 denotes the regression coefficients related to the model matrix XZ =
1n , SZ and we further denote β Z = β1 , . . . , βG−1 , the conditional group means are, for
v ∈ V, given as
Z
> Z
m1 (v) = β0 + c>
1 β + mV (v) = γ0 (v) + c1 β ,
..
..
.
.
Z
Z
>
mG (v) = β0 + cG β + mV (v) = γ0 (v) + c>
Gβ ,
(8.6)
where γ0 (v) = β0 + mV (v), v ∈ V. In a matrix notation, (8.6) becomes
m(v) = γ0 (v) 1G + Cβ Z .
8.3.1
(8.7)
Partial effects of a categorical covariate
In agreement with a general expression (8.2), we have for arbitrary v ∈ V and arbitrary g1 , g2 ∈ Z:
E Y Z = g1 , V = v − E Y Z = g2 , V = v = mg1 (v) − mg2 (v)
>
= cg1 − cg2 β Z ,
(8.8)
8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE
109
which does not depend on a value of V = v. That is, the difference between the two conditional
group means is the same for all values of the covariates in V .
Terminology (Partial effects of a categorical covariate).
If additivity of a categorical Z covariate and V covariates can be assumed, a vector of coefficients
β Z from parameterization of the conditional group means (Eqs. 8.6, 8.7) will be referred to as
partial effects of the categorical covariate.
Note. It should be clear from (8.8) that interpretation of the partial effects of a categorical
covariate depends on chosen parameterization (chosen (pseudo)contrast matrix C).
If the Z covariate acts additively with the V covariate, it makes a sense to ask a question whether
all G conditional group means are, for a given v ∈ V, equal. That is, whether all partial effects of
the Z covariate are equal to zero. In general, this corresponds to the null hypothesis
H0 : m1 (v) = · · · = mG (v),
v ∈ V.
(8.9)
If the regression function is parameterized as (8.6), the null hypothesis (8.9) is expressed using the
partial effects as
H0 : β Z = 0G−1 .
8.3.2
Interpretation of the regression coefficients
Note that (8.6) and (8.7) are basically the same expressions as those in (7.19) in Section 7.4.4. The
only difference is dependence of the group means and the intercept term on the value of the
covariates V . Hence interpretation of the individual coefficients β0 and β Z = β1 , . . . , βG−1
depends on the chosen pseudocontrast matrix C, nevertheless, it is basically the same as in case
of a single categorical covariate in Section 7.4.4 with the only difference that
(i) The non-intercept coefficients in β Z have the same interpretation as in Section 7.4.4 but
always conditionally, given a chosen (even though arbitrary) value v ∈ V.
(ii) The intercept β0 has interpretation given in Section 7.4.4 only for such v ∈ V for which
mV (v) = 0. This follows from the fact that, again, for a chosen v ∈ V, the expression (8.6)
of the conditional group means is the same as in Section 7.4.4. Nevertheless only for v such
that mV (v) = 0, we have β0 = γ0 (v).
Example 8.1 (Reference group pseudocontrasts).
If C is the reference group pseudocontrasts matrix (7.22), we obtain analogously to (7.23), but now for
a chosen v ∈ V, the following
β0 + mV (v) =
γ0 (v) = m1 (v),
β1 = m2 (v) − m1 (v),
..
.
βG−1 = mG−1 (v) − m1 (v).
8.3. ADDITIVITY OF THE EFFECT OF A CATEGORICAL COVARIATE
110
Example 8.2 (Sum contrasts).
If C is the sum contrasts matrix (7.24), we obtain analogously to (7.25), but now for a chosen v ∈ V,
the following
β0 + mV (v) =
γ0 (v) = m(v),
β1 = m1 (v) − m(v),
..
.
βG−1 = mG−1 (v) − m(v),
where
G
m(v) =
1 X
mg (v),
G
v ∈ V.
g=1
If we additionally define αG = −
PG−1
g=1
αG = −
βg , we get, in agreement with (7.26),
G−1
X
g=1
βg = mG (v) − m(v).
8.4. EFFECT MODIFICATION AND INTERACTIONS
8.4
111
Effect modification and interactions
8.4.1
Effect modification
Suppose now that the covariate vectors are
Z1 , W1 , V 1 , . . . , Zn , Wn , V n ∈ Z ×W ×V,
As usual, let


Z1
 . 
. 
Z=
 . ,
Zn

Z ⊆ R, W ⊆ R, V ⊆ Rp−2 ,

W1
 . 
. 
W=
 . ,
Wn
p ≥ 2.


V>
1
 . 

.
V= . 

V>
n
denote matrices collecting observed covariate values and finally, let Z, W, V ∈ Z × W × V
denote a generic covariate. Suppose now that the regression function is
m(z, w, v) = mZW (z, w) + mV (v),
z, w, v ∈ Z × W × V,
(8.10)
where mV : V −→ R is some function and mZW : Z × W −→ R is a function that cannot be
factorized as mZW (z, w) = mZ (z) + mW (w). We then have for any fixed v ∈ V.
E Y Z = z + 1, W = w, V = v − E Y Z = z, W = w, V = v
= mZW (z + 1, w) − mZW (z, w),
(8.11)
E Y Z = z, W = w + 1, V = v − E Y Z = z, W = w, V = v
= mZW (z, w + 1) − mZW (z, w),
(8.12)
z ∈ Z, w ∈ W,
where (8.11), i.e., the effect of a covariate Z on the response expectation possibly depends on the
value of W = w and also (8.12), i.e., the effect of a covariate W on the response expectation
possibly depends on the value of Z = z. We then say that the covariates Z and W are mutual
effect modifiers.2
In a context of a linear model, both mZW and mV are chosen to be linear in unknown (regression)
parameters and the corresponding model matrix is decomposed as
X = XZW , XV ,
(8.13)
where XZW corresponds to the regression function mZW and depends only on the covariate values
Z and W and XV corresponds to the regression function mV and depends only on the covariate
values V. In the rest of this section and in Sections 8.5, 8.6 and 8.7, we show classical choices for
the matrix XZW based on so called interactions derived from covariate parameterizations that we
introduced in Sections 7.3 and 7.4.
End of
Lecture #11
(05/11/2015)
8.4.2
Interactions
Suppose that the covariate Z is parameterized using a parameterization
Z
k−1
sZ = sZ
,
1 , . . . , sk−1 : Z −→ R
2
modifikátory efektu
(8.14)
Start of
Lecture #13
(12/11/2015)
8.4. EFFECT MODIFICATION AND INTERACTIONS
and the covariate W is parameterized using a parameterization
W
l−1
sW = sW
,
1 , . . . , sl−1 : W −→ R
112
(8.15)
and let SZ and SW be the corresponding reparameterizing matrices:




> (W )
s>
(Z
)
s
1
1

 Z. 
 W .
W

 = S 1W , . . . , S l−1 .
..  = S 1Z , . . . , S k−1
..
SZ = 
,
S
=
Z
W




s>
s>
Z (Zn )
W (Wn )
Definition 8.2 Interaction terms.
Let Z1 , W1 , . . . , Zn , Wn ∈ Z × W ⊆ R2 be values of two covariates being parameterized
using the reparameterizing matrices SZ and SW . By interaction terms3 based on the reparameterizing
matrices SZ and SW we mean columns of a matrix
SZW := SZ : SW .
Note. See Definition A.5 for a definition of the columnwise product of two matrices. We have
SZW = SZ : SW
=
k−1
: S l−1
S 1Z : S 1W , . . . , S k−1
: S 1W , . . . , S 1Z : S l−1
W
Z
W , . . . , SZ


>
s>
W (W1 ) ⊗ sZ (Z1 )


..
.
= 
.


> (Z )
s>
(W
)
⊗
s
n
n
W
Z
8.4.3
Interactions with the regression spline
Either the Z covariate or/and the W covariate can also be parameterized by the regression splines.
In that case, the interaction terms are defined in the same way as in Definition 8.2. For example,
if the Z covariate is parameterized by the regression splines
B Z = B1Z , . . . , BkZ
with the related model matrix


B>
(Z
)
1
Z


..
 = B 1Z , . . . , B kZ ,
BZ = 
.


B>
Z (Zn )
and the W covariates by the parameterization (8.15) and the reparameterizing matrix SW as above,
we will mean by the interaction terms columns of a matrix
l−1
k
BZW = BZ : SW = B 1Z : S 1W , . . . , B kZ : S 1W , . . . , B 1Z : S l−1
W , . . . , BZ : SW

>
s>
(W
)
⊗
B
(Z
)
1
1
Z

 W
..
.
= 
.


>
>
sW (Wn ) ⊗ B Z (Zn )

3
interakční členy
8.4. EFFECT MODIFICATION AND INTERACTIONS
8.4.4
113
Linear model with interactions
Interaction terms are used in a linear model to express a certain form of the effect modification. If
1n ∈
/ SZ and 1n ∈
/ SW , the matrix XZW from (8.13) is usually chosen as
XZW = 1n , SZ , SW , SZW ,
(8.16)
which, as will be shown, corresponds to a certain form of the effect modification. Let the related
regression coefficients be denoted as
W
ZW
ZW
ZW
ZW
Z
, β1,1
, . . . , βk−1,1
, . . . , β1,l−1
, . . . , βk−1,
β = β0 , β1Z , . . . , βk−1
, β1W , . . . , βl−1
l−1 .
{z
} |
{z
} |
|
{z
}
Z
W
ZW
=: β
=: β
=: β
Main and interaction effects
Terminology (Main and interaction effects).
Coefficients in β Z and β W are called the main effects4 of the covariate Z and W , respectively.
Coefficients in β ZW are called the interaction effects.5
The related regression function (8.10) is
Z
>
W
ZW
m(z, w, v) = β0 + s>
+ s>
+ mV (v),
Z (z)β + sW (w)β
ZW (z, w)β
z, w, v ∈ Z × W × V, (8.17)
where sZW (z, w) : Z × W −→ R,
sZW (z, w) = sW (w) ⊗ sZ (z)
W
Z
W
Z
W
W
Z
= sZ
1 (z)s1 (w), . . . , sk−1 (z)s1 (w), . . . , s1 (z)sl−1 1(w), . . . , sk−1 (z)sl−1 (w) ,
z, w ∈ Z × W.
The effects of the covariates Z or W , given the remaining covariates are then expressed as
E Y Z = z + 1, W = w, V = v − E Y Z = z, W = w, V = v
>
>
= sZ (z + 1) − sZ (z) β Z + sZW (z + 1, w) − sZW (z, w) β ZW ,
(8.18)
E Y Z = z, W = w + 1, V = v − E Y Z = z, W = w, V = v
>
>
= sW (w + 1) − sW (w) β W + sZW (z, w + 1) − sZW (z, w) β ZW ,
(8.19)
z ∈ Z, w ∈ W.
That is, the effect (8.18) of the covariate Z is determined by the main effects β Z of this covariate
as well as by the interaction effects β ZW . Analogously, the effect (8.19) of the covariate W is
determined by its main effects β W as well as by the interaction effects β ZW .
4
hlavní efekty
5
interakční efekty
8.4. EFFECT MODIFICATION AND INTERACTIONS
114
Hypothesis of no effect modification
If factor mZW (z, w) of the regression function (8.10) is parameterized by matrix XZW given by
(8.16) then the hypothesis of no effect modification is expressed by considering a submodel in
which matrix XZW is replaced by a matrix
XZ+W = 1n , SZ , SW .
Interaction model with regression splines
Suppose that matrix SZ = BZ where BZ corresponds to the
parameterization
of the Z covariate
Z
W
using the regression splines. We then have 1n ∈ M B
and also M S
⊆ M BZW for
BZW = BZ : SW (see Section 8.5.2). That is,
M 1n , BZ , SW , BZW = M BZ , BZW .
It is thus sufficient (with respect to obtained regression space) to choose the matrix XZW as
XZW = BZ , BZW .
(8.20)
Hypothesis of no effect modification then corresponds to a submodel in which matrix (8.20) is
replaced by a matrix
XZ+W = BZ , SW .
8.4.5
Rank of the interaction model
Lemma 8.1 Rank of the interaction model.
(i) Let rank SZ , SW = k + l − 2, i.e., all columns from the matrices SZ and SW are linearly
independent and the matrix SZ , SW is of full-rank. Then the matrix SZW = SZ : SW is of
full-rank as well, i.e.,
rank SZW = (k − 1) (l − 1).
/ M SZ , 1n ∈
/ M SW . Then also a matrix XZW = 1n , SZ , SW , SZW
is of full-rank, i.e.,
rank XZW = 1 + (k − 1) + (l − 1) + (k − 1) (l − 1) = k l.
Proof. Left as exercise in linear algebra.
Proof/calculations were skipped and are not requested for the exam.
k
8.4. EFFECT MODIFICATION AND INTERACTIONS
115
Note (Hypothesis of no effect modification).
Under the conditions
of Lemma 8.1, we have for XZW = 1n , SZ , SW , SZW
1n , SZ , SW :
rank XZW = k l,
and XZ+W =
rank XZ+W = 1 + (k − 1) + (l − 1) = k + l − 1,
M XZ+W ⊂ M XZW .
If the hypothesis of no effect modification is tested by a submodel F-test then its numerator degrees
of freedom are
kl − k − l + 1 = (k − 1) · (l − 1).
The corresponding null hypothesis can also be specified as a hypothesis on the zero value of an
estimable vector of all interaction effects:
H0 : β ZW = 0(k−1)·(l−1) .
8.5. INTERACTION OF TWO NUMERIC COVARIATES
8.5
116
Interaction of two numeric covariates
Let us now consider a situation when both Z and W are numeric covariates with Z ⊆ R, W ⊆ R.
8.5.1
Mutual effect modification
Linear effect modification
Suppose first that SZ is a reparameterizing matrix that corresponds to a simple identity transformation of the covariate Z. For the second covariate, W , assume that the matrix SW is an
n × (l − 1) reparameterizing matrix that corresponds to the general parameterization (8.14), e.g.,
any of reparameterizing matrices discussed in Sections 7.3.1, 7.3.2 and 7.3.3. That is,
z ∈ Z,
W
sW (w) = sW
1 (w), . . . , sl−1 (w) , w ∈ W,
sZ (z) = z,


Z1
 . 
. 
SZ = 
 . ,
Zn

SW
 

W (W ) . . . sW (W )
s>
(W
)
s
1
1
1
1
k−1
 W .
 

..
..
..
=
,
..
=
.
.
.

 

W (W ) . . . sW (W )
s>
(W
)
s
n
n
n
1
W
k−1
The matrix XZW , Eq. (8.16) is then

W
W
Z1 s W
1
Z1
sW
1 (W1 ) . . . Z1 sl−1 (W1 )
1 (W1 ) . . . sl−1 (W1 )


..
..
..
..
..
..
..
XZW =  ...
.
.
.
.
.
.
.

W
W
W
W
Zn s1 (Wn ) . . . Zn sl−1 (Wn )
s1 (Wn ) . . . sl−1 (Wn )
1
Zn
{z
} |
{z
}
|{z} | {z } |
1n
SZ
SW
SZW
(8.21)





and the related regression coefficients are
W
ZW
β = β0 , β Z , β1W , . . . , βl−1
, β1ZW , . . . , βl−1
.
|
{z
} |
{z
}
βW
β ZW
The regression function (8.17) then becomes
W W
m(z, w, v) = β0 + β Z z + β1W sW
1 (w) + · · · + βl−1 sl−1 (w)
ZW
W
+ β1ZW z sW
1 (w) + · · · + βl−1 z sl−1 (w) + mV (v)
W
ZW
= β0 + s>
+ mV (v) + (β Z + s>
)z
W (w)β
W (w)β
|
{z
}
|
{z
}
=: γ1Z (w)
=: γ0Z (w, v)
(8.22)
(8.23)
8.5. INTERACTION OF TWO NUMERIC COVARIATES
117
= β0 + β Z z + mV (v)
|
{z
}
W
=: γ0 (z, v)
W
ZW
(w) + · · · + (βl−1
+ βl−1
z) sW
+ (β1W + β1ZW z) sW
(w),
{z
} 1
|
|
{z
} l−1
W
=: γ1W (z)
=: γl−1
(z)
(8.24)
z, w, v ∈ Z × W × V.
The regression function (8.22) can be interpreted twofold.
(i) Expression (8.23) shows that for any fixed w ∈ W, the covariates Z and V act additively
and the effect of Z on the response expectation is expressed by a line. Nevertheless, both
intercept γ0Z and the slope γ1Z of this line depend on w and this dependence is described
by the parameterization sW . The intercept is further additively modified by a factor mV (v).
With respect to interpretation, this shows that the main effect β Z has an interpretation of
the slope of the line that for a given V = v describes the influence of Z on the response if
W = w is such that sW (w) = 0l−1 . This also shows that a test of the null hypothesis
H0 : β Z = 0
does not evaluate a statistical significance of the influence of the covariate Z on the response
expectation. It only evaluates it for values of W = w for which sW (w) = 0l−1 .
(ii) Analogously, expression (8.24) shows that for any fixed z ∈ Z, the covariates W and V act
additively and the effect of W on the response expectation is expressed by its parameterizaW ) depend in a linear way
tion sW . Nevertheless, the related coefficients (γ0W , γ1W , . . . , γl−1
W
on z. The intercept term is γ0 further additively modified by a factor mV (v).
With respect to interpretation, this shows that the main effects β W has an interpretation of
the coefficients of the influence of the W covariate on the response if Z = 0. This also
shows that a test of the null hypothesis
H0 : β W = 0l−1
does not evaluate a statistical significance of the influence of the covariate W on the response
expectation. It only evaluates it under the condition of Z = 0.
More complex effect modification
More complex effect modifications can be obtained by choosing a more complex reparameterizing
matrix SZ for the Z covariate. Interpretation of such a model is then straightforward generalization
of the above situation.
8.5.2
Mutual effect modification with regression splines
Linear effect modification with regression splines
Let us again assume that the covariate Z is parameterized using a simple identity transformation
and the reparameterizing matrix SZ is given as in (8.21). Nevertheless, for the covariate W , let us
assume its parameterization using the regression splines
B W = B1W , . . . , BlW
8.5. INTERACTION OF TWO NUMERIC COVARIATES
118
with the related model matrix

BW
 

B>
B1W (W1 ) . . . BlW (W1 )
W (W1 )

 

..
..
..
..
=
.
=
.
.
.
.

 

>
W
W
B W (Wn )
B1 (Wn ) . . . Bl (Wn )
Analogously to previous usage of a matrix SZW , let a matrix BZW be defined as


Z1 B1W (W1 ) . . . Z1 BkW (W1 )


..
..
..
.
BZW = SZ : BW = 
.
.
.


W
W
Zn B1 (Wn ) . . . Zn Bk (Wn )
P
Remember that for any w ∈ W, lj=1 BjW (w) = 1 from which it follows that
(i) 1n ∈ BW ;
(ii) M SZ ⊆ M BZW .
That is,
M
1n , SZ , BW , BZW
= M
BW , BZW
.
(8.25)
Hence if a full-rank linear model is to be obtained, where interaction between a covariate parameterized using the regression splines and a covariate parameterized using the reparameterizing
>
matrix SZ = Z1 , . . . , Zn is included, the model matrix XZW must be of the form


Z1 B1W (W1 ) . . . Z1 BlW (W1 )
B1W (W1 ) . . . BlW (W1 )




..
..
..
..
..
..
XZW = BW , BZW = 
.
.
.
.
.
.
.


Zn B1W (Wn ) . . . Zn BlW (Wn )
B1W (Wn ) . . . BlW (Wn )
{z
} |
{z
}
|
W
ZW
B
B
If we denote the related regression coefficients as
β = β1W , . . . , βlW , β1ZW , . . . , βlZW ,
|
{z
} |
{z
}
W
ZW
=: β
=: β
the regression function (8.17) becomes
m(z, w, v) = β1W B1W (w) + · · · + βlW BlW (w)
+ β1ZW z B1W (w) + · · · + βlZW z BlW (w) + mV (v)
(8.26)
W
= B>
+ mV (v) + B >
(w)β ZW z
W (w)β
| W {z
}
|
{z
}
Z
Z
=: γ1 (w)
=: γ0 (w, v)
(8.27)
= (β1W + β1ZW z) B1W (w) + · · · + (βlW + βlZW z) BlW (w) + mV (v),
|
{z
}
|
{z
}
W
W
=: γ1 (z)
=: γl (z)
(8.28)
z, w, v ∈ Z × W × V.
The regression function (8.26) can again be interpreted twofold.
8.5. INTERACTION OF TWO NUMERIC COVARIATES
119
(i) Expression (8.27) shows that for any fixed w ∈ W, the covariates Z and V act additively
and the effect of W on the response expectation is expressed by a line. Nevertheless, both
intercept γ0Z and the slope γ1Z of this line depend on w and this dependence is described by
the regression splines B W . The intercept is further additively modified by the factor mV (v).
(ii) Analogously, expression (8.28) shows that for any fixed z ∈ Z, the covariates W and V act
additively and the effect of W on the response expectation is expressed by the regression
splines B W . Nevertheless, related spline coefficients (γ1W , . . . , γlW ) depend in a linear
way on z. With respect to interpretation, this shows that the main effects β W has an
interpretation of the coefficients of the influence of the W covariate on the response if
Z = 0.
More complex effect modification with regression splines
Also with regression splines, a more complex reparameterizing matrix SZ can be
chosen. The
property (8.25) still holds and the matrix XZW can still be chosen as BW , BZW , BZW = SZ :
BW . Interpretation of the model is again a straightforward generalization of the above situation.
With respect to the rank of the resulting model, analogous statements
hold as those given in
Z
Z
Lemma 8.1. Suppose that the matrix S has k − 1 columns. If rank S , BW = k − 1 + l, i.e.,
Z , BW is
all columns from the matrices SZ and BW are linearly independent
and
the
matrix
S
of full-rank, then both BZW = SZ : BW and XZW = BW , BZW are of full-rank, i.e.,
rank BZW
= (k − 1) l,
rank XZW
= l + (k − 1)l = k l.
8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
8.6
120
Interaction of a categorical and a numeric covariate
Consider now a situation when Z is a categorical covariate with Z = {1, . . . , G} where Z = g,
g = 1, . . . , G, is repeated ng -times in the data and W is a numeric covariate with W ⊆ R.
We will assume (without loss of generality) that data are sorted according to the values of the
categorical covariate Z. For the clarity of notation, we will analogously to Section 7.4 use also
a double subscript to number the individual observations where the first subscript will indicate the
value of the covariate Z. That is, we will use

 
 


 

Z1
Z1,1
1
W1
W1,1

 
  . 

 

..
..
..
..

 
  .. 

 

.
.
.
.

 
 


 


 
 


 











Z
Z
1
W
W
n
1,n
n
1,n
1
1
1
1

 
 


 

 − − −−   − − −   −− 
 − − −−   − − − 

 
 


 


 
  . 

 

..
..
.
.









.
..
..
.
.

=
 =  ..  ,

=


 
 


 

 − − −−   − − −   −− 
 − − −−   − − − 

 
 


 

 Z
 
 

 W
 

 n−nG +1   ZG,1   G 
 n−nG +1   WG,1 

 
  . 

 

..
..
..
..

 
  . 

 

.
.
.
.

 
  . 

 

Zn
ZG,nG
G
Wn
WG,nG
If the categorical covariate Z can be interpreted as a label that indicates pertinence to one of G
groups, the regression function (8.17) in which the value of z is fixed at z = g, g = 1, . . . , G, can
be viewed as a regression function that parameterizes dependence of the response expectation on
the numeric covariate W and possibly other covariates V in group g. We have for w ∈ W, v ∈ V:
=: m1 (w, v),
m(1, w, v) = E Y Z = 1, W = w, V = v
..
..
.
.
m(G, w, v) = E Y Z = G, W = w, V = v =: mG (w, v).
(8.29)
Functions m1 , . . . , mG are then conditional (given a value of Z) regression functions describing
dependence of the response expectation on the covariates W and V .
Alternatively, for fixed w ∈ W and v ∈ V, a vector
m(w, v) = m1 (w, v), . . . , mG (w, v)
can be interpreted as a vector of conditional (given W and V ) group means.
In the following assume that the categorical covariate Z is parameterized by the mean of a chosen
(pseudo)contrast matrix
 


c
=
c
,
.
.
.
,
c
,
1
1,1
1,G−1
c>
1n1 ⊗ c>
1
1
 . 


..
Z
..

.
..  ,
C=
that
is,
S
=
.
 


.
>
c>
1
⊗
c
nG
cG = cG,1 , . . . , cG,G−1 ,
G
G
8.6.1
Categorical effect modification
First suppose that the numeric covariate W is parameterized using a parameterization
W
l−1
sW = sW
,
1 , . . . , sl−1 : W −→ R
8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
121
W
SW is the corresponding n × (l − 1) reparameterizing matrix. Let SW
1 , . . . , SG be the blocks of
W
the reparameterizing matrix S that correspond to datapoints with Z = 1, . . . , Z = G. That is,
for g = 1, . . . , G, matrix SW
g is an ng × (l − 1) matrix,

 

 
s>
sW
. . . sW
SW
1 (Wg,1 )
1
W (Wg,1 )
l−1 (Wg,1 )





..
..
..
..
.. 
W
W






Sg = 
=
and
S
=
.
.
.
.
 

 . .
W
s>
sW
SW
1 (Wg,ng ) . . . sl−1 (Wg,ng )
W (Wg,ng )
G
The matrix XZW that parameterizes the term mZW (z, w) in the regression function (8.10) is again
XZW = 1n , SZ , SW , SZW ,
where the interaction matrix SZW = SZ : SW is an n × (G − 1)(l − 1) matrix:

SZW =





















...
c1,1 sW
...
c1,G−1 sW
c1,1 sW
...
c1,G−1 sW
1 (W1,1 )
1 (W1,1 )
l−1 (W1,1 )
l−1 (W1,1 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.
W
W
W
)
)
.
.
.
c
s
(W
c1,1 sW
(W
)
.
.
.
c
s
.
.
.
c
s
(W
1,G−1
1,n
1,n
1,G−1
1,1
1,n
1
1
l−1 (W1,n1 )
l−1
1
1
1
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−
..
..
..
..
..
..
..
.
.
.
.
.
.
.
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−
. . . cG,1 sW
. . . cG,G−1 sW
cG,1 sW
. . . cG,G−1 sW
1 (WG,1 )
1 (WG,1 )
l−1 (WG,1 )
l−1 (WG,1 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.
cG,1 sW
. . . cG,G−1 sW
. . . cG,1 sW
. . . cG,G−1 sW
1 (WG,nG )
1 (WG,nG )
l−1 (WG,nG )
l−1 (WG,n1 )
>
>
sW
. . . sW
1 (W1,1 ) c1
l−1 (W1,1 ) c1

..
..
..

.
.
.

 W
>
 s1 (W1,n1 ) c>
. . . sW
1
l−1 (W1,n1 ) c1

 − − − − − − − − − − − − − − −−


..
..
..
=
.
.
.


 − − − − − − − − − − − − − − −−

 sW (W ) c> . . . sW (W ) c>
 1
G,1
G,1
G
G
l−1

..
..
..

.
.
.

W
W
>
s1 (WG,nG ) cG . . . sl−1 (WG,nG ) c>
G






 

SW

1
 
=
 


SW
G







⊗ c>
1

..
.
.

>
⊗ cG
The regression coefficients related to the model matrix XZW are now
Z
W
ZW
ZW
ZW
ZW
β = β0 , β1Z , . . . , βG−1
, β1W , . . . , βl−1
, β1,1
, . . . , βG−1,1
, . . . , β1,l−1
, . . . , βG−1,l−1
.
|
{z
} |
{z
} |
{z
}
=: β Z
=: β W
=: β ZW
For following considerations, it will be useful to denote subvectors of the interaction effects β ZW
as
ZW
ZW
ZW
ZW
β ZW = β1,1
, . . . , βG−1,1
, . . . , β1,l−1
, . . . , βG−1,l−1
.
|
{z
}
|
{z
}
=: β ZW
=: β ZW
•1
•l−1
The values of the regression function (8.17) for z = g, g = 1, . . . , G, w ∈ W and v ∈ V , i.e., the
values of the conditional regression functions (8.29) can then be written as
Z
W
W
mg (w, v) = β0 + c>
+ β1W sW
gβ
1 (w) + · · · + βl−1 sl−1 (w)
> ZW
W
> ZW
+ sW
1 (w) cg β •1 + · · · + sl−1 (w) cg β •l−1 + mV (v).
(8.30)





















8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
122
A useful interpretation of the regression function (8.30) is obtained if we view mg (w, v) as a conditional (given Z = g) regression function that describes dependence of the response expectation
on the covariates W and V in group g and write it as
Z
mg (w, v) = β0 + c>
g β + mV (v)
|
{z
}
W
=: γg,0 (v)
W
W
W
ZW
ZW
s1 (w) + · · · + βl−1
+ c>
+ β1W + c>
g β •l−1 sl−1 (w). (8.31)
g β •1
|
|
{z
}
{z
}
W
W
=: γg,1
=: γg,l−1
Expression (8.31) shows that a linear model with an interaction between a numeric covariate parameterized using a parameterization sW and a categorical covariate can be interpreted such that
for any fixed Z = g, the covariates W and V act additively and the effect of W on the response
expectation is expressed by its parameterization sW . Nevertheless, the related coefficients depend
on a value of a categorical covariate Z. The intercept term is further additively modified by a factor
mV (v).
In other words, if the categorical covariate Z expresses pertinence of a subject/experimental unit
into one of G groups (internally labeled by numbers 1, . . . , G), the regression function (8.31) of the
interaction model parameterizes a situation when, given the remaining covariates V , dependence
of the response expectation on the numeric covariate W can be in each of the G groups expressed
by the same linear model (parameterized by the parameterization sW ), nevertheless, the regression
coefficients of the G linear models may differ. It follows from (8.31) that given Z = g (and given
V = v), the regression coefficients for the dependence of the response on the numeric covariate
W expressed by the parameterization sW are
W
Z
γg,0
(v) = β0 + c>
+ mV (v),
gβ
W
γg,j
=
βjW
+
ZW
c>
g β •j ,
(8.32)
j = 1, . . . , l − 1.
(8.33)
Chosen (pseudo)contrasts that parameterize a categorical covariate Z then determine interpretation
of the intercept β0 , both sets of main effect β Z and β W and also the interaction effects β ZW . This
interpretation is now a straightforward generalization of derivations shown earlier in Sections 7.4.4
and 8.3.
• Interpretation of the intercept term β0 and the main effects β Z of the categorical covariate Z
is obtained by noting correspondence between the expression of the group specific intercepts
W (v), . . . , γ W (v) given by (8.32) and the conditional group means (8.6) in Section 8.3.
γ1,0
G,0
• Analogously, interpretation of the main effects β W and the interaction effects β ZW is obtained
W , . . . , γ W given by
by noting that for each j = 1, . . . , l − 1, the group specific “slopes” γ1,j
G,j
(8.33) play a role of the group specific means (7.19) in Section 7.4.4.
Example 8.3 (Reference group pseudocontrasts).
Suppose that C is the reference group pseudocontrast matrix (7.22). While viewing the group specific intercepts (8.32) as the conditional (given V = v) group means (8.6), we obtain, analogously to Example 8.1, the following interpretation of the intercept term β0 and the main effects
Z
β Z = β1Z , . . . , βG−1
of the categorical covariate:
W (v),
β0 + mV (v) = γ1,0
β1Z
W (v) − γ W (v),
= γ2,0
1,0
..
.
Z
W
W (v).
βG−1
= γG−1,0
(v) − γ1,0
8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
123
W , . . . , γ W given by (8.33) are viewed
If for given j = 1, . . . , l − 1, the group specific “slopes” γ1,j
G,j
as the group specific means (7.19) in Section 7.4.4, interpretation of the jth main effect βjW of the
ZW
ZW
numeric covariate and the jth set of the interaction effects β ZW
•j = β1,j , . . . , βG−1,j is analogous
to expressions (7.23):
W,
βjW = γ1,j
ZW
β1,j
ZW
βG−1,j
W − γW ,
= γ2,j
1,j
..
.
W
W.
= γG−1,j
− γ1,j
Example 8.4 (Sum contrasts).
Suppose now that C is the sum contrast matrix (7.24). Again, while viewing the group specific intercepts
(8.32) as the conditional (given V = v) group means (8.6), we obtain, now analogously to Example 8.2,
Z
the following interpretation of the intercept term β0 and the main effects β Z = β1Z , . . . , βG−1
of
the categorical covariate:
β0 + mV (v) = γ0 W (v),
W (v) − γ W (v),
= γ1,0
0
..
.
β1Z
W
Z
(v) − γ0 W (v),
= γG−1,0
βG−1
where
G
γ0 W (v) =
1 X W
γg,0 (v),
G
v ∈ V.
g=1
W , . . . , γ W given by (8.33) are viewed
If for given j = 1, . . . , l − 1, the group specific “slopes” γ1,j
G,j
as the group specific means (7.19) in Section 7.4.4, interpretation of the jth main effect βjW of the
ZW
ZW
numeric covariate and the jth set of the interaction effects β ZW
•j = β1,j , . . . , βG−1,1 is analogous
to expression (7.26):
βjW = γj W ,
ZW
β1,j
ZW
βG−1,j
W − γ W,
= γ1,j
j
..
.
W
= γG−1,j
− γj W ,
where
G
γj W =
1 X W
γg,j .
G
g=1
Alternative, interpretation of the regression function (8.30) is obtained if for a fixed w ∈ W and
v ∈ V , the values of mg (w, v), g = 1, . . . , G, are viewed as conditional (given W and V ) group
means. Expression (8.30) can then be rewritten as
Z
W
W
ZW
W
ZW
mg (w, v) = β0 + s>
+ mV (v) + c>
W (w)β
g β + s1 (w)β •1 + · · · + sl−1 (w)β •l−1 .
|
|
{z
}
{z
}
Z
Z?
=: γ0 (w, v)
γ (w)
(8.34)
8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
124
That is, the vector m(w, v) is parameterized as
m(w, v) = γ0Z (w, v)1G + Cγ Z? (w).
And the related coefficients γ0Z (w, v), γ Z? (w) depend on w by a linear model parameterized by
the parameterization sW , the intercept term is further additively modified by mV (v). Expression
(8.34) perhaps provide a way for the interpretation of the intercept term β0 and the main effects β Z .
Nevertheless, attempts to use (8.34) for interpretation of the main effects β W and the interaction
effects β ZW are usually quite awkward.
8.6.2
Categorical effect modification with regression splines
Suppose now that the numeric covariate W is parameterized using the regression splines
B W = B1W , . . . , BlW
W
with the related model matrix BW that we again factorize into blocks BW
1 , . . . , BG that correspond
W
to datapoints with Z = 1, . . . , Z = G. That is, for g = 1, . . . , G, matrix Bg is an ng × l matrix,

 



B>
B1W (Wg,1 ) . . . BlW (Wg,1 )
BW
W (Wg,1 )
1

 

 . 
..
..
..
..

=

. 
BW
and BW = 
.
.
.
.
g =
 

 . .
B>
B1W (Wg,ng ) . . . BlW (Wg,ng )
BW
W (Wg,ng )
G
Let BZW = SZ : BW which is an n × (G − 1)l matrix:

BZW =




















...
c1,1 BlW (W1,1 )
...
c1,G−1 BlW (W1,1 )
c1,1 B1W (W1,1 )
...
c1,G−1 B1W (W1,1 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.
c1,1 B1W (W1,n1 ) . . . c1,G−1 B1W (W1,n1 ) . . . c1,1 BlW (W1,n1 ) . . . c1,G−1 BlW (W1,n1 )
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−
..
..
..
..
..
..
..
.
.
.
.
.
.
.
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−
cG,1 B1W (WG,1 ) . . . cG,G−1 B1W (WG,1 ) . . . cG,1 BlW (WG,1 ) . . . cG,G−1 BlW (WG,1 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.
cG,1 B1W (WG,nG ) . . . cG,G−1 B1W (WG,nG ) . . . cG,1 BlW (WG,nG ) . . . cG,G−1 BlW (WG,n1 )

B1W (W1,1 ) c>
. . . BlW (W1,1 ) c>
1
1

..
..
..

.
.
.

 W
W
 B1 (W1,n1 ) c>
. . . Bl (W1,n1 ) c>
1
1

 − − − − − − − − − − − − − − −−


..
..
..
=
.
.
.


 − − − − − − − − − − − − − − −−

 B W (W ) c> . . . B W (W ) c>

G,1
G,1
1
G
G
l

.
.
.

..
..
..

W
>
B1W (WG,nG ) c>
G . . . Bl (WG,nG ) cG
As in Section 8.5.2, remember that for any w ∈ W,
(i) 1n ∈ BW ;






 

BW

1
 
=
 


BW
G






Pl
W
j=1 Bj (w)

⊗ c>
1

..
.
.

>
⊗ cG
= 1 from which it follows that





















8.6. INTERACTION OF A CATEGORICAL AND A NUMERIC COVARIATE
125
(ii) M SZ ⊆ M BZW .
Hence
M
1n , SZ , BW , BZW
= M
BW , BZW
and if a full-rank linear model is to be obtained that includes an interaction between a numeric
covariate parameterized using the regression splines and a categorical covariate parameterized by
the reparameterizing matrix SZ derived from a (pseudo)contrast matrix C, the model matrix XZW
that parameterizes the term mZW (z, w) in the regression function (8.10) is


W
>
BW
1 , B1 ⊗ c1

 .
..
.
..
XZW = BW , BZW = 
.


W ⊗ c>
BW
,
B
G
G
G
The regression coefficients related to the model matrix XZW are
ZW
ZW
ZW
ZW
, . . . , βG−1,1
, . . . , β1,k
, . . . , βG−1,k
.
β = β1W , . . . , βlW , β1,1
{z
} |
|
{z
}
|
{z
}
=: β W
=: β ZW
=: β ZW
•1
•k
|
{z
}
ZW
=: β
The value of the regression function (8.17) for z = g, g = 1, . . . , G, w ∈ W, and v ∈ V , i.e., the
values of the conditional regression functions (8.29) can then be written as
mg (w, v) = β1W B1W (w) + · · · + βlW BlW (w)
> ZW
ZW
W
+ mV (v).
+ B1W (w) c>
g β •1 + · · · + Bl (w) cg β •l
Its useful interpretation is obtained if we write it as
W
W
ZW
ZW
mg (w, v) = β1W + c>
B1 (w) + · · · + βlW + c>
Bl (w) + mV (v),
g β •1
g β •l
|
|
{z
}
{z
}
W
W
=: γg,1
=: γg,k
which shows that the underlying linear model assumes that given Z = g, the covariates W and V
act additively and the effect of the numeric covariate W on the response expectation is described
W , . . . , γ W , however, depend on the value of the
by the regression spline whose coefficients γg,1
g,k
categorical covariate Z. Analogously to Section 8.6.1, interpretation of the regression coefficients β W
and β ZW depends on chosen (pseudo)contrasts used to parameterize the categorical covariate Z.
End of
Lecture #13
(12/11/2015)
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
8.7
126
Interaction of two categorical covariates
Start of
Lecture #14
(19/11/2015)
Finally consider a situation when both Z and W are categorical covariates with
Z = {1, . . . , G},
W = {1, . . . , H}.
Let a combination (Z, W ) = (g, h) be repeated ng,h -times in the data, g = 1, . . . , G, h =
1, . . . , H and assume that ng,h > 0 for each g and h. For the clarity of notation, we will now use
also a triple subscript to index the individual observations. The first subscript will indicate a value
of the covariate Z, the second subscript will indicate a value of the covariate W and the third
subscript will consecutively number the observations with the same (Z, W ) combination. Finally,
without loss of generality, we will assume that data are sorted primarily with respect to the value of
the covariate W and secondarily with respect to the value of the covariate Z. That is, we assume
and denote


















Z1
..
.
..
.
..
.
..
.
..
.
Zn
W1
..
.
..
.
..
.
..
.
..
.
Wn









=
































































Z1,1,1
W1,1,1
..
..
.
.
W1,1,n1,1
Z1,1,n1,1
− − − − − − − − −−
..
..
.
.
− − − − − − − − −−
ZG,1,1
WG,1,1
..
..
.
.
ZG,1,nG,1
WG,1,nG,1
− − − − − − − − −−
..
..
.
.
..
..
.
.
− − − − − − − − −−
Z1,H,1
W1,H,1
..
..
.
.
Z1,H,n1,H
W1,H,n1,H
− − − − − − − − −−
..
..
.
.
− − − − − − − − −−
ZG,H,1
WG,H,1
..
..
.
.
ZG,H,nG,H WG,H,nG,H




















































































































=
1 1
.. ..
. .
1 1
−−−
.. ..
. .
−−−
G 1
.. ..
. .
G 1
−−−
.. ..
. .
.. ..
. .
−−−
1 H
.. ..
. .
1 H
−−−
.. ..
. .
−−−
G H
.. ..
. .
G H






























,






























































































Y =







Y1
..
.
..
.
..
.
..
.
..
.
Yn









=







Y1,1,1
..
.
Y1,1,n1,1
− − −−
..
.
− − −−
YG,1,1
..
.
YG,1,nG,1
− − −−
..
.
..
.
− − −−
Y1,H,1
..
.
Y1,H,n1,H
− − −−
..
.
− − −−
YG,H,1
..
.
YG,H,nG,H





























.




























In the same way, a triple subscript will also be used with the covariates V , i.e., V 1 , . . . , V n ≡
V 1,1,1 , . . . , V G,H,nG,H . Further, let
ng• =
H
X
ng,h ,
g = 1, . . . , G
h=1
denote the number of datapoints with Z = g and similarly, let
n•h =
G
X
g=1
ng,h ,
h = 1, . . . , H
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
127
denote the number of datapoints with W = h.
Analogously to Section 8.3, it will be useful to view, for a fixed v ∈ V, the values of the regression
function (8.17) for z ∈ {1, . . . , G} and w ∈ {1, . . . , H} as a set of G · H conditional (given V )
group means:
m(g, h, v) = E Y Z = g, W = h, V = v =: mg,h (v),
g = 1, . . . , G,
h = 1, . . . , H. (8.35)
Let, for v ∈ V, m(v) be a vector with elements (8.35):
m(v) = m1,1 (v), . . . , mG,1 (v), . . . . . . , m1,H (v), . . . , mG,H (v) .
Note that under our assumptions concerning the factorization (8.10) of the regression function and
an additive effect of the covariates V , we have
mg,h (v) = mg,h + mV (v),
g = 1, . . . , G,
h = 1, . . . , H,
(8.36)
m(v) = m + mV (v),
where mg,h := mZW (g, h) in (8.10) and
m = m1,1 , . . . , mG,1 , . . . . . . , m1,H , . . . , mG,H .
Due to the fact that m = m(v) if v ∈ V is such that mV (v) = 0, we will call in this section
the vector m as a vector of baseline group means. Further, since they relate to division of the data
according to values of two covariates (factors), we will also call them as two-way classified (baseline)
group means.
Notation. In the following, we will additionally use notation:
G H
1 XX
m :=
mg,h ,
G·H
g=1 h=1
mg• :=
m•h :=
8.7.1
1
H
1
G
H
X
h=1
G
X
mg,h ,
g = 1, . . . , G,
mg,h ,
h = 1, . . . , H.
g=1
Linear model parameterization of two-way classified group
means
Let, as usual, µ denote a vector of the response expectations where again a triple subscript will be
used, i.e.,


µ1,1,1


..
.
µ = E Y Z, W, V = 
.


µG,H,nG,H
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
Due to (8.36), this response expectation is factorized as





m1,1 (V 1,1,1 )
mV (V 1,1,1 )
m1,1





.
..
.





..
..
.










 m1,1 (V 1,1,n1,1 ) 
 mV (V 1,1,n1,1 )
 m1,1 





 −−−−−−− 
 −−−−−−−
 −−− 










..
..
..





.
.
.










 −−−−−−− 
 −−−−−−−
 −−− 





 m (V
 m (V

 m

G,1
G,1,1 )
G,1,1 )
G,1 
V









..
..
..





.
.
.










 mG,1 (V G,1,nG,1 ) 
 mV (V G,1,nG,1 )
 mG,1 





 −−−−−−− 
 −−−−−−−
 −−− 










.
..
.





..
..
.





µ = 
 = 
 + 
..
..
..





.
.
.










 −−−−−−− 
 −−−−−−−
 −−− 





 m (V
 m (V

 m





1,H
1,H,1 )
1,H,1 )
1,H 
V





.
.
.





..
..
..










 m1,H (V 1,H,n1,H ) 
 mV (V 1,H,n1,H )
 m1,H 










 −−−−−−− 
 −−−−−−−
 −−− 





.
.
..





..
..
.










 −−−−−−− 
 −−−−−−−
 −−− 










 mG,H (V G,H,1 ) 
 mV (V G,H,1 )
 mG,H 





.
..
.





..
..
.





mG,H (V G,H,nG,H )
mV (V G,H,nG,H )
mG,H
{z
}
{z
|
|
=: µZW
=: µV
128




























.


























(8.37)
}
Note that a vector µZW from (8.37) corresponds to the factor mZW in the regression function
(8.10). It is now our aim to parameterize µZW for use in a linear model, i.e., as
µZW = XZW β,
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
129
where β is a vector of regression coefficients. Further note that only the baseline group means
appear in the expression of µZW and we have


m1,1 1n1,1


..


.




 mG,1 1nG,1 


− − − − −




.


..


µZW = 
.
.


..




− − − − −


m

 1,H 1n1,H 


..


.


mG,H 1nG,H
The situation is basically the same as in case of a single categorical covariate in Section 7.4 if we
view each of the G · H combinations of the Z and W covariates as one of the values of a new
categorical covariate with G · H levels labeled by double indeces (1, 1), . . . , (G, H). The following
facts then directly follow from Section 7.4 (given our assumption that ng,h > 0 for all (g, h)).
• Matrix XZW must have a rank of G · H, i.e., at least k = G · H columns and its choice simplifies
e ZW such that
into selecting an (G · H) × k matrix X




>
x>
1
⊗
x
n
1,1
1,1
1,1
 . 


..
 .. 


.




 > 


>
 xG,1 
 1nG,1 ⊗ xG,1 




− − − 
 − − − − −− 




 . 


..
 .. 


.



e ZW = 
X
 .  , leading to XZW = 
.
..
 .. 


.








− − − 
 − − − − −− 




 x> 
1
> 
⊗
x
 1,H 
 n1,H
1,H 
 . 


..
 . 


.
.




>
x>
1
⊗
x
nG,H
G,H
G,H
e ZW .
• rank XZW = rank X
e ZW parameterizes the baseline group means as
• Matrix X
mg,h = x>
g,h β,
g = 1, . . . , G, h = 1, . . . , H,
e ZW β.
m = X
e can
If purely parameterization of a vector m of the baseline group means is of interest, matrix X
be chosen using the methods discussed in Section 7.4 applied to a combined categorical covariate
with G · H levels. Nevertheless, it is often of interest to decompose in a certain sense the influence
of the original covariates Z and W on the response expectation and this will be provided, as we
shall show, by the interaction model built using common guidelines introduced in Section 8.4.
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
8.7.2
130
ANOVA parameterization of two-way classified group means
Nevertheless, we first mention parameterization of two-way classified group means which only
partly corresponds to common guidelines of Section 8.4, nevertheless, it can be encoutered in
practice and it directly generalizes the ANOVA parameterization of one-way classified group means
introduced in Section 7.4.3. The ANOVA parameterization of two-way classified group means is
given as
ZW
mg,h = α0 + αgZ + αhW + αg,h
,
g = 1, . . . , G, h = 1, . . . , H,
(8.38)
where a vector of regression coefficients α = α0 , αZ , αW , αZW is composed of
• intercept term α0 ;
Z of the covariate Z;
• main effects αZ = α1Z , . . . , αG
W of the covariate W ;
• main effects αW = α1W , . . . , αH
ZW , . . . , αZW , . . . . . . , αZW , . . . , αZW .
• interaction effects αZW = α1,1
G,1
1,H
G,H
Interaction effects and no effect modification
The interaction effects αZW allow for possible mutual effect modification of the two categorical
covariates Z and W . With parameterization (8.38) of the baseline group means, we have for any
g1 , g2 ∈ {1, . . . , G}, h ∈ {1, . . . , H} and any v ∈ V :
E Y Z = g1 , W = h, V = v − E Y Z = g2 , W = h, V = v
= mg1 , h (v) − mg2 , h (v) = mg1 , h − mg2 , h = αgZ1 − αgZ2
.
+ αgZW
− αgZW
2 ,h
1 ,h
That is, if the value of the Z covariate is changed, the change of the response expectation possibly
depends, through the interaction terms, on a value of the W covariate. Similarly, if we express
a change in the response expectation due to a change in the second categorical covariate W . For
any h1 , h2 ∈ {1, . . . , H}, g ∈ {1, . . . , G} and any v ∈ V :
E Y Z = g, W = h1 , V = v − E Y Z = g, W = h2 , V = v
= mg, h1 (v) − mg, h2 (v) = mg, h1 − mg, h2 = αhW1 − αhW2
ZW
ZW
+ αg,h
− αg,h
.
1
2
Above expressions also show that hypothesis of no effect modification is given by equality of all
interaction effects, i.e.,
ZW
ZW
H0 : α1,1
= · · · = αG,H
.
(8.39)
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
131
Linear model parameterization
In a matrix form, parameterization (8.38) is


m1,1


..


.




 mG,1 

 
 −−− 
IG
. . . . . . 0G×G
1G IG 1G . . . . . . 0G



  .
.
.
.
.
.
..
.
..

  ..
..
..
..
..
..
..
.
.

 
m=
= .
.
.
.
..
.
..
.
.



..
..
..
..
..
..
.
.

  ..


 −−− 
IG
G 0G×G . . . . . .

 | 1G IG 0G . . . . . . 1{z
 m


1,H 
e ZW


X
α
..


.


mG,H
 
  α0
  αZ
 
 
  αW

αZW
}



,


e ZW is an (G · H) × (1 + G + H + G · H) matrix and its rank is hence at most
where matrix X
α
e ZW provides less-than-full rank
G · H (it is indeed precisely equal to G · H). That is, matrix X
α
e ZW can concisely be
parameterization of the two-way classified group means. Note that matrix X
α
written as
e ZW = 1H ⊗ 1G 1H ⊗ IG IH ⊗ 1G IH ⊗ IG
(8.40)
X
α
e ⊗ 1G D
e ⊗C
e ,
e D
= 1H ⊗ 1G 1H ⊗ C
(8.41)
| {z }
1G·H
where
e = IG ,
C
e = IH .
D
That is, we have
e αZ + D
e ⊗ 1G α W + D
e ⊗C
e αZW .
m = α0 1G·H + 1H ⊗ C
(8.42)
Lemma 8.2 Column rank of a matrix that parameterizes two-way classified group
means.
e ZW being divided into blocks as
Matrix X
e ZW = 1H ⊗ 1G 1H ⊗ C
e D
e ⊗ 1G D
e ⊗C
e
X
e and 1H , D
e . That is,
has the column rank given by a product of column ranks of matrices 1G , C
e ZW = col-rank 1G , C
e
e .
col-rank X
· col-rank 1H , D
Proof. Proof/calculations below are shown only for those who are interested.
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
132
e ZW is upon suitable reordering of columns (which does not
By point (x) of Theorem A.3, matrix X
have any influence on the rank of the matrix) equal to a matrix
ZW
e
e
e ⊗ 1G , C
e .
Xreord = 1H ⊗ 1G , C D
Further, by point (ix) of Theorem A.3:
e ZW = 1H , D
e ⊗ 1G , C
e .
X
reord
Finally, by point (xi) of Theorem A.3:
e
e .
e ZW = col-rank X
e ZW = col-rank 1G , C
·
col-rank
1
,
D
col-rank X
H
reord
k
e ZW given by (8.40) is indeed
Lemma 8.2 can now be used to get easily that the rank of the matrix X
α
G · H and hence it can be used to parameterize G · H two-way classified group means.
Sum constraints identification
e ZW is 1 + G + H. By Scheffé’s theorem on identification in
Deficiency in the rank of the matrix X
α
a linear model (Theorem 7.1), (1+G+H) (or more) linear constraints on the regression coefficients
α are needed to identify the vector α in the related linear model. In practice, the following set of
(2 + G + H) constraints is often used:
G
X
H
X
αgZ = 0,
g=1
H
X
ZW
αg,h
= 0,
h=1
G
X
g = 1, . . . , G,
αhW = 0,
(8.43)
ZW
αg,h
= 0,
h = 1, . . . , H,
g=1
h=1
which in matrix notation is written as

Aα = 02+G+H ,
0
1>
G
0>
H

0>
1>
0
G
H

0G 0G×G 0G×H

A=
0
0>
0>
G
H

..
..
 ..
 .
.
.
>
0
0G
0>
H
>
0>
G . . . 0G

>
0>
G . . . 0G 

IG . . . IG 

.
>
>
1G . . . 0G 

..
.. 
..
.
.
. 
>
0>
G . . . 1G
We leave it as an exercise in linear algebra to verify that an (2 + G + H) × (1 + G + H + G · H)
matrix A satisfies conditions of Scheffé’s theorem, i.e.,
e ZW > = 0 .
rank A = 1 + G + H,
M A> ∩ M X
α
The coefficients α identified by a set of constraints (8.43) have the following (easy to see using simple algebra with expressions (8.38) while taking into account the constraints) useful interpretation.
α0 = m,
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
133
αgZ = mg• − m,
g = 1, . . . , G,
αhW = m•h − m,
h = 1, . . . , H,
ZW
αg,h
= mg,h − mg• − m•h + m,
g = 1, . . . , G,
h = 1, . . . , H,
from which it also follows
αgZ1 − αgZ2 = mg1 • − mg2 • ,
g1 , g2 = 1, . . . , G,
(8.44)
αhW1 − αhW2 = m•h1 − m•h2 ,
h1 , h2 = 1, . . . , H.
(8.45)
That is, the difference between the two main effects of the Z covariate, Eq. (8.44) provides the
mean effect of changing the Z covariate if the mean is taken over possible values of its effect
modifier W . Analogously, the difference between the two main effects of the W covariate, Eq.
(8.45) provides the mean effect of changing the W covariate if the mean is taken over possible
values of its effect modifier Z. Both (8.44) and (8.45) are important quantities especially in the area
of designed (industrial, agricultural, . . . ) experiments aiming in evaluating an effect of two factors
on the response.
8.7.3
Full-rank parameterization of two-way classified group means
Suppose that the categorical covariate Z is parameterized by the mean of a chosen G × (G − 1)
(pseudo)contrast matrix
 
c1 = c1,1 , . . . , c1,G−1 ,
c>
1
 . 
..
. 
C=
 . ,
.
>
cG
cG = cG,1 , . . . , cG,G−1 ,
and the categorical covariate W is parameterized by the mean of a chosen H × (H − 1)
(pseudo)contrast matrix
 
d
=
d
,
.
.
.
,
d
,
1
1,1
1,H−1
d>
1
 . 
..
. 
D=
 . ,
.
>
dH
dH = dH,1 , . . . , dH,H−1 .
Note that we do not require that matrices C and D are based on (pseudo)contrasts of the same
type. Let
e ZW = 1H ⊗ 1G 1H ⊗ C D ⊗ 1G D ⊗ C
(8.46)
X
β










=








1
..
.
c>
1
d>
1
d>
1
⊗
..
.
c>
1
1
..
.
..
.
..
.
c>
G
..
.
..
.
..
.
d>
1
..
.
..
.
>
d>
1 ⊗ cG
..
.
..
.
1
..
.
c>
1
..
.
d>
H
..
.
>
d>
G ⊗ c1
..
.
1
c>
G
d>
H
>
d>
G ⊗ cG










,








(8.47)
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
134
which is a matrix with G · H rows and 1 + (G − 1) + (H − 1) + (G − 1)(H − 1) = G · H columns
and its structure is the same as a structure of the matrix (8.41), where we now take
e = C,
C
e = D.
D
Using Lemma 8.2 and properties of (pseudo)contrast matrices, we have
e ZW = col-rank 1G , C
col-rank X
·
col-rank
1
,
D
= G · H.
H
β
e ZW is of full-rank G · H and hence can be used to parameterize the two-way
That is, the matrix X
β
classified group means as
e ZW β,
m=X
β = β0 , β Z , β W , β ZW ,
β
where
Z
β Z = β1Z , · · · , βG−1
,
W
β W = β1W , · · · , βH−1
,
ZW
ZW
ZW
ZW
β ZW = β1,1
, . . . , βG−1,1
, . . . . . . , β1,H−1
, . . . , βG−1,H−1
.
We can also write
m = β0 1G·H + 1H ⊗ C β Z + D ⊗ 1G β W + D ⊗ C β ZW ,
ZW
>
W
Z
,
+ d>
mg,h = β0 + c>
+ d>
gβ
h ⊗ cg β
hβ
g = 1, . . . , G, h = 1, . . . , H.
(8.48)
Different choices of the (pseudo)contrast matrices C and D lead to different interpretations of the
regression coefficients β.
e ZW , it is directly seen that it can also be written as
If we take expression (8.47) of the matrix X
e ZW = 1G·H , e
X
SZ , e
SW , e
SZW ,
β
where
e
SZ = 1H ⊗ C,
e
SW = D ⊗ 1G ,
e
SZW = e
SZ : e
SW .
Similarly, a matrix XZW
which parameterizes a vector µZW (in which a value mg,h is repeated
β
ng,h -times) is factorized as
XZW
= 1n , SZ , SW , SZW ,
(8.49)
β
where SZ and SW is obtained from matrices e
SZ and e
SW , respectively, by appropriately repeating
ZW
Z
W
their rows and S
= S : S . That is, the model matrix (8.49) is precisely of the form given in
(8.16) that we used to parameterize a linear model with interactions. In context of this chapter, the
(pseudo)contrast matrices C and D, respectively, play the role of parameterizations sZ in (8.14) and
sW in (8.15), respectively.
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
8.7.4
135
Relationship between the full-rank and ANOVA parameterizations
With the full-rank parameterization of the two-way classified group means, expression (8.48) shows
that we can also write
ZW
mg,h = α0 + αgZ + αhW + αg,h
,
g = 1, . . . , G, h = 1, . . . , H.
(8.50)
where
α0 := β0 ,
αgZ
Z
:= c>
gβ ,
g = 1, . . . , G,
αhW
W
:= d>
hβ ,
h = 1, . . . , H,
ZW
αg,h
:=
ZW
>
,
d>
h ⊗ cg β
(8.51)
g = 1, . . . , G, h = 1, . . . , H.
That is, chosen full-rank parameterization of the two-way classified group means corresponds to the
ANOVA parameterization (8.50) in which (1 + G + H + G · H) regression coefficients α is uniquely
obtained from G·H coefficients β of the full-rank parameterization using the relationships (8.51). In
other words, expressions (8.51) correspond to identifying constraints on α in the less-than-full-rank
parameterization. Note also, that in matrix notation, (8.51) can be written as
α0 := β0 ,
αZ
:= C β Z ,
αW
:= D β W ,
αZW
(8.52)
:= (D ⊗ C) β ZW .
Interaction effects and no effect modification
ZW = · · · = αZW ,
Hypothesis of no effect modification for covariates Z and W is given as H0 : α1,1
G,H
see (8.39), which can also be written as
H0 : αZW = a 1G·H
for some a ∈ R.
ZW
Taking into account (8.52), αZW =
= a 1. Due
a 1 for some a ∈ R, if and only if (D ⊗ C) β
to the fact that 1 ∈
/ M D ⊗ C (if both C and D are (pseudo)contrast matrices), this is only
possible with a = 0 and β ZW = 0. Hence with the full-rank parameterization of the two-way
classified group means, the hypothesis of no effect modification is
H0 : β ZW = 0(G−1)(H−1) .
8.7.5
Suppose that the categorical covariates Z and W act additively. That is, it can be assumed that
αZW = a 1 for some a ∈ R and hence the baseline group means can be parameterized as
mg,h = α0 + αgZ + αhW ,
g = 1, . . . , G, h = 1, . . . , H,
(8.53)
Z is a vector of main effects of the covariate Z and αW =
where, as before
αZ = α1Z , . . . , αG
W is a vector of main effects of the covariate W , and
α1W , . . . , αH
α = α0 , αZ , αW
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
136
is a vector of regression coefficients. Even though this situation was in fact treated in a more general
way in Section 8.3 (covariate W in this section plays a role of covariate vector V in Section 8.3), it
will be useful to revisit it in this specific case when both covariates at hand are categorical.
Parameterization (8.53) written in a matrix notation is
e Z+W α,
m=X
α
e Z+W =
X
α
1H ⊗ 1G 1H ⊗ IG IH ⊗ 1G
.
e Z+W has 1 + G + H columns but its rank is only
Matrix X
α
e Z+W = 1 + (G − 1) + (H − 1) = G + H − 1.
rank X
α
Analogously to the interaction model and also analogously to Section 8.3, a full-rank parameterization of the two-way classified group means (8.53) under additivity assumption, is
Z
W
mg,h = β0 + c>
+ d>
gβ
hβ ,
(8.54)
g = 1, . . . , G, h = 1, . . . , H,
where c1 > , . . . , c>
G are rows of an G × (G − 1) (pseudo)contrast matrix C that parameterizes the
categorical covariate Z and d1 > , . . . , d>
H are rows of an H × (H − 1) (pseudo)contrast matrix D
that parameterizes the categorical covariate W and
Z
W
β = β0 , β1Z , . . . , βG−1
, β1W , . . . , βH−1
|
{z
} |
{z
}
Z
W
β
β
are regression coefficients. Parameterization (8.54) written in a matrix form is
(8.55)
e Z+W β,
m=X
β
e Z+W =
X
β
1H ⊗ 1G 1H ⊗ C D ⊗ 1G
.
e Z+W is a matrix with 1 + (G − 1) + (H − 1) = G + H − 1 columns obtained from
Since X
β
e ZW , Eq. (8.46), by omitting its last (G − 1)(H − 1) columns, matrix X
e Z+W is
a full-rank matrix X
β
β
indeed of a full-rank G + H − 1.
Analogously to the interaction model, coefficients β from the full-rank parameterization correspond
to certain identifying constraints imposed on the coefficients α of the less-than-full-rank parameterization. Their mutual relationship is basically the same as in Section 8.7.4 and is given by the
first three expressions from (8.51).
Partial effects
If two-way classified baseline group means satisfy additivity, we have for any g1 , g2 ∈ {1, . . . , G},
h ∈ {1, . . . , H} and any v ∈ V :
E Y Z = g1 , W = h, V = v − E Y Z = g2 , W = h, V = v
=
=
mg1 , h (v) − mg2 , h (v)
αgZ1 − αgZ2
=
=
mg1 , h − mg2 , h
>
Z
c>
g1 − cg2 β .
In agreement with Section 8.3.1, main effects αZ or β Z are referred to as partial effects of
Z depend on
a categorical covariate Z. Remind that interpretation of particular values α1Z , . . . , αG
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
137
Z
considered identifying constraints, interpretation of particular values β1Z , . . . , βG−1
depend on
a choice of a (pseudo)contrast matrix C.
Due to additivity, we also have that, still for arbitrary h ∈ {1, . . . , H}:
mg1 , h − mg2 , h
=
mg1 • − mg2 •
= αgZ1 − αgZ2
=
Z
>
c>
g1 − cg2 β .
Interpretation of the invididual coefficients is then the same as already explained in Section 8.3.2.
Similarly, we have for any h1 , h2 ∈ {1, . . . , H}, g ∈ {1, . . . , G} and any v ∈ V :
E Y Z = g, W = h1 , V = v − E Y Z = g, W = h2 , V = v
8.7.6
=
mg, h1 (v) − mg, h1 (v)
=
αhW1 − αhW2
=
d>
h1
mg, h2 − mg, h2
W
− d>
h2 β .
=
=
m•h2 − m• h2
Interpretation of model parameters for selected choices of
(pseudo)contrasts
Chosen (pseudo)contrast matrices C and D determine interpretation of the coefficients β of the
full-rank parameterization (8.48) of the two-way classified group means and also of the coefficients
α given by (8.51) and determine an ANOVA parameterization with certain identification. For interpretation, it is useful to view the two-way classified group means as entries in the G × H
matrix


m1,1 . . . m1,H
 .
.. 
.
M=
...
. 
 .
.
mG,1 . . . mG,H
Corresponding sample sizes ng,h , g = 1, . . . , G, h = 1, . . . , H, form a G × H contingency table
based on the values of the two categorical covariates Z and W . With the ANOVA parameterization
(8.50), the main effects αZ = Cβ Z and αW = Dβ W can be interpreted as
the row and column
effects in matrix M, respectively. The interaction effects αZW = D ⊗ C β ZW can be put into
a matrix

 

>
>
ZW . . . αZW
> β ZW . . .
> β ZW
α1,1
d
⊗
c
d
⊗
c
1
H
1
1
1,H
 .


.. 
..
..
ZW




.
A
= .
...
. =
.
...
.

>
>
ZW . . . αZW
> β ZW . . .
> β ZW
αG,1
d
⊗
c
d
⊗
c
1
H
G,H
G
G
whose entries could be interpreted as cell effects in matrix M. In other words, each of the values in
M is obtained as a sum of the intercept term α0 and corresponding row, column and cell effects.
As was already mentioned, the two (pseudo)contrast matrices C and D can be both of different
type, e.g., matrix C being the reference group pseudocontrast matrix and matrix D being the sum
contrast matrix. Nevertheless, in practice, both of them are mostly chosen as being of the same
type. In the reminder of this section, we shall discuss interpretation of the model parameters for two
most common choices of the (pseudo)contrasts which are (i) the reference group pseudocontrasts
and (ii) sum contrasts.
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
138
Reference group pseudocontrasts
Suppose that both C and D are reference group pseudocontrasts, i.e.,


 >


 >
0 ... 0
c1
0 ... 0
d1
!

 >





1 . . . 0
 c2 
1 . . . 0
 d>
0>
2 
G−1







C = . .
=  .  =
,
D = . .
=  . 
 =
. . ... 
. . ... 
IG−1
 ..

 .. 
 ..

 .. 
0 ... 1
c>
0 ... 1
d>
H
G
We have,
 Z


α1
0
 Z
 Z 
 α2 
 β 
 .  = αZ = Cβ Z =  .1  ,
 . 
 . 
 . 
 . 
Z
Z
αG
βG−1
 W


α1
0
 W
 W 
α2 
β 
 .  = αW = Dβ W =  1.  ,
 . 
 . 
 . 
 . 
W
W
αH
βH−1
0>
H−1
IH−1
!
(8.56)
To get the link between the full-rank interaction terms β ZW and their ANOVA counterparts αZW ,
>
we have to explore the form of vectors d>
h ⊗ cg , g = 1, . . . , G, h = 1, . . . , H. With the reference
group pseudocontrasts, we easily see that
> = 0,
d>
for all h = 1, . . . , H,
h ⊗ c1
>
>
d1 ⊗ cg = 0,
for all g = 1, . . . , G,
>
>
dh ⊗ cg = 0, . . . , 1, . . . , 0 ,
if g 6= 1 & h 6= 1,
ZW
>
ZW
,
1 on a place that in dh ⊗ c>
multiplies βg−1,h−1
g β

A
ZW
ZW
α1,1
 ZW
α2,1
=
 ..
 .
ZW
α1,2
ZW
α2,2
..
.
ZW
αG,1
ZW
αG,2
 
ZW
. . . α1,H
0
 
ZW
. . . α2,H  0

.. 
 =  ..
...
.  .
...
ZW
αG,H
0
0
ZW
β1,1
..
.
ZW
βG−1,1
...
...
0
ZW
β1,H−1
..
.
...
ZW
. . . βG−1,H−1



.


In summary, the ANOVA coefficients are identified by a set of 3 + (H − 1) + (G − 1) = G + H + 1
constraints
α1Z = 0,
α1W = 0,
ZW
α1,1
= 0,
ZW
α1,h
= 0,
h = 2, . . . , H,
ZW
αg,1
g = 2, . . . , H.
= 0,
The first two constraints come from (8.56), remaining ones correspond to zeros in the matrix AZW .
ZW , g = 1, . . . , G, h =
While taking into account parameterization mg,h = α0 + αgZ + αhW + αg,h
1, . . . , H, of the two-way classified group means where the parameters satisfy above constraints,
we get their following interpretation:
α0 = β 0
= m1,1 ,
Z
αgZ = βg−1
= mg,1 − m1,1 ,
g = 2, . . . , G,
.
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
W
αhW = βh−1
= m1,h − m1,1 ,
139
h = 2, . . . , H,
ZW
ZW
αg,h
= βg−1,h−1
= mg,h − mg,1 − m1,h + m1,1 ,
g = 2, . . . , G,
h = 2, . . . , H.
Note (Reference group pseudocontrasts in the additive model).
If the additive model is assumed where mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H and
the reference group pseudocontrasts
are used in the full-rank parameterization (8.55), the ANOVA
coefficients α = α0 , αZ , αW are obtained from the full-rank coefficients β = β0 , β Z , β W
again by (8.56), that is, they are identified by two constraints
α1Z = 0,
α1W = 0.
Their interpretation becomes
α0 = β0
= m1,1 ,
Z
αgZ = βg−1
= mg,h − m1,h ,
g = 2, . . . , G,
arbitrary h ∈ {1, . . . , H}
h = 2, . . . , H,
arbitrary g ∈ {1, . . . , G}
= mg• − m1• ,
W
= mg,h − mg,1 ,
αhW = βh−1
= m•h − m•1 .
Sum contrasts
Suppose now that both C and D

1
 .
 ..
C=

 0
−1
are sum contrasts, i.e.,

 > 
...
0
c1
 . 
.. 
..
 . 
.
.
 =  .  =

 > 
...
1
cG−1 
. . . −1
c>
G

 > 
1 ...
0
d1
 . .

 . 
.
..
.. 
 ..
 . 
 =  .  =
D=


 > 
1
 0 ...
dH−1 
−1 . . . −1
d>
H
IG−1
− 1>
G−1
!
,

We have,
IH−1
− 1>
H−1



β1Z
α1Z


..
 . 


 .. 
.



 = αZ = Cβ Z = 
,
 Z 
Z
 βG−1 
αG−1 


PG−1 Z
Z
αG
− g=1 βg
!
.


α1W


β1W



..
 . 


 .. 
.



 = αW = Dβ W = 
,
 W 
W


βH−1
αH−1 


P
W
H−1
W
αH
− h=1 βh
(8.57)
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
140
>
The form of the vectors d>
h ⊗ cg , g = 1, . . . , G, h = 1, . . . , H, needed to calculate the interaction
ZW
terms αg,h is the following
> =
d>
h ⊗ cg
0, . . . , 1, . . . , 0 ,
ZW
>
1 on a place that in d>
h ⊗ cg β
for g = 1, . . . , G − 1,
h = 1, . . . , H − 1,
ZW
multiplies βg,h ,
>
d>
h ⊗ cG =
0G−1 , . . . , − 1G−1 , . . . , 0G−1 ,
for all h = 1, . . . , H − 1,
ZW
>
>
ZW ,
multiply β•h
− 1G−1 block on places that in dh ⊗ cG β
> =
d>
H ⊗ cg
0, . . . , −1, . . . , 0, . . . . . . , 0, . . . , −1, . . . , 0 ,
for all g = 1, . . . , G − 1,
ZW
>
>
ZW
−1’s on places that in dH ⊗ cg β
multiply βg• ,
>
d>
H ⊗ cG =
1, . . . , 1 = 1(G−1)(H−1) .

AZW

ZW
ZW
ZW
α1,1
...
α1,H−1
α1,H
 .
..
..
.. 
 ..
.
.
. 


=  ZW

ZW
ZW
αG−1,1 . . . αG−1,H−1

αG−1,H


ZW
ZW
ZW
αG,H
. . . αG,H−1
αG,1

ZW
β1,1
..
.



=



...
...
...
ZW
βG−1,1
−
PG−1
g=1
ZW
βg,1
... −
ZW
β1,H−1
..
.
ZW
βG−1,H−1
PG−1
g=1
ZW
βg,H−1
−
−
PH−1
h=1
..
.
PH−1
h=1
ZW
β1,h
ZW
βG−1,h
PG−1 PH−1
g=1
h=1
ZW
βg,h




.



Note that the entries in each row of the matrix AZW and also in each of its column sum up to zero.
Similarly, (8.57) shows that the elements of the main effects αZ and also elements of the main
effects αW sum up to zero. Identifying constraints for the ANOVA coefficients α that correspond
to considered sum contrast full-rank parameterization are hence
G
X
αgZ = 0,
g=1
H
X
αhW = 0,
h=1
H
X
(8.58)
ZW
αg,h
= 0,
for each g = 1, . . . , G,
ZW
αg,h
= 0,
for each h = 1, . . . , H.
h=1
G
X
g=1
ZW , one constraint is redundant
Note that in a set of G + H constraints on the interaction terms αg,h
and the last two rows could also be replaced by a set of (G − 1) + (H − 1) + 1 = G + H − 1
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
constraints:
H
X
141
ZW
αg,h
= 0,
for each g = 1, . . . , G − 1,
ZW
αg,h
= 0,
for each h = 1, . . . , H − 1,
h=1
G
X
g=1
G−1
X H−1
X
ZW
ZW
αg,h
= αG,H
.
g=1 h=1
We see that the set of equations (8.58) exactly corresponds to identification by the sum constraints
(8.43), see Section 8.7.2. Hence the interpretation of the regression coefficients is the same as
derived there, namely,
α0 = m,
αgZ = mg• − m,
g = 1, . . . , G,
αhW = m•h − m,
h = 1, . . . , H,
ZW
= mg,h − mg• − m•h + m,
αg,h
g = 1, . . . , G,
h = 1, . . . , H.
αgZ1 − αgZ2 = mg1 • − mg2 • ,
g1 , g2 = 1, . . . , G,
αhW1 − αhW2 = m•h1 − m•h2 ,
h1 , h2 = 1, . . . , H.
Note (Sum contrasts in the additive model).
If the additive model is assumed where mg,h = α0 + αgZ + αhW , g = 1, . . . , G, h = 1, . . . , H
and the sum contrasts
are used in the full-rank parameterization (8.55), the ANOVA coefficients
α = α0 , αZ , αW are obtained from the full-rank coefficients β = β0 , β Z , β W again by
(8.57), that is, they are identified by two constraints
G
X
g=1
αgZ = 0,
H
X
αhW = 0.
h=1
Their interpretation becomes
α0 = m,
αgZ = mg,h − m•h ,
g = 1, . . . , G,
arbitrary h ∈ {1, . . . , H}
h = 1, . . . , H,
arbitrary g ∈ {1, . . . , G}
= mg• − m,
αhW = mg,h − mg• ,
= m•h − m.
αgZ1 − αgZ2 = mg1 ,h − mg2 ,h ,
= mg1 • − mg2 • ,
g1 , g2 = 1, . . . , G,
arbitrary h ∈ {1, . . . , H}
8.7. INTERACTION OF TWO CATEGORICAL COVARIATES
αhW1 − αhW2 = mg,h1 − mg,h2 ,
h1 , h2 = 1, . . . , H,
142
arbitrary g ∈ {1, . . . , G}
= m•h1 − m•h2 .
End of
Lecture #14
(19/11/2015)
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
8.8
8.8.1
143
Hierarchically well-formulated models, ANOVA tables
Start of
Lecture #15
(19/11/2015)
Model terms
In majority of applications of a linear model, a particular covariate Z ∈ Z ⊆ R enters the regression
function using one of the parameterizations described in Sections 7.3 and 7.4 or inside an interaction
(see Defition 8.2) or inside a so called higher order interaction (will be defined in a while). As
a summary, depending on whether the covariate is numeric or categorical, several parameterizations
s were introduced in Sections 7.3 and 7.4 that with the covariate values Z1 , . . . , Zn in the data

 

s> (Z1 )
X>
1
 .   . 



.
.
S= . = . 
,
>
>
s (Zn )
Xn
where X 1 = s(Z1 ), . . ., X n = s(Zn ) are the regressors used in the linear model. The considered
parameterizations were the following.
Numeric covariate
(i) Simple transformation: s = s : Z −→ R with


s(Z1 )
X 1 = X1 = s(Z1 ),
 . 
..


S =  ..  = S ,
.
s(Zn )
X n = Xn = s(Zn ).
(8.59)
(ii) Polynomial: s = s1 , . . . , sk−1 such that sj (z) = P j (z) is polynomial in z of
degree j, j = 1, . . . , k − 1. This leads to


P 1 (Z1 ) . . . P k−1 (Z1 )
 .
 ..
..
1
k−1 ,

..
S=
=
(8.60)
P
,
.
.
.
,
P
.
.


P 1 (Zn ) . . . P k−1 (Zn )
X 1 = P 1 (Z1 ), . . . , P k−1 (Z1 ) ,
..
.
X n = P 1 (Zn ), . . . , P k−1 (Zn ) .
For a particular form of the basis polynomials P 1 , . . . , P k−1 , raw or orthonormal
polynomials have been suggested in Sections 7.3.2 and 7.3.3. Other choices are possible
as well.
(iii) Regression spline: s = s1 , . . . , sk such that sj (z) = Bj (z), j = 1, . . . , k, where
B1 , . . . , Bk is the spline basis of chosen degree d ∈ N0 composed of basis B-splines
built above a set of chosen knots λ = λ1 , . . . , λk−d+1 . This leads to


B1 (Z1 ) . . . Bk (Z1 )
 .
..
.. 
1
k ,

..
S=B=
=
(8.61)
B , ..., B
.
. 

B1 (Zn ) . . . Bk (Zn )
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
X1 =
..
.
B1 (Z1 ), . . . , Bk (Z1 ) ,
Xn =
B1 (Zn ), . . . , B k (Zn ) .
144
Categorical covariate with Z = 1, . . . , G . The parameterization s is s(z) = cz , z ∈ Z,
where c1 , . . . , cG ∈ RG−1 are the rows of a chosen (pseudo)contrast matrix CG×G−1 . This


c>
X 1 = cZ1 ,
 Z. 1  ..
1
G−1
. 
S=
(8.62)
,
.
 .  = C , ..., C
c>
X n = cZn .
Zn
Main effect model terms
In the following, we restrict ourselves only into situations when the considered covariates are
parameterized by one of above mentioned ways. The following definitions define sets of the
columns of a possible model matrix which will be called the model terms and which are useful to
be always considered “together” when proposing a linear model for a problem at hand.
Definition 8.3 The main effect model term.
Depending on a chosen parameterization, the main effect model term6 (of order one) of a given
covariate Z is defined as a matrix T with columns:
Numeric covariate
(i) Simple transformation: (the only) column S of the reparameterizing matrix S given by
(8.59), i.e.,
T= S .
(ii) Polynomial: the first column P 1 of the reparameterizing matrix S (given by Eq. 8.60) that
corresponds to the linear transformation of the covariate Z, i.e.,
T = P1 .
(iii) Regression spline: (all) columns B 1 , . . . , B k of the reparameterizing matrix S = B
given by (8.61), i.e.,
T = B1, . . . , Bk .
Categorical covariate: (all) columns C 1 , . . . , C G−1 of the reparameterizing matrix S given by
(8.62), i.e.,
T = C 1 , . . . , C G−1 .
Definition 8.4 The main effect model term of order j.
If a numeric covariate Z is parameterized using the polynomial of degree k − 1 then the main effect
model term of order j, j = 2, . . . , k − 1, means a matrix Tj whose the only column is the jth
6
hlavní efekt
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
145
column P j of the reparameterizing matrix S (given by Eq. 8.60) that corresponds to the polynomial of
degree j, i.e.,
Tj = P j .
Note. The terms T, . . ., Tj−1 are called as lower order terms included in the term Tj .
Two-way interaction model terms
In the following, consider two covariates Z and W and their main effect model terms TZ and TW .
Definition 8.5 The two-way interaction model term.
The two-way interaction7 model term means a matrix TZW , where
TZW := TZ : TW .
Notes.
• The main effect model term TZ and/or the main effect model term TW that enter the two-way
interaction may also be of a degree j > 1.
• Both the main effect model terms TZ and TW are called as lower order terms included in the
two-way interaction term TZ : TW .
Higher order interaction model terms
In the following, consider three covariates Z, W and V and their main effect model terms TZ ,
TW , TV .
Definition 8.6 The three-way interaction model term.
The three-way interaction8 model term means a matrix TZW V , where
TZW V := TZ : TW : TV .
Notes.
• Any of the main effect model terms TZ , TW , TV that enter the three-way interaction may also
be of a degree j > 1.
• All main effect terms TZ , TW and TV and also all two-way interaction terms TZ : TW , TZ : TV
and TW : TV are called as lower order terms included in the three-way interaction term TZW V .
• By induction, we could define also four-way, five-way, . . . , i.e., higher order interaction model
terms and a notion of corresponding lower order nested terms.
7
dvojná interakce
8
trojná interakce
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
8.8.2
146
Model formula
To write concisely linear models based on several covariates, the model formula will be used. The
following symbols in the model formula have the following meaning:
• 1:
intercept term in the model if this is the only term in the model (i.e., intercept only model).
• Letter or abbreviation:
main effect of order one of a particular covariate (which is identified by the letter or abbreviation). It is assumed that chosen parameterization is either known from context or is indicated
in some way (e.g., by the used abbreviation). Letters or abbreviations will also be used to
indicate a response variable.
• Power of j, j > 1 (above a letter or abbreviation):
main effect of order j of a particular covariate.
• Colon (:) between two or more letters or abbreviations:
interaction term based on particular covariates.
• Plus sign (+):
a delimiter of the model terms.
• Tilde (∼):
a delimiter between the response and description of the regression function.
Further, when using a model formula, it is assumed that the intercept term is explicitely included
in the regression function. If the explicit intercept should not be included, this will be indicated by
writing −1 among the model terms.
8.8.3
Hierarchically well formulated model
Definition 8.7 Hierarchically well formulated model.
Hierarchically well formulated (HWF) model9 is such a model that contains an intercept term (possibly implicitely) and with each model term also all lower order terms that are nested in this term.
Notes.
• Unless there is some well-defined specific reason, models used in practice should be hierarchically
well formulated.
• Reason for use of the HWF models is the fact that the regression space of such models is
invariant towards linear (location-scale) transformations of the regressors where invariance is
meant with respect to possibility to obtain the equivalent linear models.
Example 8.5.
mx (x) = β0 + β1 x + β2 x2
and perform a linear transformation of the regressor:
x = δ (t − ϕ),
9
hierarchicky dobře formulovaný model
t=ϕ+
x
,
δ
(8.63)
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
147
where δ 6= 0 and ϕ 6= 0 are pre-specified constants and t is a new regressor. The regression function
in t is
mt (t) = γ0 + γ1 t + γ2 t2 ,
where γ0 = β0 − β1 δϕ + β2 δ 2 ϕ2 ,
γ1 = β1 δ − 2β2 δ 2 ϕ,
γ2 = β2 δ 2 .
With at least three different x values in the data, both regression functions lead to two equivalent
linear models of rank 3.
Suppose now that the initial regression function mx did not include a linear term, i.e., it was
mx (x) = β0 + β2 x2
which leads to a linear model of rank 2 (with at least three or even two different covariate values in
data). Upon performing the linear transformation (8.63) of the regressor x, the regression function
becomes
mt (t) = γ0 + γ1 t + γ2 t2
with γ0 = β0 + β2 δ 2 ϕ2 ,
γ1 = −2β2 δ 2 ϕ,
γ2 = β 2 δ 2 .
With at least three different covariate values in data, this leads to the linear model of rank 3.
To use a non-HWF model in practice, there should always be a (physical, . . . ) reason for that. For
example,
• No intercept in the model ≡ it can be assumed that the response expectation is zero if all
regressors in a chosen parameterization take zero values.
• No linear term in a model with a quadratic regression function m(x) = β0 + β2 x2 ≡ it
can be assumed that the regression function is a parabola with the vertex in a point (0, β0 )
with respect to the x parameterization.
• No main effect of one covariate in an interaction model with two numeric covariates and
a regression function m(x, z) = β0 + β1 z + β2 x z ≡ it can be assumed
that with z = 0,
the response expectation does not depend on a value of x, i.e., E Y X = x, Z = 0 = β0
(a constant).
8.8.4
ANOVA tables
For a particular linear model, so called ANOVA tables are often produced to help the analyst to
decide which model terms are important with respect to its influence on the response expectation.
Similarly to well known one-way ANOVA table (see any of introductory statistical courses), ANOVA
tables produced in a context of linear models provide on each row input of a certain F-statistic,
now that based on Theorem 5.2. The last row of the table (labeled often as Residual, Error or
Within) provides
(i) residual degrees of freedom νe of the considered model;
(ii) residual sum of squares SSe of the considered model;
(iii) residual mean square MSe = SSe /νe of the considered model.
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
148
Each of the remaining rows of the ANOVA table provides input for the numerator of the F-statistic
that corresponds to comparison of certain two models M1 ⊂ M2 which are both submodels of
the considered model (or M2 is the considered model itself) and which have ν1 and ν2 degrees of
freedom, respectively. The following quantities are provided on each of the remaining rows of the
ANOVA table:
(i) degrees of freedom for the numerator of the F-statistic (effect degrees of freedom νE = ν1 −ν2 );
(ii) difference
in the residual sum of squares of the two models (effect sum of squares SSE =
SS M2 M1 );
(iii) ratio of the above two values which is the numerator of the F-statistic (effect mean square
MSE = SSE /νE );
(iv) value of the F-statistic FE = MSE /MSe ;
(v) a p-value based on the F-statistic FE and the FνE , νe distribution.
Several types of the ANOVA tables are distinguished which differ by definition of a pair of the two
models M1 and M2 that are being compared on a particular row. Consequently, interpretation of
results provided by the ANOVA tables of different type differs. Further, it is important to know
that in all ANOVA tables, the lower order terms always appear on earlier rows in the table than the
higher order terms that include them. Finally, for some ANOVA tables, different interpretation of
the results is obtained for different ordering of the rows with the terms of the same hierarchical
level, e.g., for different ordering of the main effect terms. We introduce ANOVA tables of three
types which are labeled by the R software (and by many others as well) as tables of type I, II or
III (arabic numbers can be used as well). Nevertheless, note that there exist software packages and
literature that use different typology. In the reminder of this section we assume that intercept term
is included in the considered model.
In the following, we illustrate each type of the ANOVA table on a linear model based on two
covariates whose main effect terms will be denoted as A and B. Next to the main effects, the
model will include also an interaction term A : B. That is, the model formula of the considered
model, denoted as MAB is ∼ A + B + A : B. In total, the following (sub)models of this model will
appear in the ANOVA tables: M0 :
∼ 1,
MA :
∼ A,
MB :
∼ B,
MA+B : ∼ A + B,
MAB :
∼ A + B + A : B.
The symbol SS F2 F1 will denote a difference in the residual sum of squares of the models with
model formulas F1 and F2 .
Type I (sequential) ANOVA table
Example 8.6 (Type I ANOVA table for model MAB :∼ A + B + A : B).
In the type I ANOVA table, the presented results depend on the ordering of the rows with the terms
of the same hierarchical level. In this example, those are the rows that correspond to the main effect
terms A and B.
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
149
Order A + B + A:B
Effect
(Term)
Degrees
of
freedom
A
?
Effect
sum of
squares
SS A 1
?
?
?
?
SS A + B A
?
?
?
A:B
?
SS A + B + A : B A + B
?
?
?
Residual
νe
SSe
MSe
B
Effect
mean
square
F-stat.
P-value
Order B + A + A:B
Effect
(Term)
Degrees
of
freedom
B
?
Effect
sum of
squares
SS B 1
?
?
?
?
SS A + B B
?
?
?
A:B
?
SS A + B + A : B A + B
?
?
?
Residual
νe
SSe
MSe
A
Effect
mean
square
F-stat.
P-value
The row of the effect (term) E in the type I ANOVA table has in general the following interpretation
and properties.
• It compares two models M1 ⊂ M2 , where
• M1 contains all terms included in the rows that precede the row of the term E.
• M2 contains the terms of model M1 and additionally the term E.
• The sum of squares shows increase of the explained variability of the response due to the term
E on top of the terms shown on the preceding rows.
• The p-value provides a significance of the influence of the term E on the response while
controlling (adjusting) for all terms shown on the preceding rows.
• Interpretation of the F-tests is different for rows labeled equally A in the two tables in
Example 8.6. Similarly, interpretation of the F-tests is different for rows labeled equally B in
the two tables in Example 8.6.
• The sum of all sums of squares shown in the type I ANOVA table gives the total sum of squares
SST of the considered model. This follows from the construction of the table where the terms
are added sequentially one-by-one and from a sequential use of Theorem 5.8 (Breakdown of the
total sum of squares in a linear model with intercept).
Type II ANOVA table
Example 8.7 (Type II ANOVA table for model MAB :∼ A + B + A : B).
In the type II ANOVA table, the presented results do not depend on the ordering of the rows with the
terms of the same hierarchical level as should become clear from subsequent explanation.
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
Degrees
of
freedom
Effect
sum of
squares
Effect
mean
square
F-stat.
P-value
A
?
?
?
?
B
?
SS A + B B
SS A + B A
?
?
?
A:B
?
SS A + B + A : B A + B
?
?
?
Residual
νe
SSe
MSe
Effect
(Term)
150
The row of the effect (term) E in the type II ANOVA table has in general the following interpretation
and properties.
• It compares two models M1 ⊂ M2 , where
• M1 is the considered (full) model without the term E and also all higher order terms than E
that include E.
• M2 contains the terms of model M1 and additionally the term E (this is the same as in type
I ANOVA table).
• The sum of squares shows increase of the explained variability of the response due to the term
E on top of all other terms that do not include the term E.
• The p-value provides a significance of the influence of the term E on the response while
controlling (adjusting) for all other terms that do not include E.
• For practical purposes, this is probably the most useful ANOVA table.
Type III ANOVA table
Example 8.8 (Type III ANOVA table for model MAB :∼ A + B + A : B).
Also in the type III ANOVA table, the presented results do not depend on the ordering of the rows with
the terms of the same hierarchical level as should become clear from subsequent explanation.
Degrees
of
freedom
Effect
sum of
squares
Effect
mean
square
F-stat.
P-value
A
?
?
?
?
B
?
?
?
?
A:B
?
SS A + B + A : B B + A : B
SS A + B + A : B A + A : B
SS A + B + A : B A + B
?
?
?
Residual
νe
SSe
MSe
Effect
(Term)
The row of the effect (term) E in the type III ANOVA table has in general the following interpretation
and properties.
• It compares two models M1 ⊂ M2 , where
• M1 is the considered (full) model without the term E.
• M2 contains the terms of model M1 and additionally the term E (this is the same as in type
I and type II ANOVA table). Due to the construction of M1 , the model M2 is always equal to
the considered (full) model.
8.8. HIERARCHICALLY WELL-FORMULATED MODELS, ANOVA TABLES
151
• The submodel M1 is not necessarily hierarchically well formulated. If M1 is not HWF, interpretation of its comparison to model M2 depends on a parameterization of the term E. Consequently,
also the interpretation of the F-test depends on the used parameterization.
• For general practical purposes, most rows of the type III ANOVA table are often useless.
Chapter
9
Analysis of Variance
In this chapter, we examine several specific issues of linearmodels where all covariates are categorical. That is, the covariate vector Z is Z = Z1 , . . . , Zp , Zj ∈ Zj , j = 1, . . . , p, and each Zj
is a finite set (with usually a “low” cardinality). The corresponding linear models are traditionally
used in the area of designed (industrial, agricultural, . . . ) experiments or controlled clinical studies. The elements of the covariate vector Z then correspond to p factors whose influence on the
response Y is of interest. The values of those factors for experimental units/subjects are typically
within the control of an experimenter in which case the covariates are fixed rather than being
random. Nevertheless, since the whole theory presented in this chapter is based on statements on
the conditional distribution of the response given the covariate values, everything applies for both
fixed and random covariates.
152
9.1. ONE-WAY CLASSIFICATION
9.1
153
One-way classification
One-way classification corresponds to situation of one categorical covariate Z ∈ Z = {1, . . . , G},
A linear
model
is then used
to parameterize a set of G (conditional) response
expectations E Y Z = 1 , . . ., E Y Z = G that we call as one-way classified group means:
g = 1, . . . , G.
m(g) = E Y Z = g =: mg ,
Without loss of generality, we can assume that the response random variables Y1 , . . . , Yn are
sorted such that Z1 = · · · = Zn1 = 1, Zn1 +1 = · · · = Zn1 +n2 = 2, . . ., Zn1 +···+nG−1 +1 = · · · =
Zn = G. For notational clarity in theoretical derivations, it is useful to use a double subscript to
index the individual observations and to merge responses with a common covariate value Z = g,
g = 1, . . . , G, into response subvectors Y g :
Z=1:
..
.
Y 1 = Y1,1 , . . . , Y1,n1 ,
..
.
Z = G : Y G = YG,1 , . . . , YG,nG .
>
The full response vector is Y = Y 1 , . . . , Y G and its (conditional, given Z = Z1 , . . . , Zn )
mean is


m1 1n1


..
.
E Y Z = µ = 
.


mG 1nG
A standard linear model then additionally assumes
var Y Z = σ 2 In .
If moreover, a linear model with i.i.d. errors is assumed, we get
Yg,j = mg + εg,j ,
g = 1, . . . , G, j = 1, . . . , ng ,
i.i.d.
where εg,j ∼ D(0, σ 2 ). Finally, if moreover the covariates are fixed rather than random, the data
form G independent random samples (with common variances):
Sample 1 :
..
.
Y 1 = Y1,1 , . . . , Y1,n1 ,
..
.
i.i.d.
Y1,j ∼ D(m1 , σ 2 ),
..
.
j = 1, . . . , n1 ,
i.i.d.
Sample G : Y G = YG,1 , . . . , YG,nG , YG,j ∼ D(mG , σ 2 ), j = 1, . . . , nG .
A linear model and related methodology can then be used for inference on the group means
m1 , . . . , mG or on their linear combinations.
9.1.1
Parameters of interest
Differences between the group means
The principal inferential interest with one-way classification lies in estimation of and tests on
parameters
θg,h = mg − mh ,
g, h = 1, . . . , G, g 6= h,
9.1. ONE-WAY CLASSIFICATION
154
which are the differences between the
means. Since each θg,h is a linear combination of the
group
elements of the mean vector µ = E Y Z , it is trivially an estimable parameter of the underlying
linear model irrespective of its parameterization. The LSE of each θg,h is then a difference between
the corresponding fitted values.
The principal null hypothesis being tested in context of the one-way classification is the null
hypothesis on equality of the group means, i.e., the null hypothesis
H0 : m1 = · · · = mG ,
which written in terms of the differences between the group means is
H0 : θg,h = 0,
g, h = 1, . . . , G, g 6= h.
Factor effects
One-way classification often corresponds to a designed experiment which aims in evaluating the
effect of a certain factor on the response. In that case, the following quantities, called as factor
effects, are usually of primary interest.
Definition 9.1 Factor effects in a one-way classification.
By factor effects in case of a one-way classification we understand the quantities η1 , . . . , ηG defined
as
G
1 X
ηg = mg −
mh ,
g = 1, . . . , G.
G
h=1
Notes.
• The
effects are again linear combinations of the elements of the mean vector µ =
factor
E Y Z and hence all are estimable parameters of the underlying linear model with the LSE
being equal to the appropriate linear combination of the fitted values.
• The null hypothesis
H0 : ηg = 0,
g = 1, . . . , G,
is equivalent to the null hypothesis H0 : m1 = · · · = mG on the equality of the group means.
9.1.2
One-way ANOVA model
As a reminder from Section 7.4.2, the regression space of the one-way classification is





m
1



 1 n1 
.

 : m1 , . . . , mG ∈ R ⊆ Rn .
..







 m 1
G nG
While assuming ng > 0, g = 1, . . . , G, n > G, its vector dimension is G. In Sections 7.4.3 and
7.4.4, we introduced two classical classes of parameterizations of this regression space and of the
response mean vector µ as µ = Xβ, β ∈ Rk .
9.1. ONE-WAY CLASSIFICATION
155
ANOVA (less-than-full rank) parameterization
mg = α0 + αg ,
with k = G + 1, β =: α = α0 , |{z}
αZ .
α1 , . . . , αG
g = 1, . . . , G
Full-rank parameterization
Z
mg = β0 + c>
gβ ,
g = 1, . . . , G
with k = G, β = β0 ,
β Z , where
|{z}
β1 , . . . , βG−1


c>
1
 . 

.
C= . 

c>
G
is a chosen G × (G − 1) (pseudo)contrast matrix.
Note.
If the parameters in the ANOVA parameterization are identified by the sum constraint
P
G
g=1 αg
= 0, we get
G
α0 =
1 X
mg ,
G
g=1
αg = ηg = mg −
H
1 X
mh ,
H
h=1
that is, parameters α1 , . . . , αG are then equal to the factor effects.
Terminology. Related linear model is referred to as one-way ANOVA model 1
Notes.
• Depending on chosen parameterization (ANOVA or full-rank) the differences between the group
means, parameters θg,h , are expressed as
θg,h = αg − αh = cg − ch
>
βZ ,
g 6= h.
The null hypothesis H0 : m1 = · · · mG on equality of the group means is the expressed as
(a) H0 : α1 = · · · = αG .
(b) H0 : β1 = 0 & . . . & βG−1 = 0, i.e., H0 : β Z = 0G−1 .
• If a normal linear model is assumed, test on a value of the estimable vector parameter or
a submodel test which compares the one-way ANOVA model with the intercept-only model can
be used to test the above null hypotheses. The corresponding F-test is indeed a well known
one-way ANOVA F-test.
End of
Lecture #15
(19/11/2015)
9.1. ONE-WAY CLASSIFICATION
9.1.3
156
Least squares estimation
Start of
In case of a one-way ANOVA linear model, explicit formulas for the LSE related quantities can easily Lecture #16
be derived.
(26/11/2015)
Theorem 9.1 Least squares estimation in one-way ANOVA linear model.
The fitted values and the LSE of the group means in a one-way ANOVA linear model are equal to the
group sample means:
ng
1 X
b
m
b g = Yg,j =
Yg,l =: Y g• ,
ng
g = 1, . . . , G, j = 1, . . . , ng .
l=1
That is,

 
Y 1•
m
b1
 .   . 
.   . 
c := 
m
 .  =  . ,
Y G•
m
bG

Y 1• 1n1


..
.
Yb = 
.


Y G• 1nG
c | Z ∼ NG m, σ 2 V , where
If additionally normality is assumed then m


1
.
.
.
0
 n.1 .
.. 
.
..
V=
. 
.
 .
0 . . . n1G


Proof. Use a full-rank parameterization µ = Xβ with


1n1 . . . 0n1
 .
..
.. 
..
,
X=
β
=
m
,
.
.
.
,
m
.
.
.
1
G


..
0nG . 1nG
We have


n1 . . . 0
 . .
.. 
.
..
X> X = 
. 
 .
,
0 . . . nG
P
n1
j=1 Y1,j

..
X> Y = 
.

PnG


,


X> X
−1
j=1 YG,j
1
n1
 .
.
=
 .
0
...
..
.
...

0
.. 
. 
,
1
nG
−1 >
b=m
c= m
β
b 1, . . . , m
b G = X> X
X Y = Y 1• , . . . , Y G• .
Finally,
 

m
b 1 1n1
Y 1• 1n1

 

..
b =  ...  = 
.
Yb = Xβ
.

 

Y G• 1nG
m
b G 1nG

c follows from a general LSE theory.
Normality and the form of the covariance matrix of m
1
model analýzy rozptylu jednoduchého třídění
k
9.1. ONE-WAY CLASSIFICATION
157
LSE of regression coefficients and estimable parameters
With a full-rank parameterization, a vector m is linked to the regression coefficients β = β0 , β Z ,
β Z = β1 , . . . , βG−1 , by the relationship
m = β0 1G + Cβ Z .
b where X is a model matrix derived from the (pseudo)contrast matrix
Due to the fact that Yb = Xβ,
Z
b = βb0 , β
b of the regression coefficients in a full-rank parameterization satisfy
C, the LSE β
bZ ,
c = βb0 1G + Cβ
m
which is a regular linear system with the solution

βb0
bZ
β
!

Y 1•
−1 

 ...  .
= 1G , C


Y G•
That is, the LSE of the regression coefficients is always a linear combination of the group sample
means. The same then holds for any estimable parameter. For example, the LSE of the differences
between the group means θg,h = mg − mh , g, h = 1, . . . , G, are
θbg,h = Y g• − Y h• ,
g, h = 1, . . . , G.
P
Analogously, the LSE of the factor effects ηg = mg − G1 G
h=1 mh , g = 1, . . . , G, are
G
ηbg = Y g• −
1 X
Y h• ,
G
g = 1, . . . , G.
h=1
9.1.4
Within and between groups sums of squares, ANOVA Ftest
Sums of squares
Let as usual, Y denote a sample mean based on the response vector Y , i.e.,
ng
G
G
1 XX
1X
Y =
ng Y g• .
Yg,j =
n
n
g=1 j=1
g=1
In a one-way ANOVA linear model, the residual and the regression sums of squares and corresponding degrees of freedom are
n
n
g
g
G X
G X
2 X
2 X
2
SSe = Y − Yb =
Yg,j − Ybg,j =
Yg,j − Y g• ,
g=1 j=1
g=1 j=1
νe = n − G,
n
g
G X
G
2 X
2 X
2
SSR = Yb − Y 1n =
Ybg,j − Y =
ng Y g• − Y ,
g=1 j=1
g=1
νR = G − 1.
In this context, the residual sum of squares SSe is also called the within groups sum of squares2 ,
the regression sum of squares SSR is called the between groups sum of squares3 .
2
vnitroskupinový součet čtverců
3
meziskupinový součet čtverců
9.1. ONE-WAY CLASSIFICATION
158
One-way ANOVA F-test
Let us assume normality of the response and consider a submodel Y | Z ∼ Nn 1n β0 , σ 2 In of
the one-way ANOVA model. A residual sum of squares of the submodel is
n
SS0e
g
G X
2 X
2
= SST = Y − Y 1n =
Yg,j − Y .
g=1 j=1
Breakdown of the total sum of squares (Theorem 5.8) gives SSR = SST − SSe = SS0e − SSe and
hence the statistic of the F-test on a submodel is
F =
SSR
G−1
SSe
n−G
=
MSR
,
MSe
(9.1)
where
SSR
SSe
,
MSe =
.
G−1
n−G
The F-statistic (9.1) is indeed a classical one-way ANOVA F-statistics which under the null hypothesis
of validity of a submodel, i.e., under the null hypothesis of equality of the group means, follows an
FG−1, n−G distribution. Above quantities, together with the P-value derived from the FG−1, n−G
distribution are often recorded in a form of the ANOVA table:
MSR =
Effect
(Term)
Degrees
of
freedom
Effect
sum of
squares
Effect
mean
square
F-stat.
P-value
Factor
G−1
SSR
MSR
F
p
Residual
n−G
SSe
MSe
Consider a terminology introduced in Section 8.8, and
denote as Z main effect terms that correspond to the covariate Z. We have SSR = SS Z 1 and the above ANOVA table is now type I as
well as type II ANOVA table. If intercept is explicitely included in the model matrix then it is also
the type III ANOVA table.
9.2. TWO-WAY CLASSIFICATION
9.2
159
Two-way classification
Two-way classification corresponds to situation of two categorical covariates Z ∈ Z = {1, . . . , G}
and W ∈ W = {1, . . . , H}, see also Section 8.7.
A linear model is then used to parameterize a set
of G · H (conditional) response expectations E Y Z = g, W = h , g = 1, . . . , G, h = 1, . . . , H
that we will call as two-way classified group means:
m(g, h) = E Y Z = g, W = h =: mg,h ,
g = 1, . . . , G, h = 1, . . . , H.
Analogously to Section 8.7 and without loss of generality, we can assume that the response variables
Y1 , . . . , Yn are sorted as indicated in that section. That is, the first n1,1 responses correspond to
(Z, W ) = (1, 1), the following n2,1 responses to (Z, W ) = (2, 1) etc. Analogously to one-way
classification, we will now use a triple subscript to index the individual observations and merge
responses with a common value of the two covariates into response subvectors Y g,h , g = 1, . . . , G,
h = 1, . . . , H as indicated in the following table:
Z
1
1
..
.
Y 1,1 = Y1,1,1 , . . . , Y1,1,n1,1
..
.
G
Y G,1 = YG,1,1 , . . . , YG,1,nG,1
W
...
H
..
.
Y 1,H = Y1,H,1 , . . . , Y1,H,n1,H
..
..
.
.
..
.
Y G,H = YG,H,1 , . . . , YG,H,nG,H
The overall response vector Y is assumed to be taken from columns of the above table, i.e.,
Y = Y 1,1 , . . . , Y G,1 , . . . , Y 1,H , . . . , Y G,H .
In the same way, we define a vector m being composed of the two-way classified group means as
m = m1,1 , . . . , mG,1 , . . . , m1,H , . . . , mG,H .
Finally, we keep using a dot notation for collapsed sample sizes or means of the group means. That
is,
n=
G X
H
X
ng,h ,
g=1 h=1
ng• =
H
X
ng,h ,
g = 1, . . . , G,
n•h =
H
1 X
mg,h ,
H
h = 1, . . . , H,
G
g = 1, . . . , G,
m•h =
1 X
mg,h ,
G
h = 1, . . . , H,
g=1
h=1
m =
ng,h ,
g=1
h=1
mg• =
G
X
1
G·H
G X
H
X
mg,h =
g=1 h=1
which can be summarized in a tabular form as
1
G
G
X
g=1
mg• =
H
1 X
m•h ,
H
h=1
9.2. TWO-WAY CLASSIFICATION
160
Group means
Sample sizes
W
Z
1
mG,1
...
..
.
..
.
..
.
1
..
.
m1,1
..
.
G
•
m•1
...
W
H
•
Z
1
m1,H
..
.
m1•
..
.
1
..
.
n1,1
..
.
mG,H
mG•
G
m•H
m
•
nG,1
...
..
.
..
.
..
.
n•1
...
H
•
n1,H
..
.
n1•
..
.
nG,H
nG•
n•H
n
Note. The above defined quantities mg• , m•h , m are the means of the group means which are
not weighted by the corresponding sample sizes (which are moreover random if the covariates are
random). As such, all above defined means are always real constants and never random variables
(irrespective of whether the covariates are considered as being fixed or random).
>
>
The (conditional, given Z = Z1 , . . . , Zn
and W = W1 , . . . , Wn ) mean of the response
vector Y is


m1,1 1n1,1


..
.
E Y Z, W = µ = 
.


mG,H 1nG,H
A standard linear model then additionally assumes
var Y Z, W = σ 2 In .
If moreover, a linear model with i.i.d. errors is assumed, we get
Yg,h,j = mg,h + εg,h,j ,
g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h ,
i.i.d.
where εg,h,j ∼ D(0, σ 2 ). Finally, if moreover the covariates are fixed rather than random, the
data form G · H independent random samples (with common variances):
Sample (1, 1) :
Y 1,1 = Y1,1,1 , . . . , Y1,1,n1,1 ,
i.i.d.
Y1,1,j ∼ D(m1,1 , σ 2 ),
..
.
j = 1, . . . , n1,1 ,
..
.
Sample (G, H) : Y G,H = YG,H,1 , . . . , YG,H,nG,H ,
i.i.d.
YG,H,j ∼ D(mG,H , σ 2 ),
j = 1, . . . , nG,H .
A linear model and related methodology can then be used for inference on the group means
m1,1 , . . . , mG,H or on their linear combinations. On top of that, a linear model can reveal
a structure of the relationship in which the two covariates – the two factors influence the two-way
classified group means.
9.2.1
Parameters of interest
Means of the means and their differences
Various quantities, all being linear combinations of the two-way classified group means, i.e., all
being estimable in any parameterization of the two-way classification, are clasically of interest.
They include:
9.2. TWO-WAY CLASSIFICATION
161
(i) The mean of the group means m.
(ii) The means of the means by the first or the second factor, i.e., parameters
m1• , . . . , mG• ,
and
m•1 , . . . , m•H .
(iii) Differences between the means of the means by the first or the second factor, i.e., parameters
θg1 ,g2 • := mg1 • − mg2 • ,
g1 , g2 = 1, . . . , G, g1 6= g2 ,
θ•h1 ,h2 := m•h1 − m•h2 ,
h1 , h2 = 1, . . . , H, h1 6= h2 .
Those, in a certain sense quantify the mean effect of the first or the second factor on the
response.
Main effects
Analogously to one-way classification, also the two-way classification often corresponds to a designed experiment, in this case aiming in evaluating the effect of two factors represented by the
covariates Z and W . The following quantities, called as main effects of factor Z and W , respectively, are then usually of primary interest.
Definition 9.2 Main effects in a two-way classification.
Consider a two-way classification based on factors Z and W . By main effects of the factor Z, we
Z defined as
understand quantities η1Z , . . . , ηG
ηgZ := mg• − m,
g = 1, . . . , G.
W defined as
By main effects of the factor W , we understand quantities η1W , . . . , ηH
ηhW := m•h − m,
h = 1, . . . , H.
Note. Differences between the means of the means are also equal to the differences between the
main effects:
θg1 ,g2 • = mg1 • − mg2 • = ηgZ1 − ηgZ2 ,
θ•h1 ,h2 = m•h1 − m•h2 = ηhW1 − ηhW2 ,
g1 , g2 = 1, . . . , G, g1 6= g2 ,
h1 , h2 = 1, . . . , H, h1 6= h2 .
Interaction effects
Suppose now that the factors Z and W act additively on the response expectation. In this case,
differences
mg1 ,h − mg2 ,h ,
g1 , g2 = 1, . . . , G
do not depend on a value of h = 1, . . . , H. Consequently, for any h
mg1 ,h − mg2 ,h = mg1 • − mg2 • ,
g1 , g2 = 1, . . . , G.
9.2. TWO-WAY CLASSIFICATION
162
This implies
mg1 ,h − mg1 • = mg2 ,h − mg2 • ,
g1 , g2 = 1, . . . , G, h = 1, . . . , H.
In other words, additivity implies that for any h = 1, . . . , H, the differences
∆(g, h) = mg,h − mg•
do not depend on a value of g = 1, . . . , G. Then (for any g = 1, . . . , G and h = 1, . . . , H)
mg,h − mg• = ∆(g, h)
=
G
1 X
∆(g ? , h)
G ?
g =1
G
1 X
(mg? ,h − mg? • )
=
G ?
g =1
= m•h − m.
Clearly, we would arrive at the same comclusion if we started differently from assuming that
mg,h1 − mg,h2 ,
h1 , h2 = 1, . . . , H
do not depend on a value of g = 1, . . . , G.
mg,h − mg• − m•h + m = 0,
g = 1, . . . , G, h = 1, . . . , H.
Easily, we find that this is also a sufficient condition for additivity.
Definition 9.3 Interaction effects in a two-way classification.
Consider a two-way classification based on factors Z and W . By interaction effects of the two factors,
ZW , . . . , η ZW defined as
we understand quantities η1,1
G,H
ZW
:= mg,h − mg• − m•h + m,
ηg,h
g = 1, . . . , G, h = 1, . . . , H.
Notes.
• While taking into account definition of the main and interaction effects, the two-way classified
group means are given as
ZW
,
mg,h = m + ηgZ + ηhW + ηg,h
g = 1, . . . , G, h = 1, . . . , H.
• Hypothesis of additivity of the effect of the factors Z and W on the response expectation is
given as
ZW
H0 : ηg,h
= 0,
g = 1, . . . , G, h = 1, . . . , H.
9.2.2
Two-way ANOVA models
The following linear models, referred to as two-way ANOVA models,4 are traditionally considered.
Each of them corresponds to different structure for the two-way classified group means.
4
modely analýzy rozptylu dvojného třídění
9.2. TWO-WAY CLASSIFICATION
163
Interaction model
No structure is imposed on the group means that can in general be written as
ZW
mg,h = α0 + αgZ + αhW + αg,h
,
g = 1, . . . , G, h = 1, . . . , H,
(9.2)
where
α0 ,
Z
αZ = α1Z , . . . , αG
,
αZW
W
αW = α1W , . . . , αH
,
ZW
ZW
ZW
ZW
= α1,1 , . . . , αG,1 , . . . , α1,H , . . . , αG,H
are the regression parameters. If ng,h > 0 for all g, h, the rank of the related linear model is G · H,
see Section 8.7. This explains why the interaction model is also called as the saturated model 5 . The
reason is that its regression space has maximal possible vector dimension equal to the number
of the group means. Identification of the regression coefficients is possibly achieved by the sum
constraints (see Section 8.7.2)
G
X
αgZ
= 0,
g=1
H
X
H
X
αhW = 0,
h=1
(9.3)
ZW
= 0, g = 1, . . . , G,
αg,h
h=1
G
X
ZW
= 0, h = 1, . . . , H.
αg,h
g=1
Having imposed the sum constraints (9.3), the regression coefficients coincide with the mean of the
means m, with the main and interaction effects respectively (see Section 8.7.2 for corresponding
derivations), that is,
α0 = m,
αgZ = ηgZ = mg• − m,
αhW
ZW
αg,h
=
=
g = 1, . . . , G,
ηhW = m•h − m,
ZW
= mg,h − mg•
ηg,h
h = 1, . . . , H,
− m•h + m,
g = 1, . . . , G, h = 1, . . . , H.
Section 8.7 also explains possible full-rank parameterization of the underlying linear model, which
parameterizes the two-way classified group means as
ZW
>
Z
> W
mg,h = β0 + c>
+ d>
g = 1, . . . , G, h = 1, . . . , H,
(9.4)
g β + dh β
h ⊗ cg β
where

CG×(G−1)

c>
1
 . 
..  ,
=
 
c>
G

DH×(H−1)

d>
1
 . 
..  ,
=
 
d>
H
are chosen (pseudo)contrast matrices, and
β0 ,
β ZW
Z
β Z = β1Z , . . . , βG−1
,
W
β W = β1W , . . . , βH−1
,
ZW
ZW
ZW
ZW
= β1,1 , . . . , βG−1,1 , . . . , β1,H−1 , . . . , βG−1,H−1
are the regression parameters.
5
saturovaný model
9.2. TWO-WAY CLASSIFICATION
164
In the following, let symbols Z and W denote the terms in the model matrix that correspond to
the main effects β Z of the covariate Z, and β W of the covariate W , respectively. Let further Z : W
denote the terms corresponding to the interaction effects β ZW . The interaction model will then
symbolically be written as
MZW : ∼ Z + W + Z : W.
It is obtained as a submodel of the interaction model (9.2) where it is requested
ZW
ZW
α1,1
= · · · = αG,H
,
which in the full-rank parameterization (9.4) corresponds to requesting
β ZW = 0(G−1)·(H−1) .
Hence the group means can be written as
mg,h = α0 + αgZ + αhW ,
g = 1, . . . , G, h = 1, . . . , H,
Z
> W
= β0 + c>
g β + dh β ,
(9.5)
(9.6)
In Section 8.7.5, we showed that if ng,h > 0 for all g, h, the rank of the linear model with the
two-way classified group means that satisfy (9.5), is G+H −1. The additive model will symbolically
be written as
MZ+W : ∼ Z + W.
Note. It can easily be shown that ng• for all g = 1, . . . , G and n•h for all h = 1, . . . , H suffice
to get a rank of the related linear model being still G+H −1. This guarantees, among other things,
that all parameters that are estimable in the additive model with ng,h > 0 for all g, h, are still
estimable under a weaker requirement ng• for all g = 1, . . . , G and n•h for all h = 1, . . . , H. That
is, if the additive model can be assumed, it is not necessary to have observations for all possible
combinations of the values of the two covariates (factors) and the same types of the statistical
inference are possible. This is often exploited in the area of designed experiments where it might
be impractical or even impossible to get observations under all possible covariate combinations.
See Section 8.7.5 what the additive model implies for the two-way classified group means. Most
importantly,
(i) For each g1 6= g2 , g1 , g2 ∈ {1, . . . , G}, the difference mg1 ,h − mg2 ,h does not depend on
a value of h ∈ {1, . . . , H} and is equal to the difference between the corresponding means
of the means by the first factor, i.e.,
mg1 ,h − mg2 ,h = mg1 • − mg2 • = θg1 ,g2 • ,
which is expressed using the parameterizations (9.5) and (9.6) as
>
θg1 ,g2 • = αgZ1 − αgZ2 = cg1 − cg2 β Z .
(ii) For each h1 6= h2 , h1 , h2 ∈ {1, . . . , H}, the difference mg,h1 − mg,h2 does not depend on
a value of g ∈ {1, . . . , G} and is equal to the difference between the corresponding means
of the means by the second factor, i.e.,
mg,h1 − mg,h2 = m•h1 − m•h2 = θ•h1 ,h2 ,
which is expressed using the parameterizations (9.5) and (9.6) as
>
θ•h1 ,h2 = αhW1 − αhW2 = dh1 − dh2 β W .
9.2. TWO-WAY CLASSIFICATION
165
Model of effect of Z only
It is obtained as a submodel of the additive model (9.5) by requesting
W
α1W = · · · = αH
,
which in the full-rank parameterization (9.6) corresponds to requesting
β W = 0H−1 .
Hence the group means can be written as
mg,h = α0 + αgZ ,
g = 1, . . . , G, h = 1, . . . , H,
(9.7)
Z
= β0 + c>
gβ ,
This is in fact a linear model for the one-way classified (by the values of the covariate Z) group
means whose rank is G as soon as ng• > 0 for all g = 1, . . . , G. The model of effect of Z only
will symbolically be written as
MZ : ∼ Z.
The two-way classified group means then satisfy
(i) For each g = 1, . . . , G, mg,1 = · · · = mg,H = mg• .
(ii) m•1 = · · · = m•H .
Model of effect of W only
It is the same as the model of effect of Z only with exchaged meaning of Z and W . That is, the
model of effect of W only is obtained as a submodel of the additive model (9.5) by requesting
Z
α1Z = · · · = αG
,
which in the full-rank parameterization (9.6) corresponds to requesting
β Z = 0G−1 .
Hence the group means can be written as
mg,h = α0 + αhW ,
g = 1, . . . , G, h = 1, . . . , H,
W
= β0 + d>
hβ .
The model of effect of W only will symbolically be written as
MW : ∼ W.
Intercept only model
This is a submodel of either the model (9.7) of effect of Z only where it is requested
Z
α1Z = · · · = αG
,
(9.8)
9.2. TWO-WAY CLASSIFICATION
166
or the model (9.8) of effect of W only where it is requested
W
α1W = · · · = αH
,
Hence the group means can be written as
mg,h = α0 ,
g = 1, . . . , G, h = 1, . . . , H.
As usual, this model will symbolically be denoted as
M0 : ∼ 1.
Summary
In summary, we consider the following models for the two-way classification:
Model
Rank
Requirement
for Rank
MZW : ∼ Z + W + Z : W
G·H
ng,h > 0 for all g = 1, . . . , G, h = 1, . . . , H
G+H −1
ng• > 0 for all g = 1, . . . , G,
n•h > 0 for all h = 1, . . . , H
MZ : ∼ Z
G
ng• > 0 for all g = 1, . . . , G
MW : ∼ W
H
n•h > 0 for all h = 1, . . . , H
M0 : ∼ 1
1
n>0
MZ+W : ∼ Z + W
The considered models form two sequence of nested submodels:
(i) M0 ⊂ MZ ⊂ MZ+W ⊂ MZW ;
(ii) M0 ⊂ MW ⊂ MZ+W ⊂ MZW .
Related submodel testing then corresponds to evaluating whether the two-way classified group
means satisfy a particular structure invoked by the submodel at hand. If normality of the error
terms is assumed, the testing can be performed by the methodology of Chapter 5 (F-tests on
submodels).
9.2.3
Least squares estimation
Also with the two-way classification, explicit formulas for some of the LSE related quantities can
be derived and then certain properties of the least squares based inference drawn.
Notation (Sample means in two-way classification).
Y g,h• :=
ng,h
1 X
ng,h
j=1
Yg,h,j ,
g = 1, . . . , G, h = 1, . . . , H,
9.2. TWO-WAY CLASSIFICATION
Y g•
H ng,h
H
1 XX
1 X
:=
Yg,h,j =
ng,h Y g,h• ,
ng•
ng•
g = 1, . . . , G,
G ng,h
G
1 XX
1 X
:=
Yg,h,j =
ng,h Y g,h• ,
n•h
n•h
h = 1, . . . , G,
h=1 j=1
Y •h
167
g=1 j=1
G
Y :=
H ng,h
h=1
g=1
G
H
g=1
h=1
1 XXX
1X
1X
ng• Y g• =
Yg,h,j =
n•h Y •h .
n
n
n
g=1 h=1 j=1
As usual, m
b g,h , g = 1, . . . , G, h = 1, . . . , H, denote the LSE of the two-way classified group means
c= m
and m
b 1,1 , . . . , m
b G,H .
Theorem 9.2 Least squares estimation in two-way ANOVA linear models.
The fitted values and the LSE of the group means in two-way ANOVA linear models are given as follows
(always for g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h ).
(i) Interaction model MZW : ∼ Z + W + Z : W
m
b g,h = Ybg,h,j = Y g,h• .
(ii) Additive model MZ+W : ∼ Z + W
m
b g,h = Ybg,h,j = Y g• + Y •h − Y ,
but only in case of balanced data6 (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H).
(iii) Model of effect of Z only MZ : ∼ Z
m
b g,h = Ybg,h,j = Y g• .
(iv) Model of effect of W only MW : ∼ W
m
b g,h = Ybg,h,j = Y •h .
(v) Intercept only model M0 : ∼ 1
m
b g,h = Ybg,h,j = Y .
Note. There exists no simple expression to calculate the fitted values in the additive model in
case of unbalanced data. See Searle (1987, Section 4.9) for more details.
Proof.
Only the fitted values in the additive model must be derived now.
6
vyvážená data
9.2. TWO-WAY CLASSIFICATION
168
Models MZW , MZ , MW are, in fact, one-way ANOVA models where we already know that the fitted
values are equal to the corresponding group means.
Also model M0 is nothing new.
Fitted values in the additive model can be calculated by solving the normal equations corresponding
to the parameterization
mg,h = α0 + αgZ + αhW ,
g = 1, . . . , G, h = 1, . . . , H.
while imposing the identifying constraints
G
X
H
X
αgZ = 0,
g=1
αhW = 0.
h=1
For the additive model with the balanced data (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H):
• Sum of squares to be minimized
SS(α) =
XXX
g
h
Yg,h,j − α0 − αgZ − αhW
2
.
j
• Normal equations ≡ derivatives of SS(α) divided by (−2) and set to zero:
XXX
X
X
Yg,h,j − GHJα0 − HJ
αgZ − GJ
αhW = 0,
g
h
g
j
XX
h
Yg,h,j − HJα0 − HJαgZ − J
αhW = 0,
g = 1, . . . , G,
αgZ − GJαhW = 0,
h = 1, . . . , H.
j
h
XX
g
h
X
Yg,h,j − GJα0 − J
X
g
j
• After exploiting the identifying constraints:
XXX
Yg,h,j − GHJα0 = 0,
g
h
XX
h
Yg,h,j − HJα0 − HJαgZ = 0,
g = 1, . . . , G,
Yg,h,j − GJα0 − GJαhW = 0,
h = 1, . . . , H.
j
XX
g
j
j
• Hence α
b0 = Y ,
α
bgZ = Y g• − Y ,
α
bhW
= Y •h − Y ,
g = 1, . . . , G,
h = 1, . . . , H.
• And then m
b g,h = α
b0 + α
bgZ + α
bhW = Y g• + Y •h − Y ,
g = 1, . . . , G, h = 1, . . . , H.
k
End of
Lecture #16
(26/11/2015)
9.2. TWO-WAY CLASSIFICATION
169
Start of
Lecture #17
Consequence of Theorem 9.2: LSE of the means of the means in the interaction (26/11/2015)
and the additive model with balanced data.
With balanced data (ng,h = J for all g = 1, . . . , G, h = 1, . . . , H), the LSE of the means of the means
by the first factor (parameters m1• , . . . , mG• ) or by the second factor (parameters m•1 , . . . , m•H )
satisfy in both the interaction and the additive two-way ANOVA linear models the following:
b g• = Y g• ,
m
b •h = Y •h ,
m
g = 1, . . . , G,
h = 1, . . . , H.
cZ := m
cW := m
b 1• , . . . , m
b G• and m
b •1 , . . . , m
b •H
If additionally normality is assumed then m
satisfy
cZ | Z, W ∼ NG mZ , σ 2 VZ ,
cW | Z, W ∼ NH mW , σ 2 VW ,
m
m
where


m1•
 . 
. 
mZ = 
 . ,
mG•

mW

m•1
 . 
. 
=
 . ,
m•H

1
JH
 .
.
VZ = 
 .
0

1
JG
 .
.
VW = 
 .
0
...
..
.
...
...
..
.
...

0
.. 
. 
,
1
JH

0
.. 
. 
.
1
JG
Proof. All parameters mg• , g = 1, . . . , G, and m•h , h = 1, . . . , H are linear combinations of the
group means (of the response mean vector µ = E Y Z, W ) and hence are estimable with the
LSE being an appropriate linear combination of the LSE of the group means. With balanced data,
we get for the the considered models (calculation shown only for LSE of mg• , g = 1, . . . , G):
(i) Interaction model
H
H
H
H
1 X
1 X
1 X
1 X
b
mg• =
m
b g,h =
Y g,h• =
J Y g,h• =
ng,h Y g,h• = Y g• .
H
H
HJ
ng•
h=1
h=1
h=1
h=1
H
H
1 X
1 X
b
mg• =
m
b g,h =
Y g• + Y •h − Y
H
H
h=1
h=1
H
H
1 X
1 X
G JY •h − Y
= Y g• +
Y •h − Y = Y g• +
H
H GJ
h=1
h=1
H
= Y g• +
1X
n•h Y •h −Y = Y g• .
n
| h=1 {z
}
Y
9.2. TWO-WAY CLASSIFICATION
170
Further, E Y g• Z, W = mg• follows from properties of the LSE which are unbiased or from
direct calculation. Next,
var Y g• Z, W = var
H J
1 XX
σ2
Yg,h,j Z, W =
JH
JH
h=1 j=1
follows from the linear model assumption var Y Z, W = σ 2 In .
Finally, normality of Y g• in case of a normal linear model, follows from the general LSE theory.
9.2.4
k
Sums of squares and ANOVA tables with balanced data
Sums of squares
As already mentioned in Section 9.2.2, the considered models form two sequence of nested submodels:
(i) M0 ⊂ MZ ⊂ MZ+W ⊂ MZW ;
(ii) M0 ⊂ MW ⊂ MZ+W ⊂ MZW .
Corresponding differences in the residual sums of squares (that enter the numerator of the respective F-statistic) are given as squared Euclidean norms of the fitted values from the models
being compared (Section 5.1). In particular, in case of balanced data (ng,h = J, g = 1, . . . , G,
h = 1, . . . , H), we get
G X
H
X
2
SS Z + W + Z : W Z + W =
J Y g,h• − Y g• − Y •h + Y ,
g=1 h=1
G X
H
G X
H
X
2 X
2
SS Z + W W =
J Y g• + Y •h − Y − Y •h =
J Y g• − Y ,
g=1 h=1
g=1 h=1
G X
G X
H
H
X
2 X
2
SS Z + W Z =
J Y g• + Y •h − Y − Y g• =
J Y •h − Y ,
g=1 h=1
G X
H
X
2
SS Z 1 =
J Y g• − Y ,
g=1 h=1
G X
H
X
2
SS W 1 =
J Y •h − Y .
g=1 h=1
We see,
SS Z + W W = SS Z 1 ,
SS Z + W Z = SS W 1 .
Nevertheless, note that this does not hold in case of unbalanced data.
g=1 h=1
9.2. TWO-WAY CLASSIFICATION
171
Notation (Sums of squares in two-way classification).
In case of two-way classification and balanced data, we will use the following notation.
SSZ :=
SSW :=
SSZW :=
G X
H
X
g=1 h=1
G X
H
X
g=1 h=1
G X
H
X
J Y g• − Y
2
J Y •h − Y
2
,
,
J Y g,h• − Y g• − Y •h + Y
2
,
g=1 h=1
SST :=
G X
H X
J
X
Yg,h,j − Y
2
,
g=1 h=1 j=1
SSZW
:=
e
G X
H X
J
X
2
Yg,h,j − Y g,h• .
g=1 h=1 j=1
Notes.
• Quantities SSZ , SSW , SSZW are differences of the residual sums of squares of two models that
differ by terms Z, W or Z : W, respectively.
• Quantity SST is a classical total sum of squares.
• Quantity SSZW
is a residual sum of squares from the interaction model.
e
Lemma 9.3 Breakdown of the total sum of squares in a balanced two-way classification.
In case of a balanced two-way classification, the following identity holds
SST = SSZ + SSW + SSZW + SSZW
.
e
Proof.
Decomposition in the lemma corresponds to the numerator sum of squares of the F -statistics when
testing a series of submodels
M0 ⊂ MZ ⊂ MZ+W ⊂ MZW
or a series of submodels
M0 ⊂ MW ⊂ MZ+W ⊂ MZW .
Let M0 , MZ , MW , MZ+W , MZW be the regression spaces of the models M0 , MZ , MW ,
MZ+W , MZW , respectively.
That is, SST = kU 0 k2 , where U 0 are residuals of model M0 and
U 0 = D 1 + D 2 + D 3 + U ZW ,
where D 1 , D 2 , D 3 , U ZW are mutually orthogonal projections of Y into subspaces of Rn :
9.2. TWO-WAY CLASSIFICATION
172
(i) D 1 : projection into MZ \ M0 , kD 1 k2 = SSZ .
(ii) D 2 : projection into MZ+W \ MZ , kD 2 k2 = SSZ+W .
(iii) D 3 : projection into MZW \ MZ+W , kD 3 k2 = SSZW .
(iv) U ZW : projection into Rn \ MZW (residual space of MZW ).
From orthogonality: SST = SSZ + SSW + SSZW + SSZW
.
e
k
ANOVA tables
As consequence of the above considerations, it holds for balanced data:
(i) Equally labeled rows in the type I ANOVA table are the same irrespective of whether the table
is formed in the order Z + W + Z:W or in the order W + Z + Z:W.
(ii) Type I and type II ANOVA tables are the same.
Both type I and type II ANOVA table then take the form
Effect
(Term)
Degrees
of
freedom
Effect
sum of
squares
Effect
mean
square
F-stat.
P-value
Z
G−1
SSZ
?
?
?
W
H −1
SSW
?
?
?
Z:W
GH − G − H + 1
SSZW
?
?
?
Residual
n − GH
SSZW
e
?
Chapter
10
Checking Model Assumptions
In Chapter 4, we introduced some basic, mostly graphical methods to check the model assumptions.
Now, we introduce some additional methods, mostly based onstatistical tests. As in Chapter
4, we
assume that data are represented by n random vectors Yi , Z i , Z i = Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp
i = 1, . . . , n. Possibly two sets of regressors are available:
(i) X i , i = 1, . . . , n, where X i = tX (Z i ) for some transformation tX : Rp −→ Rk . They give
rise to the model matrix


X>
1
 .  0
..  = X , . . . , X k−1 .
Xn×k = 


>
Xn
For most practical problems, X 0 = 1, . . . , 1 (almost surely).
(ii) V i , i = 1, . . . , n, where V i = tV (Z i ) for some transformation tV : Rp −→ Rl . They give
rise to the model matrix


V>
1
 .  1
l

.
Vn×l =  . 
=
V
,
.
.
.
,
V
.

>
Vn
Primarily,
we
will
assume
that
the
model
matrix
X
is
sufficient
to
be
able
to
assume
that
E
Y Z =
k
E Y X = Xβ for some β = β0 , . . . , βk−1 ∈ R . That is, we will arrive from assuming
Y Z ∼ Xβ, σ 2 In ,
or even from assuming normality, i.e.,
Y Z ∼ Nn Xβ, σ 2 In .
The task is now to verify appropriateness of those assumptions that, in principle, consist of four
subassumptions outlined in Chapter 4:
(A1) Correct regression function (errors with a zero mean).
(A2) Homoscedasticity of errors.
(A3) Uncorrelated/independent errors.
(A4) Normal errors.
173
10.1
174
In this section, we technically derive some expressions that will be useful in latter sections of this
chapter and also in Chapter 14. We will deal with two models:
(i) Model M: Y Z ∼ Xβ, σ 2 In .
(ii) Model Mg : Y Z ∼ Xβ + Vγ, σ 2 In , where the model matrix is an n × (k + l) matrix G,
G = X, V .
Notation (Quantities derived under the two models).
(i) Quantities derived while assuming model M will be denoted as it is usual. In particular:
−
• (Any) solution to normal equations: b = X> X X> Y . In case of a full-rank model
matrix X:
b = X> X −1 X> Y
β
is the LSE of a vector β in model M;
• Hat matrix (projection matrix into the regression space M X ):
H = X X> X
−
X> = hi,t
i,t=1,...,n
;
• Fitted values Yb = HY = Yb1 , . . . , Ybn ;
• Projection matrix into the residual space M X
⊥
M = In − H = mi,t
:
i,t=1,...,n
;
• Residuals: U = Y − Yb = MY = U1 , . . . , Un ;
2
• Residual sum of squares: SSe = U .
(ii) Analogous quantities derived while assuming model Mg will be indicated by a subscript g:
−
• (Any) solution to normal equations: bg , cg = G> G G> Y . In case of a full-rank
model matrix G:
−1 >
>
b ,γ
β
G Y
g bg = G G
provides the LSE of vectors β and γ in model Mg ;
• Hat matrix (projection matrix into the regression space M G ):
−
Hg = G G> G G> = hg,i,t i,t=1,...,n ;
• Fitted values Yb g = Hg Y = Ybg,1 , . . . , Ybg,n ;
⊥
• Projection matrix into the residual space M G :
Mg = In − Hg = mg,i,t
i,t=1,...,n
• Residuals: U g = Y − Yb g = Mg Y = Ug,1 , . . . , Ug,n ;
2
• Residual sum of squares: SSe,g = U g .
;
175
Lemma 10.1 Model with added regressors.
Quantities derived while assuming model M : Y Z ∼ Xβ, σ 2 In and quantities derived while
assuming model Mg : Y Z ∼ Xβ + Vγ, σ 2 In are mutually in the following relationship.
Yb g = Yb + MV V> MV
−
V> U
for some bg ∈ Rk , cg ∈ Rl .
= Xbg + Vcg ,
Vector bg and cg such that Yb g = Xbg + Vcg satisfy:
cg =
−
V> MV V> U ,
−
bg = b − X> X X> Vcg
Finally
for some b = X> X
−
X> Y .
2
SSe − SSe,g = MVcg .
Proof.
• Yb g is a projection of Y into M X, V = M X, MV .
−
• Use “H = X X> X X> ”:
−
>
X
MV}
{z
|


0 
Hg = X, MV 
V> MX V> MV
| {z }
0

= X, MV
= X X> X
−
X> X
X> X
0
−
0
−
>
V MV
X> + MV V> MV
−
X>
V> M
!
!
X>
V> M
!
V> M.
• So that,
−
−
Yb g = Hg Y = X X> X X> Y + MV V> MV V> MY
|{z}
|
{z
}
U
Yb
−
= Yb + MV V> MV V> U
®
• Theorem 2.5: It must be possible to write Yb g as
Yb g = Xbg + Vcg ,
where bg , cg solves normal equations based on a model matrix X, V .
• We rewrite
® to see what bg
and cg could be.
• Remember that Yb = Xb for any b = X> X
176
−
X> Y . Take now
® and further calculate:
−
− Yb g = |{z}
Xb + In − X X> X X> V V> MV V> U
{z
}
|
Yb
M
−
−
−
= Xb + V V> MV V> U − X X> X X> V V> MV V> U
−
−
−
= X b − X> X X> V V> MV V> U + V V> MV V> U .
|
{z
}
|
{z
}
cg
bg
−
• That is, cg = V> MV V> U ,
−
bg = b − X> X X> Vcg .
• Finally
2
2 2 −
SSe − SSe,g = Yb g − Yb = MV V> MV V> U = MVcg .
k
End of
Lecture #17
(26/11/2015)
10.2. CORRECT REGRESSION FUNCTION
10.2
177
Correct regression function
We are now assuming a linear model
M : Y Z ∼ Xβ, σ 2 In ,
that written using the error terms is
M : Y = Xβ + ε,
E ε Z = E ε = 0n , var ε Z = var ε = σ 2 In .
The assumption (A1) of a correct regression function is, in particular,
E Y Z = Xβ for some β ∈ Rk ,
E Y Z ∈ M X ,
E ε Z = E ε = 0n .
As (also) explained in Section 4.1, assumption (A1) implies
E U Z = 0n
and this property is exploited by a basic diagnostic tool which is a plot of residuals against
possible factors derived from the covariates Z that may influence the residuals expectation. Factors
(i) Fitted values Yb ;
(ii) Regressors included in the model M (columns of the model matrix X);
(iii) Regressors not included in the model M (columns of the model matrix V).
Assumptions.
For the rest of this section, we assume that model M is a model of general rank r with intercept,
that is,
rank X = r ≤ k < n,
X = X 0 , . . . , X k−1 , X 0 = 1n .
In the following, we develop methods to examine whether for given j (j ∈ 1, . . . , k − 1 ) the jth
regressor, i.e., the column X j , is correctly included in the model matrix X. In other words, we will
aim in examining whether the jth regressor is possibly responsible for violation of the assumption
(A1).
10.2.1
Partial residuals
Notation (Model with a removed regressor).
For j ∈ 1, . . . , k − 1 , let X(−j) denote the model matrix X without the column X j and let
β (−j) = β0 , . . . , βj−1 , βj , . . . , βk−1
denote the regression coefficients vector without the jth element. Model with a removed jth
regressor will be a linear model
M(−j) : Y Z ∼ X(−j) β (−j) , σ 2 In .
Start of
Lecture #18
(03/12/2015)
10.2. CORRECT REGRESSION FUNCTION
178
All quantities related to the model M(−j) will be indicated by a superscript (−j). In particular,
−
>
>
M(−j) = In − X(−j) X(−j) X(−j) X(−j)
is a projection matrix into the residual space M X(−j)
⊥
;
U (−j) = M(−j) Y
is a vector of residuals of the model M(−j) .
Assumptions.
We will assume rank X(−j) = r − 1 which implies that
(i) X j ∈
/ M X(−j) ;
(ii) X j 6= 0n ;
(iii) X j is not a multiple of a vector 1n .
Derivations towards partial residuals
Model M is now a model with one added regressor to a model M(−j) and the two models form
a pair (model–submodel). Let
b = b0 , . . . , bj−1 , bj , bj+1 , . . . , bk−1
>
be (any) solution to normal equations in model M. Lemma 10.1 (Model with added regressors)
provides
−
>
>
bj = X j M(−j) X j X j U (−j) .
(10.1)
Further, since a matrix M(−j) is idempotent, we have
2
>
X j M(−j) X j = M(−j) X j .
>
At the same time, M(−j) X j 6= 0n since X j ∈
/ M X(−j) , X j 6= 0n . Hence, X j M(−j) X j > 0
and a pseudoinverse in (10.1) can be replaced by an inverse. That is,
>
bj = βbj = X j M(−j) X j
−1
X
j>
>
U
(−j)
=
X j U (−j)
>
X j M(−j) X j
is the LSE of the estimable parameter βj of model M (which is its BLUE).
In summary, under the assumptions
used to perform derivations above, i.e., while assuming that
0
X = 1n and for chosen j ∈ 1, . . . , k − 1 , the regression coefficient βj is estimable. Consequently, we define a vector of jth partial residuals of model M as follows.
10.2. CORRECT REGRESSION FUNCTION
179
Definition 10.1 Partial residuals.
A vector of jth partial residuals1 of model M is a vector

U part,j

U1 + βbj X1,j


..
.
= U + βbj X j = 
.


Un + βbj Xn,j
Note. We have
U part,j = U + βbj X j = Y − Xb − βbj X j = Y − Yb − βbj X j .
That is, the jth partial residuals are calculated as (classical) residuals where, however, the fitted
values subtract a part that corresponds to the column X j of the model matrix.
Theorem 10.2 Property of partial residuals.
Let Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = r ≤ k, X 0 = 1n , β = β0 , . . . , βk−1 . Let j ∈
1, . . . , k − 1 be such that rank X(−j) = r − 1 and let βbj be the LSE of βj . Let us consider
a linear model (regression line with covariates X j ) with
• the jth partial residuals U part,j as response;
• a matrix 1n , X j as the model matrix;
• regression coefficients γ j = γj,0 , γj,1 .
The least squares estimators of parameters γj,0 and γj,1 are
γ
bj,0 = 0,
γ
bj,1 = βbj .
Proof.
• U part,j = U + βbj X j .
2 2
part,j
j
j b
• Hence U
− γj,0 1n − γj,1 X = U − γj,0 1n + (γj,1 − βj )X = ®.
⊥
• Since 1n ∈ M X , X j ∈ M X , U ∈ M X , we have
®
2 2
2
= U + γj,0 1n + γj,1 − βbj X j ≥ U with equality if and only if γj,0 = 0 & γj,1 = βbj .
1
vektor jtých parciálních reziduí
k
10.2. CORRECT REGRESSION FUNCTION
180
Shifted partial residuals
Notation (Response, regressor and partial residuals means).
Let
n
Y =
n
1X
Yi ,
n
j
X =
i=1
1X
Xi,j ,
n
n
U
part,j
i=1
=
1 X part,j
Ui
.
n
i=1
If X 0 = 1n (model with intercept), we have
0=
n
X
Ui =
i=1
1
n
n
X
n X
Uipart,j + βbj Xi,j ,
i=1
Uipart,j
i=1
n
1 X
b
Xi,j ,
= βj
n
i=1
U
part,j
j
= βbj X .
Especially for purpose of visualization by plotting the partial residuals against the regressors
a shifted partial residuals are sometimes used. Note that this only changes the estimated intercept
of the regression line of dependence of partial residuals on the regressor.
Definition 10.2 Shifted partial residuals.
A vector of jth response-mean partial residuals of model M is a vector
j
U part,j,Y = U part,j + Y − βbj X 1n .
A vector of jth zero-mean partial residuals of model M is a vector
j
U part,j,0 = U part,j − βbj X 1n .
Notes.
• A mean of the response-mean partial residuals is the response sample mean Y , i.e.,
n
1 X part,j,Y
Ui
=Y.
n
i=1
• A mean of the zero-mean partial residuals is zero, i.e.,
n
1 X part,j,0
Ui
= 0.
n
i=1
The zero-mean partial residuals are calculated by the R function residuals with its type
argument being set to partial.
10.2. CORRECT REGRESSION FUNCTION
181
Notes (Use of partial residuals).
A vector of partial residuals can be interpreted as a response vector from which we removed
a possible effect of all remaining regressors. Hence, dependence of U part,j on X j shows
• a net effect of the jth regressor on the response;
• a partial effect of the jth regressor on the response which is adjusted for the effect of the
remaining regressors.
The partial residuals are then mainly used twofold:
Diagnostic tool. As a (graphical) diagnostic tool, a scatterplot X j , U part,j is used. In case, the
jth regressor is correctly included in the original regression
model M, i.e., if no transformaj
tion of the regressor
X is required to achieve E Y Z ∈ M X , points in the scatterplot
j
part,j
X ,U
should lie along a line.
Visualization. Property that the estimated slope of the regression line in a model U part,j ∼ X j
is the same as the jth estimated regression coeffient in the multiple regression model Y ∼
X is also used to visualize dependence of the response of the jth regressor by showing
a scatterplot X j , U part,j equipped by a line with zero intercept and slope equal to βbj .
10.2.2
Test for linearity of the effect
To examine appropriateness
of the linearity of the effect of the jth regressor X j on the response
expectation E Y Z by a statistical test, we can use a test on submodel (per se, requires additional
assumption of normality). Without loss of generality, assume that the jth regressor X j is the last
column of the model matrix X and denote the remaining non-intercept columns of matrix X as
X0 . That is, assume that
X = 1n , X 0 , X j .
Two classical choices of a pair model–submodel being tested in this context are the following.
More general parameterization of the jth regressor
Submodel is the model M with the model matrix X. The (larger) model is model Mg obtained by
replacing column X j in the model matrix X by a matrix V such that
Xj ∈ M V ,
rank V ≥ 2.
That is, the model matrices of the submodel and the (larger) model are
Submodel M:
(Larger) model Mg :
1n , X0 , X j = X;
1n , X 0 , V .
Classical choices of the matrix V are such that it corresponds to:
(i) polynomial of degree d ≥ 2 based on the regressor X j ;
10.2. CORRECT REGRESSION FUNCTION
182
(ii) regression spline of degree d ≥ 1 based on the regressor X j . In this case, 1n ∈ V and
hence for practical calculations, the larger model Mg is usually estimated while using a model
matrix
X0 , V
that does not explicitely include the intercept term which is included implicitely.
Categorization of the jth regressor
upp
Let −∞ < xlow
< xupp
< ∞ be chosen such that interval xlow
j
j , xj
j
X1,j , . . . , Xn,j of the jth regressor. That is,
xlow
< min Xi,j ,
j
i
covers the values
max Xi,j < xupp
j .
i
upp Let I1 , . . . , IH be H > 1 subintervals of xlow
based on a grid
j , xj
xlow
< λ1 < · · · < λH−1 < xupp
j
j .
Let xh ∈ Ih , h = 1, . . . , H, be chosen representative values for each of the subintervals I1 , . . . , IH
(e.g., their midpoints) and let
X j,cut = X1j,cut , . . . , Xnj,cut
be obtained by categorization of the jth regressor using the division I1 , . . . , IH and representatives x1 , . . . , xH , i.e., (i = 1, . . . , n):
Xij,cut = xh
≡
Xij ∈ Ih ,
h = 1, . . . , H.
In this way, we obtained a categorical ordinal regressor X j,cut whose values x1 , . . . , xH , can be
considered as collapsed values of the original regressor X j . Consequently, if linearity with respect
to the original regressor X j holds then it also does (approximately, depending on chosen division
I1 , . . . , IH and the representatives x1 , . . . , xH ) with respect to the ordinal categorical regressor
X j,cut if this is viewed as numeric one.
Let V be an n × (H − 1) model matrix corresponding to some (pseudo)contrast parameterization
of the covariate X j,cut if this is viewed as categorical with H levels. We have
X j,cut ∈ M V ,
and test for linearity of the jth regressor is obtained by considering the following model matrices
in the submodel and the (larger) model:
Submodel M:
(Larger) model Mg :
1n , X0 , X j,cut ;
1n , X0 , V .
Drawback of tests for linearity of the effect
Remind that hypothesis of linearity of the effect of the jth regressor always forms the null hypothesis of the proposed submodel tests. Hence we are only able to confirm non-linearity of the effect
(if the submodel is rejected) but are never able to confirm linearity.
10.3. HOMOSCEDASTICITY
10.3
183
Homoscedasticity
We are again assuming a linear model
M : Y Z ∼ Xβ, σ 2 In ,
that written using the error terms is
M : Y = Xβ + ε,
E ε Z = E ε = 0n , var ε Z = var ε = σ 2 In .
The assumption (A2) of homoscedasticity is, in particular,
var Y Z = σ 2 In ,
var ε Z = var ε = σ 2 In ,
where σ 2 is unknown but most importantly constant.
10.3.1
Tests of homoscedasticity
Many tests of homoscedasticity can be found in literature. They mostly consider the following null
and alternative hypotheses: H0 : σ 2 = const,
H1 : σ 2 = certain function of some factor(s).
A particular test is then sensitive (powerful) to detect heteroscedasticity if this expresses itself
such that the variance σ 2 is the certain function of the factor(s) as specified by the alternative
hypothesis. The test is possibly weak to detect heteroscedasticity (weak to reject the null hypothesis of homoscedasticity) if heteroscedasticity expresses itself in a different way compared to the
considered alternative hypothesis.
10.3.2
Score tests of homoscedasticity
A wide range of tests of homoscedasticity can be derived by assuming a (full-rank) normal linear
model, basing the alternative hypothesis on a particular general linear model and then using
an (asymptotic) maximum-likelihood theory to derive a testing procedure.
Assumptions.
For the rest of this section, we assume that model M (model under the null hypothesis) is normal
of full-rank, i.e.,
M : Y Z ∼ Nn Xβ, σ 2 In , rank Xn×k = k,
and an alternative model is a generalization of a general normal linear model
Mhetero : Y Z ∼ Nn Xβ, σ 2 W−1 ,
where
W = diag(w1 , . . . , wn ),
wi−1 = τ (λ, β, Z i ), i = 1, . . . , n,
τ is a known function of λ ∈ Rq , β ∈ Rk (regression coefficients), z ∈ Rp (covariates) such that
τ (0, β, z) = 1,
for all β ∈ Rk , z ∈ Rp .
10.3. HOMOSCEDASTICITY
184
Model Mhetero is then a model with unknown parameters β, λ, σ 2 which with λ = 0 simplifies
into model M. In other words, model M is a nested 2 model of model Mhetero and a test of
homoscedasticity corresponds to testing
H0 : λ = 0,
H1 : λ =
6 0.
(10.2)
Having assumed normality, both models M and Mhetero are fully parametric models and a standard
(asymptotic) maximum-likelihood theory can now be used to derive a test of (10.2). A family of score
tests based on specific choices of the weight function τ is derived by Cook and Weisberg (1983).
Breusch-Pagan test
A particular score test of homoscedasticity was also derived by Breusch and Pagan (1979) who
consider the following weight function (x = tX (z) is a transformation of the original covariates
that determines the regressors of model M).
τ (λ, β, z) = τ (λ, β, x) = exp λ x> β .
That is, under the heteroscedastic model, for i = 1, . . . , n,
2 exp λ E Y Z
var Yi Z i = var εi Z i = σ 2 exp λ X >
β
=
σ
i
i ,
i
(10.3)
and the test of homoscedasticity is testing
H0 : λ = 0,
H1 : λ =
6 0.
It is seen from the model (10.3) that the Breusch-Pagan test is sensitive (powerful to detect heteroscedasticity) if the residual variance is a monotone function of the response expectation.
Note (One-sided tests of homoscedasticity).
In practical situations, if it can be assumed that the residual variance is possibly a monotone
function of the response expectation then it can mostly be also assumed that it is its increasing
function. A more powerful test of homoscedasticity is then obtained by considering the one-sided
alternative
H1 : λ > 0.
Analogously, a test that is sensitive towards alternative of a residual variance which decreases with
the response expectation is obtained by considering the alternative H1 : λ < 0.
Note (Koenker’s studentized Breusch-Pagan test).
The original Breusch-Pagan test is derived using standard maximum-likelihood theory while departing from assumption of a normal linear model. It has been shown in the literature that the
test is not robust towards non-normality. For this reason, Koenker (1981) derived a slightly modified
version of the Breusch-Pagan test which is robust towards non-normality. It is usually referred to
as (Koenker’s) studentized Breusch-Pagan test and its use is preferred to the original test.
2
vnořený
10.3. HOMOSCEDASTICITY
185
Linear dependence on the regressors
Let tW : Rp −→ Rq be a given transformation, w := tW (z), W i = tW (Z i ), i = 1, . . . , n. The
following choice of the weight function can be considered:
τ (β, λ, z) = τ (λ, w) = exp λ> w .
That is, under the heteroscedastic model, for i = 1, . . . , n,
=
var Yi Z i = var εi Z i = var εi
On a log-scale:
σ 2 exp λ> W i .
log var Yi Z i = log(σ 2 ) + λ> W i .
| {z }
λ0
In other words, the residual variance follows on a log-scale a linear model with regressors given by
vectors W i .
If tW is a univariate transformation leading to w = tW (z), one-sided alternatives are again possible
reflecting assumption that under heteroscedasticity, the residual variance increases/decreases with
a value of W = tW (Z). The most common use is then such that tW (z) and related values
of W1 = tW (Z 1 ), . . ., Wn = tW (Z n ) correspond to one of the (non-intercept) regressors from
either the model matrix X (regressors included in the model), or from the matrix V that contains
regressors currently not included in the model. The corresponding score test of homoscedasticity
then examines whether the residual variance changes/increases/decreases (depending on chosen
alternative) with that regressor.
Note (Score tests of homoscedasticity in the R software).
In the R software, the score tests of homoscedasticity are provided by functions:
(i) ncvTest (abbreviation for a “non-constant variance test”) from package car;
(ii) bptest from package lmtest.
The Koenker’s studentized variant of the test is only possible with the bptest function.
10.3.3
Some other tests of homoscedasticity
Some other tests of homoscedasticity that can be encountered in practice include the following
Goldfeld-Quandt test is an adaptation of a classical F-test of equality of the variances of the
two independent samples into a regression context proposed by Goldfeld and Quandt (1965).
It is applicable in linear models with both numeric and categorical covariates and under
the alternative, heteroscedasticity is expressed by a monotone dependence of the residual
variance on a prespecified ordering of the observations.
G-sample tests of homoscedasticity are tests applicable for linear models with only categorical
covariates (ANOVA models). They require repeated observations for each combination of
values of the covariates and basically test equality of variances of G independent random
samples. The most common tests of this type include:
Bartlett test by Bartlett (1937) which, however, is quite sensitive towards non-normality and
hence its use is not recommended. It is implemented in the R function bartlett.test;
10.3. HOMOSCEDASTICITY
186
Levene test by Levene (1960), implemented in the R function leveneTest from package
car or in the R function levene.test from package lawstat;
Brown-Forsythe test by Brown and Forsythe (1974) which is a robustified version of the
Levene test and is implemented in the R function levene.test from package lawstat;
Fligner-Killeen test by Fligner and Killeen (1976) which is implemented in the R function
fligner.test.
End of
Lecture #18
(03/12/2015)
10.4. NORMALITY
10.4
187
Normality
Start of
Lecture #19
(03/12/2015)
In this section, we are assuming a normal linear model
M : Y Z ∼ Nn Xβ, σ 2 In , rank(X) = r,
that written using the error terms is
i.i.d.
M : Yi = X >
i β + εi ,
εi ∼ N (0, σ 2 ), i = 1, . . . , n.
(10.4)
Our interest now lies in verifying assumption (A4) of normality of the error terms εi , i = 1, . . . , n.
Let us remind our standard notation needed in this section:
(i) Hat matrix (projection matrix into the regression space M X ):
−
H = X X> X X> = hi,t i,t=1,...,n ;
(ii) Projection matrix into the residual space M X
⊥
:
M = In − H = mi,t
i,t=1,...,n
;
(iii) Residuals: U = Y − Yb = MY = U1 , . . . , Un ;
2
(iv) Residual sum of squares: SSe = U ;
(v) Residual mean square: MSe =
1
n−r
SSe .
(vi) Standardized residuals: U std = U1std , . . . , Unstd , where
Ui
Uistd = p
,
MSe mi,i
i = 1, . . . , n
(if mi,i > 0).
Lemma 10.3 Property of normal distribution.
Let Z ∼ Nn (0, σ 2 In ). Let T : Rn −→ R be a measurable function satisfying T (cz) =
T (z) for all c > 0 and z ∈ Rn . The random variables T (Z) and kZk are then independent.
Proof. Proof/calculations were skipped and are not requested for the exam.
Proof/calculations below are shown only for those who are interested.
• Consider spherical coordinates:
Z1 = R cos(φ1 ),
Z2 = R sin(φ1 ) cos(φ2 ),
Z3 = R sin(φ1 ) sin(φ2 ) cos(φ3 ),
..
.
Zn−1 = R sin(φ1 ) · · · sin(φn−2 ) cos(φn−1 ),
Zn = R sin(φ1 ) · · · sin(φn−2 ) sin(φn−1 ).
10.4. NORMALITY
188
• Distance from origin: R = kZk.
>
• Direction: φ = φ1 , . . . , φn−1 .
• Exercise for the 3rd year bachelor students:
If Z ∼ Nn (0, σ 2 In ) then distance R from the origin and direction φ are independent.
• R = kZk (distance from origin itself), T (Z) depends on the direction only (since T (Z) =
T (cZ) for all c > 0) and hence kZk and T (Z) are independent.
k
Theorem 10.4 Moments of standardized residuals under normality.
Let Y X ∼ Nn Xβ, σ 2 In and let for chosen i ∈ {1, . . . , n}, mi,i > 0. Then
E Uistd X = 0,
var Uistd X = 1.
Proof. Proof/calculations were available on the blackboard in K1.
Notes.
k
If the normal linear model (10.4) holds then Theorems 3.1 and 10.4 provide:
(i) For (raw) residuals:
U Z ∼ Nn 0n , σ 2 M .
That is, the (raw) residuals follow also a normal distribution, nevertheless, the variances of
the individual residuals U1 , . . . , Un differ (a diagonal of the projection matrix M is not
necessarily constant). On top of that, the residuals are not necessarily independent (the
projection matrix M is not necessarily a diagonal matrix).
(ii) For standardized residuals (if mi,i > 0 for all i = 1, . . . , n, which is always the case in
a full-rank model):
E Uistd Z = 0,
var Uistd Z = 1,
i = 1, . . . , n.
That is, the standardized residuals have the same mean and also the variance but are neither
necessarily normally distributed nor necessarily independent.
In summary, in a normal linear model, neither the raw residuals, nor standardized residuals form
a random sample (a set of i.i.d. random variables) from a normal distribution.
10.4.1
Tests of normality
There exist formal tests of the null hypothesis on a normality of the error terms:
H0 : distribution of ε1 , . . . , εn is normal,
(10.5)
10.4. NORMALITY
189
where a distribution of the test statistic is exactly known under the null hypothesis of normality.
Nevertheless, those tests have quite a low power and hence are only rarely used in practice.
In practice, approximate approaches are used that apply standard tests of normality on either the
raw residuals U or the standardized residuals U std (both of them, under the null hypothesis (10.5),
do not form a random sample from the normal distribution ). Several empirical studies showed
that such approaches maintain quite well a significance level of the test on a requested value. At
the same time, they mostly recommend to use the raw residuals U rather than the standardized
residuals U std .
Classical tests of normality include the following:
Shapiro-Wilk test implemented in the R function shapiro.test.
Lilliefors test implemented in the R function lillie.test from package nortest.
Anderson-Darling test implemented in the R function ad.test from package nortest.
10.5. UNCORRELATED ERRORS
10.5
190
Uncorrelated errors
In this section, we are assuming a linear model
M:
Yi = X >
i β + εi ,
E εi X i = E εi = 0,
var εi X i = var εi = σ 2 , i = 1, . . . , n,
(10.6)
i 6= l.
cor(εi , εl ) = 0,
Our interest now lies in verifying assumption (A3) of whether the error terms εi , i = 1, . . . , n, are
uncorrelated.
The fact that errors are uncorrelated often follows from a design of the study/data collection
(measurements on independently behaving units, . . . ) and then there is no need to check this
assumption. Situation when uncorrelated errors cannot be taken for granted is if the observations
are obtained sequentially. Typical examples are
(i) time series (time does not have to be a covariate of the model) which may lead to so called
serial depedence among the error terms of the linear model;
(ii) repeated measurements performed using one measurement unit or on one subject.
In the following, we introduce a classical procedure that is used to test a null hypothesis of uncorrelated errors against alternative of serial dependence expressed by the first order autoregressive
process.
10.5.1
Durbin-Watson test
Assumptions.
It is assumed that the ordering of the observations expressed by their indeces 1, . . . , n, has
a practical meaning and may induce depedence between the error terms ε1 , . . . , εn of the model.
One of the simplest stochastic processes that capture a certain form of serial dependence is the
first order autoregressive process AR(1). Assuming this for the error terms ε1 , . . . , εn of the linear
model (10.6) leads to a more general model
MAR :
Yi = X >
i β + εi ,
i = 1, . . . , n,
ε1 = η1 , εi = % εi−1 + ηi ,
i = 2, . . . , n,
E ηi X i = E ηi = 0, var ηi X i = var ηi = σ 2 , i = 1, . . . , n,
i 6= l,
cor(ηi , ηl ) = 0,
where −1 < % < 1 is additional unknown parameter of the model.
It has been shown in the course Stochastic Processes 2 (NMSA409):
• ε1 , . . . , εn is a stacionary process if and only if −1 < % < 1.
Notes.
• For each m ≥ 0: cor(εi , εi−m ) = %m , i = m + 1, . . . , n. In particular
% = cor(εi , εi−1 ),
i = 2, . . . , n.
(10.7)
10.5. UNCORRELATED ERRORS
191
Test of uncorrelated errors in model M can be now be based on testing
H0 :
H1 :
% = 0,
% 6= 0
in model MAR . Since positive autocorrelation (% > 0) is more common in practice, one-sided tests
(with H1 : % > 0) are used frequently as well.
Let U = U1 , . . . , Un be residuals from model M which corresponds to the null hypothesis.
A test statistic proposed by Durbin and Watson (1950, 1951, 1971) takes a form
n
X
(Ui − Ui−1 )2
DW =
i=2
n
X
.
Ui2
i=1
A testing procedure is based on observing that a statistic DW is approximately equal to 2 (1 − %b),
where %b is an estimator of the autoregression parameter % from model MAR .
Calculations.
First remember that
E(Ui ) := E Ui X = 0,
i = 1, . . . , n,
and this property is maintained even if the error terms of the model are not uncorrelated (see
process of the proof of Theorem 2.3).
As residuals can be considered as predictions of the error terms ε1 , . . . , εn , a suitable estimator of
their covariance of lag 1 is
c εl , εl−1
σ
b1,2 = cov
n
1 X
=
Ui Ui−1 .
n−1
i=2
Similarly, three possible estimators of the variance σ 2 of the error terms ε1 , . . . , εn are
c εl =
σ
b2 = var
n−1
1 X 2
Ui
n−1
or
i=1
n
1 X 2
Ui
n−1
or
i=2
n
1 X 2
Ui .
n
i=1
Then,
Pn
DW =
(Ui − Ui−1 )2
i=2P
=
n
2
i=1 Ui
Pn
2
i=2 Ui
+
Pn
2
i=2 Ui−1 −
Pn
2
i=1 Ui
2
Pn
i=2 Ui Ui−1
σ
b2 + σ
b2 − 2 σ
b1,2
σ
b1,2
=2 1− 2
≈
σ
b2
σ
b
= 2 1 − %b .
Use of the test statistic DW for tests of H0 : % = 0 is complicated by the fact that distribution of
DW under the null hypothesis depends on the model matrix X. It is hence not possible to derive
(and tabulate) critical values in full generality. In practice, two approaches are used to calculate
approximate critical values and p-values:
10.5. UNCORRELATED ERRORS
192
(i) Numerical algorithm of Farebrother (1980, 1984) which is implemented in the R function
dwtest from package lmtest;
(ii) General simulation method bootstrap (introduced by Efron, 1979) whose use for the DurbinWatson test is implemented in the R function durbinWatsonTest from package car.
10.6. TRANSFORMATION OF RESPONSE
10.6
193
Transformation of response
Especially in situations when homoscedasticity and/or normality does not hold, it is often possible
to achieve a linear model where both those assumptions are fulfilled by a suitable (non-linear)
transformation t : R −→ R of the response. That is, it is worked with a normal linear model
t(Yi ) = X >
i β + εi ,
i = 1, . . . , n,
i.i.d.
εi ∼ N (0, σ 2 ),
(10.8)
where it is assumed that both homoscedasticity and normality hold. Disadvantage of a model
with transformed response is that the corresponding regression function m(x) = x> β provides
a model for expectation of the transformed response and not of the original response, i.e., for
x ∈ X (sample space of the regressors):
m(x) = E t(Y ) X = x 6= t E Y X = x ,
unless the transformation t is a linear function. Similarly, regression coefficients have now interpretation of an expected change of the transformed response t(Y ) related to a unity increase of
the regressor.
10.6.1
Prediction based on a model with transformed response
Nevertheless, the above mentioned interpretational issue is not a problem in a situation when
prediction of a new value of the response Ynew , given X new = xnew , is of interest. If this is the
case, we can base the prediction on the model (10.8) for the transormed response. In the following,
we assume that t is strictly increasing, nevertheless, the procedure can be adjusted for decreasing
or even non-monotone t as well:
trans,L b trans,U trans and a (1 − α) 100% prediction interval Y
bnew
for
• Construct a prediction Ybnew
, Ynew
trans
Ynew = t(Ynew ) based on the model (10.8).
• Trivially, an interval
−1 trans,L −1 trans,U L
U
Ybnew
, Ybnew
= t Ybnew
, t Ybnew
(10.9)
covers a value of Ynew with a probability of 1 − α.
trans lies inside the prediction interval (10.9) and can be considered
• A value Ybnew = t−1 Ybnew
L , Y
bU
as a point prediction of Ynew . Only note that the prediction interval Ybnew
new is not
necessarily centered around a value of Ybnew .
10.6.2
Log-normal model
Suitably interpretable model is obtained if the response is logarithmically transformed. Suppose
that the following model (normal linear model for log-transformed response) holds:
log(Yi ) = X >
i β + εi ,
i = 1, . . . , n,
i.i.d.
εi ∼ N (0, σ 2 ).
We then have
Yi = exp X >
i β ηi ,
i.i.d.
exp(εi ) = ηi ∼ LN (0, σ 2 ),
i = 1, . . . , n,
(10.10)
10.6. TRANSFORMATION OF RESPONSE
194
where LN (0, σ 2 ) denotes a log-normal distribution with location parameter 0 and a scale parameter σ. That is, under validity of the model (10.10) for the log-transformed response, errors in
a model for the original response are combined multiplicatively with the regression function.
We can easily calculate the first two moments of the log-normal distribution which provides (for
i = 1, . . . , n),
σ2 M := E ηi = exp
> 1 (with σ 2 > 0),
2
V := var ηi = exp(σ 2 ) − 1 exp(σ 2 ).
Hence, for x ∈ X :
E Yi X i = x = M exp x> β ,
var Yi X i = x = V exp 2 x> β = V ·
E Yi X i = x 2
.
M
(10.11)
A log-normal model (10.10) is thus suitable in two typical situations that cause non-normality and/or
heteroscedasticity of a linear model for the original response Y :
(i) a conditional distribution of Y given X = x is skewed. If this is the case, the log-normal
distribution which is skewed as well may provide a satisfactory model for this distribution.
(ii) a conditional variance var Y X = x increases with a conditional expectation E Y X =
x . This feature is captured
by (10.11). Indeed, under the
by the log-normal model as shown
log-normal model, var Y X = x increases with E Y X = x . It is then said that the
logarithmic transformation stabilizes the variance.
Interpretation of regression coefficients
With a log-normal model (10.10), the (non-intercept) regression coefficients have the following
interpretation. Let for j ∈ {1, . . . , k − 1},
x = x0 , . . . , xj . . . , xk−1 ∈ X , and xj(+1) := x0 , . . . , xj + 1 . . . , xk−1 ∈ X ,
and suppose that β = β0 , . . . , βk−1 We then have
> E Y X = xj(+1)
M exp xj(+1) β
=
= exp(βj ).
E Y X = x
M exp x> β
Notes.
• If ANOVA linear model with log-transformed response is fitted, estimated differences between
the group means of the log-response are equal to estimated log-ratios between the group means
of the original response.
• If a linear model with logarithmically transformed response if fitted, estimated regression coefficients, estimates of estimable parameters etc. and corresponding confidence intervals are often
reported back-transformed (exponentiated) due to above interpretation.
10.6. TRANSFORMATION OF RESPONSE
195
Evaluation of impact of the regressors on response
Evaluation of impact of the regressors on response requires necessity to perform statistical tests on
regression coefficients or estimable parameters of a linear model. Homoscedasticity and for small
samples also normality are needed to be able to use standard t- or F-tests. Both homoscedasticity
and normality can be achieved by a log transformation of the response. Consequently performed
statistical tests still have a reasonable practical interpretation as tests on ratios of two expectations
of the (original) response.
Chapter
11
Consequences of a Problematic
Regression Space
As in Chapter 10,
that data are represented by n random vectors
Yi , Z i , Z i =
we assume
Zi,1 , . . . , Zi,p ∈ Z ⊆ Rp i = 1, . . . , n. As usual, let Y = Y1 , . . . , Yn and let Zn×p denote
a matrix with covariate vectors Z 1 , . . . , Z n in its rows. Finally, let X i , i = 1, . . . , n, where
X i = tX (Z i ) for some transformation tX : Rp −→ Rk , be the regressors that give rise to the
model matrix


X>
1
 .  0
..  = X , . . . , X k−1 .
Xn×k = 


>
Xn
It will be assumed that X 0 = 1, . . . , 1 (almost surely) leading to the model matrix
Xn×k = 1n , X 1 , . . . , X k−1 ,
with explicitely included intercept term.
Primarily,
to be able to assume that E Y Z =
we will assume that the model matrixX is sufficient
E Y X = Xβ for some β = β0 , . . . , βk−1 ∈ Rk . That is, we will arrive from assuming
Y Z ∼ Xβ, σ 2 In .
It will finally be assumed in the whole chapter that the model matrix X is of full rank, i.e.,
rank X = k < n.
196
11.1. MULTICOLLINEARITY
11.1
197
Multicollinearity
A principal assumption of any regression
model is correct
specification of the
function.
regression
2
While assuming a linear model Y Z ∼ Xβ, σ In , this means that E Y Z ∈ M X . To
guarantee this, it seems to be optimal to choose the regression space M X as rich as possible.
In other words, if many covariates are available, it seems optimal to include a high number k of
columns in the model matrix X. Nevertheless, as we show in this section, this approach bears
certain complications.
11.1.1
Singular value decomposition of a model matrix
We are assuming rank Xn×k = k < n. As was shown in the course Fundamentals of Numerical
Mathematics (NMNM201), the matrix X can be decomposed as
X = U D V> =
k−1
X
dj uj v >
j ,
D = diag(d0 , . . . , dk−1 ),
j=0
where
• Un×k = u0 , . . . , uk−1 are the first k orthonormal eigenvectors of the n × n matrix XX> .
• Vk×k = v 0 , . . . , v k−1 are (all) orthonormal eigenvectors of the k × k (invertible) matrix X> X.
p
• dj = λj , j = 0, . . . , k − 1, where λ0 ≥ · · · ≥ λk−1 > 0 are
• the first k eigenvalues of the matrix XX> ;
• (all) eigenvalues of the matrix X> X, i.e.,
>
X X=
k−1
X
>
λj v j v >
j = VΛV ,
Λ = diag(λ0 , . . . , λk−1 )
j=0
=
k−1
X
2 >
d2j v j v >
j = VD V .
j=0
The numbers d0 ≥ · · · ≥ dk−1 > 0 are called singular values1 of the matrix X. We then have
k−1
X
−1
1
>
−2 >
X> X
=
V ,
2 vj vj = V D
d
j
j=0
tr
n
X> X
−1 o
k−1
X
1
=
.
d2
j=0 j
(11.1)
Note (Moore-Penrose pseudoinverse of the matrix X> X).
The singular value decomposition of the model matrix X provides also a way to calculate the MoorePenrose pseudoinverse of the matrix X> X if X is of less-than-full rank. If rank Xn×k = r < k,
then d0 ≥ · · · ≥ dr−1 > dr = · · · = dk−1 = 0. The Moore-Penrose pseudoinverse of X> X is
obtained as
r−1
+ X
1
>
X> X =
2 vj vj .
d
j=0 j
1
singulární hodnoty
11.1. MULTICOLLINEARITY
11.1.2
198
Multicollinearity and its impact on precision of the LSE
It is seen from (11.1) that with dk−1 −→ 0:
(i) the matrix X> X tends to a singular matrix, i.e., the columns of the model matrix X tend to
being linearly dependent;
n
−1 o
(ii) tr X> X
−→ ∞.
Situation when the columns of the (full-rank) model matrix X are close to being linearly dependent
is referred to as multicollinearity.
If a linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k is assumed, then we know from GaussMarkov theorem that
−1
(i) The fitted values Yb = Yb1 , . . . , Ybn = HY , where H = X X> X X> , is the best linear
unbiased estimator (BLUE) of a vector parameter µ = Xβ = E Y Z with
var Yb Z = σ 2 H;
b = βb0 , . . . , βbk−1 = X> X −1 X> Y is the BLUE of a vector
(ii) The least squares estimator β
of regression coefficients β with
b Z = σ 2 X> X −1 .
var β
It then follows
n
X
var Ybi Z = tr var Yb Z = tr σ 2 H = σ 2 tr(H) = σ 2 k,
i=1
k−1
X
n
n
−1 o
−1 o
2
>
2
>
b
b
= σ tr X X
.
var βj Z = tr var β Z = tr σ X X
j=0
This shows that multicollinearity
(i) does not have any impact on precision of the LSE of the response expectation µ = Xβ;
(ii) may have a serious impact on precision of the LSE of the regression coefficients β. At
the same time, since LSE is BLUE, there exist no better linear unbiased estimator of β. If
additionally normality is assumed there even exist no better unbiased estimator at all.
An impact of multicollinearity can also be expressed by considering a problem of estimating the
squared Euclidean norm of µ = Xβ and β, respectively. Asnatural
2 estimators
2 of those squared
b
norms are the squared norms of the corresponding LSE’s, i.e., Y
and b
β , respectively. As we
show, those estimators are biased,
the amount of bias
nevertheless,
does
2 not depend on a degree
2
b
b
of multicollinearity in case of Y
but depends on it in case of β .
11.1. MULTICOLLINEARITY
199
Lemma 11.1 Bias in estimation of the squared norms.
Let Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k. The following then holds.
h 2 i
2
E Yb − Xβ Z = σ 2 k,
h n
2 i
−1 o
2
E b
β − β Z = σ 2 tr X> X
.
Proof.
In accordance with our convention, condition will be omitted from notation of all
expectations and variances. Nevertheless, all are still understood as conditional expectations and
variances given the covariate values Z.
h 2 i
2
b
E Y − Xβ Z
• Let us calculate:
n
n
nX
2
2 o X
EYb − Xβ = E
Ybi − X >
β
=
var Ybi
i
i=1
i=1
n
o
= tr σ 2 H = σ 2 tr(H)= σ 2 k.
= tr var Yb
• At the same time:
2
>
EYb − Xβ = E Yb − Xβ
Yb − Xβ
2
2
EYb
= EYb + EXβ − 2 β > X> |{z}
Xβ
2 2
2
2 2
= EYb + Xβ − 2 Xβ = EYb − Xβ .
2 2
• So that, EYb − Xβ 2
EYb =
=
σ 2 k,
2
Xβ + σ 2 k.
h 2 i
2
E b
β − β Z
• Let us start in a similar way:
k−1
k−1
nX
2
2 o X
b
b
E β−β =E
βj − βj
=
var βbj
j=0
j=0
n
n
n
o
o
o
b = tr σ 2 X> X −1 = σ 2 tr X> X −1 .
= tr var β
11.1. MULTICOLLINEARITY
200
• At the same time:
2
b −β > β
b −β
Eb
β − β = E β
2
2
b
β + Eβ − 2 β > Eβ
= Eb
|{z}
β
2 2
2
2 2
= Eb
β + β − 2 β = Eb
β − β .
2 2
• So that, Eb
β − β 2
Eb
β
=
=
n
−1 o
σ 2 tr X> X
,
n
2
o
β + σ 2 tr X> X −1 .
{z
}
|
k−1
X
var βbj
j=0
11.1.3
k
Variance inflation factor and tolerance
Notation. For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k, where
Y = Y1 , . . . , Yn ,
>
X = 1n , X 1 , . . . , X k−1 ,
X j = X1,j , . . . , Xn,j , j = 1, . . . , k − 1,
the following (partly standard) notation, will be used:
n
Response sample mean:
Square root of the total sum of squares:
1X
Y =
Yi ;
n
i=1
v
u n
uX
2 TY = t
Yi − Y = Y − Y 1n ;
i=1
Fitted values:
Yb = Yb1 , . . . , Ybn ;
Coefficient of determination:
Y − Yb 2
Y − Yb 2
R = 1 − .
= 1 −
Y − Y 1n 2
TY2
Residual mean square:
MSe =
2
1 Y − Yb 2 .
n−k
Further, for each j = 1, . . . , k − 1, consider a linear model Mj , where the vector X j acts as
a response and the model matrix is
X(−j) = 1n , X 1 , . . . , X j−1 , X j+1 , . . . , X k−1 .
The following notation will be used:
End of
Lecture #19
(03/12/2015)
Start of
Lecture #20
(10/12/2015)
11.1. MULTICOLLINEARITY
201
n
Column sample mean:
1X
X =
Xi,j ;
n
j
i=1
Square root of the total sum of squares from model Mj :
v
u n
uX
j 2
j
Tj = t
= X j − X 1n ;
Xi,j − X
i=1
Fitted values from model Mj :
cj = X
b1,j , . . . , X
bn,j ;
X
Coefficient of determination from model Mj :
j
j
cj 2
cj 2
X − X
X − X
2
Rj = 1 − .
= 1 −
Tj2
X j − X j 1n 2
Notes.
(i) If data (response random variables and non-intercept covariates) Yi , Xi,1 , . . . , Xi,k−1 , i = 1, . . . , n are a random sample from a distribution of a generic random vector Y, X1 , . . . , Xk−1
then
• The coefficient of determination R2 is also a squared value
of a sample coefficient of
multiple correlation between Y and X := X1 , . . . , Xk−1 .
• For each j = 1, . . . , k − 1, the coefficient of determination Rj2 is also a squared value of
a sample coefficient
of multiple correlation between Xj and X (−j) := X1 , . . . , Xj−1 ,
Xj+1 , . . . , Xk−1 .
(ii) For given j = 1, . . . , k − 1:
• A value of Rj2 close to 1 means that the jth column X j is almost equal to some linear combination
of the columns of the matrix X(−j) (remaining columns of the model matrix). We then say that
X j is collinear with the remaining columns of the model matrix.
• A value of Rj2 = 0 means that
• the column X j is orthognal to all remaining non-intercept regressors (non-intercept columns
of the matrix X(−j) );
• the jth regressor represented by a random variable Xj is multiply uncorrelated with the
remaining regressors represented by the random vector X (−j) .
For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k,
b Z = MSe X> X −1 .
c β
var
−1
The following Theorem shows that diagonal elements of the matrix MSe X> X , i.e., values
c βbj Z can also be calculated, for j = 1, . . . , k − 1, using above defined quantities TY , Tj , R2 ,
var
Rj2 .
Theorem 11.2 Estimated variances of the LSE of the regression coefficients.
For a given linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k, diagonal elements of the matrix
c βbj Z are, for j = 1, . . . , k − 1,
var
2
1 − R2
1
TY
c βbj Z =
·
·
.
var
Tj
n−k
1 − Rj2
11.1. MULTICOLLINEARITY
Proof. See Zvára (2008, Chapter 11).
Proof/calculations were skipped and are not requested for the exam.
202
k
Definition 11.1 Variance inflation factor and tolerance.
For given j = 1, . . ., k − 1, the variance
inflation factor2 and the tolerance3 of the jth regressor of
the linear model Y Z ∼ Xβ, σ 2 In , rank(Xn×k ) = k are values VIFj and Tolerj , respectively,
defined as
1
1
VIFj =
.
,
Tolerj = 1 − Rj2 =
2
VIFj
1 − Rj
Notes.
• With Rj = 0 (the jth regressor orthogonal to all remaining regressors, the j regressor multiply
uncorrelated with the remaining ones), VIFj = 1.
• With Rj −→ 1 (the jth regressor collinear with the remaining regressors, the jth regressor
almost perfectly multiply correlated with the remaining ones), VIFj −→ ∞.
Interpretation and use of VIF
• If we take into account the statement of Theorem 11.2, the VIF of the jth regressor (j =
1, . . . , k − 1) can be interpreted as a factor by which the (estimated) variance of βbj is multiplied
(inflated) compared to an optimal situation when the jth regressor is orthogonal to (multiply
uncorrelated with) the remaining regressors included in the model. Hence the term variance
inflation factor.
• Under assumption of normality, the confidence interval for βj with a coverage of 1 − α has the
lower and the upper bounds given as
q
α
b
c βbj .
βj ± tn−k 1 −
var
2
Using the statement of Theorem 11.2, the lower and the upper bounds of the confidence interval
for βj can also be written as
r
α TY
1 − R2 p
b
βj ± tn−k 1 −
VIFj .
2 Tj
n−k
That is, the (square root of) VIF also provides a factor by which the half-length (radius) of
the confidence interval is inflated compared to an optimal situation when the jth regressor
is orthogonal to (multiply uncorrelated with) the remaining regressors included in the model,
namely,
Volj 2
VIFj =
,
(11.2)
Vol0,j
2
varianční inflační faktor
3
tolerance
11.1. MULTICOLLINEARITY
where Volj =
Vol0,j =
203
length (volume) of the confidence interval for βj ;
length (volume) of the confidence interval for βj if it was Rj2 = 0.
• Regressors with a high VIF are possibly responsible for multicollinearity. Nevertheless, the VIF
does not reveal which regressors are mutually collinear.
Generalized variance inflation factor
Beginning of
A generalized variance inflation factor was derived by Fox and Monette (1992) to evaluate a degree skipped part
of collinearity between a specified group of regressors and the remaining regressors. Let
• J ⊂ 1, . . . , k − 1 , J = m;
• β [J ] be a subvector of β having the elements indexed by j ∈ J .
Under normality, a confidence ellipsoid for βJ with a coverage 1 − α is
n
β [J ] ∈ Rm :
b
β [J ] − β
[J ]
> MSe V[J ]
−1
o
b
β [J ] − β
<
m
F
(1
−
α)
,
m,n−k
[J ]
−1
V[J ] = (J − J ) block of the matrix X> X . (11.3)
Let VolJ :
Vol0,J :
volume of the confidence ellipsoid (11.3);
volume of the confidence ellipsoid (11.3) would all columns of X coresponding to β [J ] be orthogonal to the remaining colums of X.
A definition of the generalized variance inflation factor gVIF is motivated by (11.2) as it is given as
gVIFJ =
VolJ
Vol0,J
2
.
It is seen that with J = {j} for some j = 1, . . . , k − 1, the generalized VIF simplifies into
a standard VIF, i.e.,
gVIFj = VIFj .
Notes.
• The generalized VIF is especially useful if J relates to the regression coefficients corresponding
to the reparameterizing (pseudo)contrasts of one categorical covariate. It can then be shown
that gVIFJ does not depend on a choice of the (pseudo)contrasts. gVIFJ then evaluates the
magnitude of the linear dependence of a categorical variable and the remaining regressors.
• When comparing gVIF for index sets J , J of different cardinality m, quantities
J
1
gVIF 2 m
J
=
VolJ
Vol0,J
1
m
(11.4)
should be compared which all relate to volume units in 1D.
• Generalized VIF’s (and standard VIF’s if m = 1) together with (11.4) are calculated by the R
function vif from the package car.
End of
skipped part
11.1. MULTICOLLINEARITY
11.1.4
204
Basic treatment of multicollinearity
Especially in situations when inference on the regression coefficients is of interest, i.e., when the
primary purpose of the regression modelling is to evaluate which variables influence significantly
the response expectation and which not, multicollinearity is a serious problem. Basic treatment of
multicollinearity consists of preliminary exploration of mutual relationships between all covariates
and then choosing only suitable representatives of each group of mutually multiply correlated
covariates. Very basic decision can be based on pairwise correlation coefficients. In some (especially
“cook-book”) literature, rules of thumb are applied like “Covariates with a correlation (in absolute
value) higher than 0.80 should not be included together in one model.” Nevertheless, such rules
should never be applied in an automatic manner (why just 0.80 and not 0.79, . . . ?) Decision
on which covariates cause multicollinearity can additionally be based on (generalized) variance
inflation factors. Nevertheless, also those should be used comprehensively. In general, if a large set
of covariates is available to relate it to the response expectation, a deep (and often timely) analyzis
of mutual relationships and their understanding must preceed any regression modelling that is to
11.2. MISSPECIFIED REGRESSION SPACE
11.2
205
Misspecified regression space
We are often in a situation when a large (potentially enormous) number p of candidate regressors
is available. The question is then which of them should be included in a linear model. As shown
in Section 11.1, inclusion of all possible regressors in the model is not necessarily optimal and may
even have seriously negative impact on the statistical inference we would like to draw using the
linear model. In this section, we explore some (additional) properties of the least squares estimators
and of the related prediction in two situations:
(i) Omitted important regressors.
(ii) Irrelevant regressors included in a model.
11.2.1
Omitted and irrelevant regressors
We will assume that possibly two sets of regressors are available:
(i) X i , i = 1, . . . , n, where X i = tX (Z i ) for some transformation tX : Rp −→ Rk . They give
rise to the model matrix


X>
1
 .  0
..  = X , . . . , X k−1 .
Xn×k = 


X>
n
It will still be assumed that X 0 = 1, . . . , 1 (almost surely) leading to the model matrix
Xn×k = 1n , X 1 , . . . , X k−1 ,
with explicitely included intercept term.
(ii) V i , i = 1, . . . , n, where V i = tV (Z i ) for some transformation tV : Rp −→ Rl . They give
rise to the model matrix


V>
1
 .  1
..  = V , . . . , V l .
Vn×l = 


>
Vn
We will assume that both matrices X and V are of a full column rank and their columns are
linearly independent, i.e., we assume
rank Xn×k = k, rank Vn×l = l,
for Gn×(k+l) := X, V , rank G = k + l < n.
The matrices X and G give rise to two nested linear models:
Model MX Y Z ∼ Xβ, σ 2 In ;
Model MXV Y Z ∼ Xβ + Vγ, σ 2 In .
Depending on which of the two models is a correct one and which model is used for inference, we
face two situations:
11.2. MISSPECIFIED REGRESSION SPACE
206
Omitted important regressors mean that the larger model MXV is correct (with γ 6= 0m ) but we
base inference on model MX . In particular,
• β is estimated using model MX ;
• σ 2 is estimated using model MX ;
• prediction is based on the fitted model MX .
Irrelevant regressors included in a model that the smaller model MX is correct but we base
inference on model MXV . In particular,
• β is estimated (together with γ) using model MXV ;
• σ 2 is estimated using model MXV ;
• prediction is based on the fitted model MXV .
Note that if MX is correct then MXV is correct as well. Nevertheless, it includes redundant
parameters γ which are known to be equal to zeros.
Notation (Quantities derived under the two models).
Quantities derived while assuming model MX will be indicated by subscript X, quantities derived
while assuming model MXV will be indicated by subscript XV . Namely,
(i) Quantities derived while assuming model MX :
• Least squares estimator of β:
b = X> X
β
X
−1
X> Y = βbX,0 , . . . , βbX,k−1 ;
⊥
• Projection matrices into the regression space M X and into the residual space M X :
−1
HX = X X> X X> ,
MX = In − HX ;
• Fitted values (LSE of a vector Xβ):
b = YbX,1 , . . . , YbX,n ;
Yb X = HX Y = Xβ
X
• Residuals
U X = MX Y = Y − Yb X = UX,1 , . . . , UX,n ;
• Residual sum of squares and residual mean square:
2
SSe,X = U X ,
MSe,X =
SSe,X
.
n−k
(ii) Quantities derived while assuming model MXV :
• Least squares estimator of β, γ :
b
β
XV
−1 >
>
b ,γ
β
G Y,
XV b XV = G G
b XV = γ
γ
bXV,1 , . . . , γ
bXV,l ;
= βbXV,0 , . . . , βbXV,k−1 ,
⊥
• Projection matrices into the regression space M G and into the residual space M G :
−1
HXV = G G> G G> ,
MXV = In − HXV ;
11.2. MISSPECIFIED REGRESSION SPACE
207
• Fitted values (LSE of a vector Xβ + Vγ):
b
b
b
Yb XV = HXV Y = Xβ
XV + Vγ XV = YXV,1 , . . . , YXV,n ;
• Residuals
U XV = MXV Y = Y − Yb XV = UXV,1 , . . . , UXV,n ;
• Residual sum of squares and residual mean square:
2
SSe,XV = U XV ,
MSe,XV =
SSe,XV
.
n−k−l
Consequence of Theorem 10.1: Relationship between the quantities derived while
assuming the two models.
Quantities derived while assuming models MX and MXV are mutually in the following relationships:
−1
Yb XV − Yb X = MX V V> MX V V> U X ,
b
b
= X β
−β
+ Vb
γ ,
XV
b XV =
γ
X
V> MX V
>
b
b
β
XV − β X = − X X
−1
−1
XV
V> U X ,
X> Vb
γ XV ,
2
γ XV ,
SSe,X − SSe,XV = MX Vb
HXV = HX + MX V V> MX V
−1
V> MX .
Proof. Direct use of Lemma 10.1 while taking into account the fact that now, all involved model
matrices are of full-rank.
−1 >
Relationship HXV = HX +MX V V> MX V
V MX was shown inside the proof of Lemma 10.1.
It easily follows from a general expression of the hat matrix if we realize that
M X, V = M X, MX V ,
and that X> MX V = 0k×l .
k
Theorem 11.3 Variance of the LSE in the two models.
Irrespective of whether MX or MXV holds, the covariance matrices of the fitted values and the LSE of
the regression coefficients satisfy the following:
11.2. MISSPECIFIED REGRESSION SPACE
208
var Yb XV Z − var Yb X Z
≥
0,
b b
var β
XV Z − var β X Z
≥
0.
Notes.
• Estimator of the response mean vector µ = E Y Z based on a (smaller) model MX is always
(does not matter which model is correct) less or equally variable than the estimator based on the
(richer) model MXV .
• Estimators of the regression coefficients β based on a (smaller) model MX have always lower (or
equal if X> V = 0k×m ) standard errors than the estimator based on the (richer) model MXV .
Proof.
In accordance with our convention, condition will be omitted from notation of all
expectations and variances. Nevertheless, all are still understood as conditional expectations and
variances given the covariate values Z.
var Yb XV
− var Yb X ≥ 0
var Yb X = var HX Y = HX (σ 2 In )HX
We have,
= σ 2 HX
var Yb XV
b
var β
XV
(even if MX is not correct).
= var HXV Y = σ 2 HXV
= σ 2 HX + MX V(V> MX V)−1 V> MX
= var Yb X + σ 2 MX V(V> MX V)−1 V> MX .
{z
}
|
positive semidefinite matrix
b
− var β
X ≥0
Proof/calculations for this part were skipped and are not requested for the
exam. Proof/calculations below are shown only for those who are interested.
First, use a formula to calculate an inverse of a matrix divided into blocks (Theorem A.4):

−1
n

o−1
!
> X X> V
> X − X> V V> V −1 V> X
b
X
X
β
XV
 = σ2 
.
var
= σ2 
>
>
b XV
γ
V X V V
V
V
V
Further,
−1 > −1
−1
>
b
var β
=
var
X
X
X Y = X> X X> (σ 2 In )X X> X
X
−1
= σ 2 X> X
(even if MX is not correct).
b
var β
XV
n
o−1
−1
= σ 2 X> X − X> V V> V V> X
.
Property of positive definite matrices (“A − B ≥ 0 ⇔ B−1 − A−1 ≥ 0”) finalizes the proof.
k
11.2. MISSPECIFIED REGRESSION SPACE
11.2.2
209
Prediction quality of the fitted model
To evaluate a prediction
quality
of
the
fitted
model,
we
will
assume
that
data
Y
,
Z
, Zi =
i
i
Zi,1 , . . . , Zi,p ∈ Z⊆ Rp , i = 1, . . . , n, are a random sample from a distribution of a generic
random vector Y, Z , Z = Z1 , . . . , Zp . Let the conditional distribution Y | Z of Y given the
covariates Z satisfies
E Y Z = m(Z),
var Y Z = σ 2 ,
(11.5)
for some (regression) function m and some σ 2 > 0.
Replicated response
Let z 1 , . . . , z n be the values of the covariate vectors Z 1 , . . . , Z n in the original
data that are
available to estimate the parameters of the model (11.5). Further, let Yn+i , Z n+i , i = 1, . . . , n,
be independent
random vectors (new or future data) beingdistributed as a generic random vector
Y, Z and being independent of the original data Yi , Z i , i = 1, . . . , n. Suppose that our aim is
to predict values of Yn+i , i = 1, . . . , n, under the condition that the new covariate values are equal
to the old ones. That is, we want to predict, for i = 1, . . . , n, values of Yn+i given Z n+i = z i .
Terminology (Replicated response).
A random vector
Y new = Yn+1 , . . . , Yn+n ,
where Yn+i is supposed to come from the conditional distribution Y | Z = z i , i = 1, . . . , n, is
called the replicated data.
Notes.
• The original (old) response vector Y and the replicated response vector Y new are assumed to
be independent.
• Both Y and Y new are assumed to be generated by the same conditional distribution (given Z),
where
E Y Z 1 = z1, . . . , Z n = zn =
µ
= E Y new Z n+1 = z 1 , . . . , Z n+n = z n ,
var Y Z 1 = z 1 , . . . , Z n = z n = σ 2 In = var Y new Z n+1 = z 1 , . . . , Z n+n = z n ,
for some σ 2 > 0,
and
µ = m(z 1 ), . . . , m(z n ) = µ1 , . . . , µn .
Prediction of replicated response
Let
Yb new = Ybn+1 , . . . , Ybn+n
be the prediction of a vector Y new based on the regression model (11.5) estimated using the original
data Y . Analogously to Section 5.4.3, we shall evaluate a quality of the prediction by the mean
squared error of prediction (MSEP). Nevertheless, in contrast to Section 5.4.3, the following issues
will be different:
11.2. MISSPECIFIED REGRESSION SPACE
210
(i) A value of a random vector rather than a value of a random variable (as in Section 5.4.3) is
predicted now. Now, the MSEP will be given as a sum of the MSEPs of the elements of the
random vector being predicted.
(ii) Since we are now interested in prediction of new response values given the covariate values
being equal to the covariate values in the original data, the MSEP now will be based on
a conditional distribution of the responses given Z (given Z i = Z n+i = z i , i = 1, . . . , n).
In contrast, variability of the covariates was taken into account in Section 5.4.3.
(iii) Variability of the prediction induced by estimation of the model parameters (estimation of
the regression function) using the original data Y will also be taken into account now.
In contrast, model parameters were assumed to be known when deriving the MSEP in
Section 5.4.3.
Definition 11.2 Quantification of a prediction quality of the fitted regression model.
Prediction quality of the fitted regression model will be evaluated by the mean squared error of
prediction (MSEP)4 defined as
n
h
X
2 i
MSEP Yb new =
E Ybn+i − Yn+i Z ,
(11.6)
i=1
where the expectation is with respect to the (n+n)-dimensional conditional distribution of Y , Y new
given
  

Z>
Z>
1
n+1
 .   . 
.   . 
Z=
 .  =  . .
Z>
Z>
n
n+n
Additionally, we define the averaged mean squared error of prediction (AMSEP)5 as
1
AMSEP Yb new = MSEP Yb new .
n
Prediction of replicated response in a linear model
End of
Lecture #20
(10/12/2015)
Start of
With a linear model, it is assumed that m(z) = x> β for some (known) transformation x = tX (z) Lecture #21
(10/12/2015)
and a vector of (unknown) parameters β. Hence, it is assumed that
>
µ = Xβ = x>
1 β, . . . , xn β = µ1 , . . . , µn ,
for a model matrix X based on the (transformed) covariate values xi = tX (z i ), i = 1, . . . , n.
If we restrict our attention to unbiased and linear predictions of Y new , i.e., to predictions of the
form Yb new = AY for some matrix A, a variant of the Gauss-Markov theorem would show that
(11.6) is minimized for
−
Yb new = Yb ,
Yb = X X> X X> Y ,
Ybn+i = Ybi ,
4
střední čtvercová chyba predikce
5
i = 1, . . . , n.
průměrná střední čtvercová chyba predikce
11.2. MISSPECIFIED REGRESSION SPACE
211
That is, for Yb new being equal to the fitted values of the model estimated using the original data.
Note also that
b,
Yb new = Yb =: µ
b is the LSE of a vector µ = E Y Z 1 = z 1 , . . . , Z n = z n = E Y new Z n+1 =
where µ
z 1 , . . . , Z n+n = z n .
Lemma 11.4 Mean squared error of prediction in a linear model.
In a linear model, the mean squared error of prediction can be expressed as
MSEP Yb new
2
= nσ +
n
X
MSE Ybi ,
i=1
where
h
2 i
MSE Ybi = E Ybi − µi Z ,
i = 1, . . . , n,
is the mean squared error6 of Ybi if this is viewed as estimator of µi , i = 1, . . . , n.
Proof.
In accordance with our convention, condition will be omitted from notation of all
expectations and variances. Nevertheless, all are still understood as conditional expectations and
variances given the covariate values Z.
We have for i = 1, . . . , n (remember, Ybn+i = Ybi , i = 1, . . . , n),
2
2
E Ybn+i − Yn+i
= E Ybi − Yn+i
2
= E Ybi − µi − (Yn+i − µi )
2
2
= E Ybi − µi + E Yn+i − µi −2 E(Ybi − µi )(Yn+i − µi )
|
{z
}
b
b
E(Yi − µi ) E(Yn+i − µi ) = E(Yi − µi ) · 0
= E Ybi − µi
2
+ E Yn+i − µi
2
= MSE(Ybi ) + σ 2 .
So that
MSEP(Yb new ) =
n
X
E Ybn+i − Yn+i
i=1
2
= n σ2 +
n
X
MSE Ybi .
i=1
Notes.
• We can also write
n
X
h
2 i
b
b
MSE Yi = E Y − µ Z .
i=1
Hence,
MSEP Yb new
6
střední čtvercová chyba
h
2 i
= n σ 2 + E Yb − µ Z .
k
11.2. MISSPECIFIED REGRESSION SPACE
212
• If the assumed linear model is a correct model for data at hand, Gauss-Markov theorem states
that Yb is the BLUE of the vector µ in which case
h
2 i
i = 1, . . . , n.
MSE Ybi = E Ybi − µi Z = var Ybi Z ,
• Nevertheless, if the assumed linear model is not a correct model for data at hand, estimator Yb
might be a biased estimator of the vector µ, in which case
h
2 i
MSE Ybi = E Ybi − µi Z
o2
o2
n
n ,
= var Ybi Z + bias Ybi
= var Ybi Z + E Ybi − µi Z
i = 1, . . . , n.
• Expression of the mean squared error of prediction is
MSEP(Yb new ) = n σ 2 +
n
X
h
2 i
MSE Ybi = n σ 2 + E Yb − µ Z .
i=1
By specification of a model for the conditional response
i.e., by specification of
hexpectation,
2 i
a model for µ, we can influence only the second factor E Yb − µ Z . The first factor (n σ 2 )
reflects the true (conditional) variability of the response which does not depend on specification
of the model for the expectation. Hence, if evaluating a prediction quality of a linear model with
respect to ability to predict replicated data, the only term that matters is
n
X
h
2 i
MSE Ybi = E Yb − µ Z ,
i=1
that relates to the error of the fitted values being considered as an estimator of the vector µ.
11.2.3
Omitted regressors
In this section, we will assume that the correct model is model
MXV : Y Z ∼ Xβ + Vγ, σ 2 In ,
with γ 6= 0l . Hence all estimators derived under model MXV are derived under the correct model
and hence have usual properties of the LSE, namely,
b
E β
= β,
XV Z
E Yb XV Z = Xβ + Vγ =: µ,
n
X
i=1
MSE YbXV,i
=
n
X
b
b
var YXV,i Z = tr var Y XV Z
= tr σ 2 HXV
i=1
= σ 2 (k + l),
E MSe,XV Z = σ 2 .
(11.7)
11.2. MISSPECIFIED REGRESSION SPACE
213
Nevertheless, all estimators derived under model MX : Y Z ∼ Xβ, σ 2 In are calculated while
assuming a misspecified model with omitted important regressors and their properties do not
coincide with properties of the LSE calculated under the correct model.
Theorem 11.5 Properties of the LSE in a model with omitted regressors.
Let MXV : Y Z ∼ Xβ + Vγ, σ 2 In hold, i.e., µ := E Y Z satisfies
µ = Xβ + Vγ
for some β ∈ Rk , γ ∈ Rl .
Then the least squares estimators derived while assuming model MX : Y Z ∼ Xβ, σ 2 In attain
the following properties:
b Z = β + X> X −1 X> Vγ,
E β
X
E Yb X Z = µ − MX Vγ,
n
X
MSE YbX,i
2
= k σ 2 + MX Vγ ,
i=1
MX Vγ 2
2
.
E MSe,X Z = σ +
n−k
Proof.
In accordance with our convention, condition will be omitted from notation of all
expectations and variances. Nevertheless, all are still understood as conditional expectations and
variances given the covariate values Z.
b Z
E β
X
−1 >
>
b
b
By Theorem 10.1: β
X Vb
γ XV .
XV − β X = − X X
n
o
−1 >
>
b
b
Hence, E β
=
E
β
+
X
X
X
Vb
γ
X
XV
XV
−1 >
>
= β + X X X Vγ,
b
bias β
X
= X> X
−1
X> Vγ.
E Yb X Z
b
b
γ XV .
By Theorem 10.1: Yb XV − Yb X = X β
XV − β X + Vb
b
b
Hence,
E Yb X
= E Yb XV − Xβ
γ XV
XV + Xβ X − Vb
−1
= µ − Xβ + Xβ + X X> X X> Vγ − Vγ
n
o
−1
= µ + X X> X X> − In Vγ
= µ − MX Vγ,
bias Yb X
= − MX Vγ.
11.2. MISSPECIFIED REGRESSION SPACE
Pn
i=1
MSE YbX,i
214
n
> o
b
b
b
Let us first calculate MSE Y X = E Y X − µ Y X − µ
:
MSE Yb X = var Yb X + bias Yb X bias> Yb X
= σ 2 HX + MX Vγγ > V> MX .
Hence,
n
X
MSE YbX,i = tr MSE Yb X
i=1
= tr σ 2 HX + MX Vγγ > V> MX
= tr σ 2 HX + tr MX Vγγ > V> MX
= σ 2 k + tr γ > V> MX MX Vγ
2
= σ 2 k + MX Vγ .
E MSe,X Z
Proof/calculations for this part were skipped and are not requested for the
exam. Proof/calculations below are shown only for those who are interested.
Let us first calculate E SSe,X := E SSe,X Z . To do that, write the linear model MXV using
the error terms as
Y = Xβ + Vγ + ε,
E ε = E ε Z = 0n , var ε = var ε Z = σ 2 In .
2
2
E SSe,X = EMX Y = EMX (Xβ + Vγ + ε)
2
= EMX Vγ + MX ε
2
2
= EMX Vγ + EMX ε + 2 E γ > V> MX MX ε
|
{z
}
γ > V> MX Eε=0
2
E ε> M X ε
= MX Vγ +
|
{z }
E tr(ε> MX ε) =tr E(MX εε> ) =tr σ 2 MX =σ 2 (n−k)
2
= MX Vγ + σ 2 (n − k).
SSe,X
Hence,
E MSe,X = E
n−k
MX Vγ 2
= σ2 +
,
n−k
MX Vγ 2
bias MSe,X =
.
n−k
k
11.2. MISSPECIFIED REGRESSION SPACE
215
Least squares estimators
−1 >
>
b
b is not
Theorem 11.5 shows that bias β
X Vγ, nevertheless, the estimator β
X = X X
X
necessarily biased. Let us consider two situations.
(i) X> V = 0k×l , which means that each column of X is orthogonal with each column in V. In
other words, regressors included in the matrix X are uncorrelated with regressors included
in the matrix V. Then
b =β
b
b
• β
and bias β
= 0k .
X
XV
X
• Hence β can be estimated using the smaller model MX without any impact on a quality
of the estimator.
(ii) X> V 6= 0k×l
b is a biased estimator of β.
• β
X
Further, for the fitted values Yb X if those are considered as an estimator of the response vector
expectation µ = Xβ + Vγ, we have
bias Yb X = − MX Vγ.
In this case, all elements of the bias vector would
be equal to zero if MX V = 0n×l . Nevertheless, this would mean that M V ⊆ M X which is in contradition with our assumption
rank X, V = k + l. That is, if the omitted covariates (included in the matrix V) are linearly independent (are not perfectly multiply correlated) with the covariates included in the model matrix X,
the fitted values Yb X always provide a biased estimator of the response expectation.
Prediction
Let us compare predictions Yb new,X = Yb X based on a (misspecified) model MX and predictions
Yb new,XV = Yb XV based on a (correct) model MXV . Properties of the fitted values in a correct
model (Expressions (11.7)) together with results of Lemma 11.4 and Theorem 11.5 give
MSEP Yb new,XV
= n σ2 + k σ2 + l σ2,
MSEP Yb new,X
2
= n σ 2 + k σ 2 + MX Vγ .
That is, the average mean squared errors of prediction are
k 2
l
σ + σ2,
n
n
2
k
1
AMSEP Yb new,X = σ 2 + σ 2 + MX Vγ .
n
n
AMSEP Yb new,XV
= σ2 +
We can now conclude the following.
2
• The term MX Vγ might be huge compared to l σ 2 in which case the prediction using the
model with omitted important covariates is (much) worse than the prediction using the (correct)
model.
σ 2 → 0 with n → ∞ (while increasing the number of predictions).
2
• On the other hand, n1 MX Vγ does not necessarily tend to zero with n → ∞.
l
n
11.2. MISSPECIFIED REGRESSION SPACE
216
Estimator of the residual variance
Theorem 11.5 shows that the mean residual square MSe,X in a misspecified model MX is a biased
estimator of the residual variance σ 2 with the bias amounting to
MX Vγ 2
bias MSe,X =
.
n−k
Also in this case, bias does not necessarily tend to zero with n → ∞.
11.2.4
Irrelevant regressors
In this section, we will assume that the correct model is model
MX : Y Z ∼ Xβ, σ 2 In .
This means, that also model
MXV :
Y Z ∼ Xβ + Vγ, σ 2 In
holds, nevertheless, γ = 0l and hence the regressors from the matrix V are irrelevant.
Since both models MX and MXV hold, estimators derived under both models have usual properties
of the LSE, namely,
b Z = E β
b
E β
= β,
X
XV Z
E Yb X Z = E Yb XV Z = Xβ =: µ,
n
X
MSE YbX,i
=
i=1
=
n
X
MSE YbXV,i
i=1
=
n
X
= tr σ 2 HX
var YbX,i Z = tr var Yb X Z
i=1
σ 2 k,
n
X
= tr σ 2 HXV
var YbXV,i Z = tr var Yb XV Z
i=1
= σ 2 (k + l),
E MSe,X Z = E MSe,XV Z = σ 2 .
Least squares estimators
b and β
b
Both estimators β
X
XV are unbiased estimators of a vector β. Nevertheless, as stated in
Theorem 11.3, their quality expressed by the mean squared error which in this case coincide with
the covariance matrix (may) differ since
h
h
> i
> i
b
b
b
b
b
b
MSE β XV − MSE β XV = E β XV − β β XV − β Z − E β X − β β X − β Z
b
b = var β
XV Z − var β X Z ≥ 0.
In particular, we derived during the proof of Theorem 11.3 that
n
−1 > o−1
−1
>
>
>
2
>
b
b
var β XV Z − var β X Z = σ
X X−X V V V V X
− X X
.
11.2. MISSPECIFIED REGRESSION SPACE
217
Let us again consider two situations.
(i) X> V = 0k×l , which means that each column of X is orthogonal with each column in V. In
other words, regressors included in the matrix X are uncorrelated with regressors included
in the matrix V. Then
b =β
b
b b
• β
X
XV and var β X Z = var β XV Z .
• Hence β can be estimated using the model MXV with irrelevant covariates included
without any impact on a quality of the estimator.
(ii) X> V 6= 0k×l
b
b
• The estimator β
XV is worse than the estimator β X in terms of its variability.
• If we take into account a fact that by including more regressors in the model, we are
b
increasing a danger of multicollinearity, difference between variability of β
XV and that
b may become huge.
of β
X
Prediction
Let us now compare predictions Yb new,X = Yb X based on a correct model MX and predictions
Yb new,XV = Yb XV based on also a correct model MXV , where however, irrelevant covariates were
included. Properties of the fitted values in a correct model together with results of Lemma 11.4 give
MSEP Yb new,XV
= n σ 2 + (k + l) σ 2 ,
MSEP Yb new,X
= n σ2 + k σ2.
That is, the average mean squared errors of prediction are
k+l 2
σ ,
n
k
AMSEP Yb new,X = σ 2 + σ 2 .
n
AMSEP Yb new,XV
= σ2 +
The following can now be concluded.
• If n → ∞, both AMSEP Yb new,XV and AMSEP Yb new,X tend to σ 2 . Hence on average, if
sufficiently large number of predictions is needed, both models provide predictions of practically
the same quality.
• On the other hand, by using the richer model MXV (which for a finite n provides worse predictions than the smaller model MX ), we are eliminating a possible problem of omitted important
covariates that leads to biased predictions with possibly even worse MSEP and AMSEP than
that of model MXV .
11.2.5
Summary
Interest in estimation of the regression coefficients and inference on them
If interest lies in estimation of and inference on the regression coefficients β related to the regressors
included in the model matrix X, the following was derived in Sections 11.2.3 and 11.2.4.
(i) If we omit important regressors which are (multiply) correlated with the regressors of main
interest included in the matrix X, the LSE of the regression coefficients is biased.
11.2. MISSPECIFIED REGRESSION SPACE
218
(ii) If we include irrelevant regressors which are (multiply) correlated with the regressors of main
interest in the matrix X, we are facing a danger of multicollinearity and related inflation of
the standard errors of the LSE of the regression coefficients.
(iii) Regressors which are (multiply) uncorrelated with regressors of main interest influence neither
b irrespective of whether they are omitted or irrelevantly included.
bias nor variability of β
Consequently, if a primary task of the analysis is to evaluate whether and how much the primary
regressors included in the model matrix X influence the response expectation, detailed exploration
and understanding of mutual relationships among all potential regressors and also between the
regressors and the response is needed. In particular, regressors which are (multiply) correlated with
the regressors from the model matrix X and at the same time do not have any influence on the
response expectation should not be included in the model. On the other hand, regressors which are
(multiply) uncorrelated with the regressors of primary interest can, without any harm, be included
in the model. In general, it is necessary to find a trade-off between too poor and too rich model.
Interest in prediction
If prediction is the primary purpose of the regression analysis, results derived in Sections 11.2.3 and
11.2.4 dictate to follow a strategy to include all available covariates in the model. The reasons are
the following.
(i) If we omit important regressors, the predictions get biased and the averaged mean squared
error of prediction is possibly not tending to the optimal value of σ 2 with n → ∞.
(ii) If we include irrelevant regressors in the model, this has, especially with n → ∞, a negligible
effect on a quality of the prediction. The averaged mean squared error of prediction is still
tending to the optimal value of σ 2 .
Chapter
12
Simultaneous Inference in
a Linear Model
In this chapter, we will assume
that
data
are
represented
by
a
set
of
n
random
vectors
Y
,
X
,
i
i
X i = Xi,0 , . . . , Xi,k−1 , i = 1, . . . , n, that satisfy a normal linear model. That is,
Y X ∼ Nn Xβ, σ 2 In ,
rank Xn×k = r ≤ k < n,
>
where Y = Y1 , . . . , Yn , X is a matrix with vectors X >
1 , . . . , X n in its rows and β = β0 , . . . ,
βk−1 ∈ Rk and σ 2 > 0 are unknown parameters. Further, we will assume that a matrix Lm×k
>
(m > 1) with rows l>
1 , . . . , lm is given such that
>
θ = Lβ = l>
1 β, . . . , lm β = θ1 , . . . , θm
is an estimable vector parameter of the linear model. Our interest will lie in a simultaneous inference
on elements of the parameter θ. This means, we will be interested in
(i) deriving confidence regions for a vector parameter θ;
(ii) testing a null hypothesis H0 : θ = θ 0 for given θ 0 ∈ Rm .
219
12.1. BASIC SIMULTANEOUS INFERENCE
12.1
220
Basic simultaneous inference
If matrix Lm×k is such that
(i) m ≤ r;
(ii) its rows, i.e., vectors l1 , . . . , lm ∈ Rk are linearly independent,
then we already have a tool for a simultaneous inference on θ = Lβ. It is based on point (x) of
Theorem 3.1 (Least squares estimators under the normality). It provides a confidence region for θ
with a coverage of 1 − α which is
n
θ ∈ Rm :
b
θ−θ
o
> n
− o−1
b < m Fm,n−r (1 − α) ,
θ−θ
MSe L X> X L>
(12.1)
b = Lb is the LSE of θ. The null hypothesis H0 : θ = θ 0 is tested using the statistic
where θ
n
− > o−1
1 b
0 >
>
b − θ0 ,
Q0 =
θ−θ
MSe L X X L
θ
m
(12.2)
which under the null hypothesis follows an Fm,n−r distribution and the critical region of a test on
the level of α is
C(α) = Fm,n−r (1 − α), ∞ .
(12.3)
The P-value if Q0 = q0 is then given as p = 1 − CDFF , m, n−r (q0 ). Note that the confidence region
(12.1) and the test based on the statistic Q0 and the critical region (12.3) are mutually dual. That is,
the null hypothesis is rejected on a level of α if and only if θ 0 is not covered by the confidence
region (12.1) with a coverage 1 − α.
12.2. MULTIPLE COMPARISON PROCEDURES
12.2
221
Multiple comparison procedures
12.2.1
Multiple testing
0 ) on a vector parameter θ = θ , . . . , θ
The null hypothesis H0 : θ = θ 0 (θ 0 = θ10 , . . . , θm
1
m
0 .
can also be written as H0 : θ1 = θ10 & · · · & θm = θm
Definition 12.1 Multiple testing problem, elementary null hypotheses, global null
hypothesis.
A testing problem with the null hypothesis
H0 : θ1 = θ10
&
...
0
& θm = θm
,
(12.4)
is called the multiple testing problem1 with the m elementary hypotheses2
0
H1 : θ1 = θ10 , . . . , Hm : θm = θm
.
Hypothesis H0 is called in this context also as a global null hypothesis.
Note. The above definition of the multiple testing problem is a simplified definition of a general
multiple testing problem where the elementary null hypotheses are not necessarily simple hypotheses. Further, general multiple testing procedures consider also problems where the null hypothesis
H0 is not necessarily given as a conjunction of the elementary hypotheses. Nevertheless, for our
purposes in context of this lecture, Definition 12.1 will suffice. Also subsequent theory of multiple
comparison procedures will be provided in a simplified way in an extent needed for its use in context of the multiple testing problem according to Definition 12.1 and in context of a linear model.
Notation.
• When dealing with a multiple testing problem, we will also write
H0
or
≡
H0
H1
&
H1 ,
≡
or
H0 =
m
\
...
...,
& Hm
Hm
Hj .
j=1
• In context of a multiple testing, subscript 1 at H1 will never indicate an alternative hypothesis.
A symbol { will rather be used to indicate an alternative hypothesis.
• The alternative hypothesis of a multiple testing problem with the null hypothesis (12.4) will always
be given by a complement of the parameter space under the global null hypothesis, i.e.,
H{0 : θ1 6= θ10
≡
1
problém vícenásobného testování
2
H{1
OR
OR
elementární hypotézy
...
...
OR
OR
0
θm 6= θm
,
H{m ,
12.2. MULTIPLE COMPARISON PROCEDURES
222
where H{j : θj 6= θj0 , j = 1, . . . , m. We will also write
H{0 =
m
[
H{j .
j=1
• Different ways of indexing the elementary null hypotheses will also be used (e.g., a double
subscript) depending on a problem at hand.
Example 12.1 (Multiple testing problem for one-way classified group means).
Suppose that a normal linear model Y X ∼ Nn Xβ, σ 2 In is used to model dependence of the
response Y on a single
categorical covariate Z with a sample space Z = {1, . . . , G}, where the
regression space M X of a vector dimension G parameterizes the one-way classified group means
m1 := E Y Z = 1 , . . . , mG = E Y Z = G .
If we restrict ourselves
to full-rank parameterizations
(see Section 7.4.4), the regression coefficients vector
is β = β0 , β Z , β Z = β1 , . . . , βG−1 and the group means are parameterized as
Z
mg = β0 + c>
gβ ,
where
g = 1, . . . , G,


c>
1
 . 
.. 
C=
 
c>
G
is a chosen G × (G − 1) (pseudo)contrast matrix.
The null hypothesis H0 : m1 = · · · = mG
on equality of the G group means can be specified as
a multiple testing problem with m = G2 elementary hypotheses (double subscript will be used to
index them):
H1,2 : m1 = m2 , . . . , HG−1,G : mG−1 = mG .
The elementary null hypotheses can now be written in terms of a vector estimable parameter
θ = θ1,2 , . . . , θG−1,G ,
>
θg,h = mg − mh = cg − ch β Z ,
g = 1, . . . , G − 1, h = g + 1, . . . , G
as
H1,2 : θ1,2 = 0,
...,
HG−1,G : θG−1,G = 0,
or written directly in term of the group means as
H1,2 : m1 − m2 = 0,
HG−1,G : mG−1 − mG = 0,
The global null hypothesis is H0 : θ = 0, where θ = Lβ. Here, L is an G2 × G matrix

0
.
.
L=
.
0
...,
c1 − c2
..
.
> 
cG−1 − cG
>

.

Since rank C = G − 1, we have rank L = G − 1. We then have
12.2. MULTIPLE COMPARISON PROCEDURES
223
• For G ≥ 4, m = G2 > G. That is, in this case, the number of elementary null hypotheses is higher
than the rank of the underlying linear model.
• For G ≥ 3, the matrix L has linearly dependent rows.
That is, for G ≥ 3, we can
(i) neither calculate a simultaneous confidence region (12.1) for θ;
(ii) nor use the test statistic (12.2) to test H0 : θ = 0.
In this chapter,
(i) we develop procedures that allow to test the null hypothesis H0 : Lβ = θ 0 and provide
a simultaneous confidence region for θ = Lβ even if the rows of the matrix L are linearly
dependent or its rank is higher than the rank of the underlying linear model;
(ii) the test procedure will also decide which of the elementary hypotheses is/are responsible (in
a certain sense) for rejection of the global null hypothesis;
(iii) developed confidence regions will have a more appealing form of a product of intervals.
12.2.2
Simultaneous confidence intervals
Suppose
that a distribution of the random vector D depends on a (vector) parameter θ = θ1 , . . . ,
θm ∈ Θ1 × · × Θm = Θ ⊆ Rm .
Definition 12.2 Simultaneous confidence intervals.
(Random) intervals θjL , θjU , j = 1, . . . , m, where θjL = θjL (D) and θjU = θjU (D), j = 1, . . . , m,
are called simultaneous
confidence intervals3 for parameter θ with a coverage of 1 − α if for any
0 ∈ Θ,
θ 0 = θ10 , . . . , θm
L
U
P θ1L , θ1U × · · · × θm
, θm
3 θ 0 ; θ = θ 0 ≥ 1 − α.
Notes.
• The condition in the definition can also be written as
P ∀ j = 1, . . . , m : θjL , θjU 3 θj0 ; θ = θ 0 ≥ 1 − α.
• The product of the simultaneous confidence intervals indeed forms a confidence region in
a classical sense.
Example 12.2 (Bonferroni simultaneous confidence intervals).
α
Let for each j = 1, . . . , m, θjL , θjU be a classical confidence interval for θj with a coverage of 1− m
.
That is,
α
∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL , θjU 3 θj0 ; θj = θj0 ≥ 1 − .
m
We then have
α
∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL , θjU 63 θj0 ; θj = θj0 ≤ .
m
3
simultánní intervaly spolehlivosti
12.2. MULTIPLE COMPARISON PROCEDURES
224
Further, using elementary property of a probability (for any θ 0 ∈ Θ)
P ∃ j = 1, . . . , m :
θjL ,
θjU
63
θj0 ;
θ=θ
0
m
X
≤
P θjL , θjU 63 θj0 ; θ = θ 0
j=1
m
X
α
≤
= α.
m
j=1
Hence,
P ∀ j = 1, . . . , m :
θjL , θjU 3 θj0 ; θ = θ 0 ≥ 1 − α.
That is, intervals θjL , θjU , j = 1, . . . , m, are simultaneous confidence intervals for parameter θ
with a coverage of 1 − α. Simultaneous confidence intervals constructed in this way from univariate
confidence intervals are called Bonferroni simultaneous confidence intervals. Their disadvantage is that
they are often seriously conservative, i.e., having a coverage (much) higher than requested 1 − α.
12.2.3
Multiple comparison procedure, P-values adjusted for multiple comparison
Suppose again that a distribution of the random vector D depends on a (vector) parameter θ =
θ1 , . . . , θm ∈ Θ1 × · × Θm = Θ ⊆ Rm . Let for each 0 < α < 1 a procedure be given to
construct the simultaneous confidence intervals θjL (α), θjU (α) , j = 1, . . . , m, for parameter θ
with a coverage of 1 − α. Let for each j = 1, . . . , m, the procedure creates intervals satisfying
a monotonicity condition
1 − α1 < 1 − α2
=⇒
θjL (α1 ), θjU (α1 ) ⊆ θjL (α2 ), θjU (α2 ) .
Definition 12.3 Multiple comparison procedure.
Multiple comparison procedure (MCP)4 for a multiple testing problem with the elementary null hypotheses Hj : θj = θj0 , j = 1, . . . , m, based on given procedure for construction of simultaneous
confidence intervals for parameter θ is the testing procedure that for given 0 < α < 1
(i) rejects the global null hypothesis H0 : θ = θ 0 if and only if
L
U
θ1L (α), θ1U (α) × · · · × θm
(α), θm
(α) 63 θ 0 ;
(ii) for j = 1, . . . , m, rejects the jth elementary hypothesis Hj : θj = θj0 if and only if
θjL (α), θjU (α) 3
6 θj0 .
Note. Since θ1L (α), θ1U (α)
L (α), θ U (α) 63 θ 0 if and only if there exists
× · · · × θm
m
j = 1, . . . , m, such that θjL (α), θjU (α) 63 θj0 , the MCP rejects, for given 0 < α < 1, the global
4
procedura vícenásobného srovnávání
12.2. MULTIPLE COMPARISON PROCEDURES
225
null hypothesis H0 : θ = θ 0 if and only if, it rejects at least one out of m elementary null
hypotheses.
Note (Control of the type-I error rate).
Classical duality between confidence regions and testing procedures provides that for any 0 < α <
1, the multiple comparison procedure defines a statistical test which
(i) controls the type-I error rate with respect to the global null hypothesis H0 : θ = θ 0 , i.e.,
P H0 rejected; θ = θ 0 ≤ α;
(ii) at the same time, for each j = 1, . . . , m, controls the type-I error rate with respect to the
elementary hypothesis Hj : θj = θj0 , i.e.,
P Hj rejected; θj = θj0 ≤ α.
End of
Lecture #21
(10/12/2015)
Definition 12.4 P-values adjusted for multiple comparison.
Start of
P-values adjusted for multiple comparison for a multiple testing problem with the elementary null
Lecture #22
hypotheses Hj : θj = θj0 , j = 1, . . . , m, based on given procedure for construction of simultaneous
(17/12/2015)
confidence intervals for parameter θ are values padj
1 , . . . , pm defined as
n
o
L
U
0
=
inf
α
:
θ
(α),
θ
(α)
3
6
θ
j = 1, . . . , m.
j
j
j ,
j
Notes.
The following is clear from construction:
• The multiple comparison procedure rejects for given 0 < α < 1 the jth elementary hypothesis
Hj : θj = θj0 (j = 1, . . . , m) if and only if padj
j ≤ α.
• Since the global null hypothesis H0 : θ = θ 0 is rejected by the MCP if and only if at least one
elementary hypothesis is rejected, we have that the global null hypothesis is for given α rejected
if and only if
≤ α.
1 , . . . , pm
That is,
1 , . . . , pm
is the P-value of a test of the global null hypothesis based on the considered MCP.
Example 12.3 (Bonferroni multiple comparison procedure, Bonferroni adjusted
P-values).
Let for 0 < α < 1, θjL (α), θjU (α) , j = 1, . . . , m, be the confidence intervals for parameters
α
θ1 , . . . , θm , each with a (univariate) coverage of 1 − m
. That is,
α
∀ j = 1, . . . , m, ∀ θj0 ∈ Θj : P θjL (α), θjU (α) 3 θj0 ; θj = θj0 ≥ 1 − .
m
As shown in Example 12.2, θjL (α), θjU (α) , j = 1, . . . , m, are the Bonferroni simultaneous confidence
intervals for parameter θ = θ1 , . . . , θm with a coverage of 1 − α.
12.2. MULTIPLE COMPARISON PROCEDURES
226
Let for j = 1, . . . , m, puni
be a P-value related to the (single) test of the (jth elementary) hypothesis
j
0
Hj : θj = θj being dual to the confidence interval θjL (α), θjU (α) . That is,
puni
j
Hence,
α
:
= inf
m
θjL (α),
n
min m puni
j , 1 = inf α :
θjU (α)
63
θj0
.
o
θjL (α), θjU (α) 3
6 θj0 .
That is, the P-values adjusted for multiple comparison based on the Bonferroni simultaneous confidence
intervals are
uni
pB
j = 1, . . . , m.
j = min m pj , 1 ,
The related multiple comparison procedure is called the Bonferroni MCP.
Conservativeness of the Bonferroni MCP is seen, for instance, on the fact that the global null hypothesis
H0 : θ = θ 0 is rejected for given 0 < α < 1 if and only if, at least one of the elementary hypothesis is
rejected by its single test on a significance level of α/m which approaches zero as m, the number of
elementary hypotheses, increases.
12.2.4
Bonferroni simultaneous inference in a normal linear model
Consider a linear model
Y X ∼ Nn Xβ, σ 2 In ,
Let
rank Xn×k = r ≤ k < n.
>
θ = Lβ = l>
1 β, . . . , lm β = θ1 , . . . , θm
be an estimable vector parameter of the linear model. At this point, we shall only require that
lj 6= 0k for each j = 1, . . . , m. Nevertheless, we allow for m > r and also for possibly linearly
dependent vectors l1 , . . . , lm .
b = Lb = l> b, . . . , l> b = θb1 , . . . , θbm be the LSE of the vector θ and let MSe
As usual, let θ
1
m
be the residual mean square of the model.
α
It follows from properties of the LSE under normality that for given α, the 1 −
100%
m
confidence intervals for parameters θ1 , . . . , θm have the lower and the upper bounds given as
q
−
α >
>
L
>
θj (α) = lj b − MSe lj X X lj tn−r 1 −
,
2m
(12.5)
q
−
α >
>
U
>
θj (α) = lj b + MSe lj X X lj tn−r 1 −
,
j = 1, . . . , m.
2m
By the Bonferroni principle, intervals θjL (α), θjU (α) , j = 1, . . . , m, are simultaneous confidence
intervals for parameter θ with a coverage of 1 − α.
For each j = 1, . . . , m, the confidence interval (12.5) is dual to the (single) test of the (jth
elementary) hypothesis Hj : θj = θj0 based on the statistic
0
l>
j b − θj
Tj (θj0 ) = q
,
>X −l
MSe l>
X
j
j
12.2. MULTIPLE COMPARISON PROCEDURES
227
which under the hypothesis Hj follows the Student tn−r distribution. The univariate P-values are
then calculated as
puni
= 2 CDFt, n−r − |tj,0 | ,
j
where tj,0 is the value of the statistic Tj (θj0 ) attained with given data. Hence the Bonferroni
adjusted P-values for a multiple testing problem with the elementary null hypotheses Hj : θj = θj0 ,
j = 1, . . . , m, are
n
o
pB
=
min
2
m
CDF
−
|t
|
j = 1, . . . , m.
t, n−r
j,0 , 1 ,
j
12.3. TUKEY’S T-PROCEDURE
12.3
228
Tukey’s T-procedure
Method presented in this section is due to John Wilder Tukey (1915 – 2000) who published the
initial version of the method in 1949 (Tukey, 1949).
12.3.1
Tukey’s pairwise comparisons theorem
Lemma 12.1 Studentized range.
Let T1 , . . . , Tm be a random sample from N (µ, σ 2 ), σ 2 > 0. Let
R = max Tj −
j=1,...,m
min Tj
j=1,...,m
be the range of the sample. Let S 2 be the estimator of σ 2 such that S 2 and T = T1 , . . . , Tm are
independent and
ν S2
∼ χ2ν for some ν > 0.
σ2
Let
R
Q = .
S
The distribution of the random variable Q then depends on neither µ, nor σ.
Proof.
• We can write:
R
=
S
o
Tj − µ
Tj − µ
1n
max
− min
max(Tj − µ) − min(Tj − µ)
j
j
σ
σ
j
j
σ
=
.
S
S
σ
σ
• Distribution of both the numerator and the denominator depends on neither µ, nor σ since
Tj − µ
• For all j = 1, . . . , m
∼ N (0, 1).
σ
S
• Distribution of
is a transformation of the χ2ν distribution.
σ
Note. The distribution of the random variable Q =
sample size of T ) and ν (degrees of freedom of the
S 2 ).
χ2
k
from Lemma 12.1 still depends on m (the
distribution related to the variance estimator
R
S
12.3. TUKEY’S T-PROCEDURE
229
Definition 12.5 Studentized range.
R
from Lemma 12.1 will be called studentized range5 of a sample of size
S
m with ν degrees of freedom and its distribution will be denoted as qm,ν .
The random variable Q =
Notation.
• For 0 < p < 1, the p 100% quantile of the random variable Q with distribution qm,ν will be
denoted as qm,ν (p).
• The distribution function of the random variable Q with distribution qm,ν will be denoted
CDFq,m,ν (·).
Theorem 12.2 Tukey’s pairwise comparisons theorem, balanced version.
Let T1 , . . . , Tm be independent random variables and let Tj ∼ N (µj , v 2 σ 2 ), j = 1, . . . , m, where
v 2 > 0 is a known constant. Let S 2 be the estimator of σ 2 such that S 2 and T = T1 , . . . , Tm are
independent and
ν S2
∼ χ2ν for some ν > 0.
σ2
Then
√
P for all j 6= l: Tj − Tl − (µj − µl ) < qm,ν (1 − α) v 2 S 2 = 1 − α.
Proof.
• It follows from the assumptions that random variables
distribution N (0, σ 2 ).
Tj − µj
, j = 1, . . . , m, are i.i.d. with the
v
Tj − µj
Tj − µj
− min
.
• Let R = max
j
j
v
v
R
⇒
∼ qm, ν .
S
• Hence for any 0 < α < 1 (qm,ν is a continuous distribution):


Tj − µj
Tj − µj
− min
 max

j
j
v
v
1 − α = P
< qm,ν (1 − α)


S

max(Tj − µj ) − min(Tj − µj )
j
= P
j
vS

< qm,ν (1 − α)
= P max(Tj − µj ) − min(Tj − µj ) < v S qm,ν (1 − α)
j
5
studentizované rozpětí
j
12.3. TUKEY’S T-PROCEDURE
230
= P for all j 6= l (Tj − µj ) − (Tl − µl ) < v S qm,ν (1 − α)
√
= P for all j 6= l Tj − Tl − (µj − µl ) < qm,ν (1 − α) v 2 S 2 .
k
Theorem 12.3 Tukey’s pairwise comparisons theorem, general version.
Let T1 , . . . , Tm be independent random variables and let Tj ∼ N (µj , vj2 σ 2 ), j = 1, . . . , m, where
vj2 > 0, j = 1, . . . , m are known constants. Let S 2 be the estimator of σ 2 such that S 2 and
T = T1 , . . . , Tm are independent and
ν S2
∼ χ2ν
σ2
for some
ν > 0.
Then
s
P for all j 6= l
Tj − Tl − (µj − µl ) < qm,ν (1 − α)
vj2 + vl2
2
!
S2
≥
1 − α.
Proof. Proof/calculations were skipped and are not requested for the exam.
See Hayter (1984).
k
Notes.
• Tukey suggested that statement of Theorem 12.3 holds already in 1953 (in an unpublished
manuscript Tukey, 1953) without proving it. Independently, it was also suggested by Kramer
(1956). Consequently, the statement of Theorem 12.3 was called as Tukey–Kramer conjecture.
• The proof is not an easy adaptation of the proof of the balanced version.
12.3.2
Tukey’s honest significance differences (HSD)
A method of multiple comparison that will now be developed appears under several different
names in the literature: Tukey’s method, Tukey–Kramer method, Tukey’s range test, Tukey’s honest
significance differences (HSD) test.
Assumptions.
In the following, we assume that
T = T1 , . . . , Tm ∼ Nm (µ, σ 2 V),
where
12.3. TUKEY’S T-PROCEDURE
231
• µ = µ1 , . . . , µm ∈ Rm and σ 2 > 0 are unknown parameters;
2 on a diagonal.
• V is a known diagonal matrix with v12 , . . . , vm
That is, T1 , . . . , Tm are independent and Tj ∼ N (µj , σ 2 vj ), j = 1, . . . , m. Further, we will
assume that an estimator S 2 of σ 2 is available which is independent of T and which satisfies
ν S 2 /σ 2 ∼ χ2ν for some ν > 0.
Multiple comparison problem.
A multiple comparison procedure that will be developed aims in testing m? =
hypotheses on all pairwise differences between the means µ1 , . . . , µm . Let
m
2
elementary
θj,l = µj − µl ,
j = 1, . . . , m − 1, l = j + 1, . . . , m,
θ = θ1,2 , θ1,3 , . . . , θm−1,m .
The elementary hypotheses of a multiple testing problem that we shall consider are
0
,
Hj,l : θj,l (= µj − µl ) = θj,l
j = 1, . . . , m − 1, l = j + 1, . . . , m,
m? . The global null hypothesis is as usual H : θ =
0 , θ0 , . . . , θ0
for some θ 0 = θ1,2
0
1,3
m−1,m ∈ R
θ0 .
Note. The most common multiple testing problem in this context is with θ0 = 0m? which
corresponds to all pairwise comparisons of the means µ1 , . . . , µm . The global null hypothesis
then states that all the means are equal.
Some derivations
Using either of the Tukey’s pairwise comparison theorems (Theorems 12.2 and 12.3), we have (for
chosen 0 < α < 1):
s
!
vj2 + vl2
P for all j 6= l Tj − Tl − (µj − µl ) < qm,ν (1 − α)
S2
≥ 1 − α,
2
2 . That is, we
with equality of the above probability to 1 − α in the balanced case of v12 = · · · = vm
have,


T − T − (µ − µ ) j
j
l ql 2 2
P for all j 6= l < qm,ν (1 − α) ≥ 1 − α.
vj +vl
2
S
2
0 ∈R
Let for j 6= l and for θj,l
0
Tj − Tl − θj,l
0
Tj,l (θj,l
) := s
.
vj2 + vl2
S2
2
12.3. TUKEY’S T-PROCEDURE
232
That is
!
1 − α ≤ P for all j 6= l
Tj,l (θ0 ) < qm,ν (1 − α); θ = θ 0
j,l

0
Tj − Tl − θj,l < qm,ν (1 − α); θ = θ 0 
= P for all j 6= l q 2 2
v
+v
j
l
2 S
2
TL
TU
0
0
= P for all j 6= l
θj,l (α), θj,l (α) 3 θj,l ; θ = θ ,
(12.6)

where
T L (α)
θj,l
T U (α)
θj,l
q
vj2 +vl2
2
S2,
q
vj2 +vl2
2
S2,
= Tj − Tl − qm,ν (1 − α)
= Tj − Tl + qm,ν (1 − α)
(12.7)
(12.8)
j < l.
Theorem 12.4 Tukey’s honest significance differences.
Random intervals given by (12.8) are simultaneous confidence intervals for parameters θj,l = µj − µl ,
j = 1, . . . , m − 1, l = j + 1, . . . , m with a coverage of 1 − α.
2 , the coverage is exactly equal to 1 − α, i.e., for any θ 0 ∈ Rm
In the balanced case of v12 = · · · = vm
0
0
TU
TL
= 1 − α.
P for all j 6= l
θj,l (α), θj,l (α) 3 θj,l ; θ = θ
?
0 , θ 0 ∈ R,
Related P-values for a multiple testing problem with elementary hypotheses Hj,l : θj,l = θj,l
j,l
j < l, adjusted for multiple comparison are given by
j < l,
pTj,l = 1 − CDFq,m,ν t0j,l ,
0 )=
where t0j,l is a value of Tj,l (θj,l
0
Tj −Tl −θj,l
r
v 2 +v 2
j
l
2
attained with given data.
S2
Proof.
T L (α), θ T U (α) , j < l, are simultaneous confidence intervals for parameters
The fact that θj,l
j,l
θj,l = µj − µl with a coverage of 1 − α follows from (12.7).
The fact that the coverage of the simultaneous confidence intervals is exactly equal to 1 − α in
a balanced case follows from the fact that inequality in (12.6) is equality in a balanced case.
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
0 , j < l, follows from noting the following (for each
with the elementary hypotheses Hj,l : θj,l = θj,l
j < l):
TL
TU
0
0 θj,l
(α), θj,l
(α) 63 θj,l
⇐⇒ Tj,l θj,l
≥ qm, ν (1 − α)
It now follows from monotonicity of the quantiles of a continuous Studentized range distribution
that
T
TL
TU
0
0 pj,l = inf α : θj,l (α), θj,l (α) 63 θj,l = inf α : Tj,l θj,l ≥ qm, ν (1 − α)
12.3. TUKEY’S T-PROCEDURE
233
is attained for pTj,l satisfying
That is, if t0j,l
0 T
θ
j,l j,l = qm, ν 1 − pTj,l .
0
is a value of the statistic Tj,l θj,l
attained with given data, we have
pTj,l = 1 − CDFq,m,ν t0j,l .
k
12.3.3
Tukey’s HSD in a linear model
In context of a normal linear model Y X ∼ Nn Xβ, σ 2 In , rank Xn×k = r ≤ k < n, the
Tukey’s honest significance differences are applicable in the following situation.
>
• Lm×k is a matrix with non-zero rows l>
1 , . . . , lm such that the parameter
>
η = Lβ = l>
1 β, . . . , lm β = η1 , . . . , ηm
is estimable.
• Matrix L is such that
V := L X> X
−
L> = vj,l
j,l=1,...,m
is a diagonal matrix with
:= vj,j , j = 1, . . . , m.
−
With b = X> X X> Y and the residual mean square MSe of the fitted linear model, we have
(conditionally, given the model matrix X):
vj2
>
2
b = l>
T := η
1 b, . . . , lm b = Lb ∼ Nm η, σ V ,
(n − r)MSe
∼ χ2n−r ,
σ2
b and MSe independent.
η
Hence the Tukey’s T-procedure can be used for a multiple comparison problem on (also estimable)
parameters
>
θj,l = ηj − ηl = lj − ll β,
j < l.
The Tukey’s simultaneous confidence intervals for parameters θj,l , j < l, with a coverage of 1 − α
have then the lower and the upper bound given as
q 2 2
vj +vl
T
L
MSe ,
θj,l (α) = ηbj − ηbl − qm,n−r (1 − α)
2
q 2 2
vj +vl
T U (α) = η
θj,l
bj − ηbl + qm,n−r (1 − α)
MSe ,
j < l.
2
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
with elementary hypotheses
0
Hj,l : θj,l = θj,l
,
j < l,
0 ∈ R, is based on statistics
for chosen θj,l
0
ηbj − ηbl − θj,l
0
Tj,l (θj,l
)= s
,
2
2
vj + vl
MSe
2
j < l.
12.3. TUKEY’S T-PROCEDURE
234
The above procedure is in particular applicable if all involved covariates are categorical and the
model corresponds to one-way, two-way or higher-way classification. If normal and homoscedastic
errors in the underlying linear model are assumed, the Tukey’s HSD method can then be used to
develop a multiple comparison procedure for differences between the group means or between the
means of the group means.
One-way classification
P
Let Y = Y1,1 , . . . , YG,nG , n = G
g=1 ng , and
Yg,j = mg + εg,j ,
g = 1, . . . , G, j = 1, . . . , nG ,
i.i.d.
εg,j ∼ N (0, σ 2 ).
We then have (see Theorem 9.1, with random covariates conditionally given the covariate values)
 




1
0
Y1
m1
n1 . . .
 . 




.. 
..
 .
..  ∼ NG  ...  , σ 2  ...
T := 
.
.
 




0 . . . n1G
YG
mG
Moreover, the mean square error MSe of the underlying one-way ANOVA linear model satisfies,
with νe = n − G,
νe MSe
∼ χ2νe ,
MSe and T independent
σ2
(due to the fact that T is the LSE of the vector of group means m = m1 , . . . , mG ). Hence the
Tukey’s simultaneous confidence intervals for θg,h = mg −mh , g = 1, . . . , G−1, h = g +1, . . . , G
with a coverage of 1 − α, have then the lower and upper bounds given as
s
1 1 1
Y g − Y h ± qG, n−G (1 − α)
+
MSe ,
g < h.
2 ng
nh
In case of a balanced data (n1 = · · · = nG ), the coverage of those intervals is even exactly equal
to 1 − α, otherwise, the intervals are conservative (having a coverage greater than 1 − α).
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
with elementary hypotheses
0
Hg,h : θg,h = θg,h
,
g < h,
0 ∈ R, is based on statistics
for chosen θg,h
0
Tg,h (θg,h
)
0
Y g − Y h − θg,h
=s
,
1 1
1 +
MSe
2 ng
nh
g < h.
Note. The R function TukeyHSD applied to objects obtained using the function aov (performs
LSE based inference for linear models involving only categorical covariates) provides a software
implementation of the Tukey’s T multiple comparison described here.
12.3. TUKEY’S T-PROCEDURE
235
Two-way classification
PH
P
Let Y = Y1,1,1 , . . . , YG,H,nG,H , n = G
h=1 ng,h , and
g=1
Yg,h,j = mg,h + εg,h,j ,
g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h ,
i.i.d.
εg,h,j ∼ N (0, σ 2 ).
Let, as usual,
ng• =
H
X
ng,h ,
Y g•
H ng,h
1 XX
Yg,h,j ,
=
ng•
h=1 j=1
h=1
mg•
H
1 X
=
mg,h ,
H
h=1
mwt
g•
H
1 X
=
ng,h mg,h ,
ng•
g = 1, . . . , G.
h=1
Balanced data
In case of balanced data (ng,h = J for all g, h), we have ng• = J H, mwt
g = mg . Further,






1
0
Y 1•
m1•
JH ...
 . 


 . 
.. 
..
2  ..

 . 
. 
T := 
.
. 
 .  ∼ NG  .  , σ  .
 ,
Y G•
mG•
0 . . . J 1H
see Consequence of Theorem 9.2. Further, let MSZW
and MSZ+W
be the residual mean squares
e
e
from the interaction model and the additive model, respectively, νeZW = n − G H, and νeZ+W =
n − G − H + 1 degrees of freedom, respectively. We have shown in the proof of Consequence
of Theorem 9.2 that for both the interaction model and the additive model, the sample means
Y 1• , . . . , Y G• are LSE’s of estimable parameters m1• , . . . , mG• and hence, for both models,
vector T is independent of the corresponding residual mean square. Further, depending on whether
the interaction model or the additive model is assumed, we have
νe? MS?e
∼ χ2νe? ,
σ2
where MS?e is the residual mean square of the model that is assumed (MSZW
or MSZ+W
) and νe?
e
e
ZW
Z+W
the corresponding degrees of freedom (νe or νe
). Hence the Tukey’s simultaneous confidence
intervals for θg1 ,g2 = mg1 • − mg2 • , g1 = 1, . . . , G − 1, g2 = g1 + 1, . . . , G have then the lower
and upper bounds given as
r
1
Y g1 • − Y g2 • ± qG, νe? (1 − α)
MS?e ,
JH
and the coverage of those intervals is even exactly equal to 1 − α.
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
with elementary hypotheses
End of
Lecture #22
Hg1 ,g2 : θg1 ,g2 = θg01 ,g2 ,
g1 < g2 ,
(17/12/2015)
0
for chosen θg1 ,g2 ∈ R, is based on statistics
Tg1 ,g2 (θg01 ,g2 ) =
Y g1 • − Y g2 • − θg01 ,g2
r
,
1
?
MSe
JH
g1 < g2 .
12.3. TUKEY’S T-PROCEDURE
236
Unbalanced data
With unbalanced data, direct calculation shows that





1
Y 1•
mwt
1•
n1•
 . 
 . 

2  ..
 . 
. 
T := 
 .  ∼ NG  .  , σ  .
0
Y G•
mwt
G•
...
..
.
...

0

.. 

. 
 .
1
nG•
wt
Further, the sample means Y 1• , . . . , Y G• are LSE’s of the estimable parameters mwt
1• , . . ., mG•
in both the interaction and the additive model. This is obvious for the interaction model since
there we know the fitted values (≡ LSE’s of the group means mg,h ). Those are Ybg,h,j = Y g,h• ,
g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h (Theorem 9.2). Hence the sample means Y 1• , . . .,
Y G• , which are their linear combinations, are LSE’s of the corresponding linear combinations of
wt
the group means mg,h . Those are the weighted means of the means mwt
1• , . . ., mG• . To show that
wt
the sample means Y 1• , . . ., Y G• are the LSE’s for the estimable parameters mwt
1• , . . ., mG• in the
For the rest, we can proceed in the same way as in the balanced case. That is, let MS?e and νe?
denote the residual mean square and the residual degrees of freedom of the model that can be
assumed (interaction or additive). Owing to the fact that T is a vector of the LSE’s of the estimable
parameters for both models, it is independent of MS?e . The Tukey’s T multiple comparison procedure
is now applicable for inference on parameters
wt
θgwt
= mwt
g1 • − mg2 • ,
1 ,g2
g1 = 1, . . . , G − 1, g2 = g1 + 1, . . . , G.
wt
The Tukey’s simultaneous confidence intervals for θgwt
= mwt
g1 • − mg2 • , g1 = 1, . . . , G − 1, g2 =
1 ,g2
g1 + 1, . . . , G, with a coverage of 1 − α, have the lower and upper bounds given as
s
1 1
1 ?
Y g1 • − Y g2 • ± qG, νe? (1 − α)
+
MSe .
2 ng1 • ng2 •
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
with elementary hypotheses
Hg1 ,g2 : θgwt
= θgwt,0
,
1 ,g2
1 ,g2
g1 < g2 ,
for chosen θgwt,0
1 ,g2 ∈ R, is based on statistics
Y g1 • − Y g2 • − θgwt,0
1 ,g2
s
,
Tg1 ,g2 (θgwt,0
)
=
1 ,g2
1 1
1
+
MS?e
2 ng1 • ng2 •
g1 < g2 .
Notes.
• Analogous procedure applies for the inference on the means of the means
G
m•h =
1 X
mg,h ,
G
g=1
mwt
•h =
G
1 X
ng,h mg,h ,
n•h
h = 1, . . . , H,
g=1
by the second factor of the two-way classification.
wt
• The weighted means of the means mwt
g• or m•h have a reasonable interpretation only in certain
special situations. If this is not the case, the Tukey’s multiple comparison with unbalanced data
does not make much sense.
Beginning of
skipped part
12.3. TUKEY’S T-PROCEDURE
237
• Even with unbalanced data, we can, of course, calculate the LSE’s of the (unweighted) means
of the means mg• or m•h . Nevertheless, those LSE’s are correlated with unbalanced data and
hence we cannot apply the Tukey’s procedure.
Note (Tukey’s HSD in the R software).
The R function TukeyHSD provides the Tukey’s T-procedure also for the two-way classification
(for both the additive and the interaction model). For balanced data, it performs a simultaneous
inference on parameters θg1 ,g2 = mg1 • − mg2 • (and analogous parameters with respect to the
second factor) in a way described here. For unbalanced data, it performs a simultaneous inference
wt
on parameters θgwt
= mwt
g1 • − mg2 • as described here, nevertheless, only for the first factor
1 ,g2
mentioned in the model formula. Inference on different parameters is provided with respect to the
second factor in the model formula. That is, with unbalanced data, output from the R function
TukeyHSD and interpretation of the results depend on the order of the factors in the model formula.
TukeyHSD with two-way classification for the second factor uses “new” observations that adjust for
? , given as
the effect of the first factor. That is, it is worked with “new” observations Yg,h,j
?
Yg,h,j
= Yg,h,j − Y g• + Y ,
g = 1, . . . , G, h = 1, . . . , H, j = 1, . . . , ng,h .
The Tukey’s T procedure is then applied to the sample means
?
Y •h = Y •h −
G
1 X
ng,h Y g• + Y ,
n•h
h = 1, . . . , H,
g=1
whose expectations are
mwt
•h
G
G
H
1 X
1XX
wt
ng,h2 mg,h2 ,
−
ng,h mg• +
n•h
n
g=1
h = 1, . . . , H,
g=1 h2 =1
which, with unbalanced data, are not equal to mwt
•h .
End of
skipped part
12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE
12.4
238
Hothorn-Bretz-Westfall procedure
Start of
The multiple comparison procedure presented in this section is applicable for any parametric model Lecture #23
where the parameters estimators follow either exactly (as in the case of a normal linear model) or (17/12/2015)
at least asymptotically a (multivariate) normal or t-distribution. In full generality, it was published
only rather recently (Hothorn et al., 2008, 2011), nevertheless, the principal ideas behind the method
12.4.1
Max-abs-t distribution
Definition 12.6 Max-abs-t-distribution.
Let T = T1 , . . . , Tm ∼ mvtm,ν Σ , where Σ is a positive semidefinite matrix. The distribution of
a random variable
H = max |Tj |
j=1,...,m
will be called the max-abs-t-distribution of dimension m with ν degrees of freedom and a scale matrix
Σ and will be denoted as hm,ν (Σ).
Notation.
• For 0 < p < 1, the p 100% quantile of the distribution hm,ν (Σ) will be denoted as hm,ν (p; Σ).
That is, hm,ν (p; Σ) is the number satisfying
P max |Tj | ≤ hm,ν (p; Σ) = p.
j=1,...,m
• The distribution function of the random variable with distribution hm,ν (Σ) will be denoted
CDFh,m,ν (·; Σ).
Notes.
• If the scale matrix Σ is positive definite (invertible), the random vector T ∼ mvtm,ν Σ has
a density w.r.t. Lebesgue measure
ν+m
− 1
Γ ν+m
t> Σ−1 t − 2
2
2
fT (t) =
1+
,
t ∈ Rm .
m m Σ
ν
Γ ν2 ν 2 π 2
• The distribution function CDFh,m,ν (·; Σ) of a random variable H = maxj=1,...,m |Tj | is then
(for h > 0):
CDFh,m,ν (h; Σ) = P max |Tj | ≤ h = P ∀j = 1, . . . , m |Tj | ≤ h
j=1,...,m
Z
h
Z
h
···
=
−h
fT (t1 , . . . , tm ) dt1 · · · dtm .
−h
• That is, when calculating the CDF of the random variable H having the max-abs-t distribution,
it is necessary to calculate integrals from a density of a multivariate t-distribution.
• Computationally efficient methods not available until 90’s of the 20th century.
• Nowadays, see, e.g., Genz and Bretz (2009) and the R packages mvtnorm or mnormt.
• Calculation of CDFh,m,ν (·; Σ) is also possible with a singular scale matrix Σ.
12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE
12.4.2
239
General multiple comparison procedure for a linear model
Assumptions.
In the following, we consider a normal linear model
Y X ∼ Nn Xβ, σ 2 In , rank(Xn×k ) = r ≤ k.
Further, let

l>
1
 . 
. 
=
 . 
l>
m

Lm×k
be a matrix such that
>
θ = Lβ = l>
1 β, . . . , lm β = θ1 , . . . , θm
is an estimable vector parameter with l1 6= 0k , . . . , lm 6= 0k .
Notes.
• The number m of the estimable parameters of interest may be arbitrary, i.e., even greater than r
or k.
• The rows of the matrix L may be linearly dependent vectors.
Multiple comparison problem.
A multiple comparison procedure that will be developed aims in providing a simultaneous inference on m estimable parameters θ1 , . . . , θm with the multiple testing problem composed of m
elementary hypotheses
Hj : θj = θj0 ,
j = 1, . . . , m,
0 ∈ Rm . The global null hypothesis is as usual H : θ = θ 0 .
for some θ 0 = θ10 , . . . , θm
0
Notation. In the following, the following (standard) notation will be used:
−
• b = X> X X> Y (any solution to normal equations X> Xb = X> Y );
b = Lb = l> b, . . . , l> b = θb1 , . . . , θbm : LSE of θ;
• θ
1
m
−
• V = L X> X L> = vj,l j,l=1,...,m (which does not depend on a choice of a pseudoinverse
−
X> X );
1
1
• D = diag √
, ..., √
;
v1,1
vm,m
• MSe : the residual mean square of the model with νe = n − r degrees of freedom.
12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE
240
Reminders from Chapter 3
• For j = 1, . . . , m, (both conditionally given X and unconditionally as well):
θbj − θj
Zj := p
σ 2 vj,j
∼ N (0, 1),
θbj − θj
Tj := p
MSe vj,j
∼ tn−r .
• Further (conditionally given X):
1
b − θ ∼ Nm 0m , DVD ,
Z = Z1 , . . . , Zm = √ D θ
σ2
1
b − θ ∼ mvtm, n−r DVD .
D θ
T = T1 , . . . , Tm = √
MSe
Notes.
• Matrices V and DVD are not necessarily invertible.
• If rank L = m ≤ r then both matrices V and DVD are invertible and Theorem 3.1 further
provides (both conditionally given X and unconditionally as well) that under H0 : θ = θ 0 :
>
−1
1 b
b − θ 0 = 1 T > DVD −1 T ∼ Fm, n−r .
Q0 =
θ − θ0
MSe V
θ
m
m
This was used to test the global null hypothesis H0 : θ = θ 0 and to derive the elliptical
confidence sets for θ.
• It can also be shown that if m0 = rank L then under H0 : θ = θ 0 :
>
+
1 b
b − θ 0 = 1 T > DVD + T ∼ Fm , n−r
θ − θ0
MSe V
θ
0
m
m
(both conditionally given X and unconditionally), where symbol + denotes the Moore-Penrose
pseudoinverse.
Q0 =
Some derivations
Let for θj0 ∈ R, j = 1, . . . , m,
θbj − θj0
Tj (θj0 ) = p
,
MSe vj,j
j = 1, . . . , m.
Then, under H0 : θ = θ 0 :
0
T θ 0 := T1 (θ10 ), . . . , Tm (θm
) ∼ mvtm, n−r (DVD).
We then have, for 0 < α < 1:
1 − α = P max Tj (θj0 ) < hm, n−r (1 − α; DVD); θ = θ 0
j=1,...,m
= P for all j = 1, . . . , m
Tj (θj0 ) < hm, n−r (1 − α; DVD); θ = θ 0
!
b
θj − θj0 < hm, n−r (1 − α; DVD); θ = θ 0
= P for all j = 1, . . . , m p
MSe vj,j = P for all j = 1, . . . , m
θjHL (α), θjHU (α) 3 θj0 ; θ = θ 0 ,
(12.9)
12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE
where
241
θjHL (α) = θbj − hm, n−r (1 − α; DVD)
p
MSe vj,j ,
p
θjHU (α) = θbj + hm, n−r (1 − α; DVD) MSe vj,j ,
j = 1, . . . , m.
(12.10)
Theorem 12.5 Hothorn-Bretz-Westfall MCP for linear hypotheses in a normal linear model.
Random intervals given by (12.10) are simultaneous confidence intervals for parameters
θj = l> β,
0
0
0
m
j = 1, . . . , m, with an exact coverage of 1 − α, i.e., for any θ = θ1 , . . . , θm ∈ R
HL
HU
0
0
P for all j = 1, . . . , m
θj (α), θj (α) 3 θj ; θ = θ
= 1 − α.
Related P-values for a multiple testing problem with elementary hypotheses Hj : θj = θj0 , θj0 ∈ R,
j = 1, . . . , m, adjusted for multiple comparison are given by
0
pH
j = 1, . . . , m,
j = 1 − CDFh,m,n−r tj ; DVD ,
θbj −θj0
where t0j is a value of Tj (θj0 ) = √
MSe vj,j
attained with given data.
Proof.
The fact that
θjHL (α), θjHU (α) , j = 1, . . . , m, are simultaneous confidence intervals for pa-
rameters θj = l>
j β with an exact coverage of 1 − α follows from (12.9).
Calculation of the P-values adjusted for multiple comparison related to the multiple testing problem
with the elementary hypotheses Hj : θj = θj0 , j = 1, . . . , m, follows from noting the following (for
each j = 1, . . . , m):
HL
HU
0
0 θj (α), θj (α) 63 θj ⇐⇒ Tj θj ≥ hm, n−r (1 − α; DVD).
It now follows from monotonicity of the quantiles of a continuous max-abs-t-distribution that
H
HL
HU
0
0 pj = inf α : θj (α), θj (α) 63 θj = inf α : Tj θj ≥ hm, n−r (1 − α; DVD)
is attained for pH
j satisfying
Tj θj0 = hm, n−r (1 − pH
j ; DVD).
That is, if t0j is a value of the statistic Tj θj0 attained with given data, we have
0
pH
=
1
−
CDF
t
;
DVD
.
h,m,n−r
j
j
k
12.4. HOTHORN-BRETZ-WESTFALL PROCEDURE
242
Note (Hothorn-Bretz-Westfall MCP in the R software).
In the R software, the Hothorn-Bretz-Westfall MCP for linear hypotheses on parameters of (generalized) linear models is implemented in package multcomp. After fitting a model (by the function
lm), it is necessary to call sequentially the following functions:
(i) glht. One of its arguments specifies the linear hypothesis of interest (specification of
the L matrix). Note that for some common hypotheses, certain keywords can be used.
For example, pairwise comparison of all group means in context of the ANOVA models is
achieved by specifying the keyword “Tukey”. Nevertheless, note that invoked MCP is still
that of Hothorn-Bretz-Westfall and it is not based on the Tukey’s procedure. The “Tukey”
keyword only specifies what should be compared and not how it should be compared.
(ii) summary (applied on an object of class glht) provides P-values adjusted for multiple comparison.
(iii) confint (applied on an object of class glht) provides simultaneous confidence intervals
which, among other things, requires calculation of a critical value hm, n−r (1 − α), that is also
available in the output.
Note that both calculation of the P-values adjusted for multiple comparison and calculation of the
critical value hm, n−r (1 − α) needed for the simultaneous confidence intervals requires calculation
of a multivariate t integral. This is calculated by a Monte Carlo integration (i.e., based on a certain
stochastic simulation) and hence the results slightly differ if repeatedly calculated at different
occasions. Setting a seed of the random number generator (set.seed()) is hence recommended
for full reproducibility of the results.
12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION
12.5
243
Confidence band for the regression function
In this section, we shall assume that data are represented by i.i.d. random vectors
Yi , Z i ,
i = 1, . . . , n, being sampled from a distribution of a generic random vector Y, Z ∈ R1+p . It is
further assumed that for some known transformation t : Rp −→ Rk , a normal linear model with
regressors X i = t(Z i ), i = 1, . . . , n, holds. That is, it is assumed that
Yi = X >
i β + εi ,
i.i.d.
εi ∼ N (0, σ 2 ),
for some β ∈ Rk , σ 2 > 0. The corresponding regression function is
E Y X = t(z) = E Y Z = z = m(z) = t> (z)β,
z ∈ Rp .
It will further be assumed that the corresponding model matrix

 

> (Z )
X>
t
1
1
 .   . 
..  =  .. 
X=

 

>
>
Xn
t (Z n )
b will be the LSE of a vector of
is of full-rank (almost surely), i.e., rank Xn×k = k. As it is usual, β
β and MSe the residual mean square.
Reminder from Section 3.3
Let z ∈ Rp be given. Theorem 3.2 then states that a random interval with the lower and upper
bounds given as
q
−1
b ± tn−k 1 − α
MSe t> (z) X> X t(z),
t> (z)β
2
is the confidence interval for m(z) = t> (z)β with a coverage of 1 − α. That is, for given z ∈ Rp ,
for any β 0 ∈ Rk ,
q
−1
α
>
b
P t (z)β ± tn−k 1 −
MSe t> (z) X> X t(z) 3 t> (z)β 0 ; β = β 0 = 1 − α.
2
Theorem 12.6 Confidence band for the regression function.
Let Yi , Z i , i = 1, . . . , n, Z i ∈ Rp , be such that
Yi = X >
i β + εi ,
i.i.d.
εi ∼ N (0, σ 2 ),
where X i = t(Z i ), i = 1, . . . , n, for a known transformation t : Rp −→ Rk , and
β ∈ Rk and
σ 2 > 0 are unknown parameters. Let for all z ∈ Rp t(z) 6= 0k and let rank Xn×k = k, where X
is a matrix with vectors X 1 , . . . , X n in rows. Then for any β 0 ∈ Rk
P for all z ∈ Rp
b ±
t> (z)β
q
−1
k Fk, n−k (1 − α) MSe t> (z) X> X t(z) 3 t> (z)β 0 ;
β = β0
= 1 − α.
12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION
244
Note. Requirement t(z) 6= 0k for all z ∈ Rp is not too restrictive from a practical point of view
as it is satisfied, e.g., for all linear models with intercept.
Proof. Let (for 0 < α < 1)
n
K = β ∈ Rk :
b
β−β
>
X> X
o
b ≤ k MSe Fk,n−k (1 − α) .
β−β
Section 3.2: K is a confidence ellipsoid for β with a coverage of 1 − α, that is, for any β 0 ∈ Rk
P K 3 β 0 ; β = β 0 = 1 − α.
K is an ellipsoid in Rk , that is, bounded, convex and with our definition also closed subset of Rk .
Let for z ∈ Rp :
L(z) = inf t> (z)β,
β∈K
U (z) = sup t> (z)β.
β∈K
From construction:
β ∈ K ⇒ ∀z ∈ Rp
L(z) ≤ t> (z)β ≤ U (z).
Due to the fact that K is bounded, convex and closed, we also have
∀z ∈ Rp
That is,
L(z) ≤ t> (z)β ≤ U (z) ⇒ β ∈ K.
β ∈ K ⇔ ∀z ∈ Rp
L(z) ≤ t> (z)β ≤ U (z).
and hence, for any β 0 ∈ Rk ,
1 − α = P K 3 β 0 ; β = β 0 = P for all z ∈ Rp
L(z) ≤ t> (z)β 0 ≤ U (z); β = β 0 .
(12.11)
Further, since t> (z)β is a linear function (in β) and K is bounded, convex and closed, we have
L(z) = inf t> (z)β= min t> (z)β,
β∈K
β∈K
U (z) = sup t> (z)β= max t> (z)β,
β∈K
β∈K
and both extremes must lie on a boundary of K, that is, both extremes are reached for β satisfying
b > X> X β − β
b = k MSe Fk,n−k (1 − α).
β−β
Method of Lagrange multipliers:
o
1 n
b > X> X β − β
b − k MSe Fk,n−k (1 − α)
ϕ(β, λ) = t> (z)β + λ β − β
2
( 12 is only included to simplify subsequent expressions).
12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION
245
Derivatives of ϕ:
∂ϕ
b ,
(β, λ) = t(z) + λ X> X β − β
∂β
o
∂ϕ
1n
b > X> X β − β
b − k MSe Fk,n−k (1 − α) .
(β, λ) =
β−β
∂λ
2
With given λ, the first set of equations is solved (with respect to β) for
b − 1 X> X −1 t(z).
β(λ) = β
λ
Use β(λ) in the second equation:
−1
−1
1 >
t (z) X> X X> X X> X t(z) = k MSe Fk,n−k (1 − α),
2
λ
s
−1
t> (z) X> X t(z)
.
λ=±
k MSe Fk,n−k (1 − α)
Hence, β which minimizes/maximizes t> (z)β subject to
b > X> X β − β
b = k MSe Fk,n−k (1 − α)
β−β
is given as
s
−1
k MSe Fk,n−k (1 − α)
>
X
X
t(z),
−1
t> (z) X> X t(z)
s
−1
k MSe Fk,n−k (1 − α)
X> X t(z).
−1
t> (z) X> X t(z)
b−
β min = β
β max
b+
=β
−1
Note that with our assumptions of t(z) 6= 0, we never divide by zero since X> X
is a positive
definite matrix.
That is,
L(z) = t> (z)β min
>
b −
= t (z)β
q
MSe t> (z)(X> X)−1 t(z) k Fk,n−k (1 − α),
U (z) = t> (z)β max
b +
= t> (z)β
q
MSe t> (z)(X> X)−1 t(z) k Fk,n−k (1 − α).
The proof is finalized by looking back at expression (12.11) and realizing that, due to continuity,
1 − α = P for all z ∈ Rp L(z) ≤ t> (z)β 0 ≤ U (z); β = β 0
= P for all z ∈ Rp L(z) < t> (z)β 0 < U (z); β = β 0
= P for all z ∈ Rp
b ±
t> (z)β
q
−1
k Fk, n−k (1 − α) MSe t> (z) X> X t(z) 3 t> (z)β 0 ; β = β 0 .
12.5. CONFIDENCE BAND FOR THE REGRESSION FUNCTION
246
k
Terminology (Confidence band for the regression function).
If the covariates Z1 , . . . , Zn ∈ R, confidence intervals according to Theorem (12.6) are often
calculated for an (equidistant) sequence of values z1 , . . . , zN ∈ R and then plotted together with
b z ∈ R. A band that is obtained in this way is called
the fitted regression function m(z)
b
= t> (z)β,
the confidence band for the regression function 6 as it covers jointly all true values of the regression
function with a given probability of 1 − α.
Note (Confidence band for and around the regression function).
For given z ∈ R:
Half width of the confidence band FOR the regression function (overall coverage) is
q
k Fk,n−k (1 − α) MSe t> (z)(X> X)−1 t(z).
Half width of the confidence band AROUND the regression function (pointwise coverage) is
q
α
tn−k 1 −
MSe t> (z)(X> X)−1 t(z)
2
q
=
F1,n−k (1 − α) MSe t> (z)(X> X)−1 t(z),
α
since for any ν > 0, t2ν 1 −
= F1,ν (1 − α).
2
For k ≥ 2, and any ν > 0,
k Fk,ν (1 − α) > F1,ν (1 − α)
and hence the confidence band for the regression function is indeed wider than the confidence
band around the regression function. Their width is the same only if k = 1.
6
pás spolehlivosti pro regresní funkci
Chapter
13
Asymptotic Properties of the
LSE and Sandwich Estimator
13.1
Assumptions and setup
Assumption (A0).
(i) Let Y1 , X 1 , Y2 , X 2 , . . . be a sequence of (1 + k)-dimensional independent and identi
cally distributed (i.i.d.) random
random vector Y, X ,
vectors being distributed as a generic
(X = X0 , X1 , . . . , Xk−1 , X i = Xi,0 , Xi,1 , . . . , Xi,k−1 , i = 1, 2, . . .);
(ii) Let β = β0 , . . . , βk−1 be an unknown k-dimensional real parameter;
(iii) Let E Y X = X > β.
Notation (Error terms).
We denote ε = Y − X > β,
εi = Yi − X >
i β,
i = 1, 2, . . ..
Convention. In this chapter, all unconditional expectations are understood as expectations with
respect to the joint distribution of a random vector (Y, X) (which depends on the vector β).
Note.
From assumption (A0), the error terms ε1 , ε2 , . . . are i.i.d. with a distribution of a generic error
term ε. The following can be concluded for their first two (conditional) moments:
E ε X = E Y − X >β X
= 0,
var ε X = var Y − X > β X = var Y X
=: σ 2 (X),
247
13.1. ASSUMPTIONS AND SETUP
248
E ε = E E εX = E 0
= 0,
var ε = var E ε X + E var ε X = var 0 + E σ 2 (X) = E σ 2 (X) .
Assumption (A1).
Let the covariate random vector X = X0 , . . . , Xk−1 satisfy
(i) EXj Xl < ∞, j, l = 0, . . . , k − 1;
(ii) E XX > = W, where W is a positive definite matrix.
Notation (Covariates second and first mixed moments).
Let W = wj,l
j,l=0,...,k−1
. We have,
wj2 := wj,j = E Xj2 ,
j = 0, . . . , k − 1,
wj,l = E Xj Xl ,
Let
V := W−1 = vj,l
j 6= l.
j,l=0,...,k−1
.
Notation (Data of size n).
For n ≥ 1:

Yn

Y1
 . 
. 
:= 
 . ,
Yn


X>
1
 . 
. 
Xn := 
 . ,
X>
n
Wn :=
X>
n Xn
=
n
X
X iX >
i ,
i=1
Vn := X>
n Xn
−1
(if it exists).
Lemma 13.1 Consistent estimator of the second and first mixed moments of the
covariates.
Let assumpions (A0) and (A1) hold. Then
1
a.s.
Wn −→ W
n
a.s.
n Vn −→ V
as n → ∞,
as n → ∞.
13.1. ASSUMPTIONS AND SETUP
249
Proof.
The statement of Lemma follows from applying, for each j = 0, . . . , k − 1 and l =
0, . . . , k−1, the strong law of large numbers for i.i.d. random variables (Theorem C.2) to a sequence
Zi,j,l = Xi,j Xi,l ,
i = 1, 2, . . . .
k
End of
Lecture #23
(17/12/2015)
LSE based on data of size n
Since
1
n
Start of
Lecture #24
(07/01/2016)
a.s.
X>
n Xn −→ W > 0 then
P there exists n0 > k such that for all n ≥ n0
rank Xn = k = 1
and we define (for n ≥ n0 )
b =
β
n
X>
n Xn
−1
X>
nY n =
n
X
X iX >
i
n
−1 X
i=1
MSe,n =
X i Yi ,
i=1
1 1
b 2 =
Y n − Xn β
n
n−k
n−k
n
X
b 2
(Yi − X >
i βn ) ,
i=1
which are the LSE of β and the residual mean square based on the assumed linear model for data
of size n.
Mn : Y n Xn ∼ Xn β, σ 2 In .
Further, for n ≥ n0 any non-trivial linear combination of regression coefficients is estimable
parameter of model Mn .
• For a given real vector l = l0 , l1 , . . . , lk−1 6= 0k we denote
θ = l> β,
b .
θbn = l> β
n
>
>
>
• For a given m × k matrix L with rows l>
1 6= 0k , . . . , lm 6= 0k we denote
ξ = Lβ,
b .
b
ξ n = Lβ
n
It will be assumed that m ≤ k and that the rows of L are lineary independent.
Interest in asymptotics (as n → ∞) behavior of
b ;
(i) β
n
(ii) MSe,n ;
b for given l 6= 0k ;
(iii) θbn = l> β
n
b for given m × k matrix L with linearly independent rows;
(iv) b
ξ n = Lβ
n
under two different scenarios (two different truths)
(i) homoscedastic errors (i.e., model Mn : Y n Xn ∼ Xn β, σ 2 In is correct);
13.1. ASSUMPTIONS AND SETUP
250
(ii) heteroscedastic errors where var ε X is not necessarily constant and perhaps depends on
the covariate values X (i.e., model Mn is not necessarily fully correct).
Normality of the errors will not be assumed.
Assumption (A2 homoscedastic).
Let the conditional variance of the response satisfy
σ 2 (X) := var Y X = σ 2 ,
where ∞ > σ 2 > 0 is an unknown parameter.
Assumption (A2 heteroscedastic).
Let σ 2 (X) := var Y X satisfy, for each j, l = 0, . . . , k − 1, the condition
E σ 2 (X)Xj Xl < ∞.
Notes.
• Condition (A2 heteroscedastic) states that the matrix
WF := E σ 2 (X) XX >
is a real matrix (with all elements being finite).
• If (A0) and (A1) are assumed then
(A2 homoscedastic) =⇒ (A2 heteroscedastic).
Hence everything that will be proved under (A2 heteroscedastic) holds also under (A2 homoscedastic).
• Under assumptions (A0) and (A2 homoscedastic), we have
E Yi X i = X >
var Yi X i = var εi X i = σ 2 ,
i β,
i = 1, 2, . . . ,
and for each n > 1, Y1 , . . . , Yn are, given Xn , independent and satisfying a linear model
Y n Xn ∼ Xn β, σ 2 In .
• Under assumptions (A0) and (A2 heteroscedastic), we have
E Yi X i = X >
var Yi X i = var εi X i = σ 2 (X i ),
i β,
i = 1, 2, . . . ,
and for each n > 1, Y1 , . . . , Yn are, given Xn , independent with


σ 2 (X 1 ) . . .
0


..
..
..
.
E Y n Xn = Xn β,
var Y n Xn = 
.
.
.


2
0
. . . σ (X n )
13.2. CONSISTENCY OF LSE
13.2
251
Consistency of LSE
We shall show in this section:
b , θbn , b
(i) Strong consistency of β
ξ n (LSE’s regression coefficients or their linear combinations).
n
• No need of normality;
• No need of homoscedasticity.
(ii) Strong consistency of MSe,n (unbiased estinator of the residual variance).
• No need of normality.
Theorem 13.2 Strong consistency of LSE.
Let assumptions (A0), (A1) and (A2 heteroscedastic) hold.
Then
a.s.
b −→
β
β
n
a.s.
as n → ∞,
b = θbn −→ θ
l> β
n
= l> β
as n → ∞,
a.s.
b = b
ξ n −→ ξ
Lβ
n
= Lβ
as n → ∞.
Proof.
a.s.
b −→
It is sufficient to show that β
β. The remaining two statements follow from properties of
n
convergence almost surely.
We have
−1 > X>
Xn Y n
n Xn
−1 1 >
1 >
X Xn
X Yn ,
=
n n
n n
|
{z
}|
{z
}
An
Bn
b =
β
n
where An =
1 >
X Xn
n n
−1
a.s.
−→ W−1 by Lemma 13.1.
Further
Bn =
n
1 X
1 >
>
Xn Y n =
X i Yi − X >
i β + Xi β
n
n
i=1
=
(a) C n =
n
n
1 X
1 X
X i εi +
X iX >
i β.
n
n
| i=1{z
}
| i=1 {z
}
Cn
Dn
n
1 X
a.s.
X i εi −→ 0k due to the SLLN (i.i.d., Theorem C.2). This is justified as follows.
n
i=1
n
n
1 X
1 X
• The jth (j = 0, . . . , k − 1) element of the vector
X i εi is
Xi,j εi .
n
n
i=1
i=1
13.2. CONSISTENCY OF LSE
252
• The random variables Xi,j εi , i = 1, 2, . . . are i.i.d. by (A0).
• E Xi,j εi
= E E Xi,j εi X i
= E Xi,j E εi X i
= E Xi,j 0
=
•
var Xi,j εi
0.
=
+ var E Xi,j εi X i
E var Xi,j εi X i
2 var ε X
E Xi,j
+ var Xi,j 0
i
i
=
2 σ 2 (X )
E Xi,j
i
<
∞ by (A2 heteroscedastic)
=
=⇒ EXi,j εi < ∞.
n
1
1 X
a.s.
(b) D n =
Wn β −→ Wβ by Lemma 13.1.
X iX >
i β =
n
n
i=1
a.s.
b = An C n + D n , where An −→
In summary: β
W−1 ,
n
a.s.
C n −→ 0k ,
a.s.
D n −→ W β.
Hence
a.s.
b −→
β
W−1 W β = β.
n
Theorem 13.3 Strong consistency of the mean squared error.
Let assumptions (A0), (A1), (A2 homoscedastic) hold.
Then
a.s.
MSe,n −→ σ 2
as n → ∞.
Proof.
We have
MSe,n =
n
1
n 1 X
b 2
Yi − X >
SSe,n =
i β .
n−k
n−k n
i=1
k
13.2. CONSISTENCY OF LSE
Since lim
n
n→∞ n−k
253
= 1, it is sufficient to show that
n
a.s. 2
1 X
b 2 −→
Yi − X >
σ
i βn
n
as n → ∞.
i=1
We have
n
n
X
1 X
>
>b 2
b 2 = 1
Yi − X >
β
Yi − X >
i n
i β + X i β − X i βn
n
n
i=1
i=1
1
=
n
|
n
X
Yi −
n
n
o2
>
1 X >
2 X
b
b
+
X i β − βn
+
Yi − X >
i β X i β − βn .
n
n
}
| i=1
{z
}
| i=1
{z
}
Bn
Cn
2
X>
i β
i=1
{z
An
n
n
2
1 X 2 a.s.
1 X
>
Yi − X i β
=
εi −→ σ 2 due to the SLLN (i.i.d., Theorem C.2).
(a) An =
n
n
i=1
i=1
This is justified by noting the following.
• The random variables ε2i , i = 1, 2, . . . are i.i.d. by (A0).
• E εi = 0
=⇒ E ε2i = var εi = E σ 2 (X i ) = E σ 2 = σ 2 by assumption (A2 homoscedastic).
• Eε2i = E ε2i = σ 2 < ∞ by assumption (A2 homoscedastic).
(b) B n
n
o2 a.s.
1 X >
b
Xi β − β
−→ 0, which is seen as follows.
=
n
n
i=1
Bn
n
o2
1 X >
b
Xi β − β
=
n
n
i=1
=
n
1 X
b > X iX > β − β
b
β −β
n
n
i
n
i=1
n
> 1 X
b
X iX >
β −β
n
i
n
=
b
β −β
n
=
b .
b > 1 X> Xn β − β
β −β
n
n
n
n
i=1
Now
a.s.
b −→
β −β
0k due to Theorem 13.2.
n
1 >
a.s.
X Xn −→ W due to Lemma 13.1.
n n
Hence
a.s.
B n −→ 0>
k W 0k = 0.
13.2. CONSISTENCY OF LSE
254
n
>
a.s.
2 X
b
(c) C n =
Yi − X >
i β X i β − β n −→ 0, which is justified by the following.
n
i=1
Cn
n
>
2 X
b
Yi − X >
=
i β X i β − βn
n
i=1
= 2
Now
n
1 X
n
εi X >
i
b .
β −β
n
i=1
n
1 X
a.s.
>
εi X >
i −→ 0k as was shown in the proof of Theorem 13.2.
n
i=1
a.s.
b −→
β −β
0k due to Theorem 13.2.
n
Hence
In summary: MSe,n =
a.s.
C n −→ 0>
k 0k = 0.
n
An + B n + C n , where
n−k
n
n−k
→ 1,
a.s.
An −→ σ 2 ,
a.s.
B n −→ 0,
a.s.
C n −→ 0.
Hence
a.s.
MSe,n −→ 1 (σ 2 + 0 + 0) = σ 2 .
k
13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY
13.3
255
Asymptotic normality of LSE under homoscedasticity
b , θbn , b
We shall show in this section: asymptotic normality of β
ξ n (LSE’s regression coefficients or
n
their linear combinations) when homoscedasticity of the errors is assumed but not their normality.
n
Reminder. V = E XX >
o−1
.
Theorem 13.4 Asymptotic normality of LSE in homoscedastic case.
Let assumptions (A0), (A1), (A2 homoscedastic) hold.
Then
√
D
b −β
n β
−→ Nk (0k , σ 2 V)
n
√
D
n θbn − θ
−→ N1 (0, σ 2 l> V l)
√
D
n b
ξn − ξ
−→ Nm (0m , σ 2 L V L> )
as n → ∞,
as n → ∞,
as n → ∞.
Proof. Will be done jointly with Theorem 13.5.
13.3.1
k
Asymptotic validity of the classical inference under homoscedasticity but non-normality
For given n ≥ n0 > k, the following statistics are used to infer on estimable parameters of the
linear model Mn based on the response vector Y n and the model matrix Xn (see Chapter 3):
θbn − θ
Tn := q
−1 ,
MSe,n l> X>
X
l
n
n
Qn :=
1
m
b
ξn − ξ
> n
−1 > o−1
b
L X>
L
ξn − ξ
n Xn
MSe,n
Reminder.
• Vn = X>
n Xn
−1
.
a.s.
• Under assumptions (A0) and (A1): n Vn −→ V as n → ∞.
(13.1)
.
(13.2)
13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY
256
Consequence of Theorem 13.4: Asymptotic distribution of t- and F-statistics.
Under assumptions of Theorem 13.4:
Tn
m Qn
D
as n → ∞,
D
as n → ∞.
−→ N1 (0, 1)
−→ χ2m
Proof. It follows directly from Lemma 13.1, Theorem 13.4 and Cramér-Slutsky theorem (Theorem C.7) as follows.
√ >b
n l β n − l> β
p
−1 =
σ 2{zl> Vl
X>
X
l
n
n
|
}
b − l> β
l> β
n
Tn = q
MSe,n l>
D
−→ N (0, 1)
m Qm =
b − Lβ
Lβ
n
D
σ 2 l> Vl
n
−1 o .
MSe,n l> n X>
l
n Xn
|
{z
}
P
−→ 1
> n
−1 > o−1
b − Lβ
Lβ
MSe,n L X>
X
L
n
n n
√
=
v
u
u
t
n
−1 > o−1
MSe,n L n X>
X
L
n
{zn
}
|
b − Lβ >
n Lβ
n
|
{z
}
−→ Nm 0m , σ 2 LVL>
P
−→ σ 2 LVL>
D
b − Lβ
Lβ
n
|
{z
√
n
}
.
−→ Nm 0m , σ 2 LVL>
Convergence to χ2m in distribution follows from a property of (multivariate) normal distribution
concerning the distribution of a quadratic form.
k
If additionaly normality is assumed, i.e., if it is assumed Y n Xn
Theorem 3.1 (LSE under the normality) provides
∼ Nn Xn β, σ 2 In then
Tn ∼ tn−k ,
Qn ∼ Fm, n−k .
This is then used for inference (derivation of confidence intervals and regions, construction of tests)
on the estimable parameters of a linear model under assumption of normality.
The following holds in general:
D
Tν ∼ tν
then
Tν −→ N (0, 1)
as ν → ∞,
Qν ∼ Fm, ν
then
m Qν −→ χ2m
D
as ν → ∞.
(13.3)
This, together with Consequence of Theorem 13.4 then justify asymptotic validity of a classical
inference based on statistics Tn (Eq. 13.1) and Qn (Eq. 13.2), respectively and a Student t and Fdistribution, respectively, even if normality of the error terms of the linear model does not hold.
The only requirements are assumptions of Theorem 13.4.
That is, for example, both intervals
13.3. ASYMPTOTIC NORMALITY OF LSE UNDER HOMOSCEDASTICITY
(i)
InN
:= θbn − u(1−α/2)
q
>
−1
X>
l,
n Xn
MSe,n l
q
−1
(ii) Int := θbn − tn−k (1−α/2) MSe,n l> X>
l,
n Xn
257
q
−1 MSe,n l> X>
l ;
n Xn
q
−1 θbn + tn−k (1−α/2) MSe,n l> X>
X
l ,
n
n
θbn + u(1−α/2)
satisfy, for any θ0 ∈ R (even without normality of the error terms)
P InN 3 θ0 ; θ = θ0 −→ 1 − α
as n → ∞,
P Int 3 θ0 ; θ = θ0 −→ 1 − α
as n → ∞.
Analogously, due to a general asymptotic property of the F-distribution (Eq. 13.3), asymptotically
valid inference on the estimable vector parameter ξ = Lβ of a linear model can be based either
on the statistic m Qn and the χ2m distribution or on the statistic Qn and the Fm, n−k distribution.
For example, for both ellipsoids
n
o
> n
−1 > o−1
(i) Knχ := ξ ∈ Rm : ξ − b
ξ
MSe,n L X>
L
ξ−b
ξ < χ2m (1 − α) ;
n Xn
o
n
−1 > o−1
> n
ξ−b
ξ < m Fm,n−k (1 − α) ,
MSe,n L X>
L
(ii) KnF := ξ ∈ Rm : ξ − b
ξ
n Xn
we have for any ξ 0 ∈ Rm (under assumptions of Theorems 13.4):
P Knχ 3 ξ 0 ; ξ = ξ 0 −→ 1 − α
as n → ∞,
P KnF 3 ξ 0 ; ξ = ξ 0 −→ 1 − α
as n → ∞.
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
13.4
258
Asymptotic normality of LSE under heteroscedasticity
b , θbn , b
We shall show in this section: asymptotic normality of β
ξ n (LSE’s regression coefficients or
n
their linear combinations) when even homoscedasticity of the errors is not assumed.
Reminder.
n
o−1
.
• V = E XX >
• WF = E σ 2 (X) XX > .
Theorem 13.5 Asymptotic normality of LSE in heteroscedastic case.
Let assumptions (A0), (A1), (A2 heteroscedastic) hold.
Then
√
D
b −β
n β
−→ Nk (0k , VWF V)
n
√
D
n θbn − θ
−→ N1 (0, l> VWF V l)
√
n b
ξn − ξ
as n → ∞,
as n → ∞,
D
−→ Nm (0m , L VWF V L> )
as n → ∞.
Proof. We will jointly prove also Theorem 13.4.
We have
b =
β
n
X>
Xn
| n {z
−1
X>
nY n
}
Vn
= Vn
n
X
X i Yi
i=1
= Vn
n
X
Xi X>
i β + εi
i=1
= Vn
X
n
X iX >
i
β + Vn
{z
}
V−1
n
= β + Vn
X i εi
i=1
i=1
|
n
X
n
X
X i εi .
i=1
That is,
b − β = Vn
β
n
n
X
X i εi = n V n
i=1
n
1 X
X i εi .
n
(13.4)
i=1
a.s.
By Lemma 13.1, n Vn −→ V which implies
P
n Vn −→ V
as n → ∞.
(13.5)
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
259
Pn
In the following, let us explore asymptotic behavior of the term n1
i=1 X i εi .
1 Pn
From assumption (A0), the term n i=1 X i εi is a sample mean of i.i.d. random vector X i εi ,
i = 1, . . . , n. The mean and the covariance matrix of the distribution of those random vectors are
E Xε = 0k
(was shown in the proof of Theorem 13.2),
+ var E Xε X
var Xε = E var Xε X
= E X var ε X X > + var X E ε X
| {z }
| {z }
0
σ 2 (X)
= E σ 2 (X)XX
>
.
Depending, on whether (A2 homoscedastic) or (A2 heteroscedastic) is assumed, we have

 σ 2 E XX > = σ 2 W, (A2 homoscedastic),
var Xε = E σ 2 (X)XX > =
 WF ,
(A2 heteroscedastic).
(13.6)
Under both
(A2 homoscedastic) and (A2 heteroscedastic) all elements of the covariance matrix
var Xε are finite. Hence by Theorem C.5 (multivariate CLT for i.i.d. random vectors):
n
n
√ 1 X
1 X
D
X i εi = √
n
X i εi −→ Nk 0k , E σ 2 (X)XX >
n
n
i=1
as n → ∞.
i=1
From (13.4) and (13.5), we now have,
b − β
β
n
n
1 X
√
X i εi
n
i=1
{z
}
|
= n Vn
| {z }
P
−→V
D
1
√ .
n
−→Nk 0k , E σ 2 (X)XX >
That is,
√
b − β
n β
n
= n Vn
| {z }
P
−→V
D
n
1 X
√
X i εi
n
i=1
|
{z
}
−→Nk 0k , E σ 2 (X)XX >
Finally, by applying Theorem C.7 (Cramér–Slutsky):
D
√
b − β −→
n β
Nk 0k , V E σ 2 (X)XX > V>
n
.
as n → ∞.
By using (13.6) and realizing that V> = V, we get
Under (A2 homoscedastic)
V E σ 2 (X)XX > V> = V σ 2 W V = σ 2 V V−1 V = σ 2 V
and hence
√
b − β
n β
n
D
−→ Nk 0k , σ 2 V
as n → ∞.
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
260
Under (A2 heteroscedastic)
V E σ 2 (X)XX > V> = V WF V
and hence
D
√
F
b − β −→
n β
N
0
,
V
W
V
k
k
n
as n → ∞.
b and of ξ = Lβ
b follows now from Theorem C.6 (Cramér–
Asymptotic normality of θbn = l> β
n
n
n
Wold).
k
Notation (Residuals and related quantities based on a model for data of size n).
For n ≥ n0 > k, the following notation will be used for quantities based on the model
Mn : Y n Xn ∼ Xn β, σ 2 In .
−1
• Hat matrix:
Hn = Xn X>
n Xn
• Residual projection matrix:
Mn = In − Hn ;
• Diagonal elements of matrix Hn :
hn,1 , . . . , hn,n ;
• Diagonal elements of matrix Mn :
mn,1 = 1 − hn,1 , . . . , mn,n = 1 − hn,n ;
U n = Mn Y n = Un,1 , . . . , Un,n .
• Residuals:
X>
n;
Reminder.
• Vn =
n
X
X iX i>
−1
= X>
n Xn
−1
.
i=1
a.s.
• Under assumptions (A0) and (A1): n Vn −→ V as n → ∞.
Theorem 13.6 Sandwich estimator of the covariance matrix.
Let assumptions (A0), (A1), (A2 heteroscedastic) hold. Let additionally, for each s, t, j, l = 0, . . . , k−1
Eε2 Xj Xl < ∞,
Then
Eε Xs Xj Xl < ∞,
a.s.
F
n Vn WF
n Vn −→ V W V
EXs Xt Xj Xl < ∞.
as n → ∞,
where for n = 1, 2, . . .,
WF
n =
n
X
2
>
Un,i
X iX >
i = Xn Ωn Xn ,
i=1
Ωn = diag ωn,1 , . . . , ωn,n ,
2
ωn,i = Un,i
,
i = 1, . . . , n.
End of
Lecture #24
(07/01/2016)
Start of
Lecture #26
(14/01/2016)
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
261
Proof.
First, remind that
n
n
o−1
o−1
,
E σ 2 (X) XX > E XX >
E XX >
V WF V =
and we know from Lemma 13.1 that
−1 a.s. n
o−1
>
=V
n Vn = n X>
X
−→
E
XX
n
n
as n → ∞.
Hence, if we show that
n
1 X 2
1 F
a.s.
Wn =
Un,i X i X >
−→ E σ 2 (X) XX > = WF
i
n
n
as n → ∞,
i=1
the statement of Theorem will be proven.
Remember,
σ 2 (X) = var ε X = E ε2 X .
From here, for each j, l = 0, . . . , k − 1
E ε2 Xj Xl
= E E ε2 Xj Xl X
= E Xj Xl E ε2 X
= E σ 2 (X) Xj Xl .
For each j, l = 0, . . . , k − 1,
Eε2 Xj Xl < ∞
by assumptions of Theorem. By assumption (A0), εi Xi,j Xi,l , i = 1, 2, . . ., is a sequence of i.i.d.
random variables. Hence by Theorem C.2 (SLLN, i.i.d.),
n
1 X 2
a.s.
εi Xi,j Xi,l −→ E σ 2 (X) Xj Xl
n
as n → ∞.
i=1
That is, in a matrix form,
n
1 X 2
a.s.
εi X i X >
−→ E σ 2 (X) XX > = WF
i
n
as n → ∞.
(13.7)
i=1
In the following, we show that (unobservable) squared error terms ε2i in (13.7) can be replaced by
2 = Y − X >β
b 2 while keeping the same limitting matrix WF as in (13.7).
squared residuals Un,i
i
i n
We have
n
n
1 X 2
1 X
b 2 Xi X>
Un,i X i X >
=
Yi − X >
i
i βn
i
n
n
i=1
i=1
|
{z
}
WF
n
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
262
n
1 X
>
>b 2
Xi X>
=
Yi − X >
i
i β + X i β − X i βn
|
{z
}
n
i=1
εi
=
n
n
1 X 2
1 X
b > X iX > β − β
b X iX >
+
εi X i X >
β−β
n
n
i
i
i
n
n
i=1
i=1
{z
}
{z
}
|
|
An
Bn
+
(a) An =
n
2 X
b > X i εi X i X > .
β−β
n
i
n
i=1
|
{z
}
Cn
n
1 X 2
a.s.
2
>
εi X i X >
= WF due to (13.7).
i −→ E σ (X) XX
n
i=1
n
1 X
b > X iX > β − β
b X i X > , we can realize that β −
β−β
n
n
i
i
n
i=1
b is a scalar quantity. Hence
β−β
n
(b) To work with Bn =
b
β
n
>
Xi = X>
i
Bn
n
1 X
b > X i X iX > X > β − β
b
β−β
=
n
n
i
i
n
i=1
and the (j, l)th element of matrix Bn (j, l = 0, . . . , k − 1) is
Bn (j, l) =
n
1 X
b > X i (Xi,j Xi,l ) X > β − β
b
β−β
n
n
i
n
i=1
=
b
β−β
n
>
n
1 X
>
b .
(Xi,j Xi,l ) X i X i
β−β
n
n
i=1
a.s.
b
• From Theorem 13.2: β − β
n −→ 0k as n →
∞.
• Due to assumption (A0) and assumption EXs Xt Xj Xl < ∞ for any s, t, j, l =
0, . . . , k − 1, by Theorem C.2 (SLLN, i.i.d.), for any j, l = 0, . . . , k − 1:
n
1 X
a.s.
>
(Xi,j Xi,l ) X i X >
.
i −→ E Xj Xl XX
n
i=1
a.s.
>
• Hence, for any j, l = 0, . . . , k − 1, Bn (j, l) −→ 0>
0k = 0 and finally,
k E Xj Xl XX
a.s.
as n → ∞.
Bn −→ 0k×k
n
2 X
b > X i εi X i X > and the (j, l)th element of matrix Cn (j, l =
(c) Cn =
β−β
n
i
n
i=1
0, . . . , k − 1) is
Cn (j, l) =
n
2 X
b > X i εi Xi,j Xi,l
β−β
n
n
i=1
b
= 2 β−β
n
>
n
1 X
X i εi Xi,j Xi,l .
n
i=1
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
263
a.s.
b
• From Theorem 13.2: β − β
n −→ 0k as
n → ∞. • Due to assumption (A0) and assumption Eε Xs Xj Xl < ∞ for any s, j, l = 0, . . . , k −1,
by Theorem C.2 (SLLN, i.i.d.), for any j, l = 0, . . . , k − 1:
n
1 X
a.s.
X i εi Xi,j Xi,l −→ E Xε Xj Xl .
n
i=1
a.s.
• Hence, for any j, l = 0, . . . , k − 1, Cn (j, l) −→ 2 0>
k E Xε Xj Xl = 0 and finally,
a.s.
Cn −→ 0k×k
as n → ∞.
In summary:
n Vn WF
n Vn = n Vn
1
n
WF
n Vn
n
= n Vn An + Bn + Cn n Vn ,
a.s.
where n Vn −→ V,
a.s.
An −→ WF ,
a.s.
Bn −→ 0k×k ,
a.s.
Cn −→ 0k×k .
Hence
a.s.
F
n Vn WF
+ 0k×k + 0k×k V = VWF V
n Vn −→ V W
as n → ∞.
k
Terminology (Heteroscedasticity consistent (sandwich) estimator of the covariance matrix).
Matrix
>
Vn W F
n Vn = Xn Xn
−1
>
X>
n Ωn Xn Xn Xn
−1
(13.8)
b of
is called the heteroscedasticity consistent (HC) estimator of the covariance matrix of the LSE β
n
the regression coefficients. Due to its form, the matrix (13.8) is also called as the sandwich estimator
−1 >
Xn and a meat Ωn .
n Xn
Notes (Alternative sorts of meat for the sandwich).
• It is directly seen that the meat matrix Ωn can, for a chosen sequence νn , such that
n → ∞, be replaced by a matrix
n
Ωn ,
νn
n
νn
→ 1 as
and the statement of Theorem 13.6 remains valid. A value νn is then called degrees of freedom of
the sandwich.
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
264
• It can also be shown (see references below) that the meat matrix Ωn can, for a chosen
sequence
νn , such that νnn → 1 as n → ∞ and a suitable sequence δ n = δn,1 , . . . , δn,n , n = 1, 2, . . .,
be replaced by a matrix
ΩHC
:= diag ωn,1 , . . . , ωn,n ,
n
ωn,i =
2
n Un,i
,
νn mδn,i
n,i
i = 1, . . . , n.
• The following choices of sequences νn and δ n have appeared in the literature (n = 1, 2, . . .,
i = 1, . . . , n):
HC0: νn = n, δn,i = 0, that is,
2
ωn,i = Un,i
.
This is the choice due to White (1980) who was the first who proposed the sandwich
estimator of the covariance matrix. This choice was also used in Theorem 13.6.
HC1: νn = n − k, δn,i = 0, that is,
n
U2 .
n − k n,i
This choice was suggested by MacKinnon and White (1985).
ωn,i =
HC2: νn = n, δn,i = 1, that is,
ωn,i =
2
Un,i
.
mn,i
This is the second proposal of MacKinnon and White (1985).
HC3: νn = n, δn,i = 2, that is,
ωn,i =
2
Un,i
m2n,i
.
This is the third proposal of MacKinnon and White (1985).
HC4: νn = n, δn,i = min 4, n hn,i /k , that is,
ωn,i =
2
Un,i
δ
n,i
mn,i
.
This was proposed relatively recently by Cribari-Neto (2004). Note that k =
hence
n
n h o
1 X
n,i
δn,i = min 4,
,
hn =
hn,i .
n
hn
i=1
Pn
i=1 hn,i ,
and
• An extensive study towards small sample behavior of different sandwich estimators was carried
out by Long and Ervin (2000) who recommended usage of the HC3 estimator. Even better
small sample behavior, especially in presence of influential observations was later concluded by
Cribari-Neto (2004) for the HC4 estimator.
• Labels HC0, HC1, HC2, HC3, HC4 for the above sandwich estimators are used by the R package
sandwich (Zeileis, 2004) that enables for their easy calculation based on the fitted linear model.
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
13.4.1
265
Heteroscedasticity consistent asymptotic inference
Let for given sequences νn and δ n , n = 1, 2, . . ., ΩHC
be a sequence of the meat matrices that
n
b . Let for
lead to the heteroscedasticity consistent estimator of the covariance matrix of the LSE β
n
given n ≥ n0 > k,
−1 > HC
−1
VHC
:= X>
Xn Ωn Xn X>
.
n
n Xn
n Xn
Finally, let the statistics TnHC and QHC
be defined as
n
θbn − θ
,
TnHC := q
l> VHC
l
n
QHC
:=
n
>
1 b
> −1 b
ξn − ξ
LVHC
ξn − ξ .
n L
m
Note that the statistics TnHC and QHC
are the usual statistics Tn (Eq. 13.1) and Qn
n , respectively, −1
>
(13.2), respectively, in which the term MSe,n Xn Xn
is replaced by the sandwich estimator
VHC
.
n
Consequence of Theorems 13.5 and 13.6: Heteroscedasticity consistent asymptotic inference.
Under assumptions of Theorem 13.5 and 13.6:
TnHC
m QHC
n
D
as n → ∞,
D
as n → ∞.
−→ N1 (0, 1)
−→ χ2m
Proof. Proof/calculations were available on the blackboard in K1.
k
Due to a general asymptotic property of the Student t-distribution (Eq. 13.3), asymptotically valid
inference on the estimable parameter θ = l> β of a linear model where neither normality, nor
homoscedasticity is necessarily satisfied, can be based on the statistic TnHC and either a Student
tn−k or a standard normal distribution. Under assumptions of Theorems 13.5 and 13.6, both
intervals
q
q
bn + u(1 − α/2) l> VHC l ;
l,
θ
(i) InN := θbn − u(1 − α/2) l> VHC
n
n
q
q
bn + tn−k (1 − α/2) l> VHC l ,
(ii) Int := θbn − tn−k (1 − α/2) l> VHC
l,
θ
n
n
satisfy, for any θ0 ∈ R:
P InN 3 θ0 ; θ = θ0 −→ 1 − α
P Int 3 θ0 ; θ = θ0 −→ 1 − α
as n → ∞,
as n → ∞.
13.4. ASYMPTOTIC NORMALITY OF LSE UNDER HETEROSCEDASTICITY
266
Analogously, due to a general asymptotic property of the F-distribution (Eq. 13.3), asymptotically
valid inference on the estimable vector parameter ξ = Lβ of a linear model can be based either on
the statistic m QHC
and the χ2m distribution or on the statistic QHC
and the Fm, n−k distribution.
n
n
For example, for both ellipsoids
n
o
>
m
HC > −1
2
b
b
(i)
:= ξ ∈ R : ξ − ξ
L Vn L
ξ − ξ < χm (1 − α) ;
n
o
>
> −1
b
(ii) KnF := ξ ∈ Rm : ξ − b
ξ
L VHC
L
ξ
−
ξ
<
m
F
(1
−
α)
,
m,n−k
n
Knχ
we have for any ξ 0 ∈ Rm (under assumptions of Theorems 13.5 and 13.6):
P Knχ 3 ξ 0 ; ξ = ξ 0 −→ 1 − α
as n → ∞,
P KnF 3 ξ 0 ; ξ = ξ 0 −→ 1 − α
as n → ∞.
End of
Lecture #26
(14/01/2016)
Chapter
14
Unusual Observations
In the whole chapter, we assume a linear model
M : Y X ∼ Xβ, σ 2 In ,
Start of
Lecture #25
(07/01/2016)
rank(Xn×k ) = r ≤ k,
where standard notation is considered. That is,
−
• b = X> X X> Y = b0 , . . . , bk−1 : any solution to normal equations;
−
• H = X X> X X> = hi,t i,t=1,...,n : the hat matrix;
• M = In − H = mi,t i,t=1,...,n : the residual projection matrix;
• Yb = HY = Xb = Yb1 , . . . , Ybn : the vector of fitted values;
• U = MY = Y − Yb = U1 , . . . , Un : the residuals;
2
• SSe = U : the residual sum of squares;
SSe is the residual mean square;
= U1std , . . . , Unstd : vector of standardized residuals,
• MSe =
• U std
1
n−r
Uistd = √
Ui
,
MSe mi,i
i = 1, . . . , n.
The whole chapter will deal with idntification of “unusual” observations in a particular dataset. Any
probabilistic statements will hence be conditioned by the realized covariate values X 1 = x1 , . . . ,
X n = xn . The same symbol X will be used for (in general random) model matrix and its realized
counterpart, i.e.,
  

X>
x>
1
1
 .   . 



.
.
X= . = . 
.
>
Xn
x>
n
267
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
14.1
Leave-one-out and outlier model
Notation. For chosen t ∈ 1, . . . , n , we will use the following notation.
• Y (−t) : vector Y without the tth element;
• xt : the tth row (understood as a column vector) of the matrix X;
• X(−t) : matrix X without the tth row;
• j t : vector 0, . . . , 0, 1, 0, . . . , 0 of length n with 1 on the tth place.
Definition 14.1 Leave-one-out model.
The tth leave-one-out model1 is a linear model
M(−t) : Y (−t) X(−t) ∼ X(−t) β, σ 2 In−1 .
Definition 14.2 Outlier model.
The tth outlier model2 is a linear model
out
2
Mout
t : Y X ∼ Xβ + j t γt , σ In .
Notation (Quantities related to the leave-one-out and outlier models).
• Quantities related to model M(−t) will be recognized by subscript (−t), i.e.,
b(−t) , Yb (−t) , SSe,(−t) , MSe,(−t) , . . .
• Quantities related to model Mout
will be recognized by subscript t and superscript out, i.e.,
t
out
out
out
b
bout
t , Y t , SSe,t , MSe,t , . . .
• Solutions to normal equations in model Mout
will be denoted as
t
out
bout
.
t , ct
• If γtout is an estimable parameter of model Mout
then its LSE will be denoted as γ
btout .
t
Theorem 14.1 Four equivalent statements.
The following four statements are equivalent:
(i) rank(X) = rank X(−t) , i.e., xt ∈ M X>
(−t) ;
1
model vynechaného ttého pozorovńí
2
model ttého odlehlého pozorovńí
268
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
269
(ii) mt,t > 0;
(iii) γtout is an estimable parameter of model Mout
t ;
(iv) µt := E Yt X t = xt = x>
t β is an estimable parameter of model M(−t) .
Proof.
(ii) ⇔ (i)
• We will show this by showing non(i) ⇔ non(ii).
> .
• non(i) means that xt ∈
/ M X>
(−t) ⊂ M X
>
M X>
and M X(−t) 6= M X> .
(−t) ⊂ M X
⊥
⊥
⊥
⊥
⇔ M X> ⊂ M X>
and M X> 6= M X>
.
(−t)
(−t)
⊥
⊥
>
such that a ∈
/ M X> .
• That is, ⇔ ∃a ∈ M X(−t)
>
⇔ ∃a ∈ Rk such that a> X>
& a> X> 6= 0> .
(−t) = 0
⇔ ∃a ∈ Rk such that X(−t) a = 0 & Xa 6= 0.
It must be
Xa = 0, . . ., 0, c, 0, . . ., 0
>
= c jt
for some c 6= 0.
⇔ ∃a ∈ Rk such that Xa = cj t , c 6= 0.
⇔ jt ∈ M X .
⇔
Mj t
|{z}
= 0.
tth column of M
⇔ mt = 0.
2
⇔ mt = mt,t = 0.
⇔ non(ii).
mt denotes the tth row of M (and also its t column since M is symmetric).
(iii) ⇔ (i)
>
• γtout = 0>
,1
| k {z }
l>
!
β
.
γtout
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
270
> • γtout is estimable parameter of Mout
⇔
l
∈
M
X,
j
.
t
t
⇔ ∃a ∈ Rn such that 0> , 1 = a> X, j t .
⇔ ∃a ∈ Rn such that 0> = a> X & 1 = a> j t .
⇔ ∃a ∈ Rn such that 0> = a> X & at = 1.
>
⇔ ∃a ∈ Rn such that x>
t = − a(−t) X(−t) .
⇔ xt ∈ M X>
(−t) .
⇔ (i).
(iv) ⇔ (i)
• Follows directly from Theorem 2.7.
k
Theorem 14.2 Equivalence of the outlier model and the leave-one-out model.
are the same, i.e.,
1. The residual sums of squares in models M(−t) and Mout
t
SSe,(−t) = SSout
e,t .
out
2. Vector b(−t) solves the normal equations of model M(−t) if and only if a vector bout
t , ct
out
solves the normal equations of model Mt , where
bout
= b(−t) ,
t
cout
= Yt − xt > b(−t) .
t
Proof.
Solution to normal equations minimizes the corresponding sum of squares.
The sum of squares to be minimized w.r.t. β and γtout in the outlier model Mout
is
t
2
SSout
β, γtout = Y − Xβ − j t γtout separate the tth element of the sum
t
2
out 2
= Y (−t) − X(−t) β + Yt − x>
t β − γt
out 2
= SS(−t) (β) + Yt − x>
,
t β − γt
where SS(−t) (β) is the sum of squares to be minimized w.r.t. β in the leave-one-out model M(−t) .
out
The term Yt − x>
t β − γt
2
can for any β ∈ Rk be equal to zero if we, for given β ∈ Rk , take
γtout = Yt − x>
t β.
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
271
That is
out
(i) min SSout
t (β, γt ) = min SS(−t) (β);
β
β, γtout
|
{z
}
|
{z
}
out
SS
e,(−t)
SSe,t
(ii) A vector b(−t) ∈ Rk minimizes SS(−t) (β) if and only if a vector
k+1
b(−t) , Yt − x>
t b(−t) ∈ R
{z
}
| {z } |
bout
cout
t
t
out
minimizes SSout
t (β, γt ).
k
Notation (Leave-one-out least squares estimators of the response expectations).
If mt,t > 0 for all t = 1, . . . , n, we will use the following notation:
Yb[t] := x>
t b(−t) ,
t = 1, . . . , n,
which is the LSE of the parameter µt = E Yt X t = xt = x>
t β based on the leave-one-out
model M(−t) ;
Yb [•] := Yb , . . . , Yb[n] ,
which is an estimator of the parameter µ = µ1 , . . . , µn = E Y X , where each element is
estimated using the linear model based on data with the corresponding observation being left out.
Calculation of quantities of the outlier and the leave-one-out models
Model Mout
is a model with added regressor for model M. Suppose that mt,t > 0 for given
t
t = 1, . . . , n. By applying Lemma 10.1, we can express the LSE of the parameter γtout as
γ
btout =
j>
t Mj t
−
−
−1
j>
t U = (mt,t ) Ut = (mt,t ) Ut =
Ut
.
mt,t
Analogously, other quantities of the outlier model can be expressed using the quantities of model M.
Namely,
−
Ut
bout
= b−
X> X x t ,
t
mt,t
out
Ut
Yb t
= Yb +
mt ,
mt,t
2
Ut2
SSe − SSout
=
= MSe Utstd ,
e,t
mt,t
where mt denotes the tth column (and row as well) of the residual projection matrix M.
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
272
Lemma 14.3 Quantities of the outlier and leave-one-out model expressed using
quantities of the original model.
Suppose that for given t ∈ {1, . . . , n}, mt,t > 0. The following quantities of the outlier model Mout
t
and the leave-one-out model M(−t) are expressable using the quantities of the original model M as
follows.
Ut
b
γ
btout
= Yt − x>
,
t b(−t) = Yt − Y[t] =
mt,t
−
Ut
b(−t) = bout
= b−
X> X x t ,
t
mt,t
(14.1)
Ut2
std 2
out
= SSe − MSe Ut
SSe,(−t) = SSe,t
= SSe −
,
mt,t
2
MSout
MSe,(−t)
n − r − Utstd
e,t
=
=
.
MSe
MSe
n−r−1
Proof. Equality between the quantities of the outlier and the leave-one-out model follows from
Theorem 14.2. Remaining expressions follow from previously conducted calculations.
To see the last equality in (14.1), remember that the residual degrees of freedom of both the outlier
and the leave-one-out models are equal to n − r − 1. That is, whereas in model M,
MSe =
SSe
,
n−r
in the outlier and the leave-one-out model,
MSe,(−t) =
SSout
SSe,(−t)
e,t
=
= MSout
e,t .
n−r−1
n−r−1
k
Notes.
• Expressions in Lemma 14.3 quantify the influence of the tth observation on
(i) the LSE of a vector β of the regression coefficients (in case they are estimable);
(ii) the estimate of the residual variance.
• Lemma 14.3 also shows that it is not necessary to fit n leave-one-out (or outlier models) to
calculate their LSE-related quantities. All important quantities can be calculated directly from
the LSE-related quantities of the original model M.
14.1. LEAVE-ONE-OUT AND OUTLIER MODEL
273
Definition 14.3 Deleted residual.
If mt,t > 0, then the quantity
γ
btout = Yt − Yb[t] =
is called the tth deleted residual of the model M.
Ut
mt,t
14.2. OUTLIERS
14.2
274
Outliers
By outliers3 of the model M, we shall understand observations for which the response expectation
does not follow the assumed model, i.e., the tth observation (t ∈ {1, . . . , n}) is an outlier if
E Yt X t = xt 6= x>
t β,
in which case we can write
out
E Yt X t = xt = x>
t β + γt .
As such, an outlier can be characterized as an observation with unusual response (y) value.
If mt,t > 0, γtout is an estimable parameter of the tth outlier model Mout
(for which the model M
t
is a submodel) and decision on whether the tth observation is an outlier can be transferred into
a problem of testing
H0 : γtout = 0
in the tth outlier model Mout
t . Note that the above null hypothesis also expresses the fact that the
submodel M of the model Mout
holds.
t
If normality is assumed, this null hypothesis can be tested using a classical t-test on a value of the
estimable parameter. The corresponding t-statistic has a standard form
γ
bout
Tt = q t
c γ
var
btout
and under the null hypothesis follows the Student t distribution with n − r − 1 degrees of freedom
(residual degrees of freedom of the outlier model).
From Section 14.1, we have
γ
btout =
Ut
= Yt − Yb[t] .
mt,t
Hence (the variance is conditional given the covariate values),
(?) 1 2
Ut
1
σ2
out
var γ
bt
= var
=
var
U
=
σ
m
=
.
t
t,t
mt,t
mt,t
m2t,t
m2t,t
(?)
The equality = holds irrespective of whether γtout = 0 (and model M holds) or γtout 6= 0 (and
model Mout
holds).
t
The estimator γ
btout is the LSE of a parameter of the outlier model and hence
MSout
e,t
c γ
var
btout =
,
mt,t
and finally,
γ
bout
Tt = r t
.
MSout
e,t
mt,t
Two useful expressions of the statistic Tt are obtained by remembering from Section 14.1 (a)
t
MSout
btout = Yt − Yb[t] = γ
btout = mUt,t
e,t = MSe,(−t) and (b) two expressions of γ
to
Yt − Yb[t] √
Ut
Tt = p
mt,t = p
.
MSe,(−t)
MSe,(−t) mt,t
3
odlehlá pozorování
14.2. OUTLIERS
275
Definition 14.4 Studentized residual.
If mt,t > 0, then the quantity
Yt − Yb[t] √
Ut
Tt = p
mt,t = p
MSe,(−t)
MSe,(−t) mt,t
is called the tth studentized residual4 of the model M.
Notes.
• Using the last equality in (14.1), we can derive one more expression of the studentized residual
using the standardized residual
Ut
Utstd = p
.
MSe mt,t
Namely,
s
Tt =
n−r−1
n−r−
2
Utstd
Utstd .
This directly shows that it is not necessary to fit the leave-one-out or the outlier model to
calculate the studentized residual of the initial model M.
Theorem 14.4 On studentized residuals.
2
Let Y
X ∼ Nn Xβ, σ In , where rank Xn×k = r ≤ k < n. Let further n > r + 1. Let for given
t ∈ 1, . . . , n mt,t > 0. Then
1. The tth studentized residual Tt follows the Student t-distribution with n − r − 1 degrees of
freedom.
2. If additionally n > r + 2 then E Tt = 0.
n−r−1
3. If additionally n > r + 3 then var Tt =
.
n−r−3
Proof. Point (i) follows from preceeding derivations, points (ii) and (iii) follow from properties of the
Student t distribution.
Test for outliers
The studentized residual Tt of the model M is the test statistic (with tn−r−1 distribution under the
null hypothesis) of the test
4
studentizované reziduum
14.2. OUTLIERS
H0 :
γtout = 0,
H1 :
γtout 6= 0
276
out
2
in the tth outlier model Mout
t : Y X ∼ Nn Xβ + j t γt , σ In .
The above testing problem can also be interpreted as a test of
H0 :
tth observations is not outlier of model M,
H1 :
tth observations is outlier of model M,
where “outlier” means outlier with respect to model M: Y X ∼ Nn Xβ, σ 2 In :
• The expected value of the tth observation is different from that given by model M;
• The observed value of Yt is unusual under model M.
When performing the test for outliers for all observations in the dataset, we are in fact facing
a multiple testing problem and hence adjustment of the P-values resulted from comparison of the
values of the studentized residuals with the quantiles of the Student tn−r−1 distribution are needed
to keep the rate of falsely identified outliers under the requested level of α. For example, Bonferroni
Notes.
• Two or more outliers next to each other can hide each other.
• A notion of outlier is always relative to considered model (also in other areas of statistics).
Observation which is outlier with respect to one model is not necessarily an outlier with respect
to some other model.
• Especially in large datasets, few outliers are not a problem provided they are not at the same
time also influential for statistical inference (see next section).
• In a context of a normal linear model, presence of outliers may indicate that the error distribution
is some distribution with heavier tails than the normal distribution.
• Outlier can also suggest that a particular observation is a data-error.
• If some observation is indicated to be an outlier, it should always be explored:
• Is it a data-error? If yes, try to correct it, if this is impossible, no problem (under certain assumptions)
•
•
•
•
to exclude it from the data.
Is the assumed model correct and it is possible to find a physical/practical explanation for occurrence
of such unusual observation?
If an explanation is found, are we interested in capturing such artefacts by our model or not?
Do the outlier(s) show a serious deviation from the model that cannot be ignored (for the purposes
of a particular modelling)?
..
.
• NEVER, NEVER, NEVER exclude “outliers” from the analysis in an automatic manner.
• Often, identification of outliers with respect to some model is of primary interest:
• Example: model for amount of credit card transactions over a certain period of time depending on
some factors (age, gender, income, . . . ).
• Model found to be correct for a “standard” population (of clients).
• Outlier with respect to such model ≡ potentially a fraudulent use of the credit card.
• If the closer analysis of “outliers” suggest that the assumed model is not satisfactory capturing
the reality we want to capture (it is not useful), some other model (maybe not linear, maybe not
normal) must be looked for.
14.3. LEVERAGE POINTS
14.3
277
Leverage points
By leverage points5 of the model M, we shall understand observations with, in a certain sense,
unusual regressor (x) values. As will be shown, the fact whether the regressor values of a certain
observation are unusual is closely related to the diagonal elements h1,1 , . . . , hn,n of the hat matrix
−
H = X X> X X> of the model.
Terminology (Leverage).
A diagonal element ht,t (t = 1, . . . , n) of the hat matrix H is called the leverage of the tth
observation.
Interpretation of the leverage
To show that the leverage expresses how unusual the regressor values of the tth observations are,
let us consider a linear model with intercept, i.e., the realized model matrix is
X = 1n , x1 , . . . , xk−1 ,
where


x1,1
 . 
. 
x1 = 
 . ,
xn,1
Let


x1,k−1
 . 
. 
=
 . .
xn,k−1
...,
xk−1
...,
xk−1 =
n
x1 =
1X
xi,1 ,
n
n
i=1
1X
xi,k−1
n
i=1
be the means of the non-intercept columns of the model matrix. That is, a vector
x = x1 , . . . , xk−1
provides the mean values of the non-intercept regressors included in the model matrix X and as
such is a gravity centre of the rows of the model matrix X (with excluded intercept).
e be the non-intercept part of the model matrix X with all columns being centered, i.e.,
Further, let X


x1,1 − x1 . . . x1,k−1 − xk−1


..
..
..
e = x1 − x1 1n , . . . , xk−1 − xk−1 1n = 
.
X
.
.
.


1
k−1
xn,1 − x . . . xn,k−1 − x
e . Hence the hat matrix H = X X> X − X> can also be calculated
Clearly, M X = M 1n , X
e , where we can use additional property 1> X
e = 0> :
using the matrix 1, X
n
k−1
n
>
o−
>
e
e
e
e
H = 1n , X
1n , X
1n , X
1n , X

> e −
1>
n 1n 1n X
 | {z } | {z }  > 

n
1n
0>

k−1 
e 
= 1n , X

  
e>
e>
X
e >X
e
X 1n X

| {z }

0k−1
5
vzdálená pozorování
14.3. LEVERAGE POINTS
278
1
e

n
= 1n , X
0k−1

=
0>
k−1
e> X
e
X
−

1>
n

 
e>
X
1
e e> e − X
e >.
1n 1>
n + X X X
n
That is, the tth leverage equals
ht,t =
> −
>
1
e X
e
xt,1 − x1 , . . . , xt,k−1 − xk−1 .
+ xt,1 − x1 , . . . , xt,k−1 − xk−1 X
n
The second term is then a square of the generalized distance between the non-intercept regressors
xt,1 , . . . , xt,k−1 of the tth observation and the vector of mean regressors x. Hence the observations with a high value of the leverage ht,t are observations with the regressor values being far
from the mean regressor values and in this sense have unusual regressor (x) values.
High value of a leverage
To evaluate which values of the leverage are high enough to call a particular observation as
a leverage point, let us remind
an expression of the hat matrix using the orthonormal basis Q of
the regression space M X , which is a vector space of dimension r = rankX. We know that
H = QQ> and hence
n
X
hi,i = tr(H) = tr QQ> = tr Q> Q = tr(Ir ) = r.
i=1
That is,
n
h=
1X
r
hi,i = .
n
n
(14.2)
i=1
Several rules of thumbs can be found in the literature and software implementations concerning
a lower bound for the leverage to call a particular observation as a leverage point. Owing to (14.2),
a reasonable bound is a value higher than nr . For example, the R function influence.measures
marks the tth observation as a leverage point if
ht,t >
3r
.
n
Influence of leverage points
The fact that the leverage points may constitute a problem for the least squares based statistical
inference in a linear model comes from remembering an expression for the variance (conditional
given the covariate values) of the residuals of a linear model:
var(Ut ) = σ 2 mt,t = σ 2 (1 − ht,t ),
t = 1, . . . , n.
Remind that Ut = Yt − Ybt and hence also
var Yt − Ybt = σ 2 (1 − ht,t ),
t = 1, . . . , n.
That is, var(Ut ) = var Yt − Ybt is low for observations with a high leverage. In other words, the
fitted values of high leverage observations are forced to be closer to the observed response values
than those of low leverage observations. In this way, the high leverage observations have a higher
impact on the fitted regression function than the low leverage observations.
End of
Lecture #25
(07/01/2016)
14.4. INFLUENTIAL DIAGNOSTICS
14.4
279
Influential diagnostics
Start of
Both outliers and leverage points do not necessarily constitute a problem. This occurs if they too Lecture #26
much influence statistical inference of primary interest. Also other observations (neither outliers nor (14/01/2016)
leverage points) may harmfully influence the statistical inference. In this section, several methods
of quantifying the influence of a particular, tth (t = 1, . . . , n) observation on statistical inference
will be introduced. In all cases, we will compare a quantity of primary interest based on the model
at hand, i.e.,
M : Y X ∼ Xβ, σ 2 In ,
rank Xn×k = r,
and the quantity based on the leave-one-out model
M(−t) : Y (−t) X(−t) ∼ X(−t) β, σ 2 In−1 .
It will overally be assumed, that mt,t > 0 which implies (see Theorem 14.1) rank X(−t)
rank X = r.
14.4.1
=
DFBETAS
Let r = k, i.e., both models M and M(−t) are full-rank models. The LSE’s of the vector of regression
coefficients based on the two models are
−1 >
b = βb0 , . . . , βbk−1 >
M:
β
= X> X
X Y,
M(−t) :
b
β
(−t) =
βb(−t),0 , . . . , βb(−t),k−1
Using (14.1):
b − β
b
β
(−t) =
>
=
X(−t) > X(−t)
−1
X(−t) > Y (−t) .
−1
Ut
X> X
xt ,
mt,t
(14.3)
which quantifies influence of the tth observation on the LSE of the regression coefficients. In the
following, let v 0 = v0,0 , . . . , v0,k−1 , . . . , v k−1 = vk−1,0 , . . . , vk−1,k−1 be the rows of the
−1
matrix X> X , i.e.,

 

v>
v
.
.
.
v
0,0
0,k−1
0

−1  .   .
..
..
>



.
.
.
X X
= . = .
.
.

v>
v
.
.
.
v
k−1,0
k−1,k−1
k−1
Expression (14.3) written elementwise lead to a quantities called DFBETA:
DFBETAt,j := βbj − βb(−t),j =
Ut >
v xt ,
mt,t t
t = 1, . . . , n, j = 0, . . . , k − 1.
Note that DFBETAt,j has a scale of the jth regressor. To get a dimensionless quantity, we can divide
it by the standard error of either βbj or βb(−t),j . We have
S.E. βbj
=
p
MSe vj,j ,
S.E. βb(−t),j
=
q
MSe,(−t) v(−t),j,j ,
−1
where v(−t),j,j is p
the jth diagonal element of matrix X(−t) > X(−t)
. In practice, a combined
quantity, namely MSe,(−t) vj,j is used leading to so called DFBETAS (the last “S” stands for
14.4. INFLUENTIAL DIAGNOSTICS
280
“scaled”):
βbj − βb(−t),j
DFBETASt,j := p
MSe,(−t) vj,j
=
mt,t
p
Ut
v>
xt ,
MSe,(−t) vj,j t
t = 1, . . . , n, j = 0, . . . , k − 1.
p
The reason for using MSe,(−t) vj,j as a scale factor is that MSe,(−t) is a safer estimator of the
residual variance σ 2 not being based on the observation whose influence is examined but at the
same time, it can still be calculated from quantities of the full model M (see Eq. 14.1). On the other
hand, a value of v(−t),j,j (that fits with the leave-one-out residual mean square MSe,(−t) ) cannot,
in general, be calculated from quantities of the full model M and hence (a close) value of vj,j is
used. Consequently, all values of DFBETAS can be calculated from quantities of the full model M
and there is no need to fit n leave-one-out models.
Note (Rule-of-thumb used by R).
The R function influence.measures marks the tth observation as being influential with respect
to the LSE of the jth regression coefficient if
DFBETASt,j > 1.
14.4.2
DFFITS
We are assuming mt,t > 0 and hence by Theorem 14.1, parameter µt := E Yt X t = xt = x>
t β
− >
>
is estimable in both models M and M(−t) . Let as usual, b = X X X Y be any solution
to normal equations in model M (which is now not necessarily of a full-rank) and let b(−t) =
−
X(−t) > X(−t) X(−t) > Y (−t) be any solution to normal equations in the leave-one-out model
M(−t) . The LSE’s of µt in the two models are
M:
M(−t) :
Ybt = x>
t b,
Yb[t] = x>
t b(−t) .
Using (14.1):
n
− o
ht,t
Ut
Ut > > −
>
Yb[t] = x>
b
−
X
X
xt
= Ybt −
xt X X xt = Ybt − Ut
.
t
mt,t
mt,t
mt,t
Difference between Ybt and Yb[t] is called DFFIT and quantifies influence of the tth observation on
the LSE of its own expectation:
ht,t
,
DFFITt := Ybt − Yb[t] = Ut
mt,t
t = 1, . . . , n.
Analogously to DFBETAS, also DFFIT is scaled
by a quantity that resembles the standard error of
p
either Ybt or Yb[t] (remember, S.E. Ybt = MSe ht,t ) leading to a quantity called DFFITS:
Ybt − Yb[t]
DFFITSt := p
MSe,(−t) ht,t
ht,t
Ut
p
=
=
mt,t
MSe,(−t) ht,t
s
ht,t
Ut
p
=
mt,t
MSe,(−t) mt,t
s
ht,t
Tt ,
mt,t
t = 1, . . . , n,
14.4. INFLUENTIAL DIAGNOSTICS
281
where Tt is the tth studentized residual of the model M. Again, all values of DFFITS can be
calculated from quantities of the full model M and there is no need to fit n leave-one-out models.
Note (Rule-of-thumb used by R).
The R function influence.measures marks the tth observation as excessively influencing the
LSE of its expectation if
r
r
DFFITSt > 3
.
n−r
14.4.3
Cook distance
In this Section, we concentrate on evaluation
of the influence of the tth observation
− on the LSE
of a vector parameter µ := E Y X = Xβ. As in Section 14.4.2, let b = X> X X> Y be any
−
solution to normal equations in model M and let b(−t) = X(−t) > X(−t) X(−t) > Y (−t) be any
solution to normal equations in the leave-one-out model M(−t) . The LSE’s of µ in the two models
are
M:
M(−t) :
Yb = Xb
= HY ,
Yb (−t•) := Xb(−t) .
Note. Remind that Yb (−t•) , Yb [•] and Yb (−t) are three different quantities. Namely,
Yb (−t•) = Xb(−t)


x>
1 b(−t)


..
,
=
.


>
xn b(−t)

Yb [•]
 

Yb
x>
1 b(−1)
 .  

..
.
.  
=
.
 . =

>
b
Y[n]
xn b(−n)
Finally, Yb (−t) = X(−t) b(−t) is a subvector of length n − 1 of a vector Yb (−t•) of length n.
Possible quantification of influence of the tth observation on the LSE of a vector parameter µ is
obtained by considering a quantity
Yb − Yb (−t•) 2 .
Let us remind from Lemma 14.3:
b − b(−t) =
−
Ut
X> X xt .
mt,t
Hence,
Yb − Yb (−t•) = X b − b(−t)
=
−
Ut
X X> X x t .
mt,t
Then
2
Yb − Yb (−t•) 2 = Ut X X> X − xt mt,t
=
−
Ut2 > > − >
xt X X X X X> X xt
2
mt,t
=
Ut2
ht,t .
m2t,t
The equality (14.4) follows from noting that
(14.4)
14.4. INFLUENTIAL DIAGNOSTICS
282
− >
−
−
−
>
(a) x>
X X X> X xt is the tth diagonal element of matrix X X> X X> X X> X X> ;
t X X
−
−
−
(b) X X> X X> X X> X X> = X X> X X> = H by the five matrices rule (Theorem A.2).
The so called Cook distance of the tth observation is (14.4) modified to get a unit-free quantity.
Namely, the Cook distance is defined as
Dt :=
1 Yb − Yb (−t•) 2 .
r MSe
Expression (14.4) shows that it is again not necessary to fit the leave-one-out model to calculate the
Cook distance. Moreover, we can express it as follows
Dt =
2
1 ht,t
Ut2
1 ht,t
=
Utstd .
r mt,t MSe mt,t
r mt,t
Notes.
• We are assuming mt,t > 0. Hence ht,t = 1 − mt,t ∈ (0, 1) and the term ht,t /mt,t increases
with the leverage ht,t (having a limit of ∞ with ht,t → 1). The “ht,t /mt,t ” part of the Cook
distance thus quantifies how much is the tth observation the leverage point.
• The “Utstd ” part of the Cook distance increases with the distance between the observed and
fitted value which is high for outliers.
• The Cook distance is thus a combined measure being high for observations which are either
leverage points or outliers or both.
Cook distance in a full-rank model
If r = k and both M and M(−t) are of full-rank, we have
b =
b =
β
b
b(−t) = β
(−t) =
X> X
−1
X> Y ,
X(−t) > X(−t)
−1
X(−t) > Y (−t) .
Then, directly from definition,
2
b − Xβ
b
Yb − Yb (−t•) 2 = Xβ
=
(−t)
b
b
β
(−t) − β
>
b
b
X> X β
(−t) − β .
The Cook distance is then
Dt =
b
b
β
(−t) − β
>
b
b
X> X β
(−t) − β
,
k MSe
b and β
b
which is a distance between β
(−t) in a certain metric.
Remember now that under normality, the confidence region for parameter β with a coverage of
1 − α, derived while assuming model M is
C(α) = β :
b
β − β
That is
b
β
(−t) ∈ C(α)
>
b
X> X β − β
if and only if
< k MSe Fk,n−k (1 − α) .
Dt < Fk,n−k (1 − α).
(14.5)
14.4. INFLUENTIAL DIAGNOSTICS
283
This motivates the following rule-of-thumb.
Note (Rule-of-thumb used by R).
The R function influence.measures marks the tth observation as excessively influencing the
LSE of the full response expectation µ if
Dt > Fr,n−r (0.50).
14.4.4
COVRATIO
In this Section, we will again assume full-rank models (r = k) and explore influence of the tth
observation on precision of the LSE of the vector of regression coefficients. The LSE’s of the vector
of regression coefficients based on the two models are
M:
b =
β
M(−t) :
b
β
(−t) =
X> X
−1
X> Y ,
X(−t) > X(−t)
−1
X(−t) > Y (−t) .
b and β
b
The estimated covariance matrices of β
(−t) , respectively, are
b
c β
var
−1
= MSe X> X ,
b
c β
var
(−t)
= MSe,(−t) X>
(−t) X(−t)
−1
.
Influence of the tth observation on the precision of the LSE of the vector of regression coefficients
is quantified by so called COVRATIO being defined as
n
o
b
c β
det var
(−t)
n
t = 1, . . . , n.
COVRATIOt =
o ,
b
c β
det var
After some calculation (see below), it can be shown that
COVRATIOt
1
=
mt,t
n − k − Utstd
n−k−1
2 k
,
t = 1, . . . , n.
That is, it is again not necessary to fit n leave-one-out models to calculate the COVARTIO values
for all observations in the dataset.
Note (Rule-of-thumb used by R).
The R function influence.measures marks the tth observation as excessively influencing precision of the estimation of the regression coefficients if
1 − COVRATIOt > 3
k
.
n−k
14.4. INFLUENTIAL DIAGNOSTICS
284
Calculation towards COVRATIO
First, remind a matrix identity (e.g., Anděl, 2007, Theorem A.4): If A and D are square invertible
matrices then
A B = A · D − CA−1 B = D · A − BD−1 C.
C D Use twice the above identity:
X> X x −1 t >
xt = X> X mt,t ,
= X> X · 1 − x>
t X X
>
xt
|
{z
}
1 1 − ht,t = mt,t
= X> X(−t) .
= |1| · X> X − xt x>
t
(−t)
So that,
mt,t X> X = X>
(−t) X(−t) .
Then,
n
o
−1 > X
b
c β
det var
MS
X
e,(−t)
(−t)
(−t) (−t)
o
n
=
−1 b
c β
det var
MSe X> X =
MSe,(−t)
MSe
k
>
−1
X
MSe,(−t) k
1
(−t) X(−t)
.
·
=
·
X> X−1
MSe
mt,t
Expression (14.1):
MSe,(−t)
n − k − Utstd
=
MSe
n−k−1
2
Hence,
n
o
2 k
b
c β
det var
(−t)
n − k − Utstd
1
n
.
o = m
n−k−1
b
t,t
c β
det var
14.4.5
Final remarks
.
• All presented influence measures should be used sensibly.
• Depending on what is the purpose of the modelling, different types of influence are differently
harmful.
• There is certainly no need to panic if some observations are marked as “influential”!
End of
Lecture #26
(14/01/2016)
Appendix
A
Matrices
A.1
Pseudoinverse of a matrix
Definition A.1 Pseudoinverse of a matrix.
The pseudoinverse of a real matrix An×k is such a matrix A− of dimension k × n that satisfies
AA− A = A.
Notes.
• The pseudoinverse always exists. Nevertheless, it is not necessarily unique.
• If A is invertible then A− = A−1 is the only pseudoinverse.
Definition A.2 Moore-Penrose pseudoinverse of a matrix.
The Moore-Penrose pseudoinverse of a real matrix An×k is such a matrix A+ of dimension k × n
that satisfies the following conditions:
(i) AA+ A = A;
(ii) A+ AA+ = A+ ;
>
(iii) AA+ = AA+ ;
>
(iv) A+ A = A+ A.
Notes.
• The Moore-Penrose pseudoinverse always exists and it is unique.
• The Moore-Penrose pseudoinverse can be calculated from the singular value decomposition (SVD)
of the matrix A.
285
Start of
Lecture #2
(08/10/2015)
A.1. PSEUDOINVERSE OF A MATRIX
286
Theorem A.1 Pseudoinverse of a matrix and a solution of a linear system.
Let An×k be a real matrix and let cn×1 be a real vector. Let there exist a solution of a linear system
Ax = c, i.e., the linear system Ax = c is consistent. Let A− be the pseudoinverse of A.
A vector xk×1 solves the linear system Ax = c if and only if
x = A− c.
Proof. See Anděl (2007, Appendix A.4).
k
Theorem A.2 Five matrices rule.
For a real matrix An×k , it holds
That is, a matrix A> A
−
−
A A> A A> A = A.
A> is a pseudoinverse of a matrix A.
Proof. See Anděl (2007, Theorem A.19).
k
End of
Lecture #2
(08/10/2015)
A.2. KRONECKER PRODUCT
A.2
287
Kronecker product
Start of
Lecture #7
(22/10/2015)
Definition A.3 Kronecker product.
Let Am×n and Cp×q be real matrices. Their Kronecker product A ⊗ C is a matrix Dm·p×n·q such
that


a1,1 C . . . a1,s C
 .
..
.. 
.
D=A⊗C=
.
. 
 .
 = ai,j C i=1,...,m,j=1,...,n .
ar,1 C . . . ar,s C
Note. For a ∈ Rm , b ∈ Rp , we can write
a b> = a ⊗ b> .
Theorem A.3 Properties of a Kronecker product.
It holds for the Kronecker product:
(i) 0 ⊗ A = 0, A ⊗ 0 = 0.
(ii) (A1 + A2 ) ⊗ C = (A1 ⊗ C) + (A2 ⊗ C).
(iii) A ⊗ (C1 + C2 ) = (A ⊗ C1 ) + (A ⊗ C2 ).
(iv) aA ⊗ cC = a c (A ⊗ C).
(v) A1 A2 ⊗ C1 C2 = (A1 ⊗ C1 ) (A2 ⊗ C2 ).
−1
(vi) A ⊗ C
= A−1 ⊗ C−1 , if the inversions exist.
−
(vii) A ⊗ C = A− ⊗ C− , for arbitrary pseudoinversions.
>
(viii) A ⊗ C = A> ⊗ C> .
(ix) A, C ⊗ D = A ⊗ D, C ⊗ D .
(x) Upon a suitable reordering of the columns, matrices A ⊗ C, A ⊗ D and A ⊗ C, D are
the same.
(xi) rank A ⊗ C = rank A rank C .
Proof. See Rao (1973, Section 1b.8).
k
End of
Lecture #7
(22/10/2015)
A.2. KRONECKER PRODUCT
288
Start of
Lecture #11
(05/11/2015)
Definition A.4 Elementwise product of two vectors.
Let a = a1 , . . ., ap ∈ Rp , c = b1 , . . . , bp ∈ Rp . Their elementwise product1 is a vector
a1 c1 , . . . , ap cp that will be denoted as a : c. That is,


a1 c1
 . 
. 
a:c=
 . .
ap cp
Definition A.5 Columnwise product of two matrices.
Let
An×p = a1 , . . . , ap
and Cn×q = c1 , . . . , cq
be real matrices. Their columnwise product2 A : C is a matrix Dn×p·q such that
D = A : C = a1 : c1 , . . . , ap : c1 , . . . , a1 : cq , . . . , ap : cq .
Notes.
• If we write
 
a>
1
 . 

.
A= . 
,
>
an


c>
1
 . 

.
C= . 
,
>
cn
the columnwise product of two matrices can also be written as a matrix rows of which are
obtained as Kronecker products of the rows of the two matrices:


>
c>
⊗
a
1
1


..

.
A:C=
(A.1)
.

>
>
cn ⊗ an
• It perhaps looks more logical to define the columnwise product of two matrices as


>
a>
⊗
c
1
1


..
 = a1 : c1 , . . . , a1 : cq , . . . , ap : c1 , . . . , ap : cq ,
A:C=
.


>
a>
n ⊗ cn
which only differs by ordering of the columns of the resulting matrix. Our definition (A.1) is
motivated by the way in which an operator : acts in the R software.
1
součin po složkách
2
součin po sloupcích
End of
Lecture #11
(05/11/2015)
A.3
289
Theorem A.4 Inverse of a matrix divided into blocks.
Let

M=

A
B
B>

D
be a positive definite matrix divided in blocks A, B, D.
Then the following holds:
(i) Matrix Q = A − BD−1 B> is positive definite.
(ii) Matrix P = D − B> A−1 B is positive definite.
(iii) The inverse to M is

M−1 = 

=
Q−1
− D−1 B> Q−1

− Q−1 BD−1
D−1
A−1 + A−1 BP−1 B> A−1
− P−1 B> A−1
Proof. See Anděl (2007, Theorem A.10 in Appendix A.2).
+
D−1 B> Q−1 BD−1
− A−1 BP−1
P−1


.
k
Appendix
B
Distributions
B.1
Non-central univariate distributions
Definition B.1 Non-central Student t-distribution.
Let U ∼ N (0, 1), let V ∼ χ2ν for some ν > 0 and let U and V be independent. Let λ ∈ R. Then
we say that a random variable
U +λ
T = r
V
ν
follows a non-central Student t-distribution1 with ν degrees of freedom2 and a non-centrality parameter 3 λ. We shall write
T ∼ tν (λ).
Notes.
• Non-central t-distribution is different from simply a shifted (central) t-distribution.
• Directly seen from definition: tν (0) ≡ tν .
• Moments of a non-central Student t-distribution:
 r
Γ ν−1
ν

2
 λ
, if ν > 1,
2 Γ ν2
E(T ) =


does not exist, if ν ≤ 1.
2

2)
2 Γ ν−1
ν(1
+
λ
νλ

2

−
, if ν > 2,
ν−2
2
Γ ν2
var(T ) =


does not exist,
if ν ≤ 2.
1
necentrální Studentovo t-rozdělení
2
stupně volnosti
3
parametr necentrality
290
Start of
Lecture #6
(22/10/2015)
B.1. NON-CENTRAL UNIVARIATE DISTRIBUTIONS
291
Definition B.2 Non-central χ2 distribution.
Let U1 , . . . , Uk be independent random variables.
Let further
Ui ∼ N (µi , 1), i = 1, .. . , k, for some
µ1 , . . . , µk ∈ R. That is U = U1 , . . . , Uk ∼ Nk µ, Ik , where µ = µ1 , . . . , µk . Then we say
that a random variable
k
X
2
X=
Ui2 = U i=1
follows a non-central chi-squared distribution4 with k degrees of freedom and a non-centrality parameter
k
X
λ=
µ2i = kµk2 .
i=1
We shall write
X ∼ χ2k (λ).
Notes.
• It can easily be proved that the
of the random variable X from Definition B.2 indeed
Pdistribution
k
2
depends only on k and λ = i=1 µi and not on the particular values of µ1 , . . . , µk .
• As an exercise for the use of a convolution theorem, we can derive a density of the χ2k (λ)
distribution which is

∞
− x+λ k−2
X

λ j xj
k−1 1

 e 2 x 2 B
, +j ,
x > 0,
k
1
(2j)!
2
2
f (x) =
Γ
2 2 Γ k−1
j=0
2
2


 0,
x ≤ 0.
• The non-central χ2 distribution with general degrees of freedom ν ∈ (0, ∞) is defined as
a distribution with the density given by the above expression with k replaced by ν.
• χ2ν (0) ≡ χ2ν .
• Moments of a non-central χ2 distribution:
E(X) = ν + λ,
var(X) = 2 (ν + 2λ).
4
B.1. NON-CENTRAL UNIVARIATE DISTRIBUTIONS
292
Definition B.3 Non-central F-distribution.
Let X ∼ χ2ν1 (λ), where ν1 , λ > 0. Let Y ∼ χ2ν2 , where ν2 > 0. Let further X and Y be independent.
Then we say that a random variable
X
ν1
Q=
Y
ν2
follows a non-central F-distribution5 with ν1 and ν2 degrees of freedom and a noncentrality parameter
λ. We shall write
Q ∼ Fν1 ,ν2 (λ).
Notes.
• Directly seen from definition: Fν1 ,ν2 (0) ≡ Fν1 ,ν2 .
• Moments of a non-central F-distribution:

ν (ν + λ)

 2 1
,
if ν2 > 2,
ν1 (ν2 − 2)
E(Q) =


does not exist, if ν2 ≤ 2.

2
2 + (ν + 2λ) (ν − 2)
(ν
+
λ)
ν2

1
1
2
 2
, if ν2 > 4,
2
(ν2 − 2) (ν2 − 4)
ν1
var(Q) =


does not exist,
if ν2 ≤ 4.
End of
Lecture #6
(22/10/2015)
5
necentrální F-rozdělení
B.2. MULTIVARIATE DISTRIBUTIONS
B.2
293
Multivariate distributions
Start of
Lecture #4
(15/10/2015)
Definition B.4 Multivariate Student t-distribution.
Let U ∼ Nn (0n , Σ), where Σn×n is a positive semidefinite matrix. Let further V ∼ χ2ν for some
ν > 0 and let U and V be independent. Then we say that a random vector
r
ν
T =U
V
follows an n-dimensional multivariate Student t-distribution6 with ν degrees of freedom and a scale
matrix7 Σ. We shall write
T ∼ mvtn,ν (Σ).
Notes.
• Directly seen from definition: mvt1,ν (1) ≡ tν .
• If Σ is a regular (positive definite) matrix, then the density of the mvtn,ν (Σ) distribution is
f (t) =
ν+n
2
Γ
Γ
ν
2
n
2
ν π
n
2
− 1
Σ 2
t> Σ−1 t
1+
ν
− ν+n
2
,
t ∈ Rn .
• Expectation and a covariance matrix of T ∼ mvtn,ν (Σ) are
E(T ) =
var(T ) =

 0n ,
if ν > 1,

does not exist, if ν ≤ 1.



ν
Σ,
ν−2
if ν > 2,

 does not exist, if ν ≤ 2.
Lemma B.1 Marginals of the multivariate Student t-distribution.
>
Let T = T1 , . . . , Tn
∼ mvtn,ν (Σ), where the scale matrix Σ has positive diagonal elements
σ12 > 0, . . . , σn2 > 0. Then
Tj
∼ tν ,
j = 1, . . . , n.
σj
r
Proof.
ν
• From definition of the multivariate t-distribution, T can be written as T = U
, where
V
>
U = U1 , . . . , Un ∼ Nn (0n , Σ) and V ∼ χ2ν are independent.
6
vícerozměrné Studentovo t-rozdělení
7
měřítková matice
B.2. MULTIVARIATE DISTRIBUTIONS
• Then for all j = 1, . . . , n:
Tj
Uj
=
σj
σj
294
r
Zj
ν
=q ,
V
V
ν
where Zj ∼ N (0, 1) is independent of V ∼ χ2ν .
k
End of
Lecture #4
(15/10/2015)
Appendix
C
Asymptotic Theorems
Start of
Lecture #22
Theorem C.1 Strong law of large numbers (SLLN) for i.n.n.i.d. random variables. (17/12/2015)
Let Z1 , Z2 , . . . be a sequence of independent not necessarily identically distributed (i.n.n.i.d.) random variables. Let E(Zi ) = µi , var(Zi ) = σi2 , i = 1, 2, . . .. Let
∞
X
σ2
i
i=1
Then
i2
< ∞.
n
1 X
a.s.
(Zi − µi ) −→ 0
n
as n → ∞.
i=1
Proof. See Probability and Mathematical Statistics (NMSA202) lecture (2nd year of the Bc. study
programme).
k
Theorem C.2 Strong law of large numbers (SLLN) for i.i.d. random variables.
Let Z1 , Z2 , . . . be a sequence of independent identically distributed (i.i.d.) random variables.
Then
n
1 X
a.s.
Zi −→ µ
n
as n → ∞
i=1
for some µ ∈ R if and only if
EZ1 < ∞,
in which case µ = E Z1 .
295
296
Proof. See Probability and Mathematical Statistics (NMSA202) lecture (2nd year of the Bc. study
programme).
k
Theorem C.3 Central limit theorem (CLT), Lyapunov.
Let Z1 , Z2 , . . . be a sequence of i.n.n.i.d. random variables with
E Zi = µi , ∞ > var Zi = σi2 > 0,
Let for some δ > 0
Zi − µi 2+δ
E
i=1
−→ 0
P
2+δ
2
n
2
i=1 σi
Pn
Then
i = 1, 2, . . .
as n → ∞.
Pn
D
i=1 (Zi − µi )
q
−→ N (0, 1) as n → ∞.
Pn
2
i=1 σi
Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme).
k
Theorem C.4 Central limit theorem (CLT), i.i.d..
Let Z1 , Z2 , . . . be a sequence of i.i.d. random variables with
E Zi = µ, ∞ > var Zi = σ 2 > 0,
P
Let Z n = n1 ni=1 Zi .
i = 1, 2, . . . .
Then
n
1 X Zi − µ D
√
−→ N (0, 1)
σ
n
as n → ∞,
i=1
√
n Zn − µ
D
−→ N (0, σ 2 )
as n → ∞.
Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme).
k
297
Theorem C.5 Central limit theorem (CLT), i.i.d. multivariate.
Let Z 1 , Z 2 , . . . be a sequence of i.i.d. p-dimensional random vectors with
E Z i = µ,
var Z i = Σ,
i = 1, 2, . . . ,
P
where Σ is a real positive semidefinite matrix. Let Z n = n1 ni=1 Z i .
Then
√
n Zn − µ
D
−→ Np (0p , Σ).
If Σ is positive definite then also
n
D
1 X −1/2
√
Σ
Z i − µ −→ Np (0p , Ip ).
n
i=1
Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme).
k
Theorem C.6 Cramér-Wold.
Let Z 1 , Z 2 , . . . be a sequence of p-dimensional random vectors. Let Z be a p-dimensional random
vector.
D
Z n −→ Z
as n → ∞
if and only if for all l ∈ Rp
D
l> Z n −→ l> Z
as n → ∞.
Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme).
k
298
Theorem C.7 Cramér-Slutsky.
Let Z 1 , Z 2 , . . . be a sequence of random vectors such that
D
Z n −→ Z
as n → ∞,
where Z be a random vector. Let S1 , S2 , . . . be a sequence of random variables such that
P
Sn −→ S
as
n → ∞,
where S ∈ R is a real constant.
Then
D
(i) Sn Z n −→ S Z
(ii)
1
D 1
Z n −→ Z
Sn
S
as n → ∞.
as n → ∞, if S 6= 0.
Proof. See Probability Theory 1 (NMSA333) lecture (3rd year of the Bc. study programme).
k
End of
Lecture #22
(17/12/2015)
Bibliography
Anděl, J. (2007). Základy matematické statistiky. Matfyzpress, Praha. ISBN 80-7378-001-1.
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society
of London, Series A, Mathematical and Physical Sciences, 160(901), 268–282. doi: 10.1098/rspa.1937.
0109.
Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient
variation. Econometrica, 47(5), 1287–1294. doi: 10.2307/1911963.
Brown, M. B. and Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the
American Statistical Association, 69(346), 364–367. doi: 10.1080/01621459.1974.10482955.
Cipra, T. (2008). Finanční ekonometrie. Ekopress, Praha. ISBN 978-80-86929-43-9.
Cook, R. D. and Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika, 70
(1), 1–10. doi: 10.1093/biomet/70.1.1.
Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics and Data Analysis, 45(2), 215–233. doi: 10.1016/S0167-9473(02)00366-3.
de Boor, C. (1978). A Practical Guide to Splines. Springer, New York. ISBN 0-387-90356-9.
de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York, Revised edition. ISBN
0-387-95366-3.
Dierckx, P. (1993). Curve and Surface Fitting with Splines. Clarendon, Oxford. ISBN 0-19-853440-X.
Draper, N. R. and Smith, H. (1998). Applied Regression Analysis. John Wiley & Sons, New York, Third
edition. ISBN 0-471-17082-8.
Durbin, J. and Watson, G. S. (1950). Testing for serial correlation in least squares regression I.
Biometrika, 37, 409–428.
Durbin, J. and Watson, G. S. (1951). Testing for serial correlation in least squares regression II.
Biometrika, 38(1/2), 159–177. doi: 10.2307/2332325.
Durbin, J. and Watson, G. S. (1971). Testing for serial correlation in least squares regression III.
Biometrika, 58(1), 1–19. doi: 10.2307/2334313.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1),
1–26. doi: 10.1214/aos/1176344552.
299
BIBLIOGRAPHY
300
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties (with
Discussion). Statistical Science, 11(1), 89–121. doi: 10.1214/ss/1038425655.
Farebrother, R. W. (1980). Algorithm AS 153: Pan’s procedure for the tail probabilities of the
Durbin-Watson statistics. Applied Statistics, 29(2), 224–227.
Farebrother, R. W. (1984). Remark AS R53: A remark on algorithm AS 106, AS 153, AS 155: The
distribution of a linear combination of χ2 random variables. Applied Statistics, 33, 366–369.
Fligner, M. A. and Killeen, T. J. (1976). Distribution-free two-sample tests for scale. Journal of the
American Statistical Association, 71(353), 210–213. doi: 10.2307/2285771.
Fox, J. and Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical
Association, 87(417), 178–183. doi: 10.1080/01621459.1992.10475190.
Genz, A. and Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities. Springer-Verlag,
New York. ISBN 978-3-642-01688-2.
Goldfeld, S. M. and Quandt, R. E. (1965). Some tests for homoscedasticity. Journal of the American
Statistical Association, 60(310), 539–547. doi: 10.1080/01621459.1965.10480811.
Hayter, A. J. (1984). A proof of the conjecture that the Tukey-Kramer multiple comparisons procedure
is conservative. The Annals of Statistics, 12(1), 61–75. doi: 10.1214/aos/1176346392.
Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models.
Biometrical Journal, 50(3), 346–363. doi: 10.1002/bimj.200810425.
Hothorn, T., Bretz, F., and Westfall, P. (2011). Multiple Comparisons Using R. Chapman & Hall/CRC,
Boca Raton. ISBN 978-1-5848-8574-0.
Khuri, A. I. (2010). Linear Model Methodology. Chapman & Hall/CRC, Boca Raton. ISBN 978-1-58488481-1.
Koenker, R. (1981). A note on studentizing a test for heteroscedasticity. Journal of Econometrics, 17
(1), 107–112. doi: 10.1016/0304-4076(81)90062-2.
Kramer, C. Y. (1956). Extension of multiple range tests to group means with unequal numbers of
replications. Biometrics, 12(3), 307–310. doi: 10.2307/3001469.
Levene, H. (1960). Robust tests for equality of variances. In Olkin, I., Ghurye, S. G., Hoeffding, W.,
Madow, W. G., and Mann, H. B., editors, Contributions to Probability and Statistics: Essays in Honor
of Harold Hotelling, pages 278–292. Stanford University Press, Standord.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear
regression model. The American Statistician, 54(3), 217–224. doi: 10.2307/2685594.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3), 305–325. doi:
10.1016/0304-4076(85)90158-7.
R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Rao, C. R. (1973). Linear Statistical Inference and its Applications. John Wiley & Sons, New York,
Second edition. ISBN 0-471-21875-8.
BIBLIOGRAPHY
301
Searle, S. R. (1987). Linear Models for Unbalanced Data. John Wiley & Sons, New York. ISBN
0-471-84096-3.
Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis. John Wiley & Sons, New York, Second
edition. ISBN 978-0-47141-540-4.
Shao, J. (2003). Mathematical Statistics. Springer Science+Business Media, New York, Second edition.
ISBN 0-387-95382-5.
Sun, J. (2003). Mathematical Statistics. Springer Science+Business Media, New York, Second edition.
ISBN 0-387-95382-5.
Tukey, J. W. (1949). Comparing individual means in the Analysis of variance. Biometrics, 5(2), 99–114.
doi: 10.2307/3001913.
Tukey, J. W. (1953). The problem of multiple comparisons (originally unpublished manuscript). In
Braun, H. I., editor, The Collected Works of John W. Tukey, volume 8, 1994. Chapman & Hall, New
York.
Weisberg, S. (2005). Applied Linear Regression. John Wiley & Sons, Hoboken, Third edition. ISBN
0-471-66379-4.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica, 48(4), 817–838. doi: 10.2307/1912934.
Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estimators. Journal
of Statistical Software, 11(10), 1–17. URL http://www.jstatsoft.org/v11/i10/.
Zvára, K. (1989). Regresní analýza. Academia, Praha. ISBN 80-200-0125-5.
Zvára, K. (2008). Regrese. Matfyzpress, Praha. ISBN 978-80-7378-041-8.
```