The REG Procedure SAS/STAT User’s Guide (Book Excerpt)

®
SAS/STAT 9.2 User’s Guide
The REG Procedure
(Book Excerpt)
®
SAS Documentation
This document is an individual chapter from SAS/STAT® 9.2 User’s Guide.
The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2008. SAS/STAT® 9.2
User’s Guide. Cary, NC: SAS Institute Inc.
Copyright © 2008, SAS Institute Inc., Cary, NC, USA
All rights reserved. Produced in the United States of America.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor
at the time you acquire this publication.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation
by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19,
Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
1st electronic book, March 2008
2nd electronic book, February 2009
SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to
its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the
SAS Publishing Web site at support.sas.com/publishing or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
Chapter 73
The REG Procedure
Contents
Overview: REG Procedure . . . . . . . . . . . . . .
Getting Started: REG Procedure . . . . . . . . . . .
Simple Linear Regression . . . . . . . . . . .
Polynomial Regression . . . . . . . . . . . . .
Using PROC REG Interactively . . . . . . . .
Syntax: REG Procedure . . . . . . . . . . . . . . .
PROC REG Statement . . . . . . . . . . . . .
ADD Statement . . . . . . . . . . . . . . . .
BY Statement . . . . . . . . . . . . . . . . .
DELETE Statement . . . . . . . . . . . . . .
FREQ Statement . . . . . . . . . . . . . . . .
ID Statement . . . . . . . . . . . . . . . . . .
MODEL Statement . . . . . . . . . . . . . . .
MTEST Statement . . . . . . . . . . . . . . .
OUTPUT Statement . . . . . . . . . . . . . .
PAINT Statement . . . . . . . . . . . . . . .
PLOT Statement . . . . . . . . . . . . . . . .
PRINT Statement . . . . . . . . . . . . . . .
REFIT Statement . . . . . . . . . . . . . . . .
RESTRICT Statement . . . . . . . . . . . . .
REWEIGHT Statement . . . . . . . . . . . .
TEST Statement . . . . . . . . . . . . . . . .
VAR Statement . . . . . . . . . . . . . . . . .
WEIGHT Statement . . . . . . . . . . . . . .
Details: REG Procedure . . . . . . . . . . . . . . .
Missing Values . . . . . . . . . . . . . . . . .
Input Data Sets . . . . . . . . . . . . . . . . .
Output Data Sets . . . . . . . . . . . . . . . .
Interactive Analysis . . . . . . . . . . . . . .
Model-Selection Methods . . . . . . . . . . .
Criteria Used in Model-Selection Methods . .
Limitations in Model-Selection Methods . . .
Parameter Estimates and Associated Statistics
Predicted and Residual Values . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5428
5430
5430
5434
5444
5445
5447
5461
5461
5462
5462
5462
5463
5474
5476
5478
5481
5493
5494
5494
5496
5500
5501
5501
5501
5501
5502
5506
5513
5517
5520
5521
5521
5525
5428 F Chapter 73: The REG Procedure
Line Printer Scatter Plot Features . . . . . . . . . . . . . . . . . . . .
Traditional Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . .
Models of Less Than Full Rank . . . . . . . . . . . . . . . . . . . . .
Collinearity Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . .
Model Fit and Diagnostic Statistics . . . . . . . . . . . . . . . . . . .
Influence Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reweighting Observations in an Analysis . . . . . . . . . . . . . . . .
Testing for Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . .
Testing for Lack of Fit . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Autocorrelation in Time Series Data . . . . . . . . . . . . . . . . . . .
Computations for Ridge Regression and IPC Analysis . . . . . . . . .
Construction of Q-Q and P-P Plots . . . . . . . . . . . . . . . . . . .
Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . .
Computer Resources in Regression Analysis . . . . . . . . . . . . . .
Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples: REG Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example 73.1: Modeling Salaries of Major League Baseball Players .
Example 73.2: Aerobic Fitness Prediction . . . . . . . . . . . . . . .
Example 73.3: Predicting Weight by Height and Age . . . . . . . . .
Example 73.4: Regression with Quantitative and Qualitative Variables
Example 73.5: Ridge Regression for Acetylene Data . . . . . . . . . .
Example 73.6: Chemical Reaction Response . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5528
5541
5547
5549
5551
5553
5563
5569
5570
5571
5575
5577
5577
5578
5578
5578
5581
5583
5586
5586
5602
5621
5627
5632
5636
5638
Overview: REG Procedure
The REG procedure is one of many regression procedures in the SAS System. It is a generalpurpose procedure for regression, while other SAS regression procedures provide more specialized
applications.
Other SAS/STAT procedures that perform at least one type of regression analysis are the CATMOD,
GENMOD, GLM, LOGISTIC, MIXED, NLIN, ORTHOREG, PROBIT, RSREG, and TRANSREG
procedures. SAS/ETS procedures are specialized for applications in time series or simultaneous
systems. These other SAS/STAT regression procedures are summarized in Chapter 4, “Introduction
to Regression Procedures,” which also contains an overview of regression techniques and defines
many of the statistics computed by PROC REG and other regression procedures.
Overview: REG Procedure F 5429
PROC REG provides the following capabilities:
multiple MODEL statements
nine model-selection methods
interactive changes both in the model and the data used to fit the model
linear equality restrictions on parameters
tests of linear hypotheses and multivariate hypotheses
collinearity diagnostics
predicted values, residuals, studentized residuals, confidence limits, and influence statistics
correlation or crossproduct input
requested statistics available for output through output data sets
ODS Graphics is now available. For more information, see the section “ODS Graphics” on
page 5583. These plots are available in addition to the line printer and the traditional graphics
currently available in PROC REG.
Nine model-selection methods are available in PROC REG. In the simplest method, PROC REG
fits the complete model that you specify. The other eight methods involve various ways of including
or excluding variables from the model. You specify these methods with the SELECTION= option
in the MODEL statement.
The methods are identified in the following list and are explained in detail in the section “ModelSelection Methods” on page 5517.
NONE
no model selection. This is the default. The complete model specified in the
MODEL statement is fit to the data.
FORWARD
forward selection. This method starts with no variables in the model and adds
variables.
BACKWARD
backward elimination. This method starts with all variables in the model and
deletes variables.
STEPWISE
stepwise regression. This is similar to the FORWARD method except that variables already in the model do not necessarily stay there.
MAXR
forward selection to fit the best one-variable model, the best two-variable model,
and so on. Variables are switched so that R2 is maximized.
MINR
similar to the MAXR method, except that variables are switched so that the increase in R2 from adding a variable to the model is minimized.
RSQUARE
finds a specified number of models with the highest R2 in a range of model sizes.
ADJRSQ
finds a specified number of models with the highest adjusted R2 in a range of
model sizes.
CP
finds a specified number of models with the lowest Cp in a range of model sizes.
5430 F Chapter 73: The REG Procedure
Getting Started: REG Procedure
Simple Linear Regression
Suppose that a response variable Y can be predicted by a linear function of a regressor variable X .
You can estimate ˇ0 , the intercept, and ˇ1 , the slope, in
Yi D ˇ0 C ˇ1 Xi C i
for the observations i D 1; 2; : : : ; n. Fitting this model with the REG procedure requires only the
following MODEL statement, where y is the outcome variable and x is the regressor variable.
proc reg;
model y=x;
run;
For example, you might use regression analysis to find out how well you can predict a child’s weight
if you know that child’s height. The following data are from a study of nineteen children. Height
and weight are measured for each child.
title ’Simple Linear Regression’;
data Class;
input Name $ Height Weight Age @@;
datalines;
Alfred 69.0 112.5 14 Alice 56.5 84.0
Carol
62.8 102.5 14 Henry 63.5 102.5
Jane
59.8 84.5 12 Janet 62.5 112.5
John
59.0 99.5 12 Joyce 51.3 50.5
Louise 56.3 77.0 12 Mary
66.5 112.0
Robert 64.8 128.0 12 Ronald 67.0 133.0
William 66.5 112.0 15
;
13
14
15
11
15
15
Barbara
James
Jeffrey
Judy
Philip
Thomas
65.3 98.0 13
57.3 83.0 12
62.5 84.0 13
64.3 90.0 14
72.0 150.0 16
57.5 85.0 11
The equation of interest is
Weight D ˇ0 C ˇ1 Height C The variable Weight is the response or dependent variable in this equation, and ˇ0 and ˇ1 are the
unknown parameters to be estimated. The variable Height is the regressor or independent variable,
and is the unknown error. The following commands invoke the REG procedure and fit this model
to the data.
Simple Linear Regression F 5431
ods graphics on;
proc reg;
model Weight = Height;
run;
ods graphics off;
Figure 73.1 includes some information concerning model fit.
The F statistic for the overall model is highly significant (F =57.076, p<0.0001), indicating that the
model explains a significant portion of the variation in the data.
The degrees of freedom can be used in checking accuracy of the data and model. The model degrees
of freedom are one less than the number of parameters to be estimated. This model estimates two
parameters, ˇ0 and ˇ1 ; thus, the degrees of freedom should be 2 1 D 1. The corrected total
degrees of freedom are always one less than the total number of observations in the data set, in this
case 19 1 D 18.
Several simple statistics follow the ANOVA table. The Root MSE is an estimate of the standard
deviation of the error term. The coefficient of variation, or Coeff Var, is a unitless expression of the
variation in the data. The R-square and Adj R-square are two statistics used in assessing the fit of
the model; values close to 1 indicate a better fit. The R-square of 0.77 indicates that Height accounts
for 77% of the variation in Weight.
Figure 73.1 ANOVA Table
Simple Linear Regression
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
17
18
7193.24912
2142.48772
9335.73684
7193.24912
126.02869
Root MSE
Dependent Mean
Coeff Var
11.22625
100.02632
11.22330
R-Square
Adj R-Sq
F Value
Pr > F
57.08
<.0001
0.7705
0.7570
The “Parameter Estimates” table in Figure 73.2 contains the estimates of ˇ0 and ˇ1 . The table
also contains the t statistics and the corresponding p-values for testing whether each parameter is
significantly different from zero. The p-values (t D 4:43, p D 0:0004 and t D 7:55, p < 0:0001)
indicate that the intercept and Height parameter estimates, respectively, are highly significant.
5432 F Chapter 73: The REG Procedure
From the parameter estimates, the fitted model is
Weight D
143:0 C 3:9 Height
Figure 73.2 Parameter Estimates
Parameter Estimates
Variable
Intercept
Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
-143.02692
3.89903
32.27459
0.51609
-4.43
7.55
0.0004
<.0001
If you enable ODS Graphics with an ODS GRAPHICS statement, then PROC REG produces a
variety of useful plots. Figure 73.3 shows a plot of the residuals versus the regressor and Figure 73.4
shows a panel of diagnostic plots.
Figure 73.3 Residuals vs. Regressor
Simple Linear Regression F 5433
Figure 73.4 Fit Diagnostics
A trend in the residuals would indicate nonconstant variance in the data. The plot of residuals by
predicted values in the upper-left corner of the diagnostics panel in Figure 73.4 might indicate a
slight trend in the residuals; they appear to increase slightly as the predicted values increase. A fanshaped trend might indicate the need for a variance-stabilizing transformation. A curved trend (such
as a semicircle) might indicate the need for a quadratic term in the model. Since these residuals have
no apparent trend, the analysis is considered to be acceptable.
5434 F Chapter 73: The REG Procedure
Polynomial Regression
Consider a response variable Y that can be predicted by a polynomial function of a regressor variable X . You can estimate ˇ0 , the intercept; ˇ1 , the slope due to X; and ˇ2 , the slope due to X 2 ,
in
Yi D ˇ0 C ˇ1 Xi C ˇ2 Xi2 C i
for the observations i D 1; 2; : : : ; n.
Consider the following example on population growth trends. The population of the United States
from 1790 to 2000 is fit to linear and quadratic functions of time. Note that the quadratic term,
YearSq, is created in the DATA step; this is done since polynomial effects such as Year*Year cannot
be specified in the MODEL statement in PROC REG. The data are as follows:
data USPopulation;
input Population @@;
retain Year 1780;
Year
= Year+10;
YearSq
= Year*Year;
Population = Population/1000;
datalines;
3929 5308 7239 9638 12866 17069 23191 31443 39818 50155
62947 75994 91972 105710 122775 131669 151325 179323 203211
226542 248710 281422
;
ods graphics on;
proc reg data=USPopulation plots=ResidualByPredicted;
var YearSq;
model Population=Year / r clm cli;
run;
The DATA option ensures that the procedure uses the intended data set. Any variable that you
might add to the model but that is not included in the first MODEL statement must appear in the
VAR statement.
The “Analysis of Variance” and “Parameter Estimates” tables are displayed in Figure 73.5.
Polynomial Regression F 5435
Figure 73.5 ANOVA Table and Parameter Estimates
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
20
21
146869
12832
159700
146869
641.58160
Root MSE
Dependent Mean
Coeff Var
25.32946
94.64800
26.76175
R-Square
Adj R-Sq
F Value
Pr > F
228.92
<.0001
0.9197
0.9156
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Year
1
1
-2345.85498
1.28786
161.39279
0.08512
-14.54
15.13
<.0001
<.0001
The Model F statistic is significant (F =228.92, p<0.0001), indicating that the model accounts for
a significant portion of variation in the data. The R-square indicates that the model accounts for
92% of the variation in population growth. The fitted equation for this model is
Population D
2345:85 C 1:29 Year
In the MODEL statement, three options are specified: R requests a residual analysis to be performed, CLI requests 95% confidence limits for an individual value, and CLM requests these limits
for the expected value of the dependent variable. You can request specific 100.1 ˛/% limits with
the ALPHA= option in the PROC REG or MODEL statement.
Figure 73.6 shows the “Output Statistics” table. The residual, its standard error, and the studentized
residuals are displayed for each observation. The studentized residual is the residual divided by its
standard error. The magnitude of each studentized residual is shown in a print plot. Studentized
residuals follow a t distribution and can be used to identify outlying or extreme observations. Asterisks (*) extending beyond the dashed lines indicate that the residual is more than three standard
errors from zero. Many observations having absolute studentized residuals greater than two might
indicate an inadequate model. Cook’s D is a measure of the change in the predicted values upon
deletion of that observation from the data set; hence, it measures the influence of the observation on
the estimated regression coefficients.
5436 F Chapter 73: The REG Procedure
Figure 73.6 Output Statistics
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Dependent Predicted
Std Error
Variable
Value Mean Predict
3.9290
5.3080
7.2390
9.6380
12.8660
17.0690
23.1910
31.4430
39.8180
50.1550
62.9470
75.9940
91.9720
105.7100
122.7750
131.6690
151.3250
179.3230
203.2110
226.5420
248.7100
281.4220
-40.5778
-27.6991
-14.8205
-1.9418
10.9368
23.8155
36.6941
49.5727
62.4514
75.3300
88.2087
101.0873
113.9660
126.8446
139.7233
152.6019
165.4805
178.3592
191.2378
204.1165
216.9951
229.8738
10.4424
9.7238
9.0283
8.3617
7.7314
7.1470
6.6208
6.1675
5.8044
5.5491
5.4170
5.4170
5.5491
5.8044
6.1675
6.6208
7.1470
7.7314
8.3617
9.0283
9.7238
10.4424
95% CL Mean
-62.3602
-47.9826
-33.6533
-19.3841
-5.1906
8.9070
22.8834
36.7075
50.3436
63.7547
76.9090
89.7876
102.3907
114.7368
126.8580
138.7912
150.5721
162.2317
173.7956
185.2837
196.7116
208.0913
-18.7953
-7.4156
4.0123
15.5004
27.0643
38.7239
50.5048
62.4380
74.5592
86.9053
99.5084
112.3870
125.5413
138.9524
152.5885
166.4126
180.3890
194.4866
208.6801
222.9493
237.2786
251.6562
95% CL Predict
-97.7280
-84.2950
-70.9128
-57.5827
-44.3060
-31.0839
-17.9174
-4.8073
8.2455
21.2406
34.1776
47.0562
59.8765
72.6387
85.3432
97.9904
110.5812
123.1163
135.5969
148.0241
160.3992
172.7235
16.5725
28.8968
41.2719
53.6991
66.1797
78.7148
91.3056
103.9528
116.6573
129.4195
142.2398
155.1184
168.0554
181.0505
194.1033
207.2134
220.3799
233.6020
246.8787
260.2088
273.5910
287.0240
Output Statistics
Obs
Std Error
Residual
Student
Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23.077
23.389
23.666
23.909
24.121
24.300
24.449
24.567
24.655
24.714
24.743
24.743
24.714
24.655
24.567
24.449
24.300
24.121
23.909
23.666
23.389
23.077
1.929
1.411
0.932
0.484
0.0800
-0.278
-0.552
-0.738
-0.918
-1.019
-1.021
-1.014
-0.890
-0.857
-0.690
-0.856
-0.583
0.0400
0.501
0.948
1.356
2.234
Cook’s
D
-2-1 0 1 2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|***
|**
|*
|
|
|
*|
*|
*|
**|
**|
**|
*|
*|
*|
*|
*|
|
|*
|*
|**
|****
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.381
0.172
0.063
0.014
0.000
0.003
0.011
0.017
0.023
0.026
0.025
0.025
0.020
0.020
0.015
0.027
0.015
0.000
0.015
0.065
0.159
0.511
Residual
44.5068
33.0071
22.0595
11.5798
1.9292
-6.7465
-13.5031
-18.1297
-22.6334
-25.1750
-25.2617
-25.0933
-21.9940
-21.1346
-16.9483
-20.9329
-14.1555
0.9638
11.9732
22.4255
31.7149
51.5482
Polynomial Regression F 5437
Figure 73.7 shows the residual statistics table. A fairly close agreement between the PRESS statistic
(see Table 73.8) and the Sum of Squared Residuals indicates that the MSE is a reasonable measure
of the predictive accuracy of the fitted model (Neter, Wasserman, and Kutner 1990).
Figure 73.7 Residual Statistics
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
12832
16662
Graphical representations are very helpful in interporting the information in the “Output Statistics”
table. When you enable ODS Graphics, the REG procedure produces a default set of diagnostic
plots that are appropriate for the requested analysis.
Figure 73.8 displays a panel of diagnostics plots. These diagnostics indicate an inadequate model:
The plots of residual and studentized residual versus predicted value show a clear quadratic
pattern.
The plot of studentized residual versus leverage seems to indicate that there are two outlying
data points. However, the plot of Cook’s D distance versus observation number reveals that
these two points are just the data points for the endpoint years 1790 and 2000. These points
show up as apparent outliers because the departure of the linear model from the underlying
quadratic behavior in the data shows up most strongly at these endpoints.
The normal quantile plot of the residuals and the residual histogram are not consistent with
the assumption of Gaussian errors. This occurs as the residuals themselves still contain the
quadratic behavior that is not captured by the linear model.
The plot of the dependent variable versus the predicted value exhibits a quadratic form around
the 45-degree line that represents a perfect fit.
The “Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the centered fit
and the residuals shows that the spread in the residuals is no greater than the spread in the
centered fit. For inappropriate models, the spread of the residuals in such a plot is often
greater than the spread of the centered fit. In this case, the RF plot shows that the linear
model does indeed capture the increasing trend in the data, and hence accounts for much of
the variation in the response.
5438 F Chapter 73: The REG Procedure
Figure 73.8 Diagnostics Panel
Figure 73.9 shows a plot of residuals versus Year. Again you can see the quadratic pattern that
strongly indicates that a quadratic term should be added to the model.
Polynomial Regression F 5439
Figure 73.9 Residual Plot
Figure 73.10 shows the “FitPlot” consisting of a scatter plot of the data overlaid with the regression
line, and 95% confidence and prediction limits. Note that this plot also indicates that the model fails
to capture the quadratic nature of the data. This plot is produced for models containing a single
regressor. You can use the ALPHA= option in the model statement to change the significance level
of the confidence band and prediction limits.
5440 F Chapter 73: The REG Procedure
Figure 73.10 Fit Plot
These default plots provide strong evidence that the Yearsq needs to be added to the model. You can
use the interactive feature of PROC REG to do this by specifying the following statements:
add YearSq;
print;
run;
The ADD statement requests that YearSq be added to the model, and the PRINT command causes
the model to be refit and displays the ANOVA and parameter estimates for the new model. The
print statement also produces updated ODS graphical displays.
Figure 73.11 displays the ANOVA table and parameter estimates for the new model.
Polynomial Regression F 5441
Figure 73.11 ANOVA Table and Parameter Estimates
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
19
21
159529
170.97193
159700
79765
8.99852
Root MSE
Dependent Mean
Coeff Var
2.99975
94.64800
3.16938
R-Square
Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989
0.9988
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Year
YearSq
1
1
1
21631
-24.04581
0.00668
639.50181
0.67547
0.00017820
33.82
-35.60
37.51
<.0001
<.0001
<.0001
The overall F statistic is still significant (F =8864.19, p<0.0001). The R-square has increased from
0.9197 to 0.9989, indicating that the model now accounts for 99.9% of the variation in Population.
All effects are significant with p<0.0001 for each effect in the model.
The fitted equation is now
Population D 21631
24:046 Year C 0:0067 Yearsq
Figure 73.12 show the panel of diagnostics for this quadratic polynomial model. These diagnostics
indicate that this model is considerably more successful than the corresponding linear model:
The plots of residuals and studentized residuals versus predicted values exhibit no obvious
patterns.
The points on the plot of the dependent variable versus the predicted values lie along a 45degree line, indicating that the model successfully predicts the behavior of the dependent
variable.
The plot of studentized residual versus leverage shows that the years 1790 and 2000 are
leverage points with 2000 showing up as an outlier. This is confirmed by the plot of Cook’s
D distance versus observation number. This suggests that while the quadratic model fits the
current data well, the model might not be quite so successful over a wider range of data.
You might want to investigate whether the population trend over the last couple of decades is
growing slightly faster than quadratically.
5442 F Chapter 73: The REG Procedure
Figure 73.12 Diagnostics Panel
When a model contains more than one regressor, PROC REG does not produce a fit plot. However, when all the regressors in the model are functions of a single variable, it is appropriate to
plot predictions and residuals as a function of that variable. You request such plots by using the
PLOTS=PREDICTIONS option in the PROC REG statement, as the following code illustrates:
proc reg data=USPopulation plots=predictions(X=Year);
model Population=Year Yearsq;
quit;
ods graphics off;
Polynomial Regression F 5443
Figure 73.13 shows the data, predictions, and residuals by Year. These plots confirm that the
quadratic polynomial model successfully model the growth in U.S. population between the years
1780 and 2000.
Figure 73.13 Predictions and Residuals by Year
To complete an analysis of these data, you might want to examine influence statistics and, since the
data are essentially time series data, examine the Durbin-Watson statistic.
5444 F Chapter 73: The REG Procedure
Using PROC REG Interactively
The REG procedure can be used interactively. After you specify a model with a MODEL statement and run PROC REG with a RUN statement, a variety of statements can be executed without
reinvoking PROC REG.
The section “Interactive Analysis” on page 5513 describes which statements can be used interactively. These interactive statements can be executed singly or in groups by following the single
statement or group of statements with a RUN statement. Note that the MODEL statement can be repeated. This is an important difference from the GLM procedure, which supports only one MODEL
statement.
If you use PROC REG interactively, you can end the REG procedure with a DATA step, another
PROC step, an ENDSAS statement, or a QUIT statement. The syntax of the QUIT statement is
quit;
When you are using PROC REG interactively, additional RUN statements do not end PROC REG
but tell the procedure to execute additional statements.
When a BY statement is used with PROC REG, interactive processing is not possible; that is, once
the first RUN statement is encountered, processing proceeds for each BY group in the data set, and
no further statements are accepted by the procedure.
When you use PROC REG interactively, you can fit a model, perform diagnostics, and then refit the
model and perform diagnostics on the refitted model. Most of the interactive statements implicitly
refit the model; for example, if you use the ADD statement to add a variable to the model, the
regression equation is automatically recomputed. The two exceptions to this automatic recomputing
are the PAINT and REWEIGHT statements. These two statements do not cause the model to be
refitted. To refit the model, you can follow these statements either with a REFIT statement, which
causes the model to be explicitly recomputed, or with another interactive statement that causes the
model to be implicitly recomputed.
Syntax: REG Procedure F 5445
Syntax: REG Procedure
The following statements are available in PROC REG:
PROC REG < options > ;
< label: >MODEL dependents=< regressors > < / options > ;
BY variables ;
FREQ variable ;
ID variables ;
VAR variables ;
WEIGHT variable ;
ADD variables ;
DELETE variables ;
< label: >MTEST < equation, . . . ,equation > < / options > ;
OUTPUT < OUT=SAS-data-set >< keyword=names >
< . . . keyword=names > ;
PAINT < condition | ALLOBS >
< / options > | < STATUS | UNDO > ;
RESTRICT equation, . . . ,equation ;
REWEIGHT < condition | ALLOBS >
< / options > | < STATUS | UNDO > ;
PLOT < yvariable*xvariable > < =symbol >
< . . . yvariable*xvariable > < =symbol > < /
options > ;
PRINT < options > < ANOVA > < MODELDATA > ;
REFIT ;
RESTRICT equation, . . . ,equation ;
REWEIGHT < condition | ALLOBS >
< / options > | < STATUS | UNDO > ;
< label: >TEST equation,< ,. . . ,equation > < / option > ;
Although there are numerous statements and options available in PROC REG, many analyses use
only a few of them. Often you can find the features you need by looking at an example or by
scanning this section.
In the preceding list, brackets denote optional specifications, and vertical bars denote a choice of
one of the specifications separated by the vertical bars. In all cases, label is optional.
The PROC REG statement is required. To fit a model to the data, you must specify the MODEL
statement. If you want to use only the options available in the PROC REG statement, you do not
need a MODEL statement, but you must use a VAR statement. (See the example in the section
“OUTSSCP= Data Sets” on page 5512.) Several MODEL statements can be used. In addition,
several MTEST, OUTPUT, PAINT, PLOT, PRINT, RESTRICT, and TEST statements can follow
each MODEL statement.
The ADD, DELETE, and REWEIGHT statements are used interactively to change the regression
model and the data used in fitting the model. The ADD, DELETE, MTEST, OUTPUT, PLOT,
PRINT, RESTRICT, and TEST statements implicitly refit the model; changes made to the model
are reflected in the results from these statements. The REFIT statement is used to refit the model
explicitly and is most helpful when it follows PAINT and REWEIGHT statements, which do not
refit the model.
5446 F Chapter 73: The REG Procedure
The BY, FREQ, ID, VAR, and WEIGHT statements are optionally specified once for the entire
PROC step, and they must appear before the first RUN statement.
When a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as an input data set to PROC
REG, statements and options that require the original data are not available. Specifically, the
OUTPUT, PAINT, PLOT, and REWEIGHT statements and the MODEL and PRINT statement
options P, R, CLM, CLI, DW, DWPROB, INFLUENCE, PARTIAL, and PARTIALDATA are disabled.
You can specify the following statements with the REG procedure in addition to the PROC REG
statement:
ADD
adds independent variables to the regression model.
BY
specifies variables to define subgroups for the analysis.
DELETE
deletes independent variables from the regression model.
FREQ
specifies a frequency variable.
ID
names a variable to identify observations in the tables.
MODEL
specifies the dependent and independent variables in the regression model, requests a model selection method, displays predicted values, and provides details
on the estimates (according to which options are selected).
MTEST
performs multivariate tests across multiple dependent variables.
OUTPUT
creates an output data set and names the variables to contain predicted values,
residuals, and other diagnostic statistics.
PAINT
paints points in scatter plots.
PLOT
generates scatter plots.
PRINT
displays information about the model and can reset options.
REFIT
refits the model.
RESTRICT
places linear equality restrictions on the parameter estimates.
REWEIGHT
excludes specific observations from analysis or changes the weights of observations used.
TEST
performs an F test on linear functions of the parameters.
VAR
lists variables for which crossproducts are to be computed, variables that can be
interactively added to the model, or variables to be used in scatter plots.
WEIGHT
declares a variable to weight observations.
PROC REG Statement F 5447
PROC REG Statement
PROC REG < options > ;
The PROC REG statement is required. If you want to fit a model to the data, you must also use a
MODEL statement. If you want to use only the PROC REG options, you do not need a MODEL
statement, but you must use a VAR statement. If you do not use a MODEL statement, then the
COVOUT and OUTEST= options are not available.
Table 73.1 lists the options you can use with the PROC REG statement. Note that any option
specified in the PROC REG statement applies to all MODEL statements.
Table 73.1
Option
PROC REG Statement Options
Description
Data Set Options
DATA=
names a data set to use for the regression
OUTEST=
outputs a data set that contains parameter estimates and other
model fit summary statistics
OUTSSCP=
outputs a data set that contains sums of squares and crossproducts
COVOUT
outputs the covariance matrix for parameter estimates to the
OUTEST= data set
EDF
outputs the number of regressors, the error degrees of freedom,
and the model R2 to the OUTEST= data set
OUTSEB
outputs standard errors of the parameter estimates to the
OUTEST= data set
OUTSTB
outputs standardized parameter estimates to the OUTEST= data
set. Use only with the RIDGE= or PCOMIT= option.
OUTVIF
outputs the variance inflation factors to the OUTEST= data set.
Use only with the RIDGE= or PCOMIT= option.
PCOMIT=
performs incomplete principal component analysis and outputs
estimates to the OUTEST= data set
PRESS
outputs the PRESS statistic to the OUTEST= data set
RIDGE=
performs ridge regression analysis and outputs estimates to the
OUTEST= data set
RSQUARE
same effect as the EDF option
TABLEOUT
outputs standard errors, confidence limits, and associated test
statistics of the parameter estimates to the OUTEST= data set
ODS Graphics Options
PLOTS=
produces ODS graphical displays
Traditional Graphics Options
ANNOTATE=
specifies an annotation data set
GOUT=
specifies the graphics catalog in which graphics output is saved
Display Options
CORR
displays correlation matrix for variables listed in MODEL and
VAR statements
5448 F Chapter 73: The REG Procedure
Table 73.1
continued
Option
Description
SIMPLE
displays simple statistics for each variable listed in MODEL and
VAR statements
displays uncorrected sums of squares and crossproducts matrix
displays all statistics (CORR, SIMPLE, and USSCP)
suppresses output
creates plots requested as line printer plot
USCCP
ALL
NOPRINT
LINEPRINTER
Other Options
ALPHA=
SINGULAR=
sets significance value for confidence and prediction intervals and
tests
sets criterion for checking for singularity
Following are explanations of the options that you can specify in the PROC REG statement (in
alphabetical order).
Note that any option specified in the PROC REG statement applies to all MODEL statements.
ALL
requests the display of many tables. Using the ALL option in the PROC REG statement is
equivalent to specifying ALL in every MODEL statement. The ALL option also implies the
CORR, SIMPLE, and USSCP options.
ALPHA=number
sets the significance level used for the construction of confidence intervals. The value must
be between 0 and 1; the default value of 0.05 results in 95% intervals. This option affects the
PROC REG option TABLEOUT; the MODEL options CLB, CLI, and CLM; the OUTPUT
statement keywords LCL, LCLM, UCL, and UCLM; the PLOT statement keywords LCL.,
LCLM., UCL., and UCLM.; and the PLOT statement options CONF and PRED.
ANNOTATE=SAS-data-set
ANNO=SAS-data-set
specifies an input data set containing annotate variables, as described in SAS/GRAPH Software: Reference. You can use this data set to add features to the traditional graphics that you
request with the PLOT statement. Features provided in this data set are applied to all plots
produced in the current run of PROC REG. To add features to individual plots, use the ANNOTATE= option in the PLOT statement. This option cannot be used if the LINEPRINTER
option is specified.
CORR
displays the correlation matrix for all variables listed in the MODEL or VAR statement.
COVOUT
outputs the covariance matrices for the parameter estimates to the OUTEST= data set. This
option is valid only if the OUTEST= option is also specified. See the section “OUTEST=
Data Set” on page 5506.
PROC REG Statement F 5449
DATA=SAS-data-set
names the SAS data set to be used by PROC REG. The data set can be an ordinary SAS data
set or a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set. If one of these special TYPE=
data sets is used, the OUTPUT, PAINT, PLOT, and REWEIGHT statements, ODS Graphics,
and some options in the MODEL and PRINT statements are not available. See Appendix A,
“Special SAS Data Sets,” for more information about TYPE= data sets. If the DATA= option
is not specified, PROC REG uses the most recently created SAS data set.
EDF
outputs the number of regressors in the model excluding and including the intercept, the error
degrees of freedom, and the model R2 to the OUTEST= data set.
GOUT=graphics-catalog
specifies the graphics catalog in which traditional graphics output is saved. The default
graphics-catalog is WORK.GSEG. The GOUT= option cannot be used if the LINEPRINTER
option is specified.
LINEPRINTER | LP
creates plots requested as line printer plots. If you do not specify this option, requested plots
are created on a high-resolution graphics device. This option is required if plots are requested
and you do not have SAS/GRAPH software.
NOPRINT
suppresses the normal display of results. Note that this option temporarily disables the Output Delivery System (ODS); see Chapter 20, “Using the Output Delivery System,” for more
information.
OUTEST=SAS-data-set
requests that parameter estimates and optional model fit summary statistics be output to this
data set. See the section “OUTEST= Data Set” on page 5506 for details. If you want to
create a permanent SAS data set, you must specify a two-level name (refer to the section
“SAS Files” in SAS Language Reference: Concepts for more information about permanent
SAS data sets).
OUTSEB
outputs the standard errors of the parameter estimates to the OUTEST= data set. The value
SEB for the variable _TYPE_ identifies the standard errors. If the RIDGE= or PCOMIT= option is specified, additional observations are included and identified by the values RIDGESEB
and IPCSEB, respectively, for the variable _TYPE_. The standard errors for ridge regression
estimates and IPC estimates are limited in their usefulness because these estimates are biased. This option is available for all model selection methods except RSQUARE, ADJRSQ,
and CP.
OUTSSCP=SAS-data-set
requests that the sums of squares and crossproducts matrix be output to this TYPE=SSCP data
set. See the section “OUTSSCP= Data Sets” on page 5512 for details. If you want to create a
permanent SAS data set, you must specify a two-level name (refer to the section “SAS Files”
in SAS Language Reference: Concepts for more information about permanent SAS data sets).
5450 F Chapter 73: The REG Procedure
OUTSTB
outputs the standardized parameter estimates as well as the usual estimates to the OUTEST=
data set when the RIDGE= or PCOMIT= option is specified. The values RIDGESTB and
IPCSTB for the variable _TYPE_ identify ridge regression estimates and IPC estimates, respectively.
OUTVIF
outputs the variance inflation factors (VIF) to the OUTEST= data set when the RIDGE=
or PCOMIT= option is specified. The factors are the diagonal elements of the inverse of
the correlation matrix of regressors as adjusted by ridge regression or IPC analysis. These
observations are identified in the output data set by the values RIDGEVIF and IPCVIF for
the variable _TYPE_.
PCOMIT=list
requests an incomplete principal component (IPC) analysis for each value m in the list. The
procedure computes parameter estimates by using all but the last m principal components.
Each value of m produces a set of IPC estimates, which are output to the OUTEST= data set.
The values of m are saved by the variable _PCOMIT_, and the value of the variable _TYPE_
is set to IPC to identify the estimates. Only nonnegative integers can be specified with the
PCOMIT= option.
If you specify the PCOMIT= option, RESTRICT statements are ignored.
PLOTS < (global-plot-options) > < = plot-request < (options) > >
PLOTS < (global-plot-options) > < = (plot-request < (options) > < ... plot-request < (options) > >) >
controls the plots produced through ODS Graphics. When you specify only one plot request,
you can omit the parentheses around the plot request. Here are some examples:
plots
plots
plots
plots(label)
plots(only)
=
=
=
=
=
none
diagnostics(unpack)
(all fit(stats)=none)
(rstudentbyleverage cooksd)
(diagnostics(stats=all) fit(nocli stats=(aic sbc)
You must enable ODS Graphics before requesting plots, as shown in the following example.
For general information about ODS Graphics, see Chapter 21, “Statistical Graphics Using
ODS.”
ods graphics on;
proc reg;
model y = x1-x10;
run;
proc reg plots=diagnostics(stats=(default aic sbc));
model y = x1-x10;
run;
ods graphics off;
If you have enabled ODS Graphics but do not specify the PLOTS= option, then PROC REG
produces a default set of plots. Table 73.2 lists the default set of plots produced.
PROC REG Statement F 5451
Table 73.2
Default ODS Graphics Produced
Plot
Conditional On
DiagnosticsPanel
ResidualPlot
FitPlot
PartialPlot
RidgePanel
Unconditional
Unconditional
Model with one regressor (excluding intercept)
PARTIAL option specified in MODEL statement
RIDGE= option specified in PROC REG or MODEL statement
For models with multiple dependent variables, separate plots are produced for each dependent
variable. For jobs with more than one MODEL statement, plots are produced for each model
statement.
The global-options apply to all plots generated by the REG procedure, unless it is altered by
a specific-plot-option. The following global plot options are available:
LABEL
specifies that the LABEL option be applied to each plot that supports a LABEL option.
See the descriptions of the specific plots for details.
MAXPOINTS=NONE | number
specifies that plots with elements that require processing more than number points be
suppressed. The default is MAXPOINTS=5000. This cutoff is ignored if you specify
MAXPOINTS=NONE.
MODELLABEL
requests that the model label be displayed in the upper-left corner of all plots. This
option is useful when you use more than one MODEL statement.
ONLY
suppress the default plots. Only plots specifically requested are displayed.
STATS=ALL | DEFAULT | NONE | (plot-statistics)
requests statistics that are included on the fit plot and diagnostics panel. Table 73.3
lists the statistics that you can request. STATS=ALL requests all these statistics;
STATS=NONE suppresses them.
Table 73.3
Statistics Available on Plots
Keyword
Default
ADJRSQ
AIC
BIC
CP
x
Description
adjusted R-square
Akaike’s information criterion
Sawa’s Bayesian information criterion
Mallows’ Cp statistic
5452 F Chapter 73: The REG Procedure
Table 73.3
continued
Keyword
COEFFVAR
DEPMEAN
DEFAULT
EDF
GMSEP
JP
MSE
NOBS
NPARM
PC
RSQUARE
SBC
SP
SSE
Default
x
x
x
x
x
Description
coefficient of variation
mean of dependent
all default statistics
error degrees of freedom
estimated MSE of prediction, assuming multivariate normality
final prediction error
mean squared error
number of observations used
number of parameters in the model (including the intercept)
Amemiya’s prediction criterion
R-square
SBC statistic
SP statistic
error sum of squares
You request statistics in addition to the default set by including the keyword DEFAULT
in the plot-statistics list.
UNPACK
suppresses paneling.
USEALL
specifies that predicted values at data points with missing dependent variable(s) be
included on appropriate plots. By default, only points used in constructing the SSCP
matrix appear on plots.
The following specific plots are available:
ADJRSQ < (adjrsq-options) >
displays the adjusted R-square values for the models examined when you request variable selection with the SELECTION= option in the MODEL statement.
The following adjrsq-options are available for models where you request the
RSQUARE, ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset
Selection Summary” table be used to label the model with the largest adjusted Rsquare statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the largest adjusted R-square statistic at
each value of the number of parameters.
PROC REG Statement F 5453
AIC < (aic-options) >
displays Akaike’s information criterion (AIC) for the models examined when you request variable selection with the SELECTION= option in the MODEL statement.
The following aic-options are available for models where you request the RSQUARE,
ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset Selection Summary” table be used to label the model with the smallest AIC
statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the smallest AIC statistic at each value of
the number of parameters.
ALL
produces all appropriate plots.
BIC < (bic-options) >
displays Sawa’s Bayesian information criterion (BIC) for the models examined when
you request variable selection with the SELECTION= option in the MODEL statement.
The following bic-options are available for models where you request the RSQUARE,
ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset Selection Summary” table be used to label the model with the smallest BIC
statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the smallest BIC statistic at each value of
the number of parameters.
COOKSD <(LABEL)>
plots Cook’s D statistic by observation number. Observations whose Cook’s D statistic
lies above the horizontal reference line at value 4=n, where n is the number of observations used, are deemed to be influential (Rawlings 1998). If you specify the LABEL
option, then points deemed as influential are labeled. If you do not specify an ID variable, the observation number within the current BY group is used as the label. If you
specify one or more ID variables in one or more ID statements, then the first ID variable
you specify is used for the labeling.
CP < (cp-options) >
displays Mallow’s Cp statistic for the models examined when you request variable
selection with the SELECTION= option in the MODEL statement. For models where
you request the RSQUARE, ADJRSQ, or CP selection, reference lines corresponding
5454 F Chapter 73: The REG Procedure
to the equations Cp D p and Cp D 2p pf ul l , where pf ul l is the number of
parameters in the full model (excluding the intercept) and p is the number of parameters
in the subset model (including the intercept), are displayed on the plot of Cp versus p.
For the purpose of parameter estimation, Hocking (1976) suggests selecting a model
where Cp 2p pf ul l . For the purpose of prediction, Hocking suggests the criterion
Cp p. Mallows (1973) suggests that all subset models with Cp small and near p be
considered for further study.
The following cp-options are available for models where you request the RSQUARE,
ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset Selection Summary” table be used to label the model with the smallest Cp
statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the smallest Cp statistic at each value of
the number of parameters.
CRITERIA | CRITERIONPANEL < (criteria-options) >
produces a panel of fit criteria for the models examined when you request variable
selection with the SELECTION= option in the MODEL statement. The fit criteria
displayed are R-square, adjusted R-square, Mallow’s Cp , Akaike’s information criterion (AIC), Sawa’s Bayesian information criterion (BIC), and Schwarz’s Bayesian
information criterion (SBC). For SELECTION=RSQUARE, SELECTION=ADJRSQ,
or SELECTION=CP, scatter plots of these statistics versus the number of parameters
(including the intercept) are displayed. For other selection methods, line plots of these
statistics as function of the selection step number are displayed.
The following criteria-options are available:
LABEL
requests that the model number corresponding to the one displayed in the “Subset
Selection Summary” table be used to label the best model at each value of the
number of parameters. This option applies only to the RSQUARE, ADJRSQ,
and CP selection methods.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the best model at each value of the number of parameters.
Since these labels are typically long, LABELVARS is supported only when the
panel is unpacked. This option applies only to the RSQUARE, ADJRSQ, and CP
selection methods.
UNPACK
suppresses paneling. Separate plots are produced for each of the six fit statistics.
For models where you request the RSQUARE, ADJRSQ, or CP selection, two
reference lines corresponding to the equations Cp D p and Cp D 2p pf ul l ,
PROC REG Statement F 5455
where pf ul l is the number of parameters in the full model (excluding the intercept) and p is the number of parameters in the subset model (including the intercept), are displayed on the plot of Cp versus p. For the purpose of parameter
estimation, Hocking (1976) suggests selecting a model where Cp 2p pf ul l .
For the purpose of prediction, Hocking suggests the criterion Cp p. Mallows
(1973) suggests that all subset models with Cp small and near p be considered
for further study.
DFBETAS < (DFBETAS-options) >
produces panels of DFBETAS by observation number for the regressors in the model.
Note that each panel contains at most six plots, and multiple panels are used in the
case where there are more than six regressors (including the intercept) in the model.
Observations whose DFBETAS’ statistics for a regressor are greater in magnitude than
p
2= n, where n is the number of observations used, are deemed to be influential for
that regressor (Rawlings 1998).
The following DFBETAS-options are available:
COMMONAXES
specifies that the same DFBETAS axis be used in all panels when multiple panels
are needed. By default, the DFBETAS axis is chosen independently for each
panel. If you also specify the UNPACK option, then the same DFBETAS axis is
used for each regressor.
LABEL
p
specifies that observations whose magnitude are greater than 2= n be labeled.
If you do not specify an ID variable, the observation number within the current
BY group is used as the label. If you specify one or more ID variables on one or
more ID statements, then the first ID variable you specify is used for the labeling.
UNPACK
suppresses paneling. The DFBETAS statistics for each regressor are displayed
on separate plots.
DFFITS < (LABEL) >
plots the DFFITS statistic by observation
number. Observations whose DFFITS’ statisp
tic is greater in magnitude than 2 p=n, where n is the number of observations used
and p is the number of regressors, are deemed to be influential (Rawlings 1998). If you
specify the LABEL option, then these influential observations are labeled. If you do
not specify an ID variable, the observation number within the current BY group is used
as the label. If you specify one or more ID variables in one or more ID statements, then
the first ID variable you specify is used for the labeling.
5456 F Chapter 73: The REG Procedure
DIAGNOSTICS < (diagnostics-options) >
produces a summary panel of fit diagnostics:
residuals versus the predicted values
studentized residuals versus the predicted values
studentized residuals versus the leverage
normal quantile plot of the residuals
dependent variable values versus the predicted values
Cook’s D versus observation number
histogram of the residuals
“Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the centered
fit and the residuals
box plot of the residuals if you specify the STATS=NONE suboption
You can specify the following diagnostics-options:
STATS=stats-options
determines which model fit statistics are included in the panel. See the global
STATS= suboption for details. The PLOTS= suboption of the DIAGNOSTICSPANEL option overrides the global PLOTS= suboption.
UNPACK
produces the eight plots in the panel as individual plots. Note that you can also
request individual plots in the panel by name without having to unpack the panel.
FITPLOT | FIT < (fit-options) >
produces a scatter plot of the data overlaid with the regression line, confidence band,
and prediction band for models that depend on at most one regressor excluding the
intercept.
You can specify the following fit-options:
NOCLI
suppresses the prediction limits.
NOCLM
suppresses the confidence limits.
NOLIMITS
suppresses the confidence and prediction limits.
STATS=stats-options
determines which model fit statistics are included in the panel. See the global
STATS= suboption for details. The PLOTS= suboption of the FITPLOT option
overrides the global PLOTS= suboption.
OBSERVEDBYPREDICTED < (LABEL) >
plots dependent variable values by the predicted values. If you specify the LABEL option, then points deemed as outliers or influential (see the RSTUDENTBYLEVERAGE
option for details) are labeled.
PROC REG Statement F 5457
NONE
suppresses all plots.
PARTIAL < (UNPACK) >
produces panels of partial regression plots for each regressor with at most six regressors
per panel. If you specify the UNPACK option, then all partial plot panels are unpacked.
PREDICTIONS (X=numeric-variable < prediction-options >)
produces a panel of two plots whose horizontal axis is the variable you specify in the
required X= suboption. The upper plot in the panel is a scatter plot of the residuals.
The lower plot shows the data overlaid with the regression line, confidence band, and
prediction band. This plot is appropriate for models where all regressors are known to
be functions of the single variable that you specify in the X= suboption.
You can specify the following prediction-options:
NOCLI
suppresses the prediction limits.
NOCLM
suppresses the confidence limits
NOLIMITS
suppresses the confidence and prediction limits
SMOOTH
requests a nonparametric smooth of the residuals as a function of the variable you
specify in the X= suboption. This nonparametric fit is a loess fit that uses local
linear polynomials, linear interpolation, and a smoothing parameter selected that
yields a local minimum of the corrected Akaike information criterion (AICC).
See Chapter 50, “The LOESS Procedure,” for details. The SMOOTH option is
not supported when a FREQ statement is used.
UNPACK
suppresses paneling.
QQPLOT | QQ
produces a normal quantile plot of the residuals.
RESIDUALBOXPLOT | BOXPLOT < (LABEL) >
produces a box plot consisting of the residuals. If you specify label option, points
deemed far-outliers are labeled. If you do not specify an ID variable, the observation
number within the current BY group is used as the label. If you specify one or more
ID variables in one or more ID statements, then the first ID variable you specify is used
for the labeling.
RESIDUALBYPREDICTED < (LABEL) >
plots residuals by predicted values. If you specify the LABEL option, then points
deemed as outliers or influential (see the RSTUDENTBYLEVERAGE option for details) are labeled.
5458 F Chapter 73: The REG Procedure
RESIDUALS < residual-options) >
produces panels of the residuals versus the regressors in the model. Note that each
panel contains at most six plots, and multiple panels are used in the case where there
are more than six regressors (including the intercept) in the model.
The following residual-options are available:
SMOOTH
requests a nonparametric smooth of the residuals for each regressor. Each nonparametric fit is a loess fit that uses local linear polynomials, linear interpolation,
and a smoothing parameter selected that yields a local minimum of the corrected
Akaike information criterion (AICC). See Chapter 50, “The LOESS Procedure,”
for details. The SMOOTH option is not supported when a FREQ statement is
used.
UNPACK
suppresses paneling.
RESIDUALHISTOGRAM
produces a histogram of the residuals.
RFPLOT | RF
produces a “Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the
centered fit and the residuals. This plot “shows how much variation in the data is
explained by the fit and how much remains in the residuals” (Cleveland 1993).
RIDGE | RIDGEPANEL | RIDGEPLOT < (ridge-options) >
creates panels of VIF values and standardized ridge estimates by ridge values for each
coefficient. The VIF values for each coefficient are connected by lines and are displayed in the upper plot in each panel. The points corresponding to the standardized
estimates of each coefficient are connected by lines and are displayed in the lower plot
in each panel. By default, at most 10 coefficients are represented in a panel and multiple
panels are produced for models with more than 10 regressors. For ridge estimates to be
computed and plotted, the OUTEST= option must be specified in the PROC REG statement, and the RIDGE= list must be specified in either the PROC REG or the MODEL
statement. (See Example 73.5.)
The following ridge-options are available:
COMMONAXES
specifies that the same VIF axis and the same standardized estimate axis are used
in all panels when multiple panels are needed. By default, these axes are chosen
independently for the regressors shown in each panel.
RIDGEAXIS=LINEAR | LOG
specifies the axis type used to display the ridge parameters. The default is
RIDGEAXIS=LINEAR. Note that the point with the ridge parameter equal to
zero is not displayed if you specify RIDGEAXIS=LOG.
PROC REG Statement F 5459
UNPACK
suppresses paneling. The traces of the VIF statistics and standardized estimates
are shown in separate plots.
VARSPERPLOT=ALL
VARSPERPLOT=number
specifies the maximum number of regressors displayed in each panel or in each
plot if you additionally specify the UNPACK option. If you specify VARSPERPLOT=ALL, then the VIF values and ridge traces for all regressors are displayed
in a single panel.
VIFAXIS=LINEAR | LOG
specifies the axis type used to display the VIF statistics. The default is VIFAXIS=LINEAR.
RSQUARE < (rsquare-options) >
displays the R-square values for the models examined when you request variable selection with the SELECTION= option in the MODEL statement.
The following rsquare-options are available for models where you request the
RSQUARE, ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset
Selection Summary” table be used to label the model with the largest R-square
statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the largest R-square statistic at each value
of the number of parameters.
RSTUDENTBYLEVERAGE < (LABEL) >
plots studentized residuals by leverage. Observations whose studentized residuals
lie outside the band between the reference lines RSTUDENT D ˙2 are deemed
outliers. Observations whose leverage values are greater than the vertical reference
LEVERAGE D 2p=n, where p is the number of parameters excluding the intercept
and n is the number of observations used, are deemed influential (Rawlings 1998). If
you specify the LABEL option, then points deemed as outliers or influential are labeled. If you do not specify an ID variable, the observation number within the current
BY group is used as the label. If you specify one or more ID variables in one or more
ID statements, then the first ID variable you specify is used for the labeling.
RSTUDENTBYPREDICTED < (LABEL) >
plots studentized residuals by predicted values. If you specify the LABEL option, then
points deemed as outliers or influential (see the RSTUDENTBYLEVERAGE option
for details) are labeled.
5460 F Chapter 73: The REG Procedure
SBC < (sbc-options) >
displays Schwarz’s Bayesian information criterion (SBC) for the models examined
when you request variable selection with the SELECTION= option in the MODEL
statement.
The following sbc-options are available for models where you request the RSQUARE,
ADJRSQ, or CP selection method:
LABEL
requests that the model number corresponding to the one displayed in the “Subset
Selection Summary” table be used to label the model with the smallest SBC
statistic at each value of the number of parameters.
LABELVARS
requests that the list (excluding the intercept) of the regressors in the relevant
model be used to label the model with the smallest SBC statistic at each value of
the number of parameters.
PRESS
outputs the PRESS statistic to the OUTEST= data set. The values of this statistic are saved
in the variable _PRESS_. This option is available for all model selection methods except
RSQUARE, ADJRSQ, and CP.
RIDGE=list
requests a ridge regression analysis and specifies the values of the ridge constant k (see the
section “Computations for Ridge Regression and IPC Analysis” on page 5577). Each value
of k produces a set of ridge regression estimates that are placed in the OUTEST= data set.
The values of k are saved by the variable _RIDGE_, and the value of the variable _TYPE_ is
set to RIDGE to identify the estimates.
Only nonnegative numbers can be specified with the RIDGE= option. Example 73.5 illustrates this option.
If ODS Graphics is in effect (see the section “ODS Graphics” on page 5583), then ridge
regression plots are automatically produced. These plots consist of panels containing ridge
traces for the regressors, with at most eight ridge traces per panel.
If you specify the RIDGE= option, RESTRICT statements are ignored.
RSQUARE
has the same effect as the EDF option.
SIMPLE
displays the sum, mean, variance, standard deviation, and uncorrected sum of squares for
each variable used in PROC REG.
SINGULAR=n
tunes the mechanism used to check for singularities. The default value is machine dependent
but is approximately 1E 7 on most machines. This option is rarely needed.
Singularity checking is described in the section “Computational Methods” on page 5578.
ADD Statement F 5461
TABLEOUT
outputs the standard errors and 100.1 ˛/% confidence limits for the parameter estimates, the
t statistics for testing if the estimates are zero, and the associated p-values to the OUTEST=
data set. The _TYPE_ variable values STDERR, LnB, UnB, T, and PVALUE, where n D
100.1 ˛/, identify these rows in the OUTEST= data set. The ˛ level can be set with the
ALPHA= option in the PROC REG or MODEL statement. The OUTEST= option must be
specified in the PROC REG statement for this option to take effect.
USSCP
displays the uncorrected sums-of-squares and crossproducts matrix for all variables used in
the procedure.
ADD Statement
ADD variables ;
The ADD statement adds independent variables to the regression model. Only variables used in the
VAR statement or used in MODEL statements before the first RUN statement can be added to the
model. You can use the ADD statement interactively to add variables to the model or to include a
variable that was previously deleted with a DELETE statement. Each use of the ADD statement
modifies the MODEL label.
See the section “Interactive Analysis” on page 5513 for an example.
BY Statement
BY variables ;
You can specify a BY statement with PROC REG to obtain separate analyses on observations in
groups defined by the BY variables. When a BY statement appears, the procedure expects the input
data set to be sorted in the order of the BY variables. The variables are one or more variables in the
input data set.
If your input data set is not sorted in ascending order, use one of the following alternatives.
Sort the data by using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for
the REG procedure. The NOTSORTED option does not mean that the data are unsorted but
rather that the data are arranged in groups (according to values of the BY variables) and that
these groups are not necessarily in alphabetical or increasing numeric order.
Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).
5462 F Chapter 73: The REG Procedure
When a BY statement is used with PROC REG, interactive processing is not possible; that is, once
the first RUN statement is encountered, processing proceeds for each BY group in the data set, and
no further statements are accepted by the procedure. A BY statement that appears after the first
RUN statement is ignored.
For more information about the BY statement, see SAS Language Reference: Contents. For more
information about the DATASETS procedure, see the Base SAS Procedures Guide.
DELETE Statement
DELETE variables ;
The DELETE statement deletes independent variables from the regression model. The DELETE
statement performs the opposite function of the ADD statement and is used in a similar manner.
Each use of the DELETE statement modifies the MODEL label.
For an example of how the ADD statement is used (and how the DELETE statement can be used),
see the section “Interactive Analysis” on page 5513.
FREQ Statement
FREQ variable ;
When a FREQ statement appears, each observation in the input data set is assumed to represent n
observations, where n is the value of the FREQ variable. The analysis produced when you use a
FREQ statement is the same as an analysis produced by using a data set that contains n observations in place of each observation in the input data set. When the procedure determines degrees of
freedom for significance tests, the total number of observations is considered to be equal to the sum
of the values of the FREQ variable.
If the value of the FREQ variable is missing or is less than 1, the observation is not used in the
analysis. If the value is not an integer, only the integer portion is used.
The FREQ statement must appear before the first RUN statement, or it is ignored.
ID Statement
ID variables ;
When one of the MODEL statement options CLI, CLM, P, R, and INFLUENCE is requested, the
variables listed in the ID statement are displayed beside each observation. These variables can be
used to identify each observation. If the ID statement is omitted, the observation number is used to
identify the observations.
MODEL Statement F 5463
Although there are no restrictions on the length of ID variables, PROC REG might truncate ID
values to 16 characters for display purposes.
MODEL Statement
< label: > MODEL dependents=< regressors > < / options > ;
After the keyword MODEL, the dependent (response) variables are specified, followed by an equal
sign and the regressor variables. Variables specified in the MODEL statement must be numeric
variables in the data set being analyzed. For example, if you want to specify a quadratic term
for variable X1 in the model, you cannot use X1*X1 in the MODEL statement but must create a
new variable (for example, X1SQUARE=X1*X1) in a DATA step and use this new variable in the
MODEL statement. The label in the MODEL statement is optional.
Table 73.4 lists the options available in the MODEL statement. Equations for the statistics available
are given in the section “Model Fit and Diagnostic Statistics” on page 5551.
Table 73.4
Option
MODEL Statement Options
Description
Model Selection and Details of Selection
SELECTION=
specifies model selection method
BEST=
specifies maximum number of subset models displayed or output to the OUTEST= data set
DETAILS
produces summary statistics at each step
DETAILS=
specifies the display details for FORWARD, BACKWARD, and
STEPWISE methods
GROUPNAMES=
provides names for groups of variables
INCLUDE=
includes first n variables in the model
MAXSTEP=
specifies maximum number of steps that might be performed
NOINT
fits a model without the intercept term
PCOMIT=
performs incomplete principal component analysis and outputs
estimates to the OUTEST= data set
RIDGE=
performs ridge regression analysis and outputs estimates to the
OUTEST= data set
SLE=
sets criterion for entry into model
SLS=
sets criterion for staying in model
START=
specifies number of variables in model to begin the comparing
and switching process
STOP=
stops selection criterion
Statistics
ADJRSQ
AIC
B
BIC
CP
computes adjusted R2
computes Akaike’s information criterion
computes parameter estimates for each model
computes Sawa’s Bayesian information criterion
computes Mallows’ Cp statistic
5464 F Chapter 73: The REG Procedure
Table 73.4
continued
Option
Description
GMSEP
computes estimated MSE of prediction assuming multivariate
normality
computes Jp , the final prediction error
computes MSE for each model
computes Amemiya’s prediction criterion
displays root MSE for each model
computes the SBC statistic
computes Sp statistic for each model
computes error sum of squares for each model
JP
MSE
PC
RMSE
SBC
SP
SSE
Data Set Options
EDF
OUTSEB
OUTSTB
OUTVIF
PRESS
RSQUARE
outputs the number of regressors, the error degrees of freedom,
and the model R2 to the OUTEST= data set
outputs standard errors of the parameter estimates to the OUTEST= data set
outputs standardized parameter estimates to the OUTEST=
data set. Use only with the RIDGE= or PCOMIT= option.
outputs the variance inflation factors to the OUTEST= data set.
Use only with the RIDGE= or PCOMIT= option.
outputs the PRESS statistic to the OUTEST= data set
has same effect as the EDF option
Regression Calculations
I
displays inverse of sums of squares and crossproducts
XPX
displays sums-of-squares and crossproducts matrix
Details on Estimates
ACOV
displays heteroscedasticity- consistent covariance matrix of estimates and heteroscedasticity-consistent standard errors
ACOVMETHOD= specifies
method
for
computing
the
asymptotic
heteroscedasticity-consistent covariance matrix
COLLIN
produces collinearity analysis
COLLINOINT
produces collinearity analysis with intercept adjusted out
CORRB
displays correlation matrix of estimates
COVB
displays covariance matrix of estimates
HCC
displays heteroscedasticity-consistent standard errors
HCCMETHOD=
specifies
method
for
computing
the
asymptotic
heteroscedasticity-consistent covariance matrix
LACKFIT
performs lack-of-fit test
PARTIALR2
displays squared semipartial correlation coefficients computed
using Type I sums of squares
PCORR1
displays squared partial correlation coefficients computed using Type I sums of squares
PCORR2
displays squared partial correlation coefficients computed using Type II sums of squares
MODEL Statement F 5465
Table 73.4
continued
Option
Description
SCORR1
displays squared semipartial correlation coefficients computed
using Type I sums of squares
displays squared semipartial correlation coefficients computed
using Type II sums of squares
displays a sequence of parameter estimates during selection
process
tests that first and second moments of model are correctly specified
displays the sequential sums of squares
displays the partial sums of squares
displays standardized parameter estimates
displays tolerance values for parameter estimates
displays heteroscedasticity-consistent standard errors
computes variance-inflation factors
SCORR2
SEQB
SPEC
SS1
SS2
STB
TOL
WHITE
VIF
Predicted and Residual Values
CLB
computes 100.1 ˛/% confidence limits for the parameter estimates
CLI
computes 100.1 ˛/% confidence limits for an individual predicted value
CLM
computes 100.1 ˛/% confidence limits for the expected value
of the dependent variable
DW
computes a Durbin-Watson statistic
DWPROB
computes a Durbin-Watson statistic and p-value
INFLUENCE
computes influence statistics
P
computes predicted values
PARTIAL
displays partial regression plots for each regressor
PARTIALDATA
displays partial regression data
R
produces analysis of residuals
Display Options and Other Options
ALL
requests the following options:
ACOV, CLB, CLI, CLM, CORRB, COVB, HCC, I, P,
PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC,
SS1, SS2, STB, TOL, VIF, XPX
ALPHA=
sets significance value for confidence and prediction intervals
and tests
NOPRINT
suppresses display of results
SIGMA=
specifies the true standard deviation of error term for computing
CP and BIC
SINGULAR=
sets criterion for checking for singularity
5466 F Chapter 73: The REG Procedure
You can specify the following options in the MODEL statement after a slash (/).
ACOV
displays the estimated asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity and heteroscedasticity-consistent standard errors of parameter estimates. See the HCCMETHOD= option and the HCC option and the section “Testing for
Heteroscedasticity” on page 5569 for more information.
ACOVMETHOD=0,1,2, or 3
See the HCCMETHOD= option.
ADJRSQ
computes R2 adjusted for degrees of freedom for each model selected (Darlington 1968;
Judge et al. 1980).
AIC
outputs Akaike’s information criterion for each model selected (Akaike 1969; Judge et al.
1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or
SELECTION=CP is specified, then the AIC statistic is also added to the SubsetSelSummary
table.
ALL
requests all these options: ACOV, CLB, CLI, CLM, CORRB, COVB, HCC, I, P, PCORR1,
PCORR2, R, SCORR1, SCORR2, SEQB, SPEC, SS1, SS2, STB, TOL, VIF, and XPX.
ALPHA=number
sets the significance level used for the construction of confidence intervals for the current
MODEL statement. The value must be between 0 and 1; the default value of 0.05 results in
95% intervals. This option affects the MODEL options CLB, CLI, and CLM; the OUTPUT
statement keywords LCL, LCLM, UCL, and UCLM; the PLOT statement keywords LCL.,
LCLM., UCL., and UCLM.; and the PLOT statement options CONF and PRED. If you specify this option in the MODEL statement, it takes precedence over the ALPHA= option in the
PROC REG statement.
B
is used with the RSQUARE, ADJRSQ, and CP model-selection methods to compute estimated regression coefficients for each model selected.
BEST=n
is used with the RSQUARE, ADJRSQ, and CP model-selection methods. If SELECTION=CP or SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum
number of subset models to be displayed or output to the OUTEST= data set. For SELECTION=RSQUARE, the BEST= option requests the maximum number of subset models for
each size.
If the BEST= option is used without the B option (displaying estimated regression coefficients), the variables in each MODEL are listed in order of inclusion instead of the order in
which they appear in the MODEL statement.
If the BEST= option is omitted and the number of regressors is less than 11, all possible
subsets are evaluated. If the BEST= option is omitted and the number of regressors is greater
MODEL Statement F 5467
than 10, the number of subsets selected is, at most, equal to the number of regressors. A small
value of the BEST= option greatly reduces the CPU time required for large problems.
BIC
outputs Sawa’s Bayesian information criterion for each model selected (Sawa 1978; Judge et
al. 1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE,
or SELECTION=CP is specified, then the BIC statistic is also added to the SubsetSelSummary table.
CLB
requests the 100.1 ˛/% upper and lower confidence limits for the parameter estimates. By
default, the 95% limits are computed; the ALPHA= option in the PROC REG or MODEL
statement can be used to change the ˛ level. If any of the MODEL statement options
ACOV, HCC, or WHITE are in effect, then the CLB option also produces heteroscedasticityconsistent 100.1 ˛/% upper and lower confidence limits for the parameter estimates.
CLI
requests the 100.1 ˛/% upper and lower confidence limits for an individual predicted value.
By default, the 95% limits are computed; the ALPHA= option in the PROC REG or MODEL
statement can be used to change the ˛ level. The confidence limits reflect variation in the
error, as well as variation in the parameter estimates. See the section “Predicted and Residual
Values” on page 5525 and Chapter 4, “Introduction to Regression Procedures,” for more
information.
CLM
displays the 100.1 ˛/% upper and lower confidence limits for the expected value of the
dependent variable (mean) for each observation. By default, the 95% limits are computed;
the ALPHA= in the PROC REG or MODEL statement can be used to change the ˛ level. This
is not a prediction interval (see the CLI option) because it takes into account only the variation
in the parameter estimates, not the variation in the error term. See the section “Predicted and
Residual Values” on page 5525 and Chapter 4, “Introduction to Regression Procedures,” for
more information.
COLLIN
requests a detailed analysis of collinearity among the regressors. This includes eigenvalues,
condition indices, and decomposition of the variances of the estimates with respect to each
eigenvalue. See the section “Collinearity Diagnostics” on page 5549.
COLLINOINT
requests the same analysis as the COLLIN option with the intercept variable adjusted out
rather than included in the diagnostics. See the section “Collinearity Diagnostics” on
page 5549.
CORRB
displays the correlation matrix of the estimates. This is the .X0 X/
diagonals.
1
matrix scaled to unit
COVB
displays the estimated covariance matrix of the estimates. This matrix is .X0 X/
s 2 is the estimated mean squared error.
1s2,
where
5468 F Chapter 73: The REG Procedure
CP
outputs Mallows’ Cp statistic for each model selected (Mallows 1973; Hocking 1976)
to the OUTEST= data set. See the section “Criteria Used in Model-Selection Methods”
on page 5520 for a discussion of the use of Cp . If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the Cp statistic is also added to
the SubsetSelSummary table.
DETAILS
DETAILS=name
specifies the level of detail produced when the BACKWARD, FORWARD, or STEPWISE
method is used, where name can be ALL, STEPS, or SUMMARY. The DETAILS or DETAILS=ALL option produces entry and removal statistics for each variable in the model
building process, ANOVA and parameter estimates at each step, and a selection summary
table. The option DETAILS=STEPS provides the step information and summary table. The
option DETAILS=SUMMARY produces only the summary table. The default if the DETAILS option is omitted is DETAILS=STEPS.
DW
calculates a Durbin-Watson statistic to test whether or not the errors have first-order autocorrelation. (This test is appropriate only for time series data.) Note that your data should be
sorted by the date/time ID variable before you use this option. The sample autocorrelation
of the residuals is also produced. See the section “Autocorrelation in Time Series Data” on
page 5575.
DWPROB
calculates a Durbin-Watson statistic and a p-value to test whether or not the errors have firstorder autocorrelation. Note that it is not necessary to specify the DW option if the DWPROB
option is specified. (This test is appropriate only for time series data.) Note that your data
should be sorted by the date/time ID variable before you use this option. The sample autocorrelation of the residuals is also produced. See the section “Autocorrelation in Time Series
Data” on page 5575.
EDF
outputs the number of regressors in the model excluding and including the intercept, the error
degrees of freedom, and the model R2 to the OUTEST= data set.
GMSEP
outputs the estimated mean square error of prediction assuming that both independent and
dependent variables are multivariate normal (Stein 1960; Darlington 1968) to the OUTEST=
data set. (Note that Hocking’s formula (1976, eq. 4.20) contains a misprint: “n 1” should
read “n 2.”) If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP
is specified, then the GMSEP statistic is also added to the SubsetSelSummary table.
GROUPNAMES=’name1’ ’name2’ . . .
provides names for variable groups. This option is available only in the BACKWARD, FORWARD, and STEPWISE methods. The group name can be up to 32 characters. Subsets of
independent variables listed in the MODEL statement can be designated as variable groups.
This is done by enclosing the appropriate variables in braces. Variables in the same group
are entered into or removed from the regression model at the same time. However, if the
MODEL Statement F 5469
tolerance of any variable (see the TOL option on page 5474) in a group is less than the setting
of the SINGULAR= option, then the variable is not entered into the model with the rest of
its group. If the GROUPNAMES= option is not used, then the names GROUP1, GROUP2,
. . . , GROUPn are assigned to groups encountered in the MODEL statement. Variables not
enclosed by braces are used as groups of a single variable.
For example:
model y={x1 x2} x3 / selection=stepwise
groupnames=’x1 x2’ ’x3’;
Another example:
model y={ht wgt age} bodyfat / selection=forward
groupnames=’htwgtage’ ’bodyfat’;
HCC
requests heteroscedasticity-consistent standard errors of the parameter estimates. You can use
the HCCMETHOD= option to specify the method used to compute the heteroscedasticityconsistent covariance matrix.
HCCMETHOD=0,1,2, or 3
specifies the method used to obtain a heteroscedasticity-consistent covariance matrix for use
with the ACOV, HCC, or WHITE option in the MODEL statement and for heteroscedasticityconsistent tests with the TEST statement. The default is HCCMETHOD=0. See the section
“Testing for Heteroscedasticity” on page 5569 for details.
I
displays the .X0 X/ 1 matrix. The inverse of the crossproducts matrix is bordered by the
parameter estimates and SSE matrices.
INCLUDE=n
forces the first n independent variables listed in the MODEL statement to be included in all
models. The selection methods are performed on the other variables in the MODEL statement. The INCLUDE= option is not available with SELECTION=NONE.
INFLUENCE
requests a detailed analysis of the influence of each observation on the estimates and the
predicted values. See the section “Influence Statistics” on page 5553 for details.
JP
outputs Jp , the estimated mean square error of prediction for each model selected assuming that the values of the regressors are fixed and that the model is correct to the OUTEST=
data set. The Jp statistic is also called the final prediction error (FPE) by Akaike (Nicholson
1948; Lord 1950; Mallows 1967; Darlington 1968; Rothman 1968; Akaike 1969; Hocking
1976; Judge et al. 1980). If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the Jp statistic is also added to the SubsetSelSummary table.
LACKFIT
performs a lack-of-fit test. See the section “Testing for Lack of Fit” on page 5570 for more
information. Refer to Draper and Smith (1981) for a discussion of lack-of-fit tests.
5470 F Chapter 73: The REG Procedure
MSE
computes the mean square error for each model selected (Darlington 1968).
MAXSTEP=n
specifies the maximum number of steps that are done when SELECTION=FORWARD, SELECTION=BACKWARD, or SELECTION=STEPWISE is used. The default value is the
number of independent variables in the model for the FORWARD and BACKWARD methods and three times this number for the stepwise method.
NOINT
suppresses the intercept term that is otherwise included in the model.
NOPRINT
suppresses the normal display of regression results. Note that this option temporarily disables
the Output Delivery System (ODS); see Chapter 20, “Using the Output Delivery System,” for
more information.
OUTSEB
outputs the standard errors of the parameter estimates to the OUTEST= data set. The value
SEB for the variable _TYPE_ identifies the standard errors. If the RIDGE= or PCOMIT= option is specified, additional observations are included and identified by the values RIDGESEB
and IPCSEB, respectively, for the variable _TYPE_. The standard errors for ridge regression
estimates and incomplete principal components (IPC) estimates are limited in their usefulness
because these estimates are biased. This option is available for all model-selection methods
except RSQUARE, ADJRSQ, and CP.
OUTSTB
outputs the standardized parameter estimates as well as the usual estimates to the OUTEST=
data set when the RIDGE= or PCOMIT= option is specified. The values RIDGESTB and
IPCSTB for the variable _TYPE_ identify ridge regression estimates and IPC estimates, respectively.
OUTVIF
outputs the variance inflation factors (VIF) to the OUTEST= data set when the RIDGE=
or PCOMIT= option is specified. The factors are the diagonal elements of the inverse of
the correlation matrix of regressors as adjusted by ridge regression or IPC analysis. These
observations are identified in the output data set by the values RIDGEVIF and IPCVIF for
the variable _TYPE_.
P
calculates predicted values from the input data and the estimated model. The display includes
the observation number, the ID variable (if one is specified), the actual and predicted values,
and the residual. If the CLI, CLM, or R option is specified, the P option is unnecessary. See
the section “Predicted and Residual Values” on page 5525 for more information.
PARTIAL
requests partial regression leverage plots for each regressor. You can use the PARTIALDATA
option to obtain a tabular display of the partial regression leverage data. If ODS Graphics is
in effect (see the section “ODS Graphics” on page 5583), then these partial plots are produced
MODEL Statement F 5471
in panels with up to six plots per panel. See the section “Influence Statistics” on page 5553
for more information.
PARTIALDATA
requests partial regression leverage data for each regressor. You can request partial regression
leverage plots of these data with the PARTIAL option. See the section “Influence Statistics”
on page 5553 for more information.
PARTIALR2 < ( < TESTS > < SEQTESTS > ) >
See the SCORR1 option.
PC
outputs Amemiya’s prediction criterion for each model selected (Amemiya 1976; Judge et al.
1980) to the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or
SELECTION=CP is specified, then the PC statistic is also added to the SubsetSelSummary
table.
PCOMIT=list
requests an IPC analysis for each value m in the list. The procedure computes parameter
estimates by using all but the last m principal components. Each value of m produces a set
of IPC estimates, which is output to the OUTEST= data set. The values of m are saved by
the variable _PCOMIT_, and the value of the variable _TYPE_ is set to IPC to identify the
estimates. Only nonnegative integers can be specified with the PCOMIT= option.
If you specify the PCOMIT= option, RESTRICT statements are ignored. The PCOMIT=
option is ignored if you use the SELECTION= option in the MODEL statement.
PCORR1
displays the squared partial correlation coefficients computed using Type I sum of squares
(SS). This is calculated as SS/(SS+SSE), where SSE is the error sum of squares.
PCORR2
displays the squared partial correlation coefficients computed using Type II sums of squares.
These are calculated the same way as with the PCORR1 option, except that Type II SS are
used instead of Type I SS.
PRESS
outputs the PRESS statistic to the OUTEST= data set. The values of this statistic are saved
in the variable _PRESS_. This option is available for all model-selection methods except
RSQUARE, ADJRSQ, and CP.
R
requests an analysis of the residuals. The results include everything requested by the P option
plus the standard errors of the mean predicted and residual values, the studentized residual,
and Cook’s D statistic to measure the influence of each observation on the parameter estimates. See the section “Predicted and Residual Values” on page 5525 for more information.
RIDGE=list
requests a ridge regression analysis and specifies the values of the ridge constant k (see the
section “Computations for Ridge Regression and IPC Analysis” on page 5577). Each value
5472 F Chapter 73: The REG Procedure
of k produces a set of ridge regression estimates that are placed in the OUTEST= data set.
The values of k are saved by the variable _RIDGE_, and the value of the variable _TYPE_ is
set to RIDGE to identify the estimates.
Only nonnegative numbers can be specified with the RIDGE= option. Example 73.5 illustrates this option.
If you specify the RIDGE= option, RESTRICT statements are ignored. The RIDGE= option
is ignored if you use the SELECTION= option in the MODEL statement.
RMSE
displays the root mean square error for each model selected.
RSQUARE
has the same effect as the EDF option.
SBC
outputs the SBC statistic for each model selected (Schwarz 1978; Judge et al. 1980) to
the OUTEST= data set. If SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then the SBC statistic is also added to the SubsetSelSummary table.
SCORR1 < ( < TESTS > < SEQTESTS > ) >
displays the squared semipartial correlation coefficients computed using Type I sums of
squares. This is calculated as SS/SST, where SST is the corrected total SS. If the NOINT
option is used, the uncorrected total SS is used in the denominator. The optional arguments
TESTS and SEQTESTS request are sequentially added to a model. The F -test values are
computed as the Type I sum of squares for the variable in question divided by a mean square
error. If you specify the TESTS option, the denominator MSE is the residual mean square for
the full model specified in the MODEL statement. If you specify the SEQTESTS option, the
denominator MSE is the residual mean square for the model containing all the independent
variables that have been added to the model up to and including the variable in question. The
TESTS and SEQTESTS options are not supported if you specify model selection methods
or the RIDGE or PCOMIT options. Note that the PARTIALR2 option is a synonym for the
SCORR1 option.
SCORR2 < ( TESTS ) >
displays the squared semipartial correlation coefficients computed using Type II sums of
squares. These are calculated the same way as with the SCORR1 option, except that Type II
SS are used instead of Type I SS. The optional TEST argument requests F tests and p-values
as variables are sequentially added to a model. The F -test values are computed as the Type
II sum of squares for the variable in question divided by the residual mean square for the full
model specified in the MODEL statement. The TESTS option is not supported if you specify
model selection methods or the RIDGE or PCOMIT options.
SELECTION=name
specifies the method used to select the model, where name can be FORWARD (or F), BACKWARD (or B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (use the
full model). The default method is NONE. See the section “Model-Selection Methods” on
page 5517 for a description of each method.
MODEL Statement F 5473
SEQB
produces a sequence of parameter estimates as each variable is entered into the model. This
is displayed as a matrix where each row is a set of parameter estimates.
SIGMA=n
specifies the true standard deviation of the error term to be used in computing the CP and BIC
statistics. If the SIGMA= option is not specified, an estimate from the full model is used.
This option is available in the RSQUARE, ADJRSQ, and CP model-selection methods only.
SINGULAR=n
tunes the mechanism used to check for singularities. If you specify this option in the MODEL
statement, it takes precedence over the SINGULAR= option in the PROC REG statement.
The default value is machine dependent but is approximately 1E 7 on most machines. This
option is rarely needed. Singularity checking is described in the section “Computational
Methods” on page 5578.
SLENTRY=value
SLE=value
specifies the significance level for entry into the model used in the FORWARD and STEPWISE methods. The defaults are 0.50 for FORWARD and 0.15 for STEPWISE.
SLSTAY=value
SLS=value
specifies the significance level for staying in the model for the BACKWARD and STEPWISE
methods. The defaults are 0.10 for BACKWARD and 0.15 for STEPWISE.
SP
outputs the Sp statistic for each model selected (Hocking 1976) to the OUTEST= data set. If
SELECTION=ADJRSQ, SELECTION=RSQUARE, or SELECTION=CP is specified, then
the SP statistic is also added to the SubsetSelSummary table.
SPEC
performs a test that the first and second moments of the model are correctly specified. See
the section “Testing for Heteroscedasticity” on page 5569 for more information.
SS1
displays the sequential sums of squares (Type I SS) along with the parameter estimates for
each term in the model. See Chapter 15, “The Four Types of Estimable Functions,” for more
information about the different types of sums of squares.
SS2
displays the partial sums of squares (Type II SS) along with the parameter estimates for each
term in the model. See the SS1 option also.
SSE
computes the error sum of squares for each model selected.
5474 F Chapter 73: The REG Procedure
START=s
is used to begin the comparing-and-switching process in the MAXR, MINR, and STEPWISE
methods for a model containing the first s independent variables in the MODEL statement,
where s is the START value. For these methods, the default is START=0.
For the RSQUARE, ADJRSQ, and CP methods, START=s specifies the smallest number of
regressors to be reported in a subset model. For these methods, the default is START=1.
The START= option cannot be used with model-selection methods other than the six described here.
STB
produces standardized regression coefficients. A standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the
dependent variable to the sample standard deviation of the regressor.
STOP=s
causes PROC REG to stop when it has found the “best” s-variable model, where s is the
STOP value. For the RSQUARE, ADJRSQ, and CP methods, STOP=s specifies the largest
number of regressors to be reported in a subset model. For the MAXR and MINR methods,
STOP=s specifies the largest number of regressors to be included in the model.
The default setting for the STOP= option is the number of variables in the MODEL statement. This option can be used only with the MAXR, MINR, RSQUARE, ADJRSQ, and CP
methods.
TOL
produces tolerance values for the estimates. Tolerance for a variable is defined as 1 R2 ,
where R2 is obtained from the regression of the variable on all other regressors in the model.
See the section “Collinearity Diagnostics” on page 5549 for more details.
VIF
produces variance inflation factors with the parameter estimates. Variance inflation is the
reciprocal of tolerance. See the section “Collinearity Diagnostics” on page 5549 for more
detail.
WHITE
See the HCC option.
XPX
displays the X0 X crossproducts matrix for the model. The crossproducts matrix is bordered
by the X0 Y and Y0 Y matrices.
MTEST Statement
< label: > MTEST < equation < , . . . , equation > > < / options > ;
where each equation is a linear function composed of coefficients and variable names. The label is
optional.
MTEST Statement F 5475
The MTEST statement is used to test hypotheses in multivariate regression models where there are
several dependent variables fit to the same regressors. If no equations or options are specified, the
MTEST statement tests the hypothesis that all estimated parameters except the intercept are zero.
The hypotheses that can be tested with the MTEST statement are of the form
.Lˇ
cj/M D 0
where L is a linear function on the regressor side, ˇ is a matrix of parameters, c is a column vector
of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special
case where the constants are zero is
LˇM D 0
See the section “Multivariate Tests” on page 5571 for more details.
Each linear function extends across either the regressor variables or the dependent variables. If the
equation is across the dependent variables, then the constant term, if specified, must be zero. The
equations for the regressor variables form the L matrix and c vector in the preceding formula; the
equations for dependent variables form the M matrix. If no equations for the dependent variables are
given, PROC REG uses an identity matrix for M, testing the same hypothesis across all dependent
variables. If no equations for the regressor variables are given, PROC REG forms a linear function
corresponding to a test that all the nonintercept parameters are zero.
As an example, consider the following statements:
model
mtest
mtest
mtest
y1 y2 y3=x1 x2 x3;
x1,x2;
y1-y2, y2 -y3, x1;
y1-y2;
The first MTEST statement tests the hypothesis that the X1 and X2 parameters are zero for Y 1, Y 2,
and Y 3. In addition, the second MTEST statement tests the hypothesis that the X1 parameter is the
same for all three dependent variables. For the same model, the third MTEST statement tests the
hypothesis that all parameters except the intercept are the same for dependent variables Y 1 and Y 2.
You can specify the following options in the MTEST statement:
CANPRINT
displays the canonical correlations for the hypothesis combinations and the dependent variable combinations. If you specify
mtest / canprint;
the canonical correlations between the regressors and the dependent variables are displayed.
DETAILS
displays the M matrix and various intermediate calculations.
5476 F Chapter 73: The REG Procedure
MSTAT=FAPPROX
MSTAT=EXACT
specifies the method of evaluating the multivariate test statistics.
The default is
MSTAT=FAPPROX, which specifies that the multivariate tests are evaluated by using the
usual approximations based on the F distribution, as discussed in the “Multivariate Tests”
section in Chapter 4, “Introduction to Regression Procedures.” Alternatively, you can specify
MSTAT=EXACT to compute exact p-values for three of the four tests (Wilks’ lambda, the
Hotelling-Lawley trace, and Roy’s greatest root) and an improved F approximation for the
fourth (Pillai’s trace). While MSTAT=EXACT provides better control of the significance
probability for the tests, especially for Roy’s greatest root, computations for the exact pvalues can be appreciably more demanding, and are in fact infeasible for large problems
(many dependent variables). Thus, although MSTAT=EXACT is more accurate for most
data, it is not the default method.
PRINT
displays the H and E matrices.
OUTPUT Statement
OUTPUT < OUT=SAS-data-set >< keyword=names >
< . . . keyword=names > ;
The OUTPUT statement creates a new SAS data set that saves diagnostic measures calculated after
fitting the model. The OUTPUT statement refers to the most recent MODEL statement. At least
one keyword=names specification is required.
All the variables in the original data set are included in the new data set, along with variables
created in the OUTPUT statement. These new variables contain the values of a variety of statistics
and diagnostic measures that are calculated for each observation in the data set. If you want to create
a permanent SAS data set, you must specify a two-level name (for example, libref.data-set-name).
For more information about permanent SAS data sets, refer to the section “SAS Files” in SAS
Language Reference: Concepts.
The OUTPUT statement cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data
set is used as the input data set for PROC REG. See the section “Input Data Sets” on page 5502 for
more details.
The statistics created in the OUTPUT statement are described in this section. More details are given
in the section “Predicted and Residual Values” on page 5525 and the section “Influence Statistics”
on page 5553. Also see Chapter 4, “Introduction to Regression Procedures,” for definitions of the
statistics available from the REG procedure.
You can specify the following options in the OUTPUT statement:
OUT=SAS data set
gives the name of the new data set. By default, the procedure uses the DATAn convention to
name the new data set.
OUTPUT Statement F 5477
keyword=names
specifies the statistics to include in the output data set and names the new variables that
contain the statistics. Specify a keyword for each desired statistic (see the following list of
keywords), an equal sign, and the variable or variables to contain the statistic.
In the output data set, the first variable listed after a keyword in the OUTPUT statement contains that statistic for the first dependent variable listed in the MODEL statement; the second
variable contains the statistic for the second dependent variable in the MODEL statement, and
so on. The list of variables following the equal sign can be shorter than the list of dependent
variables in the MODEL statement. In this case, the procedure creates the new names in order
of the dependent variables in the MODEL statement.
For example, the following SAS statements create an output data set named b:
proc reg data=a;
model y z=x1 x2;
output out=b
p=yhat zhat
r=yresid zresid;
run;
In addition to the variables in the input data set, b contains the following variables:
yhat, with values that are predicted values of the dependent variable y
zhat, with values that are predicted values of the dependent variable z
yresid, with values that are the residual values of y
zresid, with values that are the residual values of z
You can specify the following keywords in the OUTPUT statement. See the section “Model
Fit and Diagnostic Statistics” on page 5551 for computational formulas.
Table 73.5
Keywords for OUTPUT Statement
Keyword
Description
COOKD=names
COVRATIO=names
Cook’s D influence statistic
standard influence of observation on covariance of betas, as
discussed in the section “Influence Statistics” on page 5553
standard influence of observation on predicted value
leverage, xi .X0 X/ 1 xi0
lower bound of a 100.1 ˛/% confidence interval for an
individual prediction. This includes the variance of the
error, as well as the variance of the parameter estimates.
lower bound of a 100.1 ˛/% confidence interval for the
expected value (mean) of the dependent variable
predicted values
i th residual divided by .1 h/, where h is the leverage,
and where the model has been refit without the i th
observation
residuals, calculated as ACTUAL minus PREDICTED
DFFITS=names
H=names
LCL=names
LCLM=names
PREDICTED | P=names
PRESS=names
RESIDUAL | R=names
5478 F Chapter 73: The REG Procedure
Table 73.5
continued
Keyword
Description
RSTUDENT=names
STDI=names
STDP=names
STDR=names
STUDENT=names
a studentized residual with the current observation deleted
standard error of the individual predicted value
standard error of the mean predicted value
standard error of the residual
studentized residuals, which are the residuals divided by their
standard errors
upper bound of a 100.1 ˛/% confidence interval for an
individual prediction
upper bound of a 100.1 ˛/% confidence interval for the
expected value (mean) of the dependent variable
UCL=names
UCLM=names
PAINT Statement
PAINT < condition | ALLOBS > < / options > ;
PAINT < STATUS | UNDO > ;
The PAINT statement selects observations to be painted or highlighted in a scatter plot on line
printer output; the PAINT statement is ignored if the LINEPRINTER option is not specified in the
PROC REG statement.
All observations that satisfy condition are painted using some specific symbol. The PAINT statement does not generate a scatter plot and must be followed by a PLOT statement, which does
generate a scatter plot. Several PAINT statements can be used before a PLOT statement, and all
prior PAINT statement requests are applied to all later PLOT statements.
The PAINT statement lists the observation numbers of the observations selected, the total number
of observations selected, and the plotting symbol used to paint the points.
On a plot, paint symbols take precedence over all other symbols. If any position contains more than
one painted point, the paint symbol for the observation plotted last is used.
The PAINT statement cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set
is used as the input data set for PROC REG. Also, the PAINT statement cannot be used for models
with more than one dependent variable. Note that the syntax for the PAINT statement is the same
as the syntax for the REWEIGHT statement.
For detailed examples of painting scatter plots, see the section “Painting Scatter Plots” on page 5537.
PAINT Statement F 5479
Specifying Condition
Condition is used to select observations to be painted. The syntax of condition is
variable compare value
or
variable compare value
logical
variable compare value
where
variable
is one of the following:
a variable name in the input data set
OBS., which is the observation number
keyword., where keyword is a keyword for a statistic requested in the OUTPUT
statement
compare
is an operator that compares variable to value. Compare can be any one of the following: <, <=, >, >=, =, ˆ =. The operators LT, LE, GT, GE, EQ, and NE, respectively, can
be used instead of the preceding symbols. Refer to the “Expressions” section in SAS
Language Reference: Concepts for more information about comparison operators.
value
gives an unformatted value of variable. Observations are selected to be painted if they
satisfy the condition created by variable compare value. Value can be a number or a
character string. If value is a character string, it must be eight characters or less and
must be enclosed in quotes. In addition, value is case-sensitive. In other words, the
statements
paint name=’henry’;
and
paint name=’Henry’;
are not the same.
logical
is one of two logical operators. Either AND or OR can be used. To specify AND, use
AND or the symbol &. To specify OR, use OR or the symbol |.
Here are some examples of the variable compare value form:
paint name=’Henry’;
paint residual.>=20;
paint obs.=99;
Here are some examples of the variable compare value
paint name=’Henry’|name=’Mary’;
paint residual.>=20 or residual.<=10;
paint obs.>=11 and residual.<=20;
logical
variable compare value form:
5480 F Chapter 73: The REG Procedure
Using ALLOBS
Instead of specifying condition, the ALLOBS option can be used to select all observations. This is
most useful when you want to unpaint all observations. For example,
paint allobs / reset;
resets the symbols for all observations.
Options in the PAINT Statement
The following options can be used when either a condition is specified, the ALLOBS option is
specified, or nothing is specified before the slash. If only an option is listed, the option applies
to the observations selected in the previous PAINT statement, not to the observations selected by
reapplying the condition from the previous PAINT statement. For example, in the statements
paint r.>0 / symbol=’a’;
reweight r.>0;
refit;
paint / symbol=’b’;
the second PAINT statement paints only those observations selected in the first PAINT statement.
No additional observations are painted even if, after refitting the model, there are new observations
that meet the condition in the first PAINT statement.
N OTE : Options are not available when either the UNDO or STATUS option is used.
You can specify the following options after a slash (/).
NOLIST
suppresses the display of the selected observation numbers. If the NOLIST option is not
specified, a list of observations selected is written to the log. The list includes the observation
numbers and painting symbol used to paint the points. The total number of observations
selected to be painted is also shown.
RESET
changes the painting symbol to the current default symbol, effectively unpainting the observations selected. If you set the default symbol by using the SYMBOL= option in the PLOT
statement, the RESET option in the PAINT statement changes the painting symbol to the
symbol you specified. Otherwise, the default symbol of ’1’ is used.
SYMBOL=’character’
specifies a painting symbol. If the SYMBOL= option is omitted, the painting symbol is
either the one used in the most recent PAINT statement or, if there are no previous PAINT
statements, the symbol ’@’. For example,
paint / symbol=’#’;
changes the painting symbol for the observations selected by the most recent PAINT statement to ’#’. As another example,
PLOT Statement F 5481
paint temp lt 22 / symbol=’c’;
changes the painting symbol to ’c’ for all observations with TEMP<22. In general, the numbers 1, 2, . . . , 9 and the asterisk are not recommended as painting symbols. These symbols are
used as default symbols in the PLOT statement, where they represent the number of replicates
at a point. If SYMBOL=” is used, no painting is done in the current plot. If SYMBOL=’ ’ is
used, observations are painted with a blank and are no longer seen on the plot.
STATUS and UNDO
Instead of specifying condition or the ALLOBS option, you can use the STATUS or UNDO option
as follows:
STATUS
lists (in the log) the observation number and plotting symbol of all currently painted observations.
UNDO
undoes changes made by the most recent PAINT statement. Observations might be, but are
not necessarily, unpainted. For example:
paint obs. <=10 / symbol=’a’;
\Codecomment{...other interactive statements}
paint obs.=1 / symbol=’b’;
\Codecomment{...other interactive statements}
paint undo;
The last PAINT statement changes the plotting symbol used for observation 1 back to ’a’. If
the statement
paint / reset;
is used instead, observation 1 is unpainted.
PLOT Statement
PLOT < yvariable*xvariable > < =symbol >
tions > ;
< . . . yvariable*xvariable > < =symbol > < / op-
The PLOT statement in PROC REG displays scatter plots with yvariable on the vertical axis and
xvariable on the horizontal axis. Line printer plots are generated if the LINEPRINTER option is
specified in the PROC REG statement; otherwise, the traditional graphics are created. Points in line
printer plots can be marked with symbols, while global graphics statements such as GOPTIONS and
SYMBOL are used to enhance the traditional graphics. Note that the plots you request by using the
PLOT statement are independent of the ODS graphical displays (see the section “ODS Graphics”
on page 5583) that are now available in PROC REG.
5482 F Chapter 73: The REG Procedure
As with most other interactive statements, the PLOT statement implicitly refits the model. For
example, if a PLOT statement is preceded by a REWEIGHT statement, the model is recomputed,
and the plot reflects the new model.
If there are multiple MODEL statements preceding a PLOT statement, then the PLOT statement
refers to the latest MODEL statement.
The PLOT statement cannot be used when a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set
is used as input to PROC REG.
You can specify several PLOT statements for each MODEL statement, and you can specify more
than one plot in each PLOT statement.
For detailed examples of using the PLOT statement and its options, see the section “Producing
Scatter Plots” on page 5529.
Specifying Yvariables, Xvariables, and Symbol
More than one yvariablexvariable pair can be specified to request multiple plots. The yvariables
and xvariables can be as follows:
any variables specified in the VAR or MODEL statement before the first RUN statement
keyword., where keyword is a regression diagnostic statistic available in the OUTPUT statement (see Table 73.6). For example,
plot predicted.*residual.;
generates one plot of the predicted values by the residuals for each dependent variable in the
MODEL statement. These statistics can also be plotted against any of the variables in the
VAR or MODEL statements.
the keyword OBS. (the observation number), which can be plotted against any of the preceding variables
the keyword NPP. or NQQ., which can be used with any of the preceding variables to construct normal P-P or Q-Q plots, respectively (see the section “Construction of Q-Q and P-P
Plots” on page 5577 and “Traditional Normal Quantile and Normal Probability Plots” on
page 5545 for more information)
keywords for model fit summary statistics available in the OUTEST= data set with _TYPE_=
PARMS (see Table 73.6). A SELECTION= method (other than NONE) must be requested in the MODEL statement for these variables to be plotted. If one member of a
yvariablexvariable pair is from the OUTEST= data set, the other member must also be
from the OUTEST= data set.
The OUTPUT statement and the OUTEST= option are not required when their keywords are specified in the PLOT statement.
PLOT Statement F 5483
The yvariable and xvariable specifications can be replaced by a set of variables and statistics enclosed in parentheses. When this occurs, all possible combinations of yvariable and xvariable are
generated. For example, the following two statements are equivalent:
plot (y1 y2)*(x1 x2);
plot y1*x1 y1*x2 y2*x1 y2*x2;
The statement
plot;
is equivalent to respecifying the most recent PLOT statement without any options. However, the
line printer options COLLECT, HPLOTS=, SYMBOL=, and VPLOTS=, described in the section
“Line Printer Plots” on page 5491, apply across PLOT statements and remain in effect if they have
been previously specified.
Options used for the traditional graphics are described in the following section; see “Line Printer
Plots” on page 5491 for more information.
Traditional Graphics
The display of traditional graphics is described in the following paragraphs, the options are summarized in Table 73.6 and described in the section “Dictionary of PLOT Statement Options” on
page 5487, and the section “Traditional Graphics” on page 5541 contains several examples of the
graphics output.
Several line printer statements and options are not supported for the traditional graphics. In particular the PAINT statement is disabled, as are the PLOT statement options CLEAR, COLLECT,
HPLOTS=, NOCOLLECT, SYMBOL=, and VPLOTS=. To display more than one plot per page
or to collect plots from multiple PLOT statements, use the PROC GREPLAY statement (refer to
SAS/GRAPH Software: Reference). Also note that traditional graphics options are not recognized
for line printer plots.
The fitted model equation and a label are displayed in the top margin of the plot; this display can
be suppressed with the NOMODEL option. If the label is requested but cannot fit on one line, it
is not displayed. The equation and label are displayed on one line when possible; if more lines
are required, the label is displayed in the first line with the model equation in successive lines. If
displaying the entire equation causes the plot to be unacceptably small, the equation is truncated.
Table 73.7 lists options to control the display of the equation. The section “Traditional Graphics for
Simple Linear Regression” on page 5541 illustrates the display of the model equation.
Four statistics are displayed by default in the right margin: the number of observations, R2 , the
adjusted R2 , and the root mean square error. (See Figure 73.43.) The display of these statistics can
be suppressed with the NOSTAT option. You can specify other options to request the display of
various statistics in the right margin; see Table 73.7.
A default reference line at zero is displayed if residuals are plotted. If the dependent variable is
plotted against the independent variable in a simple linear regression model, the fitted regression
line is displayed by default. (See Figure 73.42.) Default reference lines can be suppressed with the
NOLINE option; the lines are not displayed if the OVERLAY option is specified.
5484 F Chapter 73: The REG Procedure
Specialized plots are requested with special options. For each coefficient, the RIDGEPLOT option
plots the ridge estimates against the ridge values k; see the description of the RIDGEPLOT option
in the section “Dictionary of PLOT Statement Options” on page 5487 for more details. The CONF
option plots 100.1 ˛/% confidence intervals for the mean while the PRED option plots 100.1
˛/% prediction intervals; see the description of these options in the section “Dictionary of PLOT
Statement Options” on page 5487 for more details.
If a SELECTION= method is requested, the fitted model equation and the statistics displayed in the
margin correspond to the selected model. For the ADJRSQ and CP methods, the selected model
is treated as a submodel of the full model. If a CP.*NP. plot is requested, the CHOCKING= and
CMALLOWS= options display model selection reference lines; see the descriptions of these options
in the section “Dictionary of PLOT Statement Options” on page 5487 and “Traditional Graphics for
Variable Selection” on page 5544 for more details.
PLOT Statement variable Keywords
The following table lists the keywords available as PLOT statement xvariables and yvariables. All
keywords have a trailing dot; for example, “COOKD.” requests Cook’s D statistic. Neither the
OUTPUT statement nor the OUTEST= option needs to be specified.
Table 73.6
Keywords for PLOT Statement xvariables
Keyword
Diagnostic Statistics
COOKD.
COVRATIO.
DFFITS.
H.
LCL.
LCLM.
PREDICTED.
| PRED. | P.
PRESS.
RESIDUAL. | R.
RSTUDENT.
STDI.
STDP.
STDR.
STUDENT.
UCL.
UCLM.
Description
Cook’s D influence statistics
standard influence of observation on covariance of betas
standard influence of observation on predicted value
leverage
lower bound of 100.1 ˛/% confidence interval for individual
prediction
lower bound of 100.1 ˛/% confidence interval for the mean of
the dependent variable
predicted values
residuals from refitting the model with current observation deleted
residuals
studentized residuals with the current observation deleted
standard error of the individual predicted value
standard error of the mean predicted value
standard error of the residual
residuals divided by their standard errors
upper bound of 100.1 ˛/% confidence interval for individual
prediction
upper bound of 100.1 ˛/% confidence interval for the mean of
the dependent variables
Other Keywords Used with Diagnostic Statistics
NPP.
normal probability-probability plot
NQQ.
normal quantile-quantile plot
PLOT Statement F 5485
Table 73.6
continued
Keyword
Description
OBS.
observation number (cannot plot against OUTEST= statistics)
Model Fit Summary Statistics
ADJRSQ.
adjusted R-square
AIC.
Akaike’s information criterion
BIC.
Sawa’s Bayesian information criterion
CP.
Mallows’ Cp statistic
EDF.
error degrees of freedom
GMSEP.
estimated MSE of prediction, assuming multivariate normality
IN.
number of regressors in the model not including the intercept
JP.
final prediction error
MSE.
mean squared error
NP.
number of parameters in the model (including the intercept)
PC.
Amemiya’s prediction criterion
RMSE.
root MSE
RSQ.
R-square
SBC.
SBC statistic
SP.
SP statistic
SSE.
error sum of squares
Summary of PLOT Statement Graphics Options
The following table lists the PLOT statement options by function. These options are available unless
the LINEPRINTER option is specified in the PROC REG statement. For complete descriptions, see
the section “Dictionary of PLOT Statement Options” on page 5487.
Table 73.7
Option
Traditional Graphics Options
Description
General Graphics Options
ANNOTATE=
specifies the annotate data set
SAS-data-set
CHOCKING=color
requests a reference line for Cp model selection criteria
CMALLOWS=color
requests a reference line for the Cp model selection criterion
CONF
requests plots of 100.1 ˛/% confidence intervals for the mean
DESCRIPTION=
specifies a description for graphics catalog member
’string’
NAME=’string’
names the plot in the graphics catalog
OVERLAY
overlays plots from the same model
PRED
requests plots of 100.1 ˛/% prediction intervals for individual
responses
RIDGEPLOT
requests the ridge trace for ridge regression
Axis and Legend Options
LEGEND=LEGENDn specifies LEGEND statement to be used
HAXIS=values
specifies tick mark values for horizontal axis
VAXIS=values
specifies tick mark values for vertical axis
5486 F Chapter 73: The REG Procedure
Table 73.7
continued
Option
Description
Reference Line Options
HREF=values
specifies reference lines perpendicular to horizontal axis
LHREF=linetype
specifies line style for HREF= lines
LLINE=linetype
specifies line style for lines displayed by default
LVREF=linetype
specifies line style for VREF= lines
NOLINE
suppresses display of any default reference line
VREF=values
specifies reference lines perpendicular to vertical axis
Color Options
CAXIS=color
CFRAME=color
CHREF=color
CLINE=color
CTEXT=color
CVREF=color
specifies color for axis line and tick marks
specifies color for frame
specifies color for HREF= lines
specifies color for lines displayed by default
specifies color for text
specifies color for VREF= lines
Options for Displaying the Fitted Model Equation
MODELFONT=font
specifies font of model equation and model label
MODELHT=value
specifies text height of model equation and model label
MODELLAB=’label’
specifies model label
NOMODEL
suppresses display of the fitted model and the label
Options for Displaying Statistics in the Plot Margin
AIC
displays Akaike’s information criterion
BIC
displays Sawa’s Bayesian information criterion
CP
displays Mallows’ Cp statistic
EDF
displays the error degrees of freedom
GMSEP
displays the estimated MSE of prediction assuming
multivariate normality
IN
displays the number of regressors in the model not including
the intercept
JP
displays the Jp statistic
MSE
displays the mean squared error
NOSTAT
suppresses display of the default statistics: the number of
observations, R-square, adjusted R-square, and
root mean square error
NP
displays the number of parameters in the model including the
intercept, if any
PC
displays the PC statistic
SBC
displays the SBC statistic
SP
displays the Sp statistic
SSE
displays the error sum of squares
STATFONT=font
specifies font of text displayed in the margin
STATHT=value
specifies height of text displayed in the margin
PLOT Statement F 5487
Dictionary of PLOT Statement Options
The following entries describe the PLOT statement options in detail. Note that these options are
available unless you specify the LINEPRINTER option in the PROC REG statement.
AIC
displays Akaike’s information criterion in the plot margin.
ANNOTATE=SAS-data-set
ANNO=SAS-data-set
specifies an input data set that contains appropriate variables for annotation. This applies
only to displays created with the current PLOT statement. Refer to SAS/GRAPH Software:
Reference for more information.
BIC
displays Sawa’s Bayesian information criterion in the plot margin.
CAXIS=color
CAXES=color
CA=color
specifies the color for the axes, frame, and tick marks.
CFRAME=color
CFR=color
specifies the color for filling the area enclosed by the axes and the frame.
CHOCKING=color
requests reference lines corresponding to the equations Cp D p and Cp D 2p pf ul l ,
where pf ul l is the number of parameters in the full model (excluding the intercept) and p
is the number of parameters in the subset model (including the intercept). The color must be
specified; the Cp D p line is solid and the Cp D 2p pf ul l line is dashed. Only PLOT
statements of the form PLOT CP.*NP. produce these lines.
For the purpose of parameter estimation, Hocking (1976) suggests selecting a model where
Cp 2p pf ul l . For the purpose of prediction, Hocking suggests the criterion Cp p. You
can request the single reference line Cp D p with the CMALLOWS= option. If, for example,
you specify both CHOCKING=RED and CMALLOWS=BLUE, then the Cp D 2p pf ul l
line is red and the Cp D p line is blue (see Figure 73.45).
CHREF=color
CH=color
specifies the color for lines requested with the HREF= option.
CLINE=color
CL=color
specifies the color for lines displayed by default. See the NOLINE option for details.
5488 F Chapter 73: The REG Procedure
CMALLOWS=color
requests a Cp D p reference line, where p is the number of parameters (including the intercept) in the subset model. The color must be specified; the line is solid. Only PLOT
statements of the form PLOT CP.*NP. produce this line.
Mallows (1973) suggests that all subset models with Cp small and near p be considered for
further study. See the CHOCKING= option for related model-selection criteria.
CONF
is a keyword used as a shorthand option to request plots that include .100 ˛/% confidence
intervals for the mean response (see Figure 73.44). The ALPHA= option in the PROC REG
or MODEL statement selects the significance level ˛, which is 0.05 by default. The CONF
option is valid for simple regression models only, and is ignored for plots where confidence
intervals are inappropriate. The CONF option replaces the CONF95 option; however, the
CONF95 option is still supported when the ALPHA= option is not specified. The OVERLAY
option is ignored when the CONF option is specified.
CP
displays Mallows’ Cp statistic in the plot margin.
CTEXT=color
CT=color
specifies the color for text including tick mark labels, axis labels, the fitted model label and
equation, the statistics displayed in the margin, and legends.
CVREF=color
CV=color
specifies the color for lines requested with the VREF= option.
DESCRIPTION=’string ’
DESC=’string ’
specifies a descriptive string, up to 40 characters, that appears in the description field of the
PROC GREPLAY master menu.
EDF
displays the error degrees of freedom in the plot margin.
GMSEP
displays the estimated mean square error of prediction in the plot margin. Note that the
estimate is calculated under the assumption that both independent and dependent variables
have a multivariate normal distribution.
HAXIS=values
HA=values
specifies tick mark values for the horizontal axis.
HREF=values
specifies where reference lines perpendicular to the horizontal axis are to appear.
PLOT Statement F 5489
IN
displays the number of regressors in the model (not including the intercept) in the plot margin.
JP
displays the Jp statistic in the plot margin.
LEGEND=LEGENDn
specifies the LEGENDn statement to be used. The LEGENDn statement is a global graphics
statement; refer to SAS/GRAPH Software: Reference for more information.
LHREF=linetype
LH=linetype
specifies the line style for lines requested with the HREF= option. The default linetype is 2.
Note that LHREF=1 requests a solid line. Refer to SAS/GRAPH Software: Reference for a
table of available line types.
LLINE=linetype
LL=linetype
specifies the line style for reference lines displayed by default; see the NOLINE option for
details. The default linetype is 2. Note that LLINE=1 requests a solid line.
LVREF=linetype
LV=linetype
specifies the line style for lines requested with the VREF= option. The default linetype is 2.
Note that LVREF=1 requests a solid line.
MODELFONT=font
specifies the font used for displaying the fitted model label and the fitted model equation.
Refer to SAS/GRAPH Software: Reference for tables of software fonts.
MODELHT=height
specifies the text height for the fitted model label and the fitted model equation.
MODELLAB=’label ’
specifies the label to be displayed with the fitted model equation. By default, no label is
displayed. If the label does not fit on one line, it is not displayed. See the section “Traditional
Graphics” on page 5483 for more information.
MSE
displays the mean squared error in the plot margin.
NAME=’string ’
specifies a descriptive string, up to eight characters, that appears in the name field of the
PROC GREPLAY master menu. The default string is REG.
NOLINE
suppresses the display of default reference lines. A default reference line at zero is displayed
if residuals are plotted. If the dependent variable is plotted against the independent variable
in a simple regression model, then the fitted regression line is displayed by default. Default
reference lines are not displayed if the OVERLAY option is specified.
5490 F Chapter 73: The REG Procedure
NOMODEL
suppresses the display of the fitted model equation.
NOSTAT
suppresses the display of statistics in the plot margin. By default, the number of observations,
R-square, adjusted R-square, and root MSE are displayed.
NP
displays the number of regressors in the model including the intercept, if any, in the plot
margin.
OVERLAY
overlays all plots specified in the PLOT statement from the same model on one set of axes.
The variables for the first plot label the axes. The procedure automatically scales the axes
to fit all of the variables unless the HAXIS= or VAXIS= option is used. Default reference
lines are not displayed. A default legend is produced; the LEGEND= option can be used to
customize the legend.
PC
displays the PC statistic in the plot margin.
PRED
is a keyword used as a shorthand option to request plots that include .100 ˛/% prediction
intervals for individual responses (see Figure 73.44). The ALPHA= option in the PROC REG
or MODEL statement selects the significance level ˛, which is 0.05 by default. The PRED
option is valid for simple regression models only, and is ignored for plots where prediction
intervals are inappropriate. The PRED option replaces the PRED95 option; however, the
PRED95 option is still supported when the ALPHA= option is not specified. The OVERLAY
option is ignored when the PRED option is specified.
RIDGEPLOT
creates overlaid plots of ridge estimates against ridge values for each coefficient. The points
corresponding to the estimates of each coefficient in the plot are connected by lines. For ridge
estimates to be computed and plotted, the OUTEST= option must be specified in the PROC
REG statement, and the RIDGE=list must be specified in either the PROC REG or MODEL
statement.
SBC
displays the SBC statistic in the plot margin.
SP
displays the Sp statistic in the plot margin.
SSE
displays the error sum of squares in the plot margin.
STATFONT=font
specifies the font used for displaying the statistics that appear in the plot margin. Refer to
SAS/GRAPH Software: Reference for tables of software fonts.
PLOT Statement F 5491
STATHT=height
specifies the text height of the statistics that appear in the plot margin.
USEALL
specifies that predicted values at data points with missing dependent variable(s) be included
on appropriate plots. By default, only points used in constructing the SSCP matrix appear on
plots.
VAXIS=values
VA=values
specifies tick mark values for the vertical axis.
VREF=values
specifies where reference lines perpendicular to the vertical axis are to appear.
Line Printer Plots
Line printer plots are requested with the LINEPRINTER option in the PROC REG statement. Points
in line printer plots can be marked with symbols, which can be specified as a single character
enclosed in quotes or the name of any variable in the input data set.
If a character variable is used for the symbol, the first (leftmost) nonblank character in the formatted
value of the variable is used as the plotting symbol. If a character in quotes is specified, that
character becomes the plotting symbol. If a character is used as the plotting symbol, and if there are
different plotting symbols needed at the same point, the symbol ’?’ is used at that point.
If an unformatted numeric variable is used for the symbol, the symbols ’1’, ’2’, . . . , ’9’ are used for
variable values 1, 2, . . . , 9. For noninteger values, only the integer portion is used as the plotting
symbol. For values of 10 or greater, the symbol ’*’ is used. For negative values, a ’?’ is used. If
a numeric variable is used, and if there is more than one plotting symbol needed at the same point,
the sum of the variable values is used at that point. If the sum exceeds 9, the symbol ’*’ is used.
If a symbol is not specified, the number of replicates at the point is displayed. The symbol ’*’ is
used if there are 10 or more replicates.
If the LINEPRINTER option is used, you can specify the following options in the PLOT statement
after a slash (/):
CLEAR
clears any collected scatter plots before plotting begins but does not turn off the COLLECT
option. Use this option when you want to begin a new collection with the plots in the current PLOT statement. For more information about collecting plots, see the COLLECT and
NOCOLLECT options in this section.
COLLECT
specifies that plots begin to be collected from one PLOT statement to the next and that subsequent plots show an overlay of all collected plots. This option enables you to overlay plots
before and after changes to the model or to the data used to fit the model. Plots collected
before changes are unaffected by the changes and can be overlaid on later plots. You can
5492 F Chapter 73: The REG Procedure
request more than one plot with this option, and you do not need to request the same number
of plots in subsequent PLOT statements. If you specify an unequal number of plots, plots in
corresponding positions are overlaid. For example, the statements
plot residual.*predicted. y*x / collect;
run;
produce two plots. If these statements are then followed by
plot residual.*x;
run;
two plots are again produced. The first plot shows residual against X values overlaid on
residual against predicted values. The second plot is the same as that produced by the first
PLOT statement.
Axes are scaled for the first plot or plots collected. The axes are not rescaled as more plots
are collected.
Once specified, the COLLECT option remains in effect until the NOCOLLECT option is
specified.
HPLOTS=number
sets the number of scatter plots that can be displayed across the page. The procedure begins
with one plot per page. The value of the HPLOTS= option remains in effect until you change
it in a later PLOT statement. See the VPLOTS= option for an example.
NOCOLLECT
specifies that the collection of scatter plots ends after adding the plots in the current PLOT
statement. PROC REG starts with the NOCOLLECT option in effect. After you specify the
NOCOLLECT option, any following PLOT statement produces a new plot that contains only
the plots requested by that PLOT statement.
For more information, see the COLLECT option.
OVERLAY
enables requested scatter plots to be superimposed. The axes are scaled so that points on all
plots are shown. If the HPLOTS= or VPLOTS= option is set to more than one, the overlaid
plot occupies the first position on the page. The OVERLAY option is similar to the COLLECT
option in that both options produce superimposed plots. However, OVERLAY superimposes
only the plots in the associated PLOT statement; COLLECT superimposes plots across PLOT
statements. The OVERLAY option can be used when the COLLECT option is in effect.
SYMBOL=’character ’
changes the default plotting symbol used for all scatter plots produced in the current and in
subsequent PLOT statements. Both SYMBOL=” and SYMBOL=’ ’ are allowed.
If the SYMBOL= option has not been specified, the default symbol is ’1’ for positions with
one observation, ’2’ for positions with two observations, and so on. For positions with more
than 9 observations, ’*’ is used. The SYMBOL= option (or a plotting symbol) is needed
to avoid any confusion caused by this default convention. Specifying a particular symbol is
especially important when either the OVERLAY or COLLECT option is being used.
If you specify the SYMBOL= option and use a number for character, that number is used for
all points in the plot. For example, the statement
PRINT Statement F 5493
plot y*x / symbol=’1’;
produces a plot with the symbol ’1’ used for all points.
If you specify a plotting symbol and the SYMBOL= option, the plotting symbol overrides the
SYMBOL= option. For example, in the statements
plot y*x y*v=’.’ / symbol=’*’;
the symbol used for the plot of Y against X is ’*’, and a ’.’ is used for the plot of Y against V.
If a paint symbol is defined with a PAINT statement, the paint symbol takes precedence over
both the SYMBOL= option and the default plotting symbol for the PLOT statement.
VPLOTS=number
sets the number of scatter plots that can be displayed down the page. The procedure begins
with one plot per page. The value of the VPLOTS= option remains in effect until you change
it in a later PLOT statement.
For example, to specify a total of six plots per page, with two rows of three plots, use the
HPLOTS= and VPLOTS= options as follows:
plot y1*x1 y1*x2 y1*x3 y2*x1 y2*x2 y2*x3 /
hplots=3 vplots=2;
run;
PRINT Statement
PRINT < options > < ANOVA > < MODELDATA > ;
The PRINT statement enables you to interactively display the results of MODEL statement options,
produce an ANOVA table, display the data for variables used in the current model, or redisplay
the options specified in a MODEL or a previous PRINT statement. In addition, like most other
interactive statements in PROC REG, the PRINT statement implicitly refits the model; thus, effects
of REWEIGHT statements are seen in the resulting tables. If ODS Graphics is in effect (see the
section “ODS Graphics” on page 5583), the PRINT statement also requests the use of the ODS
graphical displays associated with the current model.
The following specifications can appear in the PRINT statement:
options
interactively displays the results of MODEL statement options, where options is one or more
of the following: ACOV, ALL, CLI, CLM, COLLIN, COLLINOINT, CORRB, COVB, DW,
I, INFLUENCE, P, PARTIAL, PCORR1, PCORR2, R, SCORR1, SCORR2, SEQB, SPEC,
SS1, SS2, STB, TOL, VIF, or XPX. See the section “MODEL Statement” on page 5463 for a
description of these options.
ANOVA
produces the ANOVA table associated with the current model. This is either the model specified in the last MODEL statement or the model that incorporates changes made by ADD,
DELETE, or REWEIGHT statements after the last MODEL statement.
5494 F Chapter 73: The REG Procedure
MODELDATA
displays the data for variables used in the current model.
Use the statement
print;
to reprint options in the most recently specified PRINT or MODEL statement.
Options that require original data values, such as R or INFLUENCE, cannot be used when a
TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used as the input data set to PROC REG.
See the section “Input Data Sets” on page 5502 for more detail.
REFIT Statement
REFIT ;
The REFIT statement causes the current model and corresponding statistics to be recomputed immediately. No output is generated by this statement. The REFIT statement is needed after one or more
REWEIGHT statements to cause them to take effect before subsequent PAINT or REWEIGHT
statements. This is sometimes necessary when you are using statistical conditions in REWEIGHT
statements. For example, consider the following statements:
paint student.>2;
plot student.*p.;
reweight student.>2;
refit;
paint student.>2;
plot student.*p.;
The second PAINT statement paints any additional observations that meet the condition after deleting observations and refitting the model. The REFIT statement is used because the REWEIGHT
statement does not cause the model to be recomputed. In this particular example, the same effect
could be achieved by replacing the REFIT statement with a PLOT statement.
Most interactive statements can be used to implicitly refit the model; any plots or statistics produced
by these statements reflect changes made to the model and changes made to the data used to compute
the model. The two exceptions are the PAINT and REWEIGHT statements, which do not cause the
model to be recomputed.
RESTRICT Statement
RESTRICT equation < , . . . , equation > ;
A RESTRICT statement is used to place restrictions on the parameter estimates in the MODEL
preceding it. More than one RESTRICT statement can follow each MODEL statement. Each RE-
RESTRICT Statement F 5495
STRICT statement replaces any previous RESTRICT statement. To lift all restrictions on a model,
submit a new MODEL statement. If there are several restrictions, separate them with commas. The
statement
restrict equation1=equation2=equation3;
is equivalent to imposing the two restrictions
restrict equation1=equation2;
restrict equation2=equation3;
Each restriction is written as a linear equation and can be written as
equation
or
equation = equation
The form of each equation is
c 1 variable1 ˙ c 2 variable2 ˙ ˙ c n variablen
where the cj ’s are constants and the variablej ’s are any regressor variables.
When no equal sign appears, the linear combination is set equal to zero. Each variable name mentioned must be a variable in the MODEL statement to which the RESTRICT statement refers. The
keyword INTERCEPT can also be used as a variable name, and it refers to the intercept parameter
in the regression model.
Note that the parameters associated with the variables are restricted, not the variables themselves.
Restrictions should be consistent and not redundant.
Examples of valid RESTRICT statements include the following:
restrict
restrict
restrict
restrict
restrict
restrict
x1;
a+b=l;
a=b=c;
a=b, b=c;
2*f=g+h, intercept+f=0;
f=g=h=intercept;
The third and fourth statements in this list produce identical restrictions. You cannot specify
restrict f-g=0,
f-intercept=0,
g-intercept=1;
5496 F Chapter 73: The REG Procedure
because the three restrictions are not consistent. If these restrictions are included in a RESTRICT
statement, one of the restrict parameters is set to zero and has zero degrees of freedom, indicating
that PROC REG is unable to apply a restriction.
The restrictions usually operate even if the model is not of full rank. Check to ensure that DFD
for each restriction. In addition, the model DF should decrease by 1 for each restriction.
1
The parameter estimates are those that minimize the quadratic criterion (SSE) subject to the restrictions. If a restriction cannot be applied, its parameter value and degrees of freedom are listed as
zero.
The method used for restricting the parameter estimates is to introduce a Lagrangian parameter for
each restriction (Pringle and Rayner 1971). The estimates of these parameters are displayed with
test statistics. Note that the t statistic reported for the Lagrangian parameters does not follow a
Student’s t distribution, but its square follows a beta distribution (LaMotte 1994). The p-value for
these parameters is computed using the beta distribution.
The Lagrangian parameter measures the sensitivity of the SSE to the restriction constant. If the
restriction constant is changed by a small amount , the SSE is changed by 2. The t ratio tests the
significance of the restrictions. If is zero, the restricted estimates are the same as the unrestricted
estimates, and a change in the restriction constant in either direction increases the SSE.
RESTRICT statements are ignored if the PCOMIT= or RIDGE= option is specified in the PROC
REG statement.
REWEIGHT Statement
REWEIGHT < condition | ALLOBS > < / options > ;
REWEIGHT < STATUS | UNDO > ;
The REWEIGHT statement interactively changes the weights of observations that are used in computing the regression equation. The REWEIGHT statement can change observation weights, or
set them to zero, which causes selected observations to be excluded from the analysis. When a
REWEIGHT statement sets observation weights to zero, the observations are not deleted from the
data set. More than one REWEIGHT statement can be used. The requests from all REWEIGHT
statements are applied to the subsequent statements. Each use of the REWEIGHT statement modifies the MODEL label.
The model and corresponding statistics are not recomputed after a REWEIGHT statement. For
example, consider the following statements:
reweight r.>0;
reweight r.>0;
The second REWEIGHT statement does not exclude any additional observations since the model
is not recomputed after the first REWEIGHT statement. Either use a REFIT statement to explicitly
refit the model, or implicitly refit the model by following the REWEIGHT statement with any other
interactive statement except a PAINT statement or another REWEIGHT statement.
REWEIGHT Statement F 5497
The REWEIGHT statement cannot be used if a TYPE=CORR, TYPE=COV, or TYPE=SSCP data
set is used as an input data set to PROC REG. Note that the syntax used in the REWEIGHT statement is the same as the syntax in the PAINT statement.
The syntax of the REWEIGHT statement is described in the following sections.
For detailed examples of using this statement, see the section “Reweighting Observations in an
Analysis” on page 5563.
Specifying Condition
Condition is used to find observations to be reweighted. The syntax of condition is
variable compare value
or
variable compare value
logical
variable compare value
where
variable
is one of the following:
a variable name in the input data set
OBS., which is the observation number
keyword., where keyword is a keyword for a statistic requested in the OUTPUT
statement. The keyword specification is applied to all dependent variables in
the model.
compare
is an operator that compares variable to value. Compare can be any one of the
following: <, <=, >, >=, =, ˆ =. The operators LT, LE, GT, GE, EQ, and NE, respectively, can be used instead of the preceding symbols. Refer to the “Expressions”
chapter in SAS Language Reference: Concepts for more information about comparison operators.
value
gives an unformatted value of variable. Observations are selected to be reweighted if
they satisfy the condition created by variable compare value. Value can be a number
or a character string. If value is a character string, it must be eight characters or less
and must be enclosed in quotes. In addition, value is case-sensitive. In other words,
the following two statements are not the same:
reweight name=’steve’;
reweight name=’Steve’;
logical
is one of two logical operators. Either AND or OR can be used. To specify AND,
use AND or the symbol &. To specify OR, use OR or the symbol |.
5498 F Chapter 73: The REG Procedure
Here are some examples of the variable compare value form:
reweight obs. le 10;
reweight temp=55;
reweight type=’new’;
Here are some example of the variable compare value
logical
variable compare value form:
reweight obs.<=10 and residual.<2;
reweight student.<-2 or student.>2;
reweight name=’Mary’ | name=’Susan’;
Using ALLOBS
Instead of specifying condition, you can use the ALLOBS option to select all observations. This is
most useful when you want to restore the original weights of all observations. For example,
reweight allobs / reset;
resets weights for all observations and uses all observations in the subsequent analysis. Note that
reweight allobs;
specifies that all observations be excluded from analysis. Consequently, using ALLOBS is useful
only if you also use one of the options discussed in the following section.
Options in the REWEIGHT Statement
The following options can be used when either a condition, ALLOBS, or nothing is specified before the slash. If only an option is listed, the option applies to the observations selected in the
previous REWEIGHT statement, not to the observations selected by reapplying the condition from
the previous REWEIGHT statement. For example, consider the following statements:
reweight r.>0 / weight=0.1;
refit;
reweight;
The second REWEIGHT statement excludes from the analysis only those observations selected in
the first REWEIGHT statement. No additional observations are excluded even if there are new
observations that meet the condition in the first REWEIGHT statement.
N OTE : Options are not available when either the UNDO or STATUS option is used.
NOLIST
suppresses the display of the selected observation numbers. If you omit the NOLIST option,
a list of observations selected is written to the log.
RESET
resets the observation weights to their original values as defined by the WEIGHT statement
or to WEIGHT=1 if no WEIGHT statement is specified. For example,
reweight / reset;
REWEIGHT Statement F 5499
resets observation weights to the original weights in the data set. If previous REWEIGHT
statements have been submitted, this REWEIGHT statement applies only to the observations
selected by the previous REWEIGHT statement. Note that, although the RESET option does
reset observation weights to their original values, it does not cause the model and corresponding statistics to be recomputed.
WEIGHT=value
changes observation weights to the specified nonnegative real number. If you omit the
WEIGHT= option, the observation weights are set to zero, and observations are excluded
from the analysis. For example:
reweight name=’Alan’;
\Codecomment{...other interactive statements}
reweight / weight=0.5;
The first REWEIGHT statement changes weights to zero for all observations with
name=’Alan’, effectively deleting these observations. The subsequent analysis does not
include these observations. The second REWEIGHT statement applies only to those observations selected by the previous REWEIGHT statement, and it changes the weights to 0.5
for all the observations with NAME=’Alan’. Thus, the next analysis includes all original
observations; however, those observations with NAME=’Alan’ have their weights set to 0.5.
STATUS and UNDO
If you omit condition and the ALLOBS options, you can specify one of the following options.
STATUS
writes to the log the observation’s number and the weight of all reweighted observations. If an
observation’s weight has been set to zero, it is reported as deleted. However, the observation
is not deleted from the data set, only from the analysis.
UNDO
undoes the changes made by the most recent REWEIGHT statement. Weights might be, but
are not necessarily, reset. For example, consider the following statements:
reweight student.>2 / weight=0.1;
reweight;
reweight undo;
The first REWEIGHT statement sets the weights of observations that satisfy the condition to
0.1. The second REWEIGHT statement sets the weights of the same observations to zero.
The third REWEIGHT statement undoes the second, changing the weights back to 0.1.
5500 F Chapter 73: The REG Procedure
TEST Statement
< label: > TEST equation,< ,. . . ,equation > < / option > ;
The TEST statement tests hypotheses about the parameters estimated in the preceding MODEL
statement. It has the same syntax as the RESTRICT statement except that it supports an option.
Each equation specifies a linear hypothesis to be tested. The rows of the hypothesis are separated
by commas.
Variable names must correspond to regressors, and each variable name represents the coefficient
of the corresponding variable in the model. An optional label is useful to identify each test with a
name. The keyword INTERCEPT can be used instead of a variable name to refer to the model’s
intercept.
The REG procedure performs an F test for the joint hypotheses specified in a single TEST statement. More than one TEST statement can accompany a MODEL statement. The numerator is the
usual quadratic form of the estimates; the denominator is the mean squared error. If hypotheses can
be represented by
Lˇ D c
then the numerator of the F test is
Q D .Lb
c/0 .L.X0 X/ L0 /
1
.Lb
c/
divided by degrees of freedom, where b is the estimate of ˇ. For example:
model y=a1 a2 b1 b2;
aplus: test a1+a2=1;
b1:
test b1=0, b2=0;
b2:
test b1, b2;
The last two statements are equivalent; since no constant is specified, zero is assumed.
Note that, when the ACOV, HCC, or WHITE option is specified in the MODEL statement,
tests are recomputed using the heteroscedasticity-consistent covariance matrix specified with the
HCCMETHOD= option in the MODEL statement (see the section “Testing for Heteroscedasticity”
on page 5569).
One option can be specified in the TEST statement after a slash (/):
PRINT
displays intermediate calculations. This includes L.X0 X/ L0 bordered by Lb
.L.X0 X/ L0 / 1 bordered by .L.X0 X/ L0 / 1 .Lb c/.
c, and
VAR Statement F 5501
VAR Statement
VAR variables ;
The VAR statement is used to include numeric variables in the crossproducts matrix that are not
specified in the first MODEL statement.
Variables not listed in MODEL statements before the first RUN statement must be listed in the VAR
statement if you want the ability to add them interactively to the model with an ADD statement, to
include them in a new MODEL statement, or to plot them in a scatter plot with the PLOT statement.
In addition, if you want to use options in the PROC REG statement and do not want to fit a model
to the data (with a MODEL statement), you must use a VAR statement.
WEIGHT Statement
WEIGHT variable ;
A WEIGHT statement names a variable in the input data set with values that are relative weights
for a weighted least squares fit. If the weight value is proportional to the reciprocal of the variance
for each observation, then the weighted estimates are the best linear unbiased estimates (BLUE).
Values of the weight variable must be nonnegative. If an observation’s weight is zero, the observation is deleted from the analysis. If a weight is negative or missing, it is set to zero, and the
observation is excluded from the analysis. A more complete description of the WEIGHT statement
can be found in Chapter 39, “The GLM Procedure.”
Observation weights can be changed interactively with the REWEIGHT statement.
Details: REG Procedure
Missing Values
PROC REG constructs only one crossproducts matrix for the variables in all regressions. If any
variable needed for any regression is missing, the observation is excluded from all estimates. If
you include variables with missing values in the VAR statement, the corresponding observations
are excluded from all analyses, even if you never include the variables in a model. PROC REG
assumes that you might want to include these variables after the first RUN statement and deletes
observations with missing values.
5502 F Chapter 73: The REG Procedure
Input Data Sets
PROC REG does not compute new regressors. For example, if you want a quadratic term in your
model, you should create a new variable when you prepare the input data. For example, the statement
model y=x1 x1*x1;
is not valid. Note that this MODEL statement is valid in the GLM procedure.
The input data set for most applications of PROC REG contains standard rectangular data, but special TYPE=CORR, TYPE=COV, and TYPE=SSCP data sets can also be used. TYPE=CORR and
TYPE=COV data sets created by the CORR procedure contain means and standard deviations. In
addition, TYPE=CORR data sets contain correlations and TYPE=COV data sets contain covariances. TYPE=SSCP data sets created in previous runs of PROC REG that used the OUTSSCP=
option contain the sums of squares and crossproducts of the variables.
See Appendix A, “Special SAS Data Sets,” and the “SAS Files” section in SAS Language Reference:
Concepts for more information about special SAS data sets.
These summary files save CPU time. It takes nk 2 operations (where n=number of observations and
k=number of variables) to calculate crossproducts; the regressions are of the order k 3 . When n is in
the thousands and k is less than 10, you can save 99% of the CPU time by reusing the SSCP matrix
rather than recomputing it.
When you want to use a special SAS data set as input, PROC REG must determine the TYPE for
the data set. PROC CORR and PROC REG automatically set the type for their output data sets.
However, if you create the data set by some other means (such as a DATA step), you must specify
its type with the TYPE= data set option. If the TYPE for the data set is not specified when the data
set is created, you can specify TYPE= as a data set option in the DATA= option in the PROC REG
statement. For example:
proc reg data=a(type=corr);
When a TYPE=CORR, TYPE=COV, or TYPE=SSCP data set is used with PROC REG, statements
and options that require the original data values have no effect. The OUTPUT, PAINT, PLOT, and
REWEIGHT statements and the MODEL and PRINT statement options P, R, CLM, CLI, DW, INFLUENCE, and PARTIAL are disabled since the original observations needed to calculate predicted
and residual values are not present.
Example Using TYPE=CORR Data Set
The following statements use PROC CORR to produce an input data set for PROC REG. The fitness
data for this analysis can be found in Example 73.2.
proc corr data=fitness outp=r noprint;
var Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse;
proc print data=r;
proc reg data=r;
Input Data Sets F 5503
model Oxygen=RunTime Age Weight;
run;
Since the OUTP= data set from PROC CORR is automatically set to TYPE=CORR, the TYPE=
data set option is not required in this example. The data set containing the correlation matrix is
displayed by the PRINT procedure as shown in Figure 73.14. Figure 73.15 shows results from the
regression that uses the TYPE=CORR data as an input data set.
Figure 73.14 TYPE=CORR Data Set Created by PROC CORR
_
T
Y
O P
b E
s _
1
2
3
4
5
6
7
8
9
10
MEAN
STD
N
CORR
CORR
CORR
CORR
CORR
CORR
CORR
_
N
A
M
E
_
Oxygen
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
O
x
y
g
e
n
R
u
n
T
i
m
e
A
g
e
W
e
i
g
h
t
47.3758
5.3272
31.0000
1.0000
-0.8622
-0.3046
-0.1628
-0.3980
-0.2367
-0.3994
10.5861
1.3874
31.0000
-0.8622
1.0000
0.1887
0.1435
0.3136
0.2261
0.4504
47.6774
5.2114
31.0000
-0.3046
0.1887
1.0000
-0.2335
-0.3379
-0.4329
-0.1641
77.4445
8.3286
31.0000
-0.1628
0.1435
-0.2335
1.0000
0.1815
0.2494
0.0440
R
u
n
P
u
l
s
e
M
a
x
P
u
l
s
e
169.645
10.252
31.000
-0.398
0.314
-0.338
0.182
1.000
0.930
0.352
173.774
9.164
31.000
-0.237
0.226
-0.433
0.249
0.930
1.000
0.305
R
e
s
t
P
u
l
s
e
53.4516
7.6194
31.0000
-0.3994
0.4504
-0.1641
0.0440
0.3525
0.3051
1.0000
5504 F Chapter 73: The REG Procedure
Figure 73.15 Regression on TYPE=CORR Data Set
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
27
30
656.27095
195.11060
851.38154
218.75698
7.22632
Root MSE
Dependent Mean
Coeff Var
2.68818
47.37581
5.67416
R-Square
Adj R-Sq
F Value
Pr > F
30.27
<.0001
0.7708
0.7454
Parameter Estimates
Variable
Intercept
RunTime
Age
Weight
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
93.12615
-3.14039
-0.17388
-0.05444
7.55916
0.36738
0.09955
0.06181
12.32
-8.55
-1.75
-0.88
<.0001
<.0001
0.0921
0.3862
The following example uses the saved crossproducts matrix:
proc reg data=fitness outsscp=sscp noprint;
model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse;
proc print data=sscp;
proc reg data=sscp;
model Oxygen=RunTime Age Weight;
run;
First, all variables are used to fit the data and create the SSCP data set. Figure 73.16 shows the
PROC PRINT display of the SSCP data set. The SSCP data set is then used as the input data set for
PROC REG, and a reduced model is fit to the data.
Input Data Sets F 5505
Figure 73.16 TYPE=SSCP Data Set Created by PROC REG
Obs
_TYPE_
1
2
3
4
5
6
7
8
9
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
N
_NAME_
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
Oxygen
Intercept
RunTime
Age
Weight
31.00
328.17
1478.00
2400.78
5259.00
5387.00
1657.00
1468.65
31.00
328.17
3531.80
15687.24
25464.71
55806.29
57113.72
17684.05
15356.14
31.00
1478.00
15687.24
71282.00
114158.90
250194.00
256218.00
78806.00
69767.75
31.00
2400.78
25464.71
114158.90
188008.20
407745.67
417764.62
128409.28
113522.26
31.00
Obs
RunPulse
MaxPulse
RestPulse
Oxygen
1
2
3
4
5
6
7
8
9
5259.00
55806.29
250194.00
407745.67
895317.00
916499.00
281928.00
248497.31
31.00
5387.00
57113.72
256218.00
417764.62
916499.00
938641.00
288583.00
254866.75
31.00
1657.00
17684.05
78806.00
128409.28
281928.00
288583.00
90311.00
78015.41
31.00
1468.65
15356.14
69767.75
113522.26
248497.31
254866.75
78015.41
70429.86
31.00
Figure 73.17 also shows the PROC REG results for the reduced model. (For the PROC REG results
for the full model, see Figure 73.29.)
5506 F Chapter 73: The REG Procedure
Figure 73.17 Regression on TYPE=SSCP Data Set
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
27
30
656.27095
195.11060
851.38154
218.75698
7.22632
Root MSE
Dependent Mean
Coeff Var
2.68818
47.37581
5.67416
R-Square
Adj R-Sq
F Value
Pr > F
30.27
<.0001
0.7708
0.7454
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
RunTime
Age
Weight
1
1
1
1
93.12615
-3.14039
-0.17388
-0.05444
7.55916
0.36738
0.09955
0.06181
12.32
-8.55
-1.75
-0.88
<.0001
<.0001
0.0921
0.3862
In the preceding example, the TYPE= data set option is not required since PROC REG sets the
OUTSSCP= data set to TYPE=SSCP.
Output Data Sets
OUTEST= Data Set
The OUTEST= specification produces a TYPE=EST output SAS data set containing estimates and
optional statistics from the regression models. For each BY group on each dependent variable
occurring in each MODEL statement, PROC REG outputs an observation to the OUTEST= data
set. The variables output to the data set are as follows:
the BY variables, if any
_MODEL_, a character variable containing the label of the corresponding MODEL statement,
or MODELn if no label is specified, where n is 1 for the first MODEL statement, 2 for the
second model statement, and so on
_TYPE_, a character variable with the value ’PARMS’ for every observation
Output Data Sets F 5507
_DEPVAR_, the name of the dependent variable
_RMSE_, the root mean squared error or the estimate of the standard deviation of the error
term
Intercept, the estimated intercept, unless the NOINT option is specified
all the variables listed in any MODEL or VAR statement. Values of these variables are the
estimated regression coefficients for the model. A variable that does not appear in the model
corresponding to a given observation has a missing value in that observation. The dependent
variable in each model is given a value of 1.
If you specify the COVOUT option, the covariance matrix of the estimates is output after the estimates; the _TYPE_ variable is set to the value ’COV’ and the names of the rows are identified by the
character variable, _NAME_.
If you specify the TABLEOUT option, the following statistics listed by _TYPE_ are added after the
estimates:
STDERR, the standard error of the estimate
T, the t statistic for testing if the estimate is zero
PVALUE, the associated p-value
LnB, the 100.1 ˛/ lower confidence limit for the estimate, where n is the nearest integer to
100.1 ˛/ and ˛ defaults to 0:05 or is set by using the ALPHA= option in the PROC REG
or MODEL statement
UnB, the 100.1
˛/ upper confidence limit for the estimate
Specifying the option ADJRSQ, AIC, BIC, CP, EDF, GMSEP, JP, MSE, PC, RSQUARE, SBC,
SP, or SSE in the PROC REG or MODEL statement automatically outputs these statistics and the
model R2 for each model selected, regardless of the model selection method. Additional variables,
in order of occurrence, are as follows:
_IN_, the number of regressors in the model not including the intercept
_P_, the number of parameters in the model including the intercept, if any
_EDF_, the error degrees of freedom
_SSE_, the error sum of squares, if the SSE option is specified
_MSE_, the mean squared error, if the MSE option is specified
_RSQ_, the R2 statistic
_ADJRSQ_, the adjusted R2 , if the ADJRSQ option is specified
_CP_, the Cp statistic, if the CP option is specified
5508 F Chapter 73: The REG Procedure
_SP_, the Sp statistic, if the SP option is specified
_JP_, the Jp statistic, if the JP option is specified
_PC_, the PC statistic, if the PC option is specified
_GMSEP_, the GMSEP statistic, if the GMSEP option is specified
_AIC_, the AIC statistic, if the AIC option is specified
_BIC_, the BIC statistic, if the BIC option is specified
_SBC_, the SBC statistic, if the SBC option is specified
The following statements produce and display the OUTEST= data set. This example uses the population data given in the section “Polynomial Regression” on page 5434. Figure 73.18 through
Figure 73.20 show the regression equations and the resulting OUTEST= data set.
proc reg data=USPopulation outest=est;
m1: model Population=Year;
m2: model Population=Year YearSq;
proc print data=est;
run;
Figure 73.18 Regression Output for Model M1
The REG Procedure
Model: m1
Dependent Variable: Population
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
20
21
146869
12832
159700
146869
641.58160
Root MSE
Dependent Mean
Coeff Var
25.32946
94.64800
26.76175
R-Square
Adj R-Sq
F Value
Pr > F
228.92
<.0001
0.9197
0.9156
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Year
1
1
-2345.85498
1.28786
161.39279
0.08512
-14.54
15.13
<.0001
<.0001
Output Data Sets F 5509
Figure 73.19 Regression Output for Model M2
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
19
21
159529
170.97193
159700
79765
8.99852
Root MSE
Dependent Mean
Coeff Var
2.99975
94.64800
3.16938
R-Square
Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989
0.9988
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Year
YearSq
1
1
1
21631
-24.04581
0.00668
639.50181
0.67547
0.00017820
33.82
-35.60
37.51
<.0001
<.0001
<.0001
Figure 73.20 OUTEST= Data Set
Obs _MODEL_ _TYPE_
1
2
m1
m2
PARMS
PARMS
_DEPVAR_
_RMSE_ Intercept
Population 25.3295
Population 2.9998
Year
-2345.85
1.2879
21630.89 -24.0458
Population
-1
-1
YearSq
.
.006684346
The following modification of the previous example uses the TABLEOUT and ALPHA= options to
obtain additional information in the OUTEST= data set:
proc reg data=USPopulation outest=est tableout alpha=0.1;
m1: model Population=Year/noprint;
m2: model Population=Year YearSq/noprint;
proc print data=est;
run;
Notice that the TABLEOUT option causes standard errors, t statistics, p-values, and confidence
limits for the estimates to be added to the OUTEST= data set. Also note that the ALPHA= option
is used to set the confidence level at 90%. The OUTEST= data set is shown in Figure 73.21.
5510 F Chapter 73: The REG Procedure
Figure 73.21 The OUTEST= Data Set When TABLEOUT Is Specified
Obs _MODEL_ _TYPE_
1
2
3
4
5
6
7
8
9
10
11
12
m1
m1
m1
m1
m1
m1
m2
m2
m2
m2
m2
m2
PARMS
STDERR
T
PVALUE
L90B
U90B
PARMS
STDERR
T
PVALUE
L90B
U90B
_DEPVAR_
Population
Population
Population
Population
Population
Population
Population
Population
Population
Population
Population
Population
_RMSE_ Intercept
25.3295
25.3295
25.3295
25.3295
25.3295
25.3295
2.9998
2.9998
2.9998
2.9998
2.9998
2.9998
-2345.85
161.39
-14.54
0.00
-2624.21
-2067.50
21630.89
639.50
33.82
0.00
20525.11
22736.68
Year Population
1.2879
0.0851
15.1300
0.0000
1.1411
1.4347
-24.0458
0.6755
-35.5988
0.0000
-25.2138
-22.8778
-1
.
.
.
.
.
-1
.
.
.
.
.
YearSq
.
.
.
.
.
.
0.0067
0.0002
37.5096
0.0000
0.0064
0.0070
A slightly different OUTEST= data set is created when you use the RSQUARE selection method.
The following statements request only the “best” model for each subset size but ask for a variety of
model selection statistics, as well as the estimated regression coefficients. An OUTEST= data set is
created and displayed. See Figure 73.22 and Figure 73.23 for the results.
proc reg data=fitness outest=est;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=rsquare mse jp gmsep cp aic bic sbc b best=1;
proc print data=est;
run;
Output Data Sets F 5511
Figure 73.22 PROC REG Output for Physical Fitness Data: Best Models
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
R-Square Selection Method
Number in
Model
R-Square
C(p)
AIC
BIC
Estimated MSE
of Prediction
J(p)
MSE
1
0.7434
13.6988
64.5341
65.4673
8.0546
8.0199
7.53384
--------------------------------------------------------------------------------------------2
0.7642
12.3894
63.9050
64.8212
7.9478
7.8621
7.16842
--------------------------------------------------------------------------------------------3
0.8111
6.9596
59.0373
61.3127
6.8583
6.7253
5.95669
--------------------------------------------------------------------------------------------4
0.8368
4.8800
56.4995
60.3996
6.3984
6.2053
5.34346
--------------------------------------------------------------------------------------------5
0.8480
5.1063
56.2986
61.5667
6.4565
6.1782
5.17634
--------------------------------------------------------------------------------------------6
0.8487
7.0000
58.1616
64.0748
6.9870
6.5804
5.36825
Number in
Model
R-Square
SBC
-------------------Parameter Estimates-----------------Intercept
Age
Weight
RunTime
1
0.7434
67.40210
82.42177
.
.
-3.31056
-------------------------------------------------------------------------------------------2
0.7642
68.20695
88.46229
-0.15037
.
-3.20395
-------------------------------------------------------------------------------------------3
0.8111
64.77326
111.71806
-0.25640
.
-2.82538
-------------------------------------------------------------------------------------------4
0.8368
63.66941
98.14789
-0.19773
.
-2.76758
-------------------------------------------------------------------------------------------5
0.8480
64.90250
102.20428
-0.21962
-0.07230
-2.68252
-------------------------------------------------------------------------------------------6
0.8487
68.19952
102.93448
-0.22697
-0.07418
-2.62865
Number in
Model
R-Square
-----------Parameter Estimates----------RunPulse
RestPulse
MaxPulse
1
0.7434
.
.
.
-------------------------------------------------------------2
0.7642
.
.
.
-------------------------------------------------------------3
0.8111
-0.13091
.
.
-------------------------------------------------------------4
0.8368
-0.34811
.
0.27051
-------------------------------------------------------------5
0.8480
-0.37340
.
0.30491
-------------------------------------------------------------6
0.8487
-0.36963
-0.02153
0.30322
5512 F Chapter 73: The REG Procedure
Figure 73.23 PROC PRINT Output for Physical Fitness Data: OUTEST= Data Set
Obs
_MODEL_
_TYPE_
_DEPVAR_
_RMSE_
Intercept
Age
Weight
1
2
3
4
5
6
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
PARMS
PARMS
PARMS
PARMS
PARMS
PARMS
Oxygen
Oxygen
Oxygen
Oxygen
Oxygen
Oxygen
2.74478
2.67739
2.44063
2.31159
2.27516
2.31695
82.422
88.462
111.718
98.148
102.204
102.934
.
-0.15037
-0.25640
-0.19773
-0.21962
-0.22697
.
.
.
.
-0.072302
-0.074177
Obs
RunTime
RunPulse
RestPulse
Max
Pulse
Oxygen
1
2
3
4
5
6
-3.31056
-3.20395
-2.82538
-2.76758
-2.68252
-2.62865
.
.
-0.13091
-0.34811
-0.37340
-0.36963
.
.
.
.
.
-0.021534
.
.
.
0.27051
0.30491
0.30322
-1
-1
-1
-1
-1
-1
_IN_
1
2
3
4
5
6
_P_
_EDF_
2
3
4
5
6
7
29
28
27
26
25
24
_MSE_
7.53384
7.16842
5.95669
5.34346
5.17634
5.36825
Obs
_RSQ_
_CP_
_JP_
_GMSEP_
_AIC_
_BIC_
_SBC_
1
2
3
4
5
6
0.74338
0.76425
0.81109
0.83682
0.84800
0.84867
13.6988
12.3894
6.9596
4.8800
5.1063
7.0000
8.01990
7.86214
6.72530
6.20531
6.17821
6.58043
8.05462
7.94778
6.85833
6.39837
6.45651
6.98700
64.5341
63.9050
59.0373
56.4995
56.2986
58.1616
65.4673
64.8212
61.3127
60.3996
61.5667
64.0748
67.4021
68.2069
64.7733
63.6694
64.9025
68.1995
OUTSSCP= Data Sets
The OUTSSCP= option produces a TYPE=SSCP output SAS data set containing sums of squares
and crossproducts. A special row (observation) and column (variable) of the matrix called Intercept
contain the number of observations and sums. Observations are identified by the character variable
_NAME_. The data set contains all variables used in MODEL statements. You can specify additional
variables that you want included in the crossproducts matrix with a VAR statement.
The SSCP data set is used when a large number of observations are explored in many different runs.
The SSCP data set can be saved and used for subsequent runs, which are much less expensive since
PROC REG never reads the original data again. If you run PROC REG once to create only a SSCP
data set, you should list all the variables that you might need in a VAR statement or include all the
variables that you might need in a MODEL statement.
The following statements use the fitness data from Example 73.2 to produce an output data set with
the OUTSSCP= option. The resulting output is shown in Figure 73.24.
proc reg data=fitness outsscp=sscp;
var Oxygen RunTime Age Weight RestPulse RunPulse MaxPulse;
proc print data=sscp;
run;
Interactive Analysis F 5513
Since a model is not fit to the data and since the only request is to create the SSCP data set, a
MODEL statement is not required in this example. However, since the MODEL statement is not
used, the VAR statement is required.
Figure 73.24 SSCP Data Set Created with OUTSSCP= Option: REG Procedure
Obs
_TYPE_
1
2
3
4
5
6
7
8
9
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
SSCP
N
Obs
1
2
3
4
5
6
7
8
9
_NAME_
Intercept
Oxygen
RunTime
Age
Weight
RestPulse
RunPulse
MaxPulse
Intercept
31.00
1468.65
328.17
1478.00
2400.78
1657.00
5259.00
5387.00
31.00
Oxygen
RunTime
Age
1468.65
70429.86
15356.14
69767.75
113522.26
78015.41
248497.31
254866.75
31.00
328.17
15356.14
3531.80
15687.24
25464.71
17684.05
55806.29
57113.72
31.00
1478.00
69767.75
15687.24
71282.00
114158.90
78806.00
250194.00
256218.00
31.00
Weight
RestPulse
RunPulse
MaxPulse
2400.78
113522.26
25464.71
114158.90
188008.20
128409.28
407745.67
417764.62
31.00
1657.00
78015.41
17684.05
78806.00
128409.28
90311.00
281928.00
288583.00
31.00
5259.00
248497.31
55806.29
250194.00
407745.67
281928.00
895317.00
916499.00
31.00
5387.00
254866.75
57113.72
256218.00
417764.62
288583.00
916499.00
938641.00
31.00
Interactive Analysis
PROC REG enables you to change interactively both the model and the data used to compute the
model, and to produce and highlight scatter plots. See the section “Using PROC REG Interactively”
on page 5444 for an overview of interactive analysis that uses PROC REG. The following statements
can be used interactively (without reinvoking PROC REG): ADD, DELETE, MODEL, MTEST,
OUTPUT, PAINT, PLOT, PRINT, REFIT, RESTRICT, REWEIGHT, and TEST. All interactive
features are disabled if there is a BY statement.
The ADD, DELETE, and REWEIGHT statements can be used to modify the current MODEL.
Every use of an ADD, DELETE, or REWEIGHT statement causes the model label to be modified
by attaching an additional number to it. This number is the cumulative total of the number of ADD,
DELETE, or REWEIGHT statements following the current MODEL statement.
A more detailed explanation of changing the data used to compute the model is given in the section
“Reweighting Observations in an Analysis” on page 5563. Extra features for line printer scatter
plots are discussed in the section “Line Printer Scatter Plot Features” on page 5528.
5514 F Chapter 73: The REG Procedure
The following statements illustrate the usefulness of the interactive features. First, the full regression model is fit to the Class data (see the section “Getting Started: REG Procedure” on page 5430),
and Figure 73.25 is produced.
ods graphics on;
proc reg data=Class plots(modelLabel only)=ResidualByPredicted;
model Weight=Age Height;
run;
Figure 73.25 Interactive Analysis: Full Model
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
16
18
7215.63710
2120.09974
9335.73684
3607.81855
132.50623
Root MSE
Dependent Mean
Coeff Var
11.51114
100.02632
11.50811
R-Square
Adj R-Sq
F Value
Pr > F
27.23
<.0001
0.7729
0.7445
Parameter Estimates
Variable
Intercept
Age
Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
-141.22376
1.27839
3.59703
33.38309
3.11010
0.90546
-4.23
0.41
3.97
0.0006
0.6865
0.0011
Next, the regression model is reduced by the following statements, and Figure 73.26 is produced.
delete age;
print;
run;
Interactive Analysis F 5515
Figure 73.26 Interactive Analysis: Reduced Model
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
17
18
7193.24912
2142.48772
9335.73684
7193.24912
126.02869
Root MSE
Dependent Mean
Coeff Var
11.22625
100.02632
11.22330
R-Square
Adj R-Sq
F Value
Pr > F
57.08
<.0001
0.7705
0.7570
Parameter Estimates
Variable
Intercept
Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
-143.02692
3.89903
32.27459
0.51609
-4.43
7.55
0.0004
<.0001
Note that the MODEL label has been changed from MODEL1 to MODEL1.1, since the original
MODEL has been changed by the delete statement.
When ODS Graphics is enabled, updated plots are produced whenever a PRINT statement is used.
The option
plots(modelLabel only)=ResidualByPredicted
in the PROC REG statement specifies that the only plot produced is a scatter plot of residuals by
predicted values. The MODELLABEL option specifies that the current model label is added to the
plot.
The following statements generate a scatter plot of the residuals against the predicted values from
the full model. Figure 73.27 is produced, and the scatter plot shows a possible outlier.
add age;
print;
run;
5516 F Chapter 73: The REG Procedure
Figure 73.27 Interactive Analysis: Scatter Plot
The following statements delete the observation with the largest residual, refit the regression model,
and produce a scatter plot of residuals against predicted values for the refitted model. Figure 73.28
shows the new scatter plot.
reweight r.>20;
print;
run;
ods graphics off;
Model-Selection Methods F 5517
Figure 73.28 Interactive Analysis: Scatter Plot
Model-Selection Methods
The nine methods of model selection implemented in PROC REG are specified with the SELECTION= option in the MODEL statement. Each method is discussed in this section.
Full Model Fitted (NONE)
This method is the default and provides no model selection capability. The complete model specified
in the MODEL statement is used to fit the model. For many regression analyses, this might be the
only method you need.
Forward Selection (FORWARD)
The forward-selection technique begins with no variables in the model. For each of the independent
variables, the FORWARD method calculates F statistics that reflect the variable’s contribution to
the model if it is included. The p-values for these F statistics are compared to the SLENTRY=
5518 F Chapter 73: The REG Procedure
value that is specified in the MODEL statement (or to 0.50 if the SLENTRY= option is omitted). If
no F statistic has a significance level greater than the SLENTRY= value, the FORWARD selection
stops. Otherwise, the FORWARD method adds the variable that has the largest F statistic to the
model. The FORWARD method then calculates F statistics again for the variables still remaining
outside the model, and the evaluation process is repeated. Thus, variables are added one by one to
the model until no remaining variable produces a significant F statistic. Once a variable is in the
model, it stays.
Backward Elimination (BACKWARD)
The backward elimination technique begins by calculating F statistics for a model, including all of
the independent variables. Then the variables are deleted from the model one by one until all the
variables remaining in the model produce F statistics significant at the SLSTAY= level specified in
the MODEL statement (or at the 0.10 level if the SLSTAY= option is omitted). At each step, the
variable showing the smallest contribution to the model is deleted.
Stepwise (STEPWISE)
The stepwise method is a modification of the forward-selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward-selection method, variables are added one by one to the model, and the F statistic for a variable to be added must be
significant at the SLENTRY= level. After a variable is added, however, the stepwise method looks
at all the variables already included in the model and deletes any variable that does not produce
an F statistic significant at the SLSTAY= level. Only after this check is made and the necessary
deletions are accomplished can another variable be added to the model. The stepwise process ends
when none of the variables outside the model has an F statistic significant at the SLENTRY= level
and every variable in the model is significant at the SLSTAY= level, or when the variable to be
added to the model is the one just deleted from it.
Maximum R2 Improvement (MAXR)
The maximum R2 improvement technique does not settle on a single model. Instead, it tries to
find the “best” one-variable model, the “best” two-variable model, and so forth, although it is not
guaranteed to find the model with the largest R2 for each size.
The MAXR method begins by finding the one-variable model producing the highest R2 . Then another variable, the one that yields the greatest increase in R2 , is added. Once the two-variable model
is obtained, each of the variables in the model is compared to each variable not in the model. For
each comparison, the MAXR method determines if removing one variable and replacing it with the
other variable increases R2 . After comparing all possible switches, the MAXR method makes the
switch that produces the largest increase in R2 . Comparisons begin again, and the process continues until the MAXR method finds that no switch could increase R2 . Thus, the two-variable model
achieved is considered the “best” two-variable model the technique can find. Another variable is
then added to the model, and the comparing-and-switching process is repeated to find the “best”
three-variable model, and so forth.
Model-Selection Methods F 5519
The difference between the STEPWISE method and the MAXR method is that all switches are
evaluated before any switch is made in the MAXR method. In the STEPWISE method, the “worst”
variable might be removed without considering what adding the “best” remaining variable might
accomplish. The MAXR method might require much more computer time than the STEPWISE
method.
Minimum R2 (MINR) Improvement
The MINR method closely resembles the MAXR method, but the switch chosen is the one that
produces the smallest increase in R2 . For a given number of variables in the model, the MAXR
and MINR methods usually produce the same “best” model, but the MINR method considers more
models of each size.
R2 Selection (RSQUARE)
The RSQUARE method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. You can specify the largest and smallest number of
independent variables to appear in a subset and the number of subsets of each size to be selected.
The RSQUARE method can efficiently perform all possible subset regressions and display the models in decreasing order of R2 magnitude within each subset size. Other statistics are available for
comparing subsets of different sizes. These statistics, as well as estimated regression coefficients,
can be displayed or output to a SAS data set.
The subset models selected by the RSQUARE method are optimal in terms of R2 for the given
sample, but they are not necessarily optimal for the population from which the sample is drawn or
for any other sample for which you might want to make predictions. If a subset model is selected
on the basis of a large R2 value or any other criterion commonly used for model selection, then all
regression statistics computed for that model under the assumption that the model is given a priori,
including all statistics computed by PROC REG, are biased.
While the RSQUARE method is a useful tool for exploratory model building, no statistical method
can be relied on to identify the “true” model. Effective model building requires substantive theory
to suggest relevant predictors and plausible functional forms for the model.
The RSQUARE method differs from the other selection methods in that RSQUARE always identifies the model with the largest R2 for each number of variables considered. The other selection
methods are not guaranteed to find the model with the largest R2 . The RSQUARE method requires
much more computer time than the other selection methods, so a different selection method such as
the STEPWISE method is a good choice when there are many independent variables to consider.
Adjusted R2 Selection (ADJRSQ)
This method is similar to the RSQUARE method, except that the adjusted R2 statistic is used as the
criterion for selecting models, and the method finds the models with the highest adjusted R2 within
the range of sizes.
5520 F Chapter 73: The REG Procedure
Mallows’ Cp Selection (CP)
This method is similar to the ADJRSQ method, except that Mallows’ Cp statistic is used as the
criterion for model selection. Models are listed in ascending order of Cp .
Additional Information about Model-Selection Methods
If the RSQUARE or STEPWISE procedure (as documented in SAS User’s Guide: Statistics, Version
5 Edition) is requested, PROC REG with the appropriate model-selection method is actually used.
Reviews of model-selection methods by Hocking (1976) and Judge et al. (1980) describe these and
other variable-selection methods.
Criteria Used in Model-Selection Methods
When many significance tests are performed, each at a level of, for example, 5%, the overall probability of rejecting at least one true null hypothesis is much larger than 5%. If you want to guard
against including any variables that do not contribute to the predictive power of the model in the
population, you should specify a very small SLE= significance level for the FORWARD and STEPWISE methods and a very small SLS= significance level for the BACKWARD and STEPWISE
methods.
In most applications, many of the variables considered have some predictive power, however small.
If you want to choose the model that provides the best prediction computed using the sample estimates, you need only to guard against estimating more parameters than can be reliably estimated
with the given sample size, so you should use a moderate significance level, perhaps in the range of
10% to 25%.
In addition to R2 , the Cp statistic is displayed for each model generated in the model-selection
methods. The Cp statistic is proposed by Mallows (1973) as a criterion for selecting a model. It is
a measure of total squared error defined as
Cp D
S SEp
s2
.N
2p/
where s 2 is the MSE for the full model, and SSEp is the sum-of-squares error for a model with
p parameters including the intercept, if any. If Cp is plotted against p, Mallows recommends the
model where Cp first approaches p. When the right model is chosen, the parameter estimates are
unbiased, and this is reflected in Cp near p. For further discussion, refer to Daniel and Wood
(1980).
The adjusted R2 statistic is an alternative to R2 that is adjusted for the number of parameters in the
model. The adjusted R2 statistic is calculated as
Limitations in Model-Selection Methods F 5521
ADJRSQ D 1
.n
i /.1 R2 /
n p
where n is the number of observations used in fitting the model, and i is an indicator variable that
is 1 if the model includes an intercept, and 0 otherwise.
Limitations in Model-Selection Methods
The use of model-selection methods can be time-consuming in some cases because there is no
built-in limit on the number of independent variables, and the calculations for a large number of
independent variables can be lengthy. The recommended limit on the number of independent variables for the MINR method is 20 C i , where i is the value of the INCLUDE= option.
For the RSQUARE, ADJRSQ, or CP method, with a large value of the BEST= option, adding one
more variable to the list from which regressors are selected might significantly increase the CPU
time. Also, the time required for the analysis is highly dependent on the data and on the values of
the BEST=, START=, and STOP= options.
Parameter Estimates and Associated Statistics
The following example uses the fitness data from Example 73.2. Figure 73.30 shows the parameter
estimates and the tables from the SS1, SS2, STB, CLB, COVB, and CORRB options:
proc reg data=fitness;
model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse
/ ss1 ss2 stb clb covb corrb;
run;
The procedure first displays an analysis of variance table (Figure 73.29). The F statistic for the
overall model is significant, indicating that the model explains a significant portion of the variation
in the data.
5522 F Chapter 73: The REG Procedure
Figure 73.29 ANOVA Table
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
6
24
30
722.54361
128.83794
851.38154
120.42393
5.36825
Root MSE
Dependent Mean
Coeff Var
2.31695
47.37581
4.89057
R-Square
Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487
0.8108
The procedure next displays parameter estimates and some associated statistics (Figure 73.30).
First, the estimates are shown, followed by their standard errors. The next two columns of the table contain the t statistics and the corresponding probabilities for testing the null hypothesis that
the parameter is not significantly different from zero. These probabilities are usually referred to as
p-values. For example, the Intercept term in the model is estimated to be 102.9 and is significantly
different from zero. The next two columns of the table are the result of requesting the SS1 and SS2
options, and they show sequential and partial sums of squares (SS) associated with each variable.
The standardized estimates (produced by the STB option) are the parameter estimates that result
when all variables are standardized to a mean of 0 and a variance of 1. These estimates are computed by multiplying the original estimates by the standard deviation of the regressor (independent)
variable and then dividing by the standard deviation of the dependent variable. The CLB option
adds the upper and lower 95% confidence limits for the parameter estimates; the ˛ level can be
changed by specifying the ALPHA= option in the PROC REG or MODEL statement.
Parameter Estimates and Associated Statistics F 5523
Figure 73.30 SS1, SS2, STB, CLB, COVB, and CORRB Options: Parameter Estimates
Parameter Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
1
1
1
1
1
1
1
102.93448
-2.62865
-0.22697
-0.07418
-0.36963
0.30322
-0.02153
12.40326
0.38456
0.09984
0.05459
0.11985
0.13650
0.06605
8.30
-6.84
-2.27
-1.36
-3.08
2.22
-0.33
<.0001
<.0001
0.0322
0.1869
0.0051
0.0360
0.7473
69578
632.90010
17.76563
5.60522
38.87574
26.82640
0.57051
Parameter Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
DF
Type II SS
Standardized
Estimate
1
1
1
1
1
1
1
369.72831
250.82210
27.74577
9.91059
51.05806
26.49142
0.57051
0
-0.68460
-0.22204
-0.11597
-0.71133
0.52161
-0.03080
95% Confidence Limits
77.33541
-3.42235
-0.43303
-0.18685
-0.61699
0.02150
-0.15786
128.53355
-1.83496
-0.02092
0.03850
-0.12226
0.58493
0.11480
The final two tables are produced as a result of requesting the COVB and CORRB options
(Figure 73.31). These tables show the estimated covariance matrix of the parameter estimates,
and the estimated correlation matrix of the estimates.
5524 F Chapter 73: The REG Procedure
Figure 73.31 SS1, SS2, STB, CLB, COVB, and CORRB Options: Covariances and Correlations
Covariance of Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
Intercept
RunTime
Age
Weight
153.84081152
0.7678373769
-0.902049478
-0.178237818
0.280796516
-0.832761667
-0.147954715
0.7678373769
0.1478880839
-0.014191688
-0.004417672
-0.009047784
0.0046249498
-0.010915224
-0.902049478
-0.014191688
0.009967521
0.0010219105
-0.001203914
0.0035823843
0.0014897532
-0.178237818
-0.004417672
0.0010219105
0.0029804131
0.0009644683
-0.001372241
0.0003799295
Covariance of Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
RunPulse
MaxPulse
RestPulse
0.280796516
-0.009047784
-0.001203914
0.0009644683
0.0143647273
-0.014952457
-0.000764507
-0.832761667
0.0046249498
0.0035823843
-0.001372241
-0.014952457
0.0186309364
0.0003425724
-0.147954715
-0.010915224
0.0014897532
0.0003799295
-0.000764507
0.0003425724
0.0043631674
Correlation of Estimates
Variable
Intercept
RunTime
Age
Weight
1.0000
0.1610
-0.7285
-0.2632
0.1889
-0.4919
-0.1806
0.1610
1.0000
-0.3696
-0.2104
-0.1963
0.0881
-0.4297
-0.7285
-0.3696
1.0000
0.1875
-0.1006
0.2629
0.2259
-0.2632
-0.2104
0.1875
1.0000
0.1474
-0.1842
0.1054
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
Correlation of Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
RunPulse
MaxPulse
RestPulse
0.1889
-0.1963
-0.1006
0.1474
1.0000
-0.9140
-0.0966
-0.4919
0.0881
0.2629
-0.1842
-0.9140
1.0000
0.0380
-0.1806
-0.4297
0.2259
0.1054
-0.0966
0.0380
1.0000
For further discussion of the parameters and statistics, see the section “Displayed Output” on
page 5578, and Chapter 4, “Introduction to Regression Procedures.”
Predicted and Residual Values F 5525
Predicted and Residual Values
The display of the predicted values and residuals is controlled by the P, R, CLM, and CLI options
in the MODEL statement. The P option causes PROC REG to display the observation number, the
ID value (if an ID statement is used), the actual value, the predicted value, and the residual. The R,
CLI, and CLM options also produce the items under the P option. Thus, P is unnecessary if you use
one of the other options.
The R option requests more detail, especially about the residuals. The standard errors of the mean
predicted value and the residual are displayed. The studentized residual, which is the residual divided by its standard error, is both displayed and plotted. A measure of influence, Cook’s D, is
displayed. Cook’s D measures the change to the estimates that results from deleting each observation (Cook 1977, 1979). This statistic is very similar to DFFITS.
The CLM option requests that PROC REG display the 100.1 ˛/% lower and upper confidence
limits for the mean predicted values. This accounts for the variation due to estimating the parameters
only. If you want a 100.1 ˛/% confidence interval for observed values, then you can use the
CLI option, which adds in the variability of the error term. The ˛ level can be specified with the
ALPHA= option in the PROC REG or MODEL statement.
You can use these statistics in PLOT and PAINT statements. This is useful in performing a variety
of regression diagnostics. For definitions of the statistics produced by these options, see Chapter 4,
“Introduction to Regression Procedures.”
The following statements use the U.S. population data found in the section “Polynomial Regression”
on page 5434. The results are shown in Figure 73.32 and Figure 73.33.
data USPop2;
input Year @@;
YearSq=Year*Year;
datalines;
2010 2020 2030
;
data USPop2;
set USPopulation USPop2;
proc reg data=USPop2;
id Year;
model Population=Year YearSq / r cli clm;
run;
5526 F Chapter 73: The REG Procedure
Figure 73.32 Regression Using the R, CLI, and CLM Options
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
19
21
159529
170.97193
159700
79765
8.99852
Root MSE
Dependent Mean
Coeff Var
2.99975
94.64800
3.16938
R-Square
Adj R-Sq
F Value
Pr > F
8864.19
<.0001
0.9989
0.9988
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Year
YearSq
1
1
1
21631
-24.04581
0.00668
639.50181
0.67547
0.00017820
33.82
-35.60
37.51
<.0001
<.0001
<.0001
Predicted and Residual Values F 5527
Figure 73.33 Regression Using the R, CLI, and CLM Options
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Output Statistics
Obs
Year
Dependent
Variable
Predicted
Value
Std Error
Mean Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1790
1800
1810
1820
1830
1840
1850
1860
1870
1880
1890
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
2020
2030
3.9290
5.3080
7.2390
9.6380
12.8660
17.0690
23.1910
31.4430
39.8180
50.1550
62.9470
75.9940
91.9720
105.7100
122.7750
131.6690
151.3250
179.3230
203.2110
226.5420
248.7100
281.4220
.
.
.
6.2127
5.7226
6.5694
8.7531
12.2737
17.1311
23.3254
30.8566
39.7246
49.9295
61.4713
74.3499
88.5655
104.1178
121.0071
139.2332
158.7962
179.6961
201.9328
225.5064
250.4168
276.6642
304.2484
333.1695
363.4274
1.7565
1.4560
1.2118
1.0305
0.9163
0.8650
0.8613
0.8846
0.9163
0.9436
0.9590
0.9590
0.9436
0.9163
0.8846
0.8613
0.8650
0.9163
1.0305
1.2118
1.4560
1.7565
2.1073
2.5040
2.9435
95% CL Mean
2.5362
2.6751
4.0331
6.5963
10.3558
15.3207
21.5227
29.0051
37.8067
47.9545
59.4641
72.3427
86.5904
102.2000
119.1556
137.4305
156.9858
177.7782
199.7759
222.9701
247.3693
272.9877
299.8377
327.9285
357.2665
9.8892
8.7701
9.1057
10.9100
14.1916
18.9415
25.1281
32.7080
41.6425
51.9046
63.4785
76.3571
90.5405
106.0357
122.8585
141.0359
160.6066
181.6139
204.0896
228.0427
253.4644
280.3407
308.6591
338.4104
369.5883
95% CL Predict
-1.0631
-1.2565
-0.2021
2.1144
5.7087
10.5968
16.7932
24.3107
33.1597
43.3476
54.8797
67.7583
81.9836
97.5529
114.4612
132.7010
152.2618
173.1311
195.2941
218.7349
243.4378
269.3884
296.5754
324.9910
354.6310
13.4884
12.7017
13.3409
15.3918
18.8386
23.6655
29.8576
37.4024
46.2896
56.5114
68.0629
80.9415
95.1473
110.6828
127.5529
145.7654
165.3306
186.2610
208.5715
232.2779
257.3959
283.9400
311.9214
341.3479
372.2238
Output Statistics
Obs
Year
Residual
Std Error
Residual
Student
Residual
1
2
3
4
5
6
7
8
9
10
1790
1800
1810
1820
1830
1840
1850
1860
1870
1880
-2.2837
-0.4146
0.6696
0.8849
0.5923
-0.0621
-0.1344
0.5864
0.0934
0.2255
2.432
2.623
2.744
2.817
2.856
2.872
2.873
2.866
2.856
2.847
-0.939
-0.158
0.244
0.314
0.207
-0.0216
-0.0468
0.205
0.0327
0.0792
Cook’s
D
-2-1 0 1 2
|
|
|
|
|
|
|
|
|
|
*|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.153
0.003
0.004
0.004
0.001
0.000
0.000
0.001
0.000
0.000
5528 F Chapter 73: The REG Procedure
Figure 73.33 continued
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Output Statistics
Obs
Year
Residual
Std Error
Residual
Student
Residual
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1890
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
2020
2030
1.4757
1.6441
3.4065
1.5922
1.7679
-7.5642
-7.4712
-0.3731
1.2782
1.0356
-1.7068
4.7578
.
.
.
2.842
2.842
2.847
2.856
2.866
2.873
2.872
2.856
2.817
2.744
2.623
2.432
.
.
.
0.519
0.578
1.196
0.557
0.617
-2.632
-2.601
-0.131
0.454
0.377
-0.651
1.957
.
.
.
Cook’s
D
-2-1 0 1 2
|
|*
|
|*
|
|**
|
|*
|
|*
| *****|
| *****|
|
|
|
|
|
|
|
*|
|
|***
|
|
|
|
|
|
|
|
|
|
|
|
0.010
0.013
0.052
0.011
0.012
0.208
0.205
0.001
0.009
0.009
0.044
0.666
.
.
.
After producing the usual analysis of variance and parameter estimates tables (Figure 73.32),
the procedure displays the results of requesting the options for predicted and residual values
(Figure 73.33). For each observation, the requested information is shown. Note that the ID variable is used to identify each observation. Also note that, for observations with missing dependent
variables, the predicted value, standard error of the predicted value, and confidence intervals for the
predicted value are still available.
The columnar print plot of studentized residuals and Cook’s D statistics are displayed as a result of
requesting the R option. In the plot of studentized residuals, the large number of observations with
absolute values greater than two indicates an inadequate model. You can use ODS Graphics to obtain high-resolution plots of studentized residuals by predicted values or leverage; see Example 73.1
for a similar example.
Line Printer Scatter Plot Features
This section discusses the special options available with line printer scatter plots. Detailed examples
of traditional graphics and options are given in the section “Traditional Graphics” on page 5541.
Line Printer Scatter Plot Features F 5529
Producing Scatter Plots
The interactive PLOT statement available in PROC REG enables you to look at scatter plots of data
and diagnostic statistics. These plots can help you to evaluate the model and detect outliers in your
data. Several options enable you to place multiple plots on a single page, superimpose plots, and
collect plots to be overlaid by later plots. The PAINT statement can be used to highlight points on
a plot. See the section “Painting Scatter Plots” on page 5537 for more information about painting.
The Class data set introduced in the section “Simple Linear Regression” on page 5430 is used in the
following examples.
You can superimpose several plots with the OVERLAY option. With the following statements, a
plot of Weight against Height is overlaid with plots of the predicted values and the 95% prediction
intervals. The model on which the statistics are based is the full model including Height and Age.
These statements produce the plot in Figure 73.34:
proc reg data=Class lineprinter;
model Weight=Height Age / noprint;
plot (ucl. lcl. p.)*Height=’-’ Weight*Height
/ overlay symbol=’o’;
run;
Figure 73.34 Scatter Plot Showing Data, Predicted Values, and Confidence Limits
---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---U U95 |
|
p 200 +
+
p
|
|
e
|
|
r
|
|
|
|
B 150 +
o
+
o
|
- -|
u
|
- - -o
o
|
n
|
-|
d
|
- -- o - -- o
o
|
100 +
o
? o
o
+
o
|
o
- o
-|
f
|
?? ?o
? - -- |
|
|
9
|
- |
5 50 +
o
-- -+
%
|
|
|
|
C
|
|
.
|
|
I
0 +
+
.
|
|
(
---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---I
50
52
54
56
58
60
62
64
66
68
70
72
n
Height
5530 F Chapter 73: The REG Procedure
In this plot, the data values are marked with the symbol ’o’ and the predicted values and prediction
interval limits are labeled with the symbol ’-’. The plot is scaled to accommodate the points from
all plots. This is an important difference from the COLLECT option, which does not rescale plots
after the first plot or plots are collected. You could separate the overlaid plots by using the following
statements:
plot;
run;
This places each of the four plots on a separate page, while the statements
plot / overlay;
run;
repeat the previous overlaid plot. In general, the statement
plot;
is equivalent to respecifying the most recent PLOT statement without any options. However, the
COLLECT, HPLOTS=, SYMBOL=, and VPLOTS= options apply across PLOT statements and
remain in effect.
The next example shows how you can overlay plots of statistics before and after a change in the
model. For the full model involving Height and Age, the ordinary residuals and the studentized
residuals are plotted against the predicted values. The COLLECT option causes these plots to be
collected or retained for redisplay later. The option HPLOTS=2 enables the two plots to appear side
by side on one page. The symbol ’f’ is used on these plots to identify them as resulting from the
full model. These statements produce Figure 73.35:
plot r.*p. student.*p. / collect hplots=2 symbol=’f’;
run;
Line Printer Scatter Plot Features F 5531
Figure 73.35 Collecting Residual Plots for the Full Model
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+40 +
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 +
f
+
|
|
|
f
|
|
f
f |
|
f
|
|
|
|
f
|
|
f
|
0 +
f
f
+
|
f
|
|
f
f
|
|
f
|
|
|
|
f
f
|
|
f
|
|
f
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
3
2
S
T
U
D
E
N
T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-|
|
|
|
+
+
|
|
|
|
|
f
|
+
+
|
|
|
|
|
f
f
f |
+
f
+
|
|
|
f
|
|
f
|
+
f
f
+
|
f
|
|
f
f
|
|
f
|
+
+
|
f
f
|
|
f f
|
|
|
+
+
|
|
|
|
-+-----+-----+-----+-----+-----+-40
60
80
100
120
140
PRED
Note that these plots are not overlaid. The COLLECT option does not overlay the plots in one PLOT
statement but retains them so that they can be overlaid by later plots. When the COLLECT option
appears in a PLOT statement, the plots in that statement become the first plots in the collection.
Next, the model is reduced by deleting the Age variable. The PLOT statement requests the same
plots as before but labels the points with the symbol ’r’ denoting the reduced model. The following
statements produce Figure 73.36:
delete Age;
plot r.*p. student.*p. / symbol=’r’;
run;
5532 F Chapter 73: The REG Procedure
Figure 73.36 Overlaid Residual Plots for Full and Reduced Models
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+40 +
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 +
f
+
|
r
|
|
rf
|
|
?
r
? |
|
f
|
|
r
|
|
?
|
|
?
|
0 +
?
?
+
|
rf
|
|
?
?
|
|
?
|
|
|
|
?
fr
|
|
f
|
|
r ?
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
3
2
S
T
U
D
E
N
T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-|
|
|
|
+
+
|
|
|
|
|
f
|
+
+
|
r
|
|
r
|
|
?
f
? |
+
rf
+
|
r
|
|
f
|
|
?
|
+
?
?
+
|
rf
|
|
?
?
|
|
?
|
+
+
|
?
fr
|
|
? f
|
|
r
|
+
+
|
|
|
|
-+-----+-----+-----+-----+-----+-40
60
80
100
120
140
PRED
Notice that the COLLECT option causes the corresponding plots to be overlaid. Also notice that
the DELETE statement causes the model label to be changed from MODEL1 to MODEL1.1. The
points labeled ’f’ are from the full model, and the points labeled ’r’ are from the reduced model.
Positions labeled ’?’ contain at least one point from each model. In this example, the OVERLAY
option cannot be used because all of the plots to be overlaid cannot be specified in one PLOT
statement. With the COLLECT option, any changes to the model or the data used to fit the model
do not affect plots collected before the changes. Collected plots are always reproduced exactly as
they first appear. (Similarly, a PAINT statement does not affect plots collected before the PAINT
statement is issued.)
The previous example overlays the residual plots for two different models. You might prefer to see
them side by side on the same page. This can also be done with the COLLECT option by using
a blank plot. Continuing from the last example, the COLLECT, HPLOTS=2, and SYMBOL=’r’
options are still in effect. In the following PLOT statement, the CLEAR option deletes the collected
plots and enables the specified plot to begin a new collection. The plot created is the residual plot
for the reduced model. These statements produce Figure 73.37:
plot r.*p. / clear;
run;
Line Printer Scatter Plot Features F 5533
Figure 73.37 Residual Plot for Reduced Model Only
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+20 +
+
|
r
|
|
|
|
r
|
|
|
|
r
r
r |
10 +
+
|
|
|
r
|
|
|
|
r
|
|
|
0 +
r
r
+
|
r
|
|
|
|
r
r
|
|
r
|
|
|
-10 +
+
|
|
|
r
r
|
|
|
|
r
|
|
r
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
The next statements add the variable AGE to the model and place the residual plot for the full
model next to the plot for the reduced model. Notice that a blank plot is created in the first plot
request by placing nothing between the quotes. Since the COLLECT option is in effect, this plot
is superimposed on the residual plot for the reduced model. The residual plot for the full model
is created by the second request. The result is the desired side-by-side plots. The NOCOLLECT
option turns off the collection process after the specified plots are added and displayed. Any PLOT
statements that follow show only the newly specified plots. These statements produce Figure 73.38:
add Age;
plot r.*p.=’’ r.*p.=’f’ / nocollect;
run;
5534 F Chapter 73: The REG Procedure
Figure 73.38 Side-by-Side Residual Plots for the Full and Reduced Models
---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---U U95 |
|
p 200 +
+
p
|
|
e
|
|
r
|
|
|
|
B 150 +
o
+
o
|
- -|
u
|
- - -o
o
|
n
|
-|
d
|
- -- o - -- o
o
|
100 +
o
? o
o
+
o
|
o
- o
-|
f
|
?? ?o
? - -- |
|
|
9
|
- |
5 50 +
o
-- -+
%
|
|
|
|
C
|
|
.
|
|
I
0 +
+
.
|
|
(
---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---I
50
52
54
56
58
60
62
64
66
68
70
72
n
Height
Line Printer Scatter Plot Features F 5535
Figure 73.38 continued
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+40 +
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 +
f
+
|
|
|
f
|
|
f
f |
|
f
|
|
|
|
f
|
|
f
|
0 +
f
f
+
|
f
|
|
f
f
|
|
f
|
|
|
|
f
f
|
|
f
|
|
f
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
3
2
S
T
U
D
E
N
T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-|
|
|
|
+
+
|
|
|
|
|
f
|
+
+
|
|
|
|
|
f
f
f |
+
f
+
|
|
|
f
|
|
f
|
+
f
f
+
|
f
|
|
f
f
|
|
f
|
+
+
|
f
f
|
|
f f
|
|
|
+
+
|
|
|
|
-+-----+-----+-----+-----+-----+-40
60
80
100
120
140
PRED
5536 F Chapter 73: The REG Procedure
Figure 73.38 continued
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+40 +
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 +
f
+
|
r
|
|
rf
|
|
?
r
? |
|
f
|
|
r
|
|
?
|
|
?
|
0 +
?
?
+
|
rf
|
|
?
?
|
|
?
|
|
|
|
?
fr
|
|
f
|
|
r ?
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
3
2
S
T
U
D
E
N
T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-|
|
|
|
+
+
|
|
|
|
|
f
|
+
+
|
r
|
|
r
|
|
?
f
? |
+
rf
+
|
r
|
|
f
|
|
?
|
+
?
?
+
|
rf
|
|
?
?
|
|
?
|
+
+
|
?
fr
|
|
? f
|
|
r
|
+
+
|
|
|
|
-+-----+-----+-----+-----+-----+-40
60
80
100
120
140
PRED
Line Printer Scatter Plot Features F 5537
Figure 73.38 continued
R
E
S
I
D
U
A
L
-+-----+-----+-----+-----+-----+20 +
+
|
r
|
|
|
|
r
|
|
|
|
r
r
r |
10 +
+
|
|
|
r
|
|
|
|
r
|
|
|
0 +
r
r
+
|
r
|
|
|
|
r
r
|
|
r
|
|
|
-10 +
+
|
|
|
r
r
|
|
|
|
r
|
|
r
|
-20 +
+
-+-----+-----+-----+-----+-----+40
60
80
100
120 140
PRED
Frequently, when the COLLECT option is in effect, you want the current and following PLOT
statements to show only the specified plots. To do this, use both the CLEAR and NOCOLLECT
options in the current PLOT statement.
Painting Scatter Plots
Painting scatter plots is a useful interactive tool that enables you to mark points of interest in scatter
plots. Painting can be used to identify extreme points in scatter plots or to reveal the relationship between two scatter plots. The Class data (from the section “Simple Linear Regression” on
page 5430) is used to illustrate some of these applications.
The following statements produce the scatter plot of the studentized residuals against the predicted
values in Figure 73.39.
proc reg data=Class lineprinter;
model Weight=Age Height / noprint;
plot student.*p.;
run;
5538 F Chapter 73: The REG Procedure
Figure 73.39 Plotting Studentized Residuals against Predicted Values
---+------+------+------+------+------+------+------+------+------+--STUDENT |
|
S
3 +
+
t
|
|
u
|
|
d
|
1
|
e
2 +
+
n
|
|
t
|
|
i
|
1
1
1
|
z
1 +
1
+
e
|
|
d
|
11
|
|
1
|
R
0 +
1
1
+
e
|
1
|
s
|
1
2
|
i
|
1
|
d
-1 +
+
u
|
1
1
|
a
|
1
1
|
l
|
|
-2 +
+
|
|
---+------+------+------+------+------+------+------+------+------+--50
60
70
80
90
100
110
120
130
140
Predicted Value of Weight
PRED
Then, the following statements identify the observation ’Henry’ in the scatter plot and produce the
plot in Figure 73.40:
paint Name=’Henry’ / symbol = ’H’;
plot;
run;
Line Printer Scatter Plot Features F 5539
Figure 73.40 Painting One Observation
---+------+------+------+------+------+------+------+------+------+--STUDENT |
|
S
3 +
+
t
|
|
u
|
|
d
|
1
|
e
2 +
+
n
|
|
t
|
|
i
|
1
1
1
|
z
1 +
1
+
e
|
|
d
|
11
|
|
1
|
R
0 +
1
1
+
e
|
H
|
s
|
1
2
|
i
|
1
|
d
-1 +
+
u
|
1
1
|
a
|
1
1
|
l
|
|
-2 +
+
|
|
---+------+------+------+------+------+------+------+------+------+--50
60
70
80
90
100
110
120
130
140
Predicted Value of Weight
PRED
Next, the following statements identify observations with large absolute residuals:
paint student.>=2 or student.<=-2 / symbol=’s’;
plot;
run;
The log shows the observation numbers found with these conditions and gives the painting symbol
and the number of observations found. Note that the previous PAINT statement is also used in the
PLOT statement. Figure 73.41 shows the scatter plot produced by the preceding statements.
5540 F Chapter 73: The REG Procedure
Figure 73.41 Painting Several Observations
---+------+------+------+------+------+------+------+------+------+--STUDENT |
|
S
3 +
+
t
|
|
u
|
|
d
|
s
|
e
2 +
+
n
|
|
t
|
|
i
|
1
1
1
|
z
1 +
1
+
e
|
|
d
|
11
|
|
1
|
R
0 +
1
1
+
e
|
H
|
s
|
1
2
|
i
|
1
|
d
-1 +
+
u
|
1
1
|
a
|
1
1
|
l
|
|
-2 +
+
|
|
---+------+------+------+------+------+------+------+------+------+--50
60
70
80
90
100
110
120
130
140
Predicted Value of Weight
PRED
The following statements relate two different scatter plots. These statements produce the plot in
Figure 73.42.
paint student.>=1 / symbol=’p’;
paint student.<1 and student.>-1 / symbol=’s’;
paint student.<=-1 / symbol=’n’;
plot student. * p. cookd. * h. / hplots=2;
run;
Traditional Graphics F 5541
Figure 73.42 Painting Observations on More Than One Plot
3
2
S
T
U
D
E
N
T
1
0
-1
-2
-+-----+-----+-----+-----+-----+-|
|
|
|
+
+
|
|
|
|
|
p
|
+
+
|
|
|
|
|
p
p
p |
+
s
+
|
|
|
s
|
|
s
|
+
s
s
+
|
s
|
|
s
s
|
|
s
|
+
+
|
n
n
|
|
n n
|
|
|
+
+
|
|
|
|
-+-----+-----+-----+-----+-----+-40
60
80
100
120
140
C
O
O
K
D
-+----+----+----+----+----+----+0.8 +
p +
|
|
|
|
|
|
|
|
|
|
0.6 +
+
|
|
|
|
|
|
|
|
|
|
0.4 +
+
|
|
|
|
|
|
|
|
|
|
0.2 +
+
|
p
|
|
n
|
|
s
|
| n
p p n
s
|
| n
ss
|
0.0 + ss ss s
+
-+----+----+----+----+----+----+0.05 0.10 0.15 0.20 0.25 0.30 0.35
PRED
H
Traditional Graphics
This section provides examples of using options available with the traditional graphics that you
request with the PLOT statement. An alternative is to use ODS Graphics to obtain plots relevant
to the analysis. See Example 73.1, Example 73.2, and Example 73.5 for examples of obtaining
graphical displays with ODS Graphics. Examples in this section use the Fitness data set that is
described in Example 73.2.
Traditional Graphics for Simple Linear Regression
The following statements introduce the basic PLOT statement graphics syntax. A simple linear
regression of Oxygen on RunTime is performed, and a plot of OxygenRunTime is requested. The
fitted model, the regression line, and the four default statistics are also displayed in Figure 73.43.
proc reg data=fitness;
model Oxygen=RunTime;
plot Oxygen*RunTime / cframe=ligr;
run;
5542 F Chapter 73: The REG Procedure
Figure 73.43 Simple Linear Regression
You can use shorthand commands to plot the dependent variable, the predicted value, and the 95%
confidence or prediction intervals against a regressor. The following statements use the CONF and
PRED options to create a plot with confidence and prediction intervals. Results are displayed in
Figure 73.44. Note that the statistics displayed by default in the margin are suppressed while three
other statistics are exhibited. Furthermore, global graphics LEGEND and SYMBOL statements and
PLOT statement options are used to control the appearance of the plot. For more information about
the global graphics statements, see SAS/GRAPH Software: Reference.
Traditional Graphics F 5543
legend1 position=(bottom left inside)
across=1 cborder=red offset=(0,0)
shape=symbol(3,1) label=none
value=(height=.8);
title ’Confidence and Prediction Intervals’;
symbol1 c=yellow v=- h=1;
symbol2 c=red;
symbol3 c=blue;
symbol4 c=blue;
proc reg data=fitness;
model Oxygen=RunTime / noprint;
plot Oxygen*RunTime / pred nostat mse aic bic
caxis=red ctext=blue cframe=ligr
legend=legend1 modellab=’
’;
run;
Figure 73.44 Simple Linear Regression
5544 F Chapter 73: The REG Procedure
Traditional Graphics for Variable Selection
When you use the SELECTION= option in the MODEL statement, you can produce plots showing
model fit summary statistics for the models examined. The following statements produce the plot
shown in Figure 73.45 of the Cp statistic for model selection plotted against the number of parameters in the model; the CHOCKING= and CMALLOWS= options draw useful reference lines.
goptions ctitle=black
htitle=3.5pct ftitle=swiss
ctext =magenta htext =3.0pct ftext =swiss
cback =ligr
border;
symbol1 v=circle c=red h=1 w=2;
title ’Cp Plot with Reference Lines’;
symbol1 c=green;
proc reg data=fitness;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=rsquare noprint;
plot cp.*np.
/ chocking=red cmallows=blue
vaxis=0 to 15 by 5;
run;
In the GOPTIONS statement,
BORDER
frames the entire display.
CBACK=
specifies the background color.
CTEXT=
selects the default color for the border and all text, including titles, footnotes, and
notes.
CTITLE=
specifies the title, footnote, note, and border color.
HTEXT=
specifies the height for all text in the display.
HTITLE=
specifies the height for the first title line.
FTEXT=
selects the default font for all text, including titles, footnotes, notes, the model
label and equation, the statistics, the axis labels, the tick values, and the legend.
FTITLE=
specifies the first title font.
For more information about the GOPTIONS statement and other global graphics statements, see
SAS/GRAPH Software: Reference.
Traditional Graphics F 5545
Figure 73.45 Cp Plot with Reference Lines
Traditional Normal Quantile and Normal Probability Plots
The following statements create probability-probability plots and quantile-quantile plots of the
residuals (Figure 73.46 and Figure 73.47, respectively). An annotation data set is created to produce the (0,0) (1,1) reference line for the P-P plot. Note that the NOSTAT option for the P-P plot
suppresses the statistics that would be displayed in the margin.
data annote1;
length function color $8;
retain ysys xsys ’2’ color ’black’;
function=’move’;
x=0;
y=0;
output;
function=’draw’;
x=1;
y=1;
output;
run;
5546 F Chapter 73: The REG Procedure
symbol1 c=blue;
proc reg data=fitness;
title ’PP Plot’;
model Oxygen=RunTime / noprint;
plot npp.*r.
/ annotate=annote1 nostat cframe=ligr
modellab="’Best’ Two-Parameter Model:";
run;
title ’QQ Plot’;
plot r.*nqq.
/ noline mse cframe=ligr
modellab="’Best’ Two-Parameter Model:";
run;
Figure 73.46 Normal Probability-Probability Plot for the Residuals
Models of Less Than Full Rank F 5547
Figure 73.47 Normal Quantile-Quantile Plot for the Residuals
Models of Less Than Full Rank
If the model is not full rank, there are an infinite number of least squares solutions for the estimates.
PROC REG chooses a nonzero solution for all variables that are linearly independent of previous
variables and a zero solution for other variables. This solution corresponds to using a generalized
inverse in the normal equations, and the expected values of the estimates are the Hermite normal
form of X multiplied by the true parameters:
E.b/ D .X0 X/ .X0 X/ˇ
Degrees of freedom for the zeroed estimates are reported as zero. The hypotheses that are not
testable have t tests reported as missing. The message that the model is not full rank includes a
display of the relations that exist in the matrix.
5548 F Chapter 73: The REG Procedure
The following statements use the fitness data from Example 73.2.
The variable
Dif=RunPulse RestPulse is created. When this variable is included in the model along with RunPulse and RestPulse, there is a linear dependency (or exact collinearity) between the independent
variables. Figure 73.48 shows how this problem is diagnosed.
data fit2;
set fitness; Dif=RunPulse-RestPulse;
proc reg data=fit2;
model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse Dif;
run;
Figure 73.48 Model That Is Not Full Rank: REG Procedure
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
6
24
30
722.54361
128.83794
851.38154
120.42393
5.36825
Root MSE
Dependent Mean
Coeff Var
2.31695
47.37581
4.89057
R-Square
Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487
0.8108
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
Dif
1
1
1
1
B
1
B
0
102.93448
-2.62865
-0.22697
-0.07418
-0.36963
0.30322
-0.02153
0
12.40326
0.38456
0.09984
0.05459
0.11985
0.13650
0.06605
.
8.30
-6.84
-2.27
-1.36
-3.08
2.22
-0.33
.
<.0001
<.0001
0.0322
0.1869
0.0051
0.0360
0.7473
.
PROC REG produces a message informing you that the model is less than full rank. Parameters
with DF=0 are not estimated, and parameters with DF=B are biased. In addition, the form of the
linear dependency among the regressors is displayed.
Collinearity Diagnostics F 5549
Collinearity Diagnostics
When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. This problem is called collinearity or multicollinearity. It is a good idea to find out which variables are nearly collinear with which other
variables. The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC
REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL,
and VIF options.
The COLLIN option in the MODEL statement requests that a collinearity analysis be performed.
First, X0 X is scaled to have 1s on the diagonal. If you specify the COLLINOINT option, the
intercept variable is adjusted out first. Then the eigenvalues and eigenvectors are extracted. The
analysis in PROC REG is reported with eigenvalues of X0 X rather than singular values of X. The
eigenvalues of X0 X are the squares of the singular values of X.
The condition indices are the square roots of the ratio of the largest eigenvalue to each individual
eigenvalue. The largest condition index is the condition number of the scaled X matrix. Belsey,
Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies might be
starting to affect the regression estimates. When this number is larger than 100, the estimates might
have a fair amount of numerical error (although the statistical standard error almost always is much
greater than the numerical error).
For each variable, PROC REG produces the proportion of the variance of the estimate accounted
for by each principal component. A collinearity problem occurs when a component associated
with a high condition index contributes strongly (variance proportion greater than about 0.5) to the
variance of two or more variables.
The VIF option in the MODEL statement provides the variance inflation factors (VIF). These factors
measure the inflation in the variances of the parameter estimates due to collinearities that exist
among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is
large enough to affect the predicted values.
The TOL option requests the tolerance values for the parameter estimates. The tolerance is defined
as 1=V IF .
For a complete discussion of the preceding methods, refer to Belsley, Kuh, and Welsch (1980).
For a more detailed explanation of using the methods with PROC REG, refer to Freund and Littell
(1986).
This example uses the COLLIN option on the fitness data found in Example 73.2. The following
statements produce Figure 73.49.
proc reg data=fitness;
model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse
/ tol vif collin;
run;
5550 F Chapter 73: The REG Procedure
Figure 73.49 Regression Using the TOL, VIF, and COLLIN Options
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
6
24
30
722.54361
128.83794
851.38154
120.42393
5.36825
Root MSE
Dependent Mean
Coeff Var
2.31695
47.37581
4.89057
R-Square
Adj R-Sq
F Value
Pr > F
22.43
<.0001
0.8487
0.8108
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Tolerance
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
1
1
1
1
1
1
1
102.93448
-2.62865
-0.22697
-0.07418
-0.36963
0.30322
-0.02153
12.40326
0.38456
0.09984
0.05459
0.11985
0.13650
0.06605
8.30
-6.84
-2.27
-1.36
-3.08
2.22
-0.33
<.0001
<.0001
0.0322
0.1869
0.0051
0.0360
0.7473
.
0.62859
0.66101
0.86555
0.11852
0.11437
0.70642
Parameter Estimates
Variable
Intercept
RunTime
Age
Weight
RunPulse
MaxPulse
RestPulse
DF
Variance
Inflation
1
1
1
1
1
1
1
0
1.59087
1.51284
1.15533
8.43727
8.74385
1.41559
Model Fit and Diagnostic Statistics F 5551
Figure 73.49 continued
Collinearity Diagnostics
Number
Eigenvalue
Condition
Index
1
2
3
4
5
6
7
6.94991
0.01868
0.01503
0.00911
0.00607
0.00102
0.00017947
1.00000
19.29087
21.50072
27.62115
33.82918
82.63757
196.78560
--------Proportion of Variation-------Intercept
RunTime
Age
0.00002326
0.00218
0.00061541
0.00638
0.00133
0.79966
0.18981
0.00021086
0.02522
0.12858
0.60897
0.12501
0.09746
0.01455
0.00015451
0.14632
0.15013
0.03186
0.11284
0.49660
0.06210
Collinearity Diagnostics
-----------------Proportion of Variation---------------Weight
RunPulse
MaxPulse
RestPulse
Number
1
2
3
4
5
6
7
0.00019651
0.01042
0.23571
0.18313
0.44442
0.10330
0.02283
0.00000862
0.00000244
0.00119
0.00149
0.01506
0.06948
0.91277
0.00000634
0.00000743
0.00125
0.00123
0.00833
0.00561
0.98357
0.00027850
0.39064
0.02809
0.19030
0.36475
0.02026
0.00568
Model Fit and Diagnostic Statistics
This section gathers the formulas for the statistics available in the MODEL, PLOT, and OUTPUT
statements. The model to be fit is Y D Xˇ C , and the parameter estimate is denoted by b D
.X0 X/ X0 Y. The subscript i denotes values for the ith observation, the parenthetical subscript .i/
means that the statistic is computed by using all observations except the ith observation, and the
subscript jj indicates the j th diagonal matrix entry. The ALPHA= option in the PROC REG or
MODEL statement is used to set the ˛ value for the t statistics.
Table 73.8 contains the summary statistics for assessing the fit of the model.
Table 73.8
Formulas and Definitions for Model Fit Summary Statistics
MODEL Option
or Statistic
n
p
i
O 2
SST0
Definition or Formula
the number of observations
the number of parameters including the intercept
1 if there is an intercept, 0 otherwise
the estimate of pure error variance from the SIGMA=
option or from fitting the full model
the uncorrected total sum of squares for the dependent
variable
5552 F Chapter 73: The REG Procedure
Table 73.8
continued
MODEL Option
or Statistic
SST1
SSE
MSE
R2
ADJRSQ
AIC
BIC
CP .Cp /
GMSEP
JP .Jp /
PC
PRESS
RMSE
SBC
SP .Sp /
Definition or Formula
the total sum of squares corrected for the mean for the
dependent variable
the error sum of squares
SSE
n p
SSE
1
SSTi
.n i/.1 R2 /
1
n p
SSE
n ln
C 2p
n nO 2
SSE
C 2.p C 2/q 2q 2 where q D
n ln
n
SSE
SSE
C 2p n
O 2
1
MSE.n C 1/.n 2/
D Sp .n C 1/.n 2/
n.n p 1/
n
nCp
MSE
n
nCp
n
2
.1 R / D Jp
n p
SSTi
the sum of squares of predri (see Table 73.9)
p
MSE
SSE
n ln
C p ln.n/
n
MSE
n p 1
Table 73.9 contains the diagnostic statistics and their formulas; these formulas and further information can be found in Chapter 4, “Introduction to Regression Procedures,” and in the section
“Influence Statistics” on page 5553. Each statistic is computed for each observation.
Table 73.9
Formulas and Definitions for Diagnostic Statistics
MODEL Option
or Statistic
PRED (b
Yi )
RES (ri )
H (hi )
STDP
Formula
Xi b
Yi b
Yi
xi .X0 X/ x0i
p
hi b
2
Influence Statistics F 5553
Table 73.9
continued
MODEL Option
or Statistic
Formula
STDI
p
.1 C hi /b
2
STDR
LCL
LCLM
UCL
UCLM
p
.1
STUDENT
RSTUDENT
COOKD
COVRATIO
DFFITS
DFBETASj
PRESS(predri )
hi /b
2
bi t ˛ STDI
Y
2
bi t ˛ STDP
Y
2
bi C t ˛ STDI
Y
2
bi C t ˛ STDP
Y
2
ri
STDRi
ri
p
O .i / 1 hi
1
STDP2
STUDENT2
p
STDR2
det.O .i2 / .x0.i / x.i / / 1
det.O 2 .X0 X/
b
.Yi b
Y.i / /
p
.O .i / hi /
bj b.i /j
p
O .i / .X0 X/jj
ri
1 hi
1/
Influence Statistics
This section discusses the INFLUENCE option, which produces several influence statistics, and the
PARTIAL option, which produces partial regression leverage plots.
The INFLUENCE Option
The INFLUENCE option (in the MODEL statement) requests the statistics proposed by Belsley,
Kuh, and Welsch (1980) to measure the influence of each observation on the estimates. Influential
observations are those that, according to various criteria, appear to have a large influence on the
parameter estimates.
Let b.i / be the parameter estimates after deleting the i th observation; let s.i/2 be the variance
estimate after deleting the i th observation; let X.i/ be the X matrix without the i th observation;
let y.i
O / be the i th value predicted without using the i th observation; let ri D yi yOi be the i th
5554 F Chapter 73: The REG Procedure
residual; and let hi be the i th diagonal of the projection matrix for the predictor space, also called
the hat matrix:
hi D xi .X0 X/
1 0
xi
Belsley, Kuh, and Welsch (1980) propose a cutoff of 2p=n, where n is the number of observations
used to fit the model and p is the number of parameters in the model. Observations with hi values
above this cutoff should be investigated.
For each observation, PROC REG first displays the residual, the studentized residual (RSTUDENT),
and the hi . The studentized residual RSTUDENT differs slightly from STUDENT since the error
2
variance is estimated by s.i/
without the i th observation, not by s 2 . For example,
RSTUDENT D
ri
p
s.i/ .1
hi /
Observations with RSTUDENT larger than 2 in absolute value might need some attention.
The COVRATIO statistic measures the change in the determinant of the covariance matrix of the
estimates by deleting the i th observation:
det s 2 .i /.X0.i / X.i / /
COVRATIO D
det s 2 .X0 X/ 1
1
Belsley, Kuh, and Welsch (1980) suggest that observations with
jCOVRATIO
1j 3p
n
where p is the number of parameters in the model and n is the number of observations used to fit
the model, are worth investigation.
The DFFITS statistic is a scaled measure of the change in the predicted value for the i th observation
and is calculated by deleting the i th observation. A large value indicates that the observation is very
influential in its neighborhood of the X space.
DFFITS D
yOi
yO.i/
p
s.i/ h.i/
Large values of DFFITS indicate influential observations. A general cutoff
p to consider is 2; a sizeadjusted cutoff recommended by Belsley, Kuh, and Welsch (1980) is 2 p=n, where n and p are as
defined previously.
Influence Statistics F 5555
The DFFITS statistic is very similar to Cook’s D, defined in the section “Predicted and Residual
Values” on page 5525.
The DFBETAS statistics are the scaled measures of the change in each parameter estimate and are
calculated by deleting the i th observation:
DFBETASj D
bj b.i/j
p
s.i/ .X0 X/jj
where .X0 X/jj is the .j; j /th element of .X0 X/
1.
In general, large values of DFBETAS indicate observations that are influential in estimating a given
parameter. Belsley, Kuh, and Welsch (1980) recommend 2 as a general cutoff value to indicate
p
influential observations and 2= n as a size-adjusted cutoff.
The following statements use the population example in the section “Polynomial Regression” on
page 5434. See Figure 73.32 for the fitted regression equation. The INFLUENCE option produces
the tables shown in Figure 73.50 and Figure 73.51.
proc reg data=USPopulation;
model Population=Year YearSq / influence;
run;
5556 F Chapter 73: The REG Procedure
Figure 73.50 Regression Using the INFLUENCE Option
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Output Statistics
Obs
Residual
RStudent
Hat Diag
H
Cov
Ratio
DFFITS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-2.2837
-0.4146
0.6696
0.8849
0.5923
-0.0621
-0.1344
0.5864
0.0934
0.2255
1.4757
1.6441
3.4065
1.5922
1.7679
-7.5642
-7.4712
-0.3731
1.2782
1.0356
-1.7068
4.7578
-0.9361
-0.1540
0.2379
0.3065
0.2021
-0.0210
-0.0455
0.1994
0.0318
0.0771
0.5090
0.5680
1.2109
0.5470
0.6064
-3.2147
-3.1550
-0.1272
0.4440
0.3687
-0.6406
2.1312
0.3429
0.2356
0.1632
0.1180
0.0933
0.0831
0.0824
0.0870
0.0933
0.0990
0.1022
0.1022
0.0990
0.0933
0.0870
0.0824
0.0831
0.0933
0.1180
0.1632
0.2356
0.3429
1.5519
1.5325
1.3923
1.3128
1.2883
1.2827
1.2813
1.2796
1.2969
1.3040
1.2550
1.2420
1.0320
1.2345
1.2123
0.3286
0.3425
1.2936
1.2906
1.3741
1.4380
0.9113
-0.6762
-0.0855
0.1050
0.1121
0.0648
-0.0063
-0.0136
0.0615
0.0102
0.0255
0.1717
0.1916
0.4013
0.1755
0.1871
-0.9636
-0.9501
-0.0408
0.1624
0.1628
-0.3557
1.5395
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-------------DFBETAS------------Intercept
Year
YearSq
-0.4924
-0.0540
0.0517
0.0335
0.0040
0.0012
0.0054
-0.0339
-0.0067
-0.0182
-0.1272
-0.1426
-0.2895
-0.1173
-0.1076
0.4130
0.2131
-0.0007
0.0415
0.0732
-0.2107
1.0656
0.4862
0.0531
-0.0505
-0.0322
-0.0032
-0.0012
-0.0055
0.0343
0.0067
0.0183
0.1275
0.1426
0.2889
0.1167
0.1067
-0.4063
-0.2048
0.0012
-0.0432
-0.0749
0.2141
-1.0793
-0.4802
-0.0523
0.0494
0.0310
0.0025
0.0013
0.0056
-0.0347
-0.0068
-0.0183
-0.1276
-0.1424
-0.2880
-0.1160
-0.1056
0.3987
0.1957
-0.0016
0.0449
0.0766
-0.2176
1.0933
Influence Statistics F 5557
Figure 73.51 Residual Statistics
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
-4.7569E-11
170.97193
237.71229
In Figure 73.50, observations 16, 17, and 19 exceed the cutoff value of 2 for RSTUDENT. None of
the observations exceeds the general cutoff of 2 for DFFITS or the DFBETAS, but observations 16,
17, and 19 exceed at least one of the size-adjusted cutoffs for these statistics. Observations 1 and
19 exceed the cutoff for the hat diagonals, and observations 1, 2, 16, 17, and 18 exceed the cutoffs
for COVRATIO. Taken together, these statistics indicate that you should look first at observations
16, 17, and 19 and then perhaps investigate the other observations that exceeded a cutoff.
When you enable ODS Graphics, you can request influence diagnostic plots by using the PLOTS=
option in the PROC REG statement as shown in the following statements:
ods graphics on;
proc reg data=USPopulation
plots(label)=(CooksD RStudentByLeverage DFFITS DFBETAS);
id Year;
model Population=Year YearSq;
run;
ods graphics off;
The LABEL suboption specified in the PLOTS(LABEL)= option requests that observations that
exceed the relevant cutoffs for the statistics being plotted are labeled. Since Year has been named
in an ID statement, the value of Year is used for the labels. The requested plots are shown in
Figure 73.52.
5558 F Chapter 73: The REG Procedure
Figure 73.52 Influence Diagnostics
Influence Statistics F 5559
Figure 73.52 continued
5560 F Chapter 73: The REG Procedure
Figure 73.52 continued
Influence Statistics F 5561
Figure 73.52 continued
The PARTIAL and PARTIALDATA Options
The PARTIAL option in the MODEL statement produces partial regression leverage plots. If ODS
Graphics is not in effect, this option requires the use of the LINEPRINTER option in the PROC
REG statement. One plot is created for each regressor in the current full model. For example,
plots are produced for regressors included by using ADD statements; plots are not produced for
interim models in the various model-selection methods but only for the full model. If you use a
model-selection method and the final model contains only a subset of the original regressors, the
PARTIAL option still produces plots for all regressors in the full model. If ODS Graphics is in
effect, these plots are produced as high-resolution graphics, in panels with a maximum of six partial
regression leverage plots per panel. Multiple panels are displayed for models with more than six
regressors.
For a given regressor, the partial regression leverage plot is the plot of the dependent variable and
the regressor after they have been made orthogonal to the other regressors in the model. These can
be obtained by plotting the residuals for the dependent variable against the residuals for the selected
regressor, where the residuals for the dependent variable are calculated with the selected regressor
omitted, and the residuals for the selected regressor are calculated from a model where the selected
regressor is regressed on the remaining regressors. A line fit to the points has a slope equal to the
parameter estimate in the full model.
5562 F Chapter 73: The REG Procedure
When ODS Graphics is not in effect, points in the plot are marked by the number of replicates
appearing at one position. The symbol ’*’ is used if there are 10 or more replicates. If an ID
statement is specified, the leftmost nonblank character in the value of the ID variable is used as the
plotting symbol.
The PARTIALDATA option in the MODEL statement produces a table that contains the partial
regression data that are displayed in the partial regression leverage plots. You can request partial
regression data even if you do not requests plots with the PARTIAL option.
The following statements use the fitness data in Example 73.2 with the PARTIAL option and the
ODS GRAPHICS statement to produce the partial regression leverage plots. The plots are shown
in Figure 73.53.
ods graphics on;
proc reg data=fitness;
model Oxygen=RunTime Weight Age / partial;
run;
ods graphics off;
Figure 73.53 Partial Regression Leverage Plots
Reweighting Observations in an Analysis F 5563
Reweighting Observations in an Analysis
Reweighting observations is an interactive feature of PROC REG that enables you to change the
weights of observations used in computing the regression equation. Observations can also be deleted
from the analysis (not from the data set) by changing their weights to zero. In the following statements, the Class data (in the section “Getting Started: REG Procedure” on page 5430) are used to
illustrate some of the features of the REWEIGHT statement. First, the full model is fit, and the
residuals are displayed in Figure 73.54.
proc reg data=Class;
model Weight=Age Height / p;
id Name;
run;
Figure 73.54 Full Model for Class Data, Residuals Shown
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Dependent
Variable
Predicted
Value
Residual
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
124.8686
78.6273
110.2812
102.5670
105.0849
80.2266
89.2191
102.7663
100.2095
86.3415
57.3660
107.9625
76.6295
117.1544
138.2164
107.2043
118.9529
79.6676
117.1544
-12.3686
5.3727
-12.2812
-0.0670
-2.5849
2.7734
-4.7191
9.7337
-16.2095
13.1585
-6.8660
-17.9625
0.3705
-5.1544
11.7836
20.7957
14.0471
5.3324
-5.1544
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
2120.09974
3272.72186
5564 F Chapter 73: The REG Procedure
Upon examining the data and residuals, you realize that observation 17 (Ronald) was mistakenly
included in the analysis. Also, you would like to examine the effect of reweighting to 0.5 those
observations with residuals that have absolute values greater than or equal to 17. The following
statements show how you request this reweighting:
reweight obs.=17;
reweight r. le -17 or r. ge 17 / weight=0.5;
print p;
run;
At this point, a message appears (in the log) that tells you which observations have been reweighted
and what the new weights are. Figure 73.55 is produced.
Figure 73.55 Model with Reweighted Observations
The REG Procedure
Model: MODEL1.2
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Weight
Variable
Dependent
Variable
Predicted
Value
Residual
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.5000
1.0000
1.0000
1.0000
0.5000
0
1.0000
1.0000
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
121.6250
79.9296
107.5484
102.1663
104.3632
79.9762
87.8225
103.6889
98.7606
85.3117
58.6811
106.8740
76.8377
116.2429
135.9688
103.5150
117.8121
78.1398
116.2429
-9.1250
4.0704
-9.5484
0.3337
-1.8632
3.0238
-3.3225
8.8111
-14.7606
14.1883
-8.1811
-16.8740
0.1623
-4.2429
14.0312
24.4850
15.1879
6.8602
-4.2429
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
1500.61194
2287.57621
The first REWEIGHT statement excludes observation 17, and the second REWEIGHT statement
reweights observations 12 and 16 to 0.5. An important feature to note from this example is that the
model is not refit until after the PRINT statement. REWEIGHT statements do not cause the model
to be refit. This is so that multiple REWEIGHT statements can be applied to a subsequent model.
Reweighting Observations in an Analysis F 5565
In this example, since the intent is to reweight observations with large residuals, the observation that
was mistakenly included in the analysis should be deleted; then the model should be fit for those
remaining observations, and the observations with large residuals should be reweighted. To accomplish this, use the REFIT statement. Note that the model label has been changed from MODEL1 to
MODEL1.2 since two REWEIGHT statements have been used. The following statements produce
Figure 73.56:
reweight allobs / weight=1.0;
reweight obs.=17;
refit;
reweight r. le -17 or r. ge 17 / weight=.5;
print;
run;
Figure 73.56 Observations Excluded from Analysis, Model Refitted, and Observations
Reweighted
The REG Procedure
Model: MODEL1.5
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Weight
Variable
Dependent
Variable
Predicted
Value
Residual
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.5000
0
1.0000
1.0000
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
120.9716
79.5342
107.0746
101.5681
103.7588
79.7204
87.5443
102.9467
98.3117
85.0407
58.6253
106.2625
76.5908
115.4651
134.9953
103.1923
117.0299
78.0288
115.4651
-8.4716
4.4658
-9.0746
0.9319
-1.2588
3.2796
-3.0443
9.5533
-14.3117
14.4593
-8.1253
-16.2625
0.4092
-3.4651
15.0047
24.8077
15.9701
6.9712
-3.4651
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
1637.81879
2473.87984
Notice that this results in a slightly different model than the previous set of statements: only observation 16 is reweighted to 0.5. Also note that the model label is now MODEL1.5 since five
REWEIGHT statements have been used for this model.
5566 F Chapter 73: The REG Procedure
Another important feature of the REWEIGHT statement is the ability to nullify the effect of a
previous or all REWEIGHT statements. First, assume that you have several REWEIGHT statements in effect and you want to restore the original weights of all the observations. The following
REWEIGHT statement accomplishes this and produces Figure 73.57:
reweight allobs / reset;
print;
run;
Figure 73.57 Restoring Weights of All Observations
The REG Procedure
Model: MODEL1.6
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Dependent
Variable
Predicted
Value
Residual
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
124.8686
78.6273
110.2812
102.5670
105.0849
80.2266
89.2191
102.7663
100.2095
86.3415
57.3660
107.9625
76.6295
117.1544
138.2164
107.2043
118.9529
79.6676
117.1544
-12.3686
5.3727
-12.2812
-0.0670
-2.5849
2.7734
-4.7191
9.7337
-16.2095
13.1585
-6.8660
-17.9625
0.3705
-5.1544
11.7836
20.7957
14.0471
5.3324
-5.1544
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
2120.09974
3272.72186
The resulting model is identical to the original model specified at the beginning of this section.
Notice that the model label is now MODEL1.6. Note that the Weight column does not appear, since
all observations have been reweighted to have weight=1.
Now suppose you want only to undo the changes made by the most recent REWEIGHT statement.
Use REWEIGHT UNDO for this. The following statements produce Figure 73.58:
Reweighting Observations in an Analysis F 5567
reweight r. le -12 or r. ge 12 / weight=.75;
reweight r. le -17 or r. ge 17 / weight=.5;
reweight undo;
print;
run;
Figure 73.58 Example of UNDO in REWEIGHT Statement
The REG Procedure
Model: MODEL1.9
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Weight
Variable
Dependent
Variable
Predicted
Value
Residual
0.7500
1.0000
0.7500
1.0000
1.0000
1.0000
1.0000
1.0000
0.7500
0.7500
1.0000
0.7500
1.0000
1.0000
1.0000
0.7500
0.7500
1.0000
1.0000
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
125.1152
78.7691
110.3236
102.8836
105.3936
80.1133
89.0776
103.3322
100.2835
86.2090
57.0745
108.2622
76.5275
117.6752
138.9211
107.0063
119.4681
79.3061
117.6752
-12.6152
5.2309
-12.3236
-0.3836
-2.8936
2.8867
-4.5776
9.1678
-16.2835
13.2910
-6.5745
-18.2622
0.4725
-5.6752
11.0789
20.9937
13.5319
5.6939
-5.6752
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
1694.87114
2547.22751
The resulting model reflects changes made only by the first REWEIGHT statement since the third
REWEIGHT statement negates the effect of the second REWEIGHT statement. Observations 1, 3,
9, 10, 12, 16, and 17 have their weights changed to 0.75. Note that the label MODEL1.9 reflects
the use of nine REWEIGHT statements for the current model.
Now suppose you want to reset the observations selected by the most recent REWEIGHT statement
to their original weights. Use the REWEIGHT statement with the RESET option to do this. The
following statements produce Figure 73.59:
5568 F Chapter 73: The REG Procedure
reweight r. le -12 or r. ge 12 / weight=.75;
reweight r. le -17 or r. ge 17 / weight=.5;
reweight / reset;
print;
run;
Figure 73.59 REWEIGHT Statement with RESET option
The REG Procedure
Model: MODEL1.12
Dependent Variable: Weight
Output Statistics
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Name
Alfred
Alice
Barbara
Carol
Henry
James
Jane
Janet
Jeffrey
John
Joyce
Judy
Louise
Mary
Philip
Robert
Ronald
Thomas
William
Weight
Variable
Dependent
Variable
Predicted
Value
Residual
0.7500
1.0000
0.7500
1.0000
1.0000
1.0000
1.0000
1.0000
0.7500
0.7500
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.7500
1.0000
1.0000
112.5000
84.0000
98.0000
102.5000
102.5000
83.0000
84.5000
112.5000
84.0000
99.5000
50.5000
90.0000
77.0000
112.0000
150.0000
128.0000
133.0000
85.0000
112.0000
126.0076
77.8727
111.2805
102.4703
105.1278
80.2290
89.7199
102.0122
100.6507
86.6828
56.7703
108.1649
76.4327
117.1975
138.7581
108.7016
119.0957
80.3076
117.1975
-13.5076
6.1273
-13.2805
0.0297
-2.6278
2.7710
-5.2199
10.4878
-16.6507
12.8172
-6.2703
-18.1649
0.5673
-5.1975
11.2419
19.2984
13.9043
4.6924
-5.1975
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
0
1879.08980
2959.57279
Note that observations that meet the condition of the second REWEIGHT statement (residuals with
an absolute value greater than or equal to 17) now have weights reset to their original value of 1.
Observations 1, 3, 9, 10, and 17 have weights of 0.75, but observations 12 and 16 (which meet the
condition of the second REWEIGHT statement) have their weights reset to 1.
Notice how the last three examples show three ways to change weights back to a previous value.
In the first example, ALLOBS and the RESET option are used to change weights for all observations back to their original values. In the second example, the UNDO option is used to negate the
effect of a previous REWEIGHT statement, thus changing weights for observations selected in the
previous REWEIGHT statement to the weights specified in still another REWEIGHT statement. In
the third example, the RESET option is used to change weights for observations selected in a pre-
Testing for Heteroscedasticity F 5569
vious REWEIGHT statement back to their original values. Finally, note that the label MODEL1.12
indicates that 12 REWEIGHT statements have been applied to the original model.
Testing for Heteroscedasticity
The regression model is specified as yi D xi ˇ Ci , where the i ’s are identically and independently
distributed: E./ D 0 and E.0 / D 2 I. If the i ’s are not independent or their variances are
not constant, the parameter estimates are unbiased, but the estimate of the covariance matrix is
inconsistent.
In the case of heteroscedasticity, if the regression data are from a simple random sample, then White
(1980), showed that matrix
1
HC0 D .X0 X/
.X0 diag.ei2 /X/.X0 X/
1
where
ei D y i
xi b
is an asymptotically consistent estimate of the covariance matrix. MacKinnon and White (1985)
introduced three alternative heteroscedasticity-consistent covariance matrix estimators that are all
asymptotically equivalent to the estimator HC0 but that typically have better small sample behavior.
These estimators labeled HC1 , HC2 , and HC3 are defined as follows:
HC1 D
n
n
p
HC0
where n is the number of observations and p is the number of regressors including the intercept.
HC2 D .X0 X/
1
X0 diag.
ei2
/X.X0 X/
1 hi i
1
where
hi i D xi .X0 X/
1 0
xi
is the leverage of the i th observation.
HC3 D .X0 X/
1
X0 diag.
ei2
/X.X0 X/
.1 hi i /2
1
5570 F Chapter 73: The REG Procedure
Long and Ervin (2000) studied the performance of these estimators and recommend using the HC3
estimator if the sample size is less than 250.
You can use the HCCMETHOD=0,1,2, or 3 in the MODEL statement to select a heteroscedasticityconsistent covariance matrix estimator, with HC0 being the default. The ACOV option in the
MODEL statement displays the heteroscedasticity-consistent covariance matrix estimator in effect
and adds heteroscedasticity-consistent standard errors, also known as White standard errors, to the
parameter estimates table. If you specify the HCC or WHITE option in the MODEL statement,
but do not also specify the ACOV option, then the heteroscedasticity-consistent standard errors are
added to the parameter estimates table but the heteroscedasticity- consistent covariance matrix is
not displayed.
The SPEC option performs a model specification test. The null hypothesis for this test maintains
that the errors are homoscedastic and independent of the regressors and that several technical assumptions about the model specification are valid. For details, see theorem 2 and assumptions 1–7
of White (1980). When the model is correctly specified and the errors are independent of the regressors, the rejection of this null hypothesis is evidence of heteroscedasticity. In implementing this
test, an estimator of the average covariance matrix (White 1980, p. 822) is constructed and inverted.
The nonsingularity of this matrix is one of the assumptions in the null hypothesis about the model
specification. When PROC REG determines this matrix to be numerically singular, a generalized
inverse is used and a note to this effect is written to the log. In such cases, care should be taken in
interpreting the results of this test.
When you specify the SPEC, ACOV, HCC, or WHITE option in the MODEL statement, tests
listed in the TEST statement are performed with both the usual covariance matrix and the
heteroscedasticity-consistent covariance matrix requested with the HCCMETHOD= option. Tests
performed with the consistent covariance matrix are asymptotic. For more information, refer to
White (1980).
Both the ACOV and SPEC options can be specified in a MODEL or PRINT statement.
Testing for Lack of Fit
The test for lack of fit compares the variation around the model with “pure” variation within replicated observations. This measures the adequacy of the specified model. In particular, if there are ni
replicated observations Yi1 ; : : : ; Yi ni of the response all at the same values xi of the regressors, then
you can predict the true response at xi either by using the predicted value YOi based on the model or
by using the mean YNi of the replicated values. The test for lack of fit decomposes the residual error
into a component due to the variation of the replications around their mean value (the “pure” error)
and a component due to the variation of the mean values around the model prediction (the “bias”
error):
ni XX
Yij
i
j D1
YOi
2
D
ni
XX
i
j D1
Yij
YNi
2
C
X
i
ni YNi
YOi
2
Multivariate Tests F 5571
If the model is adequate, then both components estimate the nominal level of error; however, if the
bias component of error is much larger than the pure error, then this constitutes evidence that there
is significant lack of fit.
If some observations in your design are replicated, you can test for lack of fit by specifying the
LACKFIT option in the MODEL statement (see Example 73.6). Note that, since all other tests use
total error rather than pure error, you might want to hand-calculate the tests with respect to pure
error if the lack of fit is significant. On the other hand, significant lack of fit indicates that the
specified model is inadequate, so if this is a problem you can also try to refine the model.
Multivariate Tests
The MTEST statement described in the section “MTEST Statement” on page 5474 can test hypotheses involving several dependent variables in the form
.Lˇ
cj/M D 0
where L is a linear function on the regressor side, ˇ is a matrix of parameters, c is a column vector
of constants, j is a row vector of ones, and M is a linear function on the dependent side. The special
case where the constants are zero is
LˇM D 0
To test this hypothesis, PROC REG constructs two matrices called H and E that correspond to the
numerator and denominator of a univariate F test:
H D M0 .LB
E D M0 .Y0 Y
cj/0 .L.X0 X/ L0 /
1
.LB
cj/M
B0 .X0 X/B/M
These matrices are displayed for each MTEST statement if the PRINT option is specified.
Four test statistics based on the eigenvalues of E 1 H or .E C H/ 1 H are formed. These are Wilks’
lambda, Pillai’s trace, the Hotelling-Lawley trace, and Roy’s greatest root. These test statistics are
discussed in Chapter 4, “Introduction to Regression Procedures.”
The following code creates MANOVA data from Morrison (1976):
5572 F Chapter 73: The REG Procedure
* Manova Data from Morrison (1976, 190);
data a;
input sex $ drug $ @;
do rep=1 to 4;
input y1 y2 @;
sexcode=(sex=’m’)-(sex=’f’);
drug1=(drug=’a’)-(drug=’c’);
drug2=(drug=’b’)-(drug=’c’);
sexdrug1=sexcode*drug1;
sexdrug2=sexcode*drug2;
output;
end;
datalines;
m a 5 6 5 4 9 9 7 6
m b 7 6 7 7 9 12 6 8
m c 21 15 14 11 17 12 12 10
f a 7 10 6 6 9 7 8 10
f b 10 13 8 7 7 6 6 9
f c 16 12 14 9 14 8 10 5
;
The following statements perform a multivariate analysis of variance and produce Figure 73.60
through Figure 73.63:
proc reg;
model y1 y2=sexcode drug1 drug2 sexdrug1 sexdrug2;
y1y2drug: mtest y1=y2, drug1,drug2;
drugshow: mtest drug1, drug2 / print canprint;
run;
Figure 73.60 Multivariate Analysis of Variance: REG Procedure
The REG Procedure
Model: MODEL1
Dependent Variable: y1
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
18
23
316.00000
94.50000
410.50000
63.20000
5.25000
Root MSE
Dependent Mean
Coeff Var
2.29129
9.75000
23.50039
R-Square
Adj R-Sq
F Value
Pr > F
12.04
<.0001
0.7698
0.7058
Multivariate Tests F 5573
Figure 73.60 continued
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
sexcode
drug1
drug2
sexdrug1
sexdrug2
1
1
1
1
1
1
9.75000
0.16667
-2.75000
-2.25000
-0.66667
-0.41667
0.46771
0.46771
0.66144
0.66144
0.66144
0.66144
20.85
0.36
-4.16
-3.40
-1.01
-0.63
<.0001
0.7257
0.0006
0.0032
0.3269
0.5366
Figure 73.61 Multivariate Analysis of Variance: REG Procedure
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
18
23
69.33333
114.00000
183.33333
13.86667
6.33333
Root MSE
Dependent Mean
Coeff Var
2.51661
8.66667
29.03782
R-Square
Adj R-Sq
F Value
Pr > F
2.19
0.1008
0.3782
0.2055
Parameter Estimates
Variable
Intercept
sexcode
drug1
drug2
sexdrug1
sexdrug2
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
1
8.66667
0.16667
-1.41667
-0.16667
-1.16667
-0.41667
0.51370
0.51370
0.72648
0.72648
0.72648
0.72648
16.87
0.32
-1.95
-0.23
-1.61
-0.57
<.0001
0.7493
0.0669
0.8211
0.1257
0.5734
5574 F Chapter 73: The REG Procedure
Figure 73.62 Multivariate Analysis of Variance: First Test
The REG Procedure
Model: MODEL1
Multivariate Test: y1y2drug
Multivariate Statistics and Exact F Statistics
S=1
Statistic
Wilks’ Lambda
Pillai’s Trace
Hotelling-Lawley Trace
Roy’s Greatest Root
M=0
N=8
Value
F Value
Num DF
Den DF
Pr > F
0.28053917
0.71946083
2.56456456
2.56456456
23.08
23.08
23.08
23.08
2
2
2
2
18
18
18
18
<.0001
<.0001
<.0001
<.0001
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not the same across dependent variables y1 and y2.
Figure 73.63 Multivariate Analysis of Variance: Second Test
The REG Procedure
Model: MODEL1
Multivariate Test: drugshow
Error Matrix (E)
94.5
76.5
76.5
114
Hypothesis Matrix (H)
301
97.5
97.5
36.333333333
Autocorrelation in Time Series Data F 5575
Figure 73.63 continued
1
2
Canonical
Correlation
Adjusted
Canonical
Correlation
Approximate
Standard
Error
Squared
Canonical
Correlation
0.905903
0.244371
0.899927
.
0.040101
0.210254
0.820661
0.059717
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
Eigenvalue
Difference
Proportion
Cumulative
4.5760
0.0635
4.5125
0.9863
0.0137
0.9863
1.0000
1
2
Test of H0: The canonical correlations in the
current row and all that follow are zero
1
2
Likelihood
Ratio
Approximate
F Value
Num DF
Den DF
Pr > F
0.16862952
0.94028273
12.20
1.14
4
1
34
18
<.0001
0.2991
Multivariate Statistics and F Approximations
S=2
Statistic
Wilks’ Lambda
Pillai’s Trace
Hotelling-Lawley Trace
Roy’s Greatest Root
M=-0.5
N=7.5
Value
F Value
Num DF
Den DF
Pr > F
0.16862952
0.88037810
4.63953666
4.57602675
12.20
7.08
19.40
41.18
4
4
4
2
34
36
19.407
18
<.0001
0.0003
<.0001
<.0001
NOTE: F Statistic for Roy’s Greatest Root is an upper bound.
NOTE: F Statistic for Wilks’ Lambda is exact.
The four multivariate test statistics are all highly significant, giving strong evidence that the coefficients of drug1 and drug2 are not zero for both dependent variables.
Autocorrelation in Time Series Data
When regression is performed on time series data, the errors might not be independent. Often errors
are autocorrelated; that is, each error is correlated with the error immediately before it. Autocorrelation is also a symptom of systematic lack of fit. The DW option provides the Durbin-Watson d
statistic to test that the autocorrelation is zero:
5576 F Chapter 73: The REG Procedure
Pn
dD
ei 1 /2
i D2 .ei
Pn
2
i D1 ei
The value of d is close to 2 if the errors are uncorrelated. The distribution of d is reported by Durbin
and Watson (1951). Tables of the distribution are found in most econometrics textbooks, such as
Johnston (1972) and Pindyck and Rubinfeld (1981).
The sample autocorrelation estimate is displayed after the Durbin-Watson statistic. The sample is
computed as
Pn
i D2 ei ei
rD P
n
2
i D1 ei
1
This autocorrelation of the residuals might not be a very good estimate of the autocorrelation of
the true errors, especially if there are few observations and the independent variables have certain
patterns. If there are missing observations in the regression, these measures are computed as though
the missing observations did not exist.
Positive autocorrelation of the errors generally tends to make the estimate of the error variance too
small, so confidence intervals are too narrow and true null hypotheses are rejected with a higher
probability than the stated significance level. Negative autocorrelation of the errors generally tends
to make the estimate of the error variance too large, so confidence intervals are too wide and the
power of significance tests is reduced. With either positive or negative autocorrelation, least squares
parameter estimates are usually not as efficient as generalized least squares parameter estimates. For
more details, refer to Judge et al. (1985, Chapter 8) and the SAS/ETS User’s Guide.
The following SAS statements request the DWPROB option for the U.S. population data (see
Figure 73.64). If you use the DW option instead of the DWPROB option, then p-values are not
produced.
proc reg data=USPopulation;
model Population=Year YearSq / dwProb;
run;
Figure 73.64 Regression Using DW Option
The REG Procedure
Model: MODEL1
Dependent Variable: Population
Durbin-Watson D
Pr < DW
Pr > DW
Number of Observations
1st Order Autocorrelation
1.191
0.0050
0.9950
22
0.323
Computations for Ridge Regression and IPC Analysis F 5577
Computations for Ridge Regression and IPC Analysis
In ridge regression analysis, the crossproduct matrix for the independent variables is centered (the
NOINT option is ignored if it is specified) and scaled to one on the diagonal elements. The ridge
constant k (specified with the RIDGE= option) is then added to each diagonal element of the
crossproduct matrix. The ridge regression estimates are the least squares estimates obtained by
using the new crossproduct matrix.
Let X be an n p matrix of the independent variables after centering the data, and let Y be an n 1
vector corresponding to the dependent variable. Let D be a p p diagonal matrix with diagonal
elements as in X0 X. The ridge regression estimate corresponding to the ridge constant k can be
computed as
D
1
2
.Z0 Z C kIp /
where Z D XD
1
2
1 0
ZY
and Ip is a pp identity matrix.
For IPC analysis, the smallest m eigenvalues of Z0 Z (where m is specified with the PCOMIT=
option) are omitted to form the estimates.
For information about ridge regression and IPC standardized parameter estimates, parameter estimate standard errors, and variance inflation factors, refer to Rawlings (1988), Neter, Wasserman,
and Kutner (1990), and Marquardt and Snee (1975). Unlike Rawlings (1988), the REG procedure uses the mean squared errors of the submodels instead of the full model MSE to compute the
standard errors of the parameter estimates.
Construction of Q-Q and P-P Plots
If a normal probability-probability or quantile-quantile plot for the variable x is requested, the n
nonmissing values of x are first ordered from smallest to largest:
x.1/ x.2/ x.n/
If a Q-Q plot is requested (with a PLOT statement of the form PLOT yvariableNQQ.),
the i th
i 0:375
1
ordered value x.i/ is represented by a point with y-coordinate x.i / and x-coordinate ˆ
nC0:25 ,
where ˆ./ is the standard normal distribution.
If a P-P plot is requested (with a PLOT statement of the form PLOT yvariableNPP.), the ithx
ordered value x.i/ is represented by a point with y-coordinate ni and x-coordinate ˆ .i/
, where
is the mean of the nonmissing x-values and is the standard
deviation.
If
an
x-value
has
multi
x.i / iCk 1
plicity k (that is, x.i/ D D x.iCk 1/ ), then only the point ˆ
; n
is displayed.
5578 F Chapter 73: The REG Procedure
Computational Methods
The REG procedure first composes a crossproducts matrix. The matrix can be calculated from input
data, reformed from an input correlation matrix, or read in from an SSCP data set. For each model,
the procedure selects the appropriate crossproducts from the main matrix. The normal equations
formed from the crossproducts are solved by using a sweep algorithm (Goodnight 1979). The
method is accurate for data that are reasonably scaled and not too collinear.
The mechanism that PROC REG uses to check for singularity involves the diagonal (pivot) elements of X0 X as it is being swept. If a pivot is less than SINGULAR*CSS, then a singularity is
declared and the pivot is not swept (where CSS is the corrected sum of squares for the regressor
and SINGULAR is machine dependent but is approximately 1E 7 on most machines or reset in the
PROC REG statement).
The sweep algorithm is also used in many places in the model-selection methods. The RSQUARE
method uses the leaps-and-bounds algorithm by Furnival and Wilson (1974).
Computer Resources in Regression Analysis
The REG procedure is efficient for ordinary regression; however, requests for optional features can
greatly increase the amount of time required.
The major computational expense in the regression analysis is the collection of the crossproducts
matrix. For p variables and n observations, the time required is proportional to np 2 . For each
model run, PROC REG needs time roughly proportional to k 3 , where k is the number of regressors
in the model. Include an additional nk 2 for the R, CLM, or CLI option and another nk 2 for the
INFLUENCE option.
Most of the memory that PROC REG needs to solve large problems is used for crossproducts matrices. PROC REG requires 4p 2 bytes for the main crossproducts matrix plus 4k 2 bytes for the largest
model. If several output data sets are requested, memory is also needed for buffers.
See the section “Input Data Sets” on page 5502 for information about how to use TYPE=SSCP data
sets to reduce computing time.
Displayed Output
Many of the more specialized tables are described in detail in previous sections. Most of the formulas for the statistics are in Chapter 4, “Introduction to Regression Procedures,” while other formulas
can be found in the section “Model Fit and Diagnostic Statistics” on page 5551 and the section
“Influence Statistics” on page 5553.
The analysis-of-variance table includes the following:
Displayed Output F 5579
the Source of the variation, Model for the fitted regression, Error for the residual error, and C
Total for the total variation after correcting for the mean. The Uncorrected Total Variation is
produced when the NOINT option is used.
the degrees of freedom (DF) associated with the source
the Sum of Squares for the term
the Mean Square, the sum of squares divided by the degrees of freedom
the F Value for testing the hypothesis that all parameters are zero except for the intercept.
This is formed by dividing the mean square for Model by the mean square for Error.
the Prob>F, the probability of getting a greater F statistic than that observed if the hypothesis
is true. This is the significance probability.
Other statistics displayed include the following:
Root MSE is an estimate of the standard deviation of the error term. It is calculated as the
square root of the mean square error.
Dep Mean is the sample mean of the dependent variable.
C.V. is the coefficient of variation, computed as 100 times Root MSE divided by Dep Mean.
This expresses the variation in unitless values.
R-square is a measure between 0 and 1 that indicates the portion of the (corrected) total variation that is attributed to the fit rather than left to residual error. It is calculated as SS(Model)
divided by SS(Total). It is also called the coefficient of determination. It is the square of
the multiple correlation—in other words, the square of the correlation between the dependent
variable and the predicted values.
Adj R-square, the adjusted R2 , is a version of R2 that has been adjusted for degrees of
freedom. It is calculated as
RN 2 D 1
.n
i /.1 R2 /
n p
where i is equal to 1 if there is an intercept and 0 otherwise, n is the number of observations
used to fit the model, and p is the number of parameters in the model.
The parameter estimates and associated statistics are then displayed, and they include the following:
the Variable used as the regressor, including the name Intercept to represent the estimate of
the intercept parameter
the degrees of freedom (DF) for the variable. There is one degree of freedom unless the model
is not full rank.
5580 F Chapter 73: The REG Procedure
the Parameter Estimate
the Standard Error, the estimate of the standard deviation of the parameter estimate
T for H0: Parameter=0, the t test that the parameter is zero. This is computed as the Parameter
Estimate divided by the Standard Error.
the Prob > |T|, the probability that a t statistic would obtain a greater absolute value than that
observed given that the true parameter is zero. This is the two-tailed significance probability.
If model-selection methods other than NONE, RSQUARE, ADJRSQ, and CP are used, the analysisof-variance table and the parameter estimates with associated statistics are produced at each step.
Also displayed are the following:
C(p), which is Mallows’ Cp statistic
bounds on the condition number of the correlation matrix for the variables in the model (Berk
1977)
After statistics for the final model are produced, the following is displayed when the method chosen
is FORWARD, BACKWARD, or STEPWISE:
a Summary table listing Step number, Variable Entered or Removed, Partial and Model Rsquare, and C(p) and F statistics
The RSQUARE method displays its results beginning with the model containing the fewest independent variables and producing the largest R2 . Results for other models with the same number
of variables are then shown in order of decreasing R2 , and so on, for models with larger numbers
of variables. The ADJRSQ and CP methods group models of all sizes together and display results
beginning with the model having the optimal value of adjusted R2 and Cp , respectively.
For each model considered, the RSQUARE, ADJRSQ, and CP methods display the following:
Number in Model or IN, the number of independent variables used in each model
R-square or RSQ, the squared multiple correlation coefficient
If the B option is specified, the RSQUARE, ADJRSQ, and CP methods produce the following:
Parameter Estimates, the estimated regression coefficients
If the B option is not specified, the RSQUARE, ADJRSQ, and CP methods display the following:
Variables in Model, the names of the independent variables included in the model
ODS Table Names F 5581
ODS Table Names
PROC REG assigns a name to each table it creates. You can use these names to reference the table
when using the Output Delivery System (ODS) to select tables and create output data sets. These
names are listed in the following table. For more information about ODS, see Chapter 20, “Using
the Output Delivery System.”
Table 73.10
ODS Tables Produced by PROC REG
ODS Table Name
Description
Statement
Option
ACovEst
Consistent covariance of
estimates matrix
Test ANOVA using ACOV
estimates
Model ANOVA table
Canonical correlations for
hypothesis combinations
Collinearity Diagnostics
table
Collinearity Diagnostics for
no intercept model
Bounds on condition
number
MODEL
ALL, ACOV
TEST
ACOV (MODEL statement)
MODEL
MTEST
default
CANPRINT
MODEL
COLLIN
MODEL
COLLINOINT
MODEL
Correlation matrix for
analysis variables
Correlation of estimates
Covariance of estimates
Bordered model X’X matrix
Durbin-Watson statistic
Linear dependence
equations
MTest eigenvalues
MTest eigenvectors
Entry statistics for selection
methods
PROC
(SELECTION=BACKWARD
| FORWARD | STEPWISE
| MAXR | MINR) and
DETAILS
ALL, CORR
MODEL
MODEL
MODEL
MODEL
MODEL
CORRB
COVB
ALL, XPX
ALL, DW
default if needed
MTEST
MTEST
MODEL
MTest error plus hypothesis
matrix H+E
MTest error matrix E
Model fit statistics
MTest hypothesis matrix
Inv(L Ginv(X’X) L’) and
Inv(Lb-c)
MTEST
CANPRINT
CANPRINT
(SELECTION=BACKWARD
| FORWARD | STEPWISE
| MAXR | MINR) and
DETAILS
PRINT
MTEST
MODEL
MTEST
MTEST
PRINT
default
PRINT
DETAILS
ACovTestANOVA
ANOVA
CanCorr
CollinDiag
CollinDiagNoInt
ConditionBounds
Corr
CorrB
CovB
CrossProducts
DWStatistic
DependenceEquations
Eigenvalues
Eigenvectors
EntryStatistics
ErrorPlusHypothesis
ErrorSSCP
FitStatistics
HypothesisSSCP
InvMTestCov
5582 F Chapter 73: The REG Procedure
Table 73.10
continued
ODS Table Name
Description
Statement
Option
InvTestCov
Inv(L Ginv(X’X) L’) and
Inv(Lb-c)
Bordered X’X inverse matrix
L Ginv(X’X) L’ and Lb-c
MTest matrix M, across
dependents
Multivariate test statistics
Number of observations
Output statistics table
TEST
PRINT
MODEL
MTEST
MTEST
I
DETAILS
DETAILS
MTEST
Partial regression leverage
data
Model parameter estimates
Removal statistics for
selection methods
MODEL
default
default
ALL, CLI, CLM,
INFLUENCE, P, R
PARTIALDATA
Residual statistics and
PRESS statistic
Parameter estimates for
selection methods
MODEL
Selection summary for
FORWARD, BACKWARD,
and
STEPWISE methods
Sequential parameter
estimates
Simple statistics for analysis
variables
White’s heteroscedasticity
test
Selection summary for
R-square, Adj-RSq, and
Cp methods
Test ANOVA table
L Ginv(X’X) L’ and Lb-c
Uncorrected SSCP matrix
for analysis variables
MODEL
InvXPX
MTestCov
MTransform
MultStat
NObs
OutputStatistics
PartialData
ParameterEstimates
RemovalStatistics
ResidualStatistics
SelParmEst
SelectionSummary
SeqParmEst
SimpleStatistics
SpecTest
SubsetSelSummary
TestANOVA
TestCov
USSCP
MODEL
MODEL
MODEL
MODEL
default
(SELECTION=BACKWARD
| STEPWISE | MAXR |
MINR) and DETAILS
ALL, CLI, CLM,
INFLUENCE, P, R
SELECTION=BACKWARD
| FORWARD | STEPWISE |
MAXR | MINR
SELECTION=BACKWARD
| FORWARD | STEPWISE
MODEL
SEQB
PROC
ALL, SIMPLE
MODEL
ALL, SPEC
MODEL
SELECTION=RSQUARE |
ADJRSQ | CP
TEST
TEST
PROC
default
PRINT
ALL, USSCP
ODS Graphics F 5583
ODS Graphics
This section describes the use of ODS for creating statistical graphs with the REG procedure. To
request these graphs you must specify the ODS GRAPHICS statement. For more information about
the ODS GRAPHICS statement, see Chapter 21, “Statistical Graphics Using ODS.” The following
sections describe the ODS graphical displays produced by PROC REG.
Diagnostics Panel
The “Diagnostics Panel” provides a display that you can use to get an overall assessment of your
model. See Figure 73.8 for an example.
The panel contains the following plots:
residuals versus the predicted values
externally studentized residuals (RSTUDENT) versus the predicted values
externally studentized residuals versus the leverage
normal quantile-quantile plot (Q-Q plot) of the residuals
dependent variable values versus the predicted values
Cook’s D versus observation number
histogram of the residuals
“Residual-Fit” (or RF) plot consisting of side-by-side quantile plots of the centered fit and
the residuals
box plot of the residuals if you specify the STATS=NONE suboption
Patterns in the plots of residuals or studentized residuals versus the predicted values, or spread of
the residuals being greater than the spread of the centered fit in the RF plot, are indications of
an inadequate model. Patterns in the spread about the 45-degree reference line in the plot of the
dependent variable values versus the predicted values are also indications of an inadequate model.
The Q-Q plot, residual histogram, and box plot of the residuals are useful for diagnosing violations
of the normality and homoscedasticity assumptions. If the data in a Q-Q plot come from a normal
distribution, the points will cluster tightly around the reference line. A normal density is overlaid
on the residual histogram to help in detecting departures form normality.
Following Rawlings (1998), reference lines are shown on the relevant plots to identify observations
deemed outliers or influential. Observations whose externally studentized residual magnitudes exceed 2 are deemed outliers. Observations whose leverage value exceeds 2p=n or whose Cook’s D
value exceeds 4=n are deemed influential (p is the number of regressors including the intercept,
and n is the number of observations used in the analysis). If you specify the LABEL suboption of
5584 F Chapter 73: The REG Procedure
the PLOTS=DIAGNOSTICS option, then the points deemed outliers or influential are labeled on
the appropriate plots.
Fit statistics are shown in the lower right of the plot and can be customized or suppressed by using
the STATS= suboption of the PLOTS=DIAGNOSTICS option.
Residuals by Regressor Plots
Panels of plots of the residuals versus each of the regressors in the model are produced by default.
Patterns in these plots are indications of an inadequate model. To help in detecting patterns, you can
use the SMOOTH= suboption of the PLOTS=RESIDUALS option to add loess fits to these residual
plots. See Figure 73.1.6 for an example.
Fit and Prediction Plots
A fit plot consisting of a scatter plot of the data overlaid with the regression line, as well as confidence and prediction limits, is produced for models depending on a single regressor. Fit statistics are
shown to the right of the plot and can be customized or suppressed by using the STATS= suboption
of the PLOTS=FIT option.
When a model contains more than one regressor, a fit plot is not appropriate. However, if all the
regressors in the model are transformations of a single variable in the input data set, then you can
request a scatter plot of the dependent variable overlaid with a fit line and confidence and prediction
limits versus this variable. You can also plot residuals versus this variable. You request these plots,
shown in a panel, with the PLOTS=PREDICTION option. See Figure 73.13 for an example.
Influence Plots
In addition to the “Cook’s D Plot” and the “RStudent By Leverage Plot,” you can request plots of
the DFBETAS and DFFITS statistics versus observation number by using the PLOTS=DFBETAS
and PLOTS=DFFITS options. You can also obtain partial regression leverage plots by using the
PLOTS=PARTIAL option. See the section “Influence Statistics” on page 5553 for examples of
these plots and details about their interpretation.
Ridge and VIF Plots
When you use ridge regression, you can request plots of the variance inflation factor (VIF) values and standardized ridge estimates by ridge values for each coefficient with the PLOTS=RIDGE
option. See Example 73.5 for examples.
Variable Selection Plots
When you request variable selection by using the SELECTION= option in the MODEL statement,
you can request plots of fit criteria for the models examined by using the PLOTS=CRITERIA
ODS Graphics F 5585
option. The fit criteria are displayed versus the step number for the FORWARD, BACKWARD,
and STEPWISE selection methods and the step at which the optimal value of each criterion is
obtained is indicated using a “Star” marker. For the all-subset-based selection methods (SELECTION=RSQUARE|ADJRSQ|CP), the fit criteria are displayed versus the number of observations in
the model.
The criteria are shown in a panel, but you can use the UNPACK suboption of the
PLOTS=CRITERIA option to obtain separate plots for each criterion. You can also use the
LABEL suboption of the PLOTS=CRITERIA option to request that optimal models be labeled on
the plots. Example 73.2 provides several examples.
ODS Graph Names
PROC REG assigns a name to each graph it creates using ODS. You can use these names to reference the graphs when using ODS. The names are listed in Table 73.11.
To request these graphs you must specify the ODS GRAPHICS statement. For more information
about the ODS GRAPHICS statement, see Chapter 21, “Statistical Graphics Using ODS.”
Table 73.11
ODS Graphical Displays Produced by PROC REG
ODS Graph Name
Plot Description
PLOTS Option
AdjrsqPlot
Adjusted R-square statistic for models examined doing variable selection
AIC statistic for models examined
doing variable selection
BIC statistic for models examined
doing variable selection
Cook’s D statistic versus observation number
Cp statistic for models examined
doing variable selection
DFFITS statistics versus observation number
Panel of DFBETAS statistics versus
observation number
DFBETAS statistics versus observation number
Panel of fit diagnostics
Regression line, confidence limits,
and prediction limits overlaid on
scatter plot of data
Dependent variable versus predicted
values
Partial regression plot
Panel of residuals and fit versus
specified variable
ADJRSQ
AICPlot
BICPlot
CooksDPlot
CPPlot
DFFITSPlot
DFBETASPanel
DFBETASPlot
DiagnosticsPanel
FitPlot
ObservedByPredicted
PartialPlot
PredictionPanel
AIC
BIC
COOKSD
CP
DFFITS
DFBETAS
DFBETAS(UNPACK)
DIAGNOSTICS
FIT
OBSERVEDBYPREDICTED
PARTIAL
PREDICTIONS
5586 F Chapter 73: The REG Procedure
Table 73.11
continued
ODS Graph Name
Plot Description
PLOTS Option
PredictionPlot
Regression line, confidence limits,
and prediction limits versus specified variable
Residuals versus specified variable
Normal quantile plot of residuals
Box plot of residuals
Residuals versus predicted values
Histogram of fit residuals
Plot of residuals versus regressor
Side-by-side plots of quantiles of
centered fit and residuals
Plot of VIF and ridge traces
Plot of ridge traces
R-square statistic for models examined doing variable selection
Studentized residuals versus leverage
Studentized residuals versus predicted values
SBC statistic for models examined
doing variable selection
Panel of fit statistics for models examined doing variable selection
Plot of VIF traces
PREDICTIONS(UNPACK)
PredictionResidualPlot
QQPlot
ResidualBoxPlot
ResidualByPredicted
ResidualHistogram
ResidualPlot
RFPlot
RidgePanel
RidgePlot
RSquarePlot
RStudentByLeverage
RStudentByPredicted
SBCPlot
SelectionCriterionPanel
VIFPlot
PREDICTIONS(UNPACK)
QQ
BOXPLOT
RESIDUALBYPREDICTED
RESIDUALHISTOGRAM
RESIDUALS
RF
RIDGE
RIDGE(UNPACK)
RSQUARE
RSTUDENTBYLEVERAGE
RSTUDENTBYPREDITED
SBC
CRITERIA
RIDGE(UNPACK)
Examples: REG Procedure
Example 73.1: Modeling Salaries of Major League Baseball Players
This example features the use of ODS Graphics in the process of building models by using the REG
procedure and highlights the use of fit and influence diagnostics.
The following data set contains salary and performance information for Major League Baseball
players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The
salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures
are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update).
Example 73.1: Modeling Salaries of Major League Baseball Players F 5587
data baseball;
length name $ 18;
length team $ 12;
input name $ 1-18 no_atbat no_hits no_home no_runs no_rbi no_bb yr_major
cr_atbat cr_hits cr_home cr_runs cr_rbi cr_bb league $
division $ team $ position $ no_outs no_assts no_error salary;
logSalary = log10(salary);
label name="Player’s Name"
no_hits="Hits in 1986"
no_runs="Runs in 1986"
no_rbi="RBIs in 1986"
no_bb="Walks in 1986"
yr_major="Years in MLB"
cr_hits="Career Hits"
salary="1987 Salary in $ Thousands"
logSalary = "log10(Salary)";
datalines;
Allanson, Andy
293
66
1
30
29
14
1
293
66
1
30
29
14
American East Cleveland C 446 33 20 .
Ashby, Alan
315
81
7
24
38
39
14 3449
835
69
321
414
375
National West Houston C 632 43 10 475
Davis, Alan
479
130
18
66
72
76
3 1624
457
63
224
266
263
American West Seattle 1B 880 82 14 480
Dawson, Andre
496
141
20
65
78
37
11 5628 1575
225
828
838
354
National East Montreal RF 200 11 3 500
Galarraga, Andres
321
87
10
39
42
30
2
396
101
12
48
46
33
National East Montreal 1B 805 40 4 91.5
Griffin, Alfredo
594
169
4
74
51
35
11 4408 1133
19
501
336
194
... more lines ...
Wilson, Willie
631
170
9
77
44
31
11 4908 1457
30
775
357
249
American West KansasCity CF 408 4 3 1000
;
Suppose you want to investigate whether you can model the players’ salaries for the 1987 season
based on batting statistics for the previous season and lifetime batting performance. Since the
variation in salaries is much greater for higher salaries, it is appropriate to apply a log transformation
for this analysis. The following statements begin the analysis:
5588 F Chapter 73: The REG Procedure
ods graphics on;
proc reg data=baseball;
id name team league;
model logSalary = no_hits no_runs no_rbi no_bb yr_major cr_hits;
run;
Output 73.1.1 shows the default output produced by PROC REG. The number of observations table
shows that 59 observations are excluded because they have missing values for at least one of the
variables used in the analysis. The analysis of variance and parameter estimates tables provide
details about the fitted model.
Output 73.1.1 Default Output from PROC REG
The REG Procedure
Model: MODEL1
Dependent Variable: logSalary log10(Salary)
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values
322
263
59
Analysis of Variance
DF
Sum of
Squares
Mean
Square
6
256
262
22.92208
16.14954
39.07162
3.82035
0.06308
Root MSE
Dependent Mean
Coeff Var
0.25117
2.57416
9.75719
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
60.56
<.0001
0.5867
0.5770
Parameter Estimates
Variable
Label
Intercept
no_hits
no_runs
no_rbi
no_bb
yr_major
cr_hits
Intercept
Hits in 1986
Runs in 1986
RBIs in 1986
Walks in 1986
Years in MLB
Career Hits
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
1
1
1.80065
0.00288
0.00008638
0.00054382
0.00292
0.03087
0.00010384
0.05912
0.00091244
0.00173
0.00102
0.00104
0.00836
0.00006328
30.46
3.15
0.05
0.53
2.81
3.69
1.64
<.0001
0.0018
0.9602
0.5947
0.0054
0.0003
0.1020
Before you accept a regression model, it is important to examine influence and fit diagnostics to see
whether the model might be unduly influenced by a few observations and whether the data support
the assumptions that underlie the linear regression. To facilitate such investigations, you can obtain
diagnostic plots by enabling ODS Graphics.
Example 73.1: Modeling Salaries of Major League Baseball Players F 5589
Output 73.1.2 Fit Diagnostics
Output 73.1.2 shows a panel of diagnostic plots. The plot of externally studentized residuals (RStudent) by leverage values reveals that there is one observation with very high leverage that might be
overly influencing the fit produced. The plot of Cook’s D by observation also indicates two highly
influential observations. To investigate further, you can use the PLOTS= option in the PROC REG
statement as follows to produce labeled versions of these plots:
proc reg data=baseball
plots(only label)=(RStudentByLeverage CooksD);
id name team league;
model logSalary = no_hits no_runs no_rbi no_bb yr_major cr_hits;
run;
5590 F Chapter 73: The REG Procedure
Output 73.1.3 and Output 73.1.4 reveal that Pete Rose is the highly influential observation. You
might obtain a better fit to the remaining data if you omit his statistics when building the model.
Output 73.1.3 Outlier and Leverage Diagnostics
Example 73.1: Modeling Salaries of Major League Baseball Players F 5591
Output 73.1.4 Cook’s D
The following statements use a WHERE statement to omit Pete Rose’s statistics when building the
model. An alternative way to do this within PROC REG is to use a REWEIGHT statement. See
“Reweighting Observations in an Analysis” on page 5563 for details about reweighting.
proc reg data=baseball
plots=(RStudentByLeverage(label) residuals(smooth));
where name^="Rose, Pete";
id name team league;
model logSalary = no_hits no_runs no_rbi no_bb yr_major cr_hits;
run;
Output 73.1.5 shows the new fit diagnostics panel. You can see that there are still several influential and outlying observations. One possible reason for observing outliers is that the linear model
specified is not appropriate to capture the variation in this data. You can often see evidence of an
inappropriate model by observing patterns in plots of residuals.
5592 F Chapter 73: The REG Procedure
Output 73.1.5 Fit Diagnostics
Output 73.1.6 shows plots of the residuals by the regressors in the model. When you specify the
RESIDUALS(SMOOTH) suboption of the PLOTS option in the PROC REG statement, a loess fit is
overlaid on each of these plots. You can see the same clear pattern in the residual plots for yr_major
and cr_hits. Players near the start of their careers and players near the end of their careers get paid
less than the model predicts.
Example 73.1: Modeling Salaries of Major League Baseball Players F 5593
Output 73.1.6 Residuals by Regressors
You can address this lack of fit by using polynomials of degree 2 for these two variables as shown
in the following statements:
data baseball;
set baseball(where=(name^="Rose, Pete")) ;
yr_major2 = yr_major*yr_major;
cr_hits2 = cr_hits*cr_hits;
run;
proc reg data=baseball
plots=(diagnostics(stats=none) RStudentByLeverage(label)
CooksD(label) Residuals(smooth)
DFFITS(label) DFBETAS ObservedByPredicted(label));
id name team league;
model logSalary = no_hits no_runs no_rbi no_bb yr_major cr_hits
yr_major2 cr_hits2;
run;
ods graphics off;
Output 73.1.7 shows the analysis of variance and parameter estimates for this model. Note that the
R-square value of 0:787 for this model is considerably larger than the R-square value of 0:587 for
the initial model shown in Output 73.1.1.
5594 F Chapter 73: The REG Procedure
Output 73.1.7 Output from PROC REG
The REG Procedure
Model: MODEL1
Dependent Variable: logSalary log10(Salary)
Analysis of Variance
DF
Sum of
Squares
Mean
Square
8
253
261
30.69367
8.28706
38.98073
3.83671
0.03276
Root MSE
Dependent Mean
Coeff Var
0.18098
2.57301
7.03393
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
117.13
<.0001
0.7874
0.7807
Parameter Estimates
Variable
Label
Intercept
no_hits
no_runs
no_rbi
no_bb
yr_major
cr_hits
yr_major2
cr_hits2
Intercept
Hits in 1986
Runs in 1986
RBIs in 1986
Walks in 1986
Years in MLB
Career Hits
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
1
1
1
1
1.64564
-0.00005539
0.00093586
0.00187
0.00218
0.10383
0.00073955
-0.00625
-1.44072E-7
0.05030
0.00069200
0.00125
0.00074649
0.00075057
0.01495
0.00011970
0.00071687
4.348471E-8
32.72
-0.08
0.75
2.51
2.90
6.94
6.18
-8.73
-3.31
<.0001
0.9363
0.4549
0.0127
0.0040
<.0001
<.0001
<.0001
0.0011
The plots of residuals by regressors in Output 73.1.8 and Output 73.1.9 show that the strong pattern
in the plots for cr_majors and cr_hits has been reduced, although there is still some indication of a
pattern remaining in these residuals. This suggests that a quadratic function might be insufficient to
capture dependence of salary on these regressors.
Example 73.1: Modeling Salaries of Major League Baseball Players F 5595
Output 73.1.8 Residuals by Regressors
5596 F Chapter 73: The REG Procedure
Output 73.1.9 Residuals by Regressors
Output 73.1.10 show the diagnostics plots; three of the plots, with points of interest labeled, are
shown individually in Output 73.1.11, Output 73.1.12, and Output 73.1.13. The STATS=NONE
suboption specified in the PLOTS=DIAGNOSTICS option replaces the inset of statistics with a box
plot of the residuals in the fit diagnostics panel. The observed by predicted value plot reveals a
reasonably successful model for explaining the variation in salary for most of the players. However,
the model tends to overpredict the salaries of several players near the lower end of the salary range.
This bias can also be seen in the distribution of the residuals that you can see in the histogram, Q-Q
plot, and box plot in Output 73.1.10.
Example 73.1: Modeling Salaries of Major League Baseball Players F 5597
Output 73.1.10 Fit Diagnostics
5598 F Chapter 73: The REG Procedure
Output 73.1.11 Outlier and Leverage Diagnostics
Example 73.1: Modeling Salaries of Major League Baseball Players F 5599
Output 73.1.12 Observed by Predicted Values
5600 F Chapter 73: The REG Procedure
Output 73.1.13 Cook’s D
The RStudent by leverage plot in Output 73.1.11 and the Cook’s D plot in Output 73.1.13 show
that there are still a number of influential observations. By specifying the DFFITS and DFBETAS suboptions of the PLOTS= option, you obtain additional influence diagnostics plots shown in
Output 73.1.14 and Output 73.1.15. See “Influence Statistics” on page 5553 for details about the
interpretation DFFITS and DFBETAS statistics.
Example 73.1: Modeling Salaries of Major League Baseball Players F 5601
Output 73.1.14 DFFITS
5602 F Chapter 73: The REG Procedure
Output 73.1.15 DFBETAS
You can continue this analysis by investigating how the influential observations identified in the
various influence plots affect the fit. You can also use PROC ROBUSTREG to obtain a fit that is
resistant to the presence of high leverage points and outliers.
Example 73.2: Aerobic Fitness Prediction
Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The
goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive
and cumbersome oxygen consumption measurements. Three model-selection methods are used:
forward selection, backward selection, and MAXR selection. Here are the data:
Example 73.2: Aerobic Fitness Prediction F 5603
*-------------------Data on Physical Fitness-------------------*
| These measurements were made on men involved in a physical
|
| fitness course at N.C.State Univ. The variables are Age
|
| (years), Weight (kg), Oxygen intake rate (ml per kg body
|
| weight per minute), time to run 1.5 miles (minutes), heart
|
| rate while resting, heart rate while running (same time
|
| Oxygen rate measured), and maximum heart rate recorded while |
| running.
|
| ***Certain values of MaxPulse were changed for this analysis.|
*--------------------------------------------------------------*;
data fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
datalines;
44 89.47 44.609 11.37 62 178 182
40 75.07 45.313 10.07 62 185 185
44 85.84 54.297 8.65 45 156 168
42 68.15 59.571 8.17 40 166 172
38 89.02 49.874 9.22 55 178 180
47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180
43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176
38 81.87 60.055 8.63 48 170 186
44 73.03 50.541 10.13 45 168 168
45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176
47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170
49 81.42 49.156 8.95 44 180 185
51 69.63 40.836 10.95 57 168 172
51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164
49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176
54 79.38 46.080 11.17 62 156 165
52 76.32 45.441 9.63 48 164 166
50 70.87 54.625 8.92 48 146 155
51 67.25 45.118 11.08 48 172 172
54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188
57 59.08 50.545 9.93 49 148 155
49 76.32 48.673 9.40 56 186 188
48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
The following statements demonstrate the FORWARD, BACKWARD, and MAXR model selection
methods:
proc reg data=fitness;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=forward;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=backward;
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=maxr;
run;
Output 73.2.1 shows the sequence of models produced by the FORWARD model-selection method.
5604 F Chapter 73: The REG Procedure
Output 73.2.1 Forward Selection Method: PROC REG
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Forward Selection: Step 1
Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
29
30
632.90010
218.48144
851.38154
632.90010
7.53384
F Value
Pr > F
84.01
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
RunTime
82.42177
-3.31056
3.85530
0.36119
3443.36654
632.90010
457.05
84.01
<.0001
<.0001
Bounds on condition number: 1, 1
-------------------------------------------------------------------------------Forward Selection: Step 2
Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
28
30
650.66573
200.71581
851.38154
325.33287
7.16842
F Value
Pr > F
45.38
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
RunTime
88.46229
-0.15037
-3.20395
5.37264
0.09551
0.35877
1943.41071
17.76563
571.67751
271.11
2.48
79.75
<.0001
0.1267
<.0001
Example 73.2: Aerobic Fitness Prediction F 5605
Output 73.2.1 continued
Bounds on condition number: 1.0369, 4.1478
-------------------------------------------------------------------------------Forward Selection: Step 3
Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
27
30
690.55086
160.83069
851.38154
230.18362
5.95669
F Value
Pr > F
38.64
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
RunTime
RunPulse
111.71806
-0.25640
-2.82538
-0.13091
10.23509
0.09623
0.35828
0.05059
709.69014
42.28867
370.43529
39.88512
119.14
7.10
62.19
6.70
<.0001
0.0129
<.0001
0.0154
Bounds on condition number: 1.3548, 11.597
-------------------------------------------------------------------------------Forward Selection: Step 4
Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
26
30
712.45153
138.93002
851.38154
178.11288
5.34346
Variable
Intercept
Age
RunTime
RunPulse
MaxPulse
F Value
Pr > F
33.33
<.0001
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
98.14789
-0.19773
-2.76758
-0.34811
0.27051
11.78569
0.09564
0.34054
0.11750
0.13362
370.57373
22.84231
352.93570
46.90089
21.90067
69.35
4.27
66.05
8.78
4.10
<.0001
0.0488
<.0001
0.0064
0.0533
5606 F Chapter 73: The REG Procedure
Output 73.2.1 continued
Bounds on condition number: 8.4182, 76.851
-------------------------------------------------------------------------------Forward Selection: Step 5
Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
25
30
721.97309
129.40845
851.38154
144.39462
5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
Weight
RunTime
RunPulse
MaxPulse
102.20428
-0.21962
-0.07230
-2.68252
-0.37340
0.30491
11.97929
0.09550
0.05331
0.34099
0.11714
0.13394
376.78935
27.37429
9.52157
320.35968
52.59624
26.82640
72.79
5.29
1.84
61.89
10.16
5.18
<.0001
0.0301
0.1871
<.0001
0.0038
0.0316
Bounds on condition number: 8.7312, 104.83
The final variable available to add to the model, RestPulse, is not added since it does not meet
the 50% (the default value of the SLE option is 0.5 for FORWARD selection) significance-level
criterion for entry into the model.
The BACKWARD model-selection method begins with the full model. Output 73.2.2 shows the
steps of the BACKWARD method. RestPulse is the first variable deleted, followed by Weight. No
other variables are deleted from the model since the variables remaining (Age, RunTime, RunPulse,
and MaxPulse) are all significant at the 10% (the default value of the SLS option is 0.1 for the
BACKWARD elimination method) significance level.
Example 73.2: Aerobic Fitness Prediction F 5607
Output 73.2.2 Backward Selection Method: PROC REG
Backward Elimination: Step 0
All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
6
24
30
722.54361
128.83794
851.38154
120.42393
5.36825
F Value
Pr > F
22.43
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
Weight
RunTime
RunPulse
RestPulse
MaxPulse
102.93448
-0.22697
-0.07418
-2.62865
-0.36963
-0.02153
0.30322
12.40326
0.09984
0.05459
0.38456
0.11985
0.06605
0.13650
369.72831
27.74577
9.91059
250.82210
51.05806
0.57051
26.49142
68.87
5.17
1.85
46.72
9.51
0.11
4.93
<.0001
0.0322
0.1869
<.0001
0.0051
0.7473
0.0360
Bounds on condition number: 8.7438, 137.13
-------------------------------------------------------------------------------Backward Elimination: Step 1
Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
25
30
721.97309
129.40845
851.38154
144.39462
5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
Weight
RunTime
RunPulse
MaxPulse
102.20428
-0.21962
-0.07230
-2.68252
-0.37340
0.30491
11.97929
0.09550
0.05331
0.34099
0.11714
0.13394
376.78935
27.37429
9.52157
320.35968
52.59624
26.82640
72.79
5.29
1.84
61.89
10.16
5.18
<.0001
0.0301
0.1871
<.0001
0.0038
0.0316
5608 F Chapter 73: The REG Procedure
Output 73.2.2 continued
Bounds on condition number: 8.7312, 104.83
-------------------------------------------------------------------------------Backward Elimination: Step 2
Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
26
30
712.45153
138.93002
851.38154
178.11288
5.34346
F Value
Pr > F
33.33
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
RunTime
RunPulse
MaxPulse
98.14789
-0.19773
-2.76758
-0.34811
0.27051
11.78569
0.09564
0.34054
0.11750
0.13362
370.57373
22.84231
352.93570
46.90089
21.90067
69.35
4.27
66.05
8.78
4.10
<.0001
0.0488
<.0001
0.0064
0.0533
Bounds on condition number: 8.4182, 76.851
The MAXR method tries to find the “best” one-variable model, the “best” two-variable model, and
so on. Output 73.2.3 shows that the one-variable model contains RunTime; the two-variable model
contains RunTime and Age; the three-variable model contains RunTime, Age, and RunPulse; the fourvariable model contains Age, RunTime, RunPulse, and MaxPulse; the five-variable model contains
Age, Weight, RunTime, RunPulse, and MaxPulse; and finally, the six-variable model contains all the
variables in the MODEL statement.
Output 73.2.3 Maximum R-Square Improvement Selection Method: PROC REG
Maximum R-Square Improvement: Step 1
Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
29
30
632.90010
218.48144
851.38154
632.90010
7.53384
F Value
Pr > F
84.01
<.0001
Example 73.2: Aerobic Fitness Prediction F 5609
Output 73.2.3 continued
Variable
Intercept
RunTime
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
82.42177
-3.31056
3.85530
0.36119
3443.36654
632.90010
457.05
84.01
<.0001
<.0001
Bounds on condition number: 1, 1
--------------------------------------------------------------------------------
The above model is the best
1-variable model found.
Maximum R-Square Improvement: Step 2
Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
28
30
650.66573
200.71581
851.38154
325.33287
7.16842
Variable
Intercept
Age
RunTime
F Value
Pr > F
45.38
<.0001
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
88.46229
-0.15037
-3.20395
5.37264
0.09551
0.35877
1943.41071
17.76563
571.67751
271.11
2.48
79.75
<.0001
0.1267
<.0001
Bounds on condition number: 1.0369, 4.1478
--------------------------------------------------------------------------------
The above model is the best
2-variable model found.
Maximum R-Square Improvement: Step 3
Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
27
30
690.55086
160.83069
851.38154
230.18362
5.95669
F Value
Pr > F
38.64
<.0001
5610 F Chapter 73: The REG Procedure
Output 73.2.3 continued
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
RunTime
RunPulse
111.71806
-0.25640
-2.82538
-0.13091
10.23509
0.09623
0.35828
0.05059
709.69014
42.28867
370.43529
39.88512
119.14
7.10
62.19
6.70
<.0001
0.0129
<.0001
0.0154
Bounds on condition number: 1.3548, 11.597
--------------------------------------------------------------------------------
The above model is the best
3-variable model found.
Maximum R-Square Improvement: Step 4
Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
26
30
712.45153
138.93002
851.38154
178.11288
5.34346
F Value
Pr > F
33.33
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
RunTime
RunPulse
MaxPulse
98.14789
-0.19773
-2.76758
-0.34811
0.27051
11.78569
0.09564
0.34054
0.11750
0.13362
370.57373
22.84231
352.93570
46.90089
21.90067
69.35
4.27
66.05
8.78
4.10
<.0001
0.0488
<.0001
0.0064
0.0533
Example 73.2: Aerobic Fitness Prediction F 5611
Output 73.2.3 continued
Bounds on condition number: 8.4182, 76.851
--------------------------------------------------------------------------------
The above model is the best
4-variable model found.
Maximum R-Square Improvement: Step 5
Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
25
30
721.97309
129.40845
851.38154
144.39462
5.17634
F Value
Pr > F
27.90
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
Weight
RunTime
RunPulse
MaxPulse
102.20428
-0.21962
-0.07230
-2.68252
-0.37340
0.30491
11.97929
0.09550
0.05331
0.34099
0.11714
0.13394
376.78935
27.37429
9.52157
320.35968
52.59624
26.82640
72.79
5.29
1.84
61.89
10.16
5.18
<.0001
0.0301
0.1871
<.0001
0.0038
0.0316
Bounds on condition number: 8.7312, 104.83
--------------------------------------------------------------------------------
The above model is the best
5-variable model found.
Maximum R-Square Improvement: Step 6
Variable RestPulse Entered: R-Square = 0.8487 and C(p) = 7.0000
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
6
24
30
722.54361
128.83794
851.38154
120.42393
5.36825
F Value
Pr > F
22.43
<.0001
5612 F Chapter 73: The REG Procedure
Output 73.2.3 continued
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
Age
Weight
RunTime
RunPulse
RestPulse
MaxPulse
102.93448
-0.22697
-0.07418
-2.62865
-0.36963
-0.02153
0.30322
12.40326
0.09984
0.05459
0.38456
0.11985
0.06605
0.13650
369.72831
27.74577
9.91059
250.82210
51.05806
0.57051
26.49142
68.87
5.17
1.85
46.72
9.51
0.11
4.93
<.0001
0.0322
0.1869
<.0001
0.0051
0.7473
0.0360
Bounds on condition number: 8.7438, 137.13
Note that for all three of these methods, RestPulse contributes least to the model. In the case of
forward selection, it is not added to the model. In the case of backward selection, it is the first
variable to be removed from the model. In the case of MAXR selection, RestPulse is included only
for the full model.
For the STEPWISE, BACKWARD, and FORWARD selection methods, you can control the amount
of detail displayed by using the DETAILS option, and you can use ODS Graphics to produce
plots that show how selection criteria progress as the selection proceeds. For example, the following statements display only the selection summary table for the FORWARD selection method
(Output 73.2.4) and produce the plots shown in Output 73.2.5 and Output 73.2.6.
ods graphics on;
proc reg data=fitness plots=(criteria sbc);
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=forward details=summary;
run;
Output 73.2.4 Forward Selection Summary
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
Summary of Forward Selection
Step
1
2
3
4
5
Variable
Entered
RunTime
Age
RunPulse
MaxPulse
Weight
Number
Vars In
1
2
3
4
5
Partial
R-Square
Model
R-Square
0.7434
0.0209
0.0468
0.0257
0.0112
0.7434
0.7642
0.8111
0.8368
0.8480
C(p)
13.6988
12.3894
6.9596
4.8800
5.1063
F Value
Pr > F
84.01
2.48
6.70
4.10
1.84
<.0001
0.1267
0.0154
0.0533
0.1871
Example 73.2: Aerobic Fitness Prediction F 5613
Output 73.2.5 show how six fit criteria progress as the forward selection proceeds. The step at
which each criterion achieves its best value is indicated. For example, the BIC criterion achieves
its minimum value for the model at step 4. Note that this does not mean that the model at step 4
achieves the smallest BIC criterion among all possible models that use a subset of the regressors;
the model at step 4 yields the smallest BIC statistic among the models at each step of the forward
selection. Output 73.2.6 show the progression of the SBC statistic in its own plot. If you want to
see six of the selection criteria in individual plots, you can specify the UNPACK suboption of the
PLOTS=CRITERIA option in the PROC REG statement.
Output 73.2.5 Fit Criteria
5614 F Chapter 73: The REG Procedure
Output 73.2.6 SBC Criterion
Next, the RSQUARE model-selection method is used to request R2 and Cp statistics for all possible
combinations of the six independent variables. The following statements produce Output 73.2.7:
proc reg data=fitness plots=(criteria(label) cp);
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=rsquare cp;
title ’Physical fitness data: all models’;
run;
Example 73.2: Aerobic Fitness Prediction F 5615
Output 73.2.7 All Models by the RSQUARE Method: PROC REG
Physical fitness data: all models
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
R-Square Selection Method
Model
Index
Number in
Model
R-Square
C(p)
Variables in Model
1
1
0.7434
13.6988 RunTime
2
1
0.1595 106.3021 RestPulse
3
1
0.1584 106.4769 RunPulse
4
1
0.0928 116.8818 Age
5
1
0.0560 122.7072 MaxPulse
6
1
0.0265 127.3948 Weight
-------------------------------------------------------------------------------7
2
0.7642
12.3894 Age RunTime
8
2
0.7614
12.8372 RunTime RunPulse
9
2
0.7452
15.4069 RunTime MaxPulse
10
2
0.7449
15.4523 Weight RunTime
11
2
0.7435
15.6746 RunTime RestPulse
12
2
0.3760
73.9645 Age RunPulse
13
2
0.3003
85.9742 Age RestPulse
14
2
0.2894
87.6951 RunPulse MaxPulse
15
2
0.2600
92.3638 Age MaxPulse
16
2
0.2350
96.3209 RunPulse RestPulse
17
2
0.1806 104.9523 Weight RestPulse
18
2
0.1740 105.9939 RestPulse MaxPulse
19
2
0.1669 107.1332 Weight RunPulse
20
2
0.1506 109.7057 Age Weight
21
2
0.0675 122.8881 Weight MaxPulse
-------------------------------------------------------------------------------22
3
0.8111
6.9596 Age RunTime RunPulse
23
3
0.8100
7.1350 RunTime RunPulse MaxPulse
24
3
0.7817
11.6167 Age RunTime MaxPulse
25
3
0.7708
13.3453 Age Weight RunTime
26
3
0.7673
13.8974 Age RunTime RestPulse
27
3
0.7619
14.7619 RunTime RunPulse RestPulse
28
3
0.7618
14.7729 Weight RunTime RunPulse
29
3
0.7462
17.2588 Weight RunTime MaxPulse
30
3
0.7452
17.4060 RunTime RestPulse MaxPulse
31
3
0.7451
17.4243 Weight RunTime RestPulse
32
3
0.4666
61.5873 Age RunPulse RestPulse
33
3
0.4223
68.6250 Age RunPulse MaxPulse
34
3
0.4091
70.7102 Age Weight RunPulse
35
3
0.3900
73.7424 Age RestPulse MaxPulse
36
3
0.3568
79.0013 Age Weight RestPulse
37
3
0.3538
79.4891 RunPulse RestPulse MaxPulse
5616 F Chapter 73: The REG Procedure
Output 73.2.7 continued
Physical fitness data: all models
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
R-Square Selection Method
Model
Index
Number in
Model
R-Square
C(p)
Variables in Model
38
3
0.3208
84.7216 Weight RunPulse MaxPulse
39
3
0.2902
89.5693 Age Weight MaxPulse
40
3
0.2447
96.7952 Weight RunPulse RestPulse
41
3
0.1882 105.7430 Weight RestPulse MaxPulse
-------------------------------------------------------------------------------42
4
0.8368
4.8800 Age RunTime RunPulse MaxPulse
43
4
0.8165
8.1035 Age Weight RunTime RunPulse
44
4
0.8158
8.2056 Weight RunTime RunPulse MaxPulse
45
4
0.8117
8.8683 Age RunTime RunPulse RestPulse
46
4
0.8104
9.0697 RunTime RunPulse RestPulse MaxPulse
47
4
0.7862
12.9039 Age Weight RunTime MaxPulse
48
4
0.7834
13.3468 Age RunTime RestPulse MaxPulse
49
4
0.7750
14.6788 Age Weight RunTime RestPulse
50
4
0.7623
16.7058 Weight RunTime RunPulse RestPulse
51
4
0.7462
19.2550 Weight RunTime RestPulse MaxPulse
52
4
0.5034
57.7590 Age Weight RunPulse RestPulse
53
4
0.5025
57.9092 Age RunPulse RestPulse MaxPulse
54
4
0.4717
62.7830 Age Weight RunPulse MaxPulse
55
4
0.4256
70.0963 Age Weight RestPulse MaxPulse
56
4
0.3858
76.4100 Weight RunPulse RestPulse MaxPulse
-------------------------------------------------------------------------------57
5
0.8480
5.1063 Age Weight RunTime RunPulse MaxPulse
58
5
0.8370
6.8461 Age RunTime RunPulse RestPulse MaxPulse
59
5
0.8176
9.9348 Age Weight RunTime RunPulse RestPulse
60
5
0.8161
10.1685 Weight RunTime RunPulse RestPulse
MaxPulse
61
5
0.7887
14.5111 Age Weight RunTime RestPulse MaxPulse
62
5
0.5541
51.7233 Age Weight RunPulse RestPulse MaxPulse
-------------------------------------------------------------------------------63
6
0.8487
7.0000 Age Weight RunTime RunPulse RestPulse
MaxPulse
The models in Output 73.2.7 are arranged first by the number of variables in the model and then by
the magnitude of R2 for the model.
Output 73.2.8 shows the panel of fit criteria for the RSQUARE selection method. The best models
(based on the R-square statistic) for each subset size are indicated on the plots. The LABEL suboption specifies that these models are labeled by the model number that appears in the summary table
shown in Output 73.2.7.
Example 73.2: Aerobic Fitness Prediction F 5617
Output 73.2.8 Fit Criteria
Output 73.2.9 shows the plot of the Cp criterion by number of regressors in the model. Useful
reference lines suggested by Mallows (1973) and Hocking (1976) are included on the plot. However,
because all possible subset models are included on this plot, the better models are all compressed
near the bottom of the plot.
5618 F Chapter 73: The REG Procedure
Output 73.2.9 Cp Criterion
The following statements use the BEST=20 option in the model statement and SELECTION=CP to
restrict attention to the models that yield the 20 smallest values of the Cp statistic:
proc reg data=fitness plots(only)=cp(label);
model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse
/ selection=cp best=20;
run;
ods graphics off;
Output 73.2.10 shows the summary table listing the regressors in the 20 models that yield the smallest Cp values, and Output 73.2.11 presents the results graphically. Reference lines Cp D 2p pf ul l
and Cp D p are shown on this plot. See the PLOTS=CP option on page 5453 for interpretations of
these lines. For the Fitness data, these lines indicate that a six-variable model is a reasonable choice
for doing parameter estimation, while a five-variable model might be suitable for doing prediction.
Example 73.2: Aerobic Fitness Prediction F 5619
Output 73.2.10 Cp Selection Summary: PROC REG
The REG Procedure
Model: MODEL1
Dependent Variable: Oxygen
C(p) Selection Method
Model
Index
Number in
Model
C(p)
R-Square
1
2
3
4
5
4
5
5
3
6
4.8800
5.1063
6.8461
6.9596
7.0000
0.8368
0.8480
0.8370
0.8111
0.8487
6
7
8
9
10
11
12
3
4
4
4
4
5
5
7.1350
8.1035
8.2056
8.8683
9.0697
9.9348
10.1685
0.8100
0.8165
0.8158
0.8117
0.8104
0.8176
0.8161
13
14
15
16
17
18
19
20
3
2
2
4
3
4
1
3
11.6167
12.3894
12.8372
12.9039
13.3453
13.3468
13.6988
13.8974
0.7817
0.7642
0.7614
0.7862
0.7708
0.7834
0.7434
0.7673
Variables in Model
Age RunTime RunPulse MaxPulse
Age Weight RunTime RunPulse MaxPulse
Age RunTime RunPulse RestPulse MaxPulse
Age RunTime RunPulse
Age Weight RunTime RunPulse RestPulse
MaxPulse
RunTime RunPulse MaxPulse
Age Weight RunTime RunPulse
Weight RunTime RunPulse MaxPulse
Age RunTime RunPulse RestPulse
RunTime RunPulse RestPulse MaxPulse
Age Weight RunTime RunPulse RestPulse
Weight RunTime RunPulse RestPulse
MaxPulse
Age RunTime MaxPulse
Age RunTime
RunTime RunPulse
Age Weight RunTime MaxPulse
Age Weight RunTime
Age RunTime RestPulse MaxPulse
RunTime
Age RunTime RestPulse
5620 F Chapter 73: The REG Procedure
Output 73.2.11 Cp Criterion
Before making a final decision about which model to use, you would want to perform collinearity
diagnostics. Note that, since many different models have been fit and the choice of a final model is
based on R2 , the statistics are biased and the p-values for the parameter estimates are not valid.
Example 73.3: Predicting Weight by Height and Age F 5621
Example 73.3: Predicting Weight by Height and Age
In this example, the weights of schoolchildren are modeled as a function of their heights and ages.
The example shows the use of a BY statement with PROC REG, multiple MODEL statements, and
the OUTEST= and OUTSSCP= options, which create data sets. Here are the data:
*------------Data on Age, Weight, and Height of Children-------*
| Age (months), height (inches), and weight (pounds) were
|
| recorded for a group of school children.
|
| From Lewis and Taylor (1967).
|
-------------------------------------------------------------*
*;
data htwt;
input sex $ age
datalines;
f 143 56.3 85.0 f
f 191 62.5 112.5 f
f 160 62.0 94.5 f
f 157 64.5 123.5 f
:3.1 height weight @@;
155
171
140
149
62.3 105.0 f 153 63.3 108.0 f 161 59.0 92.0
62.5 112.0 f 185 59.0 104.0 f 142 56.5 69.0
53.8 68.5 f 139 61.5 104.0 f 178 61.5 103.5
58.3 93.0 f 143 51.3 50.5 f 145 58.8 89.0
... more lines ...
m 164 66.5 112.0 m 189 65.0 114.0 m 164 61.5 140.0 m 167 62.0 107.5
m 151 59.3 87.0
;
Modeling is performed separately for boys and girls. Since the BY statement is used, interactive
processing is not possible in this example; no statements can appear after the first RUN statement.
The following statements produce Output 73.3.1 through Output 73.3.4:
proc reg outest=est1 outsscp=sscp1 rsquare;
by sex;
eq1: model weight=height;
eq2: model weight=height age;
proc print data=sscp1;
title2 ’SSCP type data set’;
proc print data=est1;
title2 ’EST type data set’;
run;
5622 F Chapter 73: The REG Procedure
Output 73.3.1 Height and Weight Data: Submodel for Female Children
------------------------------------ sex=f ------------------------------------The REG Procedure
Model: eq1
Dependent Variable: weight
Analysis of Variance
DF
Sum of
Squares
Mean
Square
1
109
110
21507
16615
38121
21507
152.42739
Root MSE
Dependent Mean
Coeff Var
12.34615
98.87838
12.48620
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
141.09
<.0001
0.5642
0.5602
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
height
1
1
-153.12891
4.16361
21.24814
0.35052
-7.21
11.88
<.0001
<.0001
Example 73.3: Predicting Weight by Height and Age F 5623
Output 73.3.2 Height and Weight Data: Full Model for Female Children
------------------------------------ sex=f ------------------------------------The REG Procedure
Model: eq2
Dependent Variable: weight
Analysis of Variance
DF
Sum of
Squares
Mean
Square
2
108
110
22432
15689
38121
11216
145.26700
Root MSE
Dependent Mean
Coeff Var
12.05268
98.87838
12.18939
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
77.21
<.0001
0.5884
0.5808
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
height
age
1
1
1
-150.59698
3.60378
1.90703
20.76730
0.40777
0.75543
-7.25
8.84
2.52
<.0001
<.0001
0.0130
5624 F Chapter 73: The REG Procedure
Output 73.3.3 Height and Weight Data: Submodel for Male Children
------------------------------------ sex=m ------------------------------------The REG Procedure
Model: eq1
Dependent Variable: weight
Analysis of Variance
DF
Sum of
Squares
Mean
Square
1
124
125
31126
18714
49840
31126
150.92222
Root MSE
Dependent Mean
Coeff Var
12.28504
103.44841
11.87552
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
206.24
<.0001
0.6245
0.6215
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
height
1
1
-125.69807
3.68977
15.99362
0.25693
-7.86
14.36
<.0001
<.0001
Example 73.3: Predicting Weight by Height and Age F 5625
Output 73.3.4 Height and Weight Data: Full Model for Male Children
------------------------------------ sex=m ------------------------------------The REG Procedure
Model: eq2
Dependent Variable: weight
Analysis of Variance
DF
Sum of
Squares
Mean
Square
2
123
125
32975
16866
49840
16487
137.11922
Root MSE
Dependent Mean
Coeff Var
11.70979
103.44841
11.31945
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
120.24
<.0001
0.6616
0.6561
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
height
age
1
1
1
-113.71346
2.68075
3.08167
15.59021
0.36809
0.83927
-7.29
7.28
3.67
<.0001
<.0001
0.0004
For both female and male children, the overall F statistics for both models are significant, indicating
that the model explains a significant portion of the variation in the data. For females, the full model
is
weight D
150:57 C 3:60 height C 1:91 age
and for males, the full model is
weight D
113:71 C 2:68 height C 3:08 age
The OUTSSCP= data set is shown in Output 73.3.5. Note how the BY groups are separated. Observations with _TYPE_=‘N’ contain the number of observations in the associated BY group. Observations with _TYPE_=‘SSCP’ contain the rows of the uncorrected sums of squares and crossproducts
matrix. The observations with _NAME_=‘Intercept’ contain crossproducts for the intercept.
5626 F Chapter 73: The REG Procedure
Output 73.3.5 SSCP Matrix
Obs
1
2
3
4
5
6
7
8
9
10
sex
_TYPE_
f
f
f
f
f
m
m
m
m
m
SSCP
SSCP
SSCP
SSCP
N
SSCP
SSCP
SSCP
SSCP
N
_NAME_
Intercept
Intercept
height
weight
age
Intercept
height
weight
age
111.0
6718.4
10975.5
1824.9
111.0
126.0
7825.0
13034.5
2072.1
126.0
height
weight
age
6718.40
407879.32
669469.85
110818.32
111.00
7825.00
488243.60
817919.60
129432.57
126.00
10975.50
669469.85
1123360.75
182444.95
111.00
13034.50
817919.60
1398238.75
217717.45
126.00
1824.90
110818.32
182444.95
30363.81
111.00
2072.10
129432.57
217717.45
34515.95
126.00
The OUTEST= data set is displayed in Output 73.3.6; again, the BY groups are separated. The
_MODEL_ column contains the labels for models from the MODEL statements. If no labels are
specified, the defaults MODEL1 and MODEL2 would appear as values for _MODEL_. Note that
_TYPE_=‘PARMS’ for all observations, indicating that all observations contain parameter estimates.
The _DEPVAR_ column displays the dependent variable, and the _RMSE_ column gives the root
mean square error for the associated model. The Intercept column gives the estimate for the intercept
for the associated model, and variables with the same name as variables in the original data set
(height, age) give parameter estimates for those variables. The dependent variable, weight, is shown
with a value of 1. The _IN_ column contains the number of regressors in the model not including
the intercept; _P_ contains the number of parameters in the model; _EDF_ contains the error degrees
of freedom; and _RSQ_ contains the R2 statistic. Finally, note that the _IN_, _P_, _EDF_, and _RSQ_
columns appear in the OUTEST= data set since the RSQUARE option is specified in the PROC
REG statement.
Output 73.3.6 OUTEST Data Set
O s
b e
s x
1
2
3
4
f
f
m
m
_
M
O
D
E
L
_
_
T
Y
P
E
_
_
D
E
P
V
A
R
_
eq1
eq2
eq1
eq2
PARMS
PARMS
PARMS
PARMS
weight
weight
weight
weight
_
R
M
S
E
_
I
n
t
e
r
c
e
p
t
h
e
i
g
h
t
12.3461
12.0527
12.2850
11.7098
-153.129
-150.597
-125.698
-113.713
4.16361
3.60378
3.68977
2.68075
w
e
i
g
h
t
a
g
e
_
I _
N P
_ _
-1 .
1 2
-1 1.90703 2 3
-1 .
1 2
-1 3.08167 2 3
_
E
D
F
_
_
R
S
Q
_
109
108
124
123
0.56416
0.58845
0.62451
0.66161
Example 73.4: Regression with Quantitative and Qualitative Variables F 5627
Example 73.4: Regression with Quantitative and Qualitative Variables
At times it is desirable to have independent variables in the model that are qualitative rather than
quantitative. This is easily handled in a regression framework. Regression uses qualitative variables
to distinguish between populations. There are two main advantages of fitting both populations in
one model. You gain the ability to test for different slopes or intercepts in the populations, and more
degrees of freedom are available for the analysis.
Regression with qualitative variables is different from analysis of variance and analysis of covariance. Analysis of variance uses qualitative independent variables only. Analysis of covariance uses
quantitative variables in addition to the qualitative variables in order to account for correlation in
the data and reduce MSE; however, the quantitative variables are not of primary interest and merely
improve the precision of the analysis.
Consider the case where Yi is the dependent variable, X1i is a quantitative variable, X2i is a qualitative variable taking on values 0 or 1, and X1i X2i is the interaction. The variable X2i is called a
dummy, binary, or indicator variable. With values 0 or 1, it distinguishes between two populations.
The model is of the form
Yi D ˇ0 C ˇ1 X1i C ˇ2 X2i C ˇ3 X1i X2i C i
for the observations i D 1; 2; : : : ; n. The parameters to be estimated are ˇ0 , ˇ1 , ˇ2 , and ˇ3 . The
number of dummy variables used is one less than the number of qualitative levels. This yields a
nonsingular X0 X matrix. See Chapter 10 of Neter, Wasserman, and Kutner (1990) for more details.
An example from Neter, Wasserman, and Kutner (1990) follows. An economist is investigating
the relationship between the size of an insurance firm and the speed at which it implements new
insurance innovations. He believes that the type of firm might affect this relationship and suspects
that there might be some interaction between the size and type of firm. The dummy variable in the
model enables the two firms to have different intercepts. The interaction term enables the firms to
have different slopes as well.
In this study, Yi is the number of months from the time the first firm implemented the innovation to
the time it was implemented by the i th firm. The variable X1i is the size of the firm, measured in
total assets of the firm. The variable X2i denotes the firm type; it is 0 if the firm is a mutual fund
company and 1 if the firm is a stock company. The dummy variable enables each firm type to have
a different intercept and slope.
The previous model can be broken down into a model for each firm type by plugging in the values
for X2i . If X2i D 0, the model is
Yi D ˇ0 C ˇ1 X1i C i
5628 F Chapter 73: The REG Procedure
This is the model for a mutual company. If X2i D 1, the model for a stock firm is
Yi D .ˇ0 C ˇ2 / C .ˇ1 C ˇ3 /X1i C i
This model has intercept ˇ0 C ˇ2 and slope ˇ1 C ˇ3 .
The data1 follow. Note that the interaction term is created in the DATA step since polynomial effects
such as size*type are not allowed in the MODEL statement in the REG procedure.
title ’Regression With Quantitative and Qualitative Variables’;
data insurance;
input time size type @@;
sizetype=size*type;
datalines;
17 151 0
26 92 0
21 175 0
30 31 0
22 104 0
0 277 0
12 210 0
19 120 0
4 290 0
16 238 0
28 164 1
15 272 1
11 295 1
38 68 1
31 85 1
21 224 1
20 166 1
13 305 1
30 124 1
14 246 1
;
run;
The following statements begin the analysis:
proc reg data=insurance;
model time = size type sizetype;
run;
The ANOVA table is displayed in Output 73.4.1.
Output 73.4.1 ANOVA Table and Parameter Estimates
Regression With Quantitative and Qualitative Variables
The REG Procedure
Model: MODEL1
Dependent Variable: time
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
16
19
1504.41904
176.38096
1680.80000
501.47301
11.02381
Root MSE
Dependent Mean
Coeff Var
3.32021
19.40000
17.11450
R-Square
Adj R-Sq
F Value
Pr > F
45.49
<.0001
0.8951
0.8754
1 From Neter, J., et al., Applied Linear Statistical Models, Third Edition, Copyright (c) 1990, Richard D. Irwin.
Reprinted with permission of The McGraw-Hill Companies.
Example 73.4: Regression with Quantitative and Qualitative Variables F 5629
Output 73.4.1 continued
Parameter Estimates
Variable
Intercept
size
type
sizetype
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
33.83837
-0.10153
8.13125
-0.00041714
2.44065
0.01305
3.65405
0.01833
13.86
-7.78
2.23
-0.02
<.0001
<.0001
0.0408
0.9821
The overall F statistic is significant (F =45.490, p<0.0001). The interaction term is not significant
(t= 0.023, p=0.9821). Hence, this term should be removed and the model refitted, as shown in the
following statements:
delete sizetype;
print;
run;
The DELETE statement removes the interaction term (sizetype) from the model. The new ANOVA
and parameter estimates tables are shown in Output 73.4.2.
Output 73.4.2 ANOVA Table and Parameter Estimates
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
17
19
1504.41333
176.38667
1680.80000
752.20667
10.37569
Root MSE
Dependent Mean
Coeff Var
3.22113
19.40000
16.60377
R-Square
Adj R-Sq
F Value
Pr > F
72.50
<.0001
0.8951
0.8827
Parameter Estimates
Variable
Intercept
size
type
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
33.87407
-0.10174
8.05547
1.81386
0.00889
1.45911
18.68
-11.44
5.52
<.0001
<.0001
<.0001
The overall F statistic is still significant (F =72.497, p<0.0001). The intercept and the coefficients
associated with size and type are significantly different from zero (t =18.675, p<0.0001; t= 11.443,
p<0.0001; t=5.521, p<0.0001, respectively). Notice that the R2 did not change with the omission
of the interaction term.
5630 F Chapter 73: The REG Procedure
The fitted model is
time D 33:87
0:102 size C 8:055 type
The fitted model for a mutual fund company (X2i D 0) is
time D 33:87
0:102 size
and the fitted model for a stock company (X2i D 1) is
time D .33:87 C 8:055/
0:102 size
So the two models have different intercepts but the same slope.
The following statements first use an OUTPUT statement to save the residuals and predicted values
from the new model in the OUT= data set. Next PROC SGPLOT is used to produce Output 73.4.3,
which plots residuals versus predicted values. The firm type is used as the plot symbol; this can be
useful in determining if the firm types have different residual patterns.
output out=out r=r p=p;
run;
proc sgplot data=out;
scatter x=p y=r / markerchar=type group=type;
run;
Example 73.4: Regression with Quantitative and Qualitative Variables F 5631
Output 73.4.3 Plot of Residual vs. Predicted Values
The residuals show no major trend. Neither firm type by itself shows a trend either. This indicates
that the model is satisfactory.
The following statements produce the plot of the predicted values versus size that appears in
Output 73.4.4, where the firm type is again used as the plotting symbol:
proc sgplot data=out;
scatter x=size y=p / markerchar=type group=type;
run;
5632 F Chapter 73: The REG Procedure
Output 73.4.4 Plot of Predicted vs. Size
The different intercepts are very evident in this plot.
Example 73.5: Ridge Regression for Acetylene Data
This example uses the acetylene data in Marquardt and Snee (1975) to illustrate the RIDGEPLOT
and OUTVIF options. Here are the data:
data acetyl;
input x1-x4
x1x2 = x1 *
x1x1 = x1 *
label x1 =
x2 =
x3 =
x4 =
x1x2=
x1x1=
datalines;
@@;
x2;
x1;
’reactor temperature(celsius)’
’h2 to n-heptone ratio’
’contact time(sec)’
’conversion percentage’
’temperature-ratio interaction’
’squared temperature’;
Example 73.5: Ridge Regression for Acetylene Data F 5633
1300 7.5 .012
1300 13.5 .013
1200 5.3 .04
1200 13.5 .026
1100 5.3 .084
1100 17
.086
;
49
48.5
28
35
15
29.5
1300 9
1300 17
1200 7.5
1200 17
1100 7.5
.012
.0135
.038
.034
.098
50.2
47.5
31.5
38
17
1300
1300
1200
1200
1100
11
23
11
23
11
.0115
.012
.032
.041
.092
50.5
44.5
34.5
38.5
20.5
ods graphics on;
proc reg data=acetyl outvif
outest=b ridge=0 to 0.02 by .002;
model x4=x1 x2 x3 x1x2 x1x1;
run;
proc print data=b;
run;
When you enable ODS Graphics and you request ridge regression by using the RIDGE= option in
the PROC REG statement, PROC REG produces a panel showing variance inflation factors (VIF) in
the upper plot in the panel and ridge traces in the lower plot. This panel is shown in Output 73.5.1.
Output 73.5.1 Ridge Regression and VIF Traces
5634 F Chapter 73: The REG Procedure
The OUTVIF option outputs the variance inflation factors to the OUTEST= data set that is shown
in Output 73.5.2.
Output 73.5.2 OUTEST Data Set Showing VIF Values
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
_MODEL_
_TYPE_
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
MODEL1
PARMS
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
RIDGEVIF
RIDGE
_DEPVAR_
_RIDGE_
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
x4
.
0.000
0.000
0.002
0.002
0.004
0.004
0.006
0.006
0.008
0.008
0.010
0.010
0.012
0.012
0.014
0.014
0.016
0.016
0.018
0.018
0.020
0.020
_PCOMIT_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_RMSE_
Intercept
1.15596
.
1.15596
.
2.69721
.
3.22340
.
3.47752
.
3.62677
.
3.72505
.
3.79477
.
3.84693
.
3.88750
.
3.92004
.
3.94679
390.538
.
390.538
.
-103.388
.
-93.797
.
-87.687
.
-83.593
.
-80.603
.
-78.276
.
-76.381
.
-74.785
.
-73.407
.
-72.193
Obs
x1
x2
x3
x1x2
x1x1
x4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-0.78
7682.37
-0.78
11.18
0.05
4.36
0.06
2.93
0.06
2.36
0.06
2.04
0.06
1.84
0.06
1.69
0.06
1.57
0.06
1.47
0.06
1.39
0.06
10.174
320.022
10.174
58.731
4.404
23.939
2.839
13.011
2.110
8.224
1.689
5.709
1.414
4.226
1.221
3.279
1.078
2.637
0.968
2.182
0.880
1.847
0.809
-121.626
53.525
-121.626
10.744
-9.065
9.996
-21.338
9.383
-28.447
8.838
-33.377
8.343
-37.177
7.891
-40.297
7.476
-42.965
7.094
-45.309
6.741
-47.407
6.415
-49.310
-0.008
344.545
-0.008
63.208
-0.003
25.744
-0.002
13.976
-0.001
8.821
-0.001
6.112
-0.001
4.514
-0.001
3.493
-0.001
2.801
-0.001
2.310
-0.000
1.949
-0.000
0.00
6643.32
0.00
11.22
0.00
5.15
0.00
3.81
0.00
3.23
0.00
2.89
0.00
2.65
0.00
2.46
0.00
2.31
0.00
2.18
0.00
2.06
0.00
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Example 73.5: Ridge Regression for Acetylene Data F 5635
If you want to obtain separate plots containing the ridge traces and VIF traces, you can specify the
UNPACK suboption in the PLOTS=RIDGE option. You can also request that one or both of the
VIF axis and ridge parameter axis be displayed on a logarithmic scale. You can see in Output 73.5.1
that the VIF traces for several of the parameters are nearly indistinguishable when displayed on a
linear scale. The following code illustrates how you obtain separate VIF and ridge traces with the
VIF values displayed on a logarithmic scale. Note that you can obtain plots of VIF values even
though you do not specify the OUTVIF option in the PROC REG statement.
proc reg data=acetyl plots(only)=ridge(unpack VIFaxis=log)
outest=b ridge=0 to 0.02 by .002;
model x4=x1 x2 x3 x1x2 x1x1;
run;
ods graphics off;
The requested plots are shown in Output 73.5.3 and Output 73.5.4.
Output 73.5.3 VIF Traces
5636 F Chapter 73: The REG Procedure
Output 73.5.4 Ridge Traces
Example 73.6: Chemical Reaction Response
This example shows how you can use lack-of-fit tests with the REG procedure. See the section
“Testing for Lack of Fit” on page 5570 for details about lack-of-fit tests.
In a study of the percentage of raw material that responds in a reaction, researchers identified the
following five factors:
the feed rate of the chemicals (FeedRate), ranging from 10 to 15 liters per minute
the percentage of the catalyst (Catalyst), ranging from 1% to 2%
the agitation rate of the reactor (AgitRate), ranging from 100 to 120 revolutions per minute
the temperature (Temperature), ranging from 140 to 180 degrees Celsius
the concentration (Concentration), ranging from 3% to 6%
The following data set contains the results of an experiment designed to estimate main effects for
all factors:
Example 73.6: Chemical Reaction Response F 5637
data reaction;
input FeedRate Catalyst AgitRate Temperature
Concentration ReactionPercentage;
datalines;
10.0
1.0 100
140 6.0
37.5
10.0
1.0 120
180 3.0
28.5
10.0
2.0 100
180 3.0
40.4
10.0
2.0 120
140 6.0
48.2
15.0
1.0 100
180 6.0
50.7
15.0
1.0 120
140 3.0
28.9
15.0
2.0 100
140 3.0
43.5
15.0
2.0 120
180 6.0
64.5
12.5
1.5 110
160 4.5
39.0
12.5
1.5 110
160 4.5
40.3
12.5
1.5 110
160 4.5
38.7
12.5
1.5 110
160 4.5
39.7
;
The first eight runs of this experiment enable orthogonal estimation of the main effects for all
factors. The last four comprise four replicates of the centerpoint.
The following statements fit a linear model. Because this experiment includes replications, you can
test for lack of fit by using the LACKFIT option in the MODEL statement.
proc reg data=reaction;
model ReactionPercentage=FeedRate Catalyst AgitRate
Temperature Concentration / lackfit;
run;
Output 73.6.1 shows that the lack of fit for the linear model is significant, indicating that a more
complex model is required. Models that include interactions should be investigated. In this case,
this will require additional experimentation to obtain appropriate data for estimating the effects.
Output 73.6.1 Analysis of Variance
The REG Procedure
Model: MODEL1
Dependent Variable: ReactionPercentage
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Lack of Fit
Pure Error
Corrected Total
5
6
3
3
11
990.27000
35.69917
34.15167
1.54750
1025.96917
198.05400
5.94986
11.38389
0.51583
Root MSE
Dependent Mean
Coeff Var
2.43923
41.65833
5.85533
R-Square
Adj R-Sq
F Value
Pr > F
33.29
0.0003
22.07
0.0151
0.9652
0.9362
5638 F Chapter 73: The REG Procedure
References
Akaike, H. (1969), “Fitting Autoregressive Models for Prediction,” Annals of the Institute of Statistical Mathematics, 21, 243–247.
Allen, D. M. (1971), “Mean Square Error of Prediction as a Criterion for Selecting Variables,”
Technometrics, 13, 469–475.
Allen, D. M. and Cady, F. B. (1982), Analyzing Experimental Data by Regression, Belmont, CA:
Lifetime Learning Publications.
Amemiya, T. (1976), “Selection of Regressors,” Technical Report No. 225, Stanford, CA: Stanford
University.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980), Regression Diagnostics, New York: John Wiley
& Sons.
Berk, K. N. (1977), “Tolerance and Condition in Regression Computations,” Journal of the American Statistical Association, 72, 863–866.
Bock, R. D. (1975), Multivariate Statistical Methods in Behavioral Research, New York: McGrawHill.
Box, G. E. P. (1966), “The Use and Abuse of Regression,” Technometrics, 8, 625–629.
Cleveland, W. S. (1993), Visualizing Data, Summit, NJ: Hobart Press.
Collier Books (1987), The 1987 Baseball Encyclopedia Update, New York: Macmillan.
Cook, R. D. (1977), “Detection of Influential Observations in Linear Regression,” Technometrics,
19, 15–18.
Cook, R. D. (1979), “Influential Observations in Linear Regression,” Journal of the American Statistical Association, 74, 169–174.
Daniel, C. and Wood, F. (1980), Fitting Equations to Data, Revised Edition, New York: John Wiley
& Sons.
Darlington, R. B. (1968), “Multiple Regression in Psychological Research and Practice,” Psychological Bulletin, 69, 161–182.
Draper, N. and Smith, H. (1981), Applied Regression Analysis, Second Edition, New York: John
Wiley & Sons.
Durbin, J. and Watson, G. S. (1951), “Testing for Serial Correlation in Least Squares Regression,”
Biometrika, 37, 409–428.
Freund, R. J. and Littell, R. C. (1986), SAS System for Regression, 1986 Edition, Cary, NC: SAS
Institute Inc.
References F 5639
Furnival, G. M. and Wilson, R. W. (1974), “Regression by Leaps and Bounds,” Technometrics, 16,
499–511.
Goodnight, J. H. (1979), “A Tutorial on the SWEEP Operator,” The American Statistician, 33,
149–158. (Also available as The Sweep Operator: Its Importance in Statistical Computing, SAS
Technical Report R-106.)
Hocking, R. R. (1976), “The Analysis and Selection of Variables in Linear Regression,” Biometrics,
32, 1–50.
Johnston, J. (1972), Econometric Methods, New York: McGraw-Hill.
Judge, G. G., Griffiths, W. E., Hill, R. C., and Lee, T. (1980), The Theory and Practice of Econometrics, New York: John Wiley & Sons.
Judge, G. G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. (1985), The Theory and
Practice of Econometrics, Second Edition, New York: John Wiley & Sons.
Kennedy, W. J. and Gentle, J. E. (1980), Statistical Computing, New York: Marcel Dekker.
LaMotte, L. R. (1994), “A Note on the Role of Independence in t Statistics Constructed from Linear
Statistics in Regression Models,” The American Statistician, 48, 238–240.
Lewis, T. and Taylor, L. R. (1967), Introduction to Experimental Ecology, New York: Academic
Press.
Long, J. S. and Ervin, L. H.(2000), “Correcting for Heteroscedasticity with Heteroscedasticity Consistent Standard Errors in the Linear Regression Model: Small Sample Considerations,” The American Statistician, 54, 217–224.
Lord, F. M. (1950), “Efficiency of Prediction When a Progression Equation from One Sample Is
Used in a New Sample,” Research Bulletin No. 50-40, Princeton, NJ: Educational Testing Service.
MacKinnon, J. G. and White, H. (1985), “Some Heteroskedasticity Consistent Covariance matrix
Estimators with Improved Finite Sample Properties,” Journal of Econometrics, 29, 53–57.
Mallows, C. L. (1967), “Choosing a Subset Regression,” unpublished report, Bell Telephone Laboratories.
Mallows, C. L. (1973), “Some Comments on Cp ,” Technometrics, 15, 661–675.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, London: Academic Press.
Marquardt, D. W. and Snee, R. D. (1975), “Ridge Regression in Practice,” American Statistician,
29 (1), 3–20.
Morrison, D. F. (1976), Multivariate Statistical Methods, Second Edition, New York: McGraw-Hill.
Mosteller, F. and Tukey, J. W. (1977), Data Analysis and Regression, Reading, MA: AddisonWesley.
Neter, J., Wasserman, W., and Kutner, M. H. (1990), Applied Linear Statistical Models, Third
5640 F Chapter 73: The REG Procedure
Edition, Homewood, IL: Irwin.
Nicholson, G. E., Jr. (1948), “The Application of a Regression Equation to a New Sample,” unpublished Ph.D. dissertation, University of North Carolina at Chapel Hill.
Pillai, K. C. S. (1960), Statistical Table for Tests of Multivariate Hypotheses, Manila: The Statistical
Center, University of the Philippines.
Pindyck, R. S. and Rubinfeld, D. L. (1981), Econometric Models and Econometric Forecasts, Second Edition, New York: McGraw-Hill.
Pringle, R. M. and Rayner, A. A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications, Second Edition, New York:
John Wiley & Sons.
Rawlings, J. O. (1988), Applied Regression Analysis: A Research Tool, Belmont, CA: Wadsworth.
Rothman, D. (1968), letter to the editor, Technometrics, 10, 432.
Sall, J. P. (1981), SAS Regression Applications, Revised Edition, SAS Technical Report A-102,
Cary, NC: SAS Institute Inc.
Sawa, T. (1978), “Information Criteria for Discriminating among Alternative Regression Models,”
Econometrica, 46, 1273–1282.
Schwarz, G. (1978), “Estimating the Dimension of a Model,” Annals of Statistics, 6, 461–464.
Sports Illustrated, April 20, 1987.
Stein, C. (1960), “Multiple Regression,” in Contributions to Probability and Statistics, eds. I. Olkin
et al., Stanford, CA: Stanford University Press.
Timm, N. H. (1975), Multivariate Analysis with Applications in Education and Psychology, Monterey, CA: Brooks-Cole.
Weisberg, S. (1980), Applied Linear Regression, New York: John Wiley & Sons.
White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test
for Heteroskedasticity,” Econometrics, 48, 817–838.
Subject Index
adjusted R2 selection (REG), 5519
alpha level
REG procedure, 5448
annotate
global data set (REG), 5448
local data set (REG), 5487
autocorrelation
REG procedure, 5575
backward elimination
REG procedure, 5429, 5518
collinearity
REG procedure, 5549
correlation
matrix (REG), 5448
covariance matrix
REG procedure, 5448
COVRATIO statistic, 5554
crossproducts matrix
REG procedure, 5578
delete variables (REG), 5462
deleting observations
REG procedure, 5563
DFBETAS statistic (REG), 5555
DFFITS statistic
REG procedure, 5554
diagnostic statistics
REG procedure, 5551, 5552
fit diagnostics
examples (REG), 5586
forward selection
REG procedure, 5429, 5517
graphics
keywords (REG), 5484
options (REG), 5485
traditional plots (REG), 5483
hat matrix, 5554
heteroscedasticity
testing (REG), 5569
hypothesis tests
multivariate (REG), 5571
REG procedure, 5474, 5500
incomplete principal components
REG procedure, 5450, 5471
influence diagnostics
examples (REG), 5586
influence statistics
REG procedure, 5553
IPC analysis
REG procedure, 5450, 5471, 5577
lack of fit
examples (REG), 5636
lack-of-fit
testing (REG), 5570
line printer plots
REG procedure, 5491, 5528
LMSELECT procedure
ODS Graphics, 5450
Mallows’ Cp selection
REG procedure, 5520
model
fit summary (REG), 5551
model building
examples (REG), 5586
model selection
examples (REG), 5602
REG procedure, 5429, 5517, 5520, 5521
multicollinearity
REG procedure, 5549
multivariate tests
REG procedure, 5571
non-full-rank models
REG procedure, 5547
ODS graph names
REG procedure, 5585
ODS GRAPHICS
examples (REG), 5586
ODS Graphics
LMSELECT procedure, 5450
P-P plots
REG procedure, 5577
painting line-printer plots
REG procedure, 5537
parameter estimates
example (REG), 5521
REG procedure, 5579
partial regression leverage plots
REG procedure, 5561
plots
keywords (REG), 5484
line printer (REG), 5491, 5528
options (REG), 5485, 5487
traditional (REG), 5483, 5541
polynomial regression
REG procedure, 5434
predicted values
REG procedure, 5520, 5525
prediction
example (REG), 5602
Q-Q plots
REG procedure, 5577
qualitative variables
REG procedure, 5627
R2 improvement
REG procedure, 5518, 5519
R2 selection
REG procedure, 5519
refitting models
REG procedure, 5564
REG procedure
adding variables, 5461
adjusted R2 selection, 5519
alpha level, 5448
annotations, 5448, 5487
ANOVA table, 5578
autocorrelation, 5575
backward elimination, 5429, 5518
collinearity, 5549
computational methods, 5578
correlation matrix, 5448
covariance matrix, 5448
crossproducts matrix, 5578
delete variables, 5462
deleting observations, 5563
diagnostic statistics, 5551, 5552
dictionary of options, 5487
fit diagnostics, 5586
forward selection, 5429, 5517
graphics keywords and options, 5484, 5485
graphics plots, traditional, 5483
heteroscedasticity, testing, 5569
hypothesis tests, 5474, 5500
incomplete principal components, 5450,
5471
influence diagnostics, 5586
influence statistics, 5553
input data sets, 5502
interactive analysis, 5444, 5513
introductory example, 5430
IPC analysis, 5450, 5471, 5577
lack of fit, 5636
lack-of-fit, testing, 5570
line printer plots, 5491, 5528
Mallows’ Cp selection, 5520
missing values, 5501
model building, 5586
model fit summary statistics, 5551
model selection, 5429, 5517, 5520, 5521,
5602
multicollinearity, 5549
multivariate tests, 5571
new regressors, 5502
non-full-rank models, 5547
ODS graph names, 5585
ODS GRAPHICS, 5586
ODS table names, 5581
output data sets, 5506, 5512
P-P plots, 5577
painting line-printer plots, 5537
parameter estimates, 5521, 5579
partial regression leverage plots, 5561
plot keywords and options, 5484, 5485,
5487
plots, traditional, 5483
polynomial regression, 5434
predicted values, 5520, 5525, 5602
Q-Q plots, 5577
qualitative variables, 5627
R2 improvement, 5518, 5519
R2 selection, 5519
refitting models, 5564
residual values, 5525
restoring weights, 5565
reweighting observations, 5563
ridge regression, 5460, 5471, 5490, 5577,
5632
singularities, 5578
stepwise selection, 5429, 5518
summary statistics, 5551
sweep algorithm, 5578
time series data, 5575
traditional graphics, 5541
variance inflation factors (VIF), 5450
regression
analysis (REG), 5428
residuals
REG procedure, 5525
restoring weights
REG procedure, 5565
reweighting observations
REG procedure, 5563
ridge regression
REG procedure, 5460, 5471, 5490, 5577,
5632
singularities
REG procedure, 5578
stepwise selection
REG procedure, 5429, 5518
studentized residual, 5554
summary statistics
REG procedure, 5551
sweep algorithm
REG procedure, 5578
time series data
REG procedure, 5575
traditional graphics
REG procedure, 5541
variance inflation factors (VIF)
REG procedure, 5450
VIF, see variance inflation factors
Syntax Index
ACOV option
MODEL statement (REG), 5466
ACOVMETHOD= option
MODEL statement (REG), 5466
ADD statement, REG procedure, 5461
ADJRSQ option
MODEL statement (REG), 5466
AIC option
MODEL statement (REG), 5466
PLOT statement (REG), 5487
ALL option
MODEL statement (REG), 5466
PROC REG statement, 5448
ALLOBS option
PAINT statement (REG), 5480
REWEIGHT statement (REG), 5498
ALPHA= option
MODEL statement (REG), 5466
PROC REG statement, 5448
ANNOTATE= option
PLOT statement (REG), 5487
PROC REG statement, 5448
B option
MODEL statement (REG), 5466
BEST= option
MODEL statement (REG), 5466
BIC option
MODEL statement (REG), 5467
PLOT statement (REG), 5487
BY statement
REG procedure, 5461
CANPRINT option
MTEST statement (REG), 5475
CAXIS= option
PLOT statement (REG), 5487
CFRAME= option
PLOT statement (REG), 5487
CHOCKING= option
PLOT statement (REG), 5487
CHREF= option
PLOT statement (REG), 5487
CLB option
MODEL statement (REG), 5467
CLEAR option
PLOT statement (REG), 5491
CLI option
MODEL statement (REG), 5467
CLINE= option
PLOT statement (REG), 5487
CLM option
MODEL statement (REG), 5467
CMALLOWS= option
PLOT statement (REG), 5488
COLLECT option
PLOT statement (REG), 5491
COLLIN option
MODEL statement (REG), 5467
COLLINOINT option
MODEL statement (REG), 5467
CONF option
PLOT statement (REG), 5488
CORR option
PROC REG statement, 5448
CORRB option
MODEL statement (REG), 5467
COVB option
MODEL statement (REG), 5467
COVOUT option
PROC REG statement, 5448
CP option
MODEL statement (REG), 5468
PLOT statement (REG), 5488
CTEXT= option
PLOT statement (REG), 5488
CVREF= option
PLOT statement (REG), 5488
DATA= option
PROC REG statement, 5449
DELETE statement, REG procedure, 5462
DESCRIPTION= option
PLOT statement (REG), 5488
DETAILS option
MODEL statement (REG), 5468
MTEST statement (REG), 5475
DW option
MODEL statement (REG), 5468
DWPROB option
MODEL statement (REG), 5468
EDF option
MODEL statement (REG), 5468
PLOT statement (REG), 5488
PROC REG statement, 5449
FREQ statement
REG procedure, 5462
GMSEP option
MODEL statement (REG), 5468
PLOT statement (REG), 5488
GOUT= option
PROC REG statement, 5449
GROUPNAMES= option
MODEL statement (REG), 5468
HAXIS= option
PLOT statement (REG), 5488
HCC option
MODEL statement (REG), 5469
HCCMETHOD= option
MODEL statement (REG), 5469
HPLOTS= option
PLOT statement (REG), 5492
HREF= option
PLOT statement (REG), 5488
I option
MODEL statement (REG), 5469
ID statement
REG procedure, 5462
IN option
PLOT statement (REG), 5489
INCLUDE= option
MODEL statement (REG), 5469
INFLUENCE option
MODEL statement (REG), 5469
JP option
MODEL statement (REG), 5469
PLOT statement (REG), 5489
keyword= option
OUTPUT statement (REG), 5477
LACKFIT option
MODEL statement (REG), 5469
LEGEND= option
PLOT statement (REG), 5489
LHREF= option
PLOT statement (REG), 5489
LINEPRINTER option
PROC REG statement, 5449
LLINE= option
PLOT statement (REG), 5489
LVREF= option
PLOT statement (REG), 5489
MAXSTEP option
MODEL statement (REG), 5470
MODEL statement
REG procedure, 5463
MODELFONT option
PLOT statement (REG), 5489
MODELHT option
PLOT statement (REG), 5489
MODELLAB option
PLOT statement (REG), 5489
MSE option
MODEL statement (REG), 5470
PLOT statement (REG), 5489
MSTAT= option
MTEST statement (REG), 5476
MTEST statement
REG procedure, 5474
NAME= option
PLOT statement (REG), 5489
NOCOLLECT option
PLOT statement (REG), 5492
NOINT option
MODEL statement (REG), 5470
NOLINE option
PLOT statement (REG), 5489
NOLIST option
PAINT statement (REG), 5480
REWEIGHT statement (REG), 5498
NOMODEL option
PLOT statement (REG), 5490
NOPRINT option
MODEL statement (REG), 5470
PROC REG statement, 5449
NOSTAT option
PLOT statement (REG), 5490
NP option
PLOT statement (REG), 5490
OUT= option
OUTPUT statement (REG), 5476
OUTEST= option
PROC REG statement, 5449
OUTPUT statement
REG procedure, 5476
OUTSEB option
MODEL statement (REG), 5470
PROC REG statement, 5449
OUTSSCP= option
PROC REG statement, 5449
OUTSTB option
MODEL statement (REG), 5470
PROC REG statement, 5450
OUTVIF option
MODEL statement (REG), 5470
PROC REG statement, 5450
OVERLAY option
PLOT statement (REG), 5490, 5492
P option
MODEL statement (REG), 5470
PAINT statement
REG procedure, 5478
PARTIAL option
MODEL statement (REG), 5470
PARTIALDATA option
MODEL statement (REG), 5471
PARTIALR2 option
MODEL statement (REG), 5471
PC option
MODEL statement (REG), 5471
PLOT statement (REG), 5490
PCOMIT= option
MODEL statement (REG), 5471
PROC REG statement, 5450
PCORR1 option
MODEL statement (REG), 5471
PCORR2 option
MODEL statement (REG), 5471
PLOT option
PROC REG statement, 5450
PLOT statement
REG procedure, 5481
PLOTS option
PROC REG statement, 5450
PRED option
PLOT statement (REG), 5490
PRESS option
MODEL statement (REG), 5471
PROC REG statement, 5460
PRINT option
MTEST statement (REG), 5476
TEST statement (REG), 5500
PRINT statement, REG procedure, 5493
PROC REG statement, see REG procedure
R option
MODEL statement (REG), 5471
REFIT statement, REG procedure, 5494
REG procedure
syntax, 5445
REG procedure, ADD statement, 5461
REG procedure, BY statement, 5461
REG procedure, DELETE statement, 5462
REG procedure, FREQ statement, 5462
REG procedure, ID statement, 5462
REG procedure, MODEL statement, 5463
ACOV option, 5466
ACOVMETHOD= option, 5466
ADJRSQ option, 5466
AIC option, 5466
ALL option, 5466
ALPHA= option, 5466
B option, 5466
BEST= option, 5466
BIC option, 5467
CLB option, 5467
CLI option, 5467
CLM option, 5467
COLLIN option, 5467
COLLINOINT option, 5467
CORRB option, 5467
COVB option, 5467
CP option, 5468
DETAILS option, 5468
DW option, 5468
DWPROB option, 5468
EDF option, 5468
GMSEP option, 5468
GROUPNAMES= option, 5468
HCC option, 5469
HCCMETHOD= option, 5469
I option, 5469
INCLUDE= option, 5469
INFLUENCE option, 5469
JP option, 5469
LACKFIT option, 5469
MAXSTEP option, 5470
MSE option, 5470
NOINT option, 5470
NOPRINT option, 5470
OUTSEB option, 5470
OUTSTB option, 5470
OUTVIF option, 5470
P option, 5470
PARTIAL option, 5470
PARTIALDATA option, 5471
PARTIALR2 option, 5471
PC option, 5471
PCOMIT= option, 5471
PCORR1 option, 5471
PCORR2 option, 5471
PRESS option, 5471
R option, 5471
RIDGE= option, 5471
RMSE option, 5472
RSQUARE option, 5472
SBC option, 5472
SCORR1 option, 5472
SCORR2 option, 5472
SELECTION= option, 5429, 5472
SEQB option, 5473
SIGMA= option, 5473
SINGULAR= option, 5473
SLENTRY= option, 5473
SLSTAY= option, 5473
SP option, 5473
SPEC option, 5473
SS1 option, 5473
SS2 option, 5473
SSE option, 5473
START= option, 5474
STB option, 5474
STOP= option, 5474
TOL option, 5474
VIF option, 5474
WHITE option, 5474
XPX option, 5474
REG procedure, MTEST statement, 5474
CANPRINT option, 5475
DETAILS option, 5475
MSTAT= option, 5476
PRINT option, 5476
REG procedure, OUTPUT statement, 5476
keyword= option, 5477
OUT= option, 5476
REG procedure, PAINT statement, 5478
ALLOBS option, 5480
NOLIST option, 5480
RESET option, 5480
STATUS option, 5481
SYMBOL= option, 5480
UNDO option, 5481
REG procedure, PLOT statement, 5481
AIC option, 5487
ANNOTATE= option, 5487
BIC option, 5487
CAXIS= option, 5487
CFRAME= option, 5487
CHOCKING= option, 5487
CHREF= option, 5487
CLEAR option, 5491
CLINE= option, 5487
CMALLOWS= option, 5488
COLLECT option, 5491
CONF option, 5488
CP option, 5488
CTEXT= option, 5488
CVREF= option, 5488
DESCRIPTION= option, 5488
EDF option, 5488
GMSEP option, 5488
HAXIS= option, 5488
HPLOTS= option, 5492
HREF= option, 5488
IN option, 5489
JP option, 5489
LEGEND= option, 5489
LHREF= option, 5489
LLINE= option, 5489
LVREF= option, 5489
MODELFONT option, 5489
MODELHT option, 5489
MODELLAB option, 5489
MSE option, 5489
NAME= option, 5489
NOCOLLECT option, 5492
NOLINE option, 5489
NOMODEL option, 5490
NOSTAT option, 5490
NP option, 5490
OVERLAY option, 5490, 5492
PC option, 5490
PRED option, 5490
RIDGEPLOT option, 5490
SBC option, 5490
SP option, 5490
SSE option, 5490
STATFONT option, 5490
STATHT option, 5491
summary of options, 5484, 5485
SYMBOL= option, 5492
USEALL option, 5491
VAXIS= option, 5491
VPLOTS= option, 5493
VREF= option, 5491
REG procedure, PRINT statement, 5493
REG procedure, PROC REG statement, 5447
ALL option, 5448
ALPHA= option, 5448
ANNOTATE= option, 5448
CORR option, 5448
COVOUT option, 5448
DATA= option, 5449
EDF option, 5449
GOUT= option, 5449
LINEPRINTER option, 5449
NOPRINT option, 5449
OUTEST= option, 5449
OUTSEB option, 5449
OUTSSCP= option, 5449
OUTSTB option, 5450
OUTVIF option, 5450
PCOMIT= option, 5450
PLOT option, 5450
PLOTS option, 5450
PRESS option, 5460
RIDGE= option, 5460
RSQUARE option, 5460
SIMPLE option, 5460
SINGULAR= option, 5460
TABLEOUT option, 5461
USSCP option, 5461
REG procedure, REFIT statement, 5494
REG procedure, RESTRICT statement, 5494
REG procedure, REWEIGHT statement, 5496
ALLOBS option, 5498
NOLIST option, 5498
RESET option, 5498
STATUS option, 5499
UNDO option, 5499
WEIGHT= option, 5499
REG procedure, TEST statement, 5500
PRINT option, 5500
REG procedure, VAR statement, 5501
REG procedure, WEIGHT statement, 5501
RESET option
PAINT statement (REG), 5480
REWEIGHT statement (REG), 5498
RESTRICT statement
REG procedure, 5494
REWEIGHT statement, REG procedure, 5496
RIDGE= option
MODEL statement (REG), 5471
PROC REG statement, 5460
RIDGEPLOT option
PLOT statement (REG), 5490
RMSE option
MODEL statement (REG), 5472
RSQUARE option
MODEL statement (REG), 5472
PROC REG statement, 5460
SBC option
MODEL statement (REG), 5472
PLOT statement (REG), 5490
SCORR1 option
MODEL statement (REG), 5472
SCORR2 option
MODEL statement (REG), 5472
SELECTION= option
MODEL statement (REG), 5472
REG procedure, MODEL statement, 5429
SEQB option
MODEL statement (REG), 5473
SIGMA= option
MODEL statement (REG), 5473
SIMPLE option
PROC REG statement, 5460
SINGULAR= option
MODEL statement (REG), 5473
PROC REG statement, 5460
SLENTRY= option
MODEL statement (REG), 5473
SLSTAY= option
MODEL statement (REG), 5473
SP option
MODEL statement (REG), 5473
PLOT statement (REG), 5490
SPEC option
MODEL statement (REG), 5473
SS1 option
MODEL statement (REG), 5473
SS2 option
MODEL statement (REG), 5473
SSE option
MODEL statement (REG), 5473
PLOT statement (REG), 5490
START= option
MODEL statement (REG), 5474
STATFONT option
PLOT statement (REG), 5490
STATHT option
PLOT statement (REG), 5491
STATUS option
PAINT statement (REG), 5481
REWEIGHT statement (REG), 5499
STB option
MODEL statement (REG), 5474
STOP= option
MODEL statement (REG), 5474
SYMBOL= option
PAINT statement (REG), 5480
PLOT statement (REG), 5492
TABLEOUT option
PROC REG statement, 5461
TEST statement
REG procedure, 5500
TOL option
MODEL statement (REG), 5474
UNDO option
PAINT statement (REG), 5481
REWEIGHT statement (REG), 5499
USEALL option
PLOT statement (REG), 5491
USSCP option
PROC REG statement, 5461
VAR statement
REG procedure, 5501
VAXIS= option
PLOT statement (REG), 5491
VIF option
MODEL statement (REG), 5474
VPLOTS= option
PLOT statement (REG), 5493
VREF= option
PLOT statement (REG), 5491
WEIGHT statement
REG procedure, 5501
WEIGHT= option
REWEIGHT statement (REG), 5499
WHITE option
MODEL statement (REG), 5474
XPX option
MODEL statement (REG), 5474
Your Turn
We welcome your feedback.
If you have comments about this book, please send them to
yourturn@sas.com. Include the full title and page numbers (if
applicable).
If you have comments about the software, please send them to
suggest@sas.com.
SAS Publishing Delivers!
®
Whether you are new to the work force or an experienced professional, you need to distinguish yourself in this rapidly
changing and competitive job market. SAS Publishing provides you with a wide range of resources to help you set
yourself apart. Visit us online at support.sas.com/bookstore.
®
SAS Press
®
Need to learn the basics? Struggling with a programming problem? You’ll find the expert answers that you
need in example-rich books from SAS Press. Written by experienced SAS professionals from around the
world, SAS Press books deliver real-world insights on a broad range of topics for all skill levels.
SAS Documentation
support.sas.com/saspress
®
To successfully implement applications using SAS software, companies in every industry and on every
continent all turn to the one source for accurate, timely, and reliable information: SAS documentation.
We currently produce the following types of reference documentation to improve your work experience:
• Online help that is built into the software.
• Tutorials that are integrated into the product.
• Reference documentation delivered in HTML and PDF – free on the Web.
• Hard-copy books.
support.sas.com/publishing
SAS Publishing News
®
Subscribe to SAS Publishing News to receive up-to-date information about all new SAS titles, author
podcasts, and new Web site features via e-mail. Complete instructions on how to subscribe, as well as
access to past issues, are available at our Web site.
support.sas.com/spn
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2009 SAS Institute Inc. All rights reserved. 518177_1US.0109