The GENMOD Procedure SAS/STAT 13.1 User’s Guide ®

The GENMOD Procedure SAS/STAT 13.1 User’s Guide ®
®
SAS/STAT 13.1 User’s Guide
The GENMOD Procedure
This document is an individual chapter from SAS/STAT® 13.1 User’s Guide.
The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2013. SAS/STAT® 13.1 User’s Guide.
Cary, NC: SAS Institute Inc.
Copyright © 2013, SAS Institute Inc., Cary, NC, USA
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by
any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute
Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time
you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is
illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic
piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software
developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or
disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as
applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S.
federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision
serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The
Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.
December 2013
SAS provides a complete selection of books and electronic products to help customers use SAS® software to its fullest potential. For
more information about our offerings, visit support.sas.com/bookstore or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Gain Greater Insight into Your
SAS Software with SAS Books.
®
Discover all that you need on your journey to knowledge and empowerment.
support.sas.com/bookstore
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are
trademarks of their respective companies. © 2013 SAS Institute Inc. All rights reserved. S107969US.0613
Chapter 42
The GENMOD Procedure
Contents
Overview: GENMOD Procedure . . . . . . . . . . . .
What Is a Generalized Linear Model? . . . . . .
Examples of Generalized Linear Models . . . .
The GENMOD Procedure . . . . . . . . . . . .
Getting Started: GENMOD Procedure . . . . . . . . .
Poisson Regression . . . . . . . . . . . . . . .
Bayesian Analysis of a Linear Regression Model
Generalized Estimating Equations . . . . . . . .
Syntax: GENMOD Procedure . . . . . . . . . . . . .
PROC GENMOD Statement . . . . . . . . . . .
ASSESS Statement . . . . . . . . . . . . . . . .
BAYES Statement . . . . . . . . . . . . . . . .
BY Statement . . . . . . . . . . . . . . . . . .
CLASS Statement . . . . . . . . . . . . . . . .
CODE Statement . . . . . . . . . . . . . . . . .
CONTRAST Statement . . . . . . . . . . . . .
DEVIANCE Statement . . . . . . . . . . . . .
EFFECTPLOT Statement . . . . . . . . . . . .
ESTIMATE Statement . . . . . . . . . . . . . .
EXACT Statement . . . . . . . . . . . . . . . .
EXACTOPTIONS Statement . . . . . . . . . .
FREQ Statement . . . . . . . . . . . . . . . . .
FWDLINK Statement . . . . . . . . . . . . . .
INVLINK Statement . . . . . . . . . . . . . . .
LSMEANS Statement . . . . . . . . . . . . . .
LSMESTIMATE Statement . . . . . . . . . . .
MODEL Statement . . . . . . . . . . . . . . . .
OUTPUT Statement . . . . . . . . . . . . . . .
Programming Statements . . . . . . . . . . . .
REPEATED Statement . . . . . . . . . . . . . .
SLICE Statement . . . . . . . . . . . . . . . . .
STORE Statement . . . . . . . . . . . . . . . .
STRATA Statement . . . . . . . . . . . . . . .
VARIANCE Statement . . . . . . . . . . . . . .
WEIGHT Statement . . . . . . . . . . . . . . .
ZEROMODEL Statement . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2871
2872
2873
2874
2876
2876
2882
2895
2898
2899
2904
2905
2915
2916
2919
2920
2922
2923
2924
2925
2928
2931
2931
2931
2932
2933
2934
2944
2947
2948
2953
2953
2953
2955
2955
2955
2870 F Chapter 42: The GENMOD Procedure
Details: GENMOD Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2956
Generalized Linear Models Theory . . . . . . . . . . . . . . . . . . . . . . . . . . .
2956
Specification of Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2967
Parameterization Used in PROC GENMOD . . . . . . . . . . . . . . . . . . . . . . .
2968
Type 1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2968
Type 3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2969
Confidence Intervals for Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . .
2970
F Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2972
Lagrange Multiplier Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2972
Predicted Values of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2973
Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2973
Multinomial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2975
Zero-Inflated Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2975
Tweedie Distribution For Generalized Linear Models . . . . . . . . . . . . . . . . . .
Generalized Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2977
2979
Assessment of Models Based on Aggregates of Residuals . . . . . . . . . . . . . . .
2987
Case Deletion Diagnostic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .
2991
Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2994
Exact Logistic and Exact Poisson Regression . . . . . . . . . . . . . . . . . . . . . .
2999
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3002
Displayed Output for Classical Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3002
Displayed Output for Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3009
Displayed Output for Exact Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
3011
ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3012
ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3016
Examples: GENMOD Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3018
Example 42.1: Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
3018
Example 42.2: Normal Regression, Log Link
. . . . . . . . . . . . . . . . . . . . .
3020
Example 42.3: Gamma Distribution Applied to Life Data . . . . . . . . . . . . . . .
3023
Example 42.4: Ordinal Model for Multinomial Data . . . . . . . . . . . . . . . . . .
3026
Example 42.5: GEE for Binary Data with Logit Link Function . . . . . . . . . . . .
3030
Example 42.6: Log Odds Ratios and the ALR Algorithm . . . . . . . . . . . . . . .
3033
Example 42.7: Log-Linear Model for Count Data . . . . . . . . . . . . . . . . . . .
3036
Example 42.8: Model Assessment of Multiple Regression Using Aggregates of Residuals 3041
Example 42.9: Assessment of a Marginal Model for Dependent Data . . . . . . . . .
3048
Example 42.10: Bayesian Analysis of a Poisson Regression Model . . . . . . . . . .
3052
Example 42.11: Exact Poisson Regression . . . . . . . . . . . . . . . . . . . . . . .
3066
Example 42.12: Tweedie Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
3070
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3073
Overview: GENMOD Procedure F 2871
Overview: GENMOD Procedure
The GENMOD procedure fits generalized linear models, as defined by Nelder and Wedderburn (1972).
The class of generalized linear models is an extension of traditional linear models that allows the mean
of a population to depend on a linear predictor through a nonlinear link function and allows the response
probability distribution to be any member of an exponential family of distributions. Many widely used
statistical models are generalized linear models. These include classical linear models with normal errors,
logistic and probit models for binary data, and log-linear models for multinomial data. Many other useful
statistical models can be formulated as generalized linear models by the selection of an appropriate link
function and response probability distribution.
See McCullagh and Nelder (1989) for a discussion of statistical modeling using generalized linear models.
The books by Aitkin et al. (1989) and Dobson (1990) are also excellent references with many examples of
applications of generalized linear models. Firth (1991) provides an overview of generalized linear models.
Myers, Montgomery, and Vining (2002) provide applications of generalized linear models in the engineering
and physical sciences. Collett (2003) and Hilbe (2009) provide comprehensive accounts of generalized linear
models when the responses are binary.
The analysis of correlated data arising from repeated measurements when the measurements are assumed to
be multivariate normal has been studied extensively. However, the normality assumption might not always be
reasonable; for example, different methodology must be used in the data analysis when the responses are
discrete and correlated. Generalized estimating equations (GEEs) provide a practical method with reasonable
statistical efficiency to analyze such data.
Liang and Zeger (1986) introduced GEEs as a method of dealing with correlated data when, except for the
correlation among responses, the data can be modeled as a generalized linear model. For example, correlated
binary and count data in many cases can be modeled in this way.
The GENMOD procedure can fit models to correlated responses by the GEE method. You can use PROC
GENMOD to fit models with most of the correlation structures from Liang and Zeger (1986) by using GEEs.
For more details on GEEs, see Hardin and Hilbe (2003); Diggle, Liang, and Zeger (1994); Lipsitz et al.
(1994).
Bayesian analysis of generalized linear models can be requested by using the BAYES statement in the
GENMOD procedure. In Bayesian analysis, the model parameters are treated as random variables, and
inference about parameters is based on the posterior distribution of the parameters, given the data. The
posterior distribution is obtained using Bayes’ theorem as the likelihood function of the data weighted
with a prior distribution. The prior distribution enables you to incorporate knowledge or experience of
the likely range of values of the parameters of interest into the analysis. If you have no prior knowledge
of the parameter values, you can use a noninformative prior distribution, and the results of the Bayesian
analysis will be very similar to a classical analysis based on maximum likelihood. A closed form of the
posterior distribution is often not feasible, and a Markov chain Monte Carlo method by Gibbs sampling is
used to simulate samples from the posterior distribution. See Chapter 7, “Introduction to Bayesian Analysis
Procedures,” for an introduction to the basic concepts of Bayesian statistics. Also see the section “Bayesian
Analysis: Advantages and Disadvantages” on page 134 in Chapter 7, “Introduction to Bayesian Analysis
Procedures,” for a discussion of the advantages and disadvantages of Bayesian analysis. See Ibrahim, Chen,
and Sinha (2001) for a detailed description of Bayesian analysis.
In a Bayesian analysis, a Gibbs chain of samples from the posterior distribution is generated for the
model parameters. Summary statistics (mean, standard deviation, quartiles, HPD and credible intervals,
2872 F Chapter 42: The GENMOD Procedure
correlation matrix) and convergence diagnostics (autocorrelations; Gelman-Rubin, Geweke, Raftery-Lewis,
and Heidelberger and Welch tests; the effective sample size; and Monte Carlo standard errors) are computed
for each parameter, as well as the correlation matrix and the covariance matrix of the posterior sample. Trace
plots, posterior density plots, and autocorrelation function plots that are created using ODS Graphics are also
provided for each parameter.
The GENMOD procedure enables you to perform exact logistic regression, also called exact conditional
binary logistic regression, and exact Poisson regression, also called exact conditional Poisson regression, by
specifying one or more EXACT statements. You can test individual parameters or conduct a joint test for
several parameters. The procedure computes two exact tests: the exact conditional score test and the exact
conditional probability test. You can request exact estimation of specific parameters and corresponding odds
ratios where appropriate. Point estimates, standard errors, and confidence intervals are provided.
The GENMOD procedure uses ODS Graphics to create graphs as part of its output. For general information
about ODS Graphics, see Chapter 21, “Statistical Graphics Using ODS.”
What Is a Generalized Linear Model?
A traditional linear model is of the form
yi D x0i ˇ C "i
where yi is the response variable for the ith observation. The quantity xi is a column vector of covariates, or
explanatory variables, for observation i that is known from the experimental setting and is considered to be
fixed, or nonrandom. The vector of unknown coefficients ˇ is estimated by a least squares fit to the data y.
The "i are assumed to be independent, normal random variables with zero mean and constant variance. The
expected value of yi , denoted by i , is
i D x0i ˇ
While traditional linear models are used extensively in statistical data analysis, there are types of problems
such as the following for which they are not appropriate.
• It might not be reasonable to assume that data are normally distributed. For example, the normal
distribution (which is continuous) might not be adequate for modeling counts or measured proportions
that are considered to be discrete.
• If the mean of the data is naturally restricted to a range of values, the traditional linear model might
not be appropriate, since the linear predictor x0i ˇ can take on any value. For example, the mean of a
measured proportion is between 0 and 1, but the linear predictor of the mean in a traditional linear
model is not restricted to this range.
• It might not be realistic to assume that the variance of the data is constant for all observations. For
example, it is not unusual to observe data where the variance increases with the mean of the data.
A generalized linear model extends the traditional linear model and is therefore applicable to a wider range
of data analysis problems. A generalized linear model consists of the following components:
• The linear component is defined just as it is for traditional linear models:
i D x0i ˇ
Examples of Generalized Linear Models F 2873
• A monotonic differentiable link function g describes how the expected value of yi is related to the
linear predictor i :
g.i / D x0i ˇ
• The response variables yi are independent for i = 1, 2,. . . and have a probability distribution from an
exponential family. This implies that the variance of the response depends on the mean through a
variance function V:
Var.yi / D
V .i /
wi
where is a constant and wi is a known weight for each observation. The dispersion parameter is
either known (for example, for the binomial or Poisson distribution, D 1) or must be estimated.
See the section “Response Probability Distributions” on page 2956 for the form of a probability distribution
from the exponential family of distributions.
As in the case of traditional linear models, fitted generalized linear models can be summarized through
statistics such as parameter estimates, their standard errors, and goodness-of-fit statistics. You can also
make statistical inference about the parameters by using confidence intervals and hypothesis tests. However,
specific inference procedures are usually based on asymptotic considerations, since exact distribution theory
is not available or is not practical for all generalized linear models.
Examples of Generalized Linear Models
You construct a generalized linear model by deciding on response and explanatory variables for your data and
choosing an appropriate link function and response probability distribution. Some examples of generalized
linear models follow. Explanatory variables can be any combination of continuous variables, classification
variables, and interactions.
Traditional Linear Model
• response variable: a continuous variable
• distribution: normal
• link function: identity, g./ D Logistic Regression
• response variable: a proportion
• distribution: binomial
• link function: logit, g./ D log
1
2874 F Chapter 42: The GENMOD Procedure
Poisson Regression in Log-Linear Model
• response variable: a count
• distribution: Poisson
• link function: log, g./ D log./
Gamma Model with Log Link
• response variable: a positive, continuous variable
• distribution: gamma
• link function: log, g./ D log./
The GENMOD Procedure
The GENMOD procedure fits a generalized linear model to the data by maximum likelihood estimation of the
parameter vector ˇ. There is, in general, no closed form solution for the maximum likelihood estimates of the
parameters. The GENMOD procedure estimates the parameters of the model numerically through an iterative
fitting process. The dispersion parameter is also estimated by maximum likelihood or, optionally, by the
residual deviance or by Pearson’s chi-square divided by the degrees of freedom. Covariances, standard errors,
and p-values are computed for the estimated parameters based on the asymptotic normality of maximum
likelihood estimators. A number of popular link functions and probability distributions are available in the
GENMOD procedure. The built-in link functions are as follows:
• identity: g./ D • logit: g./ D log.=.1
//
• probit: g./ D ˆ 1 ./, where ˆ is the standard normal cumulative distribution function
if ¤ 0
• power: g./ D
log./ if D 0
• log: g./ D log./
• complementary log-log: g./ D log. log.1
//
The available distributions and associated variance functions are as follows:
• normal: V ./ D 1
• binomial (proportion): V ./ D .1
• Poisson: V ./ D • gamma: V ./ D 2
/
The GENMOD Procedure F 2875
• inverse Gaussian: V ./ D 3
• negative binomial: V ./ D C k2
• geometric: V ./ D C 2
• multinomial
• zero-inflated Poisson
• zero-inflated negative binomial
The negative binomial and zero-inflated negative binomial are distributions with an additional parameter k in
the variance function. PROC GENMOD estimates k by maximum likelihood, or you can optionally set it to
a constant value. For discussions of the negative binomial distribution, see McCullagh and Nelder (1989);
Hilbe (1994, 2007); Long (1997); Cameron and Trivedi (1998); Lawless (1987).
The multinomial distribution is sometimes used to model a response that can take values from a number
of categories. The binomial is a special case of the multinomial with two categories. See the section
“Multinomial Models” on page 2975 and McCullagh and Nelder (1989, Chapter 5) for a description of the
multinomial distribution.
The zero-inflated Poisson and zero-inflated negative binomial are included in PROC GENMOD even though
they are not generalized linear models. They are useful extensions of generalized linear models. See the
section “Zero-Inflated Models” on page 2975 for information about the zero-inflated distributions. Models
for data with correlated responses fit by the GEE method are not available for zero-inflated distributions.
In addition, you can easily define your own link functions or distributions through DATA step programming
statements used within the procedure.
An important aspect of generalized linear modeling is the selection of explanatory variables in the model.
Changes in goodness-of-fit statistics are often used to evaluate the contribution of subsets of explanatory
variables to a particular model. The deviance, defined to be twice the difference between the maximum
attainable log likelihood and the log likelihood of the model under consideration, is often used as a measure
of goodness of fit. The maximum attainable log likelihood is achieved with a model that has a parameter for
every observation. See the section “Goodness of Fit” on page 2963 for formulas for the deviance.
One strategy for variable selection is to fit a sequence of models, beginning with a simple model with only an
intercept term, and then to include one additional explanatory variable in each successive model. You can
measure the importance of the additional explanatory variable by the difference in deviances or fitted log
likelihoods between successive models. Asymptotic tests computed by the GENMOD procedure enable you
to assess the statistical significance of the additional term.
The GENMOD procedure enables you to fit a sequence of models, up through a maximum number of terms
specified in a MODEL statement. A table summarizes twice the difference in log likelihoods between each
successive pair of models. This is called a Type 1 analysis in the GENMOD procedure, because it is analogous
to Type I (sequential) sums of squares in the GLM procedure. As with the PROC GLM Type I sums of
squares, the results from this process depend on the order in which the model terms are fit.
The GENMOD procedure also generates a Type 3 analysis analogous to Type III sums of squares in the GLM
procedure. A Type 3 analysis does not depend on the order in which the terms for the model are specified. A
GENMOD procedure Type 3 analysis consists of specifying a model and computing likelihood ratio statistics
for Type III contrasts for each term in the model. The contrasts are defined in the same way as they are in the
2876 F Chapter 42: The GENMOD Procedure
GLM procedure. The GENMOD procedure optionally computes Wald statistics for Type III contrasts. This
is computationally less expensive than likelihood ratio statistics, but it is thought to be less accurate because
the specified significance level of hypothesis tests based on the Wald statistic might not be as close to the
actual significance level as it is for likelihood ratio tests.
A Type 3 analysis generalizes the use of Type III estimable functions in linear models. Briefly, a Type III
estimable function (contrast) for an effect is a linear function of the model parameters that involves the
parameters of the effect and any interactions with that effect. A test of the hypothesis that the Type III
contrast for a main effect is equal to 0 is intended to test the significance of the main effect in the presence
of interactions. See Chapter 44, “The GLM Procedure,” and Chapter 15, “The Four Types of Estimable
Functions,” for more information about Type III estimable functions. Also see Littell, Freund, and Spector
(1991).
Additional features of the GENMOD procedure include the following:
• likelihood ratio statistics for user-defined contrasts—that is, linear functions of the parameters and
p-values based on their asymptotic chi-square distributions
• estimated values, standard errors, and confidence limits for user-defined contrasts and least squares
means
• ability to create a SAS data set corresponding to most tables displayed by the procedure (see Table 42.12
and Table 42.13)
• confidence intervals for model parameters based on either the profile likelihood function or asymptotic
normality
• syntax similar to that of PROC GLM for the specification of the response and model effects, including
interaction terms and automatic coding of classification variables
• ability to fit GEE models for clustered response data
• ability to perform Bayesian analysis by Gibbs sampling
Getting Started: GENMOD Procedure
Poisson Regression
You can use the GENMOD procedure to fit a variety of statistical models. A typical use of PROC GENMOD
is to perform Poisson regression.
You can use the Poisson distribution to model the distribution of cell counts in a multiway contingency
table. Aitkin et al. (1989) have used this method to model insurance claims data. Suppose the following
hypothetical insurance claims data are classified by two factors: age group (with two levels) and car type
(with three levels).
Poisson Regression F 2877
data insure;
input n c car$ age;
ln = log(n);
datalines;
500
42 small 1
1200 37 medium 1
100
1 large 1
400 101 small 2
500
73 medium 2
300
14 large 2
;
In the preceding data set, the variable n represents the number of insurance policyholders and the variable c
represents the number of insurance claims. The variable car is the type of car involved (classified into three
groups) and the variable age is the age group of a policyholder (classified into two groups).
You can use PROC GENMOD to perform a Poisson regression analysis of these data with a log link function.
This type of model is sometimes called a log-linear model.
Assume that the number of claims c has a Poisson probability distribution and that its mean, i , is related to
the factors car and age for observation i by
log.i / D log.ni / C x0i ˇ
D log.ni / C ˇ0 C
cari .1/ˇ1 C cari .2/ˇ2 C cari .3/ˇ3 C
agei .1/ˇ4 C agei .2/ˇ5
The indicator variables cari .j / and agei .j / are associated with the jth level of the variables car and age for
observation i
1 if car D j
cari .j / D
0 if car ¤ j
The ˇs are unknown parameters to be estimated by the procedure. The logarithm of the variable n is used as
an offset—that is, a regression variable with a constant coefficient of 1 for each observation. A log-linear
relationship between the mean and the factors car and age is specified by the log link function. The log link
function ensures that the mean number of insurance claims for each car and age group predicted from the
fitted model is positive.
The following statements invoke the GENMOD procedure to perform this analysis:
proc genmod data=insure;
class car age;
model c = car age / dist
= poisson
link
= log
offset = ln;
run;
2878 F Chapter 42: The GENMOD Procedure
The variables car and age are specified as CLASS variables so that PROC GENMOD automatically generates
the indicator variables associated with car and age.
The MODEL statement specifies c as the response variable and car and age as explanatory variables. An
intercept term is included by default. Thus, the model matrix X (the matrix that has as its ith row the transpose
of the covariate vector for the ith observation) consists of a column of 1s representing the intercept term and
columns of 0s and 1s derived from indicator variables representing the levels of the car and age variables.
That is, the model matrix is
2
1 1 0 0 1
6 1 0 1 0 1
6
6 1 0 0 1 1
XD6
6 1 1 0 0 0
6
4 1 0 1 0 0
1 0 0 1 0
0
0
0
1
1
1
3
7
7
7
7
7
7
5
where the first column corresponds to the intercept, the next three columns correspond to the variable car,
and the last two columns correspond to the variable age.
The response distribution is specified as Poisson, and the link function is chosen to be log. That is, the
Poisson mean parameter is related to the linear predictor by
log./ D x0i ˇ
The logarithm of n is specified as an offset variable, as is common in this type of analysis. In this case, the
offset variable serves to normalize the fitted cell means to a per-policyholder basis, since the total number of
claims, not individual policyholder claims, is observed. PROC GENMOD produces the following default
output from the preceding statements.
Figure 42.1 Model Information
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Offset Variable
WORK.INSURE
Poisson
Log
c
ln
Poisson Regression F 2879
The “Model Information” table displayed in Figure 42.1 provides information about the specified model and
the input data set.
Figure 42.2 Class Level Information
Class Level Information
Class
Levels
car
age
Values
3
2
large medium small
1 2
Figure 42.2 displays the “Class Level Information” table, which identifies the levels of the classification
variables that are used in the model. Note that car is a character variable, and the values are sorted in
alphabetical order. This is the default sort order, but you can select different sort orders with the ORDER=
option in the PROC GENMOD statement.
Figure 42.3 Goodness of Fit
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
DF
Value
Value/DF
2
2
2
2
2.8207
2.8207
2.8416
2.8416
837.4533
-16.4638
40.9276
80.9276
40.0946
1.4103
1.4103
1.4208
1.4208
The “Criteria For Assessing Goodness Of Fit” table displayed in Figure 42.3 contains statistics that summarize
the fit of the specified model. These statistics are helpful in judging the adequacy of a model and in comparing
it with other models under consideration. If you compare the deviance of 2.8207 with its asymptotic chisquare with 2 degrees of freedom distribution, you find that the p-value is 0.24. This indicates that the
specified model fits the data reasonably well.
2880 F Chapter 42: The GENMOD Procedure
Figure 42.4 Analysis of Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept
car
car
car
age
age
Scale
large
medium
small
1
2
DF
Estimate
Standard
Error
1
1
1
0
1
0
0
-1.3168
-1.7643
-0.6928
0.0000
-1.3199
0.0000
1.0000
0.0903
0.2724
0.1282
0.0000
0.1359
0.0000
0.0000
Wald 95% Confidence
Limits
-1.4937
-2.2981
-0.9441
0.0000
-1.5863
0.0000
1.0000
-1.1398
-1.2304
-0.4414
0.0000
-1.0536
0.0000
1.0000
Wald
Chi-Square
212.73
41.96
29.18
.
94.34
.
Analysis Of Maximum Likelihood
Parameter Estimates
Parameter
Intercept
car
car
car
age
age
Scale
Pr > ChiSq
large
medium
small
1
2
<.0001
<.0001
<.0001
.
<.0001
.
NOTE: The scale parameter was held fixed.
Figure 42.4 displays the “Analysis Of Parameter Estimates” table, which summarizes the results of the
iterative parameter estimation process. For each parameter in the model, PROC GENMOD displays columns
with the parameter name, the degrees of freedom associated with the parameter, the estimated parameter value,
the standard error of the parameter estimate, the confidence intervals, and the Wald chi-square statistic and
associated p-value for testing the significance of the parameter to the model. If a column of the model matrix
corresponding to a parameter is found to be linearly dependent, or aliased, with columns corresponding to
parameters preceding it in the model, PROC GENMOD assigns it zero degrees of freedom and displays a
value of zero for both the parameter estimate and its standard error.
This table includes a row for a scale parameter, even though there is no free scale parameter in the Poisson
distribution. See the section “Response Probability Distributions” on page 2956 for the form of the Poisson
probability distribution. PROC GENMOD allows the specification of a scale parameter to fit overdispersed
Poisson and binomial distributions. In such cases, the SCALE row indicates the value of the overdispersion
scale parameter used in adjusting output statistics. See the section “Overdispersion” on page 2966 for more
about overdispersion and the meaning of the SCALE parameter output by the GENMOD procedure. PROC
GENMOD displays a note indicating that the scale parameter is fixed—that is, not estimated by the iterative
fitting process.
It is usually of interest to assess the importance of the main effects in the model. Type 1 and Type 3 analyses
generate statistical tests for the significance of these effects.
Poisson Regression F 2881
You can request these analyses with the TYPE1 and TYPE3 options in the MODEL statement, as follows:
proc genmod data=insure;
class car age;
model c = car age / dist
= poisson
link
= log
offset = ln
type1
type3;
run;
The results of these analyses are summarized in the figures that follow.
Figure 42.5 Type 1 Analysis
The GENMOD Procedure
LR Statistics For Type 1 Analysis
Source
Deviance
DF
ChiSquare
Pr > ChiSq
Intercept
car
age
175.1536
107.4620
2.8207
2
1
67.69
104.64
<.0001
<.0001
In the table for Type 1 analysis displayed in Figure 42.5, each entry in the deviance column represents the
deviance for the model containing the effect for that row and all effects preceding it in the table. For example,
the deviance corresponding to car in the table is the deviance of the model containing an intercept and car.
As more terms are included in the model, the deviance decreases.
Entries in the chi-square column are likelihood ratio statistics for testing the significance of the effect added
to the model containing all the preceding effects. The chi-square value of 67.69 for car represents twice the
difference in log likelihoods between fitting a model with only an intercept term and a model with an intercept
and car. Since the scale parameter is set to 1 in this analysis, this is equal to the difference in deviances.
Since two additional parameters are involved, this statistic can be compared with a chi-square distribution
with two degrees of freedom. The resulting p-value (labeled Pr>Chi) of less than 0.0001 indicates that this
variable is highly significant. Similarly, the chi-square value of 104.64 for age represents the difference in
log likelihoods between the model with the intercept and car and the model with the intercept, car, and age.
This effect is also highly significant, as indicated by the small p-value.
Figure 42.6 Type 3 Analysis
LR Statistics For Type 3 Analysis
Source
car
age
DF
ChiSquare
Pr > ChiSq
2
1
72.82
104.64
<.0001
<.0001
2882 F Chapter 42: The GENMOD Procedure
The Type 3 analysis results in the same conclusions as the Type 1 analysis. The Type 3 chi-square value
for the car variable, for example, is twice the difference between the log likelihood for the model with the
variables Intercept, car, and age included and the log likelihood for the model with the car variable excluded.
The hypothesis tested in this case is the significance of the variable car given that the variable age is in the
model. In other words, it tests the additional contribution of car in the model.
The values of the Type 3 likelihood ratio statistics for the car and age variables indicate that both of these
factors are highly significant in determining the claims performance of the insurance policyholders.
Bayesian Analysis of a Linear Regression Model
Neter et al. (1996) describe a study of 54 patients undergoing a certain kind of liver operation in a surgical
unit. The data set Surg contains survival time and certain covariates for each patient. Observations for the
first 20 patients in the data set Surg are shown in Figure 42.7.
Figure 42.7 Surgical Unit Data
Obs
x1
x2
x3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
6.7
5.1
7.4
6.5
7.8
5.8
5.7
3.7
6.0
3.7
6.3
6.7
5.8
5.8
7.7
7.4
6.0
3.7
7.3
5.6
62
59
57
73
65
38
46
68
67
76
84
51
96
83
62
74
85
51
68
57
81
66
83
41
115
72
63
81
93
94
83
43
114
88
67
68
28
41
74
87
x4
y
logy
Logx1
2.59
1.70
2.16
2.01
4.30
1.42
1.91
2.57
2.50
2.40
4.13
1.86
3.95
3.95
3.40
2.40
2.98
1.55
3.56
3.02
200
101
204
101
509
80
80
127
202
203
329
65
830
330
168
217
87
34
215
172
2.3010
2.0043
2.3096
2.0043
2.7067
1.9031
1.9031
2.1038
2.3054
2.3075
2.5172
1.8129
2.9191
2.5185
2.2253
2.3365
1.9395
1.5315
2.3324
2.2355
1.90211
1.62924
2.00148
1.87180
2.05412
1.75786
1.74047
1.30833
1.79176
1.30833
1.84055
1.90211
1.75786
1.75786
2.04122
2.00148
1.79176
1.30833
1.98787
1.72277
Consider the model
Y D ˇ0 C ˇ1 LogX1 C ˇ2 X2 C ˇ3 X3 C ˇ4 X4 C where Y is the survival time, LogX1 is log(blood-clotting score), X2 is a prognostic index, X3 is an enzyme
function test score, X4 is a liver function test score, and is an N.0; 2 / error term.
Bayesian Analysis of a Linear Regression Model F 2883
A question of scientific interest is whether blood clotting score has a positive effect on survival time. Using
PROC GENMOD, you can obtain a maximum likelihood estimate of the coefficient and construct a null
point hypothesis to test whether ˇ1 is equal to 0. However, if you are interested in finding the probability that
the coefficient is positive, Bayesian analysis offers a convenient alternative. You can use Bayesian analysis to
directly estimate the conditional probability, Pr.ˇ1 > 0jY/, using the posterior distribution samples, which
are produced as part of the output by PROC GENMOD.
The example that follows shows how to use PROC GENMOD to carry out a Bayesian analysis of the linear
model with a normal error term. The SEED= option is specified to maintain reproducibility; no other options
are specified in the BAYES statement. By default, a uniform prior distribution is assumed on the regression
coefficients. The uniform prior is a flat prior on the real line with a distribution that reflects ignorance of the
location of the parameter, placing equal likelihood on all possible values the regression coefficient can take.
Using the uniform prior in the following example, you would expect the Bayesian estimates to resemble
the classical results of maximizing the likelihood. If you can elicit an informative prior distribution for the
regression coefficients, you should use the COEFFPRIOR= option to specify it. A default noninformative
gamma prior is used for the scale parameter .
You should make sure that the posterior distribution samples have achieved convergence before using them
for Bayesian inference. PROC GENMOD produces three convergence diagnostics by default. If ODS
Graphics is enabled as specified in the following SAS statements, diagnostic plots are also displayed. See
the section “Assessing Markov Chain Convergence” on page 141 in Chapter 7, “Introduction to Bayesian
Analysis Procedures,” for more information about convergence diagnostics and their interpretation.
Summary statistics of the posterior distribution samples are produced by default. However, these statistics
might not be sufficient for carrying out your Bayesian inference, and further processing of the posterior samples might be necessary. The following SAS statements request the Bayesian analysis, and the OUTPOST=
option saves the samples in the SAS data set PostSurg for further processing:
proc genmod data=Surg;
model y = Logx1 X2 X3 X4 / dist=normal;
bayes seed=1 OutPost=PostSurg;
run;
The results of this analysis are shown in the following figures.
2884 F Chapter 42: The GENMOD Procedure
The “Model Information” table in Figure 42.8 summarizes information about the model you fit and the size
of the simulation.
Figure 42.8 Model Information
The GENMOD Procedure
Bayesian Analysis
Model Information
Data Set
Burn-In Size
MC Sample Size
Thinning
Sampling Algorithm
Distribution
Link Function
Dependent Variable
WORK.SURG
2000
10000
1
Conjugate
Normal
Identity
y
Survival Time
The “Analysis of Maximum Likelihood Parameter Estimates” table in Figure 42.9 summarizes maximum
likelihood estimates of the model parameters.
Figure 42.9 Maximum Likelihood Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
Logx1
x2
x3
x4
Scale
1
1
1
1
1
1
-730.559
171.8758
4.3019
4.0309
18.1377
59.8591
85.4333
38.2250
0.5566
0.4996
12.0721
5.7599
Wald 95% Confidence
Limits
-898.005
96.9561
3.2109
3.0517
-5.5232
49.5705
-563.112
246.7954
5.3929
5.0100
41.7986
72.2832
NOTE: The scale parameter was estimated by maximum likelihood.
Since no prior distributions for the regression coefficients were specified, the default noninformative uniform
distributions shown in the “Uniform Prior for Regression Coefficients” table in Figure 42.10 are used.
Noninformative priors are appropriate if you have no prior knowledge of the likely range of values of
the parameters, and if you want to make probability statements about the parameters or functions of the
parameters. See, for example, Ibrahim, Chen, and Sinha (2001) for more information about choosing prior
distributions.
Bayesian Analysis of a Linear Regression Model F 2885
Figure 42.10 Regression Coefficient Priors
The GENMOD Procedure
Bayesian Analysis
Uniform Prior for Regression Coefficients
Parameter
Prior
Intercept
Logx1
x2
x3
x4
Constant
Constant
Constant
Constant
Constant
The default noninformative improper prior distribution for the normal dispersion parameter is shown in the
“Independent Prior Distributions for Model Parameters” table in Figure 42.11.
Figure 42.11 Scale Parameter Prior
Independent Prior Distributions for Model Parameters
Parameter
Prior
Distribution
Dispersion
Improper
By default, the maximum likelihood estimates of the regression parameters are used as the starting values for
the simulation when noninformative prior distributions are used. These are listed in the “Initial Values and
Seeds” table in Figure 42.12.
Figure 42.12 MCMC Initial Values and Seeds
Initial Values of the Chain
Chain
Seed
Intercept
Logx1
x2
x3
x4
1
1
-730.559
171.8758
4.301896
4.030878
18.1377
Initial Values of the Chain
Dispersion
3449.176
2886 F Chapter 42: The GENMOD Procedure
Summary statistics for the posterior sample are displayed in the “Fit Statistics,” “Descriptive Statistics for the
Posterior Sample,” “Interval Statistics for the Posterior Sample,” and “Posterior Correlation Matrix” tables in
Figure 42.13, Figure 42.14, Figure 42.15, and Figure 42.16, respectively.
Figure 42.13 Fit Statistics
Fit Statistics
DIC (smaller is better)
pD (effective number of parameters)
607.796
6.062
Figure 42.14 Descriptive Statistics
The GENMOD Procedure
Bayesian Analysis
Posterior Summaries
Parameter
Intercept
Logx1
x2
x3
x4
Dispersion
N
Mean
Standard
Deviation
25%
10000
10000
10000
10000
10000
10000
-730.0
171.7
4.2988
4.0308
18.0858
4113.1
91.2102
40.6455
0.5952
0.5359
12.9123
867.7
-789.6
144.2
3.9029
3.6641
9.4471
3497.2
Percentiles
50%
-729.6
171.6
4.2919
4.0267
18.1230
3995.9
75%
-670.5
198.6
4.6903
4.3921
26.8141
4606.4
Figure 42.15 Interval Statistics
Posterior Intervals
Parameter
Alpha
Equal-Tail Interval
Intercept
Logx1
x2
x3
x4
Dispersion
0.050
0.050
0.050
0.050
0.050
0.050
-908.6
91.9723
3.1091
2.9803
-7.3043
2741.5
-549.8
252.5
5.4778
5.1031
43.6387
6096.6
HPD Interval
-906.9
94.1279
3.1705
2.9227
-8.8440
2540.1
-549.1
254.0
5.5167
5.0343
41.8229
5810.0
Bayesian Analysis of a Linear Regression Model F 2887
Figure 42.16 Posterior Sample Correlation Matrix
Posterior Correlation Matrix
Parameter
Intercept
Logx1
x2
x3
x4
Dispersion
Intercept
Logx1
x2
x3
x4
Dispersion
1.000
-0.857
-0.579
-0.712
0.582
0.000
-0.857
1.000
0.286
0.491
-0.640
0.007
-0.579
0.286
1.000
0.302
-0.489
-0.009
-0.712
0.491
0.302
1.000
-0.618
-0.006
0.582
-0.640
-0.489
-0.618
1.000
0.003
0.000
0.007
-0.009
-0.006
0.003
1.000
Since noninformative prior distributions were used, the posterior sample means, standard deviations, and
interval statistics shown in Figure 42.13 and Figure 42.14 are consistent with the maximum likelihood
estimates shown in Figure 42.9.
By default, PROC GENMOD computes three convergence diagnostics: the lag1, lag5, lag10, and lag50
autocorrelations (Figure 42.17); Geweke diagnostic statistics (Figure 42.18); and effective sample sizes
(Figure 42.19). There is no indication that the Markov chain has not converged. See the section
“Assessing Markov Chain Convergence” on page 141 in Chapter 7, “Introduction to Bayesian Analysis Procedures,” for more information about convergence diagnostics and their interpretation.
Figure 42.17 Posterior Sample Autocorrelations
The GENMOD Procedure
Bayesian Analysis
Posterior Autocorrelations
Parameter
Intercept
Logx1
x2
x3
x4
Dispersion
Lag 1
Lag 5
Lag 10
Lag 50
-0.0059
-0.0002
-0.0120
0.0036
0.0034
-0.0011
-0.0037
-0.0064
-0.0026
0.0033
-0.0064
0.0091
-0.0152
-0.0066
-0.0267
-0.0035
0.0083
-0.0279
0.0010
-0.0054
-0.0168
0.0004
-0.0124
0.0037
2888 F Chapter 42: The GENMOD Procedure
Figure 42.18 Geweke Diagnostic Statistics
Geweke Diagnostics
Parameter
Intercept
Logx1
x2
x3
x4
Dispersion
z
Pr > |z|
-1.0815
1.6667
0.0977
0.2506
-1.1082
0.2451
0.2795
0.0956
0.9222
0.8021
0.2678
0.8064
Figure 42.19 Effective Sample Sizes
Effective Sample Sizes
Parameter
Intercept
Logx1
x2
x3
x4
Dispersion
ESS
Autocorrelation
Time
Efficiency
10000.0
10000.0
10245.2
10000.0
10000.0
10000.0
1.0000
1.0000
0.9761
1.0000
1.0000
1.0000
1.0000
1.0000
1.0245
1.0000
1.0000
1.0000
Trace, autocorrelation, and density plots for the seven model parameters, shown in Figure 42.20
through Figure 42.25, are useful in diagnosing whether the Markov chain of posterior samples has
converged. These plots show no evidence that the chain has not converged. See the section
“Visual Analysis via Trace Plots” on page 141 in Chapter 7, “Introduction to Bayesian Analysis Procedures,”
for help with interpreting these diagnostic plots.
Bayesian Analysis of a Linear Regression Model F 2889
Figure 42.20 Diagnostic Plots for Intercept
2890 F Chapter 42: The GENMOD Procedure
Figure 42.21 Diagnostic Plots for logX1
Bayesian Analysis of a Linear Regression Model F 2891
Figure 42.22 Diagnostic Plots for X2
2892 F Chapter 42: The GENMOD Procedure
Figure 42.23 Diagnostic Plots for X3
Bayesian Analysis of a Linear Regression Model F 2893
Figure 42.24 Diagnostic Plots for X4
2894 F Chapter 42: The GENMOD Procedure
Figure 42.25 Diagnostic Plots for X5
Suppose, for illustration, a question of scientific interest is whether blood clotting score has a positive effect
on survival time. Since the model parameters are regarded as random quantities in a Bayesian analysis,
you can answer this question by estimating the conditional probability of ˇ1 being positive, given the data,
Pr.ˇ1 > 0jY/, from the posterior distribution samples. The following SAS statements compute the estimate
of the probability of ˇ1 being positive:
data Prob;
set PostSurg;
Indicator = (logX1 > 0);
label Indicator= 'log(Blood Clotting Score) > 0';
run;
proc Means data = Prob(keep=Indicator) n mean;
run;
As shown in Figure 42.26, there is a 1.00 probability of a positive relationship between the logarithm of a
blood clotting score and survival time, adjusted for the other covariates.
Generalized Estimating Equations F 2895
Figure 42.26 Probability That ˇ1 > 0
The MEANS Procedure
Analysis Variable : Indicator log(Blood Clotting Score) > 0
N
Mean
--------------------10000
0.9999000
---------------------
Generalized Estimating Equations
This section illustrates the use of the REPEATED statement to fit a GEE model, using repeated measures data
from the “Six Cities” study of the health effects of air pollution (Ware et al. 1984). The data analyzed are the
16 selected cases in Lipsitz et al. (1994). The binary response is the wheezing status of 16 children at ages 9,
10, 11, and 12 years. The mean response is modeled as a logistic regression model by using the explanatory
variables city of residence, age, and maternal smoking status at the particular age. The binary responses for
individual children are assumed to be equally correlated, implying an exchangeable correlation structure.
The data set and SAS statements that fit the model by the GEE method are as follows:
data six;
input case city$ @@;
do i=1 to 4;
input age smoke wheeze @@;
output;
end;
datalines;
1 portage
9 0 1 10 0 1 11 0
2 kingston 9 1 1 10 2 1 11 2
3 kingston 9 0 1 10 0 0 11 1
4 portage
9 0 0 10 0 1 11 0
5 kingston 9 0 0 10 1 0 11 1
6 portage
9 0 0 10 1 0 11 1
7 kingston 9 1 0 10 1 0 11 0
8 portage
9 1 0 10 1 0 11 1
9 portage
9 2 1 10 2 0 11 1
10 kingston 9 0 0 10 0 0 11 0
11 kingston 9 1 1 10 0 0 11 0
12 portage
9 1 0 10 0 0 11 0
13 kingston 9 1 0 10 0 1 11 1
14 portage
9 1 0 10 2 0 11 1
15 kingston 9 1 0 10 1 0 11 1
16 portage
9 1 1 10 1 1 11 2
;
1
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
0
2
1
1
1
1
0
2
1
1
0
0
1
2
2
1
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
2896 F Chapter 42: The GENMOD Procedure
proc genmod data=six;
class case city;
model wheeze = city age smoke / dist=bin;
repeated subject=case / type=exch covb corrw;
run;
The CLASS statement and the MODEL statement specify the model for the mean of the wheeze variable
response as a logistic regression with city, age, and smoke as independent variables, just as for an ordinary
logistic regression.
The REPEATED statement invokes the GEE method, specifies the correlation structure, and controls the
displayed output from the GEE model. The option SUBJECT=CASE specifies that individual subjects be
identified in the input data set by the variable case. The SUBJECT= variable case must be listed in the
CLASS statement. Measurements on individual subjects at ages 9, 10, 11, and 12 are in the proper order
in the data set, so the WITHINSUBJECT= option is not required. The TYPE=EXCH option specifies an
exchangeable working correlation structure, the COVB option specifies that the parameter estimate covariance
matrix be displayed, and the CORRW option specifies that the final working correlation be displayed.
Initial parameter estimates for iterative fitting of the GEE model are computed as in an ordinary generalized
linear model, as described previously. Results of the initial model fit displayed as part of the generated output
are not shown here. Statistics for the initial model fit such as parameter estimates, standard errors, deviances,
and Pearson chi-squares do not apply to the GEE model and are valid only for the initial model fit. The
following figures display information that applies to the GEE model fit.
Figure 42.27 displays general information about the GEE model fit.
Figure 42.27 GEE Model Information
The GENMOD Procedure
GEE Model Information
Correlation Structure
Subject Effect
Number of Clusters
Correlation Matrix Dimension
Maximum Cluster Size
Minimum Cluster Size
Exchangeable
case (16 levels)
16
4
4
4
Generalized Estimating Equations F 2897
Figure 42.28 displays the parameter estimate covariance matrices specified by the COVB option. Both
model-based and empirical covariances are produced.
Figure 42.28 GEE Parameter Estimate Covariance Matrices
Covariance Matrix (Model-Based)
Prm1
Prm2
Prm4
Prm5
Prm1
Prm2
Prm4
Prm5
5.74947
-0.22257
-0.53472
0.01655
-0.22257
0.45478
-0.002410
0.01876
-0.53472
-0.002410
0.05300
-0.01658
0.01655
0.01876
-0.01658
0.19104
Covariance Matrix (Empirical)
Prm1
Prm2
Prm4
Prm5
Prm1
Prm2
Prm4
Prm5
9.33994
-0.85104
-0.83253
-0.16534
-0.85104
0.47368
0.05736
0.04023
-0.83253
0.05736
0.07778
-0.002364
-0.16534
0.04023
-0.002364
0.13051
The exchangeable working correlation matrix specified by the CORRW option is displayed in Figure 42.29.
Figure 42.29 GEE Working Correlation Matrix
Working Correlation Matrix
Row1
Row2
Row3
Row4
Col1
Col2
Col3
Col4
1.0000
0.1648
0.1648
0.1648
0.1648
1.0000
0.1648
0.1648
0.1648
0.1648
1.0000
0.1648
0.1648
0.1648
0.1648
1.0000
The parameter estimates table, displayed in Figure 42.30, contains parameter estimates, standard errors,
confidence intervals, Z scores, and p-values for the parameter estimates. Empirical standard error estimates are
used in this table. A table that displays model-based standard errors can be created by using the REPEATED
statement option MODELSE.
2898 F Chapter 42: The GENMOD Procedure
Figure 42.30 GEE Parameter Estimates Table
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter
Intercept
city
kingston
city
portage
age
smoke
Estimate
Standard
Error
-1.2751
-0.1223
0.0000
0.2036
0.0935
3.0561
0.6882
0.0000
0.2789
0.3613
95% Confidence
Limits
-7.2650
-1.4713
0.0000
-0.3431
-0.6145
4.7148
1.2266
0.0000
0.7502
0.8016
Z Pr > |Z|
-0.42
-0.18
.
0.73
0.26
0.6765
0.8589
.
0.4655
0.7957
Syntax: GENMOD Procedure
The following statements are available in the GENMOD procedure. Items within the < > are optional.
PROC GENMOD < options > ;
ASSESS | ASSESSMENT VAR=(effect)| LINK < / options > ;
BAYES < options > ;
BY variables ;
CLASS variable < (options) > . . . < variable < (options) > > < / options > ;
CODE < options > ;
CONTRAST 'label' contrast-specification < / options > ;
DEVIANCE variable = expression ;
EFFECTPLOT < plot-type < (plot-definition-options) > > < / options > ;
ESTIMATE 'label' effect values < , . . . effect values > < / options > ;
EXACT < 'label' > < INTERCEPT > < effects > < / options > ;
EXACTOPTIONS options ;
FREQ | FREQUENCY variable ;
FWDLINK variable = expression ;
INVLINK variable = expression ;
LSMEANS < model-effects > < / options > ;
LSMESTIMATE model-effect < 'label' > values < divisor =n > < , . . . < 'label' > values < divisor =n > >
< / options > ;
MODEL response = < effects > < / options > ;
OUTPUT < OUT=SAS-data-set > < keyword=name . . . keyword=name > ;
Programming statements ;
REPEATED SUBJECT=subject-effect < / options > ;
SLICE model-effect < / options > ;
STORE < OUT= >item-store-name < / LABEL='label' > ;
STRATA variable < (option) > . . . < variable < (option) > > < / options > ;
WEIGHT | SCWGT variable ;
VARIANCE variable = expression ;
ZEROMODEL < effects > < / options > ;
PROC GENMOD Statement F 2899
The ASSESS, BAYES, BY, CLASS, CODE, CONTRAST, DEVIANCE, ESTIMATE, FREQUENCY,
FWDLINK, INVLINK, MODEL, OUTPUT, programming statements, REPEATED, VARIANCE, WEIGHT,
and ZEROMODEL statements are described in full after the PROC GENMOD statement in alphabetical
order. The EFFECTPLOT, LSMEANS, LSMESTIMATE, SLICE, and STORE statements are common to
many procedures. Summary descriptions of functionality and syntax for these statements are also given after
the PROC GENMOD statement in alphabetical order, and full documentation about them is available in
Chapter 19, “Shared Concepts and Topics.”
The PROC GENMOD statement invokes the GENMOD procedure. All statements other than the MODEL
statement are optional. The CLASS statement, if present, must precede the MODEL statement, and the
CONTRAST and EXACT statements must come after the MODEL statement.
PROC GENMOD Statement
PROC GENMOD < options > ;
The PROC GENMOD statement invokes the GENMOD procedure. Table 42.1 summarizes the options
available in the PROC GENMOD statement.
Table 42.1
PROC GENMOD Statement Options
Option
Description
DATA=
DESCENDING
EXACTONLY
NAMELEN=
ORDER=
PLOTS
RORDER=
Specifies the input data set
Sorts response variable in the reverse of the default order
Requests only the exact analyses
Specifies the length of effect names
Specifies the sort order of CLASS variable
Controls the plots produced through ODS Graphics
Specifies the sort order for the levels of the response variable
You can specify the following options.
DATA=SAS-data-set
specifies the SAS data set containing the data to be analyzed. If you omit the DATA= option, the
procedure uses the most recently created SAS data set.
DESCENDING
DESCEND
DESC
specifies that the levels of the response variable for the ordinal multinomial model and the binomial
model with single variable response syntax be sorted in the reverse of the default order. For example, if
RORDER=FORMATTED (the default), the DESCENDING option causes the levels to be sorted from
highest to lowest instead of from lowest to highest. If RORDER=FREQ, the DESCENDING option
causes the levels to be sorted from lowest frequency count to highest instead of from highest to lowest.
2900 F Chapter 42: The GENMOD Procedure
EXACTONLY
requests only the exact analyses. The asymptotic analysis that PROC GENMOD usually performs is
suppressed.
NAMELEN=n
specifies the length of effect names in tables and output data sets to be n characters long, where n is a
value between 20 and 200 characters. The default length is 20 characters.
ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sort order for the levels of the classification variables (which are specified in the CLASS
statement). The ORDER= option can be useful when you use the CONTRAST or ESTIMATE
statement because it determines which parameters in the model correspond to each level in the data.
This option applies to the levels for all classification variables, except when you use the (default)
ORDER=FORMATTED option with numeric classification variables that have no explicit format. In
that case, the levels of such variables are ordered by their internal value.
The ORDER= option can take the following values:
Value of ORDER=
Levels Sorted By
DATA
Order of appearance in the input data set
FORMATTED
External formatted value, except for numeric variables with
no explicit format, which are sorted by their unformatted
(internal) value
FREQ
Descending frequency count; levels with the most observations come first in the order
INTERNAL
Unformatted value
By default, ORDER=FORMATTED. For ORDER=FORMATTED and ORDER=INTERNAL, the
sort order is machine-dependent. For more information about sort order, see the chapter on the SORT
procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in SAS
Language Reference: Concepts.
PLOTS < (global-plot-options) > < =plot-request < (options) > >
PLOTS < (global-plot-options) > < =(plot-request < (options) > < ... plot-request < (options) > >) >
specifies plots to be created using ODS Graphics. Many of the observational statistics in the output
data set can be plotted using this option. You are not required to create an output data set in order to
produce a plot. When you specify only one plot request, you can omit the parentheses around the plot
request. Here are some examples:
plots=all
plots=predicted
plots=(predicted reschi)
plots(unpack)=dfbeta
PROC GENMOD Statement F 2901
ODS Graphics must be enabled before plots can be requested. For example:
proc genmod plots=all;
model y = x;
run;
For more information about enabling and disabling ODS Graphics, see the section “Enabling and
Disabling ODS Graphics” on page 606 in Chapter 21, “Statistical Graphics Using ODS.”
Any specified global-plot-options apply to all plots that are specified with plot-requests. The following
global-plot-options are available.
CLUSTERLABEL
displays formatted levels of the SUBJECT= effect instead of plot symbols. This option applies
only to diagnostic statistics for models fit by GEEs that are plotted against cluster number, and
provides a way to identify cluster level names with corresponding ordered cluster numbers.
UNPACK
displays multiple plots individually. The default is to display related multiple plots in a panel.
See the section “OUTPUT Statement” on page 2944 for definitions of the statistics specified with the
plot-requests. The plot-requests include the following:
ALL
produces all available plots.
COOKSD
DOBS
plots the Cook’s distance statistic as a function of observation number.
DFBETA
plots the ˇ deletion statistic as a function of observation number for each regression parameter in
the model.
DFBETAS
plots the standardized ˇ deletion statistic as a function of observation number for each regression
parameter in the model.
LEVERAGE
plots the leverage as a function of observation number.
PREDICTED< (option) >
plots predicted values with confidence limits as a function of observation number. The PREDICTED plot request has the following option:
CLM
includes confidence limits in the predicted value plot.
2902 F Chapter 42: The GENMOD Procedure
PZERO
plots the zero inflation probability for zero-inflated Poisson and negative binomial models as a
function of observation number.
RESCHI< (options) >
The RESCHI plot request has the following options:
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, Pearson residuals are plotted as a function of observation number.
RESDEV< (options) >
plots deviance residuals. The RESDEV plot request has the following options:
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, deviance residuals are plotted as a function of observation number.
RESLIK< (options) >
plots likelihood residuals. The RESLIK plot request has the following options:
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, likelihood residuals are plotted as a function of observation
number.
RESRAW< (options) >
plots raw residuals. The RESRAW plot request has the following options:
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, raw residuals are plotted as a function of observation number.
STDRESCHI< (options) >
plots standardized Pearson residuals. The STDRESCHI plot request has the following options:
PROC GENMOD Statement F 2903
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, standardized Pearson residuals are plotted as a function of
observation number.
STDRESDEV< (options) >
plots standardized deviance residuals. The STDRESDEV plot request has the following options:
INDEX
plots as a function of observation number.
XBETA
plots as a function of linear predictor.
If you do not specify an option, standardized deviance residuals are plotted as a function of
observation number.
If you fit a model by using generalized estimating equations (GEEs), the following additional plotrequests are available:
CLEVERAGE
plots the cluster leverage as a function of ordered cluster.
CLUSTERCOOKSD
DCLS
plots the cluster Cook’s distance statistic as a function of ordered cluster.
CLUSTERDFIT
MCLS
plots the studentized cluster Cook’s distance statistic as a function of ordered cluster.
DFBETAC
plots the cluster deletion statistic as a function of ordered cluster for each regression parameter in
the model.
DFBETACS
plots the standardized cluster deletion statistic as a function of ordered cluster for each regression
parameter in the model.
RORDER=keyword
specifies the sort order for the levels of the response variable. This order determines which intercept
parameter in the model corresponds to each level in the data. If RORDER=FORMATTED for numeric
variables for which you have supplied no explicit format, the levels are ordered by their internal values.
The following table displays the valid keywords and describes how PROC GENMOD interprets them.
2904 F Chapter 42: The GENMOD Procedure
RORDER=keyword
Levels Sorted by
DATA
FORMATTED
Order of appearance in the input data set
External formatted value, except for numeric
variables with no explicit format, which are
sorted by their unformatted (internal) value
Descending frequency count; levels with the
most observations come first in the order
Unformatted value
FREQ
INTERNAL
By default, RORDER=FORMATTED. For RORDER=FORMATTED and RORDER=INTERNAL,
the sort order is machine dependent. The DESCENDING option in the PROC GENMOD statement
causes the response variable to be sorted in the reverse of the order displayed in the previous table. For
more information about sort order, see the chapter on the SORT procedure in the Base SAS Procedures
Guide.
The NOPRINT option, which suppresses displayed output in other SAS procedures, is not available
in the PROC GENMOD statement. However, you can use the Output Delivery System (ODS) to
suppress all displayed output, store all output on disk for further analysis, or create SAS data sets from
selected output. You can suppress all displayed output with the statement ODS SELECT NONE; and
turn displayed output back on with the statement ODS SELECT ALL;. See Table 42.12 and Table 42.13
for the names of output tables available from PROC GENMOD. For more information about ODS, see
Chapter 20, “Using the Output Delivery System.”
ASSESS Statement
ASSESS VAR=(effect)| LINK < / options > ;
ASSESSMENT VAR=(effect)| LINK < / options > ;
The ASSESS statement computes and plots, using ODS Graphics, model-checking statistics based on
aggregates of residuals. See the section “Assessment of Models Based on Aggregates of Residuals” on
page 2987 for details about the model assessment methods available in GENMOD.
The types of aggregates available are cumulative residuals, moving sums of residuals, and loess smoothed
residuals. If you do not specify which aggregate to use, the assessments are based on cumulative sums. PROC
GENMOD uses ODS Graphics for graphical displays. For specific information about the graphics available
in PROC GENMOD, see the section “ODS Graphics” on page 3016.
You must specify either LINK or VAR= in order to create an analysis.
LINK requests the assessment of the link function by performing the analysis with respect to the linear
predictor.
VAR=(effect) specifies that the functional form of a covariate be checked by performing the analysis with
respect to the variable identified by the effect. The effect must be specified in the MODEL statement and
must contain only continuous variables (variables not listed in a CLASS statement).
BAYES Statement F 2905
You can specify the following options after the slash (/).
CRPANEL
requests that a plot with four panels showing just a few of the paths from the default aggregate plot to
make it easier to compare simulated and observed paths. The plot in each panel contains aggregates of
the observed residuals and two simulated curves (fewer if NPATHS= is less than 8).
LOESS< (number ) >
LOWESS< (number ) >
requests model assessment based on loess smoothed residuals with optional number the fraction of data
used; number must be between zero and one. If number is not specified, the default value one-third is
used.
NPATHS=number
NPATH=number
PATHS=number
PATH=number
specifies the number of simulated paths to plot in the default aggregate residuals plot. The default
value of number is twenty.
RESAMPLE< =number >
RESAMPLES< =number >
specifies that a p-value be computed based on 1,000 simulated paths, or number paths, if number is
specified.
SEED=number
specifies a seed for the normal random number generator used in creating simulated realizations of
aggregates of residuals for plots and estimating p-values. Specifying a seed enables you to produce
identical graphs and p-values from one run of the procedure to the next run. If a seed is not specified,
or if number is negative or zero, a random number seed is derived from the time of day.
WINDOW< (number ) >
requests assessment based on a moving sum window of width number . If number is not specified, a
value of one-half of the range of the x-coordinate is used.
BAYES Statement
BAYES < options > ;
The BAYES statement requests a Bayesian analysis of the regression model by using Gibbs sampling. The
Bayesian posterior samples (also known as the chain) for the regression parameters are not tabulated. The
Bayesian posterior samples (also known as the chain) for the regression parameters can be output to a SAS
data set.
2906 F Chapter 42: The GENMOD Procedure
Table 42.2 summarizes the options available in the BAYES statement.
Table 42.2
Option
Monte Carlo Options
INITIAL=
INITIALMLE
METROPOLIS=
NBI=
NMC=
SAMPLING=
SEED=
THINNING=
BAYES Statement Options
Description
Specifies the initial values of the chain
Specifies that maximum likelihood estimates be used as
initial values of the chain
Specifies the use of a Metropolis step in the ARMS algorithm
Specifies the number of burn-in iterations
Specifies the number of iterations after burn-in
Specifies the algorithm used to sample the posterior distribution
Specifies the random number generator seed
Controls the thinning of the Markov chain
Model and Prior Options
COEFFPRIOR=
Specifies the prior of the regression coefficients
DISPERSIONPRIOR= Specifies the prior of the dispersion parameter
PRECISIONPRIOR=
Specifies the prior of the precision parameter
SCALEPRIOR=
Specifies the prior of the scale parameter
Summary Statistics and Convergence Diagnostics
DIAGNOSTICS=
Displays convergence diagnostics
PLOTS=
Displays diagnostic plots
STATISTICS=
Displays summary statistics of the posterior samples
Posterior Samples
OUTPOST=
Names a SAS data set for the posterior samples
The following list describes these options and their suboptions.
COEFFPRIOR=JEFFREYS< (option) > | NORMAL< (options) > | UNIFORM
COEFF=JEFFREYS< (options) > | NORMAL< (options) > | UNIFORM
CPRIOR=JEFFREYS< (options) > | NORMAL< (options) > | UNIFORM
specifies the prior distribution for the regression coefficients. The default is COEFFPRIOR=UNIFORM,
which specifies the noninformative and improper prior of a constant.
Jeffreys’ prior is specified by COEFFPRIOR=JEFFREYS, which can be followed by the following
1
option in parentheses. Jeffreys’ prior is proportional to jI.ˇ/j 2 , where I.ˇ/ is the Fisher information
matrix. See the section “Jeffreys’ Prior” on page 2997 and Ibrahim and Laud (1991) for more details.
CONDITIONAL
specifies that the Jeffreys’ prior, conditional on the current Markov chain value of the generalized
1
linear model precision parameter , is proportional to j I.ˇ/j 2 .
The normal prior is specified by COEFFPRIOR=NORMAL, which can be followed by one of the
following options enclosed in parentheses. However, if you do not specify an option, the normal prior
BAYES Statement F 2907
N.0; 106 I/, where I is the identity matrix, is used. See the section “Normal Prior” on page 2997 for
more details.
CONDITIONAL
specifies that the normal prior, conditional on the current Markov chain value of the generalized
linear model precision parameter , is N.; 1 †/, where and † are the mean and covariance
of the normal prior specified by other normal options.
INPUT=SAS-data-set
specifies a SAS data set containing the mean and covariance information of the normal prior. The
data set must have a _TYPE_ variable to represent the type of each observation and a variable for
each regression coefficient. If the data set also contains a _NAME_ variable, the values of this
variable are used to identify the covariances for the _TYPE_=’COV’ observations; otherwise, the
_TYPE_=’COV’ observations are assumed to be in the same order as the explanatory variables
in the MODEL statement. PROC GENMOD reads the mean vector from the observation with
_TYPE_=’MEAN’ and reads the covariance matrix from observations with _TYPE_=’COV’. For
an independent normal prior, the variances can be specified with _TYPE_=’VAR’; alternatively,
the precisions (inverse of the variances) can be specified with _TYPE_=’PRECISION’.
RELVAR< =c >
specifies the normal prior N.0; cJ/, where J is a diagonal matrix with diagonal elements equal to
the variances of the corresponding ML estimator. By default, c D 106 .
VAR< =c >
specifies the normal prior N.0; cI/, where I is the identity matrix.
DIAGNOSTICS=ALL | NONE | (keyword-list)
DIAG=ALL | NONE | (keyword-list)
controls the number of diagnostics produced. You can request all the following diagnostics by
specifying DIAGNOSTICS=ALL. If you do not want any of these diagnostics, specify DIAGNOSTICS=NONE. If you want some but not all of the diagnostics, or if you want to change certain
settings of these diagnostics, specify a subset of the following keywords. The default is DIAGNOSTICS=(AUTOCORR ESS GEWEKE).
AUTOCORR < (LAGS= numeric-list) >
computes the autocorrelations of lags given by LAGS= list for each parameter. Elements in
the list are truncated to integers and repeated values are removed. If the LAGS= option is not
specified, autocorrelations of lags 1, 5, 10, and 50 are computed for each variable. See the section
“Autocorrelations” on page 154 in Chapter 7, “Introduction to Bayesian Analysis Procedures,” for
details.
ESS
computes Carlin’s estimate of the effective sample size, the correlation time, and the efficiency of
the chain for each parameter. See the section “Effective Sample Size” on page 154 in Chapter 7,
“Introduction to Bayesian Analysis Procedures,” for details.
GELMAN < (gelman-options) >
computes the Gelman and Rubin convergence diagnostics. You can specify one or more of the
following gelman-options:
2908 F Chapter 42: The GENMOD Procedure
NCHAIN | N=number
specifies the number of parallel chains used to compute the diagnostic, and must be 2 or
larger. The default is NCHAIN=3. If an INITIAL= data set is used, NCHAIN defaults to the
number of rows in the INITIAL= data set. If any number other than this is specified with the
NCHAIN= option, the NCHAIN= value is ignored.
ALPHA=value
specifies the significance level for the upper bound. The default is ALPHA=0.05, resulting
in a 97.5% bound.
See the section “Gelman and Rubin Diagnostics” on page 146 in Chapter 7, “Introduction to
Bayesian Analysis Procedures,” for details.
GEWEKE < (geweke-options) >
computes the Geweke spectral density diagnostics, which are essentially a two-sample t test
between the first f1 portion and the last f2 portion of the chain. The default is f1 D 0:1 and
f2 D 0:5, but you can choose other fractions by using the following geweke-options:
FRAC1=value
specifies the fraction f1 for the first window.
FRAC2=value
specifies the fraction f2 for the second window.
See the section “Geweke Diagnostics” on page 148 in Chapter 7, “Introduction to Bayesian
Analysis Procedures,” for details.
HEIDELBERGER < (heidel-options) >
computes the Heidelberger and Welch diagnostic for each variable, which consists of a stationarity
test of the null hypothesis that the sample values form a stationary process. If the stationarity test
is not rejected, a halfwidth test is then carried out. Optionally, you can specify one or more of the
following heidel-options:
SALPHA=value
specifies the ˛ level .0 < ˛ < 1/ for the stationarity test.
HALPHA=value
specifies the ˛ level .0 < ˛ < 1/ for the halfwidth test.
EPS=value
specifies a positive number such that if the halfwidth is less than times the sample mean
of the retained iterates, the halfwidth test is passed.
See the section “Heidelberger and Welch Diagnostics” on page 150 in Chapter 7, “Introduction
to Bayesian Analysis Procedures,” for details.
MCSE
MCERROR
computes the Monte Carlo standard error for each parameter. The Monte Caro standard error,
which measures the simulation accuracy, is the standard error of the posterior mean estimate
and is calculated as the posterior standard deviation divided by the square root of the effective
sample size. See the section “Standard Error of the Mean Estimate” on page 155 in Chapter 7,
“Introduction to Bayesian Analysis Procedures,” for details.
BAYES Statement F 2909
RAFTERY< (raftery-options) >
computes the Raftery and Lewis diagnostics that evaluate the accuracy of the estimated quantile
(OQ for a given Q 2 .0; 1/) of a chain. OQ can achieve any degree of accuracy when the
chain is allowed to run for a long time. A stopping criterion is when the estimated probability
POQ D Pr. OQ / reaches within ˙R of the value Q with probability S; that is, Pr.Q R POQ Q C R/ D S. The following raftery-options enable you to specify Q; R; S , and a
precision level for the test:
QUANTILE | Q=value
specifies the order (a value between 0 and 1) of the quantile of interest. The default is 0.025.
ACCURACY | R=value
specifies a small positive number as the margin of error for measuring the accuracy of
estimation of the quantile. The default is 0.005.
PROBABILITY | S=value
specifies the probability of attaining the accuracy of the estimation of the quantile. The
default is 0.95.
EPSILON | EPS=value
specifies the tolerance level (a small positive number) for the stationary test. The default is
0.001.
See the section “Raftery and Lewis Diagnostics” on page 151 in Chapter 7, “Introduction to
Bayesian Analysis Procedures,” for details.
DISPERSIONPRIOR=GAMMA< (options) > | IGAMMA< (options) > | IMPROPER
DPRIOR=GAMMA< (options) > | IGAMMA< (options) > | IMPROPER
specifies that Gibbs sampling be performed on the generalized linear model dispersion parameter and
the prior distribution for the dispersion parameter, if there is a dispersion parameter in the model. For
models that do not have a dispersion parameter (the Poisson and binomial), this option is ignored.
Note that you can specify Gibbs sampling on either the dispersion parameter , the scale parameter
1
D 2 , or the precision parameter D 1 , with the DPRIOR=, SPRIOR=, and PPRIOR= options,
respectively. These three parameters are transformations of one another, and you should specify Gibbs
sampling for only one of them.
a 1
bt
A gamma prior G.a; b/ with density f .t / D b.bt /€.a/e
is specified by DISPERSIONPRIOR=GAMMA, which can be followed by one of the following gamma-options enclosed in
parentheses. The hyperparameters a and b are the shape and inverse-scale parameters of the gamma
distribution, respectively. See the section “Gamma Prior” on page 2996 for details. The default is
G.10 4 ; 10 4 /.
RELSHAPE< =c >
O c/ distribution, where O is the MLE of the dispersion parameter.
specifies independent G.c ;
O
With this choice of hyperparameters, the mean of the prior distribution is O and the variance is c .
By default, c =10 4 .
2910 F Chapter 42: The GENMOD Procedure
SHAPE=a
ISCALE=b
when both specified, results in a G.a; b/ prior.
SHAPE=c
when specified alone, results in a G.c; c/ prior.
ISCALE=c
when specified alone, results in a G.c; c/ prior.
a
b
An inverse gamma prior IG.a; b/ with density f .t / D €.a/
t .aC1/ e b=t is specified by DISPERSIONPRIOR=IGAMMA, which can be followed by one of the following inverse gamma-options
enclosed in parentheses. The hyperparameters a and b are the shape and scale parameters of the inverse
gamma distribution, respectively. See the section “Inverse Gamma Prior” on page 2997 for details.
The default is IG.2:001; 0:001/.
RELSHAPE< =c >
O
specifies independent IG. cCO ; c/ distribution, where O is the MLE of the dispersion parameter.
O By default, c =10
With this choice of hyperparameters, the mean of the prior distribution is .
4.
SHAPE=a
SCALE=b
when both specified, results in a IG.a; b/ prior.
SHAPE=c
when specified alone, results in an IG.c; c/ prior.
SCALE=c
when specified alone, results in an IG.c; c/ prior.
An improper prior with density f .t / proportional to t
PRIOR=IMPROPER.
1
is specified with DISPERSION-
INITIAL=SAS-data-set
specifies the SAS data set that contains the initial values of the Markov chains. The INITIAL= data set
must contain all the variables of the model. You can specify multiple rows as the initial values of the
parallel chains for the Gelman-Rubin statistics, but posterior summaries, diagnostics, and plots are
computed only for the first chain. If the data set also contains the variable _SEED_, the value of the
_SEED_ variable is used as the seed of the random number generator for the corresponding chain.
INITIALMLE
specifies that maximum likelihood estimates of the model parameters be used as initial values of
the Markov chain. If this option is not specified, estimates of the mode of the posterior distribution
obtained by optimization are used as initial values.
METROPOLIS=YES | NO
specifies the use of a Metropolis step to generate Gibbs samples for posterior distributions that are not
log concave. The default value is METROPOLIS=YES.
BAYES Statement F 2911
NBI=number
specifies the number of burn-in iterations before the chains are saved. The default is 2000.
NMC=number
specifies the number of iterations after the burn-in. The default is 10000.
OUTPOST=SAS-data-set
OUT=SAS-data-set
names the SAS data set that contains the posterior samples. See the sections “OUTPOST= Output
Data Set” on page 2999 and “Posterior Samples Output Data Set” on page 2996 for more information.
Alternatively, you can create the output data set by specifying an ODS OUTPUT statement as follows:
ODS OUTPUT POSTERIORSAMPLE=SAS-data-set
PRECISIONPRIOR=GAMMA< (options) > | IMPROPER
PPRIOR=GAMMA< (options) > | IMPROPER
specifies that Gibbs sampling be performed on the generalized linear model precision parameter and
the prior distribution for the precision parameter, if there is a precision parameter in the model. For
models that do not have a precision parameter (the Poisson and binomial), this option is ignored.
Note that you can specify Gibbs sampling on either the dispersion parameter , the scale parameter
1
D 2 , or the precision parameter D 1 , with the DPRIOR=, SPRIOR=, and PPRIOR= options,
respectively. These three parameters are transformations of one another, and you should specify Gibbs
sampling for only one of them.
a 1
bt
A gamma prior G.a; b/ with density f .t / D b.bt /€.a/e
is specified by PRECISIONPRIOR=GAMMA, which can be followed by one of the following gamma-options enclosed
in parentheses. The hyperparameters a and b are the shape and inverse-scale parameters of the gamma
distribution, respectively. See the section “Gamma Prior” on page 2996 for details. The default is
G.10 4 ; 10 4 /.
RELSHAPE< =c >
specifies independent G.c O ; c/ distribution, where O is the MLE of the dispersion parameter.
With this choice of hyperparameters, the mean of the prior distribution is O and the variance is cO .
By default, c D 10 4 .
SHAPE=a
ISCALE=b
when both specified, results in a G.a; b/ prior.
SHAPE=c
when specified alone, results in an G.c; c/ prior.
ISCALE=c
when specified alone, results in an G.c; c/ prior.
An improper prior with density f .t / proportional to t
PRIOR=IMPROPER.
1
is specified with PRECISION-
2912 F Chapter 42: The GENMOD Procedure
PLOTS< (global-plot-options) >=plot-request
PLOTS< (global-plot-options) >=(plot-request < . . . plot-request >)
controls the display of diagnostic plots. Three types of plots can be requested: trace plots, autocorrelation function plots, and kernel density plots. By default, the plots are displayed in panels unless the
global-plot-option UNPACK is specified. Also, when you are specifying more than one type of plots,
the plots are displayed by parameters unless the global-plot-option GROUPBY is specified. When you
specify only one plot-request , you can omit the parentheses around the plot-request . For example:
plots=none
plots(unpack)=trace
plots=(trace autocorr)
ODS Graphics must be enabled before requesting plots. For example, the following SAS statements
enable ODS Graphics:
ods graphics on;
proc genmod;
model y=x;
bayes plots=trace;
run;
ods graphics off;
The global-plot-options are as follows:
FRINGE
creates a fringe plot on the X axis of the density plot.
GROUPBY=PARAMETER
GROUPBY=TYPE
specifies how the plots are grouped when there is more than one type of plot.
GROUPBY=TYPE
specifies that the plots be grouped by type.
GROUPBY=PARAMETER
specifies that the plots be grouped by parameter.
GROUPBY=PARAMETER is the default.
LAGS=n
specifies that autocorrelations be plotted up to lag n. If this option is not specified, autocorrelations
are plotted up to lag 50.
SMOOTH
displays a fitted penalized B-spline curve for each trace plot.
BAYES Statement F 2913
UNPACKPANEL
UNPACK
specifies that all paneled plots be unpacked, meaning that each plot in a panel is displayed
separately.
The plot-requests include the following:
ALL
specifies all types of plots. PLOTS=ALL is equivalent to specifying PLOTS=(TRACE AUTOCORR DENSITY).
AUTOCORR
displays the autocorrelation function plots for the parameters.
DENSITY
displays the kernel density plots for the parameters.
NONE
suppresses all diagnostic plots.
TRACE
displays the trace plots for the parameters. See the section “Visual Analysis via Trace Plots” on
page 141 in Chapter 7, “Introduction to Bayesian Analysis Procedures,” for details.
SAMPLING=option
specifies an algorithm used to sample the posterior distribution. The following options are available:
ARMS
GIBBS
use the ARMS algorithm.
GAMERMAN
GAM
use the Gamerman algorithm. This is the default method except for the normal distribution with a
conjugate prior. In this case a closed form for the posterior distribution is available, and samples
are obtained directly from the posterior distribution.
IM
Use the independent Metropolis algorithm.
SCALEPRIOR=GAMMA< (options) > | IMPROPER
SPRIOR=GAMMA< (options) > | IMPROPER
specifies that Gibbs sampling be performed on the generalized linear model scale parameter and the
prior distribution for the scale parameter, if there is a scale parameter in the model. For models that do
not have a scale parameter (the Poisson and binomial), this option is ignored. Note that you can specify
1
Gibbs sampling on either the dispersion parameter , the scale parameter D 2 , or the precision
parameter D 1 , with the DPRIOR=, SPRIOR=, and PPRIOR= options, respectively. These three
parameters are transformations of one another, and you should specify Gibbs sampling for only one of
them.
2914 F Chapter 42: The GENMOD Procedure
a 1
bt
A gamma prior G.a; b/ with density f .t / D b.bt /€.a/e
is specified by SCALEPRIOR=GAMMA,
which can be followed by one of the following gamma-options enclosed in parentheses. The hyperparameters a and b are the shape and inverse-scale parameters of the gamma distribution, respectively.
See the section “Gamma Prior” on page 2996 for details. The default is G.10 4 ; 10 4 /.
RELSHAPE< =c >
specifies independent G.c O ; c/ distribution, where O is the MLE of the dispersion parameter.
With this choice of hyperparameters, the mean of the prior distribution is O and the variance is cO .
By default, c D 10 4 .
SHAPE=a
ISCALE=b
when both specified, results in a G.a; b/ prior.
SHAPE=c
when specified alone, results in an G.c; c/ prior.
ISCALE=c
when specified alone, results in an G.c; c/ prior.
An improper prior with density f .t / proportional to t
1
is specified with SCALEPRIOR=IMPROPER.
SEED=number
specifies an integer seed in the range 1 to 231 1 for the random number generator in the simulation.
Specifying a seed enables you to reproduce identical Markov chains for the same specification. If the
SEED= option is not specified, or if you specify a nonpositive seed, a random seed is derived from the
time of day.
STATISTICS < (global-options) > = ALL | NONE | keyword | (keyword-list)
STATS < (global-options) > = ALL | NONE | keyword | (keyword-list)
controls the number of posterior statistics produced. Specifying STATISTICS=ALL is equivalent to
specifying STATISTICS= (SUMMARY INTERVAL COV CORR). If you do not want any posterior
statistics, you specify STATISTICS=NONE. The default is STATISTICS=(SUMMARY INTERVAL).
See the section “Summary Statistics” on page 155 in Chapter 7, “Introduction to Bayesian Analysis
Procedures,” for details. The global-options include the following:
ALPHA=numeric-list
controls the probabilities of the credible intervals. The ALPHA= values must be between 0 and 1.
Each ALPHA= value produces a pair of 100(1–ALPHA)% equal-tail and HPD intervals for each
parameters. The default is the value of the ALPHA= option in the MODEL statement, or 0.05 if
that option is not specified (yielding the 95% credible intervals for each parameter).
PERCENT=numeric-list
requests the percentile points of the posterior samples. The PERCENT= values must be between
0 and 100. The default is PERCENT=25, 50, 75, which yield the 25th, 50th, and 75th percentile
points, respectively, for each parameter.
BY Statement F 2915
The list of keywords includes the following:
CORR
produces the posterior correlation matrix.
COV
produces the posterior covariance matrix.
SUMMARY
produces the means, standard deviations, and percentile points for the posterior samples. The
default is to produce the 25th, 50th, and 75th percentile points, but you can use the global
PERCENT= option to request specific percentile points.
INTERVAL
produces equal-tail credible intervals and HPD intervals. The default is to produce the 95%
equal-tail credible intervals and 95% HPD intervals, but you can use the global ALPHA= option
to request intervals of any probabilities.
THINNING=number
THIN=number
controls the thinning of the Markov chain. Only one in every k samples is used when THINNING=k,
and if NBI=n0 and NMC=n, the number of samples kept is
n0 C n
n0
k
k
where [a] represents the integer part of the number a. The default is THINNING=1.
BY Statement
BY variables ;
You can specify a BY statement with PROC GENMOD to obtain separate analyses of observations in groups
that are defined by the BY variables. When a BY statement appears, the procedure expects the input data
set to be sorted in order of the BY variables. If you specify more than one BY statement, only the last one
specified is used.
If your input data set is not sorted in ascending order, use one of the following alternatives:
• Sort the data by using the SORT procedure with a similar BY statement.
• Specify the NOTSORTED or DESCENDING option in the BY statement for the GENMOD procedure.
The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged
in groups (according to values of the BY variables) and that these groups are not necessarily in
alphabetical or increasing numeric order.
• Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).
For more information about BY-group processing, see the discussion in SAS Language Reference: Concepts.
For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.
2916 F Chapter 42: The GENMOD Procedure
CLASS Statement
CLASS variable < (options) > . . . < variable < (options) > > < / global-options > ;
The CLASS statement names the classification variables to be used as explanatory variables in the analysis.
Response variables do not need to be specified in the CLASS statement.
The CLASS statement must precede the MODEL statement. Most options can be specified either as individual
variable options or as global-options. You can specify options for each variable by enclosing the options in
parentheses after the variable name. You can also specify global-options for the CLASS statement by placing
them after a slash (/). Global-options are applied to all the variables specified in the CLASS statement. If you
specify more than one CLASS statement, the global-options specified in any one CLASS statement apply to
all CLASS statements. However, individual CLASS variable options override the global-options. You can
specify the following values for either an option or a global-option:
CPREFIX=n
specifies that, at most, the first n characters of a CLASS variable name be used in creating names for
the corresponding design variables. The default is 32 min.32; max.2; f //, where f is the formatted
length of the CLASS variable.
DESCENDING
DESC
reverses the sort order of the classification variable. If both the DESCENDING and ORDER= options
are specified, PROC GENMOD orders the categories according to the ORDER= option and then
reverses that order.
LPREFIX=n
specifies that, at most, the first n characters of a CLASS variable label be used in creating labels for the
corresponding design variables. The default is 256 min.256; max.2; f //, where f is the formatted
length of the CLASS variable.
MISSING
treats missing values (., ._, .A, . . . , .Z for numeric variables and blanks for character variables) as valid
values for the CLASS variable.
ORDER=DATA | FORMATTED | FREQ | INTERNAL
specifies the sort order for the levels of classification variables. This ordering determines which
parameters in the model correspond to each level in the data, so the ORDER= option can be useful when
you use the CONTRAST statement. By default, ORDER=FORMATTED. For ORDER=FORMATTED
and ORDER=INTERNAL, the sort order is machine-dependent. When ORDER=FORMATTED is in
effect for numeric variables for which you have supplied no explicit format, the levels are ordered by
their internal values.
The following table shows how PROC GENMOD interprets values of the ORDER= option.
CLASS Statement F 2917
Value of ORDER=
Levels Sorted By
DATA
FORMATTED
Order of appearance in the input data set
External formatted values, except for numeric
variables with no explicit format, which are sorted
by their unformatted (internal) values
Descending frequency count; levels with more
observations come earlier in the order
Unformatted value
FREQ
INTERNAL
For more information about sort order, see the chapter on the SORT procedure in the Base SAS
Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts.
PARAM=keyword
specifies the parameterization method for the classification variable or variables. You can specify
any of the keywords shown in the following table; Design matrix columns are created from CLASS
variables according to the corresponding coding schemes:
Value of PARAM=
Coding
EFFECT
Effect coding
GLM
Less-than-full-rank reference cell coding (this
keyword can be used only in a global option)
ORDINAL
THERMOMETER
Cumulative parameterization for an ordinal
CLASS variable
POLYNOMIAL
POLY
Polynomial coding
REFERENCE
REF
Reference cell coding
ORTHEFFECT
Orthogonalizes PARAM=EFFECT coding
ORTHORDINAL
ORTHOTHERM
Orthogonalizes PARAM=ORDINAL coding
ORTHPOLY
Orthogonalizes PARAM=POLYNOMIAL coding
ORTHREF
Orthogonalizes PARAM=REFERENCE coding
All parameterizations are full rank, except for the GLM parameterization. The REF= option in the
CLASS statement determines the reference level for EFFECT and REFERENCE coding and for their
orthogonal parameterizations. It also indirectly determines the reference level for a singular GLM
parameterization through the order of levels.
If PARAM=ORTHPOLY or PARAM=POLY and the classification variable is numeric, then the
ORDER= option in the CLASS statement is ignored, and the internal unformatted values are used. See
the section “Other Parameterizations” on page 391 in Chapter 19, “Shared Concepts and Topics,” for
further details.
2918 F Chapter 42: The GENMOD Procedure
REF=’level’ | keyword
specifies the reference level for PARAM=EFFECT, PARAM=REFERENCE, and their orthogonalizations. For PARAM=GLM, the REF= option specifies a level of the classification variable to be put at
the end of the list of levels. This level thus corresponds to the reference level in the usual interpretation
of the linear estimates with a singular parameterization.
For an individual variable REF= option (but not for a global REF= option), you can specify the level
of the variable to use as the reference level. Specify the formatted value of the variable if a format is
assigned. For a global or individual variable REF= option, you can use one of the following keywords.
The default is REF=LAST.
FIRST
designates the first ordered level as reference.
LAST
designates the last ordered level as reference.
TRUNCATE< =n >
specifies the length n of CLASS variable values to use in determining CLASS variable levels. The
default is to use the full formatted length of the CLASS variable. If you specify TRUNCATE without
the length n, the first 16 characters of the formatted values are used. When formatted values are longer
than 16 characters, you can use this option to revert to the levels as determined in releases before SAS
9. The TRUNCATE option is available only as a global option.
Class Variable Default Parameterization
If you do not specify the PARAM= option, the default PARAM=GLM parameterization is used.
Class Variable Naming Convention
Parameter names for a CLASS predictor variable are constructed by concatenating the CLASS variable name
with the CLASS levels. However, for the POLYNOMIAL and orthogonal parameterizations, parameter
names are formed by concatenating the CLASS variable name and keywords that reflect the parameterization.
See the section “Other Parameterizations” on page 391 in Chapter 19, “Shared Concepts and Topics,” for
examples and further details.
Class Variable Parameterization with Unbalanced Designs
PROC GENMOD initially parameterizes the CLASS variables by looking at the levels of the variables across
the complete data set. If you have an unbalanced replication of levels across variables or BY groups, then
the design matrix and the parameter interpretation might be different from what you expect. For instance,
suppose you have a model with one CLASS variable A with three levels (1, 2, and 3), and another CLASS
variable B with two levels (1 and 2). If the third level of A occurs only with the first level of B, if you use the
EFFECT parameterization, and if your model contains the effect A(B) and an intercept, then the design for A
within the second level of B is not a differential effect. In particular, the design looks like the following:
CODE Statement F 2919
B
A
Design Matrix
A(B=1)
A(B=2)
A1 A2
A1 A2
1
1
1
2
2
1
2
3
1
2
1
0
–1
0
0
0
1
–1
0
0
0
0
0
1
0
0
0
0
0
1
PROC GENMOD detects linear dependency among the last two design variables and sets the parameter for
A2(B=2) to zero, resulting in an interpretation of these parameters as if they were reference- or dummy-coded.
The REFERENCE or GLM parameterization might be more appropriate for such problems.
CODE Statement
CODE < options > ;
The CODE statement writes SAS DATA step code for computing predicted values of the fitted model either
to a file or to a catalog entry. This code can then be included in a DATA step to score new data.
Table 42.3 summarizes the options available in the CODE statement.
Table 42.3 CODE Statement Options
Option
Description
CATALOG=
DUMMIES
ERROR
FILE=
FORMAT=
GROUP=
IMPUTE
Names the catalog entry where the generated code is saved
Retains the dummy variables in the data set
Computes the error function
Names the file where the generated code is saved
Specifies the numeric format for the regression coefficients
Specifies the group identifier for array names and statement labels
Imputes predicted values for observations with missing or invalid
covariates
Specifies the line size of the generated code
Specifies the algorithm for looking up CLASS levels
Computes residuals
LINESIZE=
LOOKUP=
RESIDUAL
For details about the syntax of the CODE statement, see the section “CODE Statement” on page 395 in
Chapter 19, “Shared Concepts and Topics.”
2920 F Chapter 42: The GENMOD Procedure
CONTRAST Statement
CONTRAST 'label' contrast-specification < / options > ;
The CONTRAST statement provides a means of obtaining a test of a specified hypothesis concerning the
model parameters. This is accomplished by specifying a matrix L for testing the hypothesis L0 ˇ D 0. You
must be familiar with the details of the model parameterization that PROC GENMOD uses. For more
information, see the section “Parameterization Used in PROC GENMOD” on page 2968 and the section
“CLASS Statement” on page 2916. Computed statistics are based on the asymptotic chi-square distribution
of the likelihood ratio statistic, or the generalized score statistic for GEE models, with degrees of freedom
determined by the number of linearly independent rows in the L0 matrix. You can request Wald chi-square
statistics with the Wald option in the CONTRAST statement.
There is no limit to the number of CONTRAST statements that you can specify, but they must appear after
the MODEL statement and after the ZEROMODEL statement for zero-inflated models. Statistics for multiple
CONTRAST statements are displayed in a single table.
The elements of the CONTRAST statement are as follows:
label
identifies the contrast on the output. A label is required for every contrast specified. Labels can be
up to 20 characters and must be enclosed in single quotes.
contrast-specification identifies the effects and their coefficients from which the L matrix is formed. The
contrast-specification can be specified in two different ways. The first method applies to all
models except the zero-inflated (ZI) distributions (zero-inflated Poisson and zero-inflated negative
binomial), and the syntax is:
effect values < ,. . . effect values >
The second method of specifying a contrast applies only to ZI models, and the syntax is:
effect values < ,. . . effect values > @ZERO effect values < ,. . . effect values >
where
options
effect
identifies an effect that appears in the MODEL statement. The value INTERCEPT or
intercept can be used as an effect when an intercept is included in the model. You do
not need to include all effects that are included in the MODEL statement.
values
are constants that are elements of the L vector associated with the effect.
specifies CONTRAST statement options.
Specification of sets of effect values before the @ZERO separator results in a row of the L0 matrix with
coefficients for effects in the regression part of the model set to values and with the coefficients for the
zero-inflation part of the model set to zero. Specification of sets of effect values after the @ZERO separator
results in a row of the L matrix with the coefficients for the regression part of the model set to zero and with
the coefficients of effects in the zero-inflation part of the model set to values.
CONTRAST Statement F 2921
For example, the statements
class a;
model y=a;
contrast 'Label1' A 1 -1;
specify an L0 matrix with one row with coefficients 1 for the first level of A and –1 for the second level of A.
The statements
class a b;
model y=a / dist=zip;
zeromodel b;
contrast 'Label2' A 1 -1 @zero B 1 -1;
specify an L0 matrix with two rows: the first row has coefficients 1 for the first level of A, –1 for the second
level of A, and zeros for all levels of B; the second row has coefficients 0 for all levels of A, 1 for the first
level of B, and –1 for the second level of B.
The rows of L0 are specified in order and are separated by commas.
If you use the default less-than-full-rank PROC GLM CLASS variable parameterization, each row of the
L0 matrix is checked for estimability. If PROC GENMOD finds a contrast to be nonestimable, it displays
missing values in corresponding rows in the results. See Searle (1971) for a discussion of estimable functions.
If the elements of L0 are not specified for an effect that contains a specified effect, then the elements of the
specified effect are distributed over the levels of the higher-order effect just as the GLM procedure does for
its CONTRAST and ESTIMATE statements. For example, suppose that the model contains effects A and B
and their interaction A*B. If you specify a CONTRAST statement involving A alone, the L0 matrix contains
nonzero terms for both A and A*B, since A*B contains A.
When you use any of the full-rank PARAM= CLASS variable options, all parameters are directly estimable,
and rows of L0 are not checked for estimability.
If an effect is not specified in the CONTRAST statement, all of its coefficients in the L0 matrix are set to 0. If
too many values are specified for an effect, the extra ones are ignored. If too few values are specified, the
remaining ones are set to 0.
PROC GENMOD handles missing level combinations of classification variables in the same manner as the
GLM and MIXED procedures. Parameters corresponding to missing level combinations are not included
in the model. This convention can affect the way in which you specify the L matrix in your CONTRAST
statement.
If you specify the WALD option, the test of hypothesis is based on a Wald chi-square statistic. If you omit
the WALD option, the test statistic computed depends on whether an ordinary generalized linear model or a
GEE-type model is specified.
For an ordinary generalized linear model, the CONTRAST statement computes the likelihood ratio statistic.
This is defined to be twice the difference between the log likelihood of the model unconstrained by the
contrast and the log likelihood with the model fitted under the constraint that the linear function of the
parameters defined by the contrast is equal to 0. A p-value is computed based on the asymptotic chi-square
distribution of the chi-square statistic.
If you specify a GEE model with the REPEATED statement, the test is based on a score statistic. The GEE
model is fit under the constraint that the linear function of the parameters defined by the contrast is equal
2922 F Chapter 42: The GENMOD Procedure
to 0. The score chi-square statistic is computed based on the generalized score function. See the section
“Generalized Score Statistics” on page 2987 for more information.
The degrees of freedom is the number of linearly independent constraints implied by the CONTRAST
statement—that is, the rank of L.
You can specify the following options after a slash (/).
E
requests that the L matrix be displayed.
SINGULAR=number
EPSILON=number
tunes the estimability checking. If v is a vector, define ABS(v) to be the absolute value of the element
of v with the largest absolute value. Let K0 be any row in the contrast matrix L. Define C to be equal
to ABS.K0 / if ABS.K0 / is greater than 0; otherwise, C equals 1. If ABS.K0 K0 T/ is greater than
C*number , then K is declared nonestimable. T is the Hermite form matrix .X0 X/ .X0 X/, and .X0 X/
represents a generalized inverse of the matrix X0 X. The value for number must be between 0 and 1;
the default value is 1E–4. The SINGULAR= option in the MODEL statement affects the computation
of the generalized inverse of the matrix X0 X. It might also be necessary to adjust this value for some
data.
WALD
requests that a Wald chi-square statistic be computed for the contrast rather than the default likelihood
ratio or score statistic. The Wald statistic for testing L0 ˇ D 0 is defined by
O 0 .L0 †L/ .L0 ˇ/
O
S D .L0 ˇ/
where ˇO is the maximum likelihood estimate and † is its estimated covariance matrix. The asymptotic
distribution of S is 2r , where r is the rank of L. Computed p-values are based on this distribution.
If you specify a GEE model with the REPEATED statement, † is the empirical covariance matrix
estimate.
DEVIANCE Statement
DEVIANCE variable=expression ;
You can specify a probability distribution other than those available in PROC GENMOD by using the
DEVIANCE and VARIANCE statements. You do not need to specify the DEVIANCE or VARIANCE
statement if you use the DIST= MODEL statement option to specify a probability distribution. The variable
identifies the deviance contribution from a single observation to the procedure, and it must be a valid SAS
variable name that does not appear in the input data set. The expression can be any arithmetic expression
supported by the DATA step language, and it is used to define the functional dependence of the deviance on
the mean and the response. You use the automatic variables _MEAN_ and _RESP_ to represent the mean and
response in the expression.
Alternatively, the deviance function can be defined using programming statements (see the section “Programming Statements” on page 2947) and assigned to a variable, which is then listed as the expression. This form
is convenient for using complex statements such as IF-THEN/ELSE clauses.
The DEVIANCE statement is ignored unless the VARIANCE statement is also specified.
EFFECTPLOT Statement F 2923
EFFECTPLOT Statement
EFFECTPLOT < plot-type < (plot-definition-options) > > < / options > ;
The EFFECTPLOT statement produces a display of the fitted model and provides options for changing and
enhancing the displays. Table 42.4 describes the available plot-types and their plot-definition-options.
Table 42.4 Plot-Types and Plot-Definition-Options
Plot-Type and Description
Plot-Definition-Options
BOX
Displays a box plot of continuous response data at each
level of a CLASS effect, with predicted values
superimposed and connected by a line. This is an
alternative to the INTERACTION plot-type.
PLOTBY= variable or CLASS effect
X= CLASS variable or effect
CONTOUR
Displays a contour plot of predicted values against two
continuous covariates
PLOTBY= variable or CLASS effect
X= continuous variable
Y= continuous variable
FIT
Displays a curve of predicted values versus a
continuous variable
PLOTBY= variable or CLASS effect
X= continuous variable
INTERACTION
Displays a plot of predicted values (possibly with error
bars) versus the levels of a CLASS effect. The
predicted values are connected with lines and can be
grouped by the levels of another CLASS effect.
PLOTBY= variable or CLASS effect
SLICEBY= variable or CLASS effect
X= CLASS variable or effect
MOSAIC
Displays a mosaic plot of predicted values by using up
to three CLASS effects
PLOTBY= variable or CLASS effect
X= CLASS effects
SLICEFIT
Displays a curve of predicted values versus a
continuous variable, grouped by the levels of a
CLASS effect
PLOTBY= variable or CLASS effect
SLICEBY= variable or CLASS effect
X= continuous variable
For full details about the syntax and options of the EFFECTPLOT statement, see the section “EFFECTPLOT
Statement” on page 416 in Chapter 19, “Shared Concepts and Topics.”
2924 F Chapter 42: The GENMOD Procedure
ESTIMATE Statement
ESTIMATE 'label' contrast-specification < / options > ;
The ESTIMATE statement is similar to a CONTRAST statement, except only one-row L0 matrices are
permitted.
The elements of the ESTIMATE statement are as follows:
label
identifies the contrast on the output. A label is required for every contrast specified. Labels can be
up to 20 characters and must be enclosed in single quotes.
contrast-specification identifies the effects and their coefficients from which the L matrix is formed. The
contrast-specification can be specified in two different ways. The first method applies to all
models except the zero-inflated (ZI) distributions (zero-inflated Poisson and zero-inflated negative
binomial), and the syntax is:
effect values < . . . effect values >
The second method of specifying a contrast applies only to ZI models, and the syntax is:
effect values < . . . effect values > @ZERO effect values < . . . effect values >
where
options
effect
identifies an effect that appears in the MODEL statement. The value INTERCEPT or
intercept can be used as an effect when an intercept is included in the model. You do
not need to include all effects that are included in the MODEL statement.
values
are constants that are elements of the L vector associated with the effect.
specifies options for the ESTIMATE statement.
For ZI models, sets of effects values before the @ZERO separator correspond to the regression part of
the model with regression parameters ˇ, and effects values after the @ZERO separator correspond to the
zero-inflation part of the model with regression parameters . In the case of ZI models, a one-row L0 matrix
is created for the regression part of the model, another one-row L0 matrix is created for the zero-inflation part
of the model, and separate estimates for the two L matrices are computed and displayed.
If you use the default less-than-full-rank GLM CLASS variable parameterization, each row is checked
for estimability. If PROC GENMOD finds a contrast to be nonestimable, it displays missing values in
corresponding rows in the results. See Searle (1971) for a discussion of estimable functions.
The actual estimate, L0 ˇ (and L0 for ZI models), its approximate standard error, and confidence limits are
displayed. Additionally, the corresponding estimate on the mean scale (defined as the inverse link function
applied to L0 ˇ), and confidence limits are displayed. Wald chi-square tests that L0 ˇ = 0 and L0 D 0 are also
displayed.
O where †
O is the
The approximate standard error of the estimate is computed as the square root of L0 †L,
estimated covariance matrix of the parameter estimates. If you specify a GEE model in the REPEATED
O is the empirical covariance matrix estimate.
statement, †
If you specify the EXP option, then exp.L0 ˇ/, its standard error, and its confidence limits are also displayed.
The construction of the L vector and the checking for estimability for an ESTIMATE statement follow the
same rules as listed under the CONTRAST statement.
EXACT Statement F 2925
You can specify the following options in the ESTIMATE statement after a slash (/).
ALPHA=number
requests that a confidence interval be constructed with confidence level 1 – number . The value of
number must be between 0 and 1; the default value is 0.05.
DIVISOR=number
specifies a value by which to divide all coefficients so that fractional coefficients can be entered as
integer numerators. For example, you can use
estimate '1/3(A1+A2) - 2/3A3' a 1 1 -2 / divisor=3;
instead of
estimate '1/3(A1+A2) - 2/3A3' a 0.33333 0.33333 -0.66667;
E
requests that the L matrix coefficients be displayed.
EXP
requests that exp.L0 ˇ/, its standard error, and its confidence limits be computed. If you specify the
EXP option, standard errors are computed using the delta method. Confidence limits are computed by
exponentiating the confidence limits for L0 ˇ.
SINGULAR=number
EPSILON=number
tunes the estimability checking as described for the CONTRAST statement.
EXACT Statement
EXACT < 'label' > < INTERCEPT > < effects > < / options > ;
The EXACT statement performs exact tests of the parameters for the specified effects and optionally estimates
the parameters and outputs the exact conditional distributions. You can specify the keyword INTERCEPT
and any effects in the MODEL statement. Inference on the parameters of the specified effects is performed
by conditioning on the sufficient statistics of all the other model parameters (possibly including the intercept).
You can specify several EXACT statements, but they must follow the MODEL statement. Each statement can
optionally include an identifying label . If several EXACT statements are specified, any statement without
a label is assigned a label of the form “Exactn,” where n indicates the nth EXACT statement. The label is
included in the headers of the displayed exact analysis tables.
If a STRATA statement is also specified, then a stratified exact logistic regression or a stratified exact Poisson
regression is performed. The model contains a different intercept for each stratum, and these intercepts are
conditioned out of the model along with any other nuisance parameters (parameters for effects specified in
the MODEL statement that are not in the EXACT statement).
The ASSESSMENT, BAYES, CONTRAST, EFFECTPLOT, ESTIMATE, LSMEANS, LSMESTIMATE,
OUTPUT, SLICE, and STORE statements are not available with an exact analysis. Exact analyses are not
2926 F Chapter 42: The GENMOD Procedure
performed when you specify a WEIGHT statement, or a model other than LINK=LOGIT with DIST=BIN or
LINK=LOG with DIST=POISSON. An OFFSET= variable is not available with exact logistic regression.
Exact estimation is not available for ordinal response models.
For classification variables, use of the reference parameterization is recommended.
The following options can be specified in each EXACT statement after a slash (/):
ALPHA=number
specifies the level of significance ˛ for 100.1 ˛/% confidence limits for the parameters or odds
ratios. The value of number must be between 0 and 1. By default, number is equal to the value of the
ALPHA= option in the MODEL statement, or 0.05 if that option is not specified.
CLTYPE=EXACT | MIDP
requests either the exact or mid-p confidence intervals for the parameter estimates. By default, the
exact intervals are produced. The confidence coefficient can be specified with the ALPHA= option.
The mid-p interval can be modified with the MIDPFACTOR= option. See the section “Exact Logistic
and Exact Poisson Regression” on page 2999 for details.
ESTIMATE < =keyword >
estimates the individual parameters (conditioned on all other parameters) for the effects specified in the
EXACT statement. For each parameter, a point estimate, a standard error, a confidence interval, and a
p-value for a two-sided test that the parameter is zero are displayed. Note that the two-sided p-value is
twice the one-sided p-value. You can optionally specify one of the following keywords:
PARM
specifies that the parameters be estimated. This is the default.
ODDS
specifies that the odds ratios be estimated. If you have classification variables, then you
must also specify the PARAM=REF option in the CLASS statement.
BOTH
specifies that both the parameters and odds ratios be estimated.
JOINT
performs the joint test that all of the parameters are simultaneously equal to zero, performs individual
hypothesis tests for the parameter of each continuous variable, and performs joint tests for the parameters of each classification variable. The joint test is indicated in the “Conditional Exact Tests” table by
the label “Joint.”
JOINTONLY
performs only the joint test of the parameters. The test is indicated in the “Conditional Exact Tests”
table by the label “Joint.” When this option is specified, individual tests for the parameters of each
continuous variable and joint tests for the parameters of the classification variables are not performed.
MIDPFACTOR=ı1 | (ı1 ; ı2 )
sets the tie factors used to produce the mid-p hypothesis statistics and the mid-p confidence intervals.
ı1 modifies both the hypothesis tests and confidence intervals, while ı2 affects only the hypothesis tests.
By default, ı1 D 0:5 and ı2 D 1:0. See the section “Exact Logistic and Exact Poisson Regression” on
page 2999 for details.
EXACT Statement F 2927
ONESIDED
requests one-sided confidence intervals and p-values for the individual parameter estimates and odds
ratios. The one-sided p-value is the smaller of the left- and right-tail probabilities for the observed
sufficient statistic of the parameter under the null hypothesis that the parameter is zero. The two-sided
p-values (default) are twice the one-sided p-values. See the section “Exact Logistic and Exact Poisson
Regression” on page 2999 for more details.
OUTDIST=SAS-data-set
names the SAS data set that contains the exact conditional distributions. This data set contains all of
the exact conditional distributions that are required to process the corresponding EXACT statement.
This data set contains the possible sufficient statistics for the parameters of the effects specified
in the EXACT statement, the counts, and, when hypothesis tests are performed on the parameters,
the probability of occurrence and the score value for each sufficient statistic. When you request an
OUTDIST= data set, the observed sufficient statistics are displayed in the “Sufficient Statistics” table.
See the section “OUTDIST= Output Data Set” on page 3000 for more information.
EXACT Statement Examples
In the following example, two exact tests are computed: one for x1 and the other for x2. The test for x1 is
based on the exact conditional distribution of the sufficient statistic for the x1 parameter given the observed
values of the sufficient statistics for the intercept, x2, and x3 parameters; likewise, the test for x2 is conditional
on the observed sufficient statistics for the intercept, x1, and x3.
proc genmod;
model y= x1 x2 x3/d=b;
exact x1 x2;
run;
PROC GENMOD determines, from all the specified EXACT statements, the distinct conditional distributions
that need to be evaluated. For example, there is only one exact conditional distribution for the following two
EXACT statements:
exact 'One' x1 / estimate=parm;
exact 'Two' x1 / estimate=parm onesided;
For each EXACT statement, individual tests for the parameters of the specified effects are computed unless
the JOINTONLY option is specified. Consider the following EXACT statements:
exact
exact
exact
exact
'E12'
'E1'
'E2'
'J12'
x1 x2 / estimate;
x1
/ estimate;
x2
/ estimate;
x1 x2 / joint;
In the E12 statement, the parameters for x1 and x2 are estimated and tested separately. Specifying the E12
statement is equivalent to specifying both the E1 and E2 statements. In the J12 statement, the joint test for
the parameters of x1 and x2 is computed in addition to the individual tests for x1 and x2.
2928 F Chapter 42: The GENMOD Procedure
EXACTOPTIONS Statement
EXACTOPTIONS options ;
The EXACTOPTIONS statement specifies options that apply to every EXACT statement in the program.
The following options are available:
ABSFCONV=value
specifies the absolute function convergence criterion. Convergence requires a small change in the
log-likelihood function in subsequent iterations,
jli
li
1j
< value
where li is the value of the log-likelihood function at iteration i.
By default, ABSFCONV=1E–12. You can also specify the FCONV= and XCONV= criteria; optimizations are terminated as soon as one criterion is satisfied.
ADDTOBS
adds the observed sufficient statistic to the sampled exact distribution if the statistic was not sampled.
This option has no effect unless the METHOD=NETWORKMC option is specified and the ESTIMATE
option is specified in the EXACT statement. If the observed statistic has not been sampled, then the
parameter estimate does not exist; by specifying this option, you can produce (biased) estimates.
BUILDSUBSETS
builds every distribution for sampling. By default, some exact distributions are created by taking a
subset of a previously generated exact distribution. When the METHOD=NETWORKMC option is
invoked, this subsetting behavior has the effect of using fewer than the desired n samples; see the N=
option for more details. Use the BUILDSUBSETS option to suppress this subsetting.
EPSILON=value
controls how the partial sums
value=1E–8.
Pj
i D1 yi xi
are compared. value must be between 0 and 1; by default,
FCONV=value
specifies the relative function convergence criterion. Convergence requires a small relative change in
the log-likelihood function in subsequent iterations,
jli
li 1 j
< value
jli 1 j C 1E–6
where li is the value of the log likelihood at iteration i.
By default, FCONV=1E–8. You can also specify the ABSFCONV= and XCONV= criteria; if more
than one criterion is specified, then optimizations are terminated as soon as one criterion is satisfied.
MAXTIME=seconds
specifies the maximum clock time (in seconds) that PROC GENMOD can use to calculate the exact
distributions. If the limit is exceeded, the procedure halts all computations and prints a note to the
LOG. The default maximum clock time is seven days.
EXACTOPTIONS Statement F 2929
METHOD=keyword
specifies which exact conditional algorithm to use for every EXACT statement specified. You can
specify one of the following keywords:
DIRECT
invokes the multivariate shift algorithm of Hirji, Mehta, and Patel (1987). This method
directly builds the exact distribution, but it can require an excessive amount of memory in its
intermediate stages. METHOD=DIRECT is invoked by default when you are conditioning
out at most the intercept.
invokes an algorithm described in Mehta, Patel, and Senchaudhuri (1992). This method
builds a network for each parameter that you are conditioning out, combines the networks,
then uses the multivariate shift algorithm to create the exact distribution. The NETWORK
method can be faster and require less memory than the DIRECT method. The NETWORK
method is invoked by default for most analyses.
NETWORK
invokes the hybrid network and Monte Carlo algorithm of Mehta, Patel, and Senchaudhuri (1992). This method creates a network, then samples from that network; this
method does not reject any of the samples at the cost of using a large amount of memory
to create the network. METHOD=NETWORKMC is most useful for producing parameter
estimates for problems that are too large for the DIRECT and NETWORK methods to
handle and for which asymptotic methods are invalid—for example, for sparse data on a
large grid.
NETWORKMC
N=n
specifies the number of Monte Carlo samples to take when the METHOD=NETWORKMC option is
specified. By default, n = 10,000. If the procedure cannot obtain n samples due to a lack of memory,
then a note is printed in the SAS log (the number of valid samples is also reported in the listing) and
the analysis continues.
The number of samples used to produce any particular statistic might be smaller than n. For example,
let X1 and X2 be continuous variables, denote their joint distribution by f (X1,X2), and let f (X1 | X2 =
x2) denote the marginal distribution of X1 conditioned on the observed value of X2. If you request
the JOINT test of X1 and X2, then n samples are used to generate the estimate fO(X1,X2) of f (X1,X2),
from which the test is computed. However, the parameter estimate for X1 is computed from the subset
of fO(X1,X2) that has X2 = x2, and this subset need not contain n samples. Similarly, the distribution
for each level of a classification variable is created by extracting the appropriate subset from the joint
distribution for the CLASS variable.
In some cases, the marginal sample size can be too small to admit accurate estimation of a particular
statistic; a note is printed in the SAS log when a marginal sample size is less than 100. Increasing n
increases the number of samples used in a marginal distribution; however, if you want to control the
sample size exactly, you can either specify the BUILDSUBSETS option or do both of the following:
• Remove the JOINT option from the EXACT statement.
• Create dummy variables in a DATA step to represent the levels of a CLASS variable, and specify
them as independent variables in the MODEL statement.
2930 F Chapter 42: The GENMOD Procedure
NOLOGSCALE
specifies that computations for the exact conditional models be computed by using normal scaling.
Log scaling can handle numerically larger problems than normal scaling; however, computations in the
log scale are slower than computations in normal scale.
ONDISK
uses disk space instead of random access memory to build the exact conditional distribution. Use this
option to handle larger problems at the cost of slower processing.
SEED=seed
specifies the initial seed for the random number generator used to take the Monte Carlo samples when
the METHOD=NETWORKMC option is specified. The value of the SEED= option must be an integer.
If you do not specify a seed, or if you specify a value less than or equal to zero, then PROC GENMOD
uses the time of day from the computer’s clock to generate an initial seed.
STATUSN=number
prints a status line in the SAS log after every number of Monte Carlo samples when the
METHOD=NETWORKMC option is specified. The number of samples taken and the current exact
p-value for testing the significance of the model are displayed. You can use this status line to track the
progress of the computation of the exact conditional distributions.
STATUSTIME=seconds
specifies the time interval (in seconds) for printing a status line in the LOG. You can use this status line
to track the progress of the computation of the exact conditional distributions. The time interval you
specify is approximate; the actual time interval varies. By default, no status reports are produced.
XCONV=value
specifies the relative parameter convergence criterion. Convergence requires a small relative parameter
change in subsequent iterations,
.i /
max jıj j < value
j
where
.i /
ıj
D
8 .i /
< ˇj
.i/
ˇj
:
.i 1/
ˇj
.i 1/
ˇj
.i 1/
ˇj
.i 1/
jˇj
j < 0:01
otherwise
.i /
and ˇj is the estimate of the jth parameter at iteration i.
By default, XCONV=1E–4. You can also specify the ABSFCONV= and FCONV= criteria; if more
than one criterion is specified, then optimizations are terminated as soon as one criterion is satisfied.
FREQ Statement F 2931
FREQ Statement
FREQ variable ;
FREQUENCY variable ;
The variable in the FREQ statement identifies a variable in the input data set containing the frequency of
occurrence of each observation. PROC GENMOD treats each observation as if it appears n times, where n is
the value of the FREQ variable for the observation. If it is not an integer, the frequency value is truncated to
an integer. If it is less than 1 or missing, the observation is not used. In the case of models fit with generalized
estimating equations (GEEs), the frequencies apply to the subject/cluster and therefore must be the same for
all observations within each subject.
FWDLINK Statement
FWDLINK variable=expression ;
You can define a link function other than a built-in link function by using the FWDLINK statement. If you
use the MODEL statement option LINK= to specify a link function, you do not need to use the FWDLINK
statement. The variable identifies the link function to the procedure. The expression can be any arithmetic
expression supported by the DATA step language, and it is used to define the functional dependence on the
mean.
Alternatively, the link function can be defined by using programming statements (see the section “Programming Statements” on page 2947) and assigned to a variable, which is then listed as the expression. The
second form is convenient for using complex statements such as IF-THEN/ELSE clauses. The GENMOD
procedure automatically computes derivatives of the link function required for iterative fitting. You must
specify the inverse of the link function in the INVLINK statement when you specify the FWDLINK statement
to define the link function. You use the automatic variable _MEAN_ to represent the mean in the preceding
expression.
INVLINK Statement
INVLINK variable=expression ;
If you define a link function in the FWDLINK statement, then you must define the inverse link function by
using the INVLINK statement. If you use the MODEL statement option LINK= to specify a link function,
you do not need to use the INVLINK statement. The variable identifies the inverse link function to the
procedure. The expression can be any arithmetic expression supported by the DATA step language, and it is
used to define the functional dependence on the linear predictor.
Alternatively, the inverse link function can be defined using programming statements (see the section
“Programming Statements” on page 2947) and assigned to a variable, which is then listed as the expression.
The second form is convenient for using complex statements such as IF-THEN/ELSE clauses. The automatic
variable _XBETA_ represents the linear predictor in the preceding expression.
2932 F Chapter 42: The GENMOD Procedure
LSMEANS Statement
LSMEANS < model-effects > < / options > ;
The LSMEANS statement computes and compares least squares means (LS-means) of fixed effects. LS-means
are predicted population margins—that is, they estimate the marginal means over a balanced population. In a
sense, LS-means are to unbalanced designs as class and subclass arithmetic means are to balanced designs.
Table 42.5 summarizes the options available in the LSMEANS statement. If you specify the BAYES
statement, the ADJUST=, STEPDOWN, and LINES options are ignored. The PLOTS= option is not available
for a maximum likelihood analysis; it is available only for a Bayesian analysis.
If you specify a zero-inflated model (that is, a model for either the zero-inflated Poisson or the zero-inflated
negative binomial distribution), then the least squares means are computed only for effects in the model for
the distribution mean, and not for effects in the zero-inflation probability part of the model.
Table 42.5
Option
LSMEANS Statement Options
Description
Construction and Computation of LS-Means
AT
Modifies the covariate value in computing LS-means
BYLEVEL
Computes separate margins
DIFF
Requests differences of LS-means
OM=
Specifies the weighting scheme for LS-means computation as determined by the input data set
SINGULAR=
Tunes estimability checking
Degrees of Freedom and p-values
ADJUST=
Determines the method for multiple-comparison adjustment of LSmeans differences
ALPHA=˛
Determines the confidence level (1 ˛)
STEPDOWN
Adjusts multiple-comparison p-values further in a step-down
fashion
Statistical Output
CL
CORR
COV
E
LINES
MEANS
PLOTS=
SEED=
Constructs confidence limits for means and mean differences
Displays the correlation matrix of LS-means
Displays the covariance matrix of LS-means
Prints the L matrix
Produces a “Lines” display for pairwise LS-means differences
Prints the LS-means
Requests graphs of means and mean comparisons
Specifies the seed for computations that depend on random numbers
LSMESTIMATE Statement F 2933
Table 42.5 continued
Option
Description
Generalized Linear Modeling
EXP
Exponentiates and displays estimates of LS-means or LS-means
differences
ILINK
Computes and displays estimates and standard errors of LS-means
(but not differences) on the inverse linked scale
ODDSRATIO
Reports (simple) differences of least squares means in terms of
odds ratios if permitted by the link function
For details about the syntax of the LSMEANS statement, see the section “LSMEANS Statement” on page 460
in Chapter 19, “Shared Concepts and Topics.”
LSMESTIMATE Statement
LSMESTIMATE model-effect < 'label' > values < divisor =n >
< , . . . < 'label' > values < divisor =n > >
< / options > ;
The LSMESTIMATE statement provides a mechanism for obtaining custom hypothesis tests among least
squares means.
Table 42.6 summarizes the options available in the LSMESTIMATE statement.
Table 42.6 LSMESTIMATE Statement Options
Option
Description
Construction and Computation of LS-Means
AT
Modifies covariate values in computing LS-means
BYLEVEL
Computes separate margins
DIVISOR=
Specifies a list of values to divide the coefficients
OM=
Specifies the weighting scheme for LS-means computation as determined by a data set
SINGULAR=
Tunes estimability checking
Degrees of Freedom and p-values
ADJUST=
Determines the method for multiple-comparison adjustment of LSmeans differences
ALPHA=˛
Determines the confidence level (1 ˛)
LOWER
Performs one-sided, lower-tailed inference
STEPDOWN
Adjusts multiple-comparison p-values further in a step-down fashion
TESTVALUE=
Specifies values under the null hypothesis for tests
UPPER
Performs one-sided, upper-tailed inference
2934 F Chapter 42: The GENMOD Procedure
Table 42.6 continued
Option
Statistical Output
CL
CORR
COV
E
ELSM
JOINT
PLOTS=
SEED=
Description
Constructs confidence limits for means and mean differences
Displays the correlation matrix of LS-means
Displays the covariance matrix of LS-means
Prints the L matrix
Prints the K matrix
Produces a joint F or chi-square test for the LS-means and LSmeans differences
Requests graphs of means and mean comparisons
Specifies the seed for computations that depend on random numbers
Generalized Linear Modeling
Specifies how to construct estimable functions with multinomial
CATEGORY=
data
EXP
Exponentiates and displays LS-means estimates
ILINK
Computes and displays estimates and standard errors of LS-means
(but not differences) on the inverse linked scale
For details about the syntax of the LSMESTIMATE statement, see the section “LSMESTIMATE Statement”
on page 476 in Chapter 19, “Shared Concepts and Topics.”
MODEL Statement
MODEL response = < effects > < / options > ;
MODEL events/trials = < effects > < / options > ;
The MODEL statement specifies the response, or dependent variable, and the effects, or explanatory variables.
If you omit the explanatory variables, the procedure fits an intercept-only model. An intercept term is
included in the model by default. The intercept can be removed with the NOINT option.
You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted
events/trials. The first form is applicable to all responses. The second form is applicable only to summarized
binomial response data. When each observation in the input data set contains the number of events (for
example, successes) and the number of trials from a set of binomial trials, use the events/trials syntax.
In the events/trials model syntax, you specify two variables that contain the event and trial counts. These two
variables are separated by a slash (/). The values of both events and (trials–events) must be nonnegative, and
the value of the trials variable must be greater than 0 for an observation to be valid. The variable events or
trials can take noninteger values.
When each observation in the input data set contains a single trial from a binomial or multinomial experiment,
use the first form of the preceding MODEL statements. The response variable can be numeric or character.
The ordering of response levels is critical in these models. You can use the RORDER= option in the PROC
GENMOD statement to specify the response level ordering.
MODEL Statement F 2935
Responses for the Poisson distribution must be all nonnegative, but they can be noninteger values.
The effects in the MODEL statement consist of an explanatory variable or combination of variables. Explanatory variables can be continuous or classification variables. Classification variables can be character or
numeric. Explanatory variables representing nominal, or classification, data must be declared in a CLASS
statement. Interactions between variables can also be included as effects. Columns of the design matrix are
automatically generated for classification variables and interactions. The syntax for specification of effects
is the same as for the GLM procedure. See the section “Specification of Effects” on page 2967 for more
information. Also see Chapter 44, “The GLM Procedure.”
Table 42.7 summarizes the options available in the MODEL statement.
Table 42.7 MODEL Statement Options
Option
Description
AGGREGATE=
ALPHA=
CICONV=
CL
CODING=
CONVERGE=
CONVH=
CORRB
COVB
DIAGNOSTICS
DIST=
EXACTMAX
EXPECTED
Specifies the subpopulations
Sets the confidence coefficient
Sets the convergence criterion for profile likelihood confidence intervals
Displays confidence limits for predicted values
Uses effect coding for all classification variables
Sets the convergence criterion
Sets the relative Hessian convergence criterion
Displays the parameter estimate correlation matrix
Displays the parameter estimate covariance matrix
Displays case deletion diagnostic statistics
Specifies the built-in probability distribution
Names a variable used for performing an exact Poisson regression
Computes covariances and associated statistics using the expected Fisher
information matrix
Displays the values of variable in the input data set in the OBSTATS table
Sets initial values for parameter estimates
Initializes the intercept term
Displays the iteration history for all iterative processes
Specifies the link function
Computes the maximum likelihood estimate and confidence limits of k
based log.k/
Computes two-sided confidence intervals for the partially likelihood function
Sets the maximum allowable number of iterations for all iterative computation processes
Requests that no intercept term
Computes the maximum likelihood estimate and confidence limits of k
based on k
Holds the scale parameter fixed
Displays an additional table of statistics
Specifies a variable in the input data set to be used as an offset
Displays predicted values and associated statistics
Displays residuals and standardized residuals
Sets the value used for the scale
ID=
INITIAL=
INTERCEPT=
ITPRINT
LINK=
LOGNB
LRCI
MAXITER=
NOINT
NOLOGNB
NOSCALE
OBSTATS
OFFSET=
PREDICTED
RESIDUALS
SCALE=
2936 F Chapter 42: The GENMOD Procedure
Table 42.7 continued
Option
Description
SCORING=
SINGULAR=
TYPE1
TYPE3
WALD
WALDCI
XVARS
Computes the Hessian matrix using the Fisher scoring method
Sets the tolerance for testing singularity
Performs a Type 1 analysis
Computes statistics for Type 3 contrasts
Requests Wald statistics for Type 3 contrasts
Computes two-sided Wald confidence intervals
Includes the regression variables in the OBSTATS table
You can specify the following options in the MODEL statement after a slash (/).
AGGREGATE= (variable-list) | variable
AGGREGATE
specifies the subpopulations on which the Pearson chi-square and the deviance are calculated. This
option applies only to the multinomial distribution or the binomial distribution with binary (single
trial syntax) response. It is ignored if specified for other cases. Observations with common values
in the given list of variables are regarded as coming from the same subpopulation. This affects
the computation of the deviance and Pearson chi-square statistics. Variables in the list can be any
variables in the input data set. Specifying the AGGREGATE option is equivalent to specifying the
AGGREGATE= option with a variable list that includes all explanatory variables in the MODEL
statement. Pearson chi-square and deviance statistics are not computed for multinomial models unless
this option is specified.
ALPHA=number
ALPH=number
A=number
sets the confidence coefficient for parameter confidence intervals to 1–number . The value of number
must be between 0 and 1. The default value of number is 0.05.
CICONV=number
sets the convergence criterion for profile likelihood confidence intervals. See the section “Confidence
Intervals for Parameters” on page 2970 for the definition of convergence. The value of number must
be between 0 and 1. By default, CICONV=1E–4.
CL
requests that confidence limits for predicted values be displayed (see the OBSTATS option).
CODING=EFFECT | FULLRANK
specifies that effect coding be used for all classification variables in the model. This is the same as
specifying PARAM=EFFECT as a CLASS statement option.
CONVERGE=number
sets the convergence criterion. The value of number must be between 0 and 1. The iterations are
considered to have converged when the maximum change in the parameter estimates between iteration
steps is less than the value specified. The change is a relative change if the parameter is greater than
0.01 in absolute value; otherwise, it is an absolute change. By default, CONVERGE=1E–4. This
convergence criterion is used in parameter estimation for a single model fit, Type 1 statistics, and
likelihood ratio statistics for Type 3 analyses and CONTRAST statements.
MODEL Statement F 2937
CONVH=number
sets the relative Hessian convergence criterion. The value of number must be between 0 and 1. After
convergence is determined with the change in parameter criterion specified with the CONVERGE=
0
1g
option, the quantity t c D g H
is computed and compared to number , where g is the gradient
jf j
vector, H is the Hessian matrix for the model parameters, and f is the log-likelihood function. If tc
is greater than number , a warning that the relative Hessian convergence criterion has been exceeded
is printed. This criterion detects the occasional case where the change in parameter convergence
criterion is satisfied, but a maximum in the log-likelihood function has not been attained. By default,
CONVH=1E–4.
CORRB
requests that the parameter estimate correlation matrix be displayed.
COVB
requests that the parameter estimate covariance matrix be displayed.
DIAGNOSTICS
INFLUENCE
requests that case deletion diagnostic statistics be displayed (see the OBSTATS option).
DIST=keyword
D=keyword
ERROR=keyword
ERR=keyword
specifies the built-in probability distribution to use in the model. If you specify the DIST= option and
you omit a user-defined link function, a default link function is chosen as displayed in the following
table. If you specify no distribution and no link function, then the GENMOD procedure defaults to the
normal distribution with the identity link function. Models for data with correlated responses fit by the
GEE method are not available for the zero-inflated distributions.
DIST=
Distribution
Default Link Function
BINOMIAL | BIN | B
GAMMA | GAM | G
GEOMETRIC | GEOM
IGAUSSIAN | IG
MULTINOMIAL | MULT
NEGBIN | NB
NORMAL | NOR | N
POISSON | POI | P
TWEEDIE< (Tweedie-options) >
ZIP
ZINB
Binomial
Gamma
Geometric
Inverse Gaussian
Multinomial
Negative binomial
Normal
Poisson
Tweedie
Zero-inflated Poisson
Zero-inflated negative binomial
Logit
Inverse ( power(–1) )
Log
Inverse squared ( power(–2) )
Cumulative logit
Log
Identity
Log
Log
Log/logit
Log/logit
You can specify the following Tweedie-options when you specify DIST=TWEEDIE.
2938 F Chapter 42: The GENMOD Procedure
INITIALP=starting-value
specifies a starting value for iterative estimation of the Tweedie power parameter.
P=power-parameter
specifies a fixed Tweedie power parameter.
EPSILON=tolerance
specifies the tolerance for series approximation of the Tweedie density function.
OFFSET=constant-value
specifies a constant value to be added to the response variable for evaluating the extended
quasi-likelihood. By default, OFFSET=0.5.
NTHREADS=number
specifies the number of threads to be used in computation.
EXACTMAX< =variable >
names a variable to be used for performing an exact Poisson regression. For each observation, the
integer part of the EXACTMAX value should be nonnegative and at least as large as the response
value. If the EXACTMAX option is specified without a variable, then default values are computed.
See the section “Exact Logistic and Exact Poisson Regression” on page 2999 for information about
using this option.
EXPECTED
requests that the expected Fisher information matrix be used to compute parameter estimate covariances
and the associated statistics. The default action is to use the observed Fisher information matrix. This
option does not affect the model fitting, only the way in which the covariance matrix is computed (see
the SCORING= option.)
ID=variable
causes the values of variable in the input data set to be displayed in the OBSTATS table. If an explicit
format for variable has been defined, the formatted values are displayed. If the OBSTATS option is not
specified, this option has no effect.
INITIAL=numbers
sets initial values for parameter estimates in the model. The default initial parameter values are
weighted least squares estimates based on using the response data as the initial mean estimate. This
option can be useful in case of convergence difficulty. The intercept parameter is initialized with the
INTERCEPT= option and is not included here. The values are assigned to the variables in the MODEL
statement in the same order in which they appear in the MODEL statement. The order of levels
for CLASS variables is determined by the ORDER= option. Note that some levels of classification
variables can be aliased; that is, they correspond to linearly dependent parameters that are not estimated
by the procedure. Initial values must be assigned to all levels of classification variables, regardless
of whether they are aliased or not. The procedure ignores initial values corresponding to parameters
not being estimated. If you specify a BY statement, all classification variables must take on the same
number of levels in each BY group. Otherwise, classification variables in some of the BY groups are
assigned incorrect initial values. Types of INITIAL= specifications are illustrated in the following
table.
MODEL Statement F 2939
Type of List
Specification
List separated by blanks
List separated by commas
x to y
x to y by z
Combination of list types
INITIAL = 3 4 5
INITIAL = 3, 4, 5
INITIAL = 3 to 5
INITIAL = 3 to 5 by 1
INITIAL = 1, 3 to 5, 9
INTERCEPT=number | number-list
initializes the intercept term to number for parameter estimation. If you specify both the INTERCEPT=
and the NOINT options, the intercept term is not estimated, but an intercept term of number is included
in the model. If you specify a multinomial model for ordinal data, you can specify a number-list for
the multiple intercepts in the model.
ITPRINT
displays the iteration history for all iterative processes: parameter estimation, fitting constrained models
for contrasts and Type 3 analyses, and profile likelihood confidence intervals. The last evaluation of
the gradient and the negative of the Hessian (second derivative) matrix are also displayed for parameter
estimation. If you perform a Bayesian analysis by specifying the BAYES statement, the iteration
history for computing the mode of the posterior distribution is also displayed.
This option might result in a large amount of displayed output, especially if some of the optional
iterative processes are selected.
LINK=keyword
specifies the link function to use in the model. The keywords and their associated built-in link functions
are as follows.
LINK=
CUMCLL
CCLL
CUMLOGIT
CLOGIT
CUMPROBIT
CPROBIT
CLOGLOG
CLL
IDENTITY
ID
LOG
LOGIT
PROBIT
POWER(number ) | POW(number )
Link Function
Cumulative complementary log-log
Cumulative logit
Cumulative probit
Complementary log-log
Identity
Log
Logit
Probit
Power with = number
If no LINK= option is supplied and there is a user-defined link function, the user-defined link function
is used. If you specify neither the LINK= option nor a user-defined link function, then the default
canonical link function is used if you specify the DIST= option. Otherwise, if you omit the DIST=
option, the identity link function is used.
The cumulative link functions are appropriate only for the multinomial distribution.
2940 F Chapter 42: The GENMOD Procedure
LOGNB
specifies that the maximum likelihood estimate and confidence limits of the negative binomial dispersion parameter k be computed based log.k/. This is the default method used for the negative binomial
dispersion parameter, so that specifying no option or specifying the LOGNB option have the same
effect. The GENMOD procedure computes the maximum likelihood estimate of log.k/ and computes
confidence limits based on the asymptotic normality of log.k/ rather than of k. The results are always
reported in terms of k rather than of log.k/. This method ensures that the estimate and confidence
limits for k are positive. See Meeker and Escobar (1998, p. 163) for details about this method of
computing confidence limits.
LRCI
requests that two-sided confidence intervals for all model parameters be computed based on the profile
likelihood function. This is sometimes called the partially maximized likelihood function. See the
section “Confidence Intervals for Parameters” on page 2970 for more information about the profile
likelihood function. This computation is iterative and can consume a relatively large amount of CPU
time. The confidence coefficient can be selected with the ALPHA=number option. The resulting
confidence coefficient is 1–number . The default confidence coefficient is 0.95.
MAXITER=number
MAXIT=number
sets the maximum allowable number of iterations for all iterative computation processes in PROC
GENMOD. By default, MAXITER=50.
NOINT
requests that no intercept term be included in the model. An intercept is included unless this option is
specified.
NOLOGNB
specifies that the maximum likelihood estimate and confidence limits of the negative binomial dispersion parameter k be computed based on k rather than log.k/. If this option is not specified, then the
GENMOD procedure computes the maximum likelihood estimate of log.k/ and computes confidence
limits based on the asymptotic normality of log.k/ rather than of k. The results are always reported
in terms of k rather than of log.k/. This method ensures that the estimate and confidence limits for
k are positive. See Meeker and Escobar (1998, p. 163) for details about this method of computing
confidence limits.
NOSCALE
holds the scale parameter fixed. Otherwise, for the normal, inverse Gaussian, and gamma distributions,
the scale parameter is estimated by maximum likelihood. If you omit the SCALE= option, the scale
parameter is fixed at the value 1.
OBSTATS
specifies that an additional table of statistics be displayed. Formulas for the statistics are given in
the section “Predicted Values of the Mean” on page 2973, the section “Residuals” on page 2973, and
the section “Case Deletion Diagnostic Statistics” on page 2991. Residuals and fit diagnostics are not
computed for multinomial models.
For each observation, the following items are displayed:
• the value of the response variable (variables if the data are binomial), frequency, and weight
variables
MODEL Statement F 2941
• the values of the regression variables
• predicted mean, O D g 1 ./, where D x0i ˇO is the linear predictor and g is the link function. If
O
there is an offset, it is included in x0i ˇ.
O If there is an offset, it is included in x0 ˇ.
O
• estimate of the linear predictor x0i ˇ.
i
• standard error of the linear predictor x0i ˇO
• the value of the Hessian weight at the final iteration
• lower confidence limit of the predicted value of the mean. The confidence coefficient is specified
with the ALPHA= option. See the section “Confidence Intervals on Predicted Values” on
page 2973 for the computational method.
• upper confidence limit of the predicted value of the mean
• raw residual, defined as Y
• Pearson, or chi residual, defined as the square root of the contribution for the observation to the
Pearson chi-square—that is,
Y p
V ./=w
where Y is the response, is the predicted mean, w is the value of the prior weight variable
specified in a WEIGHT statement, and V() is the variance function evaluated at .
• the standardized Pearson residual
• deviance residual, defined as the square root of the deviance contribution for the observation,
with sign equal to the sign of the raw residual
• the standardized deviance residual
• the likelihood residual
• a Cook distance type statistic for assessing the influence of individual observations on overall
model fit
• observation leverage
O where ˇOŒi  is
• DFBETA, defined as an approximation to ˇO ˇOŒi  for each parameter estimate ˇ,
the parameter estimate with the ith observation deleted
• standardized DFBETA, defined as DFBETA, normalized by its standard deviation
• zero inflation probability for zero-inflated models
• the mean of a zero-inflated response
The following additional cluster deletion diagnostic statistics are created and displayed for each cluster
if a REPEATED statement is specified:
• a Cook distance type statistic for assessing the influence of entire clusters on overall model fit
• a studentized Cook distance for assessing influence of clusters
• cluster leverage
• cluster DFBETA for assessing the influence of entire clusters on individual parameter estimates
• cluster DFBETA normalized by its standard deviation
2942 F Chapter 42: The GENMOD Procedure
If you specify the multinomial distribution, only regression variable values, response values, predicted
values, confidence limits for the predicted values, and the linear predictor are displayed in the table.
Residuals and other diagnostic statistics are not available for the multinomial distribution.
The RESIDUALS, DIAGNOSTICS | INFLUENCE, PREDICTED, XVARS, and CL options cause
only subgroups of the observation statistics to be displayed. You can specify more than one of these
options to include different subgroups of statistics.
The ID=variable option causes the values of variable in the input data set to be displayed in the table.
If an explicit format for variable has been defined, the formatted values are displayed.
If a REPEATED statement is present, a table is displayed for the GEE model specified in the REPEATED statement. Regression variables, response values, predicted values, confidence limits for the
predicted values, linear predictor, raw residuals, Pearson residuals for each observation in the input
data set are available. Case deletion diagnostic statistics are available for each observation and for each
cluster.
OFFSET=variable
specifies a variable in the input data set to be used as an offset variable. This variable cannot be a
CLASS variable, and it cannot be the response variable or one of the explanatory variables.
When you perform an exact Poisson regression with an OFFSET= variable but the EXACTMAX=
option is not specified, then if oi is the offset for the ith observation, floor(exp(oi )) should be greater
than or equal to the response value. See the section “Exact Logistic and Exact Poisson Regression” on
page 2999 for information about the use of the offset in the exact Poisson model.
PREDICTED
PRED
P
requests that predicted values, the linear predictor, its standard error, and the Hessian weight be
displayed (see the OBSTATS option).
RESIDUALS
R
requests that residuals and standardized residuals be displayed. Residuals and other diagnostic statistics
are not available for the multinomial distribution (see the OBSTATS option).
SCALE=number
SCALE=PEARSON | P
PSCALE
SCALE=DEVIANCE | D
DSCALE
sets the value used for the scale parameter where the NOSCALE option is used. For the binomial and
Poisson distributions, which have no free scale parameter, this can be used to specify an overdispersed
model. In this case, the parameter covariance matrix and the likelihood function are adjusted by the
scale parameter. See the section “Dispersion Parameter” on page 2965 and the section “Overdispersion”
on page 2966 for more information. If the NOSCALE option is not specified, then number is used as
an initial estimate of the scale parameter.
Specifying SCALE=PEARSON or SCALE=P is the same as specifying the PSCALE option. This
fixes the scale parameter at the value 1 in the estimation procedure. After the parameter estimates
MODEL Statement F 2943
are determined, the exponential family dispersion parameter is assumed to be given by Pearson’s
chi-square statistic divided by the degrees of freedom, and all statistics such as standard errors and
likelihood ratio statistics are adjusted appropriately.
Specifying SCALE=DEVIANCE or SCALE=D is the same as specifying the DSCALE option. This
fixes the scale parameter at a value of 1 in the estimation procedure.
After the parameter estimates are determined, the exponential family dispersion parameter is assumed
to be given by the deviance divided by the degrees of freedom. All statistics such as standard errors
and likelihood ratio statistics are adjusted appropriately.
SCORING=number
requests that on iterations up to number , the Hessian matrix be computed using the Fisher scoring
method. For further iterations, the full Hessian matrix is computed. The default value is 1. A value of
0 causes all iterations to use the full Hessian matrix, and a value greater than or equal to the value of
the MAXITER option causes all iterations to use Fisher scoring. The value of the SCORING= option
must be 0 or a positive integer.
SINGULAR=number
sets the tolerance for testing singularity of the information matrix and the crossproducts matrix.
Roughly, the test requires that a pivot be at least this number times the original diagonal value. By
default, number is 107 times the machine epsilon. The default number is approximately 10 9 on
most machines. This value also controls the check on estimability for ESTIMATE and CONTRAST
statements.
TYPE1
requests that a Type 1, or sequential, analysis be performed. This consists of sequentially fitting models,
beginning with the null (intercept term only) model and continuing up to the model specified in the
MODEL statement. The likelihood ratio statistic between each successive pair of models is computed
and displayed in a table.
A Type 1 analysis is not available for GEE models, since there is no associated likelihood.
TYPE3
requests that statistics for Type 3 contrasts be computed for each effect specified in the MODEL
statement. The default analysis is to compute likelihood ratio statistics for the contrasts or score
statistics for GEEs. Wald statistics are computed if the WALD option is also specified.
WALD
requests Wald statistics for Type 3 contrasts. You must also specify the TYPE3 option in order to
compute Type 3 Wald statistics.
WALDCI
requests that two-sided Wald confidence intervals for all model parameters be computed based on
the asymptotic normality of the parameter estimators. This computation is not as time-consuming
as the LRCI method, since it does not involve an iterative procedure. However, it is thought to be
less accurate, especially for small sample sizes. The confidence coefficient can be selected with the
ALPHA= option in the same way as for the LRCI option.
2944 F Chapter 42: The GENMOD Procedure
XVARS
requests that the regression variables be included in the OBSTATS table.
OUTPUT Statement
OUTPUT < OUT=SAS-data-set > < keyword=name . . . keyword=name > ;
The OUTPUT statement creates a new SAS data set that contains all the variables in the input data set and,
optionally, the estimated linear predictors (XBETA) and their standard error estimates, the weights for the
Hessian matrix, predicted values of the mean, confidence limits for predicted values, residuals, and case
deletion diagnostics. Residuals and diagnostic statistics are not computed for multinomial models.
You can also request these statistics with the OBSTATS, PREDICTED, RESIDUALS, DIAGNOSTICS | INFLUENCE, CL, or XVARS option in the MODEL statement. You can then create a SAS data set containing
them with ODS OUTPUT commands.
You might prefer to specify the OUTPUT statement for requesting these statistics since the following are true:
• The OUTPUT statement produces no tabular output.
• The OUTPUT statement creates a SAS data set more efficiently than ODS. This can be an advantage
for large data sets.
• You can specify the individual statistics to be included in the SAS data set.
If you use the multinomial distribution with one of the cumulative link functions for ordinal data, the data
set also contains variables named _ORDER_ and _LEVEL_ that indicate the levels of the ordinal response
variable and the values of the variable in the input data set corresponding to the sorted levels. These variables
indicate that the predicted value for a given observation is the probability that the response variable is as
large as the value of the _LEVEL_ variable. Residuals and other diagnostic statistics are not available for the
multinomial distribution.
The estimated linear predictor, its standard error estimate, and the predicted values and their confidence
intervals are computed for all observations in which the explanatory variables are all nonmissing, even if
the response is missing. By adding observations with missing response values to the input data set, you can
compute these statistics for new observations or for settings of the explanatory variables not present in the
data without affecting the model fit.
The following list explains specifications in the OUTPUT statement.
OUT=SAS-data-set
specifies the output data set. If you omit the OUT=option, the output data set is created and given a
default name that uses the DATAn convention.
keyword=name
specifies the statistics to be included in the output data set and names the new variables that contain
the statistics. Specify a keyword for each desired statistic (see the following list of keywords), an
equal sign, and the name of the new variable or variables to contain the statistic. You can list only
one variable after the equal sign for all the statistics, except for the case deletion diagnostics for
individual parameter estimates, DFBETA, DFBETAS, DFBETAC, and DFBETACS. You can list
OUTPUT Statement F 2945
variables enclosed in parentheses to correspond to the variables in the model, or you can specify the
keyword _all_, without parentheses, to include deletion diagnostics for all of the parameters in the
model.
Although you can use the OUTPUT statement without any keyword=name specifications, the output
data set then contains only the original variables and, possibly, the variables Level and Value (if you
use the multinomial model with ordinal data). Note that the residuals and deletion diagnostics are not
available for the multinomial model with ordinal data. Some of the case deletion diagnostic statistics
apply only to models for correlated data specified with a REPEATED statement. If you request these
statistics for ordinary generalized linear models, the values of the corresponding variables are set to
missing in the output data set. Formulas for the statistics are given in the section “Predicted Values
of the Mean” on page 2973, the section “Residuals” on page 2973, and the section “Case Deletion
Diagnostic Statistics” on page 2991.
The keywords allowed and the statistics they represent are as follows:
DFBETA | DBETA
represents the effect of deleting an observation on parameter estimates. If
you specify the keyword _all_ after the equal sign, variables named DFBETA_ParameterName will be included in the output data set to contain the values
of the diagnostic statistic to measure the influence of deleting a single observation
on the individual parameter estimates. ParameterName is the name of the regression model parameter formed from the input variable names concatenated with the
appropriate levels, if classification variables are involved.
represents the effect of deleting an observation on standardized parameter
estimates. If you specify the keyword _all_ after the equal sign, variables named
DFBETAS_ParameterName will be included in the output data set to contain
the values of the diagnostic statistic to measure the influence of deleting a single
observation on the individual parameter estimates. ParameterName is the name of
the regression model parameter formed from the input variable names concatenated
with the appropriate levels, if classification variables are involved.
DFBETAS | DBETAS
DOBS | COOKD | COOKSD
represents the Cook distance type statistic to measure the influence of
deleting a single observation on the overall model fit.
HESSWGT
represents the diagonal element of the weight matrix used in computing the Hessian
matrix.
H | LEVERAGE
represents the leverage of a single observation.
LOWER | L
represents the lower confidence limit for the predicted value of the mean, or the
lower confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
represents the predicted value of the mean of the response or the
predicted probability that the response variable is less than or equal to the value
of _LEVEL_ if the multinomial model for ordinal data is used (in other words,
Pr.Y _LEVEL_/, where Y is the response variable).
PREDICTED | PRED | PROB | P
PZERO
represents the zero-inflation probability for zero-inflated models.
RESCHI
represents the Pearson (chi) residual for identifying observations that are poorly
accounted for by the model.
2946 F Chapter 42: The GENMOD Procedure
RESDEV
represents the deviance residual for identifying poorly fitted observations.
RESLIK
represents the likelihood residual for identifying poorly fitted observations.
RESRAW
represents the raw residual for identifying poorly fitted observations.
STDRESCHI
represents the standardized Pearson (chi) residual for identifying observations that
are poorly accounted for by the model.
STDRESDEV
represents the standardized deviance residual for identifying poorly fitted observations.
STDXBETA
represents the standard error estimate of XBETA (see the XBETA keyword).
UPPER | U
represents the upper confidence limit for the predicted value of the mean, or the
upper confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
XBETA
represents the estimate of the linear predictor x0i ˇ for observation i, or ˛j C
x0i ˇ, where j is the corresponding ordered value of the response variable for the
multinomial model with ordinal data. If there is an offset, it is included in x0i ˇ.
The keywords in the following list apply only to models specified with a REPEATED statement, fit by
generalized estimating equations (GEEs).
CH | CLUSTERH | CLEVERAGE
CLUSTER
represents the leverage of a cluster.
represents the numerical cluster index, in order of sorted clusters.
DCLS | CLUSTERCOOKD | CLUSTERCOOKSD
represents the Cook distance type statistic to measure the influence of deleting an entire cluster on the overall model fit.
represents the effect of deleting an entire cluster on parameter estimates.
If you specify the keyword _all_ after the equal sign, variables named DFBETAC_ParameterName will be included in the output data set to contain the values
of the diagnostic statistic to measure the influence of deleting the cluster on the individual parameter estimates. ParameterName is the name of the regression model
parameter formed from the input variable names concatenated with the appropriate
levels, if classification variables are involved.
DFBETAC | DBETAC
DFBETACS | DBETACS
represents the effect of deleting an entire cluster on normalized parameter
estimates. If you specify the keyword _all_ after the equal sign, variables named
DFBETACS_ParameterName will be included in the output data set to contain the
values of the diagnostic statistic to measure the influence of deleting the cluster on
the individual parameter estimates, normalized by their standard errors. ParameterName is the name of the regression model parameter formed from the input
variable names concatenated with the appropriate levels, if classification variables
are involved.
MCLS | CLUSTERDFIT
represents the studentized Cook distance type statistic to measure the influence of deleting an entire cluster on the overall model fit.
Programming Statements F 2947
Programming Statements
Although the most commonly used link and probability distributions are available as built-in functions, the
GENMOD procedure enables you to define your own link functions and response probability distributions by
using the FWDLINK, INVLINK, VARIANCE, and DEVIANCE statements. The variables assigned in these
statements can have values computed in programming statements.
These programming statements can occur anywhere between the PROC GENMOD statement and the RUN
statement. Variable names used in programming statements must be unique. Variables from the input data set
can be referenced in programming statements. The mean, linear predictor, and response are represented by
the automatic variables _MEAN_, _XBETA_, and _RESP_, respectively, which can be referenced in your
programming statements. Programming statements are used to define the functional dependencies of the
link function, the inverse link function, the variance function, and the deviance function on the mean, linear
predictor, and response variable.
The following statements illustrate the use of programming statements. Even though you usually request
the Poisson distribution by specifying DIST=POISSON as a MODEL statement option, you can define the
variance and deviance functions for the Poisson distribution by using the VARIANCE and DEVIANCE
statements. For example, the following statements perform the same analysis as the Poisson regression
example in the section “Getting Started: GENMOD Procedure” on page 2876.
The statements must be in logical order for computation, just as in a DATA step.
proc genmod;
class car age;
a = _MEAN_;
y = _RESP_;
d = 2 * ( y * log( y / a ) - ( y - a ) );
variance var = a;
deviance dev = d;
model c = car age / link = log offset = ln;
run;
The variables var and dev are dummy variables used internally by the procedure to identify the variance and
deviance functions. Any valid SAS variable names can be used.
Similarly, the log link function and its inverse could be defined with the FWDLINK and INVLINK statements,
as follows:
fwdlink link = log(_MEAN_);
invlink ilink = exp(_XBETA_);
These statements are for illustration, and they work well for most Poisson regression problems. If, however,
in the iterative fitting process, the mean parameter becomes too close to 0, or a 0 response value occurs, an
error condition occurs when the procedure attempts to evaluate the log function. You can circumvent this
kind of problem by using IF-THEN/ELSE clauses or other conditional statements to check for possible error
conditions and appropriately define the functions for these cases.
Data set variables can be referenced in user definitions of the link function and response distributions by
using programming statements and the FWDLINK, INVLINK, DEVIANCE, and VARIANCE statements.
See the DEVIANCE, VARIANCE, FWDLINK, and INVLINK statements for more information.
2948 F Chapter 42: The GENMOD Procedure
The syntax of programming statements used in PROC GENMOD is identical to that used in the NLMIXED
procedure and the GLIMMIX procedure (see Chapter 68, “The NLMIXED Procedure,” and Chapter 43,
“The GLIMMIX Procedure,”) and the MODEL procedure (see the SAS/ETS User’s Guide). Most of the
programming statements that can be used in the DATA step can also be used in the GENMOD procedure.
See SAS Statements: Reference for a description of SAS programming statements. The following are some
commonly used programming statements.
ABORT;
ARRAY arrayname < [ dimensions ] > < $ > < variables-and-constants >;
CALL name < (expression < , expression . . . >) >;
DELETE;
DO < variable = expression < TO expression > < BY expression > >
< , expression < TO expression > < BY expression > > . . .
< WHILE expression > < UNTIL expression >;
END;
GOTO statement-label;
IF expression;
IF expression THEN program-statement;
ELSE program-statement;
variable = expression;
variable + expression;
LINK statement-label;
PUT < variable > < = > . . . ;
RETURN;
SELECT < (expression) >;
STOP;
SUBSTR(variable, index, length)= expression;
WHEN (expression)program-statement;
OTHERWISE program-statement;
REPEATED Statement
REPEATED SUBJECT=subject-effect < / options > ;
The REPEATED statement specifies the covariance structure of multivariate responses for GEE model fitting
in the GENMOD procedure. In addition, the REPEATED statement controls the iterative fitting algorithm
used in GEEs and specifies optional output. Other GENMOD procedure statements, such as the MODEL and
CLASS statements, are used in the same way as they are for ordinary generalized linear models to specify
the regression model for the mean of the responses.
Table 42.8 summarizes the options available in the REPEATED statement.
Table 42.8
REPEATED Statement Options
Option
Description
ALPHAINIT=
CONVERGE=
CORRB
CORRW
Specifies initial values for log odds ratio regression parameters
Specifies the convergence criterion for GEE parameter estimation
Displays the estimated correlation matrix
Displays the estimated working correlation matrix
REPEATED Statement F 2949
Table 42.8 continued
Option
Description
COVB
ECORRB
ECOVB
INITIAL=
INTERCEPT=
LOGOR=
MAXITER=
MCORRB
MCOVB
MODELSE
PRINTMLE
RUPDATE=
Displays the estimated covariance matrix
Displays the estimated empirical correlation matrix
Displays the estimated empirical covariance matrix
Specifies initial values of the regression parameters estimation
Specifies either an initial or a fixed value of the intercept
Specifies the regression structure of the log odds ratio
Specifies the maximum number of iterations
Displays the estimated model-based correlation matrix
Displays the estimated model-based covariance matrix
Displays an analysis of parameter estimates table
Displays an analysis of maximum likelihood parameter estimates table
Specifies the number of iterations between updates of the working correlation matrix
Groups by subject and sorts within subject
Specifies a variable defining subclusters
Identifies a different subject, or cluster
Specifies the working correlation matrix structure
Uses the SAS ‘Version 6’ method of computing normalized Pearson chisquare
Specifies the order of measurements within subjects
Specifies the pairs of responses
Specifies the full z matrix
Specifies the rows of the z matrix
SORTED
SUBCLUSTER=
SUBJECT=
TYPE=
V6CORR
WITHIN=
YPAIR=
ZDATA=
ZROW=
SUBJECT=subject-effect
identifies subjects in the input data set. The subject-effect can be a single variable, an interaction effect,
a nested effect, or a combination. Each distinct value, or level, of the effect identifies a different subject,
or cluster. Responses from different subjects are assumed to be statistically independent, and responses
within subjects are assumed to be correlated. A subject-effect must be specified, and variables used in
defining the subject-effect must be listed in the CLASS statement. The input data set does not need to
be sorted by subject (see the SORTED option).
The options control how the model is fit and what output is produced. You can specify the following
options after a slash (/).
ALPHAINIT=numbers
specifies initial values for log odds ratio regression parameters if the LOGOR= option is specified for
binary data. If this option is not specified, an initial value of 0.01 is used for all the parameters.
CONVERGE=number
specifies the convergence criterion for GEE parameter estimation. If the maximum absolute difference
between regression parameter estimates is less than the value of number on two successive iterations,
convergence is declared. If the absolute value of a regression parameter estimate is greater than
0.08, then the absolute difference normalized by the regression parameter value is used instead of the
absolute difference. The default value of number is 0.0001.
2950 F Chapter 42: The GENMOD Procedure
CORRW
displays the estimated working correlation matrix. If you specify an exchangeable working correlation
structure with the CORR=EXCH option, the CORRW option is not needed to view the estimated
correlation, since a table is printed by default that contains the single estimated correlation.
CORRB
displays the estimated regression parameter correlation matrix. Both model-based and empirical
correlations are displayed.
COVB
displays the estimated regression parameter covariance matrix. Both model-based and empirical
covariances are displayed.
ECORRB
displays the estimated regression parameter empirical correlation matrix.
ECOVB
displays the estimated regression parameter empirical covariance matrix.
INTERCEPT=number
specifies either an initial or a fixed value of the intercept regression parameter in the GEE model. If
you specify the NOINT option in the MODEL statement, then the intercept is fixed at the value of
number .
INITIAL=numbers
specifies initial values of the regression parameters estimation, other than the intercept parameter,
for GEE estimation. If this option is not specified, the estimated regression parameters assuming
independence for all responses are used for the initial values.
LOGOR=log-odds-ratio-structure-keyword
specifies the regression structure of the log odds ratio used to model the association of the responses
from subjects for binary data. The response syntax must be of the single variable type, the distribution
must be binomial, and the data must be binary. Table 42.9 displays the log odds ratio structure
keywords and the corresponding log odds ratio regression structures. See the section “Alternating
Logistic Regressions” on page 2983 for definitions of the log odds ratio types and examples of
specifying log odds ratio models. You should specify either the LOGOR= or the TYPE= option, but
not both.
Table 42.9 Log Odds Ratio Regression Structures
Keyword
Log Odds Ratio Regression Structure
EXCH
FULLCLUST
LOGORVAR(variable)
NESTK
NEST1
ZFULL
ZREP
Exchangeable
Fully parameterized clusters
Indicator variable for specifying block effects
k-nested
1-nested
Fully specified z matrix specified in ZDATA= data set
Single cluster specification for replicated z matrix specified
in ZDATA= data set
Single cluster specification for replicated z matrix
ZREP(matrix )
REPEATED Statement F 2951
MAXITER=number
MAXIT=number
specifies the maximum number of iterations allowed in the iterative GEE estimation process. The
default number is 50.
MCORRB
displays the estimated regression parameter model-based correlation matrix.
MCOVB
displays the estimated regression parameter model-based covariance matrix.
MODELSE
displays an analysis of parameter estimates table that uses model-based standard errors for inference.
By default, an “Analysis of Parameter Estimates” table based on empirical standard errors is displayed.
PRINTMLE
displays an analysis of maximum likelihood parameter estimates table. The maximum likelihood
estimates are not displayed unless this option is specified.
RUPDATE=number
specifies the number of iterations between updates of the working correlation matrix. For example,
RUPDATE=5 specifies that the working correlation is updated once for every five regression parameter
updates. The default value of number is 1; that is, the working correlation is updated every time the
regression parameters are updated.
SORTED
specifies that the input data are grouped by subject and sorted within subject. If this option is not
specified, then the procedure internally sorts by subject-effect and within subject-effect , if a within
subject-effect is specified.
SUBCLUSTER=variable
SUBCLUST=variable
specifies a variable defining subclusters for the 1-nested or k-nested log odds ratio association modeling
structures. This variable must be listed in the CLASS statement.
TYPE=correlation-structure keyword
CORR=correlation-structure keyword
specifies the structure of the working correlation matrix used to model the correlation of the responses
from subjects. Table 42.10 displays the correlation structure keywords and the corresponding correlation structures. The default working correlation type is the independent (CORR=IND). See the
section “Details: GENMOD Procedure” on page 2956 for definitions of the correlation matrix types.
You should specify LOGOR= or TYPE= but not both.
2952 F Chapter 42: The GENMOD Procedure
Table 42.10 Correlation Structure Types
Keyword
Correlation Matrix Type
AR
AR(1)
EXCH
CS
IND
MDEP(number )
UNSTR
UN
USER
FIXED(matrix )
Autoregressive(1)
Exchangeable
Independent
m-dependent with m=number
Unstructured
Fixed, user-specified correlation matrix
For example, you can specify a fixed 4 4 correlation matrix with the following option:
type=user( 1.0
0.9
0.8
0.6
0.9
1.0
0.9
0.8
0.8
0.9
1.0
0.9
0.6
0.8
0.9
1.0 )
V6CORR
specifies that the SAS ‘Version 6’ method of computing the normalized Pearson chi-square be used for
working correlation estimation and for model-based covariance matrix scale factor.
WITHINSUBJECT | WITHIN=within subject-effect
defines an effect specifying the order of measurements within subjects. Each distinct level of the within
subject-effect defines a different response from the same subject. If the data are in proper order within
each subject, you do not need to specify this option.
If some measurements do not appear in the data for some subjects, this option properly orders the existing measurements and treats the omitted measurements as missing values. If the WITHINSUBJECT=
option is not used in this situation, measurements might be improperly ordered and missing values
assumed for the last measurements in a cluster.
Variables used in defining the within subject-effect must be listed in the CLASS statement.
YPAIR=variable-list
specifies the variables in the ZDATA= data set corresponding to pairs of responses for log odds ratio
association modeling.
ZDATA=SAS-data-set
specifies a SAS data set containing either the full z matrix for log odds ratio association modeling or
the z matrix for a single complete cluster to be replicated for all clusters.
ZROW=variable-list
specifies the variables in the ZDATA= data set corresponding to rows of the z matrix for log odds ratio
association modeling.
SLICE Statement F 2953
SLICE Statement
SLICE model-effect < / options > ;
The SLICE statement provides a general mechanism for performing a partitioned analysis of the LS-means
for an interaction. This analysis is also known as an analysis of simple effects.
The SLICE statement uses the same options as the LSMEANS statement, which are summarized in Table 19.21. For details about the syntax of the SLICE statement, see the section “SLICE Statement” on
page 505 in Chapter 19, “Shared Concepts and Topics.”
STORE Statement
STORE < OUT= >item-store-name < / LABEL='label' > ;
The STORE statement requests that the procedure save the context and results of the statistical analysis. The
resulting item store has a binary file format that cannot be modified. The contents of the item store can be
processed with the PLM procedure.
For details about the syntax of the STORE statement, see the section “STORE Statement” on page 508 in
Chapter 19, “Shared Concepts and Topics.”
STRATA Statement
STRATA variable < (option) > . . . < variable < (option) > > < / options > ;
The STRATA statement names the variables that define strata or matched sets to use in stratified exact
logistic regression of binary response data, or a stratified exact Poisson regression of count data. An EXACT
statement must also be specified.
Observations that have the same variable values are in the same matched set. For a stratified logistic model,
you can analyze 1W 1, 1W n, mW n, and general mi W ni matched sets where the number of cases and controls
varies across strata. For a stratified Poisson model, you can have any number of observations in each stratum.
At least one variable must be specified to invoke the stratified analysis, and the usual unconditional asymptotic
analysis is not performed. The stratified logistic model has the form
logit.hi / D ˛h C x0hi ˇ
where hi is the event probability for the ith observation in stratum h with covariates xhi and where the
stratum-specific intercepts ˛h are the nuisance parameters that are to be conditioned out.
STRATA variables can also be specified in the MODEL statement as classification or continuous covariates;
however, the effects are nondegenerate only when crossed with a nonstratification variable. Specifying several
STRATA statements is the same as specifying one STRATA statement that contains all the strata variables.
The STRATA variables can be either character or numeric, and the formatted values of the STRATA variables
determine the levels. Thus, you can also use formats to group values into levels; see the discussion of the
FORMAT procedure in the Base SAS Procedures Guide.
2954 F Chapter 42: The GENMOD Procedure
The “Strata Summary” table is displayed by default. For an exact logistic regression, it displays the number
of strata that have a specific number of events and non-events. For example, if you are analyzing a 1W 5
matched study, this table enables you to verify that every stratum in the analysis has exactly one event and
five non-events. Strata that contain only events or only non-events are reported in this table, but such strata
are uninformative and are not used in the analysis. For an exact Poisson regression, the “Strata Summary”
table displays the number of strata that contain a specific number of observations, which enables you to check
whether every stratum in the analysis has the same number of observations.
The ASSESSMENT, BAYES, CONTRAST, EFFECTPLOT, ESTIMATE, LSMEANS, LSMESTIMATE,
OUTPUT, REPEATED, SLICE, and STORE statements are not available with a STRATA statement. Exact
analyses are not performed when you specify a WEIGHT statement, or a model other than LINK=LOGIT
with DIST=BIN or LINK=LOG with DIST=POISSON. An OFFSET= variable is not available with exact
logistic regression.
The following option can be specified for a stratification variable by enclosing the option in parentheses after
the variable name, or it can be specified globally for all STRATA variables after a slash (/).
MISSING
treats missing values (‘.’, ‘._’, ‘.A’, . . . , ‘.Z’ for numeric variables and blanks for character variables)
as valid STRATA variable values.
The following strata options are also available after the slash:
CHECKDEPENDENCY | CHECK=keyword
specifies which variables are to be tested for dependency before the analysis is performed. The available
keywords are as follows:
NONE
performs no dependence checking. Typically, a message about a singular information matrix
is displayed if you have dependent variables. Dependent variables can be identified after the
analysis by noting any missing parameter estimates.
checks dependence between covariates and an added intercept. Dependent covariates
are removed from the analysis. However, covariates that are linear functions of the strata
variable might not be removed, which results in a singular information matrix message
being displayed in the SAS log. This is the default.
COVARIATES
ALL
checks dependence between all the strata and covariates. This option can adversely affect
performance if you have a large number of strata.
NOSUMMARY
suppresses the display of the “Strata Summary” table.
INFO
displays the “Strata Information” table, which includes the stratum number, levels of the STRATA
variables that define the stratum, and the total frequency for each stratum. Since the number of strata
can be very large, this table is displayed only by request.
VARIANCE Statement F 2955
VARIANCE Statement
VARIANCE variable = expression ;
You can specify a probability distribution other than the built-in distributions by using the VARIANCE and
DEVIANCE statements. The variable name variable identifies the variance function to the procedure. The
expression is used to define the functional dependence on the mean, and it can be any arithmetic expression
supported by the DATA step language. You use the automatic variable _MEAN_ to represent the mean in the
expression.
Alternatively, you can define the variance function with programming statements, as detailed in the section
“Programming Statements” on page 2947. This form is convenient for using complex statements such as
IF-THEN/ELSE clauses. Derivatives of the variance function for use during optimization are computed
automatically. The DEVIANCE statement must also appear when the VARIANCE statement is used to define
the variance function.
WEIGHT Statement
WEIGHT | SCWGT variable ;
The WEIGHT statement identifies a variable in the input data set to be used as the exponential family
dispersion parameter weight for each observation. The exponential family dispersion parameter is divided
by the WEIGHT variable value for each observation. This is true regardless of whether the parameter is
estimated by the procedure or specified in the MODEL statement with the SCALE= option. It is also true for
distributions such as the Poisson and binomial that are not usually defined to have a dispersion parameter. For
these distributions, a WEIGHT variable weights the overdispersion parameter, which has the default value of
1.
The WEIGHT variable does not have to be an integer; if it is less than or equal to 0 or if it is missing, the
corresponding observation is not used.
ZEROMODEL Statement
ZEROMODEL effects < / options > ;
The ZEROMODEL statement enables you to perform zero-inflated Poisson regression or zero-inflated
negative binomial regression when those respective distributions are specified by the DIST= option in
the MODEL statement. The effects in the ZEROMODEL statement consist of explanatory variables or
combinations of variables for the zero-inflation probability regression model in a zero-inflated model. The
same effects can be used in both the ZEROMODEL statement and the MODEL statement, or effects can
be used in one statement or the other separately. Explanatory variables can be continuous or classification
variables. Classification variables can be character or numeric. Explanatory variables representing nominal,
or classification, data must be declared in a CLASS statement. Interactions between variables can also be
included as effects. Columns of the design matrix are automatically generated for classification variables and
interactions. The syntax for specification of effects is the same as for the GLM procedure. See the section
“Specification of Effects” on page 2967 for more information. Also see Chapter 44, “The GLM Procedure.”
You can specify the following option in the ZEROMODEL statement after a slash (/).
2956 F Chapter 42: The GENMOD Procedure
LINK=keyword
specifies the link function to use in the model. The keywords and their associated link functions are as
follows.
LINK=
Link Function
CLOGLOG
CLL
LOGIT
PROBIT
Complementary log-log
Logit
Probit
If no LINK= option is supplied, the LOGIT link is used. User-defined link functions are not allowed.
Details: GENMOD Procedure
Generalized Linear Models Theory
This is a brief introduction to the theory of generalized linear models.
Response Probability Distributions
In generalized linear models, the response is assumed to possess a probability distribution of the exponential
form. That is, the probability density of the response Y for continuous response variables, or the probability
function for discrete responses, can be expressed as
y b. /
C c.y; /
f .y/ D exp
a./
for some functions a, b, and c that determine the specific distribution. For fixed , this is a one-parameter
exponential family of distributions. The functions a and c are such that a./ D =w and c D c.y; =w/,
where w is a known weight for each observation. A variable representing w in the input data set can be
specified in the WEIGHT statement. If no WEIGHT statement is specified, wi D 1 for all observations.
Standard theory for this type of distribution gives expressions for the mean and variance of Y:
E.Y / D b 0 . /
b 00 . /
Var.Y / D
w
where the primes denote derivatives with respect to . If represents the mean of Y, then the variance
expressed as a function of the mean is
Var.Y / D
V ./
w
where V is the variance function.
Generalized Linear Models Theory F 2957
Probability distributions of the response Y in generalized linear models are usually parameterized in terms of
the mean and dispersion parameter instead of the natural parameter . The probability distributions
that are available in the GENMOD procedure are shown in the following list. The zero-inflated Poisson and
zero-inflated negative binomial distributions are not generalized linear models. However, the zero-inflated
distributions are included in PROC GENMOD since they are useful extensions of generalized linear models.
See Long (1997) for a discussion of the zero-inflated Poisson and zero-inflated negative binomial distributions.
The PROC GENMOD scale parameter and the variance of Y are also shown.
• Normal:
f .y/ D
p
1
2
1 y 2
2
exp
for
1<y<1
D 2
scale D Var.Y / D 2
• Inverse Gaussian:
f .y/ D
"
1
p
2y 3 exp
1
2y
y
2 #
for 0 < y < 1
D 2
scale D Var.Y / D 2 3
• Gamma:
f .y/ D
1
€./y
D y
exp
y
for 0 < y < 1
1
scale D 2
Var.Y / D
• Geometric: This is a special case of the negative binomial with k = 1.
./y
for y D 0; 1; 2; : : :
.1 C /yC1
D 1
f .y/ D
Var.Y / D .1 C /
2958 F Chapter 42: The GENMOD Procedure
• Negative binomial:
€.y C 1=k/
.k/y
for y D 0; 1; 2; : : :
€.y C 1/€.1=k/ .1 C k/yC1=k
D 1
f .y/ D
dispersion D k
Var.Y / D C k2
• Poisson:
y e
yŠ
D 1
f .y/ D
for y D 0; 1; 2; : : :
Var.Y / D • Binomial:
f .y/ D
n
r
D 1
.1
Var.Y / D
r .1
/n
r
for y D
r
; r D 0; 1; 2; : : : ; n
n
/
n
• Multinomial:
mŠ
y
y
y
p 1 p 2 pk k
y1 Šy2 Š yk Š 1 2
D 1
f .y1 ; y2 ; ; yk / D
• Zero-inflated Poisson:
(
f .y/ D
! C .1 !/e for y D 0
ye .1 !/ yŠ
for y D 1; 2; : : :
D 1
D E.Y / D .1
Var.Y / D .1
!/
!/.1 C !/
!
D C
2
1 !
Generalized Linear Models Theory F 2959
• Zero-inflated negative binomial:
(
f .y/ D
1
! C .1 !/.1 C k/ k for y D 0
€.yC1=k/
.k/y
.1 !/ €.yC1/€.1=k/
for y D 1; 2; : : :
.1Ck/yC1=k
D 1
dispersion D k
D E.Y / D .1
!/
Var.Y / D .1
!/.1 C ! C k/
k
!
C
2
D C
1 !
1 !
• Tweedie (1 < p < 2):
(
f .y/ D
D
e
e
1
for y D 0
P1
y= e n˛ n˛ 1 n
nD1 €.n˛/ y
nŠ
for y > 0
p .˛ /2 p
2
p
D E.Y / D ˛
Var.Y / D ˛ 2 C ˛ 2 2
The negative binomial and the zero-inflated negative binomial distributions contain a parameter k, called the
negative binomial dispersion parameter. This is not the same as the generalized linear model dispersion ,
but it is an additional distribution parameter that must be estimated or set to a fixed value.
For the binomial distribution, the response is the binomial proportion Y D events=trials. The variance
function is V ./ D .1 /, and the binomial trials parameter n is regarded as a weight w.
The density function for the Tweedie distribution when 1 < p < 2 is expressed in terms of the parameters of
the compound Poisson distribution. For more information about this representation, see the section “Tweedie
Distribution For Generalized Linear Models” on page 2977. For p > 2, the Tweedie random variable has
positive support and its density function f .y/ can be expressed in terms of stable distributions as defined in
Hougaard (1986).
If a weight variable is present, is replaced with =w, where w is the weight variable.
PROC GENMOD works with a scale parameter that is related to the exponential family dispersion parameter
instead of working with itself. The scale parameters are related to the dispersion parameter as shown
previously with the probability distribution definitions. Thus, the scale parameter output in the “Analysis of
Parameter Estimates” table is related to the exponential family dispersion parameter. If you specify a constant
scale parameter with the SCALE= option in the MODEL statement, it is also related to the exponential family
dispersion parameter in the same way.
2960 F Chapter 42: The GENMOD Procedure
Link Function
For distributions other than the zero-inflated Poisson or zero-inflated negative binomial, the mean i of the
response in the ith observation is related to a linear predictor through a monotonic differentiable link function
g.
g.i / D x0i ˇ
Here, xi is a fixed known vector of explanatory variables, and ˇ is a vector of unknown parameters.
There are two link functions and linear predictors associated with zero-inflated distributions: one for the zero
inflation probability !, and another for the mean parameter . See the section “Zero-Inflated Models” on
page 2975 for more details about zero-inflated distributions.
Log-Likelihood Functions
Log-likelihood functions for the distributions that are available in the procedure are parameterized in terms
of the means i and the dispersion parameter . Zero-inflated log likelihoods are parameterized in terms two
parameters, and !. The parameter ! is the zero-inflation probability, and is a function of the distribution
mean. The relationship between the mean of the zero-inflated Poisson and zero-inflated negative binomial
distributions and the parameter is defined in the section “Response Probability Distributions” on page 2956.
The term yi represents the response for the ith observation, and wi represents the known dispersion weight.
The log-likelihood functions are of the form
X
L.y; ; / D
log .f .yi ; i ; //
i
where the sum is over the observations. The forms of the individual contributions
li D log .f .yi ; i ; //
are shown in the following list; the parameterizations are expressed in terms of the mean and dispersion
parameters.
For the discrete distributions (binomial, multinomial, negative binomial, and Poisson), the functions computed
as the sum of the li terms are not proper log-likelihood functions, since terms involving binomial coefficients
or factorials of the observed counts are dropped from the computation of the log likelihood, and a dispersion
parameter is included in the computation. Deletion of factorial terms and inclusion of a dispersion
parameter do not affect parameter estimates or their estimated covariances for these distributions, and this
is the function used in maximum likelihood estimation. The value of used in computing the reported
log-likelihood function is either the final estimated value, or the fixed value, if the dispersion parameter is
fixed. Even though it is not a proper log-likelihood function in all cases, the function computed as the sum
of the li terms is reported in the output as the log likelihood. The proper log-likelihood function is also
computed as the sum of the ll i terms in the following list, and it is reported as the full log likelihood in the
output.
• Normal:
ll i D li D
1 wi .yi i /2
C log
C log.2/
2
wi
Generalized Linear Models Theory F 2961
• Inverse Gaussian:
1
2
ll i D li D
"
yi3
wi .yi i /2
C
log
yi 2 wi
!
#
C log.2/
• Gamma:
ll i D li D
wi
wi yi
log
i
• Negative binomial:
k
li D yi log
wi
ll i D yi log
k
wi
wi yi
i
log.yi /
wi
log €
k
€.yi C wi =k/
.yi C wi =k/ log 1 C
C log
wi
€.wi =k/
k
€.yi C wi =k/
.yi C wi =k/ log 1 C
C log
wi
€.yi C 1/€.wi =k/
• Poisson:
wi
Œyi log.i /
i 
ll i D wi Œyi log.i /
i
li D
log.yi Š/
• Binomial:
wi
li D
Œri log.pi / C .ni
ll i D wi Œlog
ni
ri
ri / log.1
pi /
C ri log.pi / C .ni
ri / log.1
pi /
• Multinomial (k categories):
li D
k
wi X
yij log.ij /
j D1
ll i D wi Œlog.mi Š/ C
k
X
.yij log.ij /
log.yij Š//
j D1
• Zero-inflated Poisson:
8
< wi logŒ!i C .1 !i / exp. i /
li D ll i D
:
wi Œlog.1 !i / C yi log.i / i
yi D 0
log.yi Š/ yi > 0
2962 F Chapter 42: The GENMOD Procedure
• Zero-inflated negative binomial:
8
logŒ!i C .1 !i /.1 C wki /
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
< log.1 ! / C y log k
i
i
wi
li D ll i D
ˆ
wi
k
ˆ
.y
C
/
log
1
C
ˆ
i
ˆ
k
wi
ˆ
ˆ
wi
ˆ
€.yi C k /
ˆ
: C log
wi
€.yi C1/€.
k
/
1
k
 yi D 0
yi > 0
• Tweedie:
li D ll i D log .f .yi ; i ; =!i ; p//
Maximum Likelihood Fitting
The GENMOD procedure uses a ridge-stabilized Newton-Raphson algorithm to maximize the log-likelihood
function L.y; ; / with respect to the regression parameters. By default, the procedure also produces
maximum likelihood estimates of the scale parameter as defined in the section “Response Probability
Distributions” on page 2956 for the normal, inverse Gaussian, negative binomial, and gamma distributions.
On the rth iteration, the algorithm updates the parameter vector ˇr with
ˇrC1 D ˇr
H
1
s
where H is the Hessian (second derivative) matrix, and s is the gradient (first derivative) vector of the
log-likelihood function, both evaluated at the current value of the parameter vector. That is,
@L
s D Œsj  D
@ˇj
and
@2 L
H D Œhij  D
@ˇi @ˇj
In some cases, the scale parameter is estimated by maximum likelihood. In these cases, elements corresponding to the scale parameter are computed and included in s and H.
If i D x0i ˇ is the linear predictor for observation i and g is the link function, then i D g.i /, so that
i D g 1 .x0i ˇ/ is an estimate of the mean of the ith observation, obtained from an estimate of the parameter
vector ˇ.
The gradient vector and Hessian matrix for the regression parameters are given by
s D
X wi .yi
i
H D
i /xi
0
V .i /g .i /
X0 Wo X
Generalized Linear Models Theory F 2963
where X is the design matrix, xi is the transpose of the ith row of X, and V is the variance function. The
matrix Wo is diagonal with its ith diagonal element
woi D wei C wi .yi
i /
V .i /g 00 .i / C V 0 .i /g 0 .i /
.V .i //2 .g 0 .i //3 where
wei D
wi
V .i /.g 0 .i //2
The primes denote derivatives of g and V with respect to . The negative of H is called the observed
information matrix. The expected value of Wo is a diagonal matrix We with diagonal values wei . If you
replace Wo with We , then the negative of H is called the expected information matrix. We is the weight
matrix for the Fisher scoring method of fitting. Either Wo or We can be used in the update equation. The
GENMOD procedure uses Fisher scoring for iterations up to the number specified by the SCORING option
in the MODEL statement, and it uses the observed information matrix on additional iterations.
Covariance and Correlation Matrix
The estimated covariance matrix of the parameter estimator is given by
†D
H
1
where H is the Hessian matrix evaluated using the parameter estimates on the last iteration. Note that
the dispersion parameter, whether estimated or specified, is incorporated into H. Rows and columns
corresponding to aliased parameters are not included in †.
The correlation matrix is the normalized covariance matrix. That is, if ij is an element of †, then the
p
corresponding element of the correlation matrix is ij =i j , where i D i i .
Goodness of Fit
Two statistics that are helpful in assessing the goodness of fit of a given generalized linear model are the
scaled deviance and Pearson’s chi-square statistic. For a fixed value of the dispersion parameter , the scaled
deviance is defined to be twice the difference between the maximum achievable log likelihood and the log
likelihood at the maximum likelihood estimates of the regression parameters.
Note that these statistics are not valid for GEE models.
If l.y; / is the log-likelihood function expressed as a function of the predicted mean values and the vector
y of response values, then the scaled deviance is defined by
D .y; / D 2.l.y; y/
l.y; //
For specific distributions, this can be expressed as
D .y; / D
D.y; /
where D is the deviance. The following table displays the deviance for each of the probability distributions
available in PROC GENMOD. The deviance cannot be directly calculated for zero-inflated models. Twice
the negative of the log likelihood is reported instead of the proper deviance for the zero-inflated Poisson and
zero-inflated negative binomial.
2964 F Chapter 42: The GENMOD Procedure
Distribution
Inverse Gaussian
Deviance
P
i /2
i wi .yi
h
i
P
yi
2 i wi yi log .y
/
i
i
i
h
P
yi
1
2 i wi mi yi log C
.1
y
/
log
i
1
i
i
h
P
yi i
yi
C
2 i wi
log i
i
P wi .yi i /2
Multinomial
P P
Negative binomial
P h
2 i y log.y=/
Normal
Poisson
Binomial
Gamma
i
i
yi
i
i
yCwi =k
Cwi =k
i
2i yi
j
wi yij log
yij
pij mi
.y C wi =k/ log
Zero-inflated Poisson
8
ˆ
ˆ wi logŒ!i C .1 !i / exp. i / yi D 0
P <
2 i
ˆ wi Œlog.1 !i / C yi log.i /
ˆ
:
i log.yi Š/
yi > 0
Zero-inflated negative binomial
8
logŒ!i C .1 !i /.1 C wki / yi D 0
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
P < log.1 !i / C yi log k
w
i
2 i
ˆ .y C wi / log 1 C k C
ˆ
i
ˆ
ˆ
wi
k
ˆ
w
ˆ
ˆ
€.yi C ki /
ˆ
: log
yi > 0
wi
€.yi C1/€.
k
/
In the binomial case, yi D ri =mi , where ri is a binomial count and mi is the binomial number of trials
parameter.
In the multinomial case, yij refers to the observed number of occurrences of the jth category for the ith
subpopulation defined by the AGGREGATE= variable, mi is the total number in the ith subpopulation, and
pij is the category probability.
Pearson’s chi-square statistic is defined as
X2 D
i /2
V .i /
X wi .yi
i
and the scaled Pearson’s chi-square is X 2 =.
The scaled version of both of these statistics, under certain regularity conditions, has a limiting chi-square
distribution, with degrees of freedom equal to the number of observations minus the number of parameters
estimated. The scaled version can be used as an approximate guide to the goodness of fit of a given model.
Use caution before applying these statistics to ensure that all the conditions for the asymptotic distributions
hold. McCullagh and Nelder (1989) advise that differences in deviances for nested models can be better
approximated by chi-square distributions than the deviances can themselves.
In cases where the dispersion parameter is not known, an estimate can be used to obtain an approximation to
the scaled deviance and Pearson’s chi-square statistic. One strategy is to fit a model that contains a sufficient
Generalized Linear Models Theory F 2965
number of parameters so that all systematic variation is removed, estimate from this model, and then use
this estimate in computing the scaled deviance of submodels. The deviance or Pearson’s chi-square divided
by its degrees of freedom is sometimes used as an estimate of the dispersion parameter . For example, since
the limiting chi-square distribution of the scaled deviance D D D= has n p degrees of freedom, where
n is the number of observations and p is the number of parameters, equating D to its mean and solving for
yields O D D=.n p/. Similarly, an estimate of based on Pearson’s chi-square X 2 is O D X 2 =.n p/.
Alternatively, a maximum likelihood estimate of can be computed by the procedure, if desired. See the
discussion in the section “Type 1 Analysis” on page 2968 for more about the estimation of the dispersion
parameter.
Other Fit Statistics
The Akaike information criterion (AIC) is a measure of goodness of model fit that balances model fit against
model simplicity. AIC has the form
AIC D
2LL C 2p
where p is the number of parameters estimated in the model, and LL is the log likelihood evaluated at the
value of the estimated parameters. An alternative form is the corrected AIC given by
n
AICC D 2LL C 2p
n p 1
where n is the total number of observations used.
The Bayesian information criterion (BIC) is a similar measure. BIC is defined by
BIC D
2LL C p log.n/
See Akaike (1981, 1979) for details of AIC and BIC. See Simonoff (2003) for a discussion of using AIC,
AICC, and BIC with generalized linear models. These criteria are useful in selecting among regression
models, with smaller values representing better model fit. PROC GENMOD uses the full log likelihoods
defined in the section “Log-Likelihood Functions” on page 2960, with all terms included, for computing all
of the criteria.
Dispersion Parameter
There are several options available in PROC GENMOD for handling the exponential distribution dispersion
parameter. The NOSCALE and SCALE options in the MODEL statement affect the way in which the
dispersion parameter is treated. If you specify the SCALE=DEVIANCE option, the dispersion parameter is
estimated by the deviance divided by its degrees of freedom. If you specify the SCALE=PEARSON option,
the dispersion parameter is estimated by Pearson’s chi-square statistic divided by its degrees of freedom.
Otherwise, values of the SCALE and NOSCALE options and the resultant actions are displayed in the
following table.
NOSCALE
SCALE=value
Action
Present
Present
Not present
Not present
Present
Not present
Not present
Present
Present (negative binomial)
Not present
Scale fixed at value
Scale fixed at 1
Scale estimated by ML
Scale estimated by ML,
starting point at value
k fixed at 0
2966 F Chapter 42: The GENMOD Procedure
The meaning of the scale parameter displayed in the “Analysis Of Parameter Estimates” table is different
for the gamma distribution than for the other distributions. The relation of the scale parameter as used by
PROC GENMOD to the exponential family dispersion parameter is displayed in the following table. For
the binomial and Poisson distributions, is the overdispersion parameter, as defined in the “Overdispersion”
section, which follows.
Distribution
Normal
Inverse Gaussian
Gamma
Binomial
Poisson
Scale
p
p
1=
p
p
In the case of the negative binomial distribution, PROC GENMOD reports the “dispersion” parameter
estimated by maximum likelihood. This is the negative binomial parameter k defined in the section “Response
Probability Distributions” on page 2956.
Overdispersion
Overdispersion is a phenomenon that sometimes occurs in data that are modeled with the binomial or Poisson
distributions. If the estimate of dispersion after fitting, as measured by the deviance or Pearson’s chi-square,
divided by the degrees of freedom, is not near 1, then the data might be overdispersed if the dispersion
estimate is greater than 1 or underdispersed if the dispersion estimate is less than 1. A simple way to model
this situation is to allow the variance functions of these distributions to have a multiplicative overdispersion
factor :
• Binomial: V ./ D .1
/
• Poisson: V ./ D An alternative method to allow for overdispersion in the Poisson distribution is to fit a negative binomial
distribution, where V ./ D C k2 , instead of the Poisson. The parameter k can be estimated by maximum
likelihood, thus allowing for overdispersion of a specific form. This is different from the multiplicative
overdispersion factor , which can accommodate many forms of overdispersion.
The models are fit in the usual way, and the parameter estimates are not affected by the value of . The
covariance matrix, however, is multiplied by , and the scaled deviance and log likelihoods used in likelihood
ratio tests are divided by . The profile likelihood function used in computing confidence intervals is also
divided by . If you specify a WEIGHT statement, is divided by the value of the WEIGHT variable for
each observation. This has the effect of multiplying the contributions of the log-likelihood function, the
gradient, and the Hessian by the value of the WEIGHT variable for each observation.
p
The SCALE= option in the MODEL statement enables you to specify a value of D for the binomial
and Poisson distributions. If you specify the SCALE=DEVIANCE option in the MODEL statement, the
procedure uses the deviance divided by degrees of freedom as an estimate of , and all statistics are adjusted
appropriately. You can use Pearson’s chi-square instead of the deviance by specifying the SCALE=PEARSON
option.
The function obtained by dividing a log-likelihood function for the binomial or Poisson distribution by a
dispersion parameter is not a legitimate log-likelihood function. It is an example of a quasi-likelihood function.
Specification of Effects F 2967
Most of the asymptotic theory for log likelihoods also applies to quasi-likelihoods, which justifies computing
standard errors and likelihood ratio statistics by using quasi-likelihoods instead of proper log likelihoods.
For details on quasi-likelihood functions, see McCullagh and Nelder (1989, Chapter 9), McCullagh (1983);
Hardin and Hilbe (2003).
Although the estimate of the dispersion parameter is often used to indicate overdispersion or underdispersion,
this estimate might also indicate other problems such as an incorrectly specified model or outliers in the data.
You should carefully assess whether this type of model is appropriate for your data.
Specification of Effects
Each term in a model is called an effect. Effects are specified in the MODEL statement. You specify effects
with a special notation that uses variable names and operators. There are two types of variables, classification
(or CLASS) variables and continuous variables. There are two primary types of operators, crossing and
nesting. A third type, the bar operator, is used to simplify effect specification. Crossing is the type of operator
most commonly used in generalized linear models.
Variables that identify classification levels are called CLASS variables in SAS and are identified in a CLASS
statement. These might also be called categorical, qualitative, discrete, or nominal variables. CLASS
variables can be either character or numeric. The values of CLASS variables are called levels. For example,
the CLASS variable Sex could have the levels ‘male’ and ‘female’.
In a model, an explanatory variable that is not declared in a CLASS statement is assumed to be continuous.
Continuous variables must be numeric. For example, the heights and weights of subjects in an experiment
are continuous variables.
The types of effects most useful in generalized linear models are shown in the following list. Assume that A,
B, and C are classification variables and that X1 and X2 are continuous variables.
• Regressor effects are specified by writing continuous variables by themselves: X1, X2.
• Polynomial effects are specified by joining two or more continuous variables with asterisks: X1*X2.
• Main effects are specified by writing classification variables by themselves: A, B, C.
• Crossed effects (interactions) are specified by joining two or more classification variables with asterisks:
A*B, B*C, A*B*C.
• Nested effects are specified by following a main effect or crossed effect with a classification variable or
list of classification variables enclosed in parentheses: B(A), C(B A), A*B(C). In the preceding example,
B(A) is “B nested within A.”
• Combinations of continuous and classification variables can be specified in the same way by using the
crossing and nesting operators.
The bar operator consists of two effects joined with a vertical bar (|). It is shorthand notation for including
the left-hand side, the right-hand side, and the cross between them as effects in the model. For example, A | B
is equivalent to A B A*B. The effects in the bar operator can be classification variables, continuous variables,
or combinations of effects defined using operators. Multiple bars are permitted. For example, A | B | C means
A B C A*B A*C B*C A*B*C.
2968 F Chapter 42: The GENMOD Procedure
You can specify the maximum number of variables in any effect that results from bar evaluation by specifying
the maximum number, preceded by an @ sign. For example, A | B | [email protected] results in effects that involve two or
fewer variables: A B C A*B A*C B*C.
Parameterization Used in PROC GENMOD
Design Matrix
The linear predictor part of a generalized linear model is
D Xˇ
where ˇ is an unknown parameter vector and X is a known design matrix. By default, all models automatically
contain an intercept term; that is, the first column of X contains all 1s. Additional columns of X are generated
for classification variables, regression variables, and any interaction terms included in the model. It is
important to understand the ordering of classification variable parameters when you use the ESTIMATE or
CONTRAST statement. The ordering of these parameters is displayed in the “CLASS Level Information”
table and in tables displaying the parameter estimates of the fitted model.
When you specify an overparameterized model with the PARAM=GLM option in the CLASS statement,
some columns of X can be linearly dependent on other columns. For example, when you specify a model
consisting of an intercept term and a classification variable, the column corresponding to any one of the
levels of the classification variable is linearly dependent on the other columns of X. The columns of X0 X are
checked in the order in which the model is specified for dependence on preceding columns. If a dependency
is found, the parameter corresponding to the dependent column is set to 0 along with its standard error to
indicate that it is not estimated. The order in which the levels of a classification variable are checked for
dependencies can be set by the ORDER= option in the PROC GENMOD statement or by the ORDER=
option in the CLASS statement. For full-rank parameterizations, the columns of the X matrix are designed to
be linearly independent.
You can exclude the intercept term from the model by specifying the NOINT option in the MODEL statement.
Missing Level Combinations
All levels of interaction terms involving classification variables might not be represented in the data. In that
case, PROC GENMOD does not include parameters in the model for the missing levels.
Type 1 Analysis
A Type 1 analysis consists of fitting a sequence of models, beginning with a simple model with only an
intercept term, and continuing through a model of specified complexity, fitting one additional effect on each
step. Likelihood ratio statistics—that is, twice the difference of the log likelihoods—are computed between
successive models. This type of analysis is sometimes called an analysis of deviance since, if the dispersion
parameter is held fixed for all models, it is equivalent to computing differences of scaled deviances. The
asymptotic distribution of the likelihood ratio statistics, under the hypothesis that the additional parameters
included in the model are equal to 0, is a chi-square with degrees of freedom equal to the difference in the
number of parameters estimated in the successive models. Thus, these statistics can be used in a test of
hypothesis of the significance of each additional term fit.
Type 3 Analysis F 2969
This type of analysis is not available for GEE models, since the deviance is not computed for this type of
model.
If the dispersion parameter is known, it can be included in the models; if it is unknown, there are two
strategies allowed by PROC GENMOD. The dispersion parameter can be estimated from a maximal model
by the deviance or Pearson’s chi-square divided by degrees of freedom, as discussed in the section “Goodness
of Fit” on page 2963, and this value can be used in all models. An alternative is to consider the dispersion to
be an additional unknown parameter for each model and estimate it by maximum likelihood on each step. By
default, PROC GENMOD estimates scale by maximum likelihood at each step.
A table of likelihood ratio statistics is produced, along with associated p-values based on the asymptotic
chi-square distributions.
If you specify either the SCALE=DEVIANCE or the SCALE=PEARSON option in the MODEL statement,
the dispersion parameter is estimated using the deviance or Pearson’s chi-square statistic, and F statistics are
computed in addition to the chi-square statistics for assessing the significance of each additional term in the
Type 1 analysis. See the section “F Statistics” on page 2972 for a definition of F statistics.
This Type 1 analysis has the general property that the results depend on the order in which the terms of the
model are fitted. The terms are fitted in the order in which they are specified in the MODEL statement.
Type 3 Analysis
A Type 3 analysis is similar to the Type III sums of squares used in PROC GLM, except that likelihood ratios
are used instead of sums of squares. First, a Type III estimable function is defined for an effect of interest
in exactly the same way as in PROC GLM. Then maximum likelihood estimates are computed under the
constraint that the Type III function of the parameters is equal to 0, by using constrained optimization. Let
Q Then the likelihood ratio
the resulting constrained parameter estimates be ˇQ and the log likelihood be l.ˇ/.
statistic
O
S D 2.l.ˇ/
Q
l.ˇ//
where ˇO is the unconstrained estimate, has an asymptotic chi-square distribution under the hypothesis that
the Type III contrast is equal to 0, with degrees of freedom equal to the number of parameters associated with
the effect.
When a Type 3 analysis is requested, PROC GENMOD produces a table that contains the likelihood ratio
statistics, degrees of freedom, and p-values based on the limiting chi-square distributions for each effect in
the model. If you specify either the DSCALE or PSCALE option in the MODEL statement, F statistics are
also computed for each effect.
Options for handling the dispersion parameter are the same as for a Type 1 analysis. The dispersion parameter
can be specified to be a known value, estimated from the deviance or Pearson’s chi-square divided by degrees
of freedom, or estimated by maximum likelihood individually for the unconstrained and constrained models.
By default, PROC GENMOD estimates scale by maximum likelihood for each model fit.
The results of this type of analysis do not depend on the order in which the terms are specified in the MODEL
statement.
A Type 3 analysis can consume considerable computation time since a constrained model is fitted for each
effect. Wald statistics for Type 3 contrasts are computed if you specify the WALD option. Wald statistics for
2970 F Chapter 42: The GENMOD Procedure
contrasts use less computation time than likelihood ratio statistics but might be less accurate indicators of the
significance of the effect of interest. The Wald statistic for testing L0 ˇ D 0, where L is the contrast matrix, is
defined by
O 0 .L0 †L/
O
O
S D .L0 ˇ/
.L0 ˇ/
where ˇ is the maximum likelihood estimate and † is its estimated covariance matrix. The asymptotic
distribution of S is chi-square with r degrees of freedom, where r is the rank of L.
For models that use the less-than-full-rank parameterization (as specified by the PARAM=GLM option in
the CLASS statement), a Type 3 test of an effect of interest is a test of the Type III estimable functions that
are defined for that effect. When there are no missing cells, the Type 3 test of a main effect corresponds to
testing the hypotheses of equal marginal means. For more information about Type III estimable functions,
see Chapter 44, “The GLM Procedure,” and Chapter 15, “The Four Types of Estimable Functions.” Also see
Littell, Freund, and Spector (1991).
For models that use a full-rank parameterization, all parameters are estimable when there are no missing
cells; so it is unnecessary to define estimable functions. The Type 3 test of an effect of interest is the joint test
that the parameters associated with that effect are zero. For a model that uses effects parameterization (as
specified by the PARAM=EFFECT option in the CLASS statement), testing a main effect is equivalent to
testing the equality of marginal means. For a model that uses reference parameterization (as specified by the
PARAM=REF option in the CLASS statement), the Type 3 test is a test of the equality of cell means at the
reference level of the other model effects. For more information about the coding scheme and the associated
interpretation of results, see Muller and Fetterman (2002, Chapter 14).
For a model without an interaction term, the Type 3 tests of main effects are the same regardless of the type
of parameterization that is used. For a model that contains an interaction term and no missing cells, the
Type 3 test of a component main effect is the same under GLM parameterization and effect parameterization,
because both test the equality of cell means. But this differs from reference parameterization, which tests the
equality of cell means at the reference level of the other component main effect. If some cells are missing,
you can obtain meaningful Type 3 tests only by testing a Type III estimable function, so in this case you
should use GLM parameterization.
The results of a Type 3 analysis do not depend on the order in which the terms are specified in the MODEL
statement.
Generalized score tests for Type III contrasts are computed for GEE models if you specify the TYPE3 option
in the MODEL statement when a REPEATED statement is also used. See the section “Generalized Score
Statistics” on page 2987 for more information about generalized score statistics. Wald tests are also available
with the Wald option in the CONTRAST statement. In this case, the robust covariance matrix estimate is
used for † in the Wald statistic.
Confidence Intervals for Parameters
Likelihood Ratio-Based Confidence Intervals
PROC GENMOD produces likelihood ratio-based confidence intervals, also known as profile likelihood
confidence intervals, for parameter estimates for generalized linear models. These are not computed for
GEE models, since there is no likelihood for this type of model. Suppose that the parameter vector is
Confidence Intervals for Parameters F 2971
ˇ D Œˇ0 ; ˇ1 ; : : : ; ˇp 0 and that you want a confidence interval for ˇj . The profile likelihood function for ˇj
is defined as
l .ˇj / D max l.ˇ/
Q̌
O is the
where ˇQ is the vector ˇ with the jth element fixed at ˇj and l is the log-likelihood function. If l D l.ˇ/
O then 2.l l .ˇj // has a limiting chi-square
log likelihood evaluated at the maximum likelihood estimate ˇ,
distribution with one degree of freedom if ˇj is the true parameter value. A .1 ˛/100% confidence interval
for ˇj is
˚
ˇj W l .ˇj / l0 D l 0:521 ˛;1
where 21 ˛;1 is the 100.1 ˛/th percentile of the chi-square distribution with one degree of freedom. The
endpoints of the confidence interval can be found by solving numerically for values of ˇj that satisfy equality
in the preceding relation. PROC GENMOD solves this by starting at the maximum likelihood estimate of ˇ.
The log-likelihood function is approximated with a quadratic surface, for which an exact solution is possible.
The process is iterated until convergence to an endpoint is attained. The process is repeated for the other
endpoint.
Convergence is controlled by the CICONV= option in the MODEL statement. Suppose is the number
specified in the CICONV= option. The default value of is 10 4 . Let the parameter of interest be ˇj , and
define r D uj , the unit vector with a 1 in position j and 0s elsewhere. Convergence is declared on the current
iteration if the following two conditions are satisfied:
jl .ˇj /
0
.s C r/ H
1
l0 j .s C r/ where l .ˇj /, s, and H are the log likelihood, the gradient, and the Hessian evaluated at the current parameter
vector and is a constant computed by the procedure. The first condition for convergence means that the
log-likelihood function must be within of the correct value, and the second condition means that the gradient
vector must be proportional to the restriction vector r.
When you specify the LRCI option in the MODEL statement, PROC GENMOD computes profile likelihood
confidence intervals for all parameters in the model, including the scale parameter, if there is one. The
interval endpoints are displayed in a table as well as the values of the remaining parameters at the solution.
Wald Confidence Intervals
You can request that PROC GENMOD produce Wald confidence intervals for the parameters. The (1 ˛)100%
Wald confidence interval for a parameter ˇ is defined as
ˇO ˙ z1
O
˛=2 where zp is the 100p percentile of the standard normal distribution, ˇO is the parameter estimate, and O is the
estimate of its standard error.
2972 F Chapter 42: The GENMOD Procedure
F Statistics
Suppose that D0 is the deviance resulting from fitting a generalized linear model and that D1 is the deviance
from fitting a submodel. Then, under appropriate regularity conditions, the asymptotic distribution of
.D1 D0 /= is chi-square with r degrees of freedom, where r is the difference in the number of parameters
between the two models and is the dispersion parameter. If is unknown, and O is an estimate of based
on the deviance or Pearson’s chi-square divided by degrees of freedom, then, under regularity conditions,
O has an asymptotic chi-square distribution with n p degrees of freedom. Here, n is the number of
.n p/=
observations and p is the number of parameters in the model that is used to estimate . Thus, the asymptotic
distribution of
F D
D1
D0
r O
is the F distribution with r and n
approximately independent.
p degrees of freedom, assuming that .D1
D0 /= and .n
O are
p/=
This F statistic is computed for the Type 1 analysis, Type 3 analysis, and hypothesis tests specified in
CONTRAST statements when the dispersion parameter is estimated by either the deviance or Pearson’s
chi-square divided by degrees of freedom, as specified by the DSCALE or PSCALE option in the MODEL
statement. In the case of a Type 1 analysis, model 0 is the higher-order model obtained by including one
additional effect in model 1. For a Type 3 analysis and hypothesis tests, model 0 is the full specified model
and model 1 is the submodel obtained from constraining the Type III contrast or the user-specified contrast to
be 0.
Lagrange Multiplier Statistics
When you select the NOINT or NOSCALE option, restrictions are placed on the intercept or scale parameters.
Lagrange multiplier, or score, statistics are computed in these cases. These statistics assess the validity of the
restrictions, and they are computed as
2 D
s2
V
where s is the component of the score vector evaluated at the restricted maximum corresponding to the
restricted parameter and V D I11 I12 I221 I21 . The matrix I is the information matrix, 1 refers to the
restricted parameter, and 2 refers to the rest of the parameters.
Under regularity conditions, this statistic has an asymptotic chi-square distribution with one degree of
freedom, and p-values are computed based on this limiting distribution.
If you set k = 0 in a negative binomial model, s is the score statistic of Cameron and Trivedi (1998) for testing
for overdispersion in a Poisson model against alternatives of the form V ./ D C k2 .
See Rao (1973, p. 417) for more details.
Predicted Values of the Mean F 2973
Predicted Values of the Mean
Predicted Values
A predicted value, or fitted value, of the mean i corresponding to the vector of covariates xi is given by
Oi D g
1
O
.x0i ˇ/
where g is the link function, regardless of whether xi corresponds to an observation or not. That is, the
response variable can be missing and the predicted value is still computed for valid xi . In the case where
xi does not correspond to a valid observation, xi is not checked for estimability. You should check the
estimability of xi in this case in order to ensure the uniqueness of the predicted value of the mean. If there is
an offset, it is included in the predicted value computation.
Confidence Intervals on Predicted Values
Approximate confidence intervals for predicted values of the mean can be computed as follows. The variance
of the linear predictor i D x0i ˇO is estimated by
x2 D x0i †xi
O The robust estimate of the covariance is used for † in the case of
where † is the estimated covariance of ˇ.
models fit with GEEs.
Approximate 100.1
g 1 x0i ˇO ˙ z1
˛/% confidence intervals are computed as
˛=2 x
where zp is the 100pth percentile of the standard normal distribution and g is the link function. If either endpoint in the argument is outside the valid range of arguments for the inverse link function, the corresponding
confidence interval endpoint is set to missing.
Residuals
The GENMOD procedure computes three kinds of residuals. Residuals are available for all generalized linear
models except multinomial models for ordinal response data, for which residuals are not available. Raw
residuals and Pearson residuals are available for models fit with generalized estimating equations (GEEs).
The raw residual is defined as
ri D yi
i
where yi is the ith response and i is the corresponding predicted mean. You can request raw residuals in an
output data set with the keyword RESRAW in the OUTPUT statement.
The Pearson residual is the square root of the ith contribution to the Pearson’s chi-square:
r
wi
rP i D .yi i /
V .i /
2974 F Chapter 42: The GENMOD Procedure
You can request Pearson residuals in an output data set with the keyword RESCHI in the OUTPUT statement.
Finally, the deviance residual is defined as the square root of the contribution of the ith observation to the
deviance, with the sign of the raw residual:
p
rDi D di .sign.yi i //
You can request deviance residuals in an output data set with the keyword RESDEV in the OUTPUT
statement.
The adjusted Pearson, deviance, and likelihood residuals are defined by Agresti (2002); Williams (1987);
Davison and Snell (1991). These residuals are useful for outlier detection and for assessing the influence of
single observations on the fitted model.
For the generalized linear model, the variance of the ith individual observation is given by
vi D
V .i /
wi
where is the dispersion parameter, wi is a user-specified prior weight (if not specified, wi D 1), i is the
mean, and V .i / is the variance function. Let
wei D vi 1 .g 0 .i //
2
for the ith observation, where g 0 .i / is the derivative of the link function, evaluated at i . Let We be the
diagonal matrix with wei denoting the ith diagonal element. The weight matrix We is used in computing the
expected information matrix.
Define hi as the ith diagonal element of the matrix
1
We2 X.X0 We X/
1
1
X0 We2
The Pearson residuals, standardized to have unit asymptotic variance, are given by
yi i
rP i D p
vi .1 hi /
You can request standardized Pearson residuals in an output data set with the keyword STDRESCHI in the
OUTPUT statement. The deviance residuals, standardized to have unit asymptotic variance, are given by
p
sign.yi i / di
rDi D
p
.1 hi /
where di is the contribution to the total deviance from observation i, and sign.yi i / is 1 if yi i is
positive and –1 if yi i is negative. You can request standardized deviance residuals in an output data set
with the keyword STDRESDEV in the OUTPUT statement. The likelihood residuals are defined by
q
2
rGi D sign.yi i / .1 hi /rDi
C hi rP2 i
You can request likelihood residuals in an output data set with the keyword RESLIK in the OUTPUT
statement.
Multinomial Models F 2975
Multinomial Models
This type of model applies to cases where an observation can fall into one of k categories. Binary data
occur in the special case where k = 2. If there are mi observations in a subpopulation i, then the probability
distribution of the number falling into the k categories yi D .yi1 ; yi 2 ; : : : ; yi k / can be modeled by the
multinomial
distribution, defined in the section “Response Probability Distributions” on page 2956, with
P
y
D
m
i . The multinomial model is an ordinal model if the categories have a natural order.
j ij
Residuals are not available in the OBSTATS table or the output data set for multinomial models.
By default, and consistently with binomial models, the GENMOD procedure orders the response categories
for ordinal multinomial models from lowest to highest and models the probabilities of the lower response
levels. You can change the way PROC GENMOD orders the response levels with the RORDER= option in
the PROC GENMOD statement. The order that PROC GENMOD uses is shown in the “Response Profiles”
output table described in the section “Response Profile” on page 3003.
The GENMOD procedure supports only the ordinal multinomial model. If .pi1 ; pi 2 ; : : : ; pi k / are the
category probabilities, the cumulative
category probabilities are modeled with the same link functions used
P
for binomial data. Let Pi r D rj D1 pij , r D 1; 2; : : : ; k 1, be the cumulative category probabilities (note
that Pi k D 1). The ordinal model is
g.Pi r / D r C x0 ˇ for r D 1; 2; : : : ; k 1
where 1 ; 2 ; : : : ; k 1 are intercept terms that depend only on the categories and xi is a vector of covariates
that does not include an intercept term. The logit, probit, and complementary log-log link functions g are
available. These are obtained by specifying the MODEL statement options DIST=MULTINOMIAL and
LINK=CUMLOGIT (cumulative logit), LINK=CUMPROBIT (cumulative probit), or LINK=CUMCLL
(cumulative complementary log-log). Alternatively,
Pi r D F.r C x0 ˇ/ for r D 1; 2; : : : ; k 1
where F D g
1
is a cumulative distribution function for the logistic, normal, or extreme-value distribution.
PROC GENMOD estimates the intercept parameters 1 ; 2 ; : : : ; k
maximum likelihood.
1
and regression parameters ˇ by
The subpopulations i are defined by constant values of the AGGREGATE= variable. This has no effect on the
parameter estimates, but it does affect the deviance and Pearson chi-square statistics; it also affects parameter
estimate standard errors if you specify the SCALE=DEVIANCE or SCALE=PEARSON option.
Zero-Inflated Models
Count data that have an incidence of zeros greater than expected for the underlying probability distribution
of counts can be modeled with a zero-inflated distribution. In GENMOD, the underlying distribution can
be either Poisson or negative binomial. See Lambert (1992), Long (1997) and Cameron and Trivedi (1998)
for more information about zero-inflated models. The population is considered to consist of two types of
individuals. The first type gives Poisson or negative binomial distributed counts, which might contain zeros.
The second type always gives a zero count. Let be the underlying distribution mean and ! be the probability
of an individual being of the second type. The parameter ! is called here the zero-inflation probability, and
2976 F Chapter 42: The GENMOD Procedure
is the probability of zero counts in excess of the frequency predicted by the underlying distribution. You can
request that the zero inflation probability be displayed in an output data set with the PZERO keyword. The
probability distribution of a zero-inflated Poisson random variable Y is given by
(
! C .1 !/e for y D 0
Pr.Y D y/ D
ye for y D 1; 2; : : :
.1 !/ yŠ
and the probability distribution of a zero-inflated negative binomial random variable Y is given by
(
1
! C .1 !/.1 C k/ k
for y D 0
Pr.Y D y/ D
€.yC1=k/
.k/y
.1 !/ €.yC1/€.1=k/ .1Ck/yC1=k for y D 1; 2; : : :
where k is the negative binomial dispersion parameter.
You can model the parameters ! and in GENMOD with the regression models:
h.!i / D z0i g.i / D x0i ˇ
where h is one of the binary link functions: logit, probit, or complementary log-log. The link function h is the
logit link by default, or the link function option specified in the ZEROMODEL statement. The link function
g is the log link function by default, or the link function specified in the MODEL statement, for both the
Poisson and the negative binomial. The covariates zi for observation i are determined by the model specified
in the ZEROMODEL statement, and the covariates xi are determined by the model specified in the MODEL
statement. The regression parameters and ˇ are estimated by maximum likelihood.
The mean and variance of Y for the zero-inflated Poisson are given by
E.Y / D D .1 !/
!
Var.Y / D C
2
1 !
and for the zero-inflated negative binomial by
E.Y / D D .1 !/
!
k
C
2
Var.Y / D C
1 !
1 !
You can request that the mean of Y be displayed for each observation in an output data set with the PRED
keyword.
Tweedie Distribution For Generalized Linear Models F 2977
Tweedie Distribution For Generalized Linear Models
The Tweedie (1984) distribution has nonnegative support and can have a discrete mass at zero, making it
useful to model responses that are a mixture of zeros and positive values. The Tweedie distribution belongs to
the exponential family, so it conveniently fits in the generalized linear models framework. According to such
parameterization, the mean and variance for the Tweedie random variable are E.Y / D and Var.Y / D p ,
respectively, where is the dispersion parameter and p is an extra parameter that controls the variance of the
distribution.
The Tweedie family of distributions includes several important distributions for generalized linear models.
When p D 0, the Tweedie distribution degenerates to the normal distribution; when p D 1, it becomes a
Poisson distribution; when p D 2, it becomes a gamma distribution; when p D 3, it is an inverse Gaussian
distribution.
Except for these special cases, the probability density function for the Tweedie distribution does not have a
closed form and can at best be expressed in terms of series. Numerical approximations are needed to evaluate
the density function. Dunn and Smyth (2005) propose using a finite series and provide a formula to determine
its lower and upper indices in order to achieve a desired accuracy. Alternatively, you can apply the Fourier
transformation on the characteristic function (Dunn and Smyth 2008). These approximations tend to be
expensive when a high level of accuracy is demanded or the data volume becomes large. PROC GENMOD
uses the series method unless it becomes complicated to do so. In this case, the method that is based on the
Fourier transformation is used. The accuracy of approximation is controlled by the EPSILON= option, whose
default value is 10 5 .
The Tweedie distribution is not defined when p is between 0 and 1. In practice, the most interesting range
is from 1 to 2 in which the Tweedie distribution gradually loses its mass at 0 as it shifts from a Poisson
distribution to a gamma distribution. In this case, the Tweedie random variable Y can be generated from a
compound Poisson distribution (Smyth 1996) as
Y
D †TiD1 Xi
T
Poisson./
Xi
gamma.˛; /
where Y D 0 if T D 0, T and Xi are statistically independent, and gamma.˛; / denotes a gamma random
variable that has mean ˛ and variance ˛ 2 . These parameters are determined by the Tweedie parameters as
follows:
2 p
.2 p/
2 p
˛ D
p 1
D .p 1/p
D
1
Inversely, given the Tweedie distributional parameters, the parameters of the compound Poisson distribution
2978 F Chapter 42: The GENMOD Procedure
are determined as follows:
D ˛
˛C2
p D
˛C1
1 p .˛ /2
D
2 p
p
In terms of generalized linear models parameterizations, the canonical parameter for the Tweedie density
can be expressed as
(
D
1 p
1 p
p¤1
log p D 1
and the function b. / is
(
b./ D
2 p
2 p
p¤2
log p D 2
Because of the intractability of differentiating the gradient functions with respect to the variance parameters,
PROC GENMOD uses a quasi-Newton approach to maximize the likelihood function, where the Hessian
matrix is approximated by taking finite differences of the gradient functions. Convergence is determined by
a union of two criteria: the relative gradient convergence criterion is set to 10 9 , and the relative function
convergence criterion is set to 2 10 9 . Convergence is declared when at least one of the criteria is attained
during the quasi-Newton iteration.
Before PROC GENMOD maximizes the approximate likelihood, it first maximizes the following extended
log quasi-likelihood which is constructed according to the definition of McCullagh and Nelder (1989, Chapter
9) as
X
Qp .y; ; ; p/ D
q.yi ; i ; ; p/
i
where the contribution from an observation is
2 p
q.yi ; i ; ; p/ D
p
0:5 log.2yi =wi /
wi
yi
.2
1 p
p/yi i
C .1
.1 p/.1 p/
2 p
p/i
!
=
and wi is the weight for the observation from the WEIGHT statement.
The range of parameter p for the quasi-likelihood is from 1 to 2. For a specified P= value outside this
range, PROC GENMOD skips optimization of the quasi-likelihood. To maintain numerical stability, PROC
GENMOD imposes a lower bound of 1.1 and a upper bound of 1.99 for computation with the quasi-likelihood.
The estimates that are obtained from optimizing the quasi-likelihood are usually near the full-likelihood
solution so that fewer iterations are needed for maximizing the more expensive full likelihood.
Generalized Estimating Equations F 2979
Generalized Estimating Equations
Let yij , j D 1; : : : ; ni , i D 1; : : : ; K, represent the jth measurement on the ith subject. There are ni
P
measurements on subject i and K
i D1 ni total measurements.
Correlated data are modeled using the same link function and linear predictor setup (systematic component)
as the independence case. The random component is described by the same variance functions as in the
independence case, but the covariance structure of the correlated measurements must also be modeled.
Let the vector of measurements on the ith subject be Yi D Œyi1 ; : : : ; yi ni 0 with corresponding vector of
means i D Œi1 ; : : : ; i ni 0 , and let Vi be the covariance matrix of Yi . Let the vector of independent, or
explanatory, variables for the jth measurement on the ith subject be
xij D Œxij1 ; : : : ; xijp 0
The generalized estimating equation of Liang and Zeger (1986) for estimating the p 1 vector of regression
parameters ˇ is an extension of the independence estimating equation to correlated data and is given by
S.ˇ/ D
K
X
D0i Vi 1 .Yi
i .ˇ// D 0
i D1
where
Di D
@i
@ˇ
Since
g.ij / D xij 0 ˇ
where g is the link function, the p ni matrix of partial derivatives of the mean with respect to the regression
parameters for the ith subject is given by
2 x
xi ni 1 3
i11
:
:
:
6 g 0 .i1 /
g 0 .i ni / 7
0
6
7
@
::
::
i
7
D0i D
D6
:
:
7
6
@ˇ
4 xi1p
xi ni p 5
:::
g 0 .i1 /
g 0 .i ni /
Working Correlation Matrix
Let Ri .˛/ be an ni ni “working” correlation matrix that is fully specified by the vector of parameters ˛.
The covariance matrix of Yi is modeled as
1
1
1
1
Vi D Ai2 Wi 2 R.˛/Wi 2 Ai2
where Ai is an ni ni diagonal matrix with v.ij / as the jth diagonal element and Wi is an ni ni diagonal
matrix with wij as the jth diagonal, where wij is a weight specified with the WEIGHT statement. If there is
no WEIGHT statement, wij D 1 for all i and j. If Ri .˛/ is the true correlation matrix of Yi , then Vi is the
true covariance matrix of Yi .
2980 F Chapter 42: The GENMOD Procedure
The working correlation matrix is usually unknown and must be estimated. It is estimated in the iterative
fitting process by using the current value of the parameter vector ˇ to compute appropriate functions of the
Pearson residual
yij ij
eij D p
v.ij /=wij
If you specify the working correlation as R0 D I, which is the identity matrix, the GEE reduces to the
independence estimating equation.
Following are the structures of the working correlation supported by the GENMOD procedure and the
estimators used to estimate the working correlations.
Working Correlation Structure
Estimator
Fixed
Corr.Yij ; Yi k / D rj k
The working correlation is not estiwhere rj k is the jkth element of a constant, mated in this case.
user-specified correlation matrix R0 .
Independent
Corr.Yij ; Yi k / D
1 j Dk
0 j ¤k
The working correlation is not estimated in this case.
m-dependent
8
< 1
˛t
Corr.Yij ; Yi;j Ct / D
:
0
t D0
t D 1; 2; : : : ; m
t >m
˛O t
1
.Kt p/
Kt D
D
e
e
ij
i;j
Ct
t
PK P
PK
i D1
j ni
i D1 .ni
t/
Exchangeable
Corr.Yij ; Yi k / D
1 j Dk
˛ j ¤k
˛O D
1
.N p/
N D 0:5
PK P
i D1
PK
i D1 ni .ni
j <k eij ei k
1/
Unstructured
Corr.Yij ; Yi k / D
1
j Dk
˛j k j ¤ k
Autoregressive
AR(1)
Corr.Yij ; Yi;j Ct / D ˛ t
for t D 0; 1; 2; : : : ; ni j
˛O j k D
˛O
1
.K p/
1
.K1 p/
K1 D
PK
i D1 eij ei k
PK P
iD1
PK
iD1 .ni
j ni
1/
D
e
e
ij
i;j
C1
1
Generalized Estimating Equations F 2981
Dispersion Parameter
The dispersion parameter is estimated by
O D
ni
K X
X
1
N
p
where N D
PK
2
eij
i D1 j D1
i D1 ni
is the total number of measurements and p is the number of regression parameters.
The square root of O is reported by PROC GENMOD as the scale parameter in the “Analysis of GEE
Parameter Estimates Model-Based Standard Error Estimates” output table. If a fixed scale parameter is
specified with the NOSCALE option in the MODEL statement, then the fixed value is used in estimating the
model-based covariance matrix and standard errors.
Fitting Algorithm
The following is an algorithm for fitting the specified model by using GEEs. Note that this is not in general
a likelihood-based method of estimation, so that inferences based on likelihoods are not possible for GEE
methods.
1. Compute an initial estimate of ˇ with an ordinary generalized linear model assuming independence.
2. Compute the working correlations R based on the standardized residuals, the current ˇ, and the
assumed structure of R.
3. Compute an estimate of the covariance:
1
1
1
1
2
2
O
Vi D Ai2 Wi 2 R.˛/W
i Ai
4. Update ˇ:
ˇrC1 D ˇr C
"K
X @i 0
i D1
@i
Vi 1
@ˇ
@ˇ
#
1" K
X
i D1
@i 0 1
V .Yi
@ˇ i
#
i /
5. Repeat steps 2-4 until convergence.
Missing Data
See Diggle, Liang, and Zeger (1994, Chapter 11) for a discussion of missing values in longitudinal data.
Suppose that you intend to take measurements Yi1 ; : : : ; Yi n for the ith unit. Missing values for which Yij are
missing whenever Yi k is missing for all j k are called dropouts. Otherwise, missing values that occur
intermixed with nonmissing values are intermittent missing values. The GENMOD procedure can estimate
the working correlation from data containing both types of missing values by using the all available pairs
method, in which all nonmissing pairs of data are used in the moment estimators of the working correlation
parameters defined previously. The resulting covariances and standard errors are valid under the missing
completely at random (MCAR) assumption.
For example, for the unstructured working correlation model,
X
1
˛O j k D
eij ei k
0
.K p/
2982 F Chapter 42: The GENMOD Procedure
where the sum is over the units that have nonmissing measurements at times j and k, and K 0 is the number of
units with nonmissing measurements at j and k. Estimates of the parameters for other working correlation
types are computed in a similar manner, using available nonmissing pairs in the appropriate moment
estimators.
The contribution of the ith unit to the parameter update equation is computed by omitting the elements
0
of .Yi i /, the columns of D0i D @
, and the rows and columns of Vi corresponding to missing
@ˇ
measurements.
Parameter Estimate Covariances
O is given by
The model-based estimator of Cov.ˇ/
O D I0 1
†m .ˇ/
where
I0 D
K
X
@i 0
i D1
@ˇ
Vi
1 @i
@ˇ
This is the GEE equivalent of the inverse of the Fisher information matrix that is often used in generalized
linear models as an estimator of the covariance estimate of the maximum likelihood estimator of ˇ. It is a
consistent estimator of the covariance matrix of ˇO if the mean model and the working correlation matrix are
correctly specified.
The estimator
†e D I0 1 I1 I0 1
O where
is called the empirical, or robust, estimator of the covariance matrix of ˇ,
I1 D
K
X
@i 0
i D1
@ˇ
Vi 1 Cov.Yi /Vi
1 @i
@ˇ
O even if the working
It has the property of being a consistent estimator of the covariance matrix of ˇ,
correlation matrix is misspecified—that is, if Cov.Yi / ¤ Vi . For further information about the robust
variance estimate, see Zeger, Liang, and Albert (1988); Royall (1986); White (1982). In computing †e , ˇ
and are replaced by estimates, and Cov.Yi / is replaced by the estimate
.Yi
O
i .ˇ//.Y
i
O 0
i .ˇ//
Multinomial GEEs
Lipsitz, Kim, and Zhao (1994) and Miller, Davis, and Landis (1993) describe how to extend GEEs to
multinomial data. Currently, only the independent working correlation is available for multinomial models in
PROC GENMOD.
Generalized Estimating Equations F 2983
Alternating Logistic Regressions
If the responses are binary (that is, they take only two values), then there is an alternative method to account
for the association among the measurements. The alternating logistic regressions (ALR) algorithm of Carey,
Zeger, and Diggle (1993) models the association between pairs of responses with log odds ratios, instead of
with correlations, as ordinary GEEs do.
For binary data, the correlation between the jth and kth response is, by definition,
Pr.Yij D 1; Yi k D 1/ ij i k
Corr.Yij ; Yi k / D p
ij .1 ij /i k .1 i k /
The joint probability in the numerator satisfies the following bounds, by elementary properties of probability,
since ij D Pr.Yij D 1/:
max.0; ij C i k
1/ Pr.Yij D 1; Yi k D 1/ min.ij ; i k /
The correlation, therefore, is constrained to be within limits that depend in a complicated way on the means
of the data.
The odds ratio, defined as
OR.Yij ; Yi k / D
Pr.Yij D 1; Yi k D 1/ Pr.Yij D 0; Yi k D 0/
Pr.Yij D 1; Yi k D 0/ Pr.Yij D 0; Yi k D 1/
is not constrained by the means and is preferred, in some cases, to correlations for binary data.
The ALR algorithm seeks to model the logarithm of the odds ratio, ij k D log.OR.Yij ; Yi k //, as
ij k D z0ij k ˛
where ˛ is a q 1 vector of regression parameters and zij k is a fixed, specified vector of coefficients.
The parameter ij k can take any value in . 1; 1/ with ij k D 0 corresponding to no association.
The log odds ratio, when modeled in this way with a regression model, can take different values in subgroups
defined by zij k . For example, zij k can define subgroups within clusters, or it can define “block effects”
between clusters.
You specify a GEE model for binary data that uses log odds ratios by specifying a model for the mean, as
in ordinary GEEs, and a model for the log odds ratios. You can use any of the link functions appropriate
for binary data in the model for the mean, such as logistic, probit, or complementary log-log. The ALR
algorithm alternates between a GEE step to update the model for the mean and a logistic regression step to
update the log odds ratio model. Upon convergence, the ALR algorithm provides estimates of the regression
parameters for the mean, ˇ, the regression parameters for the log odds ratios, ˛, their standard errors, and
their covariances.
Specifying Log Odds Ratio Models
Specifying a regression model for the log odds ratio requires you to specify rows of the z matrix zij k for
each cluster i and each unique within-cluster pair .j; k/. The GENMOD procedure provides several methods
of specifying zij k . These are controlled by the LOGOR=keyword and associated options in the REPEATED
statement. The supported keywords and the resulting log odds ratio models are described as follows.
2984 F Chapter 42: The GENMOD Procedure
EXCH
specifies exchangeable log odds ratios. In this model, the log odds ratio is a
constant for all clusters i and pairs .j; k/. The parameter ˛ is the common log
odds ratio.
zij k D 1 for all i; j; k
FULLCLUST
specifies fully parameterized clusters. Each cluster is parameterized in the same
way, and there is a parameter for each unique pair within clusters. If a complete
cluster is of size n, then there are n.n2 1/ parameters in the vector ˛. For example,
if a full cluster is of size 4, then there are 43
2 D 6 parameters, and the z matrix
is of the form
2
3
1 0 0 0 0 0
6 0 1 0 0 0 0 7
6
7
6 0 0 1 0 0 0 7
6
7
ZD6
7
6 0 0 0 1 0 0 7
4 0 0 0 0 1 0 5
0 0 0 0 0 1
The elements of ˛ correspond to log odds ratios for cluster pairs in the following
order:
Pair
Parameter
(1,2)
(1,3)
(1,4)
(2.3)
(2,4)
(3,4)
Alpha1
Alpha2
Alpha3
Alpha4
Alpha5
Alpha6
LOGORVAR(variable)
specifies log odds ratios by cluster. The argument variable is a variable name that
defines the “block effects” between clusters. The log odds ratios are constant
within clusters, but they take a different value for each different value of the
variable. For example, if Center is a variable in the input data set taking a different
value for k treatment centers, then specifying LOGOR=LOGORVAR(Center)
requests a model with different log odds ratios for each of the k centers, constant
within center.
NESTK
specifies k-nested log odds ratios.
You must also specify the SUBCLUST=variable option to define subclusters within clusters. Within each
cluster, PROC GENMOD computes a log odds ratio parameter for pairs having
the same value of variable for both members of the pair and one log odds ratio
parameter for each unique combination of different values of variable.
NEST1
specifies 1-nested log odds ratios.
You must also specify the SUBCLUST=variable option to define subclusters within clusters. There are
two log odds ratio parameters for this model. Pairs having the same value of
variable correspond to one parameter; pairs having different values of variable
correspond to the other parameter. For example, if clusters are hospitals and
subclusters are wards within hospitals, then patients within the same ward have
Generalized Estimating Equations F 2985
one log odds ratio parameter, and patients from different wards have the other
parameter.
ZFULL
specifies the full z matrix. You must also specify a SAS data set containing
the z matrix with the ZDATA=data-set-name option. Each observation in
the data set corresponds to one row of the z matrix. You must specify the
ZDATA data set as if all clusters are complete—that is, as if all clusters are
the same size and there are no missing observations. The ZDATA data set
has KŒnmax .nmax 1/=2 observations, where K is the number of clusters and
nmax is the maximum cluster size. If the members of cluster i are ordered as
1; 2; ; n, then the rows of the z matrix must be specified for pairs in the order
.1; 2/; .1; 3/; ; .1; n/; .2; 3/; ; .2; n/; ; .n 1; n/. The variables specified in the REPEATED statement for the SUBJECT effect must also be present
in the ZDATA= data set to identify clusters. You must specify variables in the
data set that define the columns of the z matrix by the ZROW=variable-list option.
If there are q columns (q variables in variable-list ), then there are q log odds
ratio parameters. You can optionally specify variables indicating the cluster pairs
corresponding to each row of the z matrix with the YPAIR=(variable1, variable2 )
option. If you specify this option, the data from the ZDATA data set are sorted
within each cluster by variable1 and variable2 . See Example 42.6 for an example
of specifying a full z matrix.
ZREP
specifies a replicated z matrix. You specify z matrix data exactly as you do for the
ZFULL case, except that you specify only one complete cluster. The z matrix for
the one cluster is replicated for each cluster. The number of observations in the
ZDATA data set is nmax .n2max 1/ , where nmax is the size of a complete cluster (a
cluster with no missing observations).
ZREP(matrix )
specifies direct input of the replicated z matrix. You specify the z matrix for
one cluster with the syntax LOGOR=ZREP ( .y1 y2 /z1 z2 zq ; ), where
y1 and y2 are numbers representing a pair of observations and the values
z1 ; z2 ; ; zq make up the corresponding row of the z matrix. The number
of rows specified is nmax .n2max 1/ , where nmax is the size of a complete cluster
(a cluster with no missing observations). For example,
logor =
zrep((1
(1
(1
(2
(2
(3
2)
3)
4)
3)
4)
4)
1
1
1
1
1
1
0,
0,
0,
1,
1,
1)
specifies the 43
2 D 6 rows of the z matrix for a cluster of size 4 with q = 2 log
odds ratio parameters. The log odds ratio for the pairs (1 2), (1 3), (1 4) is ˛1 , and
the log odds ratio for the pairs (2 3), (2 4), (3 4) is ˛1 C ˛2 .
2986 F Chapter 42: The GENMOD Procedure
Quasi-likelihood Information Criterion
The quasi-likelihood information criterion (QIC) was developed by Pan (2001) as a modification of the
Akaike information criterion (AIC) to apply to models fit by GEEs.
Define the quasi-likelihood under the independence working correlation assumption, evaluated with the
parameter estimates under the working correlation of interest as
O
Q.ˇ.R/;
/ D
ni
K X
X
O
Q.ˇ.R/;
I .Yij ; Xij //
i D1 j D1
where the quasi-likelihood contribution of the jth observation in the ith cluster is defined in the section
O
“Quasi-likelihood Functions” on page 2986 and ˇ.R/
are the parameter estimates obtained from GEEs with
the working correlation of interest R.
QIC is defined as
QIC.R/ D
O
O I VOR /
2Q.ˇ.R/;
/ C 2trace.
O I is the inverse of the model-based covariance estimate
where VOR is the robust covariance estimate and 
O
under the independent working correlation assumption, evaluated at ˇ.R/,
the parameter estimates obtained
from GEEs with the working correlation of interest R.
PROC GENMOD also computes an approximation to QIC.R/ defined by Pan (2001) as
QICu .R/ D
O
2Q.ˇ.R/;
/ C 2p
where p is the number of regression parameters.
Pan (2001) notes that QIC is appropriate for selecting regression models and working correlations, whereas
QICu is appropriate only for selecting regression models.
Quasi-likelihood Functions
See McCullagh and Nelder (1989) and Hardin and Hilbe (2003) for discussions of quasi-likelihood functions.
The contribution of observation j in cluster i to the quasi-likelihood function evaluated at the regression
Q
parameters ˇ is given by Q.ˇ; I .Yij ; Xij // D ij , where Qij is defined in the following list. These are
used in the computation of the quasi-likelihood information criteria (QIC) for goodness of fit of models fit
with GEEs. The wij are prior weights, if any, specified with the WEIGHT or FREQ statements. Note that the
definition of the quasi-likelihood for the negative binomial differs from that given in McCullagh and Nelder
(1989). The definition used here allows the negative binomial quasi-likelihood to approach the Poisson as
k ! 0.
• Normal:
Qij D
1
wij .yij
2
ij /2
• Inverse Gaussian:
Qij D
wij .ij :5yij /
2ij
Assessment of Models Based on Aggregates of Residuals F 2987
• Gamma:
Qij D
wij
yij
C log.ij /
ij
• Negative binomial:
1
Qij D wij log € yij C
k
kij
1
1
1
log €
C yij log
C log
k
1 C kij
k
1 C kij
• Poisson:
Qij D wij .yij log.ij /
ij /
• Binomial:
Qij D wij Œrij log.pij / C .nij
rij / log.1
pij /
• Multinomial (s categories):
Qij D wij
s
X
yij k log.ij k /
kD1
Generalized Score Statistics
Boos (1992) and Rotnitzky and Jewell (1990) describe score tests applicable to testing L0 ˇ D 0 in GEEs,
where L0 is a user-specified r p contrast matrix or a contrast for a Type 3 test of hypothesis.
Let ˇQ be the regression parameters resulting from solving the GEE under the restricted model L0 ˇ D 0, and
Q be the generalized estimating equation values at ˇ.
Q
let S.ˇ/
The generalized score statistic is
Q 0 †m L.L0 †e L/
T D S.ˇ/
1 0
Q
L †m S.ˇ/
where †m is the model-based covariance estimate and †e is the empirical covariance estimate. The p-values
for T are computed based on the chi-square distribution with r degrees of freedom.
Assessment of Models Based on Aggregates of Residuals
Lin, Wei, and Ying (2002) present graphical and numerical methods for model assessment based on the
cumulative sums of residuals over certain coordinates (such as covariates or linear predictors) or some
related aggregates of residuals. The distributions of these stochastic processes under the assumed model
can be approximated by the distributions of certain zero-mean Gaussian processes whose realizations can
be generated by simulation. Each observed residual pattern can then be compared, both graphically and
numerically, with a number of realizations from the null distribution. Such comparisons enable you to
assess objectively whether the observed residual pattern reflects anything beyond random fluctuation. These
procedures are useful in determining appropriate functional forms of covariates and link function. You use
the ASSESS|ASSESSMENT statement to perform this kind of model-checking with cumulative sums of
2988 F Chapter 42: The GENMOD Procedure
residuals, moving sums of residuals, or LOESS smoothed residuals. See Example 42.8 and Example 42.9 for
examples of model assessment.
Let the model for the mean be
g.i / D x0i ˇ
where i is the mean of the response yi and xi is the vector of covariates for the ith observation. Denote the
raw residual resulting from fitting the model as
ei D y i
O i
and let xij be the value of the jth covariate in the model for observation i. Then to check the functional form
of the jth covariate, consider the cumulative sum of residuals with respect to xij ,
n
1 X
Wj .x/ D p
I.xij x/ei
n
i D1
where I./ is the indicator function. For any x, Wj .x/ is the sum of the residuals with values of xj less than
or equal to x.
Denote the score, or gradient vector, by
U.ˇ/ D
n
X
h.x0 ˇ/xi .yi
.x0 ˇ//
i D1
where .r/ D g
h.r/ D
1 .r/,
and
1
g0..r//V ..r//
Let J be the Fisher information matrix
J.ˇ/ D
@U.ˇ/
@ˇ 0
Define
n
1 X
O
WO j .x/ D p
ŒI.xij x/ C 0 .xI ˇ/J
n
1
O i h.x0 ˇ/e
O i Zi
.ˇ/x
i D1
where
.xI ˇ/ D
n
X
iD1
I.xij x/
@.x0i ˇ/
@ˇ
and Zi are independent N.0; 1/ random variables. Then the conditional distribution of WO j .x/, given
.yi ; xi /; i D 1; : : : ; n, under the null hypothesis H0 that the model for the mean is correct, is the same
asymptotically as n ! 1 as the unconditional distribution of Wj .x/ (Lin, Wei, and Ying 2002).
You can approximate realizations from the null hypothesis distribution of Wj .x/ by repeatedly generating normal samples Zi ; i D 1; : : : ; n, while holding .yi ; xi /; i D 1; : : : ; n, at their observed values and computing
WO j .x/ for each sample.
Assessment of Models Based on Aggregates of Residuals F 2989
You can assess the functional form of covariate j by plotting a few realizations of WO j .x/ on the same plot as
the observed Wj .x/ and visually comparing to see how typical the observed Wj .x/ is of the null distribution
samples.
You can supplement the graphical inspection method with a Kolmogorov-type supremum test. Let sj be the
observed value of Sj D supx jWj .x/j. The p-value PrŒSj sj  is approximated by PrŒSOj sj , where
SOj D supx jWO j .x/j. PrŒSOj sj  is estimated by generating realizations of WO j .:/ (1,000 is the default
number of realizations).
You can check the link function instead of the jth covariate by using values of the linear predictor x0i ˇO in
place of values of the jth covariate xij . The graphical and numerical methods described previously are then
sensitive to inadequacies in the link function.
An alternative aggregate of residuals is the moving sum statistic
n
1 X
I.x
Wj .x; b/ D p
n
b xij x/ei
i D1
If you specify the keyword WINDOW(b), then the moving sum statistic with window size b is used instead
of the cumulative sum of residuals, with I.x b xij x/ replacing I.xij x/ in the earlier equation.
If you specify the keyword LOESS(f ), loess smoothed residuals are used in the preceding formulas, where f is
the fraction of the data to be used at a given point. If f is not specified, f D 13 is used. For data .Yi ; Xi /; i D
1; : : : ; n, define r as the nearest integer to nf and h as the rth smallest among jXi xj; i D 1; : : : ; n. Let
Xi x
Ki .x/ D K
h
where
K.t/ D
70
.1
81
jtj3 /3 I. 1 t 1/
Define
wi .x/ D Ki .x/ŒS2 .x/
.Xi
x/S1 .x/
where
S1 .x/ D
n
X
Ki .x/.Xi
x/
Ki .x/.Xi
x/2
i D1
S2 .x/ D
n
X
i D1
Then the loess estimate of Y at x is defined by
YO .x/ D
n
X
i D1
wi .x/
Pn
Yi
i D1 wi .x/
2990 F Chapter 42: The GENMOD Procedure
Loess smoothed residuals for checking the functional form of the jth covariate are defined by replacing Yi
with ei and Xi with xij . To implement the graphical and numerical assessment methods, I.xij x/ is
replaced with Pnwi .x/
in the formulas for Wj .x/ and WO j .x/.
w .x/
iD1
i
You can perform the model checking described earlier for marginal models for dependent responses fit by
generalized estimating equations (GEEs). Let yi k denote the kth measurement on the ith cluster, i D 1; : : : ; K,
k D 1; : : : ; ni , and let xi k denote the corresponding vector of covariates. The marginal mean of the response
i k D E.yi k / is assumed to depend on the covariate vector by
g.i k / D x0i k ˇ
where g is the link function.
Define the vector of residuals for the ith cluster as
ei D .ei1 ; : : : ; ei ni /0 D .yi1
O i1 ; : : : ; yi ni
O i ni /0
You use the following extension of Wj .x/ defined earlier to check the functional form of the jth covariate:
K ni
1 XX
Wj .x/ D p
I.xi kj x/eik
K i D1 kD1
where xi kj is the jth component of xi k .
The null distribution of Wj .x/ can be approximated by the conditional distribution of
(n
)
K
i
X
X
1
0
1
0
1
O V
O
O 0 D
WO j .x/ D p
I.xi kj x/ei k C .x; ˇ/I
i i ei Zi
K i D1 kD1
O i and V
O i are defined as in the section “Generalized Estimating Equations” on page 2979 with the
where D
unknown parameters replaced by their estimated values,
.x; ˇ/ D
ni
K X
X
i D1 kD1
I0 D
K
X
I.xi kj x/
@i k
@ˇ
O 0V
O 1O
D
i i Di
i D1
and Zi ; i D 1; : : : ; K, are independent N.0; 1/ random variables. You replace xi kj with the linear predictor
x0i k ˇO in the preceding formulas to check the link function.
Case Deletion Diagnostic Statistics F 2991
Case Deletion Diagnostic Statistics
For ordinary generalized linear models, regression diagnostic statistics developed by Williams (1987) can be
requested in an output data set or in the OBSTATS table by specifying the DIAGNOSTICS | INFLUENCE
option in the MODEL statement. These diagnostics measure the influence of an individual observation on
model fit, and generalize the one-step diagnostics developed by Pregibon (1981) for the logistic regression
model for binary data.
Preisser and Qaqish (1996) further generalized regression diagnostics to apply to models for correlated data fit
by generalized estimating equations (GEEs), where the influence of entire clusters of correlated observations,
or the influence of individual observations within a cluster, is measured. These diagnostic statistics can be
requested in an output data set or in the OBSTATS table if a model for correlated data is specified with a
REPEATED statement.
The next two sections use the following notation:
ˇO
is the maximum likelihood estimate of the regression parameters ˇ, or, in the case of correlated data,
the solution of the GEEs.
ˇOŒi 
is the corresponding estimate evaluated with the ith observation deleted, or, in the case of correlated
data, with the ith cluster deleted.
p
is the dimension of the regression parameter vector ˇ.
rpi
is the standardized Pearson residual pvyi.1ih / , where vi is the variance of the ith response and hi is
i
i
the leverage defined in the section “H | LEVERAGE” on page 2992.
vi
is the variance of response i, Var.Yi / D V .i /, where V ./ is the variance function and is the
dispersion parameter.
wi
is the prior weight of the ith observation specified with the WEIGHT statement. If there is no WEIGHT
statement, wi D 1 for all i.
All unknown quantities are replaced by their estimated values in the following two sections.
Diagnostics for Ordinary Generalized Linear Models
The following statistics are available for generalized linear models.
DFBETA
The DFBETA statistic for measuring the influence of the ith observation is defined as the one-step approximation to the difference in the MLE of the regression parameter vector and the MLE of the regression parameter
vector without the ith observation. This one-step approximation assumes a Fisher scoring step, and is given
by
ˇO
ˇOŒi  DFBETAi D .X0 WX/
1
1
X0i Wi2 .1
hi /
1
2
rpi
where hi is the leverage defined in the section “H | LEVERAGE” on page 2992.
2992 F Chapter 42: The GENMOD Procedure
DFBETAS
The standardized DFBETA statistic for assessing the influence of the ith observation on the jth regression
parameter is defined as the DFBETA statistic for the jth parameter divided by its estimated standard deviation,
where the standard deviation is estimated from all the data.
DFBETASij D DFBETAij =O .ˇj /
DOBS | COOKD | COOKSD
In normal linear regression, the influence of observation i can be measured by Cook’s distance (Cook and
Weisberg 1982). A measure of influence of observation i for generalized linear models that is equivalent to
Cook’s distance for normal linear regression is given by
DOBSi D p
1
hi .1
hi /
1 2
rpi
where hi is the leverage defined in the section “H | LEVERAGE” on page 2992. This measure is the one-step
O
approximation to 2p 1 ŒL.ˇ/
L.ˇOŒi  /, where L.ˇ/ is the log likelihood evaluated at ˇ.
H | LEVERAGE
wi
The Fisher scores, or expected, weight for observation i is wei D V . /.g
0 . //2 . Let W be the diagonal
i
i
matrix with wei as the ith diagonal. The leverage hi of the ith observation is defined as the ith diagonal
element of the hat matrix
1
H D W 2 X.X0 WX/
1
1
X0 W 2
Diagnostics for Models Fit by Generalized Estimating Equations (GEEs)
The diagnostic statistics in this section were developed by Preisser and Qaqish (1996). See the section
“Generalized Estimating Equations” on page 2979 for further information and notation for generalized
estimating equations (GEEs). The following additional notation is used in this section.
0 0
Partition the design matrix X and response vector Y by cluster; that is, let X D .X10 ; : : : ; XK
/ , and
0
0 0
Y D .Y1 ; : : : ; YK / corresponding to the K clusters.
P
Let ni be the number of responses for cluster i, and denote by N D K
i D1 ni the total number of observations.
Denote by Ai the ni ni diagonal matrix with V .ij / as the jth diagonal element. If there is a WEIGHT
statement, the diagonal element of Ai is V .ij /=wij , where wij is the specified weight of the jth observation
in the ith cluster. Let B the N N diagonal matrix with g 0 .ij / as diagonal elements, i D 1; : : : ; K,
j D 1; : : : ; ni . Let Bi the ni ni diagonal matrix corresponding to cluster i with g 0 .ij / as the jth diagonal
element.
Let W be the N N block diagonal weight matrix whose ith block, corresponding to the ith cluster, is the
ni ni matrix
1
1
O i 2 Bi
Wei D Bi 1 Ai 2 Ri 1 .˛/A
1
where Ri is the working correlation matrix for cluster i.
Case Deletion Diagnostic Statistics F 2993
Let
Qi D Xi .X0 WX/
1
X0i
where Xi is the ni p design matrix corresponding to cluster i.
Define the adjusted residual vector as
O
/
E D B.Y
and Ei D Bi .Yi
O i /, the estimated residual for the ith cluster.
Let the subscript Œi  denote estimates evaluated without the ith cluster, Œi t  estimates evaluated using all the
data except the tth observation of the ith cluster, and let i Œt  denote matrices corresponding to the ith cluster
without the tth observation.
The following statistics are available for generalized estimating equation models.
CH | CLUSTERH | CLEVERAGE
The leverage of cluster i is contained in the matrix Hi D Qi Wei , and is summarized by the trace of Hi ,
ch i D tr.Hi /
The leverage hi of the tth observation in the ith cluster is the tth diagonal element of Hi .
DFBETAC
The effect of deleting cluster i on the estimated parameter vector is given by the following one-step approximation for ˇO ˇOŒi  :
DBETACi D .X0 WX/
1
X0i .Wei1
Qi /
1
Ei
DFBETACS
The cluster deletion statistic DFBETAC can be standardized using the variances of ˇO based on the complete
data. The standardized one-step approximation for the change in ˇOj due to deletion of cluster i is
DBETACSij D
DBETACij
O 0 WX/
Œ.X
1
1 2
jj
DFBETAO
Partition the matrices Wei and Vi as
Wei D
Vi D
Wei t
Wei Œt t
Wei1
D
and let Ei t D Bi t .Yi t
Wei t Œt 
Wei Œt 
Vi t
Vi Œt t
Vi t Œt 
Vi Œt 
O i t / and Ei Œt  D Bi Œt  .Yi Œt 
O i Œt  /.
2994 F Chapter 42: The GENMOD Procedure
The effect of deleting the tth observation from the ith cluster is given by the following one-step approximation
to ˇO ˇOŒi t  :
1
DBETAOi t D .X0 WX/
Q0
X
it
EQ i t
Wei t1
QQ i t
Q i t .X0 WX/
Q i t D Xi t Vi t Œt  V 1 Xi Œt  , QQ i t D X
where X
i Œt 
Wei t , QQ i t , and EQ i t are scalars.
1 XQ 0 ,
it
and EQ i t D Ei t
Vi t Œt  Vi Œt1 Ei Œt  . Note that
DFBETAOS
The observation deletion statistic DFBETAO can be standardized using the variances of ˇO based on the
complete data. The standardized one-step approximation for the change in ˇOj due to deletion of observation
t in cluster i is
DBETAOi tj
DBETAOSi tj D
1
O 0 WX/ 1  2
Œ.X
jj
DCLS | CLUSTERCOOKD | CLUSTERCOOKSD
A measure of the standardized influence of the subset m of observations on the overall fit is .ˇO
O For deletion of cluster i, this is approximated by
ˇOŒm /0 .X0 WX/.ˇO ˇOŒm /=p .
DCLSi D E0i .Wei1
Qi /
1
/Qi .Wei1
Qi /
1
/Ei =p O
DOBS | COOKD | COOKSD
The measure of overall fit in the section “DCLS | CLUSTERCOOKD | CLUSTERCOOKSD” on page 2994
for the deletion of the tth observation in the ith cluster is approximated by
DOBSi t D
EQ i2t QQ i t
1
O
p .W
ei t
QQ i t /2
where EQ i t , QQ i t , and Wei t are defined in the section “DFBETAO” on page 2993. In the case of the
independence working correlation, this is equal to the measure for ordinary generalized linear models defined
in the section “DOBS | COOKD | COOKSD” on page 2992.
MCLS | CLUSTERDFIT
A studentized distance measure of the type defined in the section “DCLS | CLUSTERCOOKD | CLUSTERCOOKSD” on page 2994 of the influence of the ith cluster is given by
MCLSi D E0i .Wei1
Qi /
1
Hi Ei =p O
Bayesian Analysis
In generalized linear models, the response has a probability distribution from a family of distributions of the
exponential form. That is, the probability density of the response Y for continuous response variables, or the
probability function for discrete responses, can be expressed as
y b. /
f .y/ D exp
C c.y; /
a./
Bayesian Analysis F 2995
for some functions a, b, and c that determine the specific distribution. The canonical parameters depend
only on the means of the response i , which are related to the regression parameters ˇ through the link
function g.i / D x 0 ˇ. The additional parameter is the dispersion parameter. The GENMOD procedure
1
estimates the regression parameters and the scale parameter D 2 by maximum likelihood. However, the
GENMOD procedure can also provide Bayesian estimates of the regression parameters and either the scale ,
the dispersion , or the precision D 1 by sampling from the posterior distribution. Except where noted,
the following discussion applies to either , , or , although is used to illustrate the formulas. Note that the
Poisson and binomial distributions do not have a dispersion parameter, and the dispersion is considered to be
fixed at D 1. The ASSESS, CONTRAST, ESTIMATE, OUTPUT, and REPEATED statements, if specified,
are ignored. Also ignored are the PLOTS= option in the PROC GENMOD statement and the following options
in the MODEL statement: ALPHA=, CORRB, COVB, TYPE1, TYPE3, SCALE=DEVIANCE (DSCALE),
SCALE=PEARSON (PSCALE), OBSTATS, RESIDUALS, XVARS, PREDICTED, DIAGNOSTICS, and
SCALE= for Poisson and binomial distributions. The multinomial and zero-inflated Poisson distributions are
not available for Bayesian analysis.
See the section “Assessing Markov Chain Convergence” on page 141 in Chapter 7, “Introduction to Bayesian
Analysis Procedures,” for information about assessing the convergence of the chain of posterior samples.
Several algorithms, specified with the SAMPLING= option in the BAYES statement, are available in
GENMOD for drawing samples from the posterior distribution.
ARMS Algorithm for Gibbs Sampling
This section provides details for Bayesian analysis by Gibbs sampling in generalized linear models. See the
section “Gibbs Sampler” on page 137 in Chapter 7, “Introduction to Bayesian Analysis Procedures,” for a
general discussion of Gibbs sampling. See Gilks, Richardson, and Spiegelhalter (1996) for a discussion of
applications of Gibbs sampling to a number of different models, including generalized linear models.
Let D .1 ; : : : ; k /0 be the parameter vector. For generalized linear models, the i s are the regression
coefficients ˇi s and the dispersion parameter . Let L.Dj/ be the likelihood function, where D is the
observed data. Let ./ be the prior distribution. The full conditional distribution of Œi jj ; i ¤ j  is
proportional to the joint distribution; that is,
.i jj ; i ¤ j; D/ / L.Dj/p./
For instance, the one-dimensional conditional distribution of 1 given j D j ; 2 j k, is computed as
.1 jj D j ; 2 j k; D/ D L.Dj. D .1 ; 2 ; : : : ; k /0 /p. D .1 ; 2 ; : : : ; k /0 /
.0/
.0/
Suppose you have a set of arbitrary starting values f1 ; : : : ; k g. Using the ARMS (adaptive rejection
Metropolis sampling) algorithm (Gilks and Wild 1992; Gilks, Best, and Tan 1995), you can do the following:
.1/
.0/
.1/
.1/
.1/
.1/
.0/
draw 1 from Œ1 j2 ; : : : ; k 
.0/
.0/
draw 2 from Œ2 j1 ; 3 ; : : : ; k 
:::
.1/

1
draw k from Œk j1 ; : : : ; k
2996 F Chapter 42: The GENMOD Procedure
.1/
.1/
This completes one iteration of the Gibbs sampler. After one iteration, you have f1 ; : : : ; k g. After n
.n/
.n/
iterations, you have f1 ; : : : ; k g. PROC GENMOD implements the ARMS algorithm provided by Gilks
(2003) to draw a sample from a full conditional distribution. See the section “Adaptive Rejection Sampling
Algorithm” on page 138 in Chapter 7, “Introduction to Bayesian Analysis Procedures,” for more information
about the ARMS algorithm.
Gamerman Algorithm
The Gamerman algorithm, unlike a Gibbs sampling algorithm, samples parameters from their multivariate
posterior conditional distribution. The algorithm uses the structure of generalized linear models to efficiently
sample from the posterior distribution of the model parameters. For a detailed description and explanation
of the algorithm, see Gamerman (1997) and the section “Gamerman Algorithm” on page 140 in Chapter 7,
“Introduction to Bayesian Analysis Procedures.” The Gamerman algorithm is the default method used to
sample from the posterior distribution, except in the case of a normal distribution with a conjugate prior, in
which case a closed form is available for the posterior distribution. See any of the introductory references in
Chapter 7, “Introduction to Bayesian Analysis Procedures,” for a discussion of conjugate prior distributions
for a linear model with the normal distribution.
Independence Metropolis Algorithm
The independence Metropolis algorithm is another sampling algorithm that draws multivariate samples from
the posterior distribution. See the section “Independence Sampler” on page 139 in Chapter 7, “Introduction
to Bayesian Analysis Procedures,” for more details.
Posterior Samples Output Data Set
You can output posterior samples into a SAS data set through ODS. The following SAS statement outputs the
posterior samples into the SAS data set Post:
ODS OUTPUT POSTERIORSAMPLE=Post
You can alternatively create the SAS data set Post with the OUTPOST=Post option in the BAYES statement.
The data set also includes the variables LogPost and LogLike, which represent the log of the posterior
likelihood and the log of the likelihood, respectively.
Priors for Model Parameters
The model parameters are the regression coefficients and the dispersion parameter (or the precision or scale),
if the model has one. The priors for the dispersion parameter and the priors for the regression coefficients
are assumed to be independent, while you can have a joint multivariate normal prior for the regression
coefficients.
Dispersion, Precision, or Scale Parameter
Gamma Prior The gamma distribution G.a; b/ has a probability density function
f .u/ D
b.bu/a 1 e
€.a/
bu
;
u>0
where a is the shape parameter and b is the inverse-scale parameter. The mean is
a
b
and the variance is
a
.
b2
Bayesian Analysis F 2997
The joint prior density is given by
Improper Prior
p.u/ / u
1
;
u>0
Inverse Gamma Prior
ba
u
€.a/
f .u/ D
The inverse gamma distribution IG.a; b/ has a probability density function
.aC1/
e
b=u
;
u>0
where a is the shape parameter and b is the scale parameter. The mean is
b2
.a 1/2 .a 2/
b
a 1
if a > 1, and the variance is
if a > 2.
Regression Coefficients
Let ˇ be the regression coefficients.
Jeffreys’ Prior
The joint prior density is given by
1
p.ˇ/ / jI.ˇ/j 2
where I.ˇ/ is the Fisher information matrix for the model. If the underlying model has a scale parameter (for
example, a normal linear regression model), then the Fisher information matrix is computed with the scale
parameter set to a fixed value of one.
If you specify the CONDITIONAL option, then Jeffreys’ prior, conditional on the current Markov chain
value of the generalized linear model precision parameter , is given by
1
jI.ˇ/j 2
where is the model precision parameter.
See Ibrahim and Laud (1991) for a full discussion, with examples, of Jeffreys’ prior for generalized linear
models.
Normal Prior Assume ˇ has a multivariate normal prior with mean vector ˇ0 and covariance matrix †0 .
The joint prior density is given by
p.ˇ/ / e
1
2 .ˇ
ˇ0 /0 †0 1 .ˇ ˇ0 /
If you specify the CONDITIONAL option, then, conditional on the current Markov chain value of the
generalized linear model precision parameter , the joint prior density is given by
p.ˇ/ / e
Uniform Prior
p.ˇ/ / 1
1
2 .ˇ
ˇ0 /0 †0 1 .ˇ ˇ0 /
The joint prior density is given by
2998 F Chapter 42: The GENMOD Procedure
Deviance Information Criterion
Let i be the model parameters at iteration i of the Gibbs sampler and let LL(i ) be the corresponding model
log likelihood. PROC GENMOD computes the following fit statistics defined by Spiegelhalter et al. (2002):
• Effective number of parameters:
pD D LL. /
LL.N /
• Deviance information criterion (DIC):
DIC D LL. / C pD
where
LL./ D
1
n
Pn
N
1
n
Pn
D
i D1 LL.i /
i D1 i
PROC GENMOD uses the full log likelihoods defined in the section “Log-Likelihood Functions” on
page 2960, with all terms included, for computing the DIC.
Posterior Distribution
Denote the observed data by D.
The posterior distribution is
.ˇjD/ / LP .Djˇ/p.ˇ/
where LP .Djˇ/ is the likelihood function with regression coefficients ˇ as parameters.
Starting Values of the Markov Chains
When the BAYES statement is specified, PROC GENMOD generates one Markov chain containing the
approximate posterior samples of the model parameters. Additional chains are produced when the GelmanRubin diagnostics are requested. Starting values (or initial values) can be specified in the INITIAL= data set
in the BAYES statement. If INITIAL= option is not specified, PROC GENMOD picks its own initial values
for the chains.
Denote Œx as the integral value of x. Denote sO .X / as the estimated standard error of the estimator X.
Regression Coefficients
For the first chain that the summary statistics and regression diagnostics are based on, the default initial
values are estimates of the mode of the posterior distribution. If the INITIALMLE option is specified, the
initial values are the maximum likelihood estimates; that is,
.0/
ˇi
D ˇOi
Initial values for the rth chain (r 2) are given by
r
.0/
O
ˇi D ˇi ˙ 2 C
sO .ˇOi /
2
with the plus sign for odd r and minus sign for even r.
Exact Logistic and Exact Poisson Regression F 2999
Dispersion, Scale, or Precision Parameter Let be the generalized linear model parameter you choose to sample, either the dispersion, scale, or
precision parameter. Note that the Poisson and binomial distributions do not have this additional parameter.
For the first chain that the summary statistics and regression diagnostics are based on, the default initial
values are estimates of the mode of the posterior distribution. If the INITIALMLE option is specified, the
initial values are the maximum likelihood estimates; that is,
.0/ D O
The initial values of the rth chain (r 2) are given by
.0/
O
D e
O
˙ Œ 2r C2 sO ./
with the plus sign for odd r and minus sign for even r.
OUTPOST= Output Data Set
The OUTPOST= data set contains the generated posterior samples. There are 3+n variables, where n is the
number of model parameters. The variable Iteration represents the iteration number, the variable LogLike
contains the log of the likelihood, and the variable LogPost contains the log of the posterior. The other n
variables represent the draws of the Markov chain for the model parameters.
Exact Logistic and Exact Poisson Regression
The theory of exact logistic regression, also called exact conditional logistic regression, is described in the
section “Exact Conditional Logistic Regression” on page 4597 in Chapter 58, “The LOGISTIC Procedure.”
The following discussion of exact Poisson regression, also called exact conditional Poisson regression, uses
the notation given in that section.
Note that in exact logistic regression, the coefficients C.t/ are the number of possible response vectors y
that generate t: C.t/ D jjfy W y 0 X D t 0 gjj. However, when performing an exact Poisson regression, this
value is replaced by
C.t/ D
n
y
XY
Ni i
yi Š
 i D1
where  D fyW y 0 X D tg and Ni D exp.oi / is the exponential of the offset oi for observation i. If an offset
variable is not specified, then Ni D 1.
The probability density function (PDF) for T is created by summing over all candidate sequences y that
generate an observable t
C.t/ exp.t 0 ˇ/
Pr.T D t/ D Qn
xi0 ˇ
/
i D1 exp.Ni e
However, the conditional likelihood of TI given TN D tN has the same form as that for exact logistic
regression.
3000 F Chapter 42: The GENMOD Procedure
For details about hypothesis testing and estimation, see the sections “Hypothesis Tests” on page 4599 and
“Inference for a Single Parameter” on page 4600 in Chapter 58, “The LOGISTIC Procedure.” See the section
“Computational Resources for Exact Logistic Regression” on page 4607 in Chapter 58, “The LOGISTIC
Procedure,” for some computational notes about exact analyses.
In exact logistic binary regression, each component yi ; i D 1; :::; n; of y can take a value of 0 or 1, so
there are a finite number, 2n , of candidate y vectors to be considered. Since a Poisson-distributed response
variable can take an infinite number of values, exact Poisson regression should evaluate an infinite number
of y vectors. However, by identifying the maximumQvalue of yi to check, Si , for each observation i, the
number of candidate y vectors to check is reduced to niD1 Si . On a practical level, as Si becomes large the
probability of the Poisson random variable achieving this value drops to zero, so Si can be thought of as the
point at which the value does not matter. You can provide these maxima by specifying either an OFFSET=
variable, oi , or an EXACTMAX= variable, ei , or you can let the algorithm choose a maximum for you. The
way these two options interact to provide a maximum is described in the following list:
1. If an EXACTMAX= variable is specified, then Si D ei .
2. If the EXACTMAX option is specified without a variable, or if neither the EXACTMAX= nor
OFFSET= options are specified, then you must also condition out the intercept or you must specify the
STRATA
statement. If you are conditioning out the intercept, then every Si has an effective maximum
P
of niD1 fi y0i , where y0 is the observed response and fi is the frequency of the observation; this is
the sufficient statistic for the intercept term. If you are performing a stratified analysis, these sums are
computed within each stratum.
3. If an offset variable is specified and the EXACTMAX option is not specified (you are modeling
proportions), then Ni D exp.oi / must be a positive integer, and Si D Ni is the maximum possible
value for each observation in the experiment; for example, if you are counting the number of rats in a
cage that acquire a disease, then Ni is the number of rats in cage i.
OUTDIST= Output Data Set
The OUTDIST= data set contains every exact conditional distribution necessary to process the corresponding
EXACT statement. For example, the following statements create one distribution for the x1 parameter and
another for the x2 parameters, and produce the data set dist shown in Table 42.11:
data test;
input y x1 x2 count;
datalines;
0 0 0 1
1 0 0 1
0 1 1 2
1 1 1 1
1 0 2 3
1 1 2 1
1 2 0 3
1 2 1 2
1 2 2 1
;
Exact Logistic and Exact Poisson Regression F 3001
proc genmod data=test exactonly;
class x2 / param=ref;
model y=x1 x2 / d=b;
exact x1 x2/ outdist=dist;
run;
proc print data=dist;
run;
Table 42.11 OUTDIST= Data Set
Obs
x1
x20
x21
1
2
3
4
5
6
7
8
9
.
.
.
.
.
.
.
.
.
0
0
0
1
1
1
2
2
3
0
1
2
0
1
2
0
1
0
10
11
12
13
14
2
3
4
5
6
.
.
.
.
.
.
.
.
.
.
Count
Score
Prob
3
15
9
15
18
6
19
2
3
5.81151
1.66031
3.12728
1.46523
0.21675
4.58644
1.61869
3.27293
6.27189
0.03333
0.16667
0.10000
0.16667
0.20000
0.06667
0.21111
0.02222
0.03333
6
12
11
18
3
3.03030
0.75758
0.00000
0.75758
3.03030
0.12000
0.24000
0.22000
0.36000
0.06000
The first nine observations in the dist data set contain an exact distribution for the parameters of the x2
effect (hence the values for the x1 parameter are missing), and the remaining five observations are for the
x1 parameter. If a joint distribution was created, there would be observations with values for both the x1
and x2 parameters. For CLASS variables, the corresponding parameters in the dist data set are identified by
concatenating the variable name with the appropriate classification level.
The data set contains the possible sufficient statistics of the parameters for the effects specified in the EXACT
statement, and the Count variable contains the number of different responses that yield these statistics. In
particular, there are six possible response vectors y for which the dot product y0 x1 was equal to 2, and for
which y0 x20, y0 x21, and y0 1 were equal to their actual observed values (displayed in the “Sufficient Statistics”
table).
N OTE : If you are performing an exact Poisson analysis, then the Count variable is replaced by a variable
named Weight.
When hypothesis tests are performed on the parameters, the Prob variable contains the probability of obtaining
that statistic (which is just the count divided by the total count), and the Score variable contains the score for
that statistic.
The OUTDIST= data set can contain a different exact conditional distribution for each specified EXACT
statement. For example, consider the following EXACT statements:
3002 F Chapter 42: The GENMOD Procedure
exact
exact
exact
exact
'O1'
'OJ12'
'OA12'
'OE12'
x1
/
x1 x2 / jointonly
x1 x2 / joint
x1 x2 / estimate
outdist=o1;
outdist=oj12;
outdist=oa12;
outdist=oe12;
The O1 statement outputs a single exact conditional distribution. The OJ12 statement outputs only the joint
distribution for x1 and x2. The OA12 statement outputs three conditional distributions: one for x1, one for x2,
and one jointly for x1 and x2. The OE12 statement outputs two conditional distributions: one for x1 and the
other for x2. Data set oe12 contains both the x1 and x2 variables; the distribution for x1 has missing values
in the x2 column while the distribution for x2 has missing values in the x1 column.
Missing Values
For generalized linear models, PROC GENMOD ignores any observation with a missing value for any
variable involved in the model. You can score an observation in an output data set by setting only the
response value to missing. For models fit with generalized estimating equations (GEEs), observations with
missing values within a cluster are not used, and all available pairs are used in estimating the working
correlation matrix. Clusters with fewer observations than the full cluster size are treated as having missing
observations occurring at the end of the cluster. You can specify the order of missing observations with
the WITHINSUBJECT= option. See the section “Missing Data” on page 2981 for more information about
missing values in GEEs.
Displayed Output for Classical Analysis
The following output is produced by the GENMOD procedure. Note that some of the tables are optional and
appear only in conjunction with the REPEATED statement and its options or with options in the MODEL
statement. For details, see the section “ODS Table Names” on page 3012.
Model Information
The “Model Information” table displays the two-level data set name, the response distribution, the link
function, the response variable name, the offset variable name, the frequency variable name, the scale weight
variable name, the number of observations used, the number of events if events/trials format is used for
response, the number of trials if events/trials format is used for response, the sum of frequency weights,
the number of missing values in data set, and the number of invalid observations (for example, negative or
0 response values with gamma distribution or number of observations with events greater than trials with
binomial distribution).
Class Level Information
If you use classification variables in the model, PROC GENMOD displays the levels of classification variables
specified in the CLASS statement and in the MODEL statement. The levels are displayed in the same sorted
order used to generate columns in the design matrix.
Displayed Output for Classical Analysis F 3003
Response Profile
If you specify an ordinal model for the multinomial distribution, a table titled “Response Profile” is displayed
containing the ordered values of the response variable and the number of occurrences of the values used in
the model.
Iteration History for Parameter Estimates
If you specify the ITPRINT model option, PROC GENMOD displays a table containing the following for
each iteration in the Newton-Raphson procedure for model fitting: the iteration number, the ridge value, the
log likelihood, and values of all parameters in the model.
Criteria for Assessing Goodness of Fit
In the “Criteria for Assessing Goodness of Fit” table, PROC GENMOD displays the degrees of freedom
for deviance and Pearson’s chi-square, equal to the number of observations minus the number of regression
parameters estimated, the deviance, the deviance divided by degrees of freedom, the scaled deviance, the
scaled deviance divided by degrees of freedom, Pearson’s chi-square, Pearson’s chi-square divided by degrees
of freedom, the scaled Pearson’s chi-square, the scaled Pearson’s chi-square divided by degrees of freedom,
the log likelihood (excludes factorial terms) the full log likelihood, the Akaike information criterion, the
corrected Akaike information criterion, and the Bayesian information criterion. The information in this table
is valid only for maximum likelihood model fitting, and the table is not printed if the REPEATED statement
is specified.
Last Evaluation of the Gradient
If you specify the model option ITPRINT, the GENMOD procedure displays the last evaluation of the
gradient vector.
Last Evaluation of the Hessian
If you specify the model option ITPRINT, the GENMOD procedure displays the last evaluation of the Hessian
matrix.
Analysis of (Initial) Parameter Estimates
The “Analysis of (Initial) Parameter Estimates” table contains the results from fitting a generalized linear
model to the data. If you specify the REPEATED statement, these GLM parameter estimates are used as
initial values for the GEE solution, and are displayed only if the PRINTMLE option in the REPEATED
statement is specified. For each parameter in the model, PROC GENMOD displays the parameter name, as
follows:
• the variable name for continuous regression variables
• the variable name and level for classification variables and interactions involving classification variables
• SCALE for the scale variable related to the dispersion parameter
In addition, PROC GENMOD displays the degrees of freedom for the parameter, the estimate value, the
standard error, the Wald chi-square value, the p-value based on the chi-square distribution, and the confidence
limits (Wald or profile likelihood) for parameters.
3004 F Chapter 42: The GENMOD Procedure
Lagrange Multiplier Statistics
If you specify that either the model intercept or the scale parameter is fixed, for those distributions that have
a distribution scale parameter, the GENMOD procedure displays a table of Lagrange multiplier, or score,
statistics for testing the validity of the constrained parameter that contains the test statistic, and the p-value.
Estimated Covariance Matrix
If you specify the model option COVB, the GENMOD procedure displays the estimated covariance matrix,
defined as the inverse of the information matrix at the final iteration. This is based on the expected information
matrix if the EXPECTED option is specified in the MODEL statement. Otherwise, it is based on the Hessian
matrix used at the final iteration. This is, by default, the observed Hessian unless altered by the SCORING
option in the MODEL statement.
Estimated Correlation Matrix
If you specify the CORRB model option, PROC GENMOD displays the estimated correlation matrix. This is
based on the expected information matrix if the EXPECTED option is specified in the MODEL statement.
Otherwise, it is based on the Hessian matrix used at the final iteration. This is, by default, the observed
Hessian unless altered by the SCORING option in the MODEL statement.
Iteration History for LR Confidence Intervals
If you specify the ITPRINT and LRCI model options, PROC GENMOD displays an iteration history table for
profile likelihood-based confidence intervals. For each parameter in the model, PROC GENMOD displays
the parameter identification number, the iteration number, the log-likelihood value, parameter values.
Likelihood Ratio-Based Confidence Intervals for Parameters
If you specify the LRCI and the ITPRINT options in the MODEL statement, a table is displayed that
summarizes profile likelihood-based confidence intervals for all parameters. For each parameter in the model,
the table displays the confidence coefficient, the parameter identification number, lower and upper endpoints
of confidence intervals for the parameter, and values of all other parameters at the solution.
LR Statistics for Type 1 Analysis
If you specify the TYPE1 model option, a table is displayed that contains the name of the effect, the
deviance for the model including the effect and all previous effects, the degrees of freedom for the effect,
the likelihood ratio statistic for testing the significance of the effect, and the p-value computed from the
chi-square distribution with the effect’s degrees of freedom.
If you specify either the SCALE=DEVIANCE or SCALE=PEARSON option in the MODEL statement,
columns are displayed that contain the name of the effect, the deviance for the model including the effect and
all previous effects, the numerator degrees of freedom, the denominator degrees of freedom, the chi-square
statistic for testing the significance of the effect, the p-value computed from the chi-square distribution with
numerator degrees of freedom, the F statistic for testing the significance of the effect, and the p-value based
on the F distribution.
Displayed Output for Classical Analysis F 3005
Iteration History for Type 3 Contrasts
If you specify the model options ITPRINT and TYPE3, an iteration history table is displayed for fitting the
model with Type 3 contrast constraints for each effect that contains the effect name, the iteration number, the
ridge value, the log likelihood, and values of all parameters.
LR Statistics for Type 3 Analysis
If you specify the TYPE3 model option, a table is displayed that contains, for each effect in the model,
the name of the effect, the likelihood ratio statistic for testing the significance of the effect, the degrees of
freedom for the effect, and the p-value computed from the chi-square distribution.
If you specify either the SCALE=DEVIANCE or SCALE=PEARSON option in the MODEL statement,
columns are displayed that contain the name of the effect, the likelihood ratio statistic for testing the
significance of the effect, the F statistic for testing the significance of the effect, the numerator degrees
of freedom, the denominator degrees of freedom, the p-value based on the F distribution, and the p-value
computed from the chi-square distribution with the numerator’s degrees of freedom.
Wald Statistics for Type 3 Analysis
If you specify the TYPE3 and WALD model options, a table is displayed that contains the name of the
effect, the degrees of freedom of the effect, the Wald statistic for testing the significance of the effect, and the
p-value computed from the chi-square distribution.
Parameter Information
If you specify the ITPRINT, COVB, CORRB, WALDCI, or LRCI option in the MODEL statement, or if you
specify a CONTRAST statement, a table is displayed that identifies parameters with numbers, rather than
names, for use in tables and matrices where a compact identifier for parameters is helpful. For each parameter,
the table contains an index number that identifies the parameter, and the parameter name, including level
information for effects containing classification variables.
Observation Statistics
If you specify the OBSTATS option in the MODEL statement, PROC GENMOD displays a table containing
miscellaneous statistics. Residuals and case deletion diagnostic statistics are not available for the multinomial
distribution. Case deletion diagnostics are not available for zero-inflated models.
For each observation in the input data set, the following are displayed:
• the value of the response variable
• the predicted value of the mean
• the value of the linear predictor The value of an OFFSET variable is added to the linear predictor.
• the estimated standard error of the linear predictor
• the value of the negative of the weight in the Hessian matrix at the final iteration. This is the expected
weight if the EXPECTED option is specified in the MODEL statement. Otherwise, it is the weight
used in the final iteration. That is, it is the observed weight unless the SCORING= option has been
specified.
3006 F Chapter 42: The GENMOD Procedure
• approximate lower and upper endpoints for a confidence interval for the predicted value of the mean
• raw residual
• Pearson residual
• deviance residual
• standardized Pearson residual
• standardized deviance residual
• likelihood residual
• leverage
• Cook’s distance statistic
• DFBETA statistic, for each parameter
• standardized DFBETA statistic, for each parameter
• zero-inflation probability for zero-inflated models
• response mean for zero-inflated models
ESTIMATE Statement Results
If you specify a REPEATED statement, the ESTIMATE statement results apply to the specified GEE model.
Otherwise, they apply to the specified generalized linear model.
For each ESTIMATE statement, the table contains the contrast label, the estimated value of the contrast, the
standard error of the estimate, the significance level ˛, .1 ˛/ 100% confidence intervals for contrast,
the Wald chi-square statistic for the contrast, and the p-value computed from the chi-square distribution.
The mean of the contrast, defined as the inverse link function applied to the contrast, and .1 ˛/ 100%
confidence intervals for the mean are also displayed.
If you specify the EXP option, an additional row is displayed with statistics for the exponentiated value of
the contrast.
CONTRAST Coefficients
If you specify the CONTRAST or ESTIMATE statement and you specify the E option, a table titled
“Coefficients For Contrast label ” is displayed, where label is the label specified in the CONTRAST statement.
The table contains the contrast label, and the rows of the contrast matrix.
Iteration History for Contrasts
If you specify the ITPRINT option, an iteration history table is displayed for fitting the model with contrast
constraints for each effect. The table contains the contrast label, the iteration number, the ridge value, the log
likelihood, and values of all parameters.
Displayed Output for Classical Analysis F 3007
CONTRAST Statement Results
If you specify a REPEATED statement, the CONTRAST statement results apply to the specified GEE model.
Otherwise, they apply to the specified generalized linear model.
A table is displayed that contains the contrast label, the degrees of freedom for the contrast, and the likelihood
ratio, score, or Wald statistic for testing the significance of the contrast. Score statistics are used in GEE
models, likelihood ratio statistics are used in generalized linear models, and Wald statistics are used in both.
Also displayed are the p-value computed from the chi-square distribution, and the type of statistic computed
for this contrast: Wald, LR, or score.
If you specify either the SCALE=DEVIANCE or SCALE=PEARSON option for generalized linear models,
columns are displayed that contain the contrast label, the likelihood ratio statistic for testing the significance
of the contrast, the F statistic for testing the significance of the contrast, the numerator degrees of freedom,
the denominator degrees of freedom, the p-value based on the F distribution, and the p-value computed from
the chi-square distribution with numerator degrees of freedom.
LSMEANS Coefficients
If you specify the LSMEANS statement and you specify the E option, the “Coefficients for effect Least
Squares Means” table is displayed, where effect is the effect specified in the LSMEANS statement. The table
contains the effect names and the rows of least squares means coefficients.
Least Squares Means
If you specify the LSMEANS statement, the “Least Squares Means” table is displayed. The table contains
for each effect the following: the effect name, and for each level of each effect the following:
• the least squares mean estimate
• standard error
• chi-square value
• p-value computed from the chi-square distribution
If you specify the DIFF option, a table titled “Differences of Least Squares Means” is displayed containing
corresponding statistics for the differences between the least squares means for the levels of each effect.
GEE Model Information
If you specify the REPEATED statement, the “GEE Model Information” table displays the correlation
structure of the working correlation matrix or the log odds ratio structure, the within-subject effect, the
subject effect, the number of clusters, the correlation matrix dimension, and the minimum and maximum
cluster size.
Log Odds Ratio Parameter Information
If you specify the REPEATED statement and specify a log odds ratio model for binary data with the LOGOR=
option, then the “Log Odds Ratio Parameter Information” table is displayed showing the correspondence
between data pairs and log odds ratio model parameters.
3008 F Chapter 42: The GENMOD Procedure
Iteration History for GEE Parameter Estimates
If you specify the REPEATED statement and the MODEL statement option ITPRINT, the “Iteration History
For GEE Parameter Estimates” table is displayed. The table contains the parameter identification number, the
iteration number, and values of all parameters.
Last Evaluation of the Generalized Gradient and Hessian
If you specify the REPEATED statement and select ITPRINT as a model option, PROC GENMOD displays
the “Last Evaluation Of The Generalized Gradient And Hessian” table.
GEE Parameter Estimate Covariance Matrices
If you specify the REPEATED statement and the COVB option, PROC GENMOD displays the “Covariance
Matrix (Model-Based)” and “Covariance Matrix (Empirical)” tables.
GEE Parameter Estimate Correlation Matrices
If you specify the REPEATED statement and the CORRB option, PROC GENMOD displays the “Correlation
Matrix (Model-Based)” and “Correlation Matrix (Empirical)” tables.
GEE Working Correlation Matrix
If you specify the REPEATED statement and the CORRW option, PROC GENMOD displays the “Working
Correlation Matrix” table.
GEE Fit Criteria
If you specify the REPEATED statement, PROC GENMOD displays the quasi-likelihood information criteria
for model fit QIC and QICu in the “GEE Fit Criteria” table.
Analysis of GEE Parameter Estimates
If you specify the REPEATED statement, PROC GENMOD uses empirical standard error estimates to
compute and display the “Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates” table
that contains the parameter names as follows:
• the variable name for continuous regression variables
• the variable name and level for classification variables and interactions involving classification variables
• “Scale” for the scale variable related to the dispersion parameter
In addition, the parameter estimate, the empirical standard error, a 95% confidence interval, and the Z score
and p-value are displayed for each parameter.
If you specify the MODELSE option in the REPEATED statement, the “Analysis Of GEE Parameter Estimates
Model-Based Standard Error Estimates” table based on model-based standard errors is also produced.
Displayed Output for Bayesian Analysis F 3009
GEE Observation Statistics
If you specify the OBSTATS option in the REPEATED statement, PROC GENMOD displays a table
containing miscellaneous statistics. For each observation in the input data set, the following are displayed:
• the value of the response variable and all other variables in the model, denoted by the variable names
• the predicted value of the mean
• the value of the linear predictor
• the standard error of the linear predictor
• confidence limits for the predicted values
• raw residual
• Pearson residual
• cluster number
• leverage
• cluster leverage
• cluster Cook’s distance statistic
• studentized cluster Cook’s distance statistic
• individual observation Cook’s distance statistic
• cluster DFBETA statistic for each parameter
• cluster standardized DFBETA statistic for each parameter
• individual observation DFBETA statistic for each parameter
• individual observation standardized DFBETA statistic for each parameter
Displayed Output for Bayesian Analysis
If a Bayesian analysis is requested with a BAYES statement, the displayed output includes the following.
Model Information
The “Model Information” table displays the two-level data set name, the number of burn-in iterations, the
number of iterations after the burn-in, the number of thinning iterations, the response distribution, the link
function, the response variable name, the offset variable name, the frequency variable name, the scale weight
variable name, the number of observations used, the number of events if events/trials format is used for
response, the number of trials if events/trials format is used for response, the sum of frequency weights,
the number of missing values in data set, and the number of invalid observations (for example, negative or
0 response values with gamma distribution or number of observations with events greater than trials with
binomial distribution).
3010 F Chapter 42: The GENMOD Procedure
Class Level Information
The “Class Level Information” table displays the levels of classification variables if you specify a CLASS
statement.
Maximum Likelihood Estimates
The “Analysis of Maximum Likelihood Parameter Estimates” table displays the maximum likelihood estimate
of each parameter, the estimated standard error of the parameter estimator, and confidence limits for each
parameter.
Coefficient Prior
The “Coefficient Prior” table displays the prior distribution of the regression coefficients.
Independent Prior Distributions for Model Parameters
The “Independent Prior Distributions for Model Parameters” table displays the prior distributions of additional
model parameters (scale, exponential scale, Weibull scale, Weibull shape, gamma shape).
Initial Values and Seeds
The “Initial Values and Seeds” table displays the initial values and random number generator seeds for the
Gibbs chains.
Fit Statistics
The “Fit Statistics” table displays the deviance information criterion (DIC) and the effective number of
parameters.
Descriptive Statistics of the Posterior Samples
The “Descriptive Statistics of the Posterior Sample” table contains the size of the sample, the mean, the
standard deviation, and the quartiles for each model parameter.
Interval Estimates for Posterior Sample
The “Interval Estimates for Posterior Sample” table contains the HPD intervals and the credible intervals for
each model parameter.
Correlation Matrix of the Posterior Samples
The “Correlation Matrix of the Posterior Samples” table is produced if you include the CORR suboption in
the SUMMARY= option in the BAYES statement. This table displays the sample correlation of the posterior
samples.
Covariance Matrix of the Posterior Samples
The “Covariance Matrix of the Posterior Samples” table is produced if you include the COV suboption in the
SUMMARY= option in the BAYES statement. This table displays the sample covariance of the posterior
samples.
Displayed Output for Exact Analysis F 3011
Autocorrelations of the Posterior Samples
The “Autocorrelations of the Posterior Samples” table displays the lag1, lag5, lag10, and lag50 autocorrelations for each parameter.
Gelman and Rubin Diagnostics
The “Gelman and Rubin Diagnostics” table is produced if you include the GELMAN suboption in the
DIAGNOSTIC= option in the BAYES statement. This table displays the estimate of the potential scale
reduction factor and its 97.5% upper confidence limit for each parameter.
Geweke Diagnostics
The “Geweke Diagnostics” table displays the Geweke statistic and its p-value for each parameter.
Raftery and Lewis Diagnostics
The “Raftery Diagnostics” tables is produced if you include the RAFTERY suboption in the DIAGNOSTIC=
option in the BAYES statement. This table displays the Raftery and Lewis diagnostics for each variable.
Heidelberger and Welch Diagnostics
The “Heidelberger and Welch Diagnostics” table is displayed if you include the HEIDELBERGER suboption
in the DIAGNOSTIC= option in the BAYES statement. This table shows the results of a stationary test and a
halfwidth test for each parameter.
Effective Sample Size
The “Effective Sample Size” table displays, for each parameter, the effective sample size, the correlation
time, and the efficiency.
Monte Carlo Standard Errors
The “Monte Carlo Standard Errors” table displays, for each parameter, the Monte Carlo standard error, the
posterior sample standard deviation, and the ratio of the two.
Displayed Output for Exact Analysis
If an exact analysis is requested with an EXACT statement, the displayed output includes the following tables.
If the METHOD=NETWORKMC option is specified,
p the test and estimate tables are renamed “Monte Carlo”
tables and a Monte Carlo standard error column ( p.1 p/=n) is displayed.
Sufficient Statistics
Displays if you request an OUTDIST= data set in an EXACT statement. The table lists the parameters and
their observed sufficient statistics.
3012 F Chapter 42: The GENMOD Procedure
(Monte Carlo) Conditional Exact Tests
This table tests the hypotheses that the parameters of interest are insignificant. See the section “Exact Logistic
and Exact Poisson Regression” on page 2999 for details.
(Monte Carlo) Exact Parameter Estimates
Displays if you specify the ESTIMATE option in the EXACT statement. This table gives individual parameter
estimates for each variable (conditional on the values of all the other parameters in the model), confidence
limits, and a two-sided p-value (twice the one-sided p-value) for testing that the parameter is zero. See the
section “Exact Logistic and Exact Poisson Regression” on page 2999 for details.
(Monte Carlo) Exact Odds Ratios
Displays if you specify the ESTIMATE=ODDS or ESTIMATE=BOTH option in the EXACT statement. See
the section “Exact Logistic and Exact Poisson Regression” on page 2999 for details.
Strata Summary
Displays if a STRATA statement is also specified. Shows the pattern of the number of events and the number
of nonevents, or of the number of observations, in a stratum. See the section “STRATA Statement” on
page 2953 for more information.
Strata Information
Displays if a STRATA statement is specified with the INFO option.
ODS Table Names
PROC GENMOD assigns a name to each table that it creates. You can use these names to reference the table
when using the Output Delivery System (ODS) to select tables and create output data sets. These names are
listed separately in Table 42.12 for a maximum likelihood analysis, in Table 42.13 for a Bayesian analysis,
and in Table 42.14 for an Exact analysis. For more information about ODS, see Chapter 20, “Using the
Output Delivery System.”
Table 42.12 ODS Tables Produced in PROC GENMOD for a Classical Analysis
ODS Table Name
Description
Statement
Option
AssessmentSummary
ClassLevels
Contrasts
ContrastCoef
ConvergenceStatus
CorrB
Model assessment summary
Classification variable levels
Tests of contrasts
Contrast coefficients
Convergence status
Parameter estimate correlation matrix
Parameter estimate covariance matrix
Estimates of contrasts
ASSESS
CLASS
CONTRAST
CONTRAST
MODEL
MODEL
Default
Default
Default
E
Default
CORRB
MODEL
COVB
ESTIMATE
Default
CovB
Estimates
ODS Table Names F 3013
Table 42.12
continued
ODS Table Name
Description
Statement
Option
EstimateCoef
GEEEmpPEst
Contrast coefficients
GEE parameter estimates
with empirical standard errors
GEE exchangeable working
correlation value
GEE QIC fit criteria
GEE log odds ratio model
information
GEE model information
GEE parameter estimates
with model-based standard
errors
GEE model-based correlation matrix
GEE model-based covariance matrix
GEE empirical correlation
matrix
GEE empirical covariance
matrix
GEE working correlation
matrix
Iteration history for contrasts
Iteration history for likelihood ratio confidence intervals
Iteration history for parameter estimates
Iteration history for GEE parameter estimates
Iteration history for Type 3
statistics
Likelihood ratio confidence
intervals
Coefficients for least squares
means
Least squares means differences
Least squares means
Lagrange statistics
Last evaluation of the generalized gradient and Hessian
ESTIMATE
REPEATED
E
Default
REPEATED
TYPE=EXCH
REPEATED
REPEATED
Default
LOGOR=
REPEATED
REPEATED
Default
MODELSE
REPEATED
MCORRB
REPEATED
MCOVB
REPEATED
ECORRB
REPEATED
ECOVB
REPEATED
CORRW
GEEExchCorr
GEEFitCriteria
GEELogORInfo
GEEModInfo
GEEModPEst
GEENCorr
GEENCov
GEERCorr
GEERCov
GEEWCorr
IterContrasts
IterLRCI
IterParms
IterParmsGEE
IterType3
LRCI
Coef
Diffs
LSMeans
LagrangeStatistics
LastGEEGrad
MODEL CON- ITPRINT
TRAST
MODEL
LRCI ITPRINT
MODEL
ITPRINT
MODEL
REPEATED
MODEL
ITPRINT
MODEL
LRCI ITPRINT
LSMEANS
E
LSMEANS
DIFF
LSMEANS
MODEL
MODEL
REPEATED
Default
NOINT | NOSCALE
ITPRINT
TYPE3 ITPRINT
3014 F Chapter 42: The GENMOD Procedure
Table 42.12
continued
ODS Table Name
Description
Statement
Option
LastGradHess
MODEL
ITPRINT
CONTRAST
Default
MODEL
MODEL
Default
Default without REPEATED
Default
CONTRAST
Default
ObStats
Last evaluation of the gradient and Hessian
Linearly dependent rows of
contrasts
Model information
Goodness-of-fit statistics
Number of observations
summary
Nonestimable rows of contrasts
Observation-wise statistics
MODEL
ParameterEstimates
Parameter estimates
MODEL
OBSTATS | CL |
PREDICTED |
RESIDUALS | XVARS
Default without REPEATED |
PRINTMLE with REPEATED
Default
DIST=MULTINOMIAL |
DIST=BINOMIAL
TYPE1
TYPE3
Default
LinDep
ModelInfo
Modelfit
NObs
NonEst
ParmInfo
ResponseProfile
Parameter indices
Frequency counts for multinomial and binary models
Type1
Type 1 tests
Type3
Type 3 tests
ZeroParameterEstimates Parameter estimates for zeroinflated model
MODEL
MODEL
MODEL
MODEL
ZEROMODEL
Table 42.13 ODS Tables Produced in PROC GENMOD for a Bayesian Analysis
ODS Table Name
Description
Statement
Option
AutoCorr
Autocorrelations of the posterior samples
Classification variable levels
Prior distribution of the regression coefficients
Convergence status of maximum likelihood estimation
Correlation matrix of the
posterior samples
Effective sample size
Fit statistics
Gelman and Rubin convergence diagnostics
Geweke convergence diagnostics
Heidelberger and Welch convergence diagnostics
BAYES
Default
CLASS
BAYES
Default
Default
MODEL
Default
BAYES
SUMMARY=CORR
BAYES
BAYES
BAYES
Default
Default
DIAG=GELMAN
BAYES
Default
BAYES
DIAG=HEIDELBERGER
ClassLevels
CoeffPrior
ConvergenceStatus
Corr
ESS
FitStatistics
Gelman
Geweke
Heidelberger
ODS Table Names F 3015
Table 42.13
continued
ODS Table Name
Description
Statement
Option
InitialValues
Initial values of the Markov
chains
Iteration history for parameter estimates
Last evaluation of the gradient and Hessian for maximum likelihood estimation
Monte Carlo standard errors
Model information
Number of observations
Maximum likelihood estimates of model parameters
Parameter indices
Prior distribution for scale
and shape
HPD and equal-tail intervals
of the posterior samples
Posterior samples (for ODS
output data set only)
Summary statistics of the
posterior samples
Raftery and Lewis convergence diagnostics
BAYES
Default
MODEL
ITPRINT
MODEL
ITPRINT
BAYES
PROC
MODEL
DIAG=MCSE
Default
Default
Default
MODEL
BAYES
Default
Default
BAYES
Default
IterParms
LastGradHess
MCError
ModelInfo
NObs
ParameterEstimates
ParmInfo
ParmPrior
PostIntervals
PosteriorSample
PostSummaries
Raftery
BAYES
BAYES
Default
BAYES
DIAG=RAFTERY
Table 42.14 ODS Tables Produced in PROC GENMOD for an Exact Analysis
ODS Table Name
Description
Statement
Option
ExactOddsRatio
Exact odds ratios
EXACT
ExactParmEst
Parameter estimates
EXACT
ExactTests
NStrataIgnored
Conditional exact tests
Number of uninformative
strata
Number of strata with specific response frequencies
Event and nonevent frequencies for each stratum
Sufficient statistics
EXACT
STRATA
ESTIMATE=ODDS,
ESTIMATE=BOTH
ESTIMATE,
ESTIMATE=PARM,
ESTIMATE=BOTH
Default
Default
STRATA
Default
STRATA
INFO
EXACT
OUTDIST=
StrataSummary
StrataInfo
SuffStats
3016 F Chapter 42: The GENMOD Procedure
ODS Graphics
Statistical procedures use ODS Graphics to create graphs as part of their output. ODS Graphics is described
in detail in Chapter 21, “Statistical Graphics Using ODS.”
Before you create graphs, ODS Graphics must be enabled (for example, by specifying the ODS GRAPHICS ON statement). For more information about enabling and disabling ODS Graphics, see the section
“Enabling and Disabling ODS Graphics” on page 606 in Chapter 21, “Statistical Graphics Using ODS.”
The overall appearance of graphs is controlled by ODS styles. Styles and other aspects of using ODS
Graphics are discussed in the section “A Primer on ODS Statistical Graphics” on page 605 in Chapter 21,
“Statistical Graphics Using ODS.”
Some graphs are produced by default; other graphs are produced by using statements and options. You can
reference every graph produced through ODS Graphics with a name. The names of the graphs that PROC
GENMOD generates are listed in Table 42.15, along with the required statements and options.
ODS Graph Names
PROC GENMOD assigns a name to each graph it creates using ODS. You can use these names to reference
the graphs when using ODS. The names are listed in Table 42.15.
To request these graphs, ODS Graphics must be enabled and you must specify the statement and options
indicated in Table 42.15.
Table 42.15 Graphs Produced by PROC GENMOD
ODS Graph Name
Description
Statement Option
ADPanel
Autocorrelation function
and density panel
Autocorrelation function
panel
Autocorrelation function
plot
Cluster Cook’s D by cluster number
Cluster DFFIT by cluster
number
Cluster leverage by cluster number
Cook’s distance
Panel of aggregates of
residuals
Model assessment based
on aggregates of residuals
Deviance residuals by
linear predictor
Deviance values
BAYES
PLOTS=(AUTOCORR DENSITY)
BAYES
PLOTS= AUTOCORR
BAYES
PLOTS(UNPACK)=AUTOCORR
PROC
PLOTS=
PROC
PLOTS=
PROC
PLOTS=
PROC
ASSESS
PLOTS=
CRPANEL
ASSESS
Default
PROC
PLOTS=
PROC
PLOTS=
AutocorrPanel
AutocorrPlot
ClusterCooksDPlot
ClusterDFFITPlot
ClusterLeveragePlot
CooksDPlot
CumResidPanel
CumulativeResiduals
DevianceResidByXBeta
DevianceResidualPlot
ODS Graphics F 3017
Table 42.15 continued
ODS Table Name
Description
Statement Option
DFBETAByCluster
Cluster DFBeta by cluster number
DFBeta
Panel of residuals, influence, and diagnostic
statistics
PROC
PLOTS=
PROC
PROC
MODEL
REPEATED
PROC
PROC
PLOTS=
PLOTS=
PROC
PROC
PLOTS=
PLOTS=
PROC
PROC
PROC
PLOTS=
PLOTS=
PLOTS=
PROC
PROC
PLOTS=
PLOTS=
PROC
PLOTS=
PROC
PLOTS=
PROC
PROC
PLOTS=
PLOTS=
PROC
PLOTS=
BAYES
PLOTS=(TRACE AUTOCORR)
BAYES
Default
BAYES
BAYES
BAYES
PROC
PLOTS=(TRACE DENSITY)
PLOTS=TRACE
PLOTS(UNPACK)=TRACE
PLOTS=
DFBETAPlot
DiagnosticPlot
LeveragePlot
LikeResidByXBeta
LikeResidualPlot
PearsonResidByXBeta
PearsonResidualPlot
PredictedByObservation
RawResidByXBeta
RawResidualPlot
StdDevianceResidByXBeta
StdDevianceResidualPlot
StdDFBETAByCluster
StdDFBETAPlot
StdPearsonResidByXBeta
StdPearsonResidualPlot
TAPanel
TADPanel
TDPanel
TracePanel
TracePlot
ZeroInflationProbPlot
Leverage
Likelihood residuals by
linear predictor
Likelihood residuals
Pearson residuals by linear predictor
Pearson residuals
Predicted values
Raw residuals by linear
predictor
Raw residuals
Standardized deviance
residuals by linear
predictor
Standardized deviance
residuals
Standardized cluster DFBeta by cluster number
Standardized DFBeta
Standardized Pearson
residuals by linear
predictor
Standardized Pearson
residuals
Trace and autocorrelation function panel
Trace, autocorrelation,
and density function
panel
Trace and density panel
Trace panel
Trace plot
Zero-inflation probabilities
PLOTS=
PLOTS=
3018 F Chapter 42: The GENMOD Procedure
Examples: GENMOD Procedure
The following examples illustrate some of the capabilities of the GENMOD procedure. These are not
intended to represent definitive analyses of the data sets presented here. You should refer to the texts cited in
the references for guidance on complete analysis of data by using generalized linear models.
Example 42.1: Logistic Regression
In an experiment comparing the effects of five different drugs, each drug is tested on a number of different
subjects. The outcome of each experiment is the presence or absence of a positive response in a subject. The
following artificial data represent the number of responses r in the n subjects for the five different drugs,
labeled A through E. The response is measured for different levels of a continuous covariate x for each drug.
The drug type and the continuous covariate x are explanatory variables in this experiment. The number of
responses r is modeled as a binomial random variable for each combination of the explanatory variable values,
with the binomial number of trials parameter equal to the number of subjects n and the binomial probability
equal to the probability of a response.
The following DATA step creates the data set:
data drug;
input drug$ x r
datalines;
A .1
1 10
A
B .2
3 13
B
C .04 0 10
C
D .34 5 10
D
E .2 12 20
E
;
n @@;
.23 2
.3
4
.15 0
.6
5
.34 15
12
15
11
9
20
A
B
C
D
E
.67 1
.45 5
.56 1
.7
8
.56 13
9
16
12
10
15
B
C
.78
.7
E
.8
5
2
13
12
17
20
A logistic regression for these data is a generalized linear model with response equal to the binomial
proportion r/n. The probability distribution is binomial, and the link function is logit. For these data, drug and
x are explanatory variables. The probit and the complementary log-log link functions are also appropriate for
binomial data.
PROC GENMOD performs a logistic regression on the data in the following SAS statements:
proc genmod data=drug;
class drug;
model r/n = x drug / dist = bin
link = logit
lrci;
run;
Since these data are binomial, you use the events/trials syntax to specify the response in the MODEL
statement. Profile likelihood confidence intervals for the regression parameters are computed using the LRCI
option.
Example 42.1: Logistic Regression F 3019
General model and data information is produced in Output 42.1.1.
Output 42.1.1 Model Information
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Response Variable (Events)
Response Variable (Trials)
WORK.DRUG
Binomial
Logit
r
n
The five levels of the CLASS variable DRUG are displayed in Output 42.1.2.
Output 42.1.2 CLASS Variable Levels
Class Level Information
Class
Levels
drug
5
Values
A B C D E
In the “Criteria For Assessing Goodness Of Fit” table displayed in Output 42.1.3, the value of the deviance
divided by its degrees of freedom is less than 1. A p-value is not computed for the deviance; however, a
deviance that is approximately equal to its degrees of freedom is a possible indication of a good model fit.
Asymptotic distribution theory applies to binomial data as the number of binomial trials parameter n becomes
large for each combination of explanatory variables. McCullagh and Nelder (1989) caution against the use of
the deviance alone to assess model fit. The model fit for each observation should be assessed by examination
of residuals. The OBSTATS option in the MODEL statement produces a table of residuals and other useful
statistics for each observation.
Output 42.1.3 Goodness-of-Fit Criteria
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
12
12
12
12
5.2751
5.2751
4.5133
4.5133
-114.7732
-23.7343
59.4686
67.1050
64.8109
0.4396
0.4396
0.3761
0.3761
3020 F Chapter 42: The GENMOD Procedure
In the “Analysis Of Parameter Estimates” table displayed in Output 42.1.4, chi-square values for the
explanatory variables indicate that the parameter values other than the intercept term are all significant. The
scale parameter is set to 1 for the binomial distribution. When you perform an overdispersion analysis, the
value of the overdispersion parameter is indicated here. See the section “Overdispersion” on page 2966 for a
discussion of overdispersion.
Output 42.1.4 Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
drug
drug
drug
drug
drug
Scale
1
1
1
1
1
1
0
0
0.2792
1.9794
-2.8955
-2.0162
-3.7952
-0.8548
0.0000
1.0000
0.4196
0.7660
0.6092
0.4052
0.6655
0.4838
0.0000
0.0000
A
B
C
D
E
Likelihood Ratio
95% Confidence
Limits
-0.5336
0.5038
-4.2280
-2.8375
-5.3111
-1.8072
0.0000
1.0000
1.1190
3.5206
-1.7909
-1.2435
-2.6261
0.1028
0.0000
1.0000
Wald
Chi-Square
Pr > ChiSq
0.44
6.68
22.59
24.76
32.53
3.12
.
0.5057
0.0098
<.0001
<.0001
<.0001
0.0773
.
NOTE: The scale parameter was held fixed.
The preceding table contains the profile likelihood confidence intervals for the explanatory variable parameters
requested with the LRCI option. Wald confidence intervals are displayed by default. Profile likelihood
confidence intervals are considered to be more accurate than Wald intervals (see Aitkin et al. (1989)),
especially with small sample sizes. You can specify the confidence coefficient with the ALPHA= option in
the MODEL statement. The default value of 0.05, corresponding to 95% confidence limits, is used here.
See the section “Confidence Intervals for Parameters” on page 2970 for a discussion of profile likelihood
confidence intervals.
Example 42.2: Normal Regression, Log Link
Consider the following data, where x is an explanatory variable and y is the response variable. It appears that
y varies nonlinearly with x and that the variance is approximately constant. A normal distribution with a log
link function is chosen to model these data; that is, log.i / D x0i ˇ so that i D exp.x0i ˇ/.
Example 42.2: Normal Regression, Log Link F 3021
data nor;
input x y;
datalines;
0 5
0 7
0 9
1 7
1 10
1 8
2 11
2 9
3 16
3 13
3 14
4 25
4 24
5 34
5 32
5 30
;
The following SAS statements produce the analysis with the normal distribution and log link:
proc genmod data=nor;
model y = x / dist
link
output out
=
pred
=
resraw
=
reschi
=
resdev
=
stdreschi =
stdresdev =
reslik
=
run;
= normal
= log;
Residuals
Pred
Resraw
Reschi
Resdev
Stdreschi
Stdresdev
Reslik;
The OUTPUT statement is specified to produce a data set that contains predicted values and residuals for
each observation. This data set can be useful for further analysis, such as residual plotting.
The results from these statements are displayed in Output 42.2.1.
Output 42.2.1 Log-Linked Normal Regression
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.NOR
Normal
Log
y
3022 F Chapter 42: The GENMOD Procedure
Output 42.2.1 continued
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
14
14
14
14
52.3000
16.0000
52.3000
16.0000
-32.1783
-32.1783
70.3566
72.3566
72.6743
3.7357
1.1429
3.7357
1.1429
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x
Scale
1
1
1
1.7214
0.3496
1.8080
0.0894
0.0206
0.3196
Wald 95%
Confidence Limits
1.5461
0.3091
1.2786
Wald
Chi-Square
Pr > ChiSq
370.76
286.64
<.0001
<.0001
1.8966
0.3901
2.5566
NOTE: The scale parameter was estimated by maximum likelihood.
The PROC GENMOD scale parameter, in the case of the normal distribution, is the standard deviation. By
default, the scale parameter is estimated by maximum likelihood. You can specify a fixed standard deviation
by using the NOSCALE and SCALE= options in the MODEL statement.
proc print data=Residuals;
run;
Output 42.2.2 Data Set of Predicted Values and Residuals
Obs x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0
0
0
1
1
1
2
2
3
3
3
4
4
5
5
5
y
5
7
9
7
10
8
11
9
16
13
14
25
24
34
32
30
Pred
5.5921
5.5921
5.5921
7.9324
7.9324
7.9324
11.2522
11.2522
15.9612
15.9612
15.9612
22.6410
22.6410
32.1163
32.1163
32.1163
Reschi
Resraw
Resdev
-0.59212
1.40788
3.40788
-0.93243
2.06757
0.06757
-0.25217
-2.25217
0.03878
-2.96122
-1.96122
2.35897
1.35897
1.88366
-0.11634
-2.11634
-0.59212
1.40788
3.40788
-0.93243
2.06757
0.06757
-0.25217
-2.25217
0.03878
-2.96122
-1.96122
2.35897
1.35897
1.88366
-0.11634
-2.11634
-0.59212
1.40788
3.40788
-0.93243
2.06757
0.06757
-0.25217
-2.25217
0.03878
-2.96122
-1.96122
2.35897
1.35897
1.88366
-0.11634
-2.11634
Stdreschi Stdresdev
-0.34036
0.80928
1.95892
-0.54093
1.19947
0.03920
-0.14686
-1.31166
0.02249
-1.71738
-1.13743
1.37252
0.79069
1.22914
-0.07592
-1.38098
-0.34036
0.80928
1.95892
-0.54093
1.19947
0.03920
-0.14686
-1.31166
0.02249
-1.71738
-1.13743
1.37252
0.79069
1.22914
-0.07592
-1.38098
Reslik
-0.34036
0.80928
1.95892
-0.54093
1.19947
0.03920
-0.14686
-1.31166
0.02249
-1.71738
-1.13743
1.37252
0.79069
1.22914
-0.07592
-1.38098
Example 42.3: Gamma Distribution Applied to Life Data F 3023
The data set of predicted values and residuals (Output 42.2.2) is created by the OUTPUT statement. You can
use the PLOTS= option in the PROC GENMOD statement to create plots of predicted values and residuals.
Note that raw, Pearson, and deviance residuals are equal in this example. This is a characteristic of the normal
distribution and is not true in general for other distributions.
Example 42.3: Gamma Distribution Applied to Life Data
Life data are sometimes modeled with the gamma distribution. Although PROC GENMOD does not analyze
censored data or provide other useful lifetime distributions such as the Weibull or lognormal, it can be used
for modeling complete (uncensored) data with the gamma distribution, and it can provide a statistical test for
the exponential distribution against other gamma distribution alternatives. See Lawless (2003) or Nelson
(1982) for applications of the gamma distribution to life data.
The following data represent failure times of machine parts, some of which are manufactured by manufacturer
A and some by manufacturer B.
data A;
input lifetime @@;
mfg = 'A';
datalines;
620 470 260 89
388
103 100 39
460 284
218 393 106 158 152
403 103 69
158 818
399 1274 32
12
134
548 381 203 871 193
317 85
1410 250 41
32
421 32
343 376
1792 47
95
76
515
1585 253 6
860 89
537 101 385 176 11
164 16
1267 352 160
1279 356 751 500 803
151 24
689 1119 1733
763 555 14
45
776
;
data B;
input lifetime @@;
mfg = 'B';
datalines;
1747 945 12
1453 14
20
41
35
69
195
1090 1868 294 96
618
142 892 1307 310 230
403 860 23
406 1054
561 348 130 13
230
317 304 79
1793 536
9
256 201 733 510
122 27
273 1231 182
667 761 1096 43
44
242
1285
477
947
660
531
1101
1512
72
1055
565
195
560
2194
1
150
89
44
30
1935
250
12
660
289
87
3024 F Chapter 42: The GENMOD Procedure
405
113
646
195
246
55
6
35
380
;
998
25
575
1061
323
729
1566
181
609
1409
940
219
174
198
813
459
147
546
61
28
303
377
234
1216
946
116
278
848
304
388
39
1618
764
141
407
41
38
10
308
539
794
19
data lifdat;
set A B;
run;
The following SAS statements use PROC GENMOD to compute Type 3 statistics to test for differences
between the two manufacturers in machine part life. Type 3 statistics are identical to Type 1 statistics in this
case, since there is only one effect in the model. The log link function is selected to ensure that the mean is
positive.
proc genmod data = lifdat;
class mfg;
model lifetime = mfg / dist=gamma
link=log
type3;
run;
The output from these statements is displayed in Output 42.3.1.
Output 42.3.1 Gamma Model of Life Data
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
WORK.LIFDAT
Gamma
Log
lifetime
Class Level Information
Class
mfg
Levels
2
Values
A B
Example 42.3: Gamma Distribution Applied to Life Data F 3025
Output 42.3.1 continued
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
DF
Value
Value/DF
199
199
199
199
287.0591
237.5335
211.6870
175.1652
-1432.4177
-1432.4177
2870.8353
2870.9572
2880.7453
1.4425
1.1936
1.0638
0.8802
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept
mfg
mfg
Scale
A
B
DF
Estimate
Standard
Error
1
1
0
1
6.1302
0.0199
0.0000
0.8275
0.1043
0.1559
0.0000
0.0714
Wald 95%
Confidence Limits
5.9257
-0.2857
0.0000
0.6987
6.3347
0.3255
0.0000
0.9800
Wald
Chi-Square
Pr > ChiSq
3451.61
0.02
.
<.0001
0.8985
.
NOTE: The scale parameter was estimated by maximum likelihood.
LR Statistics For Type 3 Analysis
Source
mfg
DF
ChiSquare
Pr > ChiSq
1
0.02
0.8985
The p-value of 0.8985 for the chi-square statistic in the Type 3 table indicates that there is no significant
difference in the part life between the two manufacturers.
Using the following statements, you can refit the model without using the manufacturer as an effect. The
LRCI option in the MODEL statement is specified to compute profile likelihood confidence intervals for the
mean life and scale parameters.
proc genmod data = lifdat;
model lifetime = / dist=gamma
link=log
lrci;
run;
Output 42.3.2 displays the results of fitting the model with the mfg effect omitted.
3026 F Chapter 42: The GENMOD Procedure
Output 42.3.2 Refitting of the Gamma Model: Omitting the mfg Effect
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
Scale
1
1
6.1391
0.8274
0.0775
0.0714
Likelihood Ratio
95% Confidence
Limits
5.9904
0.6959
6.2956
0.9762
Wald
Chi-Square
Pr > ChiSq
6268.10
<.0001
NOTE: The scale parameter was estimated by maximum likelihood.
The intercept is the estimated log mean of the fitted gamma distribution, so that the mean life of the parts is
D exp.INTERCEPT/ D exp.6:1391/ D 463:64
The SCALE parameter used in PROC GENMOD is the inverse of the gamma dispersion parameter, and it
is sometimes called the gamma index parameter. See the section “Response Probability Distributions” on
page 2956 for the definition of the gamma probability density function. A value of 1 for the index parameter
corresponds to the exponential distribution . The estimated value of the scale parameter is 0.8274. The 95%
profile likelihood confidence interval for the scale parameter is (0.6959, 0.9762), which does not contain 1.
The hypothesis of an exponential distribution for the data is, therefore, rejected at the 0.05 level. A confidence
interval for the mean life is
.exp.5:99/; exp.6:30// D .399:57; 542:18/
Example 42.4: Ordinal Model for Multinomial Data
This example illustrates how you can use the GENMOD procedure to fit a model to data measured on an
ordinal scale. The following statements create a SAS data set called Icecream. The data set contains the
results of a hypothetical taste test of three brands of ice cream. The three brands are rated for taste on a
five-point scale from very good (vg) to very bad (vb). An analysis is performed to assess the differences in
the ratings of the three brands. The variable taste contains the ratings, and the variable brand contains the
brands tested. The variable count contains the number of testers rating each brand in each category.
The following statements create the Icecream data set:
data Icecream;
input count brand$ taste$;
datalines;
70 ice1 vg
71 ice1 g
151 ice1 m
30 ice1 b
46 ice1 vb
20 ice2 vg
36 ice2 g
Example 42.4: Ordinal Model for Multinomial Data F 3027
130
74
70
50
55
140
52
50
;
ice2
ice2
ice2
ice3
ice3
ice3
ice3
ice3
m
b
vb
vg
g
m
b
vb
The following statements fit a cumulative logit model to the ordinal data with the variable taste as the
response and the variable brand as a covariate. The variable count is used as a FREQ variable.
proc genmod data=Icecream rorder=data;
freq count;
class brand;
model taste = brand / dist=multinomial
link=cumlogit
aggregate=brand
type1;
estimate 'LogOR12' brand 1 -1 / exp;
estimate 'LogOR13' brand 1 0 -1 / exp;
estimate 'LogOR23' brand 0 1 -1 / exp;
run;
The AGGREGATE=BRAND option in the MODEL statement specifies the variable brand as defining
multinomial populations for computing deviances and Pearson chi-squares. The RORDER=DATA option
specifies that the taste variable levels be ordered by their order of appearance in the input data set—that is,
from very good (vg) to very bad (vb). By default, the response is sorted in increasing ASCII order. Always
check the “Response Profiles” table to verify that response levels are appropriately ordered. The TYPE1
option requests a Type 1 test for the significance of the covariate brand.
If j .x/ D Pr.taste j / is the cumulative probability of the jth or lower taste category, then the odds ratio
comparing x1 to x2 is as follows:
j .x1 /=.1
j .x2 /=.1
j .x1 //
D expŒ.x1
j .x2 //
x2 /0 ˇ
See McCullagh and Nelder (1989, Chapter 5) for details on the cumulative logit model. The ESTIMATE
statements compute log odds ratios comparing each of brands. The EXP option in the ESTIMATE statements
exponentiates the log odds ratios to form odds ratio estimates. Standard errors and confidence intervals are
also computed.
3028 F Chapter 42: The GENMOD Procedure
Output 42.4.1 displays general information about the model and data, the levels of the CLASS variable brand,
and the total number of occurrences of the ordered levels of the response variable taste.
Output 42.4.1 Ordinal Model Information
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Frequency Weight Variable
WORK.ICECREAM
Multinomial
Cumulative Logit
taste
count
Class Level Information
Class
Levels
brand
3
Values
ice1 ice2 ice3
Response Profile
Ordered
Value
1
2
3
4
5
taste
vg
g
m
b
vb
Total
Frequency
140
162
421
156
166
Output 42.4.2 displays estimates of the intercept terms and covariates and associated statistics. The intercept terms correspond to the four cumulative logits defined on the taste categories in the order shown in
1
Output 42.4.1. That is, Intercept1 is the intercept for the first cumulative logit, log. 1 pp
/, Intercept2 is the
1
intercept for the second cumulative logit, log. 1
p1 Cp2
/,
.p1 Cp2 /
and so forth.
Example 42.4: Ordinal Model for Multinomial Data F 3029
Output 42.4.2 Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept1
Intercept2
Intercept3
Intercept4
brand
brand
brand
Scale
DF
Estimate
Standard
Error
1
1
1
1
1
1
0
0
-1.8578
-0.8646
0.9231
1.8078
0.3847
-0.6457
0.0000
1.0000
0.1219
0.1056
0.1060
0.1191
0.1370
0.1397
0.0000
0.0000
ice1
ice2
ice3
Wald 95% Confidence
Limits
-2.0967
-1.0716
0.7154
1.5743
0.1162
-0.9196
0.0000
1.0000
Wald
Chi-Square
-1.6189
-0.6576
1.1308
2.0413
0.6532
-0.3719
0.0000
1.0000
232.35
67.02
75.87
230.32
7.89
21.36
.
Analysis Of Maximum Likelihood
Parameter Estimates
Parameter
Intercept1
Intercept2
Intercept3
Intercept4
brand
brand
brand
Scale
Pr > ChiSq
<.0001
<.0001
<.0001
<.0001
0.0050
<.0001
.
ice1
ice2
ice3
NOTE: The scale parameter was held fixed.
The Type 1 test displayed in Output 42.4.3 indicates that Brand is highly significant; that is, there are
significant differences among the brands. The log odds ratios and odds ratios in the “ESTIMATE Statement
Results” table indicate the relative differences among the brands. For example, the odds ratio of 2.8 in the
“Exp(LogOR12)” row indicates that the odds of brand 1 being in lower taste categories is 2.8 times the
odds of brand 2 being in lower taste categories. Since, in this ordering, the lower categories represent the
more favorable taste results, this indicates that brand 1 scored significantly better than brand 2. This is also
apparent from the data in this example.
Output 42.4.3 Type 1 Tests and Odds Ratios
LR Statistics For Type 1 Analysis
Source
Intercepts
brand
Deviance
DF
ChiSquare
Pr > ChiSq
65.9576
9.8654
2
56.09
<.0001
3030 F Chapter 42: The GENMOD Procedure
Output 42.4.3 continued
Contrast Estimate Results
Mean
Estimate
Label
LogOR12
Exp(LogOR12)
LogOR13
Exp(LogOR13)
LogOR23
Exp(LogOR23)
Mean
Confidence Limits
0.7370
0.6805
0.7867
0.5950
0.5290
0.6577
0.3439
0.2850
0.4081
L'Beta
Estimate
Standard
Error
Alpha
1.0305
2.8024
0.3847
1.4692
-0.6457
0.5243
0.1401
0.3926
0.1370
0.2013
0.1397
0.0733
0.05
0.05
0.05
0.05
0.05
0.05
Contrast Estimate Results
L'Beta
Confidence Limits
Label
LogOR12
Exp(LogOR12)
LogOR13
Exp(LogOR13)
LogOR23
Exp(LogOR23)
0.7559
2.1295
0.1162
1.1233
-0.9196
0.3987
1.3050
3.6878
0.6532
1.9217
-0.3719
0.6894
ChiSquare
Pr > ChiSq
54.11
<.0001
7.89
0.0050
21.36
<.0001
Example 42.5: GEE for Binary Data with Logit Link Function
Output 42.5.1 displays a partial listing of a SAS data set of clinical trial data comparing two treatments for a
respiratory disorder. See “Gee Model for Binary Data” in the SAS/STAT Sample Program Library for the
complete data set. These data are from Stokes, Davis, and Koch (2000).
Patients in each of two centers are randomly assigned to groups receiving the active treatment or a placebo.
During treatment, respiratory status, represented by the variable outcome (coded here as 0=poor, 1=good), is
determined for each of four visits. The variables center, treatment, sex, and baseline (baseline respiratory
status) are classification variables with two levels. The variable age (age at time of entry into the study) is a
continuous variable.
Explanatory variables in the model are Intercept (xij1 ), treatment (xij 2 ), center (xij 3 ), sex (xij 4 ), age
(xij 5 ), and baseline (xij 6 ), so that x 0 D Œxij1 ; xij 2 ; : : : ; xij 6  is the vector of explanatory variables. Indicator
variables for the classification explanatory variables can be automatically generated by listing them in the
CLASS statement in PROC GENMOD. To be consistent with the analysis in Stokes, Davis, and Koch (2000),
the four classification explanatory variables are coded as follows via options in the CLASS statement:
xij 2 D
xij 4 D
0 placebo
1 active
0 male
1 female
xij 3 D
xij 6 D
0 center 1
1 center 2
00
11
Example 42.5: GEE for Binary Data with Logit Link Function F 3031
Suppose yij represents the respiratory status of patient i at the jth visit, j D 1; : : : ; 4, and ij D E.yij / represents the mean of the respiratory status. Since the response data are binary, you can use the variance function
for the binomial distribution v.ij / D ij .1 ij / and the logit link function g.ij / D log.ij =.1 ij //.
The model for the mean is g.ij / D xij 0 ˇ, where ˇ is a vector of regression parameters to be estimated.
Output 42.5.1 Respiratory Disorder Data
O
b
s
c
e
n
t
e
r
i
d
t
r
e
a
t
m
e
n
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
P
P
P
P
P
P
P
P
A
A
A
A
P
P
P
P
P
P
P
P
s
e
x
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
F
F
F
F
a
g
e
b
a
s
e
l
i
n
e
v
i
s
i
t
1
v
i
s
i
t
2
v
i
s
i
t
3
v
i
s
i
t
4
v
i
s
i
t
o
u
t
c
o
m
e
46
46
46
46
28
28
28
28
23
23
23
23
44
44
44
44
13
13
13
13
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
1
1
1
1
The GEE solution is requested with the REPEATED statement in the GENMOD procedure. The option
SUBJECT=ID(CENTER) specifies that the observations in any single cluster are uniquely identified by both
center and id. An equivalent specification is SUBJECT=ID*CENTER. Since the same id values are used in
each center, one of these specifications is needed. If id values were unique across all centers, SUBJECT=ID
would be specified.
The option TYPE=UNSTR specifies the unstructured working correlation structure. The MODEL statement
specifies the regression model for the mean with the binomial distribution variance function. The following
SAS statements perform the GEE model fit:
3032 F Chapter 42: The GENMOD Procedure
proc genmod data=resp descend;
class id treatment(ref="P") center(ref="1") sex(ref="M")
baseline(ref="0") / param=ref;
model outcome=treatment center sex age baseline / dist=bin;
repeated subject=id(center) / corr=unstr corrw;
run;
These statements first fit the generalized linear (GLM) model specified in the MODEL statement. The
parameter estimates from the generalized linear model fit are not shown in the output, but they are used as
initial values for the GEE solution. The DESCEND option in the PROC GENMOD statement specifies that
the probability that outcome = 1 be modeled. If the DESCEND option had not been specified, the probability
that outcome = 0 would be modeled by default.
Information about the GEE model is displayed in Output 42.5.2. The results of GEE model fitting are
displayed in Output 42.5.3. Model goodness-of-fit criteria are displayed in Output 42.5.4. If you specify
no other options, the standard errors, confidence intervals, Z scores, and p-values are based on empirical
standard error estimates. You can specify the MODELSE option in the REPEATED statement to create a
table based on model-based standard error estimates.
Output 42.5.2 Model Fitting Information
The GENMOD Procedure
GEE Model Information
Correlation Structure
Subject Effect
Number of Clusters
Correlation Matrix Dimension
Maximum Cluster Size
Minimum Cluster Size
Unstructured
id(center) (111 levels)
111
4
4
4
Output 42.5.3 Results of Model Fitting
Working Correlation Matrix
Row1
Row2
Row3
Row4
Col1
Col2
Col3
Col4
1.0000
0.3351
0.2140
0.2953
0.3351
1.0000
0.4429
0.3581
0.2140
0.4429
1.0000
0.3964
0.2953
0.3581
0.3964
1.0000
Example 42.6: Log Odds Ratios and the ALR Algorithm F 3033
Output 42.5.3 continued
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter
Intercept
treatment
center
sex
age
baseline
A
2
F
1
Estimate
Standard
Error
-0.8882
1.2442
0.6558
0.1128
-0.0175
1.8981
0.4568
0.3455
0.3512
0.4408
0.0129
0.3441
95% Confidence
Limits
-1.7835
0.5669
-0.0326
-0.7512
-0.0427
1.2237
0.0071
1.9214
1.3442
0.9768
0.0077
2.5725
Z Pr > |Z|
-1.94
3.60
1.87
0.26
-1.36
5.52
0.0519
0.0003
0.0619
0.7981
0.1728
<.0001
Output 42.5.4 Model Fit Criteria
GEE Fit Criteria
QIC
QICu
512.3416
499.6081
The nonsignificance of age and sex make them candidates for omission from the model.
Example 42.6: Log Odds Ratios and the ALR Algorithm
Since the respiratory data in Example 42.5 are binary, you can use the ALR algorithm to model the log odds
ratios instead of using working correlations to model associations. In this example, a “fully parameterized
cluster” model for the log odds ratio is fit. That is, there is a log odds ratio parameter for each unique pair
of responses within clusters, and all clusters are parameterized identically. The following statements fit the
same regression model for the mean as in Example 42.5 but use a regression model for the log odds ratios
instead of a working correlation. The LOGOR=FULLCLUST option specifies a fully parameterized log odds
ratio model.
proc genmod data=resp descend;
class id treatment(ref="P") center(ref="1") sex(ref="M")
baseline(ref="0") / param=ref;
model outcome=treatment center sex age baseline / dist=bin;
repeated subject=id(center) / logor=fullclust;
run;
The results of fitting the model are displayed in Output 42.6.1 along with a table that shows the correspondence
between the log odds ratio parameters and the within-cluster pairs. Model goodness-of-fit criteria are shown
in Output 42.6.2. The QIC for the ALR model shown in Output 42.6.2 is 511.86, whereas the QIC for the
unstructured working correlation model shown in Output 42.5.4 is 512.34, indicating that the ALR model is
a slightly better fit.
3034 F Chapter 42: The GENMOD Procedure
Output 42.6.1 Results of Model Fitting
The GENMOD Procedure
Log Odds Ratio
Parameter Information
Parameter
Group
Alpha1
Alpha2
Alpha3
Alpha4
Alpha5
Alpha6
(1,
(1,
(1,
(2,
(2,
(3,
2)
3)
4)
3)
4)
4)
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter
Intercept
treatment
center
sex
age
baseline
Alpha1
Alpha2
Alpha3
Alpha4
Alpha5
Alpha6
A
2
F
1
Estimate
Standard
Error
-0.9266
1.2611
0.6287
0.1024
-0.0162
1.8980
1.6109
1.0771
1.5875
2.1224
1.8818
2.1046
0.4513
0.3406
0.3486
0.4362
0.0125
0.3404
0.4892
0.4834
0.4735
0.5022
0.4686
0.4949
95% Confidence
Limits
-1.8111
0.5934
-0.0545
-0.7526
-0.0407
1.2308
0.6522
0.1297
0.6594
1.1381
0.9634
1.1347
-0.0421
1.9287
1.3119
0.9575
0.0084
2.5652
2.5696
2.0246
2.5155
3.1068
2.8001
3.0745
Z Pr > |Z|
-2.05
3.70
1.80
0.23
-1.29
5.58
3.29
2.23
3.35
4.23
4.02
4.25
0.0400
0.0002
0.0713
0.8144
0.1977
<.0001
0.0010
0.0259
0.0008
<.0001
<.0001
<.0001
Output 42.6.2 Model Fit Criteria
GEE Fit Criteria
QIC
QICu
511.8589
499.6516
You can fit the same model by fully specifying the z matrix. The following statements create a data set
containing the full z matrix:
data zin;
keep id center z1-z6 y1 y2;
array zin(6) z1-z6;
set resp;
by center id;
if first.id
Example 42.6: Log Odds Ratios and the ALR Algorithm F 3035
then do;
t = 0;
do m = 1 to 4;
do n = m+1 to 4;
do j = 1 to 6;
zin(j) = 0;
end;
y1 = m;
y2 = n;
t + 1;
zin(t) = 1;
output;
end;
end;
end;
run;
proc print data=zin (obs=12);
run;
Output 42.6.3 displays the full z matrix for the first two clusters. The z matrix is identical for all clusters in
this example.
Output 42.6.3 Full z Matrix Data Set
Obs
z1
z2
z3
z4
z5
z6
1
2
3
4
5
6
7
8
9
10
11
12
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
center
1
1
1
1
1
1
1
1
1
1
1
1
id
y1
y2
1
1
1
1
1
1
2
2
2
2
2
2
1
1
1
2
2
3
1
1
1
2
2
3
2
3
4
3
4
4
2
3
4
3
4
4
The following statements fit the model for fully parameterized clusters by fully specifying the z matrix. The
results are identical to those shown previously.
proc genmod data=resp descend;
class id treatment(ref="P") center(ref="1") sex(ref="M")
baseline(ref="0") / param=ref;
model outcome=treatment center sex age baseline / dist=bin;
repeated subject=id(center) / logor=zfull
zdata=zin
zrow =(z1-z6)
ypair=(y1 y2);
run;
3036 F Chapter 42: The GENMOD Procedure
Example 42.7: Log-Linear Model for Count Data
In this example the data, from Thall and Vail (1990), concern the treatment of people suffering from epileptic
seizure episodes. These data are also analyzed in Diggle, Liang, and Zeger (1994). The data consist of
the number of epileptic seizures in an eight-week baseline period, before any treatment, and in each of
four two-week treatment periods, in which patients received either a placebo or the drug Progabide in
addition to other therapy. A portion of the data is displayed in Table 42.16. See “Gee Model for Count Data,
Exchangeable Correlation” in the SAS/STAT Sample Program Library for the complete data set.
Table 42.16 Epileptic Seizure Data
Patient ID
Treatment
Baseline
Visit1
Visit2
Visit3
Visit4
104
106
107
.
.
.
101
102
103
.
.
.
Placebo
Placebo
Placebo
11
11
6
5
3
2
3
5
4
3
3
0
3
3
5
Progabide
Progabide
Progabide
76
38
19
11
8
0
14
7
4
9
9
3
8
4
0
Model the data as a log-linear model with V ./ D (the Poisson variance function) and
log.E.Yij // D ˇ0 C xi1 ˇ1 C xi 2 ˇ2 C xi1 xi 2 ˇ3 C log.tij /
where
Yij D number of epileptic seizures in interval j
tij D length of interval j
1 W weeks 8–16 (treatment)
xi1 D
0 W weeks 0–8 (baseline)
1 W progabide group
xi 2 D
0 W placebo group
The correlations between the counts are modeled as rij D ˛, i ¤ j (exchangeable correlations). For
comparison, the correlations are also modeled as independent (identity correlation matrix). In this model, the
regression parameters have the interpretation in terms of the log seizure rate displayed in Table 42.17.
Example 42.7: Log-Linear Model for Count Data F 3037
Table 42.17 Interpretation of Regression Parameters
Treatment
Placebo
Visit
Baseline
1–4
Baseline
1–4
Progabide
log.E.Yij /=tij /
ˇ0
ˇ0 C ˇ1
ˇ0 C ˇ2
ˇ0 C ˇ1 C ˇ2 C ˇ3
The difference between the log seizure rates in the pretreatment (baseline) period and the treatment periods is
ˇ1 for the placebo group and ˇ1 C ˇ3 for the Progabide group. A value of ˇ3 < 0 indicates a reduction in
the seizure rate.
Output 42.7.1 lists the first 14 observations of the data, which are arranged as one visit per observation:
Output 42.7.1 Partial Listing of the Seizure Data
Obs
id
y
visit
trt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
104
104
104
104
106
106
106
106
107
107
107
107
114
114
5
3
3
3
3
5
3
3
2
4
0
5
4
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
bline
11
11
11
11
11
11
11
11
6
6
6
6
8
8
age
31
31
31
31
30
30
30
30
25
25
25
25
36
36
Some further data manipulations create an observation for the baseline measures, a log time interval variable
for use as an offset, and an indicator variable for whether the observation is for a baseline measurement or a
visit measurement. Patient 207 is deleted as an outlier, as in the Diggle, Liang, and Zeger (1994) analysis.
The following statements prepare the data for analysis with PROC GENMOD:
data new;
set thall;
output;
if visit=1 then do;
y=bline;
visit=0;
output;
end;
run;
data new;
set new;
3038 F Chapter 42: The GENMOD Procedure
if id ne 207;
if visit=0 then do;
x1=0;
ltime=log(8);
end;
else do;
x1=1;
ltime=log(2);
end;
run;
For comparison with the GEE results, an ordinary Poisson regression is first fit. The results are shown in
Output 42.7.2.
Output 42.7.2 Maximum Likelihood Estimates
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
x1
trt
x1*trt
Scale
1
1
1
1
0
1.3476
0.1108
-0.1080
-0.3016
1.0000
0.0341
0.0469
0.0486
0.0697
0.0000
Wald 95%
Confidence Limits
1.2809
0.0189
-0.2034
-0.4383
1.0000
1.4144
0.2027
-0.0127
-0.1649
1.0000
Wald
Chi-Square
Pr > ChiSq
1565.44
5.58
4.93
18.70
<.0001
0.0181
0.0264
<.0001
NOTE: The scale parameter was held fixed.
The GEE solution is requested with the REPEATED statement in the GENMOD procedure. The SUBJECT=ID option indicates that the variable id describes the observations for a single cluster, and the CORRW
option displays the working correlation matrix. The TYPE= option specifies the correlation structure; the
value EXCH indicates the exchangeable structure.
The following statements perform the analysis:
proc genmod data=new;
class id;
model y=x1 | trt / d=poisson offset=ltime;
repeated subject=id / corrw covb type=exch;
run;
These statements first fit a generalized linear model (GLM) to these data by maximum likelihood. The
estimates are not shown in the output, but are used as initial values for the GEE solution.
Information about the GEE model is displayed in Output 42.7.3. The results of fitting the model are displayed
in Output 42.7.4. Compare these with the model of independence displayed in Output 42.7.2. The parameter
estimates are nearly identical, but the standard errors for the independence case are underestimated. The
coefficient of the interaction term, ˇ3 , is highly significant under the independence model and marginally
significant with the exchangeable correlations model.
Example 42.7: Log-Linear Model for Count Data F 3039
Output 42.7.3 GEE Model Information
The GENMOD Procedure
GEE Model Information
Correlation Structure
Subject Effect
Number of Clusters
Correlation Matrix Dimension
Maximum Cluster Size
Minimum Cluster Size
Exchangeable
id (58 levels)
58
5
5
5
Output 42.7.4 GEE Parameter Estimates
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter Estimate
Intercept
x1
trt
x1*trt
1.3476
0.1108
-0.1080
-0.3016
Standard
Error
0.1574
0.1161
0.1937
0.1712
95% Confidence
Limits
1.0392
-0.1168
-0.4876
-0.6371
Z Pr > |Z|
1.6560
0.3383
0.2716
0.0339
8.56
0.95
-0.56
-1.76
<.0001
0.3399
0.5770
0.0781
Table 42.18 displays the regression coefficients, standard errors, and normalized coefficients that result from
fitting the model with independent and exchangeable working correlation matrices.
Table 42.18 Results of Model Fitting
Variable
Correlation Structure
Coef.
Std. Error
Coef./S.E.
Intercept
Exchangeable
Independent
Exchangeable
Independent
Exchangeable
Independent
Exchangeable
Independent
1.35
1.35
0.11
0.11
–0.11
–0.11
–0.30
–0.30
0.16
0.03
0.12
0.05
0.19
0.05
0.17
0.07
8.56
39.52
0.95
2.36
–0.56
–2.22
–1.76
–4.32
Visit .x1 /
Treat .x2 /
x1 x2
3040 F Chapter 42: The GENMOD Procedure
The fitted exchangeable correlation matrix is specified with the CORRW option and is displayed in Output 42.7.5.
Output 42.7.5 Working Correlation Matrix
Working Correlation Matrix
Row1
Row2
Row3
Row4
Row5
Col1
Col2
Col3
Col4
Col5
1.0000
0.5941
0.5941
0.5941
0.5941
0.5941
1.0000
0.5941
0.5941
0.5941
0.5941
0.5941
1.0000
0.5941
0.5941
0.5941
0.5941
0.5941
1.0000
0.5941
0.5941
0.5941
0.5941
0.5941
1.0000
If you specify the COVB option, you produce both the model-based (naive) and the empirical (robust)
covariance matrices. Output 42.7.6 contains these estimates.
Output 42.7.6 Covariance Matrices
Covariance Matrix (Model-Based)
Prm1
Prm2
Prm3
Prm4
Prm1
Prm2
Prm3
Prm4
0.01223
0.001520
-0.01223
-0.001520
0.001520
0.01519
-0.001520
-0.01519
-0.01223
-0.001520
0.02495
0.005427
-0.001520
-0.01519
0.005427
0.03748
Covariance Matrix (Empirical)
Prm1
Prm2
Prm3
Prm4
Prm1
Prm2
Prm3
Prm4
0.02476
-0.001152
-0.02476
0.001152
-0.001152
0.01348
0.001152
-0.01348
-0.02476
0.001152
0.03751
-0.002999
0.001152
-0.01348
-0.002999
0.02931
The two covariance estimates are similar, indicating an adequate correlation model.
Example 42.8: Model Assessment of Multiple Regression Using Aggregates of Residuals F 3041
Example 42.8: Model Assessment of Multiple Regression Using Aggregates
of Residuals
This example illustrates the use of cumulative residuals to assess the adequacy of a normal linear regression
model. Neter et al. (1996, Section 8.2) describe a study of 54 patients undergoing a certain kind of liver
operation in a surgical unit. The data consist of the survival time and certain covariates. After a model
selection procedure, they arrived at the following model:
Y D ˇ0 C ˇ1 X1 C ˇ2 X2 C ˇ3 X3 C where Y is the logarithm (base 10) of the survival time; X1 , X2 , X3 are blood-clotting score, prognostic index,
and enzyme function, respectively; and is a normal error term. A listing of the SAS data set containing the
data is shown in Output 42.8.1. The variables Y, X1, X2, and X3 correspond to Y, X1 , X2 , and X3 , and LogX1
is log(X1 ). The PROC GENMOD fit of the model is shown in Output 42.8.2. The analysis first focuses on
the adequacy of the functional form of X1 , blood-clotting score.
3042 F Chapter 42: The GENMOD Procedure
Output 42.8.1 Surgical Unit Example Data
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Y
X1
X2
X3
LogX1
2.3010
2.0043
2.3096
2.0043
2.7067
1.9031
1.9031
2.1038
2.3054
2.3075
2.5172
1.8129
2.9191
2.5185
2.2253
2.3365
1.9395
1.5315
2.3324
2.2355
2.0374
2.1335
1.8451
2.3424
2.4409
2.1584
2.2577
2.7589
1.8573
2.2504
1.8513
1.7634
2.0645
2.4698
2.0607
2.2648
2.0719
2.0792
2.1790
2.1703
1.9777
1.8751
2.6840
2.1847
2.2810
2.0899
2.4928
2.5999
2.1987
2.4914
2.0934
2.0969
2.2967
2.4955
6.7
5.1
7.4
6.5
7.8
5.8
5.7
3.7
6.0
3.7
6.3
6.7
5.8
5.8
7.7
7.4
6.0
3.7
7.3
5.6
5.2
3.4
6.7
5.8
6.3
5.8
5.2
11.2
5.2
5.8
3.2
8.7
5.0
5.8
5.4
5.3
2.6
4.3
4.8
5.4
5.2
3.6
8.8
6.5
3.4
6.5
4.5
4.8
5.1
3.9
6.6
6.4
6.4
8.8
62
59
57
73
65
38
46
68
67
76
84
51
96
83
62
74
85
51
68
57
52
83
26
67
59
61
52
76
54
76
64
45
59
72
58
51
74
8
61
52
49
28
86
56
77
40
73
86
67
82
77
85
59
78
81
66
83
41
115
72
63
81
93
94
83
43
114
88
67
68
28
41
74
87
76
53
68
86
100
73
86
90
56
59
65
23
73
93
70
99
86
119
76
88
72
99
88
77
93
84
106
101
77
103
46
40
85
72
0.82607
0.70757
0.86923
0.81291
0.89209
0.76343
0.75587
0.56820
0.77815
0.56820
0.79934
0.82607
0.76343
0.76343
0.88649
0.86923
0.77815
0.56820
0.86332
0.74819
0.71600
0.53148
0.82607
0.76343
0.79934
0.76343
0.71600
1.04922
0.71600
0.76343
0.50515
0.93952
0.69897
0.76343
0.73239
0.72428
0.41497
0.63347
0.68124
0.73239
0.71600
0.55630
0.94448
0.81291
0.53148
0.81291
0.65321
0.68124
0.70757
0.59106
0.81954
0.80618
0.80618
0.94448
Example 42.8: Model Assessment of Multiple Regression Using Aggregates of Residuals F 3043
In order to assess the adequacy of the fitted multiple regression model, the ASSESS statement in the following
SAS statements is used to create the plots of cumulative residuals against X1 shown in Output 42.8.3 and
Output 42.8.4 and the summary table in Output 42.8.5:
ods graphics on;
proc genmod data=Surg;
model Y = X1 X2 X3 / scale=Pearson;
assess var=(X1) / resample=10000
seed=603708000
crpanel;
run;
Output 42.8.2 Regression Model for Linear X1
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
X1
X2
X3
Scale
1
1
1
1
0
0.4836
0.0692
0.0093
0.0095
0.0469
0.0426
0.0041
0.0004
0.0003
0.0000
Wald 95%
Confidence Limits
0.4001
0.0612
0.0085
0.0089
0.0469
0.5672
0.0772
0.0100
0.0101
0.0469
Wald
Chi-Square
Pr > ChiSq
128.71
288.17
590.45
966.07
<.0001
<.0001
<.0001
<.0001
NOTE: The scale parameter was estimated by the square root of Pearson's
Chi-Square/DOF.
See Lin, Wei, and Ying (2002) for details about model assessment that uses cumulative residual plots. The
RESAMPLE= keyword specifies that a p-value be computed based on a sample of 10,000 simulated residual
paths. A random number seed is specified by the SEED= keyword for reproducibility. If you do not specify
the seed, one is derived from the time of day. The keyword CRPANEL specifies that the panel of four
cumulative residual plots shown in Output 42.8.4 be created, each with two simulated paths. The single
residual plot with 20 simulated paths in Output 42.8.3 is created by default.
To request these graphs, ODS Graphics must be enabled and you must specify the ASSESS statement. For
general information about ODS Graphics, see Chapter 21, “Statistical Graphics Using ODS.” For specific
information about the graphics available in the GENMOD procedure, see the section “ODS Graphics” on
page 3016.
3044 F Chapter 42: The GENMOD Procedure
Output 42.8.3 Cumulative Residual Plot for Linear X1 Fit
Example 42.8: Model Assessment of Multiple Regression Using Aggregates of Residuals F 3045
Output 42.8.4 Cumulative Residual Panel Plot for Linear X1 Fit
Output 42.8.5 Summary of Model Assessment
Assessment Summary
Assessment
Variable
X1
Maximum
Absolute
Value
Replications
Seed
Pr >
MaxAbsVal
0.0380
10000
603708000
0.1084
The p-value of 0.1084 reported on Output 42.8.3 and Output 42.8.5 suggests that a more adequate model
might be possible. The observed cumulative residuals in Output 42.8.3 and Output 42.8.4, represented by the
heavy lines, seem atypical of the simulated curves, represented by the light lines, reinforcing the conclusion
that a more appropriate functional form for X1 is possible.
The cumulative residual plots in Output 42.8.6 provide guidance in determining a more appropriate functional
form. The four curves were created from simple forms of model misspecification by using simulated data.
The mean models of the data and the fitted model are shown in Table 42.19.
3046 F Chapter 42: The GENMOD Procedure
Output 42.8.6 Typical Cumulative Residual Patterns
Table 42.19 Model Misspecifications
Plot
Data E(Y)
Fitted Model E(Y)
(a)
(b)
(c)
(d)
log(X)
X C X2
X C X2 C X3
I.X > 5/
X
X
X C X2
X
The observed cumulative residual pattern in Output 42.8.3 and Output 42.8.4 most resembles the behavior of
the curve in plot (a) of Output 42.8.6, indicating that log(X1 ) might be a more appropriate term in the model
than X1 .
The following SAS statements fit a model with LogX1 in place of X1 and request a model assessment:
proc genmod data=Surg;
model Y = LogX1 X2 X3 / scale=Pearson;
assess var=(LogX1) / resample=10000
seed=603708000;
run;
Example 42.8: Model Assessment of Multiple Regression Using Aggregates of Residuals F 3047
The revised model fit is shown in Output 42.8.7, the p-value from the simulation is 0.4777, and the cumulative
residuals plotted in Output 42.8.8 show no systematic trend. The log transformation for X1 is more appropriate.
Under the revised model, the p-values for testing the functional forms of X2 and X3 are 0.20 and 0.63,
respectively; and the p-value for testing the linearity of the model is 0.65. Thus, the revised model seems
reasonable.
Output 42.8.7 Multiple Regression Model with Log(X1)
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
LogX1
X2
X3
Scale
1
1
1
1
0
0.1844
0.9121
0.0095
0.0096
0.0434
0.0504
0.0491
0.0004
0.0003
0.0000
Wald 95%
Confidence Limits
0.0857
0.8158
0.0088
0.0090
0.0434
0.2832
1.0083
0.0102
0.0101
0.0434
Wald
Chi-Square
Pr > ChiSq
13.41
345.05
728.62
1139.73
0.0003
<.0001
<.0001
<.0001
NOTE: The scale parameter was estimated by the square root of Pearson's
Chi-Square/DOF.
3048 F Chapter 42: The GENMOD Procedure
Output 42.8.8 Cumulative Residual Plot with Log(X1)
Example 42.9: Assessment of a Marginal Model for Dependent Data
This example illustrates the use of cumulative residuals to assess the adequacy of a marginal model for
dependent data fit by generalized estimating equations (GEEs). The assessment methods are applied to CD4
count data from an AIDS clinical trial reported by Fischl, Richman, and Hansen (1990) and reanalyzed by
Lin, Wei, and Ying (2002). The study randomly assigned 360 HIV patients to the drug AZT and 351 patients
to placebo. CD4 counts were measured repeatedly over the course of the study. The data used here are the
4328 measurements taken in the first 40 weeks of the study.
The analysis focuses on the time trend of the response. The first model considered is
E.yi k / D ˇ0 C ˇ1 Ti k C ˇ2 Ti2k C ˇ3 Ri Ti k C ˇ4 Ri Ti2k
where Ti k is the time (in weeks) of the kth measurement on the ith patient, yi k is the CD4 count at Ti k for
the ith patient, and Ri is the indicator of AZT for the ith patient. Normal errors and an independent working
correlation are assumed.
The following statements create the SAS data set cd4:
Example 42.9: Assessment of a Marginal Model for Dependent Data F 3049
data cd4;
input Id Y Time Time2 TrtTime TrtTime2;
Time3 = Time2 * Time;
TrtTime3 = TrtTime2 * Time;
datalines;
1
264.00024
-0.28571
0.08163
1
175.00070
4.14286
17.16327
1
306.00150
8.14286
66.30612
1
331.99835
12.14286
147.44898
1
309.99929
16.14286
260.59184
1
185.00077
28.71429
824.51020
1
175.00070
40.14286
1611.44898
2
574.99998
-0.57143
0.32653
-0.28571
4.14286
8.14286
12.14286
16.14286
28.71429
40.14286
0.00000
0.08163
17.16327
66.30612
147.44898
260.59184
824.51020
1611.44898
0.00000
... more lines ...
711
711
711
;
363.99859
488.00224
240.00026
8.14286
12.14286
18.14286
66.30612
147.44898
329.16327
8.14286
12.14286
18.14286
66.30612
147.44898
329.16327
The following SAS statements fit the preceding model, create the cumulative residual plot in Output 42.9.1,
and compute a p-value for the model.
To request these graphs, ODS Graphics must be enabled and you must specify the ASSESS statement. For
general information about ODS Graphics, see Chapter 21, “Statistical Graphics Using ODS.” For specific
information about the graphics available in the GENMOD procedure, see the section “ODS Graphics” on
page 3016.
Here, the SAS data set variables Time, Time2, TrtTime, and TrtTime2 correspond to Ti k , Ti2k , Ri Ti k , and
Ri Ti2k , respectively. The variable Id identifies individual patients.
ods graphics on;
proc genmod data=cd4;
class Id;
model Y = Time Time2 TrtTime TrtTime2;
repeated sub=Id;
assess var=(Time) / resample
seed=603708000;
run;
3050 F Chapter 42: The GENMOD Procedure
Output 42.9.1 Cumulative Residual Plot for Quadratic Time Fit
The cumulative residual plot in Output 42.9.1 displays cumulative residuals versus time for the model and 20
simulated realizations. The associated p-value, also shown in Output 42.9.1, is 0.18. These results indicate
that a more satisfactory model might be possible. The observed cumulative residual pattern most resembles
plot (c) in Output 42.8.6, suggesting cubic time trends.
The following SAS statements fit the model, create the plot in Output 42.9.2, and compute a p-value for a
model with the additional terms Ti3k and Ri Ti3k :
proc genmod data=cd4;
class Id;
model Y = Time Time2 Time3 TrtTime TrtTime2 TrtTime3;
repeated sub=Id;
assess var=(Time) / resample
seed=603708000;
run;
Example 42.9: Assessment of a Marginal Model for Dependent Data F 3051
Output 42.9.2 Cumulative Residual Plot for Cubic Time Fit
The observed cumulative residual pattern appears more typical of the simulated realizations, and the p-value
is 0.45, indicating that the model with cubic time trends is more appropriate.
3052 F Chapter 42: The GENMOD Procedure
Example 42.10: Bayesian Analysis of a Poisson Regression Model
This example illustrates a Bayesian analysis of a log-linear Poisson regression model. Consider the following
data on patients from clinical trials. The data set is a subset of the data described in Ibrahim, Chen, and
Lipsitz (1999).
data Liver;
input X1-X6 Y;
datalines;
19.1358
50.0110
23.5970
18.4959
20.0474
56.7699
28.0277
59.7836
28.6851
74.1589
18.8092
31.0630
28.7201
52.9178
21.3669
61.6603
23.7332
42.2904
20.4783
22.1260
51.000
3.429
3.429
4.000
5.714
2.286
37.286
54.143
0.571
19.000
0
0
1
0
1
0
1
0
1
1
0
0
1
0
0
1
0
1
0
0
1
1
0
1
1
1
1
1
1
1
3
9
6
6
1
61
6
6
21
6
3.000
2.571
4.429
0
1
1
0
0
0
0
0
0
9
1
6
... more lines ...
17.0993
19.1327
17.3010
;
48.8384
65.3425
51.4493
The primary interest is in prediction of the number of cancerous liver nodes when a patient enters the trials,
by using six other baseline characteristics. The number of nodes is modeled by a Poisson regression model
with the six baseline characteristics as covariates. The response and regression variables are as follows:
Y
X1
X2
X3
X4
X5
X6
Number of Cancerous Liver Nodes
Body Mass Index
Age, in Years
Time Since Diagnosis of Disease, in Weeks
Two Biochemical Markers (each classified as normal=1 or abnormal=0)
Anti Hepatitis B Antigen
Associated Jaundice (yes=1, no=0)
Two analyses are performed using PROC GENMOD. The first analysis uses noninformative normal prior
distributions, and the second analysis uses an informative normal prior for one of the regression parameters.
In the following BAYES statement, COEFFPRIOR=NORMAL specifies a noninformative independent
normal prior distribution with zero mean and variance 106 for each parameter.
The initial analysis is performed using PROC GENMOD to obtain Bayesian estimates of the regression
coefficients by using the following SAS statements:
proc genmod data=Liver;
model Y = X1-X6 / dist=Poisson link=log;
bayes seed=1 coeffprior=normal;
run;
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3053
Maximum likelihood estimates of the model parameters are computed by default. These are shown in the
“Analysis of Maximum Likelihood Parameter Estimates” table in Output 42.10.1.
Output 42.10.1 Maximum Likelihood Parameter Estimates
The GENMOD Procedure
Bayesian Analysis
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
X1
X2
X3
X4
X5
X6
Scale
1
1
1
1
1
1
1
0
2.4508
-0.0044
-0.0135
-0.0029
-0.2715
0.3215
0.2077
1.0000
0.2284
0.0080
0.0024
0.0022
0.0795
0.0832
0.0827
0.0000
Wald 95% Confidence
Limits
2.0032
-0.0201
-0.0181
-0.0072
-0.4272
0.1585
0.0456
1.0000
2.8984
0.0114
-0.0088
0.0014
-0.1157
0.4845
0.3698
1.0000
NOTE: The scale parameter was held fixed.
Noninformative independent normal prior distributions with zero means and variances of 106 were used in
the initial analysis. These are shown in Output 42.10.2.
Output 42.10.2 Regression Coefficient Priors
The GENMOD Procedure
Bayesian Analysis
Independent Normal Prior for Regression Coefficients
Parameter
Mean
Precision
Intercept
X1
X2
X3
X4
X5
X6
0
0
0
0
0
0
0
1E-6
1E-6
1E-6
1E-6
1E-6
1E-6
1E-6
Initial values for the Markov chain are listed in the “Initial Values and Seeds” table in Output 42.10.3. The
random number seed is also listed so that you can reproduce the analysis. Since no seed was specified, the
seed shown was derived from the time of day.
3054 F Chapter 42: The GENMOD Procedure
Output 42.10.3 MCMC Initial Values and Seeds
Initial Values of the Chain
Chain
Seed
Intercept
X1
X2
X3
X4
1
1
2.450813
-0.00435
-0.01347
-0.00291
-0.27149
Initial Values of the Chain
X5
X6
0.321507
0.207713
Summary statistics for the posterior sample are displayed in the “Fit Statistics,” “Descriptive Statistics for the
Posterior Sample,” “Interval Statistics for the Posterior Sample,” and “Posterior Correlation Matrix” tables
in Output 42.10.4, Output 42.10.5, Output 42.10.6, and Output 42.10.7, respectively. Since noninformative
prior distributions for the regression coefficients were used, the mean and standard deviations of the posterior
distributions for the model parameters are close to the maximum likelihood estimates and standard errors.
Output 42.10.4 Fit Statistics
Fit Statistics
DIC (smaller is better)
pD (effective number of parameters)
829.810
7.005
Output 42.10.5 Descriptive Statistics
The GENMOD Procedure
Bayesian Analysis
Posterior Summaries
Parameter
N
Mean
Standard
Deviation
25%
Intercept
X1
X2
X3
X4
X5
X6
10000
10000
10000
10000
10000
10000
10000
2.4483
-0.00475
-0.0134
-0.00303
-0.2703
0.3202
0.2106
0.2320
0.00809
0.00237
0.00220
0.0799
0.0828
0.0838
2.2903
-0.0101
-0.0150
-0.00445
-0.3241
0.2642
0.1533
Percentiles
50%
2.4493
-0.00466
-0.0134
-0.00298
-0.2725
0.3209
0.2111
75%
2.6093
0.000851
-0.0118
-0.00150
-0.2190
0.3775
0.2663
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3055
Output 42.10.6 Interval Statistics
Posterior Intervals
Parameter
Alpha
Intercept
X1
X2
X3
X4
X5
X6
0.050
0.050
0.050
0.050
0.050
0.050
0.050
Equal-Tail Interval
1.9903
-0.0209
-0.0181
-0.00761
-0.4257
0.1563
0.0450
HPD Interval
2.9059
0.0108
-0.00870
0.00105
-0.1063
0.4804
0.3777
2.0289
-0.0211
-0.0184
-0.00745
-0.4314
0.1574
0.0468
2.9321
0.0106
-0.00908
0.00113
-0.1152
0.4811
0.3788
Output 42.10.7 Posterior Sample Correlation Matrix
Posterior Correlation Matrix
Parameter
Intercept
X1
X2
X3
X4
X5
X6
Intercept
X1
X2
X3
X4
X5
X6
1.000
-0.708
-0.432
-0.046
-0.261
-0.185
-0.422
-0.708
1.000
-0.202
-0.047
-0.035
0.078
0.129
-0.432
-0.202
1.000
0.035
0.076
0.054
0.117
-0.046
-0.047
0.035
1.000
0.027
-0.042
-0.077
-0.261
-0.035
0.076
0.027
1.000
-0.024
0.127
-0.185
0.078
0.054
-0.042
-0.024
1.000
-0.037
-0.422
0.129
0.117
-0.077
0.127
-0.037
1.000
Posterior sample autocorrelations for each model parameter are shown in Output 42.10.8. The autocorrelation
after 10 lags is negligible for all parameters, indicating good mixing in the Markov chain.
Output 42.10.8 Posterior Sample Autocorrelations
The GENMOD Procedure
Bayesian Analysis
Posterior Autocorrelations
Parameter
Lag 1
Lag 5
Lag 10
Lag 50
Intercept
X1
X2
X3
X4
X5
X6
0.3037
0.3398
0.3036
0.3489
0.2868
0.2854
0.3078
0.0152
0.0025
0.0061
0.0190
0.0213
0.0108
0.0230
0.0095
0.0003
0.0003
-0.0064
0.0157
-0.0288
0.0073
-0.0170
0.0052
-0.0062
-0.0210
-0.0107
-0.0012
0.0062
3056 F Chapter 42: The GENMOD Procedure
The p-values for the Geweke test statistics shown in Output 42.10.9 all indicate convergence of the MCMC.
See the section “Assessing Markov Chain Convergence” on page 141 in Chapter 7, “Introduction to Bayesian
Analysis Procedures,” for more information about convergence diagnostics and their interpretation.
Output 42.10.9 Geweke Diagnostic Statistics
Geweke Diagnostics
Parameter
z
Pr > |z|
Intercept
X1
X2
X3
X4
X5
X6
-0.6533
0.3418
0.3609
-0.3345
0.2851
-0.5266
1.1285
0.5135
0.7325
0.7182
0.7380
0.7755
0.5985
0.2591
The effective sample sizes for each parameter are shown in Output 42.10.10.
Output 42.10.10 Effective Sample Sizes
Effective Sample Sizes
Parameter
ESS
Autocorrelation
Time
Efficiency
Intercept
X1
X2
X3
X4
X5
X6
4880.3
4844.2
5139.3
4551.2
4953.6
5330.5
4988.1
2.0491
2.0643
1.9458
2.1972
2.0187
1.8760
2.0048
0.4880
0.4844
0.5139
0.4551
0.4954
0.5331
0.4988
Trace, autocorrelation, and density plots for the seven model parameters are shown in Output 42.10.11
through Output 42.10.17. All indicate satisfactory convergence of the Markov chain.
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3057
Output 42.10.11 Diagnostic Plots for Intercept
3058 F Chapter 42: The GENMOD Procedure
Output 42.10.12 Diagnostic Plots for X1
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3059
Output 42.10.13 Diagnostic Plots for X2
3060 F Chapter 42: The GENMOD Procedure
Output 42.10.14 Diagnostic Plots for X3
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3061
Output 42.10.15 Diagnostic Plots for X4
3062 F Chapter 42: The GENMOD Procedure
Output 42.10.16 Diagnostic Plots for X5
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3063
Output 42.10.17 Diagnostic Plots for X6
In order to illustrate the use of an informative prior distribution, suppose that researchers expect that a unit
increase in body mass index (X1) will be associated with an increase in the mean number of nodes of between
10% and 20%, and they want to incorporate this prior knowledge in the Bayesian analysis. For log-linear
models, the mean and linear predictor are related by log.i / D xi0 ˇ. If X11 and X12 are two values of body
mass index, 1 and 2 are the two mean values, and all other covariates remain equal for the two values of
X1, then
1
D exp.ˇ.X11
2
X12 //
so that for a unit change in X1,
1
D exp.ˇ/
2
1
If 1:1 2 1:2, then 1:1 exp.ˇ/ 1:2, or 0:095 ˇ 0:182. This gives you guidance in specifying
a prior distribution for the ˇ for body mass index. Taking the mean of the prior normal distribution to be the
3064 F Chapter 42: The GENMOD Procedure
midrange of the values of ˇ, and taking ˙ 2 to be the extremes of the range, an N.0:1385; 0:0005/ is
the resulting prior distribution. The second analysis uses this informative normal prior distribution for the
coefficient of X1 and uses independent noninformative normal priors with zero means and variances equal to
106 for the remaining model regression parameters.
In the following BAYES statement, COEFFPRIOR=NORMAL(INPUT=NormalPrior) specifies the normal prior distribution for the regression coefficients with means and variances contained in the data set
NormalPrior.
An analysis is performed using PROC GENMOD to obtain Bayesian estimates of the regression coefficients
by using the following SAS statements:
data NormalPrior;
input _type_ $ Intercept X1-X6;
datalines;
Var 1e6
0.0005
1e6
1e6
Mean 0.0
0.1385
0.0
0.0
;
1e6
0.0
1e6
0.0
1e6
0.0
proc genmod data=Liver;
model Y = X1-X6 / dist=Poisson link=log;
bayes seed=1 plots=none coeffprior=normal(input=NormalPrior);
run;
The prior distributions for the regression parameters are shown in Output 42.10.18.
Output 42.10.18 Regression Coefficient Priors
The GENMOD Procedure
Bayesian Analysis
Independent Normal Prior for Regression Coefficients
Parameter
Mean
Precision
Intercept
X1
X2
X3
X4
X5
X6
0
0.1385
0
0
0
0
0
1E-6
2000
1E-6
1E-6
1E-6
1E-6
1E-6
Initial values for the MCMC are shown in Output 42.10.19. The initial values of the covariates are joint
estimates of their posterior modes. The prior distribution for X1 is informative, so the initial value of X1 is
further from the MLE than the rest of the covariates. Initial values for the rest of the covariates are close to
their MLEs, since noninformative prior distributions were specified for them.
Example 42.10: Bayesian Analysis of a Poisson Regression Model F 3065
Output 42.10.19 MCMC Initial Values and Seeds
Initial Values of the Chain
Chain
Seed
Intercept
X1
X2
X3
X4
1
1
2.14282
0.010595
-0.01434
-0.00301
-0.28062
Initial Values of the Chain
X5
X6
0.334983
0.231213
Goodness-of-fit, summary, and interval statistics are shown in Output 42.10.20. Except for X1, the statistics
shown in Output 42.10.20 are very similar to the previous statistics for noninformative priors shown in
Output 42.10.4 through Output 42.10.7. The point estimate for X1 is now positive. This is expected because
the prior distribution on ˇ1 is quite informative. The distribution reflects the belief that the coefficient is
positive. The N.0:1385; 0:0005/ distribution places the majority of its probability density on positive values.
As a result, the posterior density of ˇ1 places more likelihood on positive values than in the noninformative
case.
Output 42.10.20 Fit Statistics
Fit Statistics
DIC (smaller is better)
pD (effective number of parameters)
833.074
6.869
The GENMOD Procedure
Bayesian Analysis
Posterior Summaries
Parameter
N
Mean
Standard
Deviation
25%
Intercept
X1
X2
X3
X4
X5
X6
10000
10000
10000
10000
10000
10000
10000
2.1419
0.0103
-0.0143
-0.00318
-0.2806
0.3341
0.2333
0.2157
0.00684
0.00233
0.00218
0.0800
0.0832
0.0826
1.9965
0.00573
-0.0159
-0.00467
-0.3336
0.2788
0.1774
Percentiles
50%
2.1430
0.0104
-0.0142
-0.00314
-0.2793
0.3341
0.2325
75%
2.2894
0.0150
-0.0127
-0.00170
-0.2266
0.3906
0.2880
3066 F Chapter 42: The GENMOD Procedure
Output 42.10.20 continued
Posterior Intervals
Parameter
Alpha
Intercept
X1
X2
X3
X4
X5
X6
0.050
0.050
0.050
0.050
0.050
0.050
0.050
Equal-Tail Interval
1.7225
-0.00344
-0.0188
-0.00757
-0.4365
0.1657
0.0695
2.5574
0.0235
-0.00970
0.00108
-0.1200
0.4966
0.3959
HPD Interval
1.7293
-0.00345
-0.0189
-0.00733
-0.4391
0.1682
0.0725
2.5632
0.0234
-0.00980
0.00121
-0.1256
0.4987
0.3981
Example 42.11: Exact Poisson Regression
The following data, taken from Cox and Snell (1989, pp. 10–11), consists of the number, Notready, of ingots
that are not ready for rolling, out of Total tested, for several combinations of heating time and soaking time:
data ingots;
input Heat Soak Notready
lnTotal= log(Total);
datalines;
7 1.0 0 10 14 1.0 0 31 27
7 1.7 0 17 14 1.7 0 43 27
7 2.2 0 7 14 2.2 2 33 27
7 2.8 0 12 14 2.8 0 31 27
7 4.0 0 9 14 4.0 0 19 27
;
Total @@;
1.0
1.7
2.2
2.8
4.0
1
4
0
1
1
56
44
21
22
16
51
51
51
51
1.0
1.7
2.2
4.0
3 13
0 1
0 1
0 1
The following invocation of PROC GENMOD fits an asymptotic (unconditional) Poisson regression model
to the data. The variable Notready is specified as the response variable, and the continuous predictors Heat
and Soak are defined in the CLASS statement as categorical predictors that use reference coding. Specifying
the offset variable as lnTotal enables you to model the ratio Notready/Total.
proc genmod data=ingots;
class Heat Soak / param=ref;
model Notready=Heat Soak / offset=lnTotal dist=Poisson link=log;
exact Heat Soak / joint estimate;
exactoptions statustime=10;
run;
The EXACT statement is specified to additionally fit an exact conditional Poisson regression model. Specifying the lnTotal offset variable models the ratio Notready/Total; in this case, the Total variable contains
the largest possible response value for each observation. The JOINT option produces a joint test for the
significance of the covariates, along with the usual marginal tests. The ESTIMATE option produces exact
parameter estimates for the covariates. The STATUSTIME=10 option is specified in the EXACTOPTIONS
statement for monitoring the progress of the results; this example can take several minutes to complete due to
the JOINT option. If you run out of memory, see the SAS Companion for your system for information about
how to increase the available memory.
Example 42.11: Exact Poisson Regression F 3067
The “Criteria For Assessing Goodness Of Fit” table is displayed in Output 42.11.1. Comparing the deviance
of 10.9363 to an asymptotic chi-square distribution with 11 degrees of freedom, you find that the p-value is
0.449. This indicates that the specified model fits the data reasonably well.
Output 42.11.1 Unconditional Goodness of Fit Criteria
The GENMOD Procedure
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
11
11
11
11
10.9363
10.9363
9.3722
9.3722
-7.2408
-12.9038
41.8076
56.2076
49.3631
0.9942
0.9942
0.8520
0.8520
From the “Analysis Of Parameter Estimates” table in Output 42.11.2, you can see that only two of the Heat
parameters are deemed significant. Looking at the standard errors, you can see that the unconditional analysis
had convergence difficulties with the Heat=7 parameter (Standard Error=264324.6), which means you cannot
fit this unconditional Poisson regression model to this data.
3068 F Chapter 42: The GENMOD Procedure
Output 42.11.2 Unconditional Maximum Likelihood Parameter Estimates
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept
Heat
Heat
Heat
Soak
Soak
Soak
Soak
Scale
7
14
27
1
1.7
2.2
2.8
DF
Estimate
Standard
Error
1
1
1
1
1
1
1
1
0
-1.5700
-27.6129
-3.0107
-1.7180
-0.2454
0.5572
0.4079
-0.1301
1.0000
1.1657
264324.6
1.0025
0.7691
1.1455
1.1217
1.2260
1.4234
0.0000
Wald 95% Confidence
Limits
-3.8548
-518094
-4.9756
-3.2253
-2.4906
-1.6412
-1.9951
-2.9199
1.0000
0.7147
518039.0
-1.0458
-0.2106
1.9998
2.7557
2.8109
2.6597
1.0000
Wald
Chi-Square
1.81
0.00
9.02
4.99
0.05
0.25
0.11
0.01
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter
Intercept
Heat
Heat
Heat
Soak
Soak
Soak
Soak
Scale
Pr > ChiSq
7
14
27
1
1.7
2.2
2.8
0.1780
0.9999
0.0027
0.0255
0.8304
0.6193
0.7394
0.9272
NOTE: The scale parameter was held fixed.
Following the output from the asymptotic analysis, the exact conditional Poisson regression results are
displayed, as shown in Output 42.11.3.
Example 42.11: Exact Poisson Regression F 3069
Output 42.11.3 Exact Tests
The GENMOD Procedure
Exact Conditional Analysis
Exact Conditional Tests
Effect
Test
Joint
Score
Probability
Score
Probability
Score
Probability
Heat
Soak
Statistic
18.3665
1.294E-6
15.8259
0.000175
1.4612
0.00735
--- p-Value --Exact
Mid
0.0137
0.0471
0.0023
0.0063
0.8683
0.8176
0.0137
0.0471
0.0022
0.0062
0.8646
0.8139
The Joint test in the “Conditional Exact Tests” table in Output 42.11.3 is produced by specifying the JOINT
option in the EXACT statement. The p-values for this test indicate that the parameters for Heat and Soak are
jointly significant as explanatory effects in the model. If the Heat variable is the only explanatory variable
in your model, then the rows of this table labeled as “Heat” show the joint significance of all the Heat
effect parameters in that reduced model. In this case, a model that contains only the Heat parameters still
explains a significant amount of the variability; however, you can see that a model that contains only the
Soak parameters would not be significant.
The “Exact Parameter Estimates” table in Output 42.11.4 displays parameter estimates and tests of significance
for the levels of the CLASS variables. Again, the Heat=7 parameter has some difficulties; however, in the
exact analysis, a median unbiased estimate is computed for the parameter instead of a maximum likelihood
estimate. The confidence limits show that the Heat variable contains some explanatory power, while the
categorical Soak variable is insignificant and can be dropped from the model.
Output 42.11.4 Exact Parameter Estimates
Exact Parameter Estimates
Parameter
Heat
Heat
Heat
Soak
Soak
Soak
Soak
Estimate
7
14
27
1
1.7
2.2
2.8
-2.7552*
-3.0255
-1.7846
-0.3231
0.5375
0.4035
-0.1661
Standard
Error
.
1.0128
0.8065
1.1717
1.1284
1.2347
1.4214
95% Confidence
Limits
-Infinity
-5.7450
-3.6779
-2.8673
-1.8056
-2.5785
-4.5490
-0.7864
-0.6194
0.2260
3.6754
4.4588
4.5054
4.2168
NOTE: * indicates a median unbiased estimate.
Two-sided
p-Value
0.0199
0.0113
0.0844
1.0000
1.0000
1.0000
1.0000
3070 F Chapter 42: The GENMOD Procedure
N OTE : If you want to make predictions from the exact results, you can obtain an estimate for the intercept
parameter by specifying the INTERCEPT keyword in the EXACT statement. You should also remove the
JOINT option to reduce the amount of time and memory consumed.
Example 42.12: Tweedie Regression
The following SAS statements simulate 250 observations, which are based on an underlying Tweedie
generalized linear model (GLM) that exploits its connection with the compound Poisson distribution. A
natural logarithm link function is assumed for modeling the response variable (yTweedie), and there are five
categorical variables (C1–C5), each of which has four numerical levels and two continuous variables (D1 and
D2). By design, two of the categorical variables, C3 and C4, and one of the two continuous variables, D2,
have no effect on the response. The dispersion parameter is set to 0.5, and the power parameter is set to 1.5.
%let
%let
%let
%let
nObs = 250;
nClass = 5;
nLevs = 4;
seed = 100;
data tmp1;
array c{&nClass};
keep c1-c&nClass yTweedie d1 d2;
/* Tweedie parms */
phi=0.5;
p=1.5;
do i=1 to &nObs;
do j=1 to &nClass;
c{j} = int(ranuni(1)*&nLevs);
end;
d1 = ranuni(&seed);
d2 = ranuni(&seed);
xBeta
mu
=
=
0.5*((c2<2) - 2*(c1=1) + 0.5*c&nClass + 0.05*d1);
exp(xBeta);
/* Poisson distributions parms */
lambda = mu**(2-p)/(phi*(2-p));
/* Gamma distribution parms */
alpha = (2-p)/(p-1);
gamma = phi*(p-1)*(mu**(p-1));
rpoi = ranpoi(&seed,lambda);
if rpoi=0 then yTweedie=0;
else do;
yTweedie=0;
do j=1 to rpoi;
yTweedie = yTweedie + rangam(&seed,alpha);
Example 42.12: Tweedie Regression F 3071
end;
yTweedie = yTweedie * gamma;
end;
output;
end;
run;
The following SAS statements invoke PROC GENMOD to fit the Tweedie GLM with the log link using all
of the categorical and continuous variables. A Type III analysis is requested by the TYPE3 option in the
MODEL statement.
proc genmod data=tmp1;
class C1-C5;
model yTweedie = C1-C5 D1 D2 / dist=Tweedie type3;
run;
The “Criteria For Assessing Goodness Of Fit” table is displayed in Output 42.12.1. The scaled Pearson 2 is
close to 1, indicating that the specified model fits the data well.
Output 42.12.1 Tweedie Goodness of Fit Criteria
The GENMOD Procedure
Criteria For Assessing Goodness Of Fit
Criterion
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
DF
Value
Value/DF
232
232
101.9124
251.5826
-297.2106
-297.2106
634.4212
638.0893
704.8504
0.4393
1.0844
The “LR Statistics For Type 3 Analysis” table is displayed in Output 42.12.2. As expected, the p-values for
C3, C4, and d2 are not statistically significant at the 5% level.
Output 42.12.2 Type III Analysis of Covariate Effects
LR Statistics For Type 3 Analysis
Source
c1
c2
c3
c4
c5
d1
d2
DF
ChiSquare
Pr > ChiSq
3
3
3
3
3
1
1
85.46
48.18
0.56
9.38
47.76
0.00
1.31
<.0001
<.0001
0.9050
0.0247
<.0001
0.9595
0.2518
3072 F Chapter 42: The GENMOD Procedure
You can fix the power parameter for fitting the Tweedie GLM by using the P= option. The following SAS
statements fit the model for C1, C2 and D1, while holding the power parameter at 1.5:
proc genmod data=tmp1;
class C1 C2;
model yTweedie = C1 C2 D1 / dist=Tweedie(p=1.5) type3;
run;
The parameter estimates are displayed in Output 42.12.3.
Output 42.12.3 Tweedie Maximum Likelihood Parameter Estimates
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept
c1
c1
c1
c1
c2
c2
c2
c2
d1
Dispersion
Power
0
1
2
3
0
1
2
3
DF
Estimate
Standard
Error
1
1
1
1
0
1
1
1
0
1
1
0
0.3440
-0.0722
-0.8952
0.0770
0.0000
0.6138
0.5103
0.1001
0.0000
-0.0211
0.4951
1.5000
0.1347
0.1101
0.1196
0.1073
0.0000
0.1161
0.1150
0.1215
0.0000
0.1493
0.0398
0.0000
Wald 95% Confidence
Limits
0.0801
-0.2880
-1.1296
-0.1334
0.0000
0.3862
0.2849
-0.1380
0.0000
-0.3136
0.4172
1.5000
0.6080
0.1436
-0.6607
0.2873
0.0000
0.8414
0.7356
0.3381
0.0000
0.2714
0.5731
1.5000
Wald
Chi-Square
6.53
0.43
56.01
0.51
.
27.93
19.70
0.68
.
0.02
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter
Intercept
c1
c1
c1
c1
c2
c2
c2
c2
d1
Dispersion
Power
Pr > ChiSq
0
1
2
3
0
1
2
3
0.0106
0.5120
<.0001
0.4733
.
<.0001
<.0001
0.4099
.
0.8876
NOTE: The Tweedie dispersion parameter was estimated by maximum likelihood.
NOTE: The Tweedie power parameter was held fixed.
References F 3073
References
Agresti, A. (2002), Categorical Data Analysis, 2nd Edition, New York: John Wiley & Sons.
Aitkin, M., Anderson, D. A., Francis, B., and Hinde, J. (1989), Statistical Modelling in GLIM, Oxford:
Oxford Science Publications.
Akaike, H. (1979), “A Bayesian Extension of the Minimum AIC Procedure of Autoregressive Model Fitting,”
Biometrika, 66, 237–242.
Akaike, H. (1981), “Likelihood of a Model and Information Criteria,” Journal of Econometrics, 16, 3–14.
Boos, D. (1992), “On Generalized Score Tests,” American Statistician, 46, 327–333.
Cameron, A. C. and Trivedi, P. K. (1998), Regression Analysis of Count Data, Cambridge: Cambridge
University Press.
Carey, V., Zeger, S. L., and Diggle, P. J. (1993), “Modelling Multivariate Binary Data with Alternating
Logistic Regressions,” Biometrika, 80, 517–526.
Collett, D. (2003), Modelling Binary Data, 2nd Edition, London: Chapman & Hall.
Cook, R. D. and Weisberg, S. (1982), Residuals and Influence in Regression, New York: Chapman & Hall.
Cox, D. R. and Snell, E. J. (1989), The Analysis of Binary Data, 2nd Edition, London: Chapman & Hall.
Davison, A. C. and Snell, E. J. (1991), “Residuals and Diagnostics,” in D. V. Hinkley, N. Reid, and E. J.
Snell, eds., Statistical Theory and Modelling, London: Chapman & Hall.
Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994), Analysis of Longitudinal Data, Oxford: Clarendon Press.
Dobson, A. (1990), An Introduction to Generalized Linear Models, London: Chapman & Hall.
Dunn, P. K. and Smyth, G. K. (2005), “Series Evaluation of Tweedie Exponential Dispersion Model Densities,”
Statistics and Computing, 15, 267–280.
Dunn, P. K. and Smyth, G. K. (2008), “Series Evaluation of Tweedie Exponential Dispersion Model Densities
by Fourier Inversion,” Statistics and Computing, 18, 73–86.
Firth, D. (1991), “Generalized Linear Models,” in D. V. Hinkley, N. Reid, and E. J. Snell, eds., Statistical
Theory and Modelling, London: Chapman & Hall.
Fischl, M. A., Richman, D. D., and Hansen, N. (1990), “The Safety and Efficacy of Zidovudine (AZT) in the
Treatment of Subjects with Mildly Symptomatic Human Immunodeficiency Virus Type I (HIV) Infection,”
Annals of Internal Medicine, 112, 727–737.
Gamerman, D. (1997), “Sampling from the Posterior Distribution in Generalized Linear Models,” Statistics
and Computing, 7, 57–68.
Gilks, W. R. (2003), “Adaptive Metropolis Rejection Sampling (ARMS),” software from MRC Biostatistics Unit, Cambridge, UK, http://www.maths.leeds.ac.uk/~wally.gilks/adaptive.
rejection/web_page/Welcome.html.
3074 F Chapter 42: The GENMOD Procedure
Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995), “Adaptive Rejection Metropolis Sampling within Gibbs
Sampling,” Applied Statistics, 44, 455–472.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London:
Chapman & Hall.
Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Applied Statistics, 41,
337–348.
Hardin, J. W. and Hilbe, J. M. (2003), Generalized Estimating Equations, Boca Raton, FL: Chapman &
Hall/CRC.
Hilbe, J. M. (1994), “Log Negative Binomial Regression Using the GENMOD Procedure,” in Proceedings of
the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc.
Hilbe, J. M. (2007), Negative Binomial Regression, New York: Cambridge University Press.
Hilbe, J. M. (2009), Logistic Regression Models, London: Chapman & Hall/CRC.
Hirji, K. F., Mehta, C. R., and Patel, N. R. (1987), “Computing Distributions for Exact Logistic Regression,”
Journal of the American Statistical Association, 82, 1110–1117.
Hougaard, P. (1986), “Survival Models for Heterogeneous Populations Derived from Stable Distributions,”
Biometrika, 73, 387–396.
Ibrahim, J. G., Chen, M.-H., and Lipsitz, S. R. (1999), “Monte Carlo EM for Missing Covariates in Parametric
Regression Models,” Biometrics, 55, 591–596.
Ibrahim, J. G., Chen, M.-H., and Sinha, D. (2001), Bayesian Survival Analysis, New York: Springer-Verlag.
Ibrahim, J. G. and Laud, P. W. (1991), “On Bayesian Analysis of Generalized Linear Models Using Jeffreys’
Prior,” Journal of the American Statistical Association, 86, 981–986.
Lambert, D. (1992), “Zero-Inflated Poisson Regression with an Application to Defects in Manufacturing,”
Technometrics, 34, 1–14.
Lawless, J. F. (1987), “Negative Binomial and Mixed Poisson Regression,” Canadian Journal of Statistics,
15, 209–225.
Lawless, J. F. (2003), Statistical Model and Methods for Lifetime Data, 2nd Edition, New York: John Wiley
& Sons.
Liang, K.-Y. and Zeger, S. L. (1986), “Longitudinal Data Analysis Using Generalized Linear Models,”
Biometrika, 73, 13–22.
Lin, D. Y., Wei, L. J., and Ying, Z. (2002), “Model-Checking Techniques Based on Cumulative Residuals,”
Biometrics, 58, 1–12.
Lipsitz, S. R., Fitzmaurice, G. M., Orav, E. J., and Laird, N. M. (1994), “Performance of Generalized
Estimating Equations in Practical Situations,” Biometrics, 50, 270–278.
Lipsitz, S. R., Kim, K., and Zhao, L. (1994), “Analysis of Repeated Categorical Data Using Generalized
Estimating Equations,” Statistics in Medicine, 13, 1149–1163.
References F 3075
Littell, R. C., Freund, R. J., and Spector, P. C. (1991), SAS System for Linear Models, 3rd Edition, Cary, NC:
SAS Institute Inc.
Long, J. S. (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks,
CA: Sage Publications.
McCullagh, P. (1983), “Quasi-likelihood Functions,” Annals of Statistics, 11, 59–67.
McCullagh, P. and Nelder, J. A. (1989), Generalized Linear Models, 2nd Edition, London: Chapman & Hall.
Meeker, W. Q. and Escobar, L. A. (1998), Statistical Methods for Reliability Data, New York: John Wiley &
Sons.
Mehta, C. R., Patel, N. R., and Senchaudhuri, P. (1992), “Exact Stratified Linear Rank Tests for Ordered
Categorical and Binary Data,” Journal of Computational and Graphical Statistics, 1, 21–40.
Miller, M. E., Davis, C. S., and Landis, J. R. (1993), “The Analysis of Longitudinal Polytomous Data:
Generalized Estimating Equations and Connections with Weighted Least Squares,” Biometrics, 49, 1033–
1044.
Muller, K. E. and Fetterman, B. A. (2002), Regression and ANOVA: An Integrated Approach Using SAS
Software, Cary, NC: SAS Institute Inc.
Myers, R. H., Montgomery, D. C., and Vining, G. G. (2002), Generalized Linear Models with Applications
in Engineering and the Sciences, New York: John Wiley & Sons.
Nelder, J. A. and Wedderburn, R. W. M. (1972), “Generalized Linear Models,” Journal of the Royal Statistical
Society, Series A, 135, 370–384.
Nelson, W. (1982), Applied Life Data Analysis, New York: John Wiley & Sons.
Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996), Applied Linear Statistical Models,
4th Edition, Chicago: Irwin.
Pan, W. (2001), “Akaike’s Information Criterion in Generalized Estimating Equations,” Biometrics, 57,
120–125.
Pregibon, D. (1981), “Logistic Regression Diagnostics,” Annals of Statistics, 9, 705–724.
Preisser, J. S. and Qaqish, B. F. (1996), “Deletion Diagnostics for Generalised Estimating Equations,”
Biometrika, 83, 551–562.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications, 2nd Edition, New York: John Wiley &
Sons.
Rotnitzky, A. and Jewell, N. P. (1990), “Hypothesis Testing of Regression Parameters in Semiparametric
Generalized Linear Models for Cluster Correlated Data,” Biometrika, 77, 485–497.
Royall, R. M. (1986), “Model Robust Inference Using Maximum Likelihood Estimators,” International
Statistical Review, 54, 221–226.
Searle, S. R. (1971), Linear Models, New York: John Wiley & Sons.
Simonoff, J. S. (2003), Analyzing Categorical Data, New York: Springer-Verlag.
3076 F Chapter 42: The GENMOD Procedure
Smyth, G. K. (1996), “Regression Analysis of Quantity Data with Exact Zeros,” in R. J. Wilson, S. Osaki,
and D. N. P. Murthy, eds., Proceedings of the Second Australia-Japan Workshop on Stochastic Models in
Engineering, Technology, and Management, Queensland: Technology Management Centre, University of
Queensland.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van der Linde, A. (2002), “Bayesian Measures of Model
Complexity and Fit,” Journal of the Royal Statistical Society, Series B, 64(4), 583–616, with discussion.
Stokes, M. E., Davis, C. S., and Koch, G. G. (2000), Categorical Data Analysis Using the SAS System, 2nd
Edition, Cary, NC: SAS Institute Inc.
Thall, P. F. and Vail, S. C. (1990), “Some Covariance Models for Longitudinal Count Data with Overdispersion,” Biometrics, 46, 657–671.
Tweedie, M. C. K. (1984), “An Index Which Distinguishes between Some Important Exponential Families,”
in J. K. Ghosh and J. Roy, eds., Statistics: Applications and New Directions—Proceedings of the Indian
Statistical Institute Golden Jubilee International Conference, 579–604.
Ware, J. H., Dockery, S. A., III, Speizer, F. E., and Ferris, B. G., Jr. (1984), “Passive Smoking, Gas Cooking,
and Respiratory Health of Children Living in Six Cities,” American Review of Respiratory Diseases, 129,
366–374.
White, H. (1982), “Maximum Likelihood Estimation of Misspecified Models,” Econometrica, 50, 1–25.
Williams, D. A. (1987), “Generalized Linear Model Diagnostics Using the Deviance and Single Case
Deletions,” Applied Statistics, 36, 181–191.
Zeger, S. L., Liang, K.-Y., and Albert, P. S. (1988), “Models for Longitudinal Data: A Generalized Estimating
Equation Approach,” Biometrics, 44, 1049–1060.
Subject Index
adjusted residuals
GENMOD procedure, 2974
aggregates of residuals, 3041, 3048
Akaike’s information criterion
(GENMOD), 2965
aliasing
GENMOD procedure, 2880
ALR algorithm
GENMOD procedure, 3033
alternating logistic regressions (ALR)
GENMOD procedure, 3033
bar (|) operator
GENMOD procedure, 2967
Bayesian analysis linear regression
GENMOD procedure, 2882
Bayesian information criterion
(GENMOD), 2965
binomial distribution
GENMOD procedure, 2958
case deletion diagnostics
GENMOD procedure, 2991
CATMOD procedure
log-linear models, 2877
classification variables
GENMOD procedure, 2967
sort order of levels (GENMOD), 2900
confidence intervals
confidence coefficient (GENMOD), 2936
fitted values of the mean (GENMOD), 2941, 2973
profile likelihood (GENMOD), 2940, 2970
Wald (GENMOD), 2943, 2971
continuous variables
GENMOD procedure, 2967
contrasts
GENMOD procedure, 2924
convergence criterion
GENMOD procedure, 2936, 2949
correlated data
GEE (GENMOD), 2871, 2979
correlation
matrix (GENMOD), 2937, 2963
covariance matrix
GENMOD procedure, 2937, 2963
crossed effects
GENMOD procedure, 2967
cumulative residuals, 3041, 3048
design matrix
GENMOD procedure, 2968
deviance
definition (GENMOD), 2875
GENMOD procedure, 2936
scaled (GENMOD), 2963
deviance information criterion, 2998
deviance residuals
GENMOD procedure, 2974
diagnostics
GENMOD procedure, 2937, 2991
DIC, 2998
dispersion parameter
estimation (GENMOD), 2874, 2965, 2969
GENMOD procedure, 2965
weights (GENMOD), 2955
effect
specification (GENMOD), 2967
effective number of parameters, 2998
estimability checking
GENMOD procedure, 2922
estimation
dispersion parameter (GENMOD), 2874
maximum likelihood (GENMOD), 2962
regression parameters (GENMOD), 2874
events/trials format for response
GENMOD procedure, 2934, 2959
exact conditional logistic regression, see exact logistic
regression
exact conditional Poisson regression, see exact Poisson
regression
exact logistic regression
GENMOD procedure, 2999
GENMOD procedure, 2925
exact Poisson regression
GENMOD procedure, 2999, 3066
GENMOD procedure, 2925, 2953
exponential distribution
GENMOD procedure, 3026
F statistics
GENMOD procedure, 2972
Fisher’s scoring method
GENMOD procedure, 2943, 2963
gamma distribution
GENMOD procedure, 2957
GEE, see generalized estimating equations
Generalized Estimating Equations (GEE), 2895
generalized estimating equations (GEE), 2948, 2979,
3030, 3036
generalized linear model
GENMOD procedure, 2872, 2873
theory (GENMOD), 2956
GENMOD procedure
adjusted residuals, 2974
AIC, 2965
Akaike’s information criterion, 2965
aliasing, 2880
Bayesian analysis linear regression, 2882
Bayesian information criterion, 2965
BIC, 2965
binomial distribution, 2958
built-in link function, 2874
built-in probability distribution, 2874
case deletion diagnostics, 2991
classification variables, 2967
confidence intervals, 2936
continuous variables, 2967
contrasts, 2924
convergence criterion, 2936, 2949
correlated data, 2871, 2979
correlation matrix, 2937, 2963
covariance matrix, 2937, 2963
crossed effects, 2967
design matrix, 2968
deviance, 2936
deviance definition, 2875
deviance residuals, 2974
diagnostics, 2937, 2991
dispersion parameter, 2965
dispersion parameter estimation, 2874, 2969
dispersion parameter weights, 2955
effect specification, 2967
estimability checking, 2922
events/trials format for response, 2934, 2959
exact logistic regression, 2999
exact Poisson regression, 2999, 3066
expected information matrix, 2963
exponential distribution, 3026
F statistics, 2972
Fisher’s scoring method, 2943, 2963
gamma distribution, 2957
GEE, 2871, 2895, 2948, 2979, 3030, 3033, 3036
generalized estimating equations (GEE), 2871
generalized linear model, 2872, 2873
geometric distribution, 2957
goodness of fit, 2963
gradient, 2962
Hessian matrix, 2962
information matrix, 2943
initial values, 2938, 2950
intercept, 2875, 2878, 2940
inverse Gaussian distribution, 2957
Lagrange multiplier statistics, 2972
life data, 3023
likelihood residuals, 2974
linear model, 2872
linear predictor, 2871, 2872, 2878, 2968, 3005
link function, 2871, 2873, 2960
log-likelihood functions, 2960
log-linear models, 2877
logistic regression, 3018
main effects, 2967
maximum likelihood estimation, 2962
_MEAN_ automatic variable, 2947
model checking, 3041, 3048
multinomial distribution, 2958
multinomial models, 2975
negative binomial distribution, 2958
nested effects, 2967
Newton-Raphson algorithm, 2962
normal distribution, 2957
observed information matrix, 2963
offset, 2942, 3005
offset variable, 2877
ordinal data, 3026
output data sets, 2999, 3000
output ODS Graphics table names, 3016
output table names, 3012
overdispersion, 2966
Pearson residuals, 2974
Pearson’s chi-square, 2936, 2963, 2964
Poisson distribution, 2958
Poisson regression, 2876
polynomial effects, 2967
profile likelihood confidence intervals, 2940, 2970
programming statements, 2947
QIC, 2986
quasi-likelihood, 2967
quasi-likelihood functions, 2986
quasi-likelihood information criterion, 2986
raw residuals, 2973
regression parameters estimation, 2874
regressor effects, 2967
repeated measures, 2871, 2979
residuals, 2942, 2973, 2974
_RESP_ automatic variable, 2947
scale parameter, 2959
scaled deviance, 2963
score statistics, 2972
singular contrast matrix, 2922
subpopulation, 2936
suppressing output, 2904
Tweedie distribution, 3070
tweedie distribution, 2959
tweedie GLM, 2977
Type 1 analysis, 2875, 2968
Type 3 analysis, 2875, 2969
user-defined link function, 2931
variance function, 2874
Wald confidence intervals, 2943, 2971
working correlation matrix, 2950, 2951, 2979
_XBETA_ automatic variable, 2947
zero-inflated models, 2975
zero-inflated negative binomial distribution, 2959
zero-inflated Poisson distribution, 2958
GENMOD procedure
convergence criterion, 2928
exact logistic regression, 2925
exact Poisson regression, 2925, 2953
ordering of effects, 2900
stratified exact logistic regression, 2953
stratified exact Poisson regression, 2953
geometric distribution
GENMOD procedure, 2957
goodness of fit
GENMOD procedure, 2963
gradient
GENMOD procedure, 2962
log-linear models
CATMOD procedure, 2877
GENMOD procedure, 2877
logistic regression
GENMOD procedure, 2873, 3018
Hessian matrix
GENMOD procedure, 2962
offset
GENMOD procedure, 2942, 3005
offset variable
GENMOD procedure, 2877
ordinal model
GENMOD procedure, 3026
output data sets
GENMOD procedure, 2999, 3000
output ODS Graphics table names
GENMOD procedure, 3016
output table names
GENMOD procedure, 3012
overdispersion
GENMOD procedure, 2966
information matrix
expected (GENMOD), 2963
observed (GENMOD), 2963
initial values
GENMOD procedure, 2938, 2950
intercept
GENMOD procedure, 2875, 2878, 2940
inverse Gaussian distribution
GENMOD procedure, 2957
Lagrange multiplier
statistics (GENMOD), 2972
life data
GENMOD procedure, 3023
likelihood residuals
GENMOD procedure, 2974
linear model
GENMOD procedure, 2872, 2873
linear predictor
GENMOD procedure, 2871, 2872, 2878, 2968,
3005
link function
built-in (GENMOD), 2874, 2939
GENMOD procedure, 2871, 2873, 2960
user-defined (GENMOD), 2931
log-likelihood
functions (GENMOD), 2960
main effects
GENMOD procedure, 2967
maximum likelihood
estimation (GENMOD), 2962
model assessment, 3041, 3048
model checking, 3041, 3048
multinomial
distribution (GENMOD), 2958
models (GENMOD), 2975
negative binomial distribution
GENMOD procedure, 2958
nested effects
GENMOD procedure, 2967
Newton-Raphson algorithm
GENMOD procedure, 2962
normal distribution
GENMOD procedure, 2957
parameter estimates
GENMOD procedure, 3010
Pearson residuals
GENMOD procedure, 2973, 2974
Pearson’s chi-square
GENMOD procedure, 2936, 2963, 2964
Poisson distribution
GENMOD procedure, 2958
Poisson regression
GENMOD procedure, 2874, 2876
polynomial effects
GENMOD procedure, 2967
probability distribution
built-in (GENMOD), 2874, 2937
exponential family (GENMOD), 2956
user-defined (GENMOD), 2922
profile likelihood confidence intervals
GENMOD procedure, 2970
programming statements
GENMOD procedure, 2947
quasi-likelihood
functions (GENMOD), 2986
GENMOD procedure, 2967
quasi-likelihood information criterion
(GENMOD), 2986
raw residuals
GENMOD procedure, 2973
regressor effects
GENMOD procedure, 2967
repeated measures
GEE (GENMOD), 2871, 2979
residuals
GENMOD procedure, 2942, 2973, 2974
response variable
sort order of levels (GENMOD), 2903
scale parameter
GENMOD procedure, 2959
score statistics
GENMOD procedure, 2972
singularity criterion
contrast matrix (GENMOD), 2922
information matrix (GENMOD), 2943
standard error
GENMOD procedure, 3010
stratified exact logistic regression
GENMOD procedure, 2953
stratified exact Poisson regression
GENMOD procedure, 2953
subpopulation
GENMOD procedure, 2936
suppressing output
GENMOD procedure, 2904
tweedie
distribution (GENMOD), 2959
Tweedie distribution
GENMOD procedure, 3070
tweedie distribution for generalized linear models, see
tweedie GLM
tweedie GLM
GENMOD procedure, 2977
Type 1 analysis
GENMOD procedure, 2875, 2968
Type 3 analysis
GENMOD procedure, 2875, 2969
variance function
GENMOD procedure, 2874
working correlation matrix
GENMOD procedure, 2950, 2951, 2979
zero-inflated
models (GENMOD), 2975
zero-inflated negative binomial
distribution (GENMOD), 2959
zero-inflated Poisson
distribution (GENMOD), 2958
Syntax Index
ABSFCONV option
MODEL statement (GENMOD), 2928
AGGREGATE= option
MODEL statement (GENMOD), 2936
ALPHA= option
ESTIMATE statement (GENMOD), 2925
EXACT statement (GENMOD), 2926
MODEL statement (GENMOD), 2936
ALPHAINIT= option
REPEATED statement (GENMOD), 2949
ASSESS statement
GENMOD procedure, 2904
BAYES statement
GENMOD procedure, 2905
BY statement
GENMOD procedure, 2915
CHECKDEPENDENCY= option
STRATA statement (GENMOD), 2954
CICONV= option
MODEL statement (GENMOD), 2936
CL option
MODEL statement (GENMOD), 2936
CLASS statement
GENMOD procedure, 2916
CLTYPE= option
EXACT statement (GENMOD), 2926
CODE statement
GENMOD procedure, 2919
CODING= option
MODEL statement (GENMOD), 2936
COEFFPRIOR= option
BAYES statement, 2906
CONTRAST statement
GENMOD procedure, 2920
CONVERGE= option
MODEL statement (GENMOD), 2936
REPEATED statement (GENMOD), 2949
CONVH= option
MODEL statement (GENMOD), 2937
CORR= option
REPEATED statement (GENMOD), 2951
CORRB option
MODEL statement (GENMOD), 2937
REPEATED statement (GENMOD), 2950
CORRW option
REPEATED statement (GENMOD), 2950
COVB option
MODEL statement (GENMOD), 2937
REPEATED statement (GENMOD), 2950
CPREFIX= option
CLASS statement (GENMOD), 2916
DATA= option
PROC GENMOD statement, 2899
DESCENDING option
CLASS statement (GENMOD), 2916
PROC GENMOD statement, 2899
DEVIANCE statement, GENMOD procedure, 2922,
2947
DIAGNOSTICS option
BAYES statement, 2907
MODEL statement (GENMOD), 2937
DISPERSIONPRIOR option
BAYES statement, 2909
DIST= option
MODEL statement (GENMOD), 2937
DIVISOR= option
ESTIMATE statement (GENMOD), 2925
DSCALE
MODEL statement (GENMOD), 2942
E option
CONTRAST statement (GENMOD), 2922
ESTIMATE statement (GENMOD), 2925
ECORRB option
REPEATED statement (GENMOD), 2950
ECOVB option
REPEATED statement (GENMOD), 2950
EFFECTPLOT statement
GENMOD procedure, 2923
ERR= option
MODEL statement (GENMOD), 2937
ESTIMATE option
EXACT statement (GENMOD), 2926
ESTIMATE statement
GENMOD procedure, 2924
EXACT statement
GENMOD procedure, 2925
EXACTMAX= option
MODEL statement (GENMOD), 2938
EXACTONLY option
PROC GENMOD statement, 2900
EXACTOPTIONS statement
GENMOD procedure, 2928
EXP option
ESTIMATE statement (GENMOD), 2925
EXPECTED option
MODEL statement (GENMOD), 2938
FCONV= option
MODEL statement (GENMOD), 2928
FREQ statement
GENMOD procedure, 2931
FWDLINK statement, GENMOD procedure, 2931,
2947
GENMOD procedure
syntax, 2898
GENMOD procedure, ASSESS statement, 2904
GENMOD PROCEDURE, BAYES statement, 2905
GENMOD procedure, BAYES statement
COEFFPRIOR= option, 2906
DIAGNOSTICS option, 2907
DISPERSIONPRIOR option, 2909
INITIAL= option, 2910
INITIALMLE option, 2910
MCSE option, 2908
METROPOLIS= option, 2910
NBI= option, 2911
NMC= option, 2911
OUTPOST= option, 2911
PLOTS option, 2912
PRECISIONPRIOR= option, 2911
RAFTERY option, 2909
SAMPLING= option, 2913
SCALEPRIOR= option, 2913
SEED= option, 2914
STATISTICS= option, 2914
THINNING= option, 2915
GENMOD procedure, BY statement, 2915
GENMOD procedure, CONTRAST statement, 2920
E option, 2922
SINGULAR= option, 2922
WALD option, 2922
GENMOD procedure, DEVIANCE statement, 2922,
2947
GENMOD procedure, ESTIMATE statement
ALPHA= option, 2925
DIVISOR= option, 2925
E option, 2925
EXP option, 2925
SINGULAR= option, 2925
GENMOD procedure, FREQ statement, 2924, 2931
GENMOD procedure, FWDLINK statement, 2931,
2947
GENMOD procedure, INVLINK statement, 2931,
2947
GENMOD procedure, MODEL statement, 2934
AGGREGATE= option, 2936
ALPHA= option, 2936
CICONV= option, 2936
CL option, 2936
CODING= option, 2936
CONVERGE= option, 2936
CONVH= option, 2937
CORRB option, 2937
COVB option, 2937
DIAGNOSTICS option, 2937
DIST= option, 2937
ERR= option, 2937
EXACTMAX= option, 2938
EXPECTED option, 2938
ID= option, 2938
INFLUENCE option, 2937
INITIAL= option, 2938
INTERCEPT= option, 2939
ITPRINT option, 2939
LINK= option, 2939
LRCI option, 2940
MAXIT= option, 2940
NOINT option, 2940
NOLOGNB option, 2940
NOSCALE option, 2940
OBSTATS option, 2940
OFFSET= option, 2942
PRED option, 2942
PREDICTED option, 2942
RESIDUALS option, 2942
SCALE= option, 2942
SCORING= option, 2943
SINGULAR= option, 2943
TYPE1 option, 2943
TYPE3 option, 2943
WALD option, 2943
WALDCI option, 2943
XVARS option, 2944
GENMOD procedure, OUTPUT statement, 2944
keyword= option, 2944
OUT= option, 2944
GENMOD procedure, PROC GENMOD statement,
2899
DATA= option, 2899
DESCENDING option, 2899
NAMELEN= option, 2900
PLOTS= option, 2900
RORDER= option, 2903
GENMOD procedure, REPEATED statement, 2895,
2948
ALPHAINIT= option, 2949
CONVERGE= option, 2949
CORR= option, 2951
CORRB option, 2950
CORRW option, 2950
COVB option, 2950
ECORRB option, 2950
ECOVB option, 2950
INITIAL= option, 2950
INTERCEPT= option, 2950
LOGOR= option, 2950
MAXITER= option, 2951
MCORRB option, 2951
MCOVB option, 2951
MODELSE option, 2951
PRINTMLE option, 2951
RUPDATE= option, 2951
SORTED option, 2951
SUBCLUSTER= option, 2951
SUBJECT= option, 2949
TYPE= option, 2951
V6CORR option, 2952
WITHIN= option, 2952
WITHINSUBJECT= option, 2952
YPAIR= option, 2952
ZDATA= option, 2952
ZROW= option, 2952
GENMOD procedure, SCWGT statement, 2955
GENMOD procedure, VARIANCE statement, 2955
GENMOD procedure, WEIGHT statement, 2955
GENMOD procedure, ZEROMODEL statement, 2955
LINK= option, 2956
GENMOD procedure, CLASS statement, 2916
CPREFIX= option, 2916
DESCENDING option, 2916
LPREFIX= option, 2916
MISSING option, 2916
ORDER= option, 2916
PARAM= option, 2917
REF= option, 2918
TRUNCATE option, 2918
GENMOD procedure, CODE statement, 2919
GENMOD procedure, EFFECTPLOT statement, 2923
GENMOD procedure, EXACT statement, 2925
ALPHA= option, 2926
CLTYPE= option, 2926
ESTIMATE option, 2926
JOINT option, 2926
JOINTONLY option, 2926
MIDPFACTOR= option, 2926
ONESIDED option, 2927
OUTDIST= option, 2927
GENMOD procedure, EXACTOPTIONS statement,
2928
GENMOD procedure, LSMESTIMATE statement,
2933
GENMOD procedure, MODEL statement
ABSFCONV option, 2928
FCONV= option, 2928
NOLOGSCALE option, 2930
XCONV= option, 2930
GENMOD procedure, PROC GENMOD statement
EXACTONLY option, 2900
ORDER= option, 2900
GENMOD procedure, SLICE statement, 2953
GENMOD procedure, STORE statement, 2953
GENMOD procedure, STRATA statement, 2953
CHECKDEPENDENCY= option, 2954
INFO option, 2954
MISSING option, 2954
NOSUMMARY option, 2954
ID= option
MODEL statement (GENMOD), 2938
INFLUENCE option
MODEL statement (GENMOD), 2937
INFO option
STRATA statement (GENMOD), 2954
INITIAL= option
BAYES statement, 2910
MODEL statement (GENMOD), 2938
REPEATED statement (GENMOD), 2950
INITIALMLE option
BAYES statement, 2910
INTERCEPT= option
MODEL statement (GENMOD), 2939
REPEATED statement (GENMOD), 2950
INVLINK statement, GENMOD procedure, 2931,
2947
ITPRINT option
MODEL statement (GENMOD), 2939
JOINT option
EXACT statement (GENMOD), 2926
JOINTONLY option
EXACT statement (GENMOD), 2926
keyword= option
OUTPUT statement (GENMOD), 2944
LINK= option
MODEL statement (GENMOD), 2939
ZEROMODEL statement (GENMOD), 2956
LOGNB option
MODEL statement (GENMOD), 2940
LOGOR= option
REPEATED statement (GENMOD), 2950
LPREFIX= option
CLASS statement (GENMOD), 2916
LRCI option
MODEL statement (GENMOD), 2940
LSMESTIMATE statement
GENMOD procedure, 2933
MAXIT= option
MODEL statement (GENMOD), 2940
MAXITER= option
REPEATED statement (GENMOD), 2951
MCORRB option
REPEATED statement (GENMOD), 2951
MCOVB option
REPEATED statement (GENMOD), 2951
MCSE option
BAYES statement, 2908
METROPOLIS= option
BAYES statement, 2910
MIDPFACTOR= option
EXACT statement (GENMOD), 2926
MISSING option
CLASS statement (GENMOD), 2916
STRATA statement (GENMOD), 2954
MODEL statement
GENMOD procedure, 2934
MODELSE option
REPEATED statement (GENMOD), 2951
NAMELEN= option
PROC GENMOD statement, 2900
NBI= option
BAYES statement, 2911
NMC= option
BAYES statement, 2911
NOINT option
MODEL statement (GENMOD), 2940
NOLOGNB option
MODEL statement (GENMOD), 2940
NOLOGSCALE option
MODEL statement (GENMOD), 2930
NOSCALE option
MODEL statement (GENMOD), 2940
NOSUMMARY option
STRATA statement (GENMOD), 2954
OBSTATS option
MODEL statement (GENMOD), 2940
OFFSET= option
MODEL statement (GENMOD), 2942
ONESIDED option
EXACT statement (GENMOD), 2927
ORDER= option
CLASS statement (GENMOD), 2916
PROC GENMOD statement, 2900
OUT= option
OUTPUT statement (GENMOD), 2944
OUTDIST= option
EXACT statement (GENMOD), 2927
OUTPOST= option
BAYES statement, 2911
OUTPUT statement
GENMOD procedure, 2944
PARAM= option
CLASS statement (GENMOD), 2917
PLOTS option
BAYES statement, 2912
PLOTS= option
PROC GENMOD statement, 2900
PRECISIONPRIOR= option
BAYES statement, 2911
PRED option
MODEL statement (GENMOD), 2942
PREDICTED option
MODEL statement (GENMOD), 2942
PRINTMLE option
REPEATED statement (GENMOD), 2951
PROC GENMOD statement, see GENMOD procedure
PSCALE
MODEL statement (GENMOD), 2942
RAFTERY option
BAYES statement, 2909
REF= option
CLASS statement (GENMOD), 2918
REPEATED statement
GENMOD procedure, 2895, 2948
RESIDUALS option
MODEL statement (GENMOD), 2942
RORDER= option
PROC GENMOD statement, 2903
RUPDATE= option
REPEATED statement (GENMOD), 2951
SAMPLING= option
BAYES statement, 2913
SCALE= option
MODEL statement (GENMOD), 2942
SCALEPRIOR= option
BAYES statement, 2913
SCORING= option
MODEL statement (GENMOD), 2943
SCWGT statement
GENMOD procedure, 2955
SEED= option
BAYES statement, 2914
SINGULAR= option
CONTRAST statement (GENMOD), 2922
ESTIMATE statement (GENMOD), 2925
MODEL statement (GENMOD), 2943
SLICE statement
GENMOD procedure, 2953
SORTED option
REPEATED statement (GENMOD), 2951
STATISTICS= option
BAYES statement(GENMOD), 2914
STORE statement
GENMOD procedure, 2953
STRATA statement
GENMOD procedure, 2953
SUBCLUSTER= option
REPEATED statement (GENMOD), 2951
SUBJECT= option
REPEATED statement (GENMOD), 2949
THINNING= option
BAYES statement(GENMOD), 2915
TRUNCATE option
CLASS statement (GENMOD), 2918
TYPE1 option
MODEL statement (GENMOD), 2943
TYPE3 option
MODEL statement (GENMOD), 2943
TYPE= option
REPEATED statement (GENMOD), 2951
V6CORR option
REPEATED statement (GENMOD), 2952
VARIANCE statement, GENMOD procedure, 2955
WALD option
CONTRAST statement (GENMOD), 2922
MODEL statement (GENMOD), 2943
WALDCI option
MODEL statement (GENMOD), 2943
WEIGHT statement
GENMOD procedure, 2955
WITHIN= option
REPEATED statement (GENMOD), 2952
WITHINSUBJECT= option
REPEATED statement (GENMOD), 2952
XCONV= option
MODEL statement (GENMOD), 2930
XVARS option
MODEL statement (GENMOD), 2944
YPAIR= option
REPEATED statement (GENMOD), 2952
ZDATA= option
REPEATED statement (GENMOD), 2952
ZEROMODEL statement
GENMOD procedure, 2955
ZROW= option
REPEATED statement (GENMOD), 2952
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement