[R] Base Reference

STATA BASE REFERENCE MANUAL
VOLUME 1
A–H
RELEASE 11
A Stata Press Publication
StataCorp LP
College Station, Texas
c 1985–2009 by StataCorp LP
Copyright All rights reserved
Version 11
Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in TEX
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
ISBN-10:
ISBN-10:
ISBN-10:
ISBN-10:
ISBN-13:
ISBN-13:
ISBN-13:
ISBN-13:
1-59718-066-1 (volumes 1–3)
1-59718-067-X (volume 1)
1-59718-068-8 (volume 2)
1-59718-069-6 (volume 3)
978-1-59718-066-5 (volumes 1–3)
978-1-59718-067-2 (volume 1)
978-1-59718-068-9 (volume 2)
978-1-59718-069-6 (volume 3)
This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored
in a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, or
otherwise—without the prior written permission of StataCorp LP unless permitted by the license granted to you by
StataCorp LP to use the software and documentation. No license, express or implied, by estoppel or otherwise, to any
intellectual property rights is granted by this document.
StataCorp provides this manual “as is” without warranty of any kind, either expressed or implied, including, but
not limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make
improvements and/or changes in the product(s) and the program(s) described in this manual at any time and without
notice.
The software described in this manual is furnished under a license agreement or nondisclosure agreement. The software
may be copied only in accordance with the terms of the agreement. It is against the law to copy the software onto
DVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes.
c 1979 by Consumers Union of U.S.,
The automobile dataset appearing on the accompanying media is Copyright Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979.
Stata and Mata are registered trademarks and NetCourse is a trademark of StataCorp LP.
Other brand and product names are registered trademarks or trademarks of their respective companies.
For copyright information about the software, type help copyright within Stata.
The suggested citation for this software is
StataCorp. 2009. Stata: Release 11 . Statistical Software. College Station, TX: StataCorp LP.
i
Table of contents
intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to base reference manual
1
about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display information about your Stata
adoupdate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Update user-written ado-files
alpha . . . . . . . . . . . . . . . . Compute interitem correlations (covariances) and Cronbach’s alpha
ameans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arithmetic, geometric, and harmonic means
anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of variance and covariance
anova postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for anova
areg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear regression with a large dummy-variable set
areg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for areg
asclogit . . . . . . . . . . . . . . . . Alternative-specific conditional logit (McFadden’s choice) model
asclogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asclogit
asmprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Alternative-specific multinomial probit regression
asmprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asmprobit
asroprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Alternative-specific rank-ordered probit regression
asroprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for asroprobit
6
7
11
18
22
61
78
84
86
97
104
128
138
151
BIC note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculating and interpreting BIC
binreg . . . . . . . . . . . . . . . . . . . Generalized linear models: Extensions to the binomial family
binreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for binreg
biprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bivariate probit regression
biprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for biprobit
bitest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binomial probability test
bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bootstrap sampling and estimation
bootstrap postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for bootstrap
boxcox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Box–Cox regression models
boxcox postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for boxcox
brier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brier score decomposition
bsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling with replacement
bstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report bootstrap results
159
164
177
181
189
192
197
219
223
233
235
241
248
centile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report centile and confidence interval
ci . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence intervals for means, proportions, and counts
clogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional (fixed-effects) logistic regression
clogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for clogit
cloglog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complementary log-log regression
cloglog postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for cloglog
cnsreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained linear regression
cnsreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for cnsreg
constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define and list constraints
copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display copyright information
copyright lapack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LAPACK copyright notification
copyright scintilla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scintilla copyright notification
copyright ttf2pt1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ttf2pt1 copyright notification
correlate . . . . . . . . . . . . . . . . . . . . . . . . Correlations (covariances) of variables or coefficients
cumul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cumulative distribution
cusum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph cumulative spectral distribution
256
262
274
290
295
304
307
313
316
319
320
321
322
324
332
336
db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Launch dialog 340
diagnostic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributional diagnostic plots 342
display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Substitute for a hand calculator 354
ii
do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execute commands from a file
doedit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edit do-files and other text files
dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative scatterplots
dstdize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Direct and indirect standardization
dydx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate numeric derivatives and integrals
355
356
357
364
380
eform option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Displaying exponentiated coefficients
eivreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors-in-variables regression
eivreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for eivreg
error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error messages and return codes
estat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation statistics
estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Save and manipulate estimation results
estimates describe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Describe estimation results
estimates for . . . . . . . . . . . . . . . . . . . . . . . . . Repeat postestimation command across models
estimates notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Add notes to estimation results
estimates replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Redisplay estimation results
estimates save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Save and use estimation results
estimates stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model statistics
estimates store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Store and restore estimation results
estimates table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compare estimation results
estimates title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set title for estimation results
estimation options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation options
exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exit Stata
exlogistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact logistic regression
exlogistic postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for exlogistic
expoisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Poisson regression
expoisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for expoisson
386
387
392
394
395
403
407
409
411
413
415
419
421
424
430
431
434
435
453
457
470
fracpoly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fractional polynomial regression
fracpoly postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for fracpoly
frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stochastic frontier models
frontier postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for frontier
fvrevar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor-variables operator programming command
fvset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Declare factor-variable settings
472
487
491
505
507
510
gllamm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized linear and latent mixed models
glm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized linear models
glm postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for glm
glogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logit and probit regression for grouped data
glogit postestimation . . . . . . . . . . Postestimation tools for glogit, gprobit, blogit, and bprobit
gmm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized method of moments estimation
gmm postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for gmm
grmeanby . . . . . . . . . . . . . . . . . . . . . . . . . Graph means and medians by categorical variables
516
517
552
558
569
571
628
632
hausman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hausman specification test
heckman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heckman selection model
heckman postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for heckman
heckprob . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probit model with sample selection
heckprob postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for heckprob
help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display online help
hetprob . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heteroskedastic probit model
hetprob postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for hetprob
histogram . . . . . . . . . . . . . . . . . . . . . . . . Histograms for continuous and categorical variables
635
644
660
666
675
680
682
690
693
iii
hsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search help files 703
inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inequality measures
intreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interval regression
intreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for intreg
ivprobit . . . . . . . . . . . . . . . . . . . . . . . . . . Probit model with continuous endogenous regressors
ivprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivprobit
ivregress . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-equation instrumental-variables regression
ivregress postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivregress
ivtobit . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobit model with continuous endogenous regressors
ivtobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ivtobit
707
710
721
724
737
741
757
774
785
jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jackknife estimation 789
jackknife postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for jackknife 801
kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interrater agreement
kdensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Univariate kernel density estimation
ksmirnov . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov – Smirnov equality-of-distributions test
kwallis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kruskal – Wallis equality-of-populations rank test
802
816
826
830
ladder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ladder of powers
level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set default confidence level
lincom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear combinations of estimators
linktest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specification link test for single-equation models
lnskew0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Find zero-skewness log or Box – Cox transform
log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Echo copy of session to file
logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression, reporting odds ratios
logistic postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for logistic
logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression, reporting coefficients
logit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for logit
loneway . . . . . . . . . . . . . . . . . . . . . . Large one-way ANOVA, random effects, and reliability
lowess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lowess smoothing
lpoly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel-weighted local polynomial smoothing
lrtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Likelihood-ratio test after estimation
lv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Letter-value displays
833
840
842
849
855
859
863
874
901
914
919
925
931
941
951
margins . . . . . . . . . . . . . . . . . . . . . Marginal means, predictive margins, and marginal effects 957
margins postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for margins 1008
matsize . . . . . . . . . . . . . . . . . . . . . . . . . . . . Set the maximum number of variables in a model 1010
maximize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Details of iterative maximization 1012
mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate means 1019
mean postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mean 1030
meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meta-analysis 1032
mfp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multivariable fractional polynomial models 1033
mfp postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mfp 1044
misstable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tabulate missing values 1045
mkspline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear and restricted cubic spline construction 1053
ml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum likelihood estimation 1059
mlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial (polytomous) logistic regression 1084
mlogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mlogit 1098
more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The —more— message 1108
mprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial probit regression 1110
mprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mprobit 1118
iv
mvreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multivariate regression 1121
mvreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for mvreg 1128
nbreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Negative binomial regression 1130
nbreg postestimation . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nbreg and gnbreg 1142
nestreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested model statistics 1146
net . . . . . . . . . . . . . . . . . . . . . . . Install and manage user-written additions from the Internet 1152
net search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search the Internet for installable packages 1170
netio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Control Internet connections 1174
news . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report Stata news 1177
nl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear least-squares estimation 1178
nl postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nl 1198
nlcom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear combinations of estimators 1200
nlogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nested logit regression 1211
nlogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nlogit 1234
nlsur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation of nonlinear systems of equations 1239
nlsur postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for nlsur 1261
nptrend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test for trend across ordered groups 1263
ologit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered logistic regression 1267
ologit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ologit 1276
oneway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-way analysis of variance 1280
oprobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered probit regression 1290
oprobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for oprobit 1296
orthog . . . . . . . . . . . . . . . . . . . Orthogonalize variables and compute orthogonal polynomials 1299
pcorr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partial and semipartial correlation coefficients 1306
permute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo permutation tests 1309
pk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pharmacokinetic (biopharmaceutical) data 1319
pkcollapse . . . . . . . . . . . . . . . . . . . . . . . . . . . Generate pharmacokinetic measurement dataset 1327
pkcross . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyze crossover experiments 1330
pkequiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perform bioequivalence tests 1339
pkexamine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate pharmacokinetic measures 1346
pkshape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reshape (pharmacokinetic) Latin-square data 1352
pksumm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summarize pharmacokinetic data 1360
poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poisson regression 1365
poisson postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for poisson 1375
predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . Obtain predictions, residuals, etc., after estimation 1381
predictnl . . . . . . . . . . . . . Obtain nonlinear predictions, standard errors, etc., after estimation 1392
probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probit regression 1404
probit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for probit 1416
proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate proportions 1420
proportion postestimation . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for proportion 1425
prtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One- and two-sample tests of proportions 1426
qc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quality control charts 1431
qreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantile regression 1446
qreg postestimation . . . . . . . . . . . . . . Postestimation tools for qreg, iqreg, sqreg, and bsqreg 1466
query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display system parameters 1468
ranksum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equality tests on unmatched data 1474
ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate ratios 1480
ratio postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ratio 1489
v
reg3 . . . . . . . . . . . . . . . . . . . . Three-stage estimation for systems of simultaneous equations 1490
reg3 postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for reg3 1511
regress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear regression 1514
regress postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for regress 1539
regress postestimation time series . . . . . . . Postestimation tools for regress with time series 1584
#review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review previous commands 1594
roc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Receiver operating characteristic (ROC) analysis 1595
rocfit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fit ROC models 1616
rocfit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rocfit 1623
rologit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rank-ordered logistic regression 1627
rologit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rologit 1643
rreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robust regression 1645
rreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for rreg 1652
runtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test for random order 1654
sampsi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample size and power determination 1660
saved results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saved results 1671
scobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skewed logistic regression 1675
scobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for scobit 1684
sdtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance-comparison tests 1687
search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search Stata documentation 1694
serrbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph standard error bar chart 1700
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of system parameters 1703
set defaults . . . . . . . . . . . . . . . . . . . . . . . . Reset system parameters to original Stata defaults 1714
set emptycells . . . . . . . . . . . . . . . . . . . . . . . . Set what to do with empty cells in interactions 1716
set seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specify initial value of random-number seed 1717
signrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equality tests on matched data 1719
simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo simulations 1725
sj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stata Journal and STB installation instructions 1732
sktest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skewness and kurtosis test for normality 1735
slogit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stereotype logistic regression 1740
slogit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for slogit 1754
smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robust nonlinear smoother 1758
spearman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spearman’s and Kendall’s correlations 1766
spikeplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spike plots and rootograms 1775
ssc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Install and uninstall packages from SSC 1779
stem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stem-and-leaf displays 1786
stepwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stepwise estimation 1790
suest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seemingly unrelated estimation 1800
summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary statistics 1819
sunflower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Density-distribution sunflower plots 1828
sureg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zellner’s seemingly unrelated regression 1834
sureg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for sureg 1842
swilk . . . . . . . . . . . . . . . . . . . . . . . . . . Shapiro – Wilk and Shapiro – Francia tests for normality 1845
symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Symmetry and marginal homogeneity tests 1849
table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tables of summary statistics 1858
tabstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display table of summary statistics 1868
tabulate oneway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-way tables of frequencies 1873
tabulate twoway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-way tables of frequencies 1881
tabulate, summarize() . . . . . . . . . . . . . . . . . . One- and two-way tables of summary statistics 1897
test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test linear hypotheses after estimation 1902
vi
testnl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test nonlinear hypotheses after estimation 1921
tetrachoric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tetrachoric correlations for binary variables 1929
tobit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobit regression 1939
tobit postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for tobit 1946
total . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate totals 1951
total postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for total 1957
translate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Print and translate logs 1959
treatreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Treatment-effects model 1969
treatreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for treatreg 1982
truncreg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncated regression 1985
truncreg postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for truncreg 1992
ttest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean-comparison tests 1995
update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Update Stata 2004
vce option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance estimators 2010
view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . View files and logs 2015
vwls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance-weighted least squares 2019
vwls postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for vwls 2025
which . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Display location and version for an ado-file 2027
xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction expansion 2029
zinb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-inflated negative binomial regression 2040
zinb postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for zinb 2046
zip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-inflated Poisson regression 2048
zip postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for zip 2054
ztnb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-truncated negative binomial regression 2056
ztnb postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ztnb 2064
ztp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-truncated Poisson regression 2067
ztp postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for ztp 2073
Author index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2075
2087
vii
Cross-referencing the documentation
When reading this manual, you will find references to other Stata manuals. For example,
[U] 26 Overview of Stata estimation commands
[XT] xtabond
[D] reshape
The first example is a reference to chapter 26, Overview of Stata estimation commands, in the User’s
Guide; the second is a reference to the xtabond entry in the Longitudinal-Data/Panel-Data Reference
Manual; and the third is a reference to the reshape entry in the Data-Management Reference Manual.
All the manuals in the Stata Documentation have a shorthand notation:
[GSM]
[GSU]
[GSW]
[U]
[R]
[D]
[G]
[XT]
[MI]
[MV]
[P]
[SVY]
[ST]
[TS]
[I]
Getting Started with Stata for Mac
Getting Started with Stata for Unix
Getting Started with Stata for Windows
Stata User’s Guide
Stata Base Reference Manual
Stata Data-Management Reference Manual
Stata Graphics Reference Manual
Stata Longitudinal-Data/Panel-Data Reference Manual
Stata Multiple-Imputation Reference Manual
Stata Multivariate Statistics Reference Manual
Stata Programming Reference Manual
Stata Survey Data Reference Manual
Stata Survival Analysis and Epidemiological Tables Reference Manual
Stata Time-Series Reference Manual
Stata Quick Reference and Index
[M]
Mata Reference Manual
Detailed information about each of these manuals may be found online at
http://www.stata-press.com/manuals/
Title
intro — Introduction to base reference manual
Description
This entry describes the organization of the reference manuals.
Remarks
The complete list of reference manuals is as follows:
[R]
[D]
[G]
[XT]
[MI]
[MV]
[P]
[SVY]
[ST]
[TS]
[I]
Stata Base Reference Manual
Volume 1, A–H
Volume 2, I–P
Volume 3, Q–Z
Stata Data-Management Reference Manual
Stata Graphics Reference Manual
Stata Longitudinal-Data/Panel-Data Reference Manual
Stata Multiple-Imputation Reference Manual
Stata Multivariate Statistics Reference Manual
Stata Programming Reference Manual
Stata Survey Data Reference Manual
Stata Survival Analysis and Epidemiological Tables Reference Manual
Stata Time-Series Reference Manual
Stata Quick Reference and Index
[M]
Mata Reference Manual
When we refer to “reference manuals”, we mean all manuals listed above.
When we refer to the reference manuals, we mean all manuals listed above except the Mata
Reference Manual .
When we refer to the Base Reference Manual , we mean just the three-volume Base Reference
Manual, known as [R].
When we refer to the specialty manuals, we mean all the manuals listed above except [R] and [ I ],
the Stata Quick Reference and Index.
Detailed information about each of these manuals can be found online at
http://www.stata-press.com/manuals/
1
2
intro — Introduction to base reference manual
Arrangement of the reference manuals
Each manual contains the following sections:
• Table of contents.
At the beginning of volume 1 of [R], the Base Reference Manual, is a table of contents for the
three volumes.
• Cross-referencing the documentation.
This section lists all the manuals and explains how they are cross-referenced.
• Introduction.
This entry—usually called intro—provides an overview of the manual. In the specialty manuals,
this introduction suggests entries that you might want to read first and provides information about
new features.
Each specialty manual contains an overview of the commands described in it.
• Entries.
Entries are arranged in alphabetical order. Most entries describe Stata commands, but some entries
discuss concepts, and others provide overviews.
Entries that describe estimation commands are followed by an entry discussing postestimation
commands that are available for use after the estimation command. For example, the xtlogit entry
in the [XT] manual is followed by the xtlogit postestimation entry.
• Index.
At the end of each manual is an index. The index for the entire three-volume Base Reference
Manual is found at the end of the third volume.
The Quick Reference and Index, [ I ], contains a combined index for all the manuals and a subject
table of contents for all the manuals and the User’s Guide. It also contains quick-reference information
on many subjects, such as the estimation commands.
To find information and commands quickly, use Stata’s search command; see [R] search (see the
entry search in the [R] manual). You can broaden your search to the Internet by using search,
all to find commands and extensions written by Stata users.
Arrangement of each entry
Entries in the Stata reference manuals generally contain the following sections, which are explained
below:
Syntax
Menu
Description
Options
Remarks
Saved results
Methods and formulas
References
Also see
intro — Introduction to base reference manual
3
Syntax
A command’s syntax diagram shows how to type the command, indicates all possible options, and
gives the minimal allowed abbreviations for all the items in the command. For instance, the syntax
diagram for the summarize command is
summarize
varlist
if
in
weight
, options
description
options
Main
detail
meanonly
format
separator(#)
display additional statistics
suppress the display; calculate only the mean; programmer’s option
use variable’s display format
draw separator line after every # variables; default is separator(5)
varlist may contain factor variables; see [U] 11.4.3 Factor variables.
varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists.
by is allowed; see [D] by.
aweights, fweights, and iweights are allowed. However, iweights may not be used with the detail
option; see [U] 11.1.6 weight.
Items in the typewriter-style font should be typed exactly as they appear in the diagram,
although they may be abbreviated. Underlining indicates the shortest abbreviations where abbreviations are allowed. For instance, summarize may be abbreviated su, sum, summ, etc., or it may be
spelled out completely. Items in the typewriter font that are not underlined may not be abbreviated.
Square brackets denote optional items. In the syntax diagram above, varlist, if, in, weight, and the
options are optional.
The options are listed in a table immediately following the diagram, along with a brief description
of each.
Items typed in italics represent arguments for which you are to substitute variable names, observation
numbers, and the like.
The diagrams use the following symbols:
#
Indicates a literal number, e.g., 5; see [U] 12.2 Numbers.
Anything enclosed in brackets is optional.
At least one of the items enclosed in braces must appear.
The vertical bar separates alternatives.
Any Stata format, e.g., %8.2f; see [U] 12.5 Formats: Controlling how data are displayed.
The dependent variable in an estimation command; see [U] 20 Estimation and postestimation commands.
Any algebraic expression, e.g., (5+myvar)/2; see [U] 13 Functions and expressions.
Any filename; see [U] 11.6 File-naming conventions.
The independent variables in an estimation command; see [U] 20 Estimation and
postestimation commands.
A variable that will be created by the current command; see [U] 11.4.2 Lists of new
variables.
|
%fmt
depvar
exp
filename
indepvars
newvar
4
intro — Introduction to base reference manual
numlist
oldvar
options
range
"string"
varlist
varname
weight
xvar
yvar
A list of numbers; see [U] 11.1.8 numlist.
A previously created variable; see [U] 11.4.1 Lists of existing variables.
A list of options; see [U] 11.1.7 options.
An observation range, e.g., 5/20; see [U] 11.1.4 in range.
Any string of characters enclosed in double quotes; see [U] 12.4 Strings.
A list of variable names; see [U] 11.4 varlists. If varlist allows factor variables, a note to
that effect will be shown below the syntax diagram; see [U] 11.4.3 Factor variables. If
varlist allows time-series operators, a note to that effect will be shown below the syntax
diagram; see [U] 11.4.4 Time-series varlists.
A variable name; see [U] 11.3 Naming conventions.
A [wgttype=exp] modifier; see [U] 11.1.6 weight and [U] 20.18 Weighted estimation.
The variable to be displayed on the horizontal axis.
The variable to be displayed on the vertical axis.
The Syntax section will indicate whether factor variables or time-series operators may be used
with a command. summarize allows factor variables and time-series operators.
If a command allows prefix commands, this will be indicated immediately following the table of
options. summarize allows by.
If a command allows weights, the types of weights allowed will be specified, with the default
weight listed first. summarize allows aweights, fweights, and iweights, and if the type of weight
is not specified, the default is aweights.
Menu
A menu indicates how the dialog box for the command may be accessed using the menu system.
Description
Following the syntax diagram is a brief description of the purpose of the command.
Options
If the command allows any options, they are explained here, and for dialog users the location of
the options in the dialog is indicated. For instance, in the logistic entry in this manual, the Options
section looks like this:
Model
...
SE/Robust
...
Reporting
...
Maximization
...
intro — Introduction to base reference manual
5
Remarks
The explanations under Description and Options are exceedingly brief and technical; they are
designed to provide a quick summary. The remarks explain in English what the preceding technical
jargon means. Examples are used to illustrate the command.
Saved results
Commands are classified as e-class, r-class, s-class, or n-class, according to whether they save
calculated results in e(), r(), s(), or not at all. These results can then be used in subroutines by
other programs (ado-files). Such saved results are documented here; see [U] 18.8 Accessing results
calculated by other programs and [U] 18.9 Accessing results calculated by estimation commands.
Methods and formulas
The techniques and formulas used in obtaining the results are described here as tersely and
technically as possible. If a command is implemented as an ado-file, that is indicated here.
References
Published sources are listed that either were directly referenced in the preceding text or might be
of interest.
Also see
Other manual entries relating to this entry are listed that might also interest you.
Also see
[U] 1.1 Getting Started with Stata
Title
about — Display information about your Stata
Syntax
about
Menu
Help
>
About
Description
about displays information about your version of Stata.
Remarks
about displays information about the release number, flavor, serial number, and license for your
Stata. If you are running Stata for Windows, information about memory is also displayed:
. about
Stata/SE 11.0 for Windows (32-bit)
Born 24 Aug 2009
Copyright (C) 1985-2009
Total physical memory:
506600 KB
Available physical memory:
83040 KB
11-user Stata network perpetual license:
Serial number: 4011041234
Licensed to: Alan R. Riley
StataCorp
or
. about
Stata/IC 11.0 for Mac (64-bit Intel)
Born 24 Aug 2009
Copyright (C) 1985-2009
Single-user Stata perpetual license:
Serial number: 30110512345
Licensed to: Chinh Nguyen
StataCorp
Also see
[R] which — Display location and version for an ado-file
[U] 5 Flavors of Stata
6
Title
adoupdate — Update user-written ado-files
Syntax
adoupdate
pkglist
, options
options
description
update
perform update; default is to list packages that have updates, but not to
update them
include packages that might have updates; default is to list or update
only packages that are known to have updates
check only packages obtained from SSC; default is to check all installed packages
check packages installed in dir; default is to check those installed in PLUS
provide output to assist in debugging network problems
all
ssconly
dir(dir)
verbose
Description
User-written additions to Stata are called packages. These packages can add remarkable abilities
to Stata. Packages are found and installed by using ssc, findit, and net.
User-written packages are updated by their developers, just as official Stata software is updated
by StataCorp.
To determine whether your official Stata software is up to date, and to update it if it is not, you
use update.
To determine whether your user-written additions are up to date, and to update them if they are
not, you use adoupdate.
Options
update specifies that packages with updates be updated. The default is simply to list the packages
that could be updated without actually performing the update.
The first time you adoupdate, do not specify this option. Once you see adoupdate work, you
will be more comfortable with it. Then type
. adoupdate, update
The packages that can be updated will be listed and updated.
all is rarely specified. Sometimes, adoupdate cannot determine whether a package you previously
installed has been updated. adoupdate can determine that the package is still available over the
web but is unsure whether the package has changed. Usually, the package has not changed, but
if you want to be certain that you are using the latest version, reinstall from the source.
Specifying all does this. Typing
. adoupdate, all
7
8
adoupdate — Update user-written ado-files
adds such packages to the displayed list as needing updating but does not update them. Typing
. adoupdate, update all
lists such packages and updates them.
ssconly is a popular option. Many packages are available from the Statistical Software Components
(SSC) archive—often called the Boston College Archive—which is provided at http://repec.org.
Many users find most of what they want there. See [R] ssc for more information on the SSC.
ssconly specifies that adoupdate check only packages obtained from that source. Specifying
this option is popular because SSC always provides distribution dates, and so adoupdate can be
certain whether an update exists.
dir(dir) specifies which installed packages be checked. The default is dir(PLUS), and that is
probably correct. If you are responsible for maintaining a large system, however, you may have
previously installed packages in dir(SITE), where they are shared across users. See [P] sysdir
for an explanation of these directory codewords. You may also specify an actual directory name,
such as C:\mydir.
verbose is specified when you suspect network problems. It provides more detailed output that may
help you diagnose the problem.
Remarks
Do not confuse adoupdate with update. Use adoupdate to update user-written files. Use update
to update the components (including ado-files) of the official Stata software. To use either command,
you must be connected to the Internet.
Remarks are presented under the following headings:
Using adoupdate
Possible problem the first time you run adoupdate and the solution
Notes for developers
Using adoupdate
The first time you try adoupdate, type
. adoupdate
That is, do not specify the update option. adoupdate without update produces a report but does
not update any files. The first time you run adoupdate, you may see messages such as
. adoupdate
(note: package utx was installed more than once; older copy removed)
(remaining output omitted)
Having the same packages installed multiple times is common; adoupdate cleans that up.
The second time you run adoupdate, pick one package to update. Suppose that the report indicates
that package st0008 has an update available. Type
. adoupdate st0008, update
You can specify one or many packages after the adoupdate command. You can even use wildcards
such as st* to mean all packages that start with st or st*8 to mean all packages that start with st
and end with 8. You can do that with or without the update option.
adoupdate — Update user-written ado-files
9
Finally, you can let adoupdate update all your user-written additions:
. adoupdate, update
Possible problem the first time you run adoupdate and the solution
The first time you run adoupdate, you might get many duplicate messages:
. adoupdate
(note: package ___ installed
(note: package ___ installed
(note: package ___ installed
...
(note: package ___ installed
(remaining output omitted)
more than once; older copy removed)
more than once; older copy removed)
more than once; older copy removed)
more than once; older copy removed)
Some users have hundreds of duplicates. You might even see the same package name repeated
more than once:
(note: package stylus installed more than once; older copy removed)
(note: package stylus installed more than once; older copy removed)
That means that the package was duplicated twice.
Stata tolerates duplicates, and you did nothing wrong when you previously installed and updated
packages. adoupdate, however, needs the duplicates removed, mainly so that it does not keep
checking the same files.
The solution is to just let adoupdate run. adoupdate will run faster next time, when there are
no (or just a few) duplicates.
Notes for developers
adoupdate reports whether an installed package is up to date by comparing its distribution date
with that of the package available over the web.
If you are distributing software, include the line
d Distribution-Date: date
somewhere in your .pkg file. The capitalization of Distribution-Date does not matter, but include
the hyphen and the colon as shown. Code the date in either of two formats:
all numeric:
Stata standard:
yyyymmdd, for example, 20090701
ddMONyyyy, for example, 01jul2009
Saved results
adoupdate saves the following in r():
Macros
r(pkglist) a space-separated list of package names that need updating (update not specified)
or that were updated (update specified)
10
adoupdate — Update user-written ado-files
Methods and formulas
adoupdate is implemented as an ado-file.
Also see
[R] ssc — Install and uninstall packages from SSC
[R] search — Search Stata documentation
[R] net — Install and manage user-written additions from the Internet
[R] update — Update Stata
Title
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
Syntax
alpha varlist
if
in
, options
description
options
Options
asis
casewise
detail
generate(newvar)
item
label
min(#)
reverse(varlist)
std
take sign of each item as is
delete cases with missing values
list individual interitem correlations and covariances
save the generated scale in newvar
display item-test and item-rest correlations
include variable labels in output table
must have at least # observations for inclusion
reverse signs of these variables
standardize items in the scale to mean 0, variance 1
by is allowed; see [D] by.
Menu
Statistics
>
Multivariate analysis
>
Cronbach’s alpha
Description
alpha computes the interitem correlations or covariances for all pairs of variables in varlist and
Cronbach’s α statistic for the scale formed from them. At least two variables must be specified with
alpha.
Options
Options
asis specifies that the sense (sign) of each item be taken as presented in the data. The default is to
determine the sense empirically and reverse the scorings for any that enter negatively.
casewise specifies that cases with missing values be deleted listwise. The default is pairwise
computation of covariances and correlations.
detail lists the individual interitem correlations and covariances.
generate(newvar) specifies that the scale constructed from varlist be stored in newvar. Unless asis
is specified, the sense of items entering negatively is automatically reversed. If std is also specified,
the scale is constructed by using standardized (mean 0, variance 1) values of the individual items.
Unlike most Stata commands, generate() does not use casewise deletion. A score is created
for every observation for which there is a response to at least one item (one variable in varlist
is not missing). The summative score is divided by the number of items over which the sum is
calculated.
11
12
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
item specifies that item-test and item-rest correlations and the effects of removing an item from the
scale be displayed. item is valid only when more than two variables are specified in varlist.
label requests that the detailed output table be displayed in a compact format that enables the
inclusion of variable labels.
min(#) specifies that only cases with at least # observations be included in the computations.
casewise is a shorthand for min(k), where k is the number of variables in varlist.
reverse(varlist) specifies that the signs (directions) of the variables (items) in varlist be reversed.
Any variables specified in reverse() that are not also included in alpha’s varlist are ignored.
std specifies that the items in the scale be standardized (mean 0, variance 1) before summing.
Remarks
Cronbach’s alpha (Cronbach 1951) assesses the reliability of a summative rating (Likert 1932)
scale composed of the variables (called items) specified. The set of items is often called a test or
battery. A scale is simply the sum of the individual item scores, reversing the scoring for statements
that have negative correlations with the factor (e.g., attitude) being measured. Scales can be formed
by using the raw item scores or standardized item scores.
The reliability α is defined as the square of the correlation between the measured scale and the
underlying factor. If you think of a test as being composed of a random sample of items from a
hypothetical domain of items designed to measure the same thing, α represents the expected correlation
of one test with an alternative form containing the same number of items. The square root of α is
the estimated correlation of a test with errorless true scores (Nunnally and Bernstein 1994, 235).
In addition to reporting α, alpha generates the summative scale from the items (variables) specified
and automatically reverses the sense of any when necessary. Stata’s decision can be overridden by
specifying the reverse(varlist) option.
Because it concerns reliability in measuring an unobserved factor, α is related to factor analysis.
The test should be designed to measure one factor, and, because the scale will be composed of an
unweighted sum, the factor loadings should all contribute roughly equal information to the score.
Both of these assumptions can be verified with factor; see [MV] factor. Equality of factor loadings
can also be assessed by using the item option.
Example 1
To illustrate alpha, we apply it, first without and then with the item option, to the automobile
dataset after randomly introducing missing values:
. use http://www.stata-press.com/data/r11/automiss
(1978 Automobile Data)
. alpha price headroom rep78 trunk weight length turn displ, std
Test scale = mean(standardized items)
Reversed item: rep78
Average interitem correlation:
0.5251
Number of items in the scale:
8
Scale reliability coefficient:
0.8984
The scale derived from our somewhat arbitrarily chosen automobile items (variables) appears to be
reasonable
because the estimated correlation between it and the underlying factor it measures is
√
0.8984 ≈ 0.9478 and the estimated correlation between this battery of eight items and all other
eight-item batteries from the same domain is 0.8984. Because the “items” are not on the same scale,
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
13
it is important that std was specified so that the scale and its reliability were based on the sum
of standardized variables. We could obtain the scale in a new variable called sc with the gen(sc)
option.
Though the scale appears reasonable, we include the item option to determine if all the items fit
the scale:
. alpha price headroom rep78 trunk weight length turn displ, std item
Test scale = mean(standardized items)
average
item-test
item-rest
interitem
Item
Obs Sign
correlation
correlation
correlation
price
headroom
rep78
trunk
weight
length
turn
displacement
Test scale
70
66
61
69
64
69
66
63
+
+
+
+
+
+
+
0.5260
0.6716
0.4874
0.7979
0.9404
0.9382
0.8678
0.8992
0.3719
0.5497
0.3398
0.7144
0.9096
0.9076
0.8071
0.8496
alpha
0.5993
0.5542
0.6040
0.5159
0.4747
0.4725
0.4948
0.4852
0.9128
0.8969
0.9143
0.8818
0.8635
0.8625
0.8727
0.8684
0.5251
0.8984
“ Test” denotes the additive scale; here 0.5251 is the average interitem correlation, and 0.8984 is
the alpha coefficient for a test scale based on all items.
“Obs” shows the number of nonmissing values of the items; “Sign” indicates the direction in
which an item variable entered the scale; “-” denotes that the item was reversed. The remaining four
columns in the table provide information on the effect of one item on the scale.
Column four gives the item-test correlations. Apart from the sign of the correlation for items that
entered the scale in reversed order, these correlations are the same numbers as those computed by
the commands
. alpha price headroom rep78 trunk weight length turn displ, std gen(sc)
. pwcorr sc price headroom rep78 trunk weight length turn displ
Typically, the item-test correlations should be roughly the same for all items. Item-test correlations
may not be adequate to detect items that fit poorly because the poorly fitting items may distort the scale.
Accordingly, it may be more useful to consider item-rest correlations (Nunnally and Bernstein 1994),
i.e., the correlation between an item and the scale that is formed by all other items. The average
interitem correlations (covariances if std is omitted) of all items, excluding one, are shown in column
six. Finally, column seven gives Cronbach’s α for the test scale, which consists of all but the one
item.
Here neither the price item nor the rep78 item seems to fit well in the scale in all respects.
The item-test and item-rest correlations of price and rep78 are much lower than those of the other
items. The average interitem correlation increases substantially by removing either price or rep78;
apparently, they do not correlate strongly with the other items. Finally, we see that Cronbach’s α
coefficient will increase from 0.8984 to 0.9128 if the price item is dropped, and it will increase
from 0.8984 to 0.9143 if rep78 is dropped. For well-fitting items, we would of course expect that
α decreases by shortening the test.
14
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
Example 2
The variable names for the automobile data are reasonably informative. This may not always
be true; items in batteries commonly used to measure personality traits, attitudes, values, etc., are
usually named with indexed names such as item12a, item12b, etc. The label option forces alpha
to produce the same statistical information in a more compact format that leaves room to include
variable (item) labels. In this compact format, alpha excludes the number of nonmissing values of
the items, displays the statistics using fewer digits, and uses somewhat cryptic headers:
. alpha price headroom rep78 trunk weight length turn displ, std item label detail
Test scale = mean(standardized items)
Items
S it-cor ir-cor
ii-cor
alpha
label
price
headroom
rep78
trunk
weight
length
turn
displacement
Test scale
+
+
+
+
+
+
+
0.526
0.672
0.487
0.798
0.940
0.938
0.868
0.899
0.372
0.550
0.340
0.714
0.910
0.908
0.807
0.850
0.599
0.554
0.604
0.516
0.475
0.473
0.495
0.485
0.913
0.897
0.914
0.882
0.863
0.862
0.873
0.868
Price
Headroom (in.)
Repair Record 1978
Trunk space (cu. ft.)
Weight (lbs.)
Length (in.)
Turn Circle (ft.)
Displacement (cu. in.)
0.525
0.898
mean(standardized items)
Interitem correlations (reverse applied) (obs=pairwise, see below)
price
headroom
rep78
trunk
price
1.0000
headroom
0.1174
1.0000
rep78
-0.0479
0.1955
1.0000
trunk
0.2748
0.6841
0.2777
1.0000
weight
0.5093
0.5464
0.3624
0.6486
length
0.4511
0.5823
0.3162
0.7404
turn
0.3528
0.4067
0.4715
0.5900
displacement
0.5537
0.5166
0.3391
0.6471
weight
length
turn displacement
weight
1.0000
length
0.9425
1.0000
turn
0.8712
0.8589
1.0000
displacement
0.8753
0.8422
0.7723
1.0000
Pairwise number of observations
price
headroom
rep78
trunk
price
70
headroom
62
66
rep78
59
54
61
trunk
65
61
59
69
weight
60
56
52
60
length
66
61
58
64
turn
62
58
56
62
displacement
59
58
51
58
weight
length
turn displacement
weight
64
length
60
69
turn
57
61
66
displacement
54
58
56
63
Because the detail option was also specified, the interitem correlation matrix was printed, together
with the number of observations used for each entry (because these varied across the matrix). Note
the negative sign attached to rep78 in the output, indicating the sense in which it entered the scale.
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
15
Better-looking output with less-cryptic headers is produced if the linesize is set to a value of at
least 100:
. set linesize 100
. alpha price headroom rep78 trunk weight length turn displ, std item label
Test scale = mean(standardized items)
item-test item-rest
interitem
Obs Sign
corr.
corr.
corr.
alpha
Label
Item
price
headroom
rep78
trunk
weight
length
turn
displacement
70
62
59
65
60
66
62
59
+
+
+
+
+
+
+
0.5260
0.6716
0.4874
0.7979
0.9404
0.9382
0.8678
0.8992
Test scale
0.3719
0.5497
0.3398
0.7144
0.9096
0.9076
0.8071
0.8496
0.5993
0.5542
0.6040
0.5159
0.4747
0.4725
0.4948
0.4852
0.9128
0.8969
0.9143
0.8818
0.8635
0.8625
0.8727
0.8684
Price
Headroom (in.)
Repair Record 1978
Trunk space (cu. ft.)
Weight (lbs.)
Length (in.)
Turn Circle (ft.)
Displacement (cu. in.)
0.5251
0.8984
mean(standardized items)
Users of alpha require some standard for judging values of α. We paraphrase Nunnally and
Bernstein (1994, 265): In the early stages of research, modest reliability of 0.70 or higher will suffice;
values in excess of 0.80 often waste time and funds. In contrast, where measurements on individuals
are of interest, a reliability of 0.80 may not be nearly high enough. Even with a reliability of 0.90,
the standard error of measurement is almost one-third as large as the standard deviation of test scores;
a reliability of 0.90 is the minimum that should be tolerated, and a reliability of 0.95 should be
considered the desirable standard.
Saved results
alpha saves the following in r():
Scalars
r(alpha)
r(k)
r(cov)
r(rho)
scale reliability coefficient
number of items in the scale
average interitem covariance
average interitem correlation if std is specified
Matrices
r(Alpha)
r(ItemTestCorr)
r(ItemRestCorr)
r(MeanInterItemCov)
r(MeanInterItemCorr)
scale reliability coefficient
item-test correlation
item-rest correlation
average interitem covariance
average interitem correlation if std is specified
If the item option is specified, results are saved as row matrices for the k subscales when one variable
is removed.
16
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
Methods and formulas
alpha is implemented as an ado-file.
Let xi , i = 1, . . . , k , be the variables over which α is to be calculated. Let si be the sign with
which xi enters the scale. If asis is specified, si = 1 for all i. Otherwise, principal-factor analysis
is performed on xi , and the first factor’s score is predicted; see [MV] factor. si is −1 if correlation
of the xi and the predicted score is negative and +1 otherwise.
Let rij be the correlation between xi and xj , cij be the covariance, and nij be the number of
observations used in calculating the correlation or covariance. The average correlation is
k i−1
P
P
r=
si sj nij rij
i=2 j=1
k i−1
P
P
nij
i=2 j=1
and the average covariance similarly is
k i−1
P
P
c=
si sj nij cij
i=2 j=1
k i−1
P
P
nij
i=2 j=1
Let cii denote the variance of xi , and define the average variance as
k
P
v=
nii cii
i=1
k
P
nii
i=1
If std is specified, the scale reliability α is calculated as defined by the general form of the
Spearman – Brown Prophecy Formula (Nunnally and Bernstein 1994, 232; Allen and Yen 1979,
85 – 88):
kr
α=
1 + (k − 1)r
This expression corresponds to α under the assumption that the summative rating is the sum of
the standardized variables (Nunnally and Bernstein 1994, 234). If std is not specified, α is defined
(Nunnally and Bernstein 1994, 232 and 234) as
α=
kc
v + (k − 1)c
Let xij reflect the value of item i in the j th observation. If std is specified, the j th value of the
scale computed from the k xij items is
Sj =
k
1 X
si S(xij )
kj i=1
alpha — Compute interitem correlations (covariances) and Cronbach’s alpha
17
where S() is the function that returns the standardized (mean 0, variance 1) value if xij is not missing
and returns zero if xij is missing. kj is the number of nonmissing values in xij , i = 1, . . . , k . If
std is not specified, S() is the function that returns xij or returns missing if xij is missing.
Lee Joseph Cronbach (1916–2001) was an American psychometrician and educational psychologist
who worked principally on measurement theory, program evaluation, and instruction. He taught
and researched at the State College of Washington, the University of Chicago, the University of
Illinois, and Stanford. Cronbach’s initial paper on alpha led to a theory of test reliability, called
generalizability theory, a comprehensive statistical model for identifying sources of measurement
error.
Acknowledgment
This improved version of alpha was written by Jeroen Weesie, Department of Sociology, Utrecht
University, The Netherlands.
References
Acock, A. C. 2008. A Gentle Introduction to Stata. 2nd ed. College Station, TX: Stata Press.
Allen, M. J., and W. M. Yen. 1979. Introduction to Measurement Theory. Monterey, CA: Brooks/Cole.
Bleda, M.-J., and A. Tobı́as. 2000. sg143: Cronbach’s alpha one-sided confidence interval. Stata Technical Bulletin 56:
26–27. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 187–189. College Station, TX: Stata Press.
Cronbach, L. J. 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16: 297–334.
Likert, R. A. 1932. A technique for the measurement of attitudes. Archives of Psychology 140: 5–55.
Nunnally, J. C., and I. H. Bernstein. 1994. Psychometric Theory. 3rd ed. New York: McGraw–Hill.
Shavelson, R. J., and G. Gleser. 2002. Lee J. Cronbach (1916–2001). American Psychologist 57: 360–361.
Tarlov, A. R., J. E. Ware Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes
study. An application of methods for monitoring the results of medical care. Journal of the American Medical
Association 262: 925–930.
Weesie, J. 1997. sg66: Enhancements to the alpha command. Stata Technical Bulletin 35: 32–34. Reprinted in Stata
Technical Bulletin Reprints, vol. 6, pp. 176–179. College Station, TX: Stata Press.
Also see
[MV] factor — Factor analysis
Title
ameans — Arithmetic, geometric, and harmonic means
Syntax
ameans
varlist
if
in
weight
, options
description
options
Main
add # to each variable in varlist
add # only to variables with nonpositive values
set confidence level; default is level(95)
add(#)
only
level(#)
by is allowed; see [D] by.
aweights and fweights are allowed; see [U] 11.1.6 weight.
Menu
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
>
Arith./geometric/harmonic means
Description
ameans computes the arithmetic, geometric, and harmonic means, with their corresponding
confidence intervals, for each variable in varlist or for all the variables in the data if varlist is
not specified. gmeans and hmeans are synonyms for ameans.
If you simply want arithmetic means and corresponding confidence intervals, see [R] ci.
Options
Main
add(#) adds the value # to each variable in varlist before computing the means and confidence
intervals. This option is useful when analyzing variables with nonpositive values.
only modifies the action of the add(#) option so that it adds # only to variables with at least one
nonpositive value.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals.
18
ameans — Arithmetic, geometric, and harmonic means
19
Remarks
Example 1
We have a dataset containing 8 observations on a variable named x. The eight values are 5, 4,
−4, −5, 0, 0, missing, and 7.
. ameans x
Variable
x
Type
Mean
7
3
3
1
5.192494
5.060241
Obs
Mean
7
6
6
6
5.477226
3.540984
Arithmetic
Geometric
Harmonic
. ameans x, add(5)
Variable
Type
x
Obs
Arithmetic
Geometric
Harmonic
[95% Conf. Interval]
-3.204405
2.57899
3.023008
5.204405
10.45448
15.5179
[95% Conf. Interval]
1.795595
2.1096
.
10.2044 *
14.22071 *
. *
(*) 5 was added to the variables prior to calculating the results.
Missing values in confidence intervals for harmonic mean indicate
that confidence interval is undefined for corresponding variables.
Consult Reference Manual for details.
The number of observations displayed for the arithmetic mean is the number of nonmissing observations.
The number of observations displayed for the geometric and harmonic means is the number of
nonmissing, positive observations. Specifying the add(5) option produces 3 more positive observations.
The confidence interval for the harmonic mean is not reported; see Methods and formulas below.
Saved results
ameans saves the following in r():
Scalars
r(N)
r(N pos)
r(mean)
r(lb)
r(ub)
r(Var)
r(mean g)
r(lb g)
r(ub g)
r(Var g)
r(mean h)
r(lb h)
r(ub h)
r(Var h)
number of nonmissing observations; used for arithmetic mean
number of nonmissing positive observations; used for geometric and harmonic means
arithmetic mean
lower bound of confidence interval for arithmetic mean
upper bound of confidence interval for arithmetic mean
variance of untransformed data
geometric mean
lower bound of confidence interval for geometric mean
upper bound of confidence interval for geometric mean
variance of lnxi
harmonic mean
lower bound of confidence interval for harmonic mean
upper bound of confidence interval for harmonic mean
variance of 1/xi
20
ameans — Arithmetic, geometric, and harmonic means
Methods and formulas
ameans is implemented as an ado-file.
See Armitage, Berry, and Matthews (2002) or Snedecor and Cochran (1989). For a history of the
concept of the mean, see Plackett (1958).
When restricted to the same set of values (i.e., to positive values), the arithmetic mean (x) is
greater than or equal to the geometric mean, which in turn is greater than or equal to the harmonic
mean. Equality holds only if all values within a sample are equal to a positive constant.
The arithmetic mean and its confidence interval are identical to those provided by ci; see [R] ci.
To compute the geometric mean, ameans first creates uj = lnxj for all positive xj . The arithmetic
mean of the uj and its confidence interval are then computed as in ci. Let u be the resulting mean,
and let [ L, U ] be the corresponding confidence interval. The geometric mean is then exp(u), and
its confidence interval is [ exp(L), exp(U ) ].
The same procedure is followed for the harmonic mean, except that then uj = 1/xj . The harmonic
mean is then 1/u, and its confidence interval is [ 1/U, 1/L ] if L is greater than zero. If L is not
greater than zero, this confidence interval is not defined, and missing values are reported.
When weights are specified, ameans applies the weights to the transformed values, uj = lnxj
and uj = 1/xj , respectively, when computing the geometric and harmonic means. For details on
how the weights are used to compute the mean and variance of the uj , see [R] summarize. Without
weights, the formula for the geometric mean reduces to
n1 X
o
exp
ln(xj )
n j
Without weights, the formula for the harmonic mean is
n
X1
j
xj
Acknowledgments
This improved version of ameans is based on the gmci command (Carlin, Vidmar, and Ramalheira 1998) and was written by John Carlin, University of Melbourne, Australia; Suzanna Vidmar,
University of Melbourne, Australia; and Carlos Ramalheira, Coimbra University Hospital, Portugal.
References
Armitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical Research. 4th ed. Oxford:
Blackwell.
Carlin, J., S. Vidmar, and C. Ramalheira. 1998. sg75: Geometric means and confidence intervals. Stata Technical
Bulletin 41: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 197–199. College Station, TX: Stata
Press.
Keynes, J. M. 1911. The principal averages and the laws of error which lead to them. Journal of the Royal Statistical
Society 74: 322–331.
Plackett, R. L. 1958. Studies in the history of probability and statistics: VII. The principle of the arithmetic mean.
Biometrika 45: 130–135.
ameans — Arithmetic, geometric, and harmonic means
21
Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press.
Stigler, S. M. 1985. Arithmetric means. In Vol. 1 of Encyclopedia of Statistical Sciences, ed. S. Kotz and N. L.
Johnson, 126–129. New York: Wiley.
Also see
[R] ci — Confidence intervals for means, proportions, and counts
[R] mean — Estimate means
[R] summarize — Summary statistics
[SVY] svy estimation — Estimation commands for survey data
Title
anova — Analysis of variance and covariance
Syntax
anova varname termlist
if
in
weight
, options
where termlist is a factor-variable list (see [U] 11.4.3 Factor variables) with the following additional
features:
• Variables are assumed to be categorical; use the c. factor-variable operator to override this.
• The | symbol (indicating nesting) may be used in place of the # symbol (indicating interaction).
• The / symbol is allowed after a term and indicates that the following term is the error term
for the preceding terms.
description
options
Model
repeated(varlist)
partial
sequential
noconstant
dropemptycells
variables in terms that are repeated-measures variables
use partial (or marginal) sums of squares
use sequential sums of squares
suppress constant term
drop empty cells from the design matrix
Adv. model
bse(term)
bseunit(varname)
grouping(varname)
between-subjects error term in repeated-measures ANOVA
variable representing lowest unit in the between-subjects error term
grouping variable for computing pooled covariance matrix
bootstrap, by, jackknife, and statsby are allowed; see [U] 11.1.10 Prefix commands.
aweights and fweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Linear models and related
>
ANOVA/MANOVA
>
Analysis of variance and covariance
Description
The anova command fits analysis-of-variance (ANOVA) and analysis-of-covariance (ANCOVA) models
for balanced and unbalanced designs, including designs with missing cells; for repeated-measures
ANOVA; and for factorial, nested, or mixed designs.
The regress command (see [R] regress) will display the coefficients, standard errors, etc., of the
regression model underlying the last run of anova.
If you want to fit one-way ANOVA models, you may find the oneway or loneway command more
convenient; see [R] oneway and [R] loneway. If you are interested in MANOVA or MANCOVA, see
[MV] manova.
22
anova — Analysis of variance and covariance
23
Options
Model
repeated(varlist) indicates the names of the categorical variables in the terms that are to be treated
as repeated-measures variables in a repeated-measures ANOVA or ANCOVA.
partial presents the ANOVA table using partial (or marginal) sums of squares. This setting is the
default. Also see the sequential option.
sequential presents the ANOVA table using sequential sums of squares.
noconstant suppresses the constant term (intercept) from the ANOVA or regression model.
dropemptycells drops empty cells from the design matrix. If c(emptycells) is set to keep (see
[R] set emptycells), this option temporarily resets it to drop before running the ANOVA model. If
c(emptycells) is already set to drop, this option does nothing.
Adv. model
bse(term) indicates the between-subjects error term in a repeated-measures ANOVA. This option
is needed only in the rare case when the anova command cannot automatically determine the
between-subjects error term.
bseunit(varname) indicates the variable representing the lowest unit in the between-subjects error
term in a repeated-measures ANOVA. This option is rarely needed because the anova command
automatically selects the first variable listed in the between-subjects error term as the default for
this option.
grouping(varname) indicates a variable that determines which observations are grouped together in
computing the covariance matrices that will be pooled and used in a repeated-measures ANOVA.
This option is rarely needed because the anova command automatically selects the combination
of all variables except the first (or as specified in the bseunit() option) in the between-subjects
error term as the default for grouping observations.
Remarks
Remarks are presented under the following headings:
Introduction
One-way ANOVA
Two-way ANOVA
N-way ANOVA
Weighted data
ANCOVA
Nested designs
Mixed designs
Latin-square designs
Repeated-measures ANOVA
Introduction
anova uses least squares to fit the linear models known as ANOVA or ANCOVA (henceforth referred
to simply as ANOVA models).
If your interest is in one-way ANOVA, you may find the oneway command to be more convenient;
see [R] oneway.
24
anova — Analysis of variance and covariance
ANOVA was pioneered by Fisher. It features prominently in his texts on statistical methods and his
design of experiments (1925, 1935). Many books discuss ANOVA; see, for instance, Altman (1991); van
Belle et al. (2004); Cobb (1998); Snedecor and Cochran (1989); or Winer, Brown, and Michels (1991).
For a classic source, see Scheffé (1959). Kennedy Jr. and Gentle (1980) discuss ANOVA’s computing
problems. Edwards (1985) is concerned primarily with the relationship between multiple regression
and ANOVA. Acock (2008, chap. 9) and Rabe-Hesketh and Everitt (2007, chap. 4 and 5) illustrate
their discussion with Stata output. Repeated-measures ANOVA is discussed in Winer, Brown, and
Michels (1991); Kuehl (2000); and Milliken and Johnson (1984). Pioneering work in repeated-measures
ANOVA can be found in Box (1954); Geisser and Greenhouse (1958); Huynh and Feldt (1976); and
Huynh (1978).
One-way ANOVA
anova, entered without options, performs and reports standard ANOVA. For instance, to perform a
one-way layout of a variable called endog on exog, you would type anova endog exog.
Example 1
We run an experiment varying the amount of fertilizer used in growing apple trees. We test four
concentrations, using each concentration in three groves of 12 trees each. Later in the year, we
measure the average weight of the fruit.
If all had gone well, we would have had 3 observations on the average weight for each of the
four concentrations. Instead, two of the groves were mistakenly leveled by a confused man on a large
bulldozer. We are left with the following data:
. use http://www.stata-press.com/data/r11/apple
(Apple trees)
. list, abbrev(10) sepby(treatment)
treatment
weight
1.
2.
3.
1
1
1
117.5
113.8
104.4
4.
5.
6.
2
2
2
48.9
50.4
58.9
7.
8.
3
3
70.4
86.9
9.
10.
4
4
87.7
67.3
anova — Analysis of variance and covariance
25
To obtain one-way ANOVA results, we type
. anova weight treatment
Source
Number of obs =
10
Root MSE
= 9.07002
Partial SS
df
MS
R-squared
= 0.9147
Adj R-squared = 0.8721
F
Prob > F
Model
5295.54433
3
1765.18144
21.46
0.0013
treatment
5295.54433
3
1765.18144
21.46
0.0013
Residual
493.591667
6
82.2652778
Total
5789.136
9
643.237333
We find significant (at better than the 1% level) differences among the four concentrations.
Although the output is a usual ANOVA table, let’s run through it anyway. Above the table is a
summary of the underlying regression. The model was fit on 10 observations, and the root mean
squared error (Root MSE) is 9.07. The R2 for the model is 0.9147, and the adjusted R2 is 0.8721.
The first line of the table summarizes the model. The sum of squares (Partial SS) for the model is
5295.5 with 3 degrees of freedom (df). This line results in a mean square (MS) of 5295.5/3 ≈ 1765.2.
The corresponding F statistic is 21.46 and has a significance level of 0.0013. Thus the model appears
to be significant at the 0.13% level.
The next line summarizes the first (and only) term in the model, treatment. Because there is
only one term, the line is identical to that for the overall model.
The third line summarizes the residual. The residual sum of squares is 493.59 with 6 degrees of
freedom, resulting in a mean squared error of 82.27. The square root of this latter number is reported
as the Root MSE.
The model plus the residual sum of squares equals the total sum of squares, which is reported as
5789.1 in the last line of the table. This is the total sum of squares of weight after removal of the
mean. Similarly, the model plus the residual degrees of freedom sum to the total degrees of freedom,
9. Remember that there are 10 observations. Subtracting 1 for the mean, we are left with 9 total
degrees of freedom.
Technical note
Rather than using the anova command, we could have performed this analysis by using the
oneway command. Example 1 in [R] oneway repeats this same analysis. You may wish to compare
the output.
Type regress to see the underlying regression model corresponding to an ANOVA model fit using
the anova command.
Example 2
Returning to the apple tree experiment, we found that the fertilizer concentration appears to
significantly affect the average weight of the fruit. Although that finding is interesting, we next want
to know which concentration appears to grow the heaviest fruit. One way to find out is by examining
the underlying regression coefficients.
26
anova — Analysis of variance and covariance
. regress, baselevels
Source
SS
df
MS
Number of obs
F( 3,
6)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
10
21.46
0.0013
0.9147
0.8721
9.07
Model
Residual
5295.54433
493.591667
3
6
1765.18144
82.2652778
Total
5789.136
9
643.237333
weight
Coef.
treatment
1
2
3
4
(base)
-59.16667
-33.25
-34.4
7.405641
8.279758
8.279758
-7.99
-4.02
-4.15
0.000
0.007
0.006
-77.28762
-53.50984
-54.65984
-41.04572
-12.99016
-14.14016
_cons
111.9
5.236579
21.37
0.000
99.08655
124.7134
Std. Err.
t
P>|t|
[95% Conf. Interval]
See [R] regress for an explanation of how to read this table. The baselevels option of regress
displays a row indicating the base category for our categorical variable, treatment. In summary,
we find that concentration 1, the base (omitted) group, produces significantly heavier fruits than
concentration 2, 3, and 4; concentration 2 produces the lightest fruits; and concentrations 3 and 4
appear to be roughly equivalent.
Example 3
We previously typed anova weight treatment to produce and display the ANOVA table for our
apple tree experiment. Typing regress displays the regression coefficients. We can redisplay the
ANOVA table by typing anova without arguments:
. anova
Source
Number of obs =
10
Root MSE
= 9.07002
Partial SS
df
MS
R-squared
= 0.9147
Adj R-squared = 0.8721
F
Prob > F
Model
5295.54433
3
1765.18144
21.46
0.0013
treatment
5295.54433
3
1765.18144
21.46
0.0013
Residual
493.591667
6
82.2652778
Total
5789.136
9
643.237333
Two-way ANOVA
You can include multiple explanatory variables with the anova command, and you can specify
interactions by placing ‘#’ between the variable names. For instance, typing anova y a b performs a
two-way layout of y on a and b. Typing anova y a b a#b performs a full two-way factorial layout.
The shorthand anova y a##b does the same.
With the default partial sums of squares, when you specify interacted terms, the order of the terms
does not matter. Typing anova y a b a#b is the same as typing anova y b a b#a.
anova — Analysis of variance and covariance
27
Example 4
The classic two-way factorial ANOVA problem, at least as far as computer manuals are concerned,
is a two-way ANOVA design from Afifi and Azen (1979).
Fifty-eight patients, each suffering from one of three different diseases, were randomly assigned
to one of four different drug treatments, and the change in their systolic blood pressure was recorded.
Here are the data:
Drug 1
Drug 2
Drug 3
Drug 4
Disease 1
42, 44, 36
13, 19, 22
Disease 2
33, 26, 33
21
Disease 3
31, –3, 25
25, 24
28, 23, 34
42, 13
1, 29, 19
34, 33, 31 3, 26, 28
36
32, 4, 16
11, 9, 7
21, 1, 9
1, –6
3
24, 9, 22
–2, 15
27, 12, 12
–5, 16, 15
22, 7, 25
5, 12
Let’s assume that we have entered these data into Stata and stored the data as systolic.dta.
Below we use the data, list the first 10 observations, summarize the variables, and tabulate the
control variables:
. use http://www.stata-press.com/data/r11/systolic
(Systolic Blood Pressure Data)
. list in 1/10
drug
disease
systolic
1.
2.
3.
4.
5.
1
1
1
1
1
1
1
1
1
1
42
44
36
13
19
6.
7.
8.
9.
10.
1
1
1
1
1
1
2
2
2
2
22
33
26
33
21
. summarize
Variable
Obs
Mean
drug
58
2.5
58
2.017241
disease
systolic
58
18.87931
. tabulate drug disease
Patient’s Disease
Drug Used
1
2
Std. Dev.
Min
Max
1.158493
.8269873
12.80087
1
1
-6
4
3
44
3
Total
1
2
3
4
6
5
3
5
4
4
5
6
5
6
4
5
15
15
12
16
Total
19
19
20
58
28
anova — Analysis of variance and covariance
Each observation in our data corresponds to one patient, and for each patient we record drug,
disease, and the increase in the systolic blood pressure, systolic. The tabulation reveals that the
data are not balanced — there are not equal numbers of patients in each drug – disease cell. Stata
does not require that the data be balanced. We can perform a two-way factorial ANOVA by typing
. anova systolic drug disease drug#disease
Number of obs =
58
Root MSE
= 10.5096
Source
Partial SS
df
MS
R-squared
= 0.4560
Adj R-squared = 0.3259
F
Prob > F
Model
4259.33851
11
387.212591
3.51
0.0013
drug
disease
drug#disease
2997.47186
415.873046
707.266259
3
2
6
999.157287
207.936523
117.87771
9.05
1.88
1.07
0.0001
0.1637
0.3958
Residual
5080.81667
46
110.452536
Total
9340.15517
57
163.862371
Although Stata’s table command does not perform ANOVA, it can produce useful summary tables
of your data (see [R] table):
. table drug disease, c(mean systolic) row col f(%8.2f)
Patient’s Disease
1
2
3 Total
Drug Used
1
2
3
4
29.33
28.00
16.33
13.60
28.25
33.50
4.40
12.83
20.40
18.17
8.50
14.20
26.07
25.53
8.75
13.50
Total
22.79
18.21
15.80
18.88
These are simple means and are not influenced by our anova model. More useful is the margins
command (see [R] margins) that provides marginal means and adjusted predictions. Because drug
is the only significant factor in our ANOVA, we now examine the adjusted marginal means for drug.
. margins drug, asbalanced
Adjusted predictions
Expression
: Linear prediction, predict()
at
: drug
(asbalanced)
disease
(asbalanced)
Margin
drug
1
2
3
4
25.99444
26.55556
9.744444
13.54444
Number of obs
=
58
Delta-method
Std. Err.
z
P>|z|
[95% Conf. Interval]
2.751008
2.751008
3.100558
2.637123
9.45
9.65
3.14
5.14
0.000
0.000
0.002
0.000
20.60257
21.16368
3.667462
8.375778
31.38632
31.94743
15.82143
18.71311
These adjusted marginal predictions are not equal to the simple drug means (see the total column from
the table command); they are based upon predictions from our ANOVA model. The asbalanced
option of margins corresponds with the interpretation of the F statistic produced by ANOVA —each
cell is given equal weight regardless of its sample size (see the following three technical notes). You
anova — Analysis of variance and covariance
29
can omit the asbalanced option and obtain predictive margins that take into account the unequal
sample sizes of the cells.
. margins drug
Predictive margins
Expression
: Linear prediction, predict()
Margin
drug
1
2
3
4
25.89799
26.41092
9.722989
13.55575
Number of obs
=
58
Delta-method
Std. Err.
z
P>|z|
[95% Conf. Interval]
2.750533
2.742762
3.099185
2.640602
9.42
9.63
3.14
5.13
0.000
0.000
0.002
0.000
20.50704
21.0352
3.648697
8.380261
31.28893
31.78664
15.79728
18.73123
Technical note
How do you interpret the significance of terms like drug and disease in unbalanced data? If you
are familiar with SAS, the sums of squares and the F statistic reported by Stata correspond to SAS
type III sums of squares. (Stata can also calculate sequential sums of squares, but we will postpone
that topic for now.)
Let’s think in terms of the following table:
Drug
Drug
Drug
Drug
1
2
3
4
Disease 1
µ11
µ21
µ31
µ41
µ·1
Disease 2
µ12
µ22
µ32
µ42
µ·2
Disease 3
µ13
µ23
µ33
µ43
µ·3
µ1·
µ2·
µ3·
µ4·
µ··
In this table, µij is the mean increase in systolic blood pressure associated with drug i and disease
j , while µi· is the mean for drug i, µ·j is the mean for disease j , and µ·· is the overall mean.
If the data are balanced, meaning that there are equal numbers of observations going into the
calculation of each mean µij , the row means, µi· , are given by
µi· =
µi1 + µi2 + µi3
3
In our case, the data are not balanced, but we define the µi· according to that formula anyway. The
test for the main effect of drug is the test that
µ1· = µ2· = µ3· = µ4·
To be absolutely clear, the F test of the term drug, called the main effect of drug, is formally
equivalent to the test of the three constraints:
30
anova — Analysis of variance and covariance
µ11 + µ12 + µ13
µ21 + µ22 + µ23
=
3
3
µ11 + µ12 + µ13
µ31 + µ32 + µ33
=
3
3
µ11 + µ12 + µ13
µ41 + µ42 + µ43
=
3
3
In our data, we obtain a significant F statistic of 9.05 and thus reject those constraints.
Technical note
Stata can display the symbolic form underlying the test statistics it presents, as well as display other
test statistics and their symbolic forms; see Obtaining symbolic forms in [R] anova postestimation.
Here is the result of requesting the symbolic form for the main effect of drug in our data:
. test drug, symbolic
drug
1 -(r2+r3+r4)
2
r2
3
r3
4
r4
disease
1 0
2 0
3 0
drug#disease
1 1 -1/3 (r2+r3+r4)
1 2 -1/3 (r2+r3+r4)
1 3 -1/3 (r2+r3+r4)
2 1
1/3 r2
2 2
1/3 r2
2 3
1/3 r2
3 1
1/3 r3
3 2
1/3 r3
3 3
1/3 r3
4 1
1/3 r4
4 2
1/3 r4
4 3
1/3 r4
_cons
0
This says exactly what we said in the previous technical note.
Technical note
Saying that there is no main effect of a variable is not the same as saying that it has no effect at
all. Stata’s ability to perform ANOVA on unbalanced data can easily be put to ill use.
For example, consider the following table of the probability of surviving a bout with one of two
diseases according to the drug administered to you:
anova — Analysis of variance and covariance
Drug 1
Drug 2
Disease 1
1
0
31
Disease 2
0
1
If you have disease 1 and are administered drug 1, you live. If you have disease 2 and are
administered drug 2, you live. In all other cases, you die.
This table has no main effects of either drug or disease, although there is a large interaction effect.
You might now be tempted to reason that because there is only an interaction effect, you would
be indifferent between the two drugs in the absence of knowledge about which disease infects you.
Given an equal chance of having either disease, you reason that it does not matter which drug is
administered to you — either way, your chances of surviving are 0.5.
You may not, however, have an equal chance of having either disease. If you knew that disease 1
was 100 times more likely to occur in the population, and if you knew that you had one of the two
diseases, you would express a strong preference for receiving drug 1.
When you calculate the significance of main effects on unbalanced data, you must ask yourself
why the data are unbalanced. If the data are unbalanced for random reasons and you are making
predictions for a balanced population, the test of the main effect makes perfect sense. If, however,
the data are unbalanced because the underlying populations are unbalanced and you are making
predictions for such unbalanced populations, the test of the main effect may be practically — if not
statistically — meaningless.
Example 5
Stata can perform ANOVA not only on unbalanced populations, but also on populations that are
so unbalanced that entire cells are missing. For instance, using our systolic blood pressure data, let’s
refit the model eliminating the drug 1–disease 1 cell. Because anova follows the same syntax as all
other Stata commands, we can explicitly specify the data to be used by typing the if qualifier at the
end of the anova command. Here we want to use the data that are not for drug 1 and disease 1:
. anova systolic drug##disease if !(drug==1 & disease==1)
Number of obs =
52
R-squared
= 0.4545
Root MSE
= 10.1615
Adj R-squared = 0.3215
Partial SS
df
MS
F
Prob > F
Source
Model
3527.95897
10
352.795897
3.42
0.0025
drug
disease
drug#disease
2686.57832
327.792598
703.007602
3
2
5
895.526107
163.896299
140.60152
8.67
1.59
1.36
0.0001
0.2168
0.2586
Residual
4233.48333
41
103.255691
Total
7761.44231
51
152.185143
Here we used drug##disease as a shorthand for drug disease drug#disease.
32
anova — Analysis of variance and covariance
Technical note
The test of the main effect of drug in the presence of missing cells is more complicated than that
for unbalanced data. Our underlying tableau now has the following form:
Disease 1
Drug
Drug
Drug
Drug
1
2
3
4
µ21
µ31
µ41
Disease 2
µ12
µ22
µ32
µ42
µ·2
Disease 3
µ13
µ23
µ33
µ43
µ·3
µ2·
µ3·
µ4·
The hole in the drug 1–disease 1 cell indicates that the mean is unobserved. Considering the main
effect of drug, the test is unchanged for the rows in which all the cells are defined:
µ2· = µ3· = µ4·
The first row, however, requires special attention. Here we want the average outcome for drug 1,
which is averaged only over diseases 2 and 3, to be equal to the average values of all other drugs
averaged over those same two diseases:
µ22 + µ23 /2 + µ32 + µ33 /2 + µ42 + µ43 /2
µ12 + µ13
=
2
3
Thus the test contains three constraints:
µ21 + µ22 + µ23
3
µ21 + µ22 + µ23
3
µ12 + µ13
2
=
=
=
µ31 + µ32 + µ33
3
µ41 + µ42 + µ43
3
µ22 + µ23 + µ32 + µ33 + µ42 + µ43
6
Stata can calculate two types of sums of squares, partial and sequential. If you do not specify
which sums of squares to calculate, Stata calculates partial sums of squares. The technical notes
above have gone into great detail about the definition and use of partial sums of squares. Use the
sequential option to obtain sequential sums of squares.
Technical note
Before we illustrate sequential sums of squares, consider one more feature of the partial sums. If
you know how such things are calculated, you may worry that the terms must be specified in some
particular order, that Stata would balk or, even worse, produce different results if you typed, say,
anova drug#disease drug disease rather than anova drug disease drug#disease. We assure
you that is not the case.
When you type a model, Stata internally reorganizes the terms, forms the cross-product matrix,
inverts it, converts the result to an upper-Hermite form, and then performs the hypothesis tests. As a
final touch, Stata reports the results in the same order that you typed the terms.
anova — Analysis of variance and covariance
33
Example 6
We wish to estimate the effects on systolic blood pressure of drug and disease by using sequential
sums of squares. We want to introduce disease first, then drug, and finally, the interaction of drug
and disease:
. anova systolic disease drug disease#drug, sequential
Number of obs =
58
Root MSE
= 10.5096
R-squared
=
Adj R-squared =
0.4560
0.3259
Source
Seq. SS
df
Model
4259.33851
11
387.212591
3.51
0.0013
disease
drug
disease#drug
488.639383
3063.43286
707.266259
2
3
6
244.319691
1021.14429
117.87771
2.21
9.25
1.07
0.1210
0.0001
0.3958
Residual
5080.81667
46
110.452536
Total
9340.15517
57
163.862371
MS
F
Prob > F
The F statistic on disease is now 2.21. When we fit this same model by using partial sums of
squares, the statistic was 1.88.
N-way ANOVA
You may include high-order interaction terms, such as a third-order interaction between the variables
A, B, and C, by typing A#B#C.
Example 7
We wish to determine the operating conditions that maximize yield for a manufacturing process.
There are three temperature settings, two chemical supply companies, and two mixing methods under
investigation. Three observations are obtained for each combination of these three factors.
. use http://www.stata-press.com/data/r11/manuf
(manufacturing process data)
. describe
Contains data from http://www.stata-press.com/data/r11/manuf.dta
obs:
36
manufacturing process data
vars:
4
2 Jan 2009 13:28
size:
288 (99.9% of memory free)
variable name
temperature
chemical
method
yield
Sorted by:
storage
type
byte
byte
byte
byte
display
format
value
label
%9.0g
%9.0g
%9.0g
%9.0g
temp
supplier
meth
variable label
machine temperature setting
chemical supplier
mixing method
product yield
34
anova — Analysis of variance and covariance
We wish to perform a three-way factorial ANOVA. We could type
. anova yield temp chem temp#chem meth temp#meth chem#meth temp#chem#meth
but prefer to use the ## factor-variable operator for brevity.
. anova yield temp##chem##meth
Number of obs =
36
Root MSE
= 2.62996
Partial SS
df
MS
Source
R-squared
= 0.5474
Adj R-squared = 0.3399
F
Prob > F
Model
200.75
11
18.25
2.64
0.0227
temperature
chemical
temperature#chemical
method
temperature#method
chemical#method
temperature#chemical#
method
30.5
12.25
24.5
42.25
87.5
.25
2
1
2
1
2
1
15.25
12.25
12.25
42.25
43.75
.25
2.20
1.77
1.77
6.11
6.33
0.04
0.1321
0.1958
0.1917
0.0209
0.0062
0.8508
3.5
2
1.75
0.25
0.7785
Residual
166
24
6.91666667
Total
366.75
35
10.4785714
The interaction between temperature and method appears to be the important story in these data.
A table of means for this interaction is given below.
. table method temp, c(mean yield) row col f(%8.2f)
mixing
method
machine temperature setting
low medium
high
Total
stir
fold
7.50
5.50
6.00
9.00
6.00
11.50
6.50
8.67
Total
6.50
7.50
8.75
7.58
Here our ANOVA is balanced (each cell has the same number of observations), and we obtain the
same values as in the table above (but with additional information such as confidence intervals) by
using the margins command. Because our ANOVA is balanced, using the asbalanced option with
margins would not produce different results. We request the predictive margins for the two terms
that appear significant in our ANOVA: temperature#method and method.
anova — Analysis of variance and covariance
. margins temperature#method method
Predictive margins
Expression
: Linear prediction, predict()
Margin
temperature#
method
1 1
1 2
2 1
2 2
3 1
3 2
method
1
2
Delta-method
Std. Err.
z
Number of obs
=
35
36
P>|z|
[95% Conf. Interval]
7.5
5.5
6
9
6
11.5
1.073675
1.073675
1.073675
1.073675
1.073675
1.073675
6.99
5.12
5.59
8.38
5.59
10.71
0.000
0.000
0.000
0.000
0.000
0.000
5.395636
3.395636
3.895636
6.895636
3.895636
9.395636
9.604364
7.604364
8.104364
11.10436
8.104364
13.60436
6.5
8.666667
.6198865
.6198865
10.49
13.98
0.000
0.000
5.285045
7.451711
7.714955
9.881622
We decide to use the folding method of mixing and a high temperature in our manufacturing
process.
Weighted data
Like all estimation commands, anova can produce estimates on weighted data. See [U] 11.1.6 weight
for details on specifying the weight.
Example 8
We wish to investigate the prevalence of byssinosis, a form of pneumoconiosis that can afflict
workers exposed to cotton dust. We have data on 5,419 workers in a large cotton mill. We know
whether each worker smokes, his or her race, and the dustiness of the work area. The variables are
smokes
smoker or nonsmoker in the last five years
race
white or other
workplace 1 (most dusty), 2 (less dusty), 3 (least dusty)
We wish to fit an ANOVA model explaining the prevalence of byssinosis according to a full factorial
model of smokes, race, and workplace.
The data are unbalanced. Moreover, although we have data on 5,419 workers, the data are grouped
according to the explanatory variables, along with some other variables, resulting in 72 observations.
For each observation, we know the number of workers in the group (pop), the prevalence of byssinosis
(prob), and the values of the three explanatory variables. Thus we wish to fit a three-way factorial
model on grouped data.
We begin by showing a bit of the data, which are from Higgins and Koch (1977).
36
anova — Analysis of variance and covariance
. use http://www.stata-press.com/data/r11/byssin
(Byssinosis incidence)
. describe
Contains data from http://www.stata-press.com/data/r11/byssin.dta
obs:
72
Byssinosis incidence
vars:
5
19 Dec 2008 07:04
size:
1,152 (99.9% of memory free)
variable name
storage
type
display
format
value
label
smokes
race
workplace
smokes
race
workplace
int
int
int
%8.0g
%8.0g
%8.0g
pop
prob
int
float
%8.0g
%9.0g
variable label
Smokes
Race
Dustiness of workplace
Population size
Prevalence of byssinosis
Sorted by:
. list in 1/5, abbrev(10) divider
1.
2.
3.
4.
5.
smokes
race
workplace
pop
prob
yes
yes
yes
yes
yes
white
white
white
other
other
most
less
least
most
less
40
74
260
164
88
.075
0
.0076923
.152439
0
The first observation in the data represents a group of 40 white workers who smoke and work
in a “most” dusty work area. Of those 40 workers, 7.5% have byssinosis. The second observation
represents a group of 74 white workers who also smoke but who work in a “less” dusty environment.
None of those workers has byssinosis.
Almost every Stata command allows weights. Here we want to weight the data by pop. We can,
for instance, make a table of the number of workers by their smoking status and race:
. tabulate smokes race [fw=pop]
Race
Smokes
other
white
Total
no
yes
799
1,104
1,431
2,085
2,230
3,189
Total
1,903
3,516
5,419
The [fw=pop] at the end of the tabulate command tells Stata to count each observation as representing
pop persons. When making the tally, tabulate treats the first observation as representing 40 workers,
the second as representing 74 workers, and so on.
Similarly, we can make a table of the dustiness of the workplace:
anova — Analysis of variance and covariance
. tabulate workplace [fw=pop]
Dustiness
of
workplace
Freq.
Percent
least
less
most
3,450
1,300
669
63.66
23.99
12.35
Total
5,419
100.00
37
Cum.
63.66
87.65
100.00
We can discover the average incidence of byssinosis among these workers by typing
. summarize prob [fw=pop]
Variable
Obs
prob
5419
Mean
.0304484
Std. Dev.
Min
Max
0
.287037
.0567373
We discover that 3.04% of these workers have byssinosis. Across all cells, the byssinosis rates vary
from 0 to 28.7%. Just to prove that there might be something here, let’s obtain the average incidence
rates according to the dustiness of the workplace:
. table workplace smokes race [fw=pop], c(mean prob)
Dustiness
of
workplace
least
less
most
Race and Smokes
other
white
no
yes
no
.0107527
.02
.0820896
.0101523
.0081633
.1679105
.0081549
.0136612
.0833333
yes
.0162774
.0143149
.2295082
Let’s now fit the ANOVA model.
. anova prob workplace smokes race workplace#smokes workplace#race
> smokes#race workplace#smokes#race [aweight=pop]
(sum of wgt is
5.4190e+03)
Number of obs =
65
R-squared
= 0.8300
Root MSE
= .025902
Adj R-squared = 0.7948
Source
Partial SS
df
MS
F
Prob > F
Model
.173646538
11
.015786049
23.53
0.0000
workplace
smokes
race
workplace#smokes
workplace#race
smokes#race
workplace#smokes#race
.097625175
.013030812
.001094723
.019690342
.001352516
.001662874
.000950841
2
1
1
2
2
1
2
.048812588
.013030812
.001094723
.009845171
.000676258
.001662874
.00047542
72.76
19.42
1.63
14.67
1.01
2.48
0.71
0.0000
0.0001
0.2070
0.0000
0.3718
0.1214
0.4969
Residual
.035557766
53
.000670901
Total
.209204304
64
.003268817
Of course, if we want to see the underlying regression, we could type regress.
Above we examined simple means of the cells of workplace#smokes#race. Our ANOVA shows
workplace, smokes, and their interaction as being the only significant factors in our model. We now
examine the predictive marginal mean byssinosis rates for these terms.
38
anova — Analysis of variance and covariance
. margins workplace#smokes workplace smokes
Predictive margins
Expression
: Linear prediction, predict()
Margin
Delta-method
Std. Err.
z
Number of obs
P>|z|
=
65
[95% Conf. Interval]
workplace#
smokes
1 1
1 2
2 1
2 2
3 1
3 2
.0090672
.0141264
.0158872
.0121546
.0828966
.2078768
.0062319
.0053231
.009941
.0087353
.0182151
.012426
1.45
2.65
1.60
1.39
4.55
16.73
0.146
0.008
0.110
0.164
0.000
0.000
-.003147
.0036934
-.0035967
-.0049663
.0471957
.1835222
.0212814
.0245595
.0353711
.0292756
.1185975
.2322314
workplace
1
2
3
.0120701
.0137273
.1566225
.0040471
.0065685
.0104602
2.98
2.09
14.97
0.003
0.037
0.000
.0041379
.0008533
.1361208
.0200022
.0266012
.1771241
smokes
1
2
.0196915
.0358626
.0050298
.0041949
3.91
8.55
0.000
0.000
.0098332
.0276408
.0295498
.0440844
Smoking combined with the most dusty workplace produces the highest byssinosis rates.
Ronald Aylmer Fisher (1890–1962) (Sir Ronald from 1952) studied mathematics at Cambridge.
Even before he finished his studies, he had published on statistics. He worked as a statistician at
Rothamsted Experimental Station (1919–1933), as professor of eugenics at University College
London (1933–1943), as professor of genetics at Cambridge (1943–1957), and in retirement at
the CSIRO Division of Mathematical Statistics in Adelaide. His many fundamental and applied
contributions to statistics and genetics mark him as one of the greatest statisticians of all time,
including original work on tests of significance, distribution theory, theory of estimation, fiducial
inference, and design of experiments.
ANCOVA
You can include multiple explanatory variables with the anova command, but unless you explicitly
state otherwise by using the c. factor-variable operator, all the variables are interpreted as categorical
variables. Using the c. operator, you can designate variables as continuous and thus perform ANCOVA.
Example 9
We have census data recording the death rate (drate) and median age (age) for each state. The
dataset also includes the region of the country in which each state is located (region):
anova — Analysis of variance and covariance
. use http://www.stata-press.com/data/r11/census2
(1980 Census data by state)
. summarize drate age region
Obs
Mean
Std. Dev.
Variable
drate
age
region
50
50
50
84.3
29.5
2.66
Min
Max
40
24
1
107
35
4
13.07318
1.752549
1.061574
39
age is coded in integral years from 24 to 35, and region is coded from 1 to 4, with 1 standing for
the Northeast, 2 for the North Central, 3 for the South, and 4 for the West.
When we examine the data more closely, we discover large differences in the death rate across
regions of the country:
. tabulate region, summarize(drate)
Census
Summary of Death Rate
region
Mean
Std. Dev.
Freq.
NE
N Cntrl
South
West
93.444444
88.916667
88.3125
68.769231
7.0553368
5.5833899
8.5457104
13.342625
9
12
16
13
Total
84.3
13.073185
50
Naturally, we wonder if these differences might not be explained by differences in the median ages
of the populations. To find out, we fit a regression model (via anova) of drate on region and age.
In the anova example below, we treat age as a categorical variable.
. anova drate region age
Source
Number of obs =
50
Root MSE
= 6.7583
Partial SS
df
MS
R-squared
= 0.7927
Adj R-squared = 0.7328
F
Prob > F
Model
6638.86529
11
603.533208
13.21
0.0000
region
age
1320.00973
2237.24937
3
8
440.003244
279.656171
9.63
6.12
0.0001
0.0000
Residual
1735.63471
38
45.6745977
Total
8374.5
49
170.908163
We have the answer to our question: differences in median ages do not eliminate the differences in
death rates across the four regions. The ANOVA table summarizes the two terms in the model, region
and age. The region term contains 3 degrees of freedom, and the age term contains 8 degrees of
freedom. Both are significant at better than the 1% level.
The age term contains 8 degrees of freedom. Because we did not explicitly indicate that age was
to be treated as a continuous variable, it was treated as categorical, meaning that unique coefficients
were estimated for each level of age. The only clue of this labeling is that the number of degrees of
freedom associated with the age term exceeds 1. The labeling becomes more obvious if we review
the regression coefficients:
40
anova — Analysis of variance and covariance
. regress, baselevels
SS
Source
df
MS
Number of obs
F( 11,
38)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
50
13.21
0.0000
0.7927
0.7328
6.7583
Model
Residual
6638.86529
1735.63471
11
38
603.533208
45.6745977
Total
8374.5
49
170.908163
drate
Coef.
region
1
2
3
4
(base)
.4428387
-.2964637
-13.37147
3.983664
3.934766
4.195344
0.11
-0.08
-3.19
0.912
0.940
0.003
-7.621668
-8.261981
-21.8645
8.507345
7.669054
-4.878439
age
24
26
27
28
29
30
31
32
35
(base)
-15
14.30833
12.66011
18.861
20.87003
29.91307
27.02853
38.925
9.557677
7.857378
7.495513
7.28918
7.210148
8.242741
8.509432
9.944825
-1.57
1.82
1.69
2.59
2.89
3.63
3.18
3.91
0.125
0.076
0.099
0.014
0.006
0.001
0.003
0.000
-34.34851
-1.598099
-2.51376
4.104825
6.273847
13.22652
9.802089
18.79275
4.348506
30.21476
27.83399
33.61717
35.46621
46.59963
44.25498
59.05724
_cons
68.37147
7.95459
8.60
0.000
52.26824
84.47469
Std. Err.
t
P>|t|
[95% Conf. Interval]
The regress command displayed the anova model as a regression table. We used the baselevels
option to display the dropped level (or base) for each term.
If we want to treat age as a continuous variable, we must prepend c. to age in our anova.
. anova drate region c.age
Source
Number of obs =
50
Root MSE
= 7.21483
Partial SS
df
MS
R-squared
= 0.7203
Adj R-squared = 0.6954
F
Prob > F
Model
6032.08254
4
1508.02064
28.97
0.0000
region
age
1645.66228
1630.46662
3
1
548.554092
1630.46662
10.54
31.32
0.0000
0.0000
Residual
2342.41746
45
52.0537213
Total
8374.5
49
170.908163
The age term now has 1 degree of freedom. The regression coefficients are
anova — Analysis of variance and covariance
. regress, baselevels
SS
Source
df
MS
Number of obs
F( 4,
45)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
50
28.97
0.0000
0.7203
0.6954
7.2148
Model
Residual
6032.08254
2342.41746
4
45
1508.02064
52.0537213
Total
8374.5
49
170.908163
drate
Coef.
region
1
2
3
4
(base)
1.792526
.6979912
-13.37578
3.375925
3.18154
3.723447
0.53
0.22
-3.59
0.598
0.827
0.001
-5.006935
-5.70996
-20.87519
8.591988
7.105942
-5.876377
age
_cons
3.922947
-28.60281
.7009425
21.93931
5.60
-1.30
0.000
0.199
2.511177
-72.79085
5.334718
15.58524
Std. Err.
t
P>|t|
41
[95% Conf. Interval]
Although we started analyzing these data to explain the regional differences in death rate, let’s focus
on the effect of age for a moment. In our first model, each level of age had a unique death rate
associated with it. For instance, the predicted death rate in a north central state with a median age
of 28 was
0.44 + 12.66 + 68.37 ≈ 81.47
whereas the predicted death rate from our current model is
1.79 + 3.92 × 28 − 28.60 ≈ 82.95
Our previous model had an R2 of 0.7927, whereas our current model has an R2 of 0.7203. This
“small” loss of predictive power accompanies a gain of 7 degrees of freedom, so we suspect that the
continuous-age model is as good as the discrete-age model.
Technical note
There is enough information in the two ANOVA tables to attach a statistical significance to our
suspicion that the loss of predictive power is offset by the savings in degrees of freedom. Because
the continuous-age model is nested within the discrete-age model, we can perform a standard Chow
test. For those of us who know such formulas off the top of our heads, the F statistic is
(2342.41746 − 1735.63471)/7
= 1.90
45.6745977
There is, however, a better way.
We can find out whether our continuous model is as good as our discrete model by putting age
in the model twice: once as a continuous variable and once as a categorical variable. The categorical
variable will then measure deviations around the straight line implied by the continuous variable, and
the F test for the significance of the categorical variable will test whether those deviations are jointly
zero.
42
anova — Analysis of variance and covariance
. anova drate region c.age age
Number of obs =
50
Root MSE
= 6.7583
Source
Partial SS
df
MS
R-squared
= 0.7927
Adj R-squared = 0.7328
F
Prob > F
Model
6638.86529
11
603.533208
13.21
0.0000
region
age
age
1320.00973
699.74137
606.782747
3
1
7
440.003244
699.74137
86.6832496
9.63
15.32
1.90
0.0001
0.0004
0.0970
Residual
1735.63471
38
45.6745977
Total
8374.5
49
170.908163
We find that the F test for the significance of the (categorical) age variable is 1.90, just as we
calculated above. It is significant at the 9.7% level. If we hold to a 5% significance level, we cannot
reject the null hypothesis that the effect of age is linear.
Example 10
In our census data, we still find significant differences across the regions after controlling for the
median age of the population. We might now wonder whether the regional differences are differences
in level — independent of age — or are instead differences in the regional effects of age. Just as we
can interact categorical variables with other categorical variables, we can interact categorical variables
with continuous variables.
. anova drate region c.age region#c.age
Number of obs =
50
Root MSE
= 7.24852
Source
Partial SS
df
MS
R-squared
= 0.7365
Adj R-squared = 0.6926
F
Prob > F
Model
6167.7737
7
881.110529
16.77
0.0000
region
age
region#age
188.713602
873.425599
135.691162
3
1
3
62.9045339
873.425599
45.2303874
1.20
16.62
0.86
0.3225
0.0002
0.4689
Residual
2206.7263
42
52.5411023
Total
8374.5
49
170.908163
The region#c.age term in our model measures the differences in slopes across the regions. We cannot
reject the null hypothesis that there are no such differences. The region effect is now “insignificant”.
This status does not mean that there are no regional differences in death rates because each test is a
marginal or partial test. Here, with region#c.age included in the model, region is being tested at
the point where age is zero. Apart from this value not existing in the dataset, it is also a long way
from the mean value of age, so the test of region at this point is meaningless (although it is valid
if you acknowledge what is being tested).
To obtain a more sensible test of region, we can subtract the mean from the age variable and
use this in the model.
. quietly summarize age
. generate mage = age - r(mean)
anova — Analysis of variance and covariance
. anova drate region c.mage region#c.mage
Number of obs =
50
Root MSE
= 7.24852
Partial SS
df
MS
Source
43
R-squared
= 0.7365
Adj R-squared = 0.6926
F
Prob > F
Model
6167.7737
7
881.110529
16.77
0.0000
region
mage
region#mage
1166.14735
873.425599
135.691162
3
1
3
388.715783
873.425599
45.2303874
7.40
16.62
0.86
0.0004
0.0002
0.4689
Residual
2206.7263
42
52.5411023
Total
8374.5
49
170.908163
region is significant when tested at the mean of the age variable.
Remember that we can specify interactions by typing varname#varname. We have seen examples
of interacting categorical variables with categorical variables and, in the examples above, a categorical
variable (region) with a continuous variable (age or mage).
We can also interact continuous variables with continuous variables. To include an age2 term
in our model, we could type c.age#c.age. If we also wanted to interact the categorical variable
region with the age2 term, we could type region#c.age#c.age (or even c.age#region#c.age).
Nested designs
In addition to specifying interaction terms, nested terms can also be specified in an ANOVA. A
vertical bar is used to indicate nesting: A|B is read as A nested within B. A|B|C is read as A nested
within B, which is nested within C. A|B#C is read as A is nested within the interaction of B and C.
A#B|C is read as the interaction of A and B, which is nested within C.
Different error terms can be specified for different parts of the model. The forward slash is used
to indicate that the next term in the model is the error term for what precedes it. For instance,
anova y A / B|A indicates that the F test for A is to be tested by using the mean square from B|A
in the denominator. Error terms (terms following the slash) are generally not tested unless they are
themselves followed by a slash. Residual error is the default error term.
For example, consider A / B / C, where A, B, and C may be arbitrarily complex terms. Then
anova will report A tested by B and B tested by C. If we add one more slash on the end to form
A / B / C /, then anova will also report C tested by the residual error.
Example 11
We have collected data from a manufacturer that is evaluating which of five different brands
of machinery to buy to perform a particular function in an assembly line. Twenty assembly-line
employees were selected at random for training on these machines, with four employees assigned
to learn a particular machine. The output from each employee (operator) on the brand of machine
for which he trained was measured during four trial periods. In this example, the operator is nested
within machine. Because of sickness and employee resignations, the final data are not balanced. The
following table gives the mean output and sample size for each machine and operator combination.
. use http://www.stata-press.com/data/r11/machine, clear
(machine data)
44
anova — Analysis of variance and covariance
. table machine operator, c(mean output n output) col f(%8.2f)
five
brands of
machine
operator nested in machine
1
2
3
4 Total
1
9.15
2
9.48
4
8.27
3
8.20
4
8.75
13
2
15.03
3
11.55
2
11.45
2
11.52
4
12.47
11
3
11.27
3
10.13
3
11.13
3
4
16.10
3
18.97
3
15.35
4
5
15.30
4
14.35
4
10.43
3
10.84
9
16.60
3
16.65
13
13.63
11
Assuming that operator is random (i.e., we wish to infer to the larger population of possible
operators) and machine is fixed (i.e., only these five machines are of interest), the typical test for
machine uses operator nested within machine as the error term. operator nested within machine
can be tested by residual error. Our earlier warning concerning designs with either unplanned missing
cells or unbalanced cell sizes, or both, also applies to interpreting the ANOVA results from this
unbalanced nested example.
. anova output machine / operator|machine /
Number of obs =
57
Root MSE
= 1.47089
Source
Partial SS
df
MS
R-squared
= 0.8661
Adj R-squared = 0.8077
F
Prob > F
Model
545.822288
17
32.1071934
14.84
0.0000
machine
operator|machine
430.980792
101.353804
4
13
107.745198
7.79644648
13.82
0.0001
operator|machine
101.353804
13
7.79644648
3.60
0.0009
Residual
84.3766582
39
2.16350406
Total
630.198947
56
11.2535526
operator|machine is preceded by a slash, indicating that it is the error term for the terms before
it (here machine). operator|machine is also followed by a slash that indicates it should be tested
with residual error. The output lists the operator|machine term twice, once as the error term for
machine, and again as a term tested by residual error. A line is placed in the ANOVA table to separate
the two. In general, a dividing line is placed in the output to separate the terms into groups that are
tested with the same error term. The overall model is tested by residual error and is separated from
the rest of the table by a blank line at the top of the table.
The results indicate that the machines are not all equal and that there are significant differences
between operators.
anova — Analysis of variance and covariance
45
Example 12
Your company builds and operates sewage treatment facilities. You want to compare two particulate
solutions during the particulate reduction step of the sewage treatment process. For each solution,
two area managers are randomly selected to implement and oversee the change to the new treatment
process in two of their randomly chosen facilities. Two workers at each of these facilities are trained
to operate the new process. A measure of particulate reduction is recorded at various times during
the month at each facility for each worker. The data are described below.
. use http://www.stata-press.com/data/r11/sewage
(Sewage treatment)
. describe
Contains data from http://www.stata-press.com/data/r11/sewage.dta
obs:
64
Sewage treatment
vars:
5
9 May 2009 12:43
size:
576 (99.9% of memory free)
variable name
particulate
solution
manager
facility
worker
Sorted by:
storage
type
byte
byte
byte
byte
byte
solution
display
format
value
label
variable label
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
manager
particulate reduction
2 particulate solutions
2 managers per solution
2 facilities per manager
2 workers per facility
facility
worker
You want to determine if the two particulate solutions provide significantly different particulate
reduction. You would also like to know if manager, facility, and worker are significant effects.
solution is a fixed factor, whereas manager, facility, and worker are random factors.
In the following anova command, we use abbreviations for the variable names, which can sometimes
make long ANOVA model statements easier to read.
. anova particulate s / m|s / f|m|s / w|f|m|s /, dropemptycells
Number of obs =
64
Root MSE
= 12.7445
R-squared
=
Adj R-squared =
Source
Partial SS
df
Model
13493.6094
15
899.573958
5.54
0.0000
solution
manager|solution
7203.76563
838.28125
1
2
7203.76563
419.140625
17.19
0.0536
manager|solution
facility|manager|
solution
838.28125
2
419.140625
0.55
0.6166
3064.9375
4
766.234375
3064.9375
4
766.234375
2.57
0.1193
2386.625
8
298.328125
worker|facility|
manager|solution
2386.625
8
298.328125
1.84
0.0931
Residual
7796.25
48
162.421875
Total
21289.8594
63
337.934276
facility|manager|
solution
worker|facility|
manager|solution
MS
F
0.6338
0.5194
Prob > F
46
anova — Analysis of variance and covariance
While solution is not declared significant at the 5% significance level, it is near enough to
that threshold to warrant further investigation (see example 3 in [R] anova postestimation for a
continuation of the analysis of these data).
Technical note
Why did we use the dropemptycells option with the previous anova? By default, Stata retains
empty cells when building the design matrix and currently treats | and # the same in how it
determines the possible number of cells. Retaining empty cells in an ANOVA with nested terms can
cause your design matrix to become too large. In example 12, there are 1024 = 2 × 4 × 8 × 16
cells that are considered possible for the worker|facility|manager|solution term because the
worker, facility, and manager variables are uniquely numbered. With the dropemptycells
option, the worker|facility|manager|solution term requires just 16 columns in the design
matrix (corresponding to the 16 unique workers).
Why did we not use the dropemptycells option in example 11, where operator is nested in
machine? If you look at the table presented at the beginning of that example, you will see that
operator is compactly instead of uniquely numbered (you need both operator number and machine
number to determine the operator). Here the dropemptycells option would have only reduced
our design matrix from 26 columns down to 24 columns (because there were only 3 operators instead
of 4 for machines 3 and 5).
We suggest that you specify dropemptycells when there are nested terms in your ANOVA. You
could also use the set emptycells drop command to accomplish the same thing; see [R] set.
Mixed designs
An ANOVA can consist of both nested and crossed terms. A split-plot ANOVA design provides an
example.
Example 13
Two reading programs and three skill-enhancement techniques are under investigation. Ten classes
of first-grade students were randomly assigned so that five classes were taught with one reading
program and another five classes were taught with the other. The 30 students in each class were
divided into six groups with 5 students each. Within each class, the six groups were divided randomly
so that each of the three skill-enhancement techniques was taught to two of the groups within each
class. At the end of the school year, a reading assessment test was administered to all the students.
In this split-plot ANOVA, the whole-plot treatment is the two reading programs, and the split-plot
treatment is the three skill-enhancement techniques.
. use http://www.stata-press.com/data/r11/reading
(Reading experiment data)
anova — Analysis of variance and covariance
47
. describe
Contains data from http://www.stata-press.com/data/r11/reading.dta
obs:
300
Reading experiment data
vars:
5
9 Mar 2009 18:57
size:
2,700 (99.9% of memory free)
(_dta has notes)
variable name
score
program
class
skill
group
storage
type
byte
byte
byte
byte
byte
display
format
value
label
variable label
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
reading score
reading program
class nested in program
skill enhancement technique
group nested in class and skill
Sorted by:
In this split-plot ANOVA, the error term for program is class nested within program. The error
term for skill and the program by skill interaction is the class by skill interaction nested
within program. Other terms are also involved in the model and can be seen below.
Our anova command is too long to fit on one line of this manual. Where we have chosen to break
the command into multiple lines is arbitrary. If we were typing this command into Stata, we would
just type along and let Stata automatically wrap across lines, as necessary.
. anova score prog / class|prog skill prog#skill / class#skill|prog
> / group|class#skill|prog /, dropemptycells
Number of obs =
300
R-squared
= 0.3738
Root MSE
= 14.6268
Adj R-squared = 0.2199
Source
Partial SS
df
MS
F
Prob > F
Model
30656.5167
59
519.601977
2.43
0.0000
program
class|program
4493.07
4116.61333
1
8
4493.07
514.576667
8.73
0.0183
skill
program#skill
class#skill|program
1122.64667
5694.62
5841.46667
2
2
16
561.323333
2847.31
365.091667
1.54
7.80
0.2450
0.0043
class#skill|program
group|class#skill|
program
5841.46667
16
365.091667
1.17
0.3463
9388.1
30
312.936667
group|class#skill|
program
9388.1
30
312.936667
1.46
0.0636
Residual
51346.4
240
213.943333
Total
82002.9167
299
274.257246
The program#skill term is significant, as is the program term for these particular data. Let’s look
at the predictive margins for these two terms.
48
anova — Analysis of variance and covariance
. margins, within(program skill)
Predictive margins
Expression
: Linear prediction, predict()
within
: program skill
program#
skill
1 1
1 2
1 3
2 1
2 2
2 3
Margin
Delta-method
Std. Err.
68.16
52.86
61.54
50.7
56.54
52.1
2.068542
2.068542
2.068542
2.068542
2.068542
2.068542
z
32.95
25.55
29.75
24.51
27.33
25.19
. margins, within(program)
Predictive margins
Expression
within
=
300
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
0.000
0.000
64.10573
48.80573
57.48573
46.64573
52.48573
48.04573
Number of obs
=
72.21427
56.91427
65.59427
54.75427
60.59427
56.15427
300
: Linear prediction, predict()
: program
Margin
program
1
2
Number of obs
60.85333
53.11333
Delta-method
Std. Err.
1.194273
1.194273
z
50.95
44.47
P>|z|
0.000
0.000
[95% Conf. Interval]
58.5126
50.7726
63.19407
55.45407
Because our ANOVA involves nested terms, we used the within() option of margins; see
[R] margins.
skill 2 produces a low score when combined with program 1 and a high score when combined
with program 2, demonstrating the interaction between the reading program and the skill-enhancement
technique. From these tables, you might conclude that the first reading program and the first skillenhancement technique perform best when combined. However, notice the overlapping confidence
interval for the first reading program and the third skill-enhancement technique.
Technical note
There are several valid ways to write complicated anova terms. In the reading experiment
example (example 13), we had a term group|class#skill|program. This term can be read
as group nested within both class and skill and further nested within program. You can
also write this term as group|class#skill#program or group|program#class#skill or
group|skill#class|program, etc. All variations will produce the same result. Some people prefer
having only one ‘|’ in a term and would use group|class#skill#program, which is read as group
nested within class, skill, and program.
anova — Analysis of variance and covariance
49
Gertrude Mary Cox (1900–1978) was born on a farm near Dayton, Iowa. Initially intending to
become superintendent of an orphanage, she enrolled at Iowa State College, where she majored
in mathematics and attained the college’s first Master’s degree in statistics. She started a PhD in
psychological statistics at Berkeley but returned to Iowa State after only two years to work with
George W. Snedecor. Cox was put in charge of establishing a Computing Laboratory and began
to teach design of experiments, the latter leading to her classic text with William G. Cochran. In
1940, Snedecor showed Cox his all-male list of suggestions to head a new statistics department
at North Carolina State College and, at her urging, added her name. She was selected and built
an outstanding department. Cox retired early to work at the new Research Triangle Institute
between Raleigh and Chapel Hill. She consulted widely, served as editor of Biometrics, and was
elected to the National Academy of Sciences.
Latin-square designs
You can use anova to analyze a Latin-square design. Consider the following example, published
in Snedecor and Cochran (1989).
Example 14
Data from a Latin-square design are as follows:
Row
1
2
3
4
5
Column 1
257(B)
245(D)
182(E)
203(A)
231(C)
Column 2
230(E)
283(A)
252(B)
204(C)
271(D)
Column 3
279(A)
245(E)
280(C)
227(D)
266(B)
Column 4
287(C)
280(B)
246(D)
193(E)
334(A)
Column 5
202(D)
260(C)
250(A)
259(B)
338(E)
In Stata, the data might appear as follows:
. use http://www.stata-press.com/data/r11/latinsq
. list
1.
2.
3.
4.
5.
row
c1
c2
c3
c4
c5
1
2
3
4
5
257
245
182
203
231
230
283
252
204
271
279
245
280
227
266
287
280
246
193
334
202
260
250
259
338
Before anova can be used on these data, the data must be organized so that the outcome
measurement is in one column. reshape is inadequate for this task because there is information
about the treatments in the sequence of these observations. pkshape is designed to reshape this type
of data; see [R] pkshape.
50
anova — Analysis of variance and covariance
. pkshape row row c1-c5, order(beacd daebc ebcda acdeb cdbae)
. list
sequence
outcome
treat
carry
period
1.
2.
3.
4.
5.
1
2
3
4
5
257
245
182
203
231
1
5
2
3
4
0
0
0
0
0
1
1
1
1
1
6.
7.
8.
9.
10.
1
2
3
4
5
230
283
252
204
271
2
3
1
4
5
1
5
2
3
4
2
2
2
2
2
11.
12.
13.
14.
15.
1
2
3
4
5
279
245
280
227
266
3
2
4
5
1
2
3
1
4
5
3
3
3
3
3
16.
17.
18.
19.
20.
1
2
3
4
5
287
280
246
193
334
4
1
5
2
3
3
2
4
5
1
4
4
4
4
4
21.
22.
23.
24.
25.
1
2
3
4
5
202
260
250
259
338
5
4
3
1
2
4
1
5
2
3
5
5
5
5
5
. anova outcome sequence period treat
Number of obs =
25
Root MSE
= 32.4901
Source
Partial SS
df
MS
R-squared
= 0.6536
Adj R-squared = 0.3073
F
Prob > F
Model
23904.08
12
1992.00667
1.89
0.1426
sequence
period
treat
13601.36
6146.16
4156.56
4
4
4
3400.34
1536.54
1039.14
3.22
1.46
0.98
0.0516
0.2758
0.4523
Residual
12667.28
12
1055.60667
Total
36571.36
24
1523.80667
These methods will work with any type of Latin-square design, including those with replicated
measurements. For more information, see [R] pk, [R] pkcross, and [R] pkshape.
anova — Analysis of variance and covariance
51
Repeated-measures ANOVA
One approach for analyzing repeated-measures data is to use multivariate ANOVA (MANOVA); see
[MV] manova. In this approach, the data are placed in wide form (see [D] reshape), and the repeated
measures enter the MANOVA as dependent variables.
A second approach for analyzing repeated measures is to use anova. However, one of the underlying
assumptions for the F tests in ANOVA is independence of observations. In a repeated-measures design,
this assumption is almost certainly violated or is at least suspect. In a repeated-measures ANOVA,
the subjects (or whatever the experimental units are called) are observed for each level of one or
more of the other categorical variables in the model. These variables are called the repeated-measure
variables. Observations from the same subject are likely to be correlated.
The approach used in repeated-measures ANOVA to correct for this lack of independence is to
apply a correction to the degrees of freedom of the F test for terms in the model that involve
repeated measures. This correction factor, , lies between the reciprocal of the degrees of freedom
for the repeated term and 1. Box (1954) provided the pioneering work in this area. Milliken and
Johnson (1984) refer to the lower bound of this correction factor as Box’s conservative correction
factor. Winer, Brown, and Michels (1991) call it simply the conservative correction factor.
Geisser and Greenhouse (1958) provide an estimate for the correction factor called the Greenhouse–
Geisser . This value is estimated from the data. Huynh and Feldt (1976) show that the Greenhouse–
Geisser tends to be conservatively biased. They provide a revised correction factor called the
Huynh–Feldt . When the Huynh–Feldt exceeds 1, it is set to 1. Thus there is a natural ordering
for these correction factors:
Box’s conservative ≤ Greenhouse–Geisser ≤ Huynh–Feldt ≤ 1
A correction factor of 1 is the same as no correction.
anova with the repeated() option computes these correction factors and displays the revised
test results in a table that follows the standard ANOVA table. In the resulting table, H-F stands for
Huynh–Feldt, G-G stands for Greenhouse–Geisser, and Box stands for Box’s conservative .
Example 15
This example is taken from table 4.3 of Winer, Brown, and Michels (1991). The reaction time for
five subjects each tested with four drugs was recorded in the variable score. Here is a table of the
data (see [P] tabdisp if you are unfamiliar with tabdisp):
. use http://www.stata-press.com/data/r11/t43, clear
(T4.3 -- Winer, Brown, Michels)
. tabdisp person drug, cellvar(score)
person
1
1
2
3
4
5
30
14
24
38
26
drug
2
28
18
20
34
28
3
4
16
10
18
20
14
34
22
30
44
30
drug is the repeated variable in this simple repeated-measures ANOVA example. The ANOVA is
specified as follows:
52
anova — Analysis of variance and covariance
. anova score person drug, repeated(drug)
Number of obs =
20
Root MSE
= 3.06594
R-squared
=
Adj R-squared =
Source
Partial SS
df
Model
1379
7
197
20.96
0.0000
person
drug
680.8
698.2
4
3
170.2
232.733333
18.11
24.76
0.0001
0.0000
Residual
112.8
12
9.4
Total
1491.8
19
78.5157895
Between-subjects error term:
Levels:
Lowest b.s.e. variable:
person
5
person
MS
F
0.9244
0.8803
Prob > F
(4 df)
Repeated variable: drug
Huynh-Feldt epsilon
*Huynh-Feldt epsilon reset
Greenhouse-Geisser epsilon
Box’s conservative epsilon
Source
df
F
Regular
drug
Residual
3
12
24.76
0.0000
=
to
=
=
Prob > F
H-F
G-G
0.0000
0.0006
1.0789
1.0000
0.6049
0.3333
Box
0.0076
Here the Huynh–Feldt is 1.0789, which is larger than 1. It is reset to 1, which is the same as making
no adjustment to the standard test computed in the main ANOVA table. The Greenhouse–Geisser is
0.6049, and its associated p-value is computed from an F ratio of 24.76 using 1.8147 (= 3) and
7.2588 (= 12) degrees of freedom. Box’s conservative is set equal to the reciprocal of the degrees
of freedom for the repeated term. Here it is 1/3, so Box’s conservative test is computed using 1 and
4 degrees of freedom for the observed F ratio of 24.76.
Even for Box’s conservative , drug is significant with a p-value of 0.0076. The following table
gives the predictive marginal mean score (i.e., response time) for each of the four drugs:
. margins drug
Predictive margins
Expression
Number of obs
Margin
drug
1
2
3
4
=
20
: Linear prediction, predict()
26.4
25.6
15.6
32
Delta-method
Std. Err.
1.371131
1.371131
1.371131
1.371131
z
19.25
18.67
11.38
23.34
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
23.71263
22.91263
12.91263
29.31263
29.08737
28.28737
18.28737
34.68737
The ANOVA table for this example provides an F test for person, but you should ignore it. An
appropriate test for person would require replication (i.e., multiple measurements for person and
drug combinations). Also, without replication there is no test available for investigating the interaction
between person and drug.
anova — Analysis of variance and covariance
53
Example 16
Table 7.7 of Winer, Brown, and Michels (1991) provides another repeated-measures ANOVA example.
There are four dial shapes and two methods for calibrating dials. Subjects are nested within calibration
method, and an accuracy score is obtained. The data are shown below.
. use http://www.stata-press.com/data/r11/t77
(T7.7 -- Winer, Brown, Michels)
. tabdisp shape subject calib, cell(score)
2 methods for calibrating dials and
subject nested in calib
1
2
1
2
3
1
2
3
4 dial
shapes
1
2
3
4
0
0
5
3
3
1
5
4
4
3
6
2
4
2
7
8
5
4
6
6
7
5
8
9
The calibration method and dial shapes are fixed factors, whereas subjects are random. The
appropriate test for calibration method uses the nested subject term as the error term. Both the dial
shape and the interaction between dial shape and calibration method are tested with the dial shape
by subject interaction nested within calibration method. Here we drop this term from the anova
command, and it becomes residual error. The dial shape is the repeated variable because each subject
is tested with all four dial shapes. Here is the anova command that produces the desired results:
. anova score calib / subject|calib shape calib#shape, repeated(shape)
Number of obs =
24
R-squared
= 0.8925
Root MSE
= 1.11181
Adj R-squared = 0.7939
Partial SS
df
MS
F
Prob > F
Source
Model
123.125
11
11.1931818
9.06
0.0003
calib
subject|calib
51.0416667
17.1666667
1
4
51.0416667
4.29166667
11.89
0.0261
shape
calib#shape
47.4583333
7.45833333
3
3
15.8194444
2.48611111
12.80
2.01
0.0005
0.1662
Residual
14.8333333
12
1.23611111
Total
137.958333
23
5.99818841
Between-subjects error term:
Levels:
Lowest b.s.e. variable:
Covariance pooled over:
Repeated variable: shape
subject|calib
6
(4 df)
subject
calib
(for repeated variable)
Huynh-Feldt epsilon
=
Greenhouse-Geisser epsilon =
Box’s conservative epsilon =
Prob > F
Regular
H-F
G-G
Source
df
F
shape
calib#shape
Residual
3
3
12
12.80
2.01
0.0005
0.1662
0.0011
0.1791
0.0099
0.2152
0.8483
0.4751
0.3333
Box
0.0232
0.2291
54
anova — Analysis of variance and covariance
The repeated-measure corrections are applied to any terms that are tested in the main ANOVA
table and have the repeated variable in the term. These corrections are given in a table below the
main ANOVA table. Here the repeated-measures tests for shape and calib#shape are presented.
Calibration method is significant, as is dial shape. The interaction between calibration method and
dial shape is not significant. The repeated-measure corrections do not change these conclusions, but
they do change the significance level for the tests on shape and calib#shape. Here, though, unlike
in the previous example, the Huynh–Feldt is less than 1.
Here are the predictive marginal mean scores for calibration method and dial shapes. Because the
interaction was not significant, we request only the calib and shape predictive margins.
. margins, within(calib)
Predictive margins
Expression
: Linear prediction, predict()
within
: calib
Margin
calib
1
2
3
5.916667
Delta-method
Std. Err.
.3209506
.3209506
z
9.35
18.43
. margins, within(shape)
Predictive margins
Expression
: Linear prediction, predict()
within
: shape
Margin
shape
1
2
3
4
3.833333
2.5
6.166667
5.333333
Delta-method
Std. Err.
.4538926
.4538926
.4538926
.4538926
z
8.45
5.51
13.59
11.75
Number of obs
=
24
P>|z|
[95% Conf. Interval]
0.000
0.000
2.370948
5.287615
Number of obs
3.629052
6.545718
=
24
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
2.94372
1.610387
5.277053
4.44372
4.722947
3.389613
7.05628
6.222947
Technical note
The computation of the Greenhouse–Geisser and Huynh–Feldt epsilons in a repeated-measures
ANOVA requires the number of levels and degrees of freedom for the between-subjects error term, as
well as a value computed from a pooled covariance matrix. The observations are grouped based on
all but the lowest-level variable in the between-subjects error term. The covariance over the repeated
variables is computed for each resulting group, and then these covariance matrices are pooled. The
dimension of the pooled covariance matrix is the number of levels of the repeated variable (or
combination of levels for multiple repeated variables). In example 16, there are four levels of the
repeated variable (shape), so the resulting covariance matrix is 4 × 4.
The anova command automatically attempts to determine the between-subjects error term and the
lowest-level variable in the between-subjects error term to group the observations for computation of
the pooled covariance matrix. anova issues an error message indicating that the bse() or bseunit()
option is required when anova cannot determine them. You may override the default selections of
anova by specifying the bse(), bseunit(), or grouping() option. The term specified in the bse()
option must be a term in the ANOVA model.
anova — Analysis of variance and covariance
55
The default selection for the between-subjects error term (the bse() option) is the interaction of the
nonrepeated categorical variables in the ANOVA model. The first variable listed in the between-subjects
error term is automatically selected as the lowest-level variable in the between-subjects error term
but can be overridden with the bseunit(varname) option. varname is often a term, such as subject
or subsample within subject, and is most often listed first in the term because of the nesting notation
of ANOVA. This term makes sense in most repeated-measures ANOVA designs when the terms of
the model are written in standard form. For instance, in example 16, there were three categorical
variables (subject, calib, and shape), with shape being the repeated variable. Here anova looked
for a term involving only subject and calib to determine the between-subjects error term. It found
subject|calib as the term with six levels and 4 degrees of freedom. anova then picked subject
as the default for the bseunit() option (the lowest variable in the between-subjects error term)
because it was listed first in the term.
The grouping of observations proceeds, based on the different combinations of values of the
variables in the between-subjects error term, excluding the lowest level variable (as found by default
or as specified with the bseunit() option). You may specify the grouping() option to change the
default grouping used in computing the pooled covariance matrix.
The between-subjects error term, number of levels, degrees of freedom, lowest variable in the
term, and grouping information are presented after the main ANOVA table and before the rest of the
repeated-measures output.
Example 17
Data with two repeated variables are given in table 7.13 of Winer, Brown, and Michels (1991).
The accuracy scores of subjects making adjustments to three dials during three different periods are
recorded. Three subjects are exposed to a certain noise background level, whereas a different set of
three subjects is exposed to a different noise background level. Here is a table of accuracy scores for
the noise, subject, period, and dial variables:
. use http://www.stata-press.com/data/r11/t713
(T7.13 -- Winer, Brown, Michels)
. tabdisp subject dial period, by(noise) cell(score) stubwidth(11)
noise
background
and subject
nested in
noise
10 minute time periods and dial
2
3
1
2
3
1
1
1
2
1
2
3
45
35
60
53
41
65
60
50
75
40
30
58
52
37
54
57
47
70
1
2
3
50
42
56
48
45
60
61
55
77
25
30
40
34
37
39
51
43
57
3
2
3
28
25
40
37
32
47
46
41
50
16
22
31
23
27
29
35
37
46
1
2
noise, period, and dial are fixed, whereas subject is random. Both period and dial are
repeated variables. The ANOVA for this example is specified next.
56
anova — Analysis of variance and covariance
. anova score noise / subject|noise period noise#period
> / period#subject|noise dial noise#dial / dial#subject|noise
> period#dial noise#period#dial, repeated(period dial)
Source
Number of obs =
54
Root MSE
= 2.81859
Partial SS
df
MS
R-squared
= 0.9872
Adj R-squared = 0.9576
F
Prob > F
Model
9797.72222
37
264.803303
33.33
0.0000
noise
subject|noise
468.166667
2491.11111
1
4
468.166667
622.777778
0.75
0.4348
period
noise#period
period#subject|noise
3722.33333
333
234.888889
2
2
8
1861.16667
166.5
29.3611111
63.39
5.67
0.0000
0.0293
dial
noise#dial
dial#subject|noise
2370.33333
50.3333333
105.555556
2
2
8
1185.16667
25.1666667
13.1944444
89.82
1.91
0.0000
0.2102
period#dial
noise#period#dial
10.6666667
11.3333333
4
4
2.66666667
2.83333333
0.34
0.36
0.8499
0.8357
Residual
127.111111
16
7.94444444
Total
9924.83333
53
187.261006
Between-subjects error term:
Levels:
Lowest b.s.e. variable:
Covariance pooled over:
Repeated variable: period
subject|noise
6
(4 df)
subject
noise
(for repeated variables)
Huynh-Feldt epsilon
*Huynh-Feldt epsilon reset
Greenhouse-Geisser epsilon
Box’s conservative epsilon
Source
df
F
Regular
period
noise#period
period#subject|noise
2
2
8
63.39
5.67
0.0000
0.0293
=
to
=
=
Prob > F
H-F
G-G
0.0000
0.0293
1.0668
1.0000
0.6476
0.5000
Box
0.0003
0.0569
0.0013
0.0759
Repeated variable: dial
Huynh-Feldt epsilon
*Huynh-Feldt epsilon reset
Greenhouse-Geisser epsilon
Box’s conservative epsilon
Prob > F
Regular
H-F
G-G
Source
df
F
dial
noise#dial
dial#subject|noise
2
2
8
89.82
1.91
0.0000
0.2102
0.0000
0.2102
=
to
=
=
0.0000
0.2152
2.0788
1.0000
0.9171
0.5000
Box
0.0007
0.2394
anova — Analysis of variance and covariance
57
Repeated variables: period#dial
Huynh-Feldt epsilon
*Huynh-Feldt epsilon reset
Greenhouse-Geisser epsilon
Box’s conservative epsilon
Prob > F
Regular
H-F
G-G
Source
df
F
period#dial
noise#period#dial
Residual
4
4
16
0.34
0.36
0.8499
0.8357
0.8499
0.8357
=
to
=
=
0.7295
0.7156
1.3258
1.0000
0.5134
0.2500
Box
0.5934
0.5825
For each repeated variable and for each combination of interactions of repeated variables, there are
different correction values. The anova command produces tables for each applicable combination.
The two most significant factors in this model appear to be dial and period. The noise by
period interaction may also be significant, depending on the correction factor you use. Below is a
table of predictive margins for the accuracy score for dial, period, and noise by period.
. margins, within(dial)
Predictive margins
Expression
: Linear prediction, predict()
within
: dial
Margin
dial
1
2
3
37.38889
42.22222
53.22222
Delta-method
Std. Err.
.6643478
.6643478
.6643478
z
56.28
63.55
80.11
Number of obs
=
54
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
36.08679
40.92012
51.92012
38.69099
43.52432
54.52432
. margins, within(period)
Predictive margins
Expression
: Linear prediction, predict()
within
: period
Margin
period
1
2
3
54.33333
44.5
34
Delta-method
Std. Err.
.6643478
.6643478
.6643478
z
81.78
66.98
51.18
Number of obs
=
54
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
53.03124
43.1979
32.6979
(Continued on next page )
55.63543
45.8021
35.3021
58
anova — Analysis of variance and covariance
. margins, within(noise period)
Predictive margins
Expression
: Linear prediction, predict()
within
: noise period
Margin
noise#period
1 1
1 2
1 3
2 1
2 2
2 3
53.77778
49.44444
38.44444
54.88889
39.55556
29.55556
Delta-method
Std. Err.
.9395297
.9395297
.9395297
.9395297
.9395297
.9395297
z
57.24
52.63
40.92
58.42
42.10
31.46
Number of obs
=
54
P>|z|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
0.000
0.000
51.93633
47.603
36.603
53.04744
37.71411
27.71411
55.61922
51.28589
40.28589
56.73033
41.397
31.397
Dial shape 3 produces the highest score, and scores decrease over the periods.
Example 17 had two repeated-measurement variables. Up to four repeated-measurement variables
may be specified in the anova command.
Saved results
anova saves the following in e():
Scalars
e(N)
number of observations
e(mss)
model sum of squares
model degrees of freedom
e(df m)
e(rss)
residual sum of squares
residual degrees of freedom
e(df r)
e(r2)
R-squared
e(r2 a)
adjusted R-squared
e(F)
F statistic
e(rmse)
root mean squared error
e(ll)
log likelihood
e(ll 0)
log likelihood, constant-only model
e(ss #)
sum of squares for term #
e(df #)
numerator degrees of freedom for term #
e(ssdenom #) denominator sum of squares for term # (when using nonresidual error)
e(dfdenom #) denominator degrees of freedom for term # (when using nonresidual error)
e(F #)
F statistic for term # (if computed)
e(N bse)
number of levels of the between-subjects error term
e(df bse)
degrees of freedom for the between-subjects error term
e(box#)
Box’s conservative epsilon for a particular combination of repeated variables
(repeated() only)
e(gg#)
Greenhouse–Geisser epsilon for a particular combination of repeated variables
(repeated() only)
e(hf#)
Huynh–Feldt epsilon for a particular combination of repeated variables
(repeated() only)
e(version)
2
e(rank)
rank of e(V)
anova — Analysis of variance and covariance
Macros
e(cmd)
e(cmdline)
e(depvar)
e(varnames)
e(term #)
e(errorterm #)
e(sstype)
e(repvars)
e(repvar#)
e(model)
e(wtype)
e(wexp)
e(properties)
e(estat cmd)
e(predict)
e(asbalanced)
e(asobserved)
anova
command as typed
name of dependent variable
names of the right-hand-side variables
term #
error term for term # (when using nonresidual error)
type of sum of squares; sequential or partial
names of repeated variables (repeated() only)
names of repeated variables for a particular combination (repeated() only)
ols
weight type
weight expression
b V
program used to implement estat
program used to implement predict
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(V)
e(Srep)
coefficient vector
variance–covariance matrix of the estimators
covariance matrix based on repeated measures (repeated() only)
Functions
e(sample)
marks estimation sample
59
Methods and formulas
anova is implemented as an ado-file.
References
Acock, A. C. 2008. A Gentle Introduction to Stata. 2nd ed. College Station, TX: Stata Press.
Afifi, A. A., and S. P. Azen. 1979. Statistical Analysis: A Computer Oriented Approach. 2nd ed. New York: Academic
Press.
Altman, D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall/CRC.
Anderson, R. L. 1990. Gertrude Mary Cox 1900–1978. Biographical Memoirs, National Academy of Sciences 59:
116–132.
Box, G. E. P. 1954. Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect
of inequality of variance in the one-way classification. Annals of Mathematical Statistics 25: 290–302.
Box, J. F. 1978. R. A. Fisher: The Life of a Scientist. New York: Wiley.
Cobb, G. W. 1998. Introduction to Design and Analysis of Experiments. New York: Springer.
Edwards, A. L. 1985. Multiple Regression and the Analysis of Variance and Covariance. 2nd ed. New York: Freeman.
Fisher, R. A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
. 1935. The Design of Experiments. Edinburgh: Oliver & Boyd.
. 1990. Statistical Methods, Experimental Design, and Scientific Inference. Oxford: Oxford University Press.
60
anova — Analysis of variance and covariance
Geisser, S., and S. W. Greenhouse. 1958. An extension of Box’s results on the use of the F distribution in multivariate
analysis. Annals of Mathematical Statistics 29: 885–891.
Gleason, J. R. 1999. sg103: Within subjects (repeated measures) ANOVA, including between subjects factors. Stata
Technical Bulletin 47: 40–45. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 236–243. College Station,
TX: Stata Press.
. 2000. sg132: Analysis of variance from summary statistics. Stata Technical Bulletin 54: 42–46. Reprinted in
Stata Technical Bulletin Reprints, vol. 9, pp. 328–332. College Station, TX: Stata Press.
Higgins, J. E., and G. G. Koch. 1977. Variable selection and generalized chi-square analysis of categorical data applied
to a large cross-sectional occupational health survey. International Statistical Review 45: 51–62.
Huynh, H. 1978. Some approximate tests for repeated measurement designs. Psychometrika 43: 161–175.
Huynh, H., and L. S. Feldt. 1976. Estimation of the Box correction for degrees of freedom from sample data in
randomized block and split-plot designs. Journal of Educational Statistics 1: 69–82.
Kennedy Jr., W. J., and J. E. Gentle. 1980. Statistical Computing. New York: Dekker.
Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd ed. Belmont,
CA: Duxbury.
Marchenko, Y. 2006. Estimating variance components in Stata. Stata Journal 6: 1–21.
Milliken, G. A., and D. E. Johnson. 1984. Analysis of Messy Data, Volume 1: Designed Experiments. New York:
Nostrand Reinhold.
Rabe-Hesketh, S., and B. S. Everitt. 2007. A Handbook of Statistical Analyses Using Stata. 4th ed. Boca Raton, FL:
Chapman & Hall/CRC.
Scheffé, H. 1959. The Analysis of Variance. New York: Wiley.
Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press.
van Belle, G., L. D. Fisher, P. J. Heagerty, and T. S. Lumley. 2004. Biostatistics: A Methodology for the Health
Sciences. 2nd ed. New York: Wiley.
Winer, B. J., D. R. Brown, and K. M. Michels. 1991. Statistical Principles in Experimental Design. 3rd ed. New York:
McGraw–Hill.
Also see
[R] anova postestimation — Postestimation tools for anova
[R] loneway — Large one-way ANOVA, random effects, and reliability
[R] oneway — One-way analysis of variance
[R] regress — Linear regression
[MV] manova — Multivariate analysis of variance and covariance
Title
anova postestimation — Postestimation tools for anova
Description
The following postestimation commands are of special interest after anova:
command
description
dfbeta
estat hettest
estat imtest
estat ovtest
estat szroeter
estat vif
acprplot
avplot
avplots
cprplot
lvr2plot
rvfplot
rvpplot
DFBETA influence statistics
tests for heteroskedasticity
information matrix test
Ramsey regression specification-error test for omitted variables
Szroeter’s rank test for heteroskedasticity
variance inflation factors for the independent variables
augmented component-plus-residual plot
added-variable plot
all added-variable plots in one image
component-plus-residual plot
leverage-versus-squared-residual plot
residual-versus-fitted plot
residual-versus-predictor plot
For information about these commands, see [R] regress postestimation.
The following standard postestimation commands are also available:
command
description
estat
estimates
hausman
lincom
AIC, BIC, VCE, and estimation sample summary
linktest
lrtest
margins
nlcom
predict
predictnl
suest
test
testnl
cataloging estimation results
Hausman’s specification test
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized
predictions
seemingly unrelated estimation
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
See the corresponding entries in the Base Reference Manual for details.
61
62
anova postestimation — Postestimation tools for anova
Special-interest postestimation commands
In addition to the common estat commands (see [R] estat), estat hettest, estat imtest,
estat ovtest, estat szroeter, and estat vif are also available. dfbeta is also available.
The syntax for dfbeta and these estat commands is the same as after regress; see [R] regress
postestimation.
In addition to the standard syntax of test (see [R] test), test after anova has three additionally
allowed syntaxes; see below. test performs Wald tests of expressions involving the coefficients of
the underlying regression model. Simple and composite linear hypotheses are possible.
Syntax for predict
predict after anova follows the same syntax as predict after regress and can provide
predictions, residuals, standardized residuals, studentized residuals, the standard error of the residuals,
the standard error of the prediction, the diagonal elements of the projection (hat) matrix, and Cook’s D.
See [R] regress postestimation for details.
Syntax for test after anova
In addition to the standard syntax of test (see [R] test), test after anova also allows the
following:
syntax a
test, test(matname) mtest (opt) matvlc(matname)
test, showorder
test term term . . .
/ term term . . .
, symbolic
syntax a
syntax b
syntax c
syntax b
syntax c
test expression involving the coefficients of the underlying regression model;
you provide information as a matrix
show underlying order of design matrix, which is useful when constructing
matname argument of the test() option
test effects and show symbolic forms
Menu
Statistics
>
Linear models and related
>
ANOVA/MANOVA
>
Test linear hypotheses after anova
Options for test after anova
test(matname) is required with syntax a of test. The rows of matname specify linear combinations
of the underlying design matrix of the ANOVA that are to be jointly tested. The columns correspond
to the underlying design matrix (including the constant if it has not been suppressed). The column
and row names of matname are ignored.
A listing of the constraints imposed by the test() option is presented before the table containing
the tests. You should examine this table to verify that you have applied the linear combinations
you desired. Typing test, showorder allows you to examine the ordering of the columns for
the design matrix from the ANOVA.
anova postestimation — Postestimation tools for anova
63
mtest (opt) specifies that tests are performed for each condition separately. opt specifies the method
for adjusting p-values for multiple testing. Valid values for opt are
bonferroni
holm
sidak
noadjust
Bonferroni’s method
Holm’s method
Šidák’s method
no adjustment is to be made
Specifying mtest with no argument is equivalent to mtest(noadjust).
matvlc(matname), a programmer’s option, saves the variance–covariance matrix of the linear
combinations involved in the suite of tests. For the test Lb = c, what is returned in matname is
LV L0 , where V is the estimated variance–covariance matrix of b.
showorder causes test to list the definition of each column in the design matrix. showorder is
not allowed with any other option.
symbolic requests the symbolic form of the test rather than the test statistic. When this option
is specified with no terms (test, symbolic), the symbolic form of the estimable functions is
displayed.
Remarks
Remarks are presented under the following headings:
Testing effects
Obtaining symbolic forms
Testing coefficients
Testing effects
After fitting a model using anova, you can test for the significance of effects in the ANOVA table,
as well as for effects that are not reported in the ANOVA table, by using the test command. You
follow test by the list of effects that you wish to test. By default, test uses the residual mean
squared error in the denominator of the F ratio. You can specify other error terms by using the slash
notation, just as you would with anova.
Example 1
Recall our byssinosis example (example 8) in [R] anova:
(Continued on next page)
64
anova postestimation — Postestimation tools for anova
. anova prob workplace smokes race workplace#smokes workplace#race
> smokes#race workplace#smokes#race [aweight=pop]
(sum of wgt is
5.4190e+03)
Source
Number of obs =
65
Root MSE
= .025902
Partial SS
df
MS
R-squared
= 0.8300
Adj R-squared = 0.7948
F
Prob > F
Model
.173646538
11
.015786049
23.53
0.0000
workplace
smokes
race
workplace#smokes
workplace#race
smokes#race
workplace#smokes#race
.097625175
.013030812
.001094723
.019690342
.001352516
.001662874
.000950841
2
1
1
2
2
1
2
.048812588
.013030812
.001094723
.009845171
.000676258
.001662874
.00047542
72.76
19.42
1.63
14.67
1.01
2.48
0.71
0.0000
0.0001
0.2070
0.0000
0.3718
0.1214
0.4969
Residual
.035557766
53
.000670901
Total
.209204304
64
.003268817
We can easily obtain a test on a particular term from the ANOVA table. Here are two examples:
. test smokes
Source
Partial SS
df
smokes
Residual
.013030812
.035557766
1
53
. test smokes#race
Source
Partial SS
df
smokes#race
Residual
.001662874
.035557766
1
53
MS
.013030812
.000670901
MS
.001662874
.000670901
F
19.42
F
2.48
Prob > F
0.0001
Prob > F
0.1214
Both of these tests use residual error by default and agree with the ANOVA table produced earlier.
Technical note
After anova, you can use the ‘/’ syntax in test to perform tests with a variety of non-σ 2 I error
structures. However, in most unbalanced models, the mean squares are not independent and do not
have equal expectations under the null hypothesis. Also, be warned that you assume responsibility
for the validity of the test statistic.
Example 2
We return to the nested ANOVA example (example 11) in [R] anova, where five brands of machinery
were compared in an assembly line. We can obtain appropriate tests for the nested terms using test,
even if we had run the anova command without initially indicating the proper error terms.
. use http://www.stata-press.com/data/r11/machine
(machine data)
anova postestimation — Postestimation tools for anova
. anova output machine operator|machine
Number of obs =
57
Root MSE
= 1.47089
Source
Partial SS
df
MS
65
R-squared
= 0.8661
Adj R-squared = 0.8077
F
Prob > F
Model
545.822288
17
32.1071934
14.84
0.0000
machine
operator|machine
430.980792
101.353804
4
13
107.745198
7.79644648
49.80
3.60
0.0000
0.0009
Residual
84.3766582
39
2.16350406
Total
630.198947
56
11.2535526
In this ANOVA table, machine is tested with residual error. With this particular nested design, the
appropriate error term for testing machine is operator nested within machine, which is easily
obtained from test.
. test machine / operator|machine
Source
Partial SS
machine
operator|machine
430.980792
101.353804
df
4
13
MS
107.745198
7.79644648
F
13.82
Prob > F
0.0001
This result from test matches what we obtained from our anova command.
Example 3
The other nested ANOVA example (example 12) in [R] anova was based on the sewage data. The
ANOVA table is presented here again. As before, we will use abbreviations of variable names in typing
the commands.
. use http://www.stata-press.com/data/r11/sewage
(Sewage treatment)
. anova particulate s / m|s / f|m|s / w|f|m|s /, dropemptycells
Number of obs =
64
R-squared
= 0.6338
Root MSE
= 12.7445
Adj R-squared = 0.5194
Source
Partial SS
df
MS
F
Prob > F
Model
13493.6094
15
899.573958
5.54
0.0000
solution
manager|solution
7203.76563
838.28125
1
2
7203.76563
419.140625
17.19
0.0536
manager|solution
facility|manager|
solution
838.28125
2
419.140625
0.55
0.6166
3064.9375
4
766.234375
3064.9375
4
766.234375
2.57
0.1193
2386.625
8
298.328125
worker|facility|
manager|solution
2386.625
8
298.328125
1.84
0.0931
Residual
7796.25
48
162.421875
Total
21289.8594
63
337.934276
facility|manager|
solution
worker|facility|
manager|solution
66
anova postestimation — Postestimation tools for anova
In practice, it is often beneficial to pool nonsignificant nested terms to increase the power of
tests on remaining terms. One rule of thumb is to allow the pooling of a term whose p-value is
larger than 0.25. In this sewage example, the p-value for the test of manager is 0.6166. This value
indicates that the manager effect is negligible and might be ignored. Currently, solution is tested by
manager|solution, which has only 2 degrees of freedom. If we pool the manager and facility
terms and use this pooled estimate as the error term for solution, we would have a term with 6
degrees of freedom.
Below are two tests: a test of solution with the pooled manager and facility terms and a
test of this pooled term by worker.
. test s / m|s f|m|s
Source
solution
manager|solution
facility|manager|
solution
. test m|s f|m|s / w|f|m|s
Source
manager|solution
facility|manager|
solution
worker|facility|manager|
solution
Partial SS
df
MS
F
7203.76563
1
7203.76563
3903.21875
6
650.536458
Partial SS
df
3903.21875
6
650.536458
2386.625
8
298.328125
MS
11.07
F
2.18
Prob > F
0.0159
Prob > F
0.1520
In the first test, we included two terms after the forward slash (m|s and f|m|s). test after anova
allows multiple terms both before and after the slash. The terms before the slash are combined and
are then tested by the combined terms that follow the slash (or residual error if no slash is present).
The p-value for solution using the pooled term is 0.0159. Originally, it was 0.0536. The increase
in the power of the test is due to the increase in degrees of freedom for the pooled error term.
We can get identical results if we drop manager from the anova model. (This dataset has unique
numbers for each facility so that there is no confusion of facilities when manager is dropped.)
. anova particulate s / f|s / w|f|s /, dropemptycells
Number of obs =
64
Root MSE
= 12.7445
Source
Partial SS
df
MS
R-squared
= 0.6338
Adj R-squared = 0.5194
F
Prob > F
Model
13493.6094
15
899.573958
5.54
0.0000
solution
facility|solution
7203.76562
3903.21875
1
6
7203.76562
650.536458
11.07
0.0159
facility|solution
worker|facility|
solution
3903.21875
6
650.536458
2.18
0.1520
2386.625
8
298.328125
worker|facility|
solution
2386.625
8
298.328125
1.84
0.0931
Residual
7796.25
48
162.421875
Total
21289.8594
63
337.934276
This output agrees with our earlier test results.
anova postestimation — Postestimation tools for anova
67
In the following example, two terms from the anova are jointly tested (pooled).
Example 4
In example 10 of [R] anova, we fit the model anova drate region c.mage region#c.mage.
Now we test for the overall significance of region.
. test region region#c.mage
Source
Partial SS
region region#mage
Residual
1781.35344
2206.7263
df
6
42
MS
296.89224
52.5411023
F
5.65
Prob > F
0.0002
The overall F statistic associated with the region and region#c.mage terms is 5.65, and it is
significant at the 0.02% level.
Typing test followed by one term in our model should produce output that exactly matches that
provided by the anova command. In the ANOVA output, the region term, by itself, had a sum of
squares of 1166.15, which, based on 3 degrees of freedom, yielded an F statistic of 7.40 and a
significance level of 0.0004.
. test region
Source
Partial SS
df
region
Residual
1166.14735
2206.7263
3
42
MS
388.715783
52.5411023
F
7.40
Prob > F
0.0004
test yields the same result.
Obtaining symbolic forms
test can also produce the symbolic form of the estimable functions and symbolic forms for
particular tests.
Example 5
After fitting an ANOVA model, we type test, symbolic to obtain the symbolic form of the
estimable functions. For instance, returning to our blood pressure data introduced in example 4 of
[R] anova, let’s begin by reestimating systolic on drug, disease, and drug#disease:
(Continued on next page)
68
anova postestimation — Postestimation tools for anova
. use http://www.stata-press.com/data/r11/systolic, clear
(Systolic Blood Pressure Data)
. anova systolic drug##disease
Number of obs =
58
R-squared
= 0.4560
Root MSE
= 10.5096
Adj R-squared = 0.3259
Partial SS
df
MS
F
Prob > F
Source
Model
4259.33851
11
387.212591
3.51
0.0013
drug
disease
drug#disease
2997.47186
415.873046
707.266259
3
2
6
999.157287
207.936523
117.87771
9.05
1.88
1.07
0.0001
0.1637
0.3958
Residual
5080.81667
46
110.452536
Total
9340.15517
57
163.862371
To obtain the symbolic form of the estimable functions, type
. test, symbolic
drug
1 -(r2+r3+r4-r0)
2
r2
3
r3
4
r4
disease
1 -(r6+r7-r0)
2
r6
3
r7
drug#disease
1 1 -(r2+r3+r4+r6+r7-r12-r13-r15-r16-r18-r19-r0)
1 2
r6 - (r12+r15+r18)
1 3
r7 - (r13+r16+r19)
2 1
r2 - (r12+r13)
2 2
r12
2 3
r13
3 1
r3 - (r15+r16)
3 2
r15
3 3
r16
4 1
r4 - (r18+r19)
4 2
r18
4 3
r19
_cons
r0
Example 6
To obtain the symbolic form for a particular test, we type test term [term . . . ], symbolic. For
instance, the symbolic form for the test of the main effect of drug is
anova postestimation — Postestimation tools for anova
69
. test drug, symbolic
drug
1 -(r2+r3+r4)
2
r2
3
r3
4
r4
disease
1 0
2 0
3 0
drug#disease
1 1 -1/3 (r2+r3+r4)
1 2 -1/3 (r2+r3+r4)
1 3 -1/3 (r2+r3+r4)
2 1
1/3 r2
2 2
1/3 r2
2 3
1/3 r2
3 1
1/3 r3
3 2
1/3 r3
3 3
1/3 r3
4 1
1/3 r4
4 2
1/3 r4
4 3
1/3 r4
_cons
0
If we omit the symbolic option, we instead see the result of the test:
. test drug
Source
Partial SS
df
drug
Residual
2997.47186
5080.81667
3
46
MS
999.157287
110.452536
F
9.05
Prob > F
0.0001
Testing coefficients
The test command allows you to perform tests directly on the coefficients of the underlying regression model. For instance, the coefficient on the third drug and the second disease
is referred to as 3.drug#2.disease. This could also be written as i3.drug#i2.disease, or
b[3.drug#2.disease], or even coef[i3.drug#i2.disease]; see [U] 13.5 Accessing coefficients and standard errors.
Example 7
Let’s begin by testing whether the coefficient on the third drug is equal to the coefficient on the
fourth in our blood pressure data. We have already fit the model anova systolic drug##disease
(equivalent to anova systolic drug disease drug#disease), and you can see the results of that
estimation in example 5. Even though we have performed many tasks since we fit the model, Stata
still remembers, and we can perform tests at any time.
. test 3.drug = 4.drug
( 1)
3.drug - 4.drug = 0
F(
1,
46) =
Prob > F =
0.13
0.7234
70
anova postestimation — Postestimation tools for anova
We find that the two coefficients are not significantly different, at least at any significance level smaller
than 73%. Let’s now add the constraint that the coefficient on the third drug interacted with the third
disease is equal to the coefficient on the fourth drug, again interacted with the third disease. We do
that by typing the new constraint and adding the accumulate option:
. test 3.drug#3.disease = 4.drug#3.disease, accumulate
( 1) 3.drug - 4.drug = 0
( 2) 3.drug#3.disease - 4.drug#3.disease = 0
F( 2,
46) =
0.39
Prob > F =
0.6791
Let’s continue. Our goal is to determine whether the third drug is significantly different from the
fourth drug. So far, our test includes the equality of the two drug coefficients, along with the equality
of the two drug coefficients when interacted with the third disease. We must add two more equations,
one for each of the remaining two diseases.
. test
( 1)
( 2)
( 3)
3.drug#2.disease = 4.drug#2.disease, accumulate
3.drug - 4.drug = 0
3.drug#3.disease - 4.drug#3.disease = 0
3.drug#2.disease - 4.drug#2.disease = 0
F( 3,
46) =
0.85
Prob > F =
0.4761
. test 3.drug#1.disease = 4.drug#1.disease, accumulate
( 1) 3.drug - 4.drug = 0
( 2) 3.drug#3.disease - 4.drug#3.disease = 0
( 3) 3.drug#2.disease - 4.drug#2.disease = 0
( 4) 3o.drug#1b.disease - 4o.drug#1b.disease = 0
Constraint 4 dropped
F( 3,
46) =
0.85
Prob > F =
0.4761
The overall F statistic is 0.85, which is hardly significant. We cannot reject the hypothesis that the
third drug has the same effect as the fourth drug.
You may notice that we also got the message “Constraint 4 dropped”. For the technically inclined,
this constraint was unnecessary, given the normalization of the model. You need not worry about
such problems because Stata handles them automatically.
The test() option of test provides a convenient alternative for testing coefficients. Instead of
spelling out each coefficient involved in the test, a matrix representing the test provides the needed
information. test, showorder shows the order of the terms in the ANOVA corresponding to the
order of the columns for the matrix argument of test().
Example 8
We repeat the last test of example 7 above with the test() option. First, we view the definition
and order of the columns underlying the ANOVA performed on the systolic data.
anova postestimation — Postestimation tools for anova
71
. test, showorder
Order of columns in the design matrix
1: (drug==1)
2: (drug==2)
3: (drug==3)
4: (drug==4)
5: (disease==1)
6: (disease==2)
7: (disease==3)
8: (drug==1)*(disease==1)
9: (drug==1)*(disease==2)
10: (drug==1)*(disease==3)
11: (drug==2)*(disease==1)
12: (drug==2)*(disease==2)
13: (drug==2)*(disease==3)
14: (drug==3)*(disease==1)
15: (drug==3)*(disease==2)
16: (drug==3)*(disease==3)
17: (drug==4)*(disease==1)
18: (drug==4)*(disease==2)
19: (drug==4)*(disease==3)
20: _cons
Columns 1–4 correspond to the four levels of drug. Columns 5–7 correspond to the three levels
of disease. Columns 8–19 correspond to the interaction of drug and disease. The last column
corresponds to cons, the constant in the model.
We construct the matrix dr1vs2 with the same four constraints as the last test shown in example 7
and then use the test(dr1vs2) option to perform the test.
. mat dr1vs2 = (0,0,1,-1,
>
0,0,0, 0,
>
0,0,0, 0,
>
0,0,0, 0,
0,0,0,
0,0,0,
0,0,0,
0,0,0,
0,0,0,0,0,0,0,0,0, 0, 0, 0,
0,0,0,0,0,0,0,0,1, 0, 0,-1,
0,0,0,0,0,0,0,1,0, 0,-1, 0,
0,0,0,0,0,0,1,0,0,-1, 0, 0,
0 \
0 \
0 \
0)
. test, test(dr1vs2)
(
(
(
(
1)
2)
3)
4)
3.drug - 4.drug = 0
3.drug#3.disease - 4.drug#3.disease = 0
3.drug#2.disease - 4.drug#2.disease = 0
3o.drug#1b.disease - 4o.drug#1b.disease = 0
Constraint 4 dropped
F(
3,
46) =
Prob > F =
0.85
0.4761
Here the effort involved with spelling out the coefficients is similar to that of constructing a matrix
and using it in the test() option. When the test involving coefficients is more complicated, many
people prefer using the test() option.
Technical note
You can use test to perform other, more complicated tests. In such cases, you will probably want
to review the symbolic forms of particular tests, and you will certainly want to review the symbolic
form of the estimable functions. We explained how to do that above.
Let’s check that Stata gives the right answers by laboriously typing the gory details of the test
for the main effect of drug. Stata already told us the symbolic form in the previous subsection. The
obsessed among you have no doubt already worked through the algebra and established that Stata was
72
anova postestimation — Postestimation tools for anova
correct. Here we spell out each coefficient instead of using the test() option. Using the test()
option would have saved us some typing. Our chances of typing all the constraints correctly, however,
are so small that we typed them into a do-file:
. do mainef
. test 2.drug + (2.drug#1.disease + 2.drug#2.disease +
>
2.drug#3.disease - 1.drug#1.disease >
1.drug#2.disease - 1.drug#3.disease)/3 = 0, notest
(output omitted )
. test 3.drug + (3.drug#1.disease + 3.drug#2.disease +
>
3.drug#3.disease - 1.drug#1.disease >
1.drug#2.disease - 1.drug#3.disease)/3 = 0, accumulate notest
(output omitted )
. test 4.drug + (4.drug#1.disease + 4.drug#2.disease +
>
4.drug#3.disease - 1.drug#1.disease >
1.drug#2.disease - 1.drug#3.disease)/3 = 0, accumulate
( 1)
( 2)
( 3)
2.drug - .3333333*1b.drug#1b.disease - .3333333*1b.drug#2o.disease .3333333*1b.drug#3o.disease + .3333333*2o.drug#1b.disease +
.3333333*2.drug#2.disease + .3333333*2.drug#3.disease = 0
3.drug - .3333333*1b.drug#1b.disease - .3333333*1b.drug#2o.disease .3333333*1b.drug#3o.disease + .3333333*3o.drug#1b.disease +
.3333333*3.drug#2.disease + .3333333*3.drug#3.disease = 0
4.drug - .3333333*1b.drug#1b.disease - .3333333*1b.drug#2o.disease .3333333*1b.drug#3o.disease + .3333333*4o.drug#1b.disease +
.3333333*4.drug#2.disease + .3333333*4.drug#3.disease = 0
F(
3,
46) =
Prob > F =
end of do-file
9.05
0.0001
We have our result. The F statistic has 3 degrees of freedom and is 9.05. This is the same result we
obtained when we typed test drug. However, typing test drug was much easier.
Tests involving the underlying coefficients of an ANOVA model are often done to obtain various
single-degree-of-freedom tests comparing different levels of a categorical variable. The mtest option
presents each single-degree-of-freedom test in addition to the combined test.
Example 9
Rencher and Schaalje (2008) illustrate single-degree-of-freedom contrasts for an ANOVA comparing
the net weight of cans filled by five machines (labeled A–E). The data were originally obtained from
Ostle and Mensing (1975). Rencher and Schaalje use a cell-means ANOVA model approach for this
problem. We could do the same by using the noconstant option of anova; see [R] anova. Instead,
we obtain the same results by using the standard overparameterized ANOVA approach (i.e., we keep
the constant in the model).
anova postestimation — Postestimation tools for anova
73
. use http://www.stata-press.com/data/r11/canfill
(Can Fill Data)
. list, sepby(machine)
machine
weight
1.
2.
3.
4.
A
A
A
A
11.95
12.00
12.25
12.10
5.
6.
B
B
12.18
12.11
7.
8.
9.
C
C
C
12.16
12.15
12.08
10.
11.
12.
D
D
D
12.25
12.30
12.10
13.
14.
15.
16.
E
E
E
E
12.10
12.04
12.02
12.02
. anova weight machine
Source
Number of obs =
16
Root MSE
= .087758
Partial SS
df
MS
R-squared
= 0.4123
Adj R-squared = 0.1986
F
Prob > F
Model
.059426993
4
.014856748
1.93
0.1757
machine
.059426993
4
.014856748
1.93
0.1757
Residual
.084716701
11
.007701518
Total
.144143694
. test, showorder
Order of columns in the design matrix
1: (machine==1)
2: (machine==2)
3: (machine==3)
4: (machine==4)
5: (machine==5)
6: _cons
15
.00960958
The four single-degree-of-freedom tests of interest between the five machines are A and D versus
B, C, and E; B and E versus C; A versus D; and B versus E. We place the orthogonal (unweighted)
contrasts for these tests into the matrix C1 and use that with test to obtain the results.
(Continued on next page)
74
anova postestimation — Postestimation tools for anova
. mat C1 = (3,-2,-2,3,-2,0 \ 0,1,-2,0,1,0 \ 1,0,0,-1,0,0 \ 0,1,0,0,-1,0)
. test, test(C1) mtest
( 1) 3*1b.machine - 2*2.machine - 2*3.machine + 3*4.machine - 2*5.machine = 0
( 2) 2.machine - 2*3.machine + 5.machine = 0
( 3) 1b.machine - 4.machine = 0
( 4) 2.machine - 5.machine = 0
F(df,11)
df
p
(1)
(2)
(3)
(4)
0.75
0.31
4.47
1.73
1
1
1
1
0.4055
0.5916
0.0582
0.2150
all
1.93
4
0.1757
#
#
#
#
# unadjusted p-values
The mtest option causes test to produce the single-degree-of-freedom tests for each row of the
matrix C1 provided in the test() option.
The significance values above are not adjusted for multiple comparisons. mtest takes an optional
argument for specifying multiple comparison adjustments. We could have produced the Bonferroni
adjusted significance values by using the mtest(bonferroni) option.
. test, test(C1) mtest(bonferroni)
(
(
(
(
1)
2)
3)
4)
3*1b.machine - 2*2.machine - 2*3.machine + 3*4.machine - 2*5.machine = 0
2.machine - 2*3.machine + 5.machine = 0
1b.machine - 4.machine = 0
2.machine - 5.machine = 0
F(df,11)
df
p
(1)
(2)
(3)
(4)
0.75
0.31
4.47
1.73
1
1
1
1
1.0000
1.0000
0.2329
0.8601
all
1.93
4
0.1757
#
#
#
#
# Bonferroni adjusted p-values
Example 10
Here there are two factors, A and B, each with three levels. The levels are evenly spaced so that
linear and quadratic contrasts are of interest.
. use http://www.stata-press.com/data/r11/twowaytrend
anova postestimation — Postestimation tools for anova
75
. anova Y A B A#B
Source
Number of obs =
36
Root MSE
= 2.6736
Partial SS
df
MS
R-squared
= 0.9304
Adj R-squared = 0.9097
F
Prob > F
Model
2578.55556
8
322.319444
45.09
0.0000
A
B
A#B
2026.72222
383.722222
168.111111
2
2
4
1013.36111
191.861111
42.0277778
141.77
26.84
5.88
0.0000
0.0000
0.0015
Residual
193
27
7.14814815
Total
2771.55556
. test, showorder
Order of columns in the design matrix
1: (A==1)
2: (A==2)
3: (A==3)
4: (B==1)
5: (B==2)
6: (B==3)
7: (A==1)*(B==1)
8: (A==1)*(B==2)
9: (A==1)*(B==3)
10: (A==2)*(B==1)
11: (A==2)*(B==2)
12: (A==2)*(B==3)
13: (A==3)*(B==1)
14: (A==3)*(B==2)
15: (A==3)*(B==3)
16: _cons
35
79.1873016
We create the matrices Alin, Aquad, Blin, and Bquad corresponding to the linear and quadratic
single-degree-of-freedom contrasts for terms A and B. These are combined into matrices A and B.
Then the four interaction contrasts are formed from the elementwise multiplication of appropriate
pairs of the single-degree-of-freedom contrasts. We use foreach and the vecdiag() matrix function
to automate the process. For this particular problem, the interaction contrasts are combined into two
matrices: AxBlin corresponding to the 2-degree-of-freedom test for factor A versus linear B and
AxBquad corresponding to the 2-degree-of-freedom test for factor A versus quadratic B.
.
.
.
.
.
.
.
mat Alin =
-3, 0,3,
0, 0,0, -1,-1,-1, 0, 0, 0, 1, 1,1,
mat Aquad =
3,-6,3,
0, 0,0,
1, 1, 1,-2,-2,-2, 1, 1,1,
mat Blin =
0, 0,0, -3, 0,3, -1, 0, 1,-1, 0, 1,-1, 0,1,
mat Bquad =
0, 0,0,
3,-6,3,
1,-2, 1, 1,-2, 1, 1,-2,1,
mat A = Alin \ Aquad
mat B = Blin \ Bquad
foreach i in lin quad {
2.
foreach j in lin quad {
3.
mat A‘i’B‘j’ = vecdiag(A‘i’’*B‘j’)
4.
}
5. }
. mat AxBlin = AlinBlin \ AquadBlin
. mat AxBquad = AlinBquad \ AquadBquad
0
0
0
0
We use the test() and mtest options of test to obtain the single-degree-of-freedom tests for
linear and quadratic A and B.
76
anova postestimation — Postestimation tools for anova
. mat list A, nonames
A[2,16]
-3
0
3
0
0
0 -1 -1 -1
0
0
0
1
1
1
0
3 -6
3
0
0
0
1
1
1 -2 -2 -2
1
1
1
0
. test, test(A) mtest
( 1) - 3*1b.A + 3*3.A - 1b.A#1b.B - 1b.A#2o.B - 1b.A#3o.B + 3o.A#1b.B +
3.A#2.B + 3.A#3.B = 0
( 2) 3*1b.A - 6*2.A + 3*3.A + 1b.A#1b.B + 1b.A#2o.B + 1b.A#3o.B - 2*2o.A#1b.B
- 2*2.A#2.B - 2*2.A#3.B + 3o.A#1b.B + 3.A#2.B + 3.A#3.B = 0
F(df,27)
df
p
(1)
(2)
212.65
70.88
1
1
0.0000 #
0.0000 #
all
141.77
2
0.0000
# unadjusted p-values
. mat list B, nonames
B[2,16]
0
0
0 -3
0
3 -1
0
1 -1
0
1 -1
0
1
0
0
0
0
3 -6
3
1 -2
1
1 -2
1
1 -2
1
0
. test, test(B) mtest
( 1) - 3*1b.B + 3*3.B - 1b.A#1b.B + 1b.A#3o.B - 2o.A#1b.B + 2.A#3.B 3o.A#1b.B + 3.A#3.B = 0
( 2) 3*1b.B - 6*2.B + 3*3.B + 1b.A#1b.B - 2*1b.A#2o.B + 1b.A#3o.B + 2o.A#1b.B
- 2*2.A#2.B + 2.A#3.B + 3o.A#1b.B - 2*3.A#2.B + 3.A#3.B = 0
F(df,27)
df
p
(1)
(2)
26.17
27.51
1
1
0.0000 #
0.0000 #
all
26.84
2
0.0000
# unadjusted p-values
All the above tests appear to be significant. In addition to presenting the single-degree-of-freedom
tests, the combined tests for A and B are produced and agree with the original ANOVA results.
Now we explore the interaction between A and B.
. mat list AxBlin, nonames
AxBlin[2,16]
0
0
0
0
0
0
1
0
0
0
0
0
0 -1
. test, test(AxBlin) mtest
( 1) 1b.A#1b.B - 1b.A#3o.B ( 2) - 1b.A#1b.B + 1b.A#3o.B
= 0
0
0
-1
1
0
2
0
0
0
-2
-1
-1
0
0
1
1
0
0
3o.A#1b.B + 3.A#3.B = 0
+ 2*2o.A#1b.B - 2*2.A#3.B - 3o.A#1b.B + 3.A#3.B
F(df,27)
df
p
(1)
(2)
17.71
0.07
1
1
0.0003 #
0.7893 #
all
8.89
2
0.0011
# unadjusted p-values
anova postestimation — Postestimation tools for anova
77
The 2-degree-of-freedom test of the interaction of A with the linear components of B is significant
at the 0.0011 level. But, when we examine the two single-degree-of-freedom tests that compose this
result, the significance is due to the linear A by linear B contrast (significance level of 0.0003). A
significance value of 0.7893 for the quadratic A by linear B indicates that this factor is not significant
for these data.
. mat list AxBquad, nonames
AxBquad[2,16]
0
0
0
0
0
0 -1
0
0
0
0
0
0
1
. test, test(AxBquad) mtest
( 1)
( 2)
2
-2
-1
1
0
-2
0
4
0
-2
1
1
-2
-2
1
1
0
0
- 1b.A#1b.B + 2*1b.A#2o.B - 1b.A#3o.B + 3o.A#1b.B - 2*3.A#2.B + 3.A#3.B
= 0
1b.A#1b.B - 2*1b.A#2o.B + 1b.A#3o.B - 2*2o.A#1b.B + 4*2.A#2.B 2*2.A#3.B + 3o.A#1b.B - 2*3.A#2.B + 3.A#3.B = 0
F(df,27)
df
p
(1)
(2)
2.80
2.94
1
1
0.1058 #
0.0979 #
all
2.87
2
0.0741
# unadjusted p-values
The test of A with the quadratic components of B does not fall below the 0.05 significance level.
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
References
Ostle, B., and R. W. Mensing. 1975. Statistics in Research. 3rd ed. Ames, IA: Iowa State University Press.
Rencher, A. C., and G. B. Schaalje. 2008. Linear Models in Statistics. 2nd ed. New York: Wiley.
Also see
[R] anova — Analysis of variance and covariance
[R] regress postestimation — Postestimation tools for regress
[U] 20 Estimation and postestimation commands
Title
areg — Linear regression with a large dummy-variable set
Syntax
areg depvar indepvars
if
in
weight , absorb(varname) options
description
options
Model
∗
absorb(varname)
categorical variable to be absorbed
SE/Robust
vce(vcetype)
vcetype may be ols, robust, cluster clustvar, bootstrap,
or jackknife
Reporting
level(#)
display options
† coeflegend
∗
set confidence level; default is level(95)
control spacing and display of omitted variables and base and
empty cells
display coefficients’ legend instead of coefficient table
absorb(varname) is required.
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands.
vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix.
Weights are not allowed with the bootstrap prefix.
aweights are not allowed with the jackknife prefix.
aweights, fweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Linear models and related
>
Other
>
Linear regression absorbing one cat. variable
Description
areg fits a linear regression absorbing one categorical factor. areg is designed for datasets with
many groups, but not a number of groups that increases with the sample size. See the xtreg, fe
command in [XT] xtreg for an estimator that handles the case in which the number of groups increases
with the sample size.
78
areg — Linear regression with a large dummy-variable set
79
Options
Model
absorb(varname) specifies the categorical variable, which is to be included in the regression as if
it were specified by dummy variables. absorb() is required.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression.
Exercise caution when using the vce(cluster clustvar) option with areg. The effective number
of degrees of freedom for the robust variance estimator is ng − 1, where ng is the number of
clusters. Thus the number of levels of the absorb() variable should not exceed the number of
clusters.
Reporting
level(#); see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
The following option is available with areg but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
Suppose that you have a regression model that includes among the explanatory variables a large
number, k , of mutually exclusive and exhaustive dummies:
y = Xβ + d1 γ1 + d2 γ2 + · · · + dk γk + For instance, the dummy variables, di , might indicate countries in the world or states of the United
States. One solution would be to fit the model with regress, but this solution is possible only if k
is small enough so that the total number of variables (the number of columns of X plus the number
of di ’s plus one for y) is sufficiently small — meaning less than matsize (see [R] matsize). For
problems with more variables than the largest possible value of matsize (100 for Small Stata, 800
for Stata/IC, and 11,000 for Stata/SE and Stata/MP), regress will not work. areg provides a way
of obtaining estimates of β — but not the γi ’s — in these cases. The effects of the dummy variables
are said to be absorbed.
Example 1
So that we can compare the results produced by areg with Stata’s other regression commands,
we will fit a model in which k is small. areg’s real use, however, is when k is large.
In our automobile data, we have a variable called rep78 that is coded 1, 2, 3, 4, and 5, where 1
means poor and 5 means excellent. Let’s assume that we wish to fit a regression of mpg on weight,
gear ratio, and rep78 (parameterized as a set of dummies).
80
areg — Linear regression with a large dummy-variable set
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight gear_ratio b5.rep78
SS
df
MS
Source
Model
Residual
1575.97621
764.226686
6
62
262.662702
12.3262369
Total
2340.2029
68
34.4147485
Std. Err.
t
Number of obs
F( 6,
62)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
=
=
=
=
=
=
69
21.31
0.0000
0.6734
0.6418
3.5109
mpg
Coef.
[95% Conf. Interval]
weight
gear_ratio
-.0051031
.901478
.0009206
1.565552
-5.54
0.58
0.000
0.567
-.0069433
-2.228015
-.003263
4.030971
rep78
1
2
3
4
-2.036937
-2.419822
-2.557432
-2.788389
2.740728
1.764338
1.370912
1.395259
-0.74
-1.37
-1.87
-2.00
0.460
0.175
0.067
0.050
-7.515574
-5.946682
-5.297846
-5.577473
3.4417
1.107039
.1829814
.0006939
_cons
36.23782
7.01057
5.17
0.000
22.22389
50.25175
To fit the areg equivalent, we type
. areg mpg weight gear_ratio, absorb(rep78)
Linear regression, absorbing indicators
mpg
Coef.
weight
gear_ratio
_cons
-.0051031
.901478
34.05889
rep78
Std. Err.
.0009206
1.565552
7.056383
F(4, 62) =
t
Number of obs
F( 2,
62)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
-5.54
0.58
4.83
0.000
0.567
0.000
1.117
0.356
=
=
=
=
=
=
69
41.64
0.0000
0.6734
0.6418
3.5109
[95% Conf. Interval]
-.0069433
-2.228015
19.95338
-.003263
4.030971
48.1644
(5 categories)
Both regress and areg display the same R2 values, root mean squared error, and—for weight
and gear ratio—the same parameter estimates, standard errors, t statistics, significance levels, and
confidence intervals. areg, however, does not report the coefficients for rep78, and, in fact, they
are not even calculated. This computational trick makes the problem manageable when k is large.
areg reports a test that the coefficients associated with rep78 are jointly zero. Here this test has a
significance level of 35.6%. This F test for rep78 is the same that we would obtain after regress
if we were to specify test 1.rep78 2.rep78 3.rep78 4.rep78; see [R] test.
The model F tests reported by regress and areg also differ. The regress command reports a
test that all coefficients except that of the constant are equal to zero; thus, the dummies are included
in this test. The areg output shows a test that all coefficients excluding the dummies and the constant
are equal to zero. This is the same test that can be obtained after regress by typing test weight
gear ratio.
areg — Linear regression with a large dummy-variable set
81
Technical note
areg is designed for datasets with many groups, but not a number that grows with the sample
size. Consider two different samples from the U.S. population. In the first sample, we have 10,000
individuals and we want to include an indicator for each of the 50 states, whereas in the second
sample we have 3 observations on each of 10,000 individuals and we want to include an indicator for
each individual. areg was designed for datasets similar to the first sample in which we have a fixed
number of groups, the 50 states. In the second sample, the number of groups, which is the number of
individuals, grows as we include more individuals in the sample. For an estimator designed to handle
the case in which the number of groups grows with the sample size, see the xtreg, fe command
in [XT] xtreg.
Although the point estimates produced by areg and xtreg, fe are the same, the estimated VCEs
differ when cluster() is specified because the commands make different assumptions about whether
the number of groups increases with the sample size.
Technical note
The intercept reported by areg deserves some explanation because, given k mutually exclusive
and exhaustive dummies, it is arbitrary. areg identifies the model by choosing the intercept that
makes the prediction calculated at the means of the independent variables equal to the mean of the
b
dependent variable: y = x β.
. predict yhat
(option xb assumed; fitted values)
. summarize mpg yhat if rep78 != .
Variable
Obs
Mean
mpg
yhat
69
69
21.28986
21.28986
Std. Dev.
5.866408
4.383224
Min
Max
12
11.58643
41
28.07367
We had to include if rep78 < . in our summarize command because we have missing values in
our data. areg automatically dropped those missing values (as it should) in forming the estimates,
but predict with the xb option will make predictions for cases with missing rep78 because it does
not know that rep78 is really part of our model.
These predicted values do not include the absorbed effects (i.e., the di γi ). For predicted values
that include these effects, use the xbd option of predict (see [R] areg postestimation) or see
[XT] xtreg.
Example 2
areg, vce(robust) is a Huberized version of areg; see [P] robust. Just as areg is equivalent to
using regress with dummies, areg, vce(robust) is equivalent to using regress, vce(robust)
with dummies. You can use areg, vce(robust) when you expect heteroskedastic or nonnormal
errors. areg, vce(robust), like ordinary regression, assumes that the observations are independent,
unless the vce(cluster clustvar) option is specified. If the vce(cluster clustvar) option is
specified, this independence assumption is relaxed and only the clusters identified by equal values of
clustvar are assumed to be independent.
82
areg — Linear regression with a large dummy-variable set
Assume that we were to collect data by randomly sampling 10,000 doctors (from 100 hospitals)
and then sampling 10 patients of each doctor, yielding a total dataset of 100,000 patients in a cluster
sample. If in some regression we wished to include effects of the hospitals to which the doctors
belonged, we would want to include a dummy variable for each hospital, adding 100 variables to our
model. areg could fit this model by
. areg depvar patient vars, absorb(hospital) vce(cluster doctor)
Saved results
areg saves the following in e():
Scalars
e(N)
e(tss)
e(df m)
e(rss)
e(df r)
e(r2)
e(r2 a)
e(df a)
e(rmse)
e(ll)
e(ll 0)
e(N clust)
e(F)
e(F absorb)
e(rank)
number of observations
total sum of squares
model degrees of freedom
residual sum of squares
residual degrees of freedom
R-squared
adjusted R-squared
degrees of freedom for absorbed effect
root mean squared error
log likelihood
log likelihood, constant-only model
number of clusters
F statistic
F statistic for absorbed effect (when vce(robust) is not specified)
rank of e(V)
Macros
e(cmd)
areg
e(cmdline)
command as typed
e(depvar)
name of dependent variable
e(absvar)
name of absorb variable
e(wtype)
weight type
e(wexp)
weight expression
e(title)
title in estimation output
e(clustvar)
name of cluster variable
e(vce)
vcetype specified in vce()
e(vcetype)
title used to label Std. Err.
e(datasignature)
the checksum
e(datasignaturevars) variables used in calculation of checksum
e(properties)
b V
e(predict)
program used to implement predict
e(marginsnotok)
predictions disallowed by margins
e(asbalanced)
factor variables fvset as asbalanced
e(asobserved)
factor variables fvset as asobserved
areg — Linear regression with a large dummy-variable set
83
Matrices
e(b)
coefficient vector
e(V)
variance–covariance matrix of the estimators
e(V modelbased) model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
areg is implemented as an ado-file.
areg begins by recalculating depvar and indepvars to have mean 0 within the groups specified
by absorb(). The overall mean of each variable is then added back in. The adjusted depvar is then
regressed on the adjusted indepvars with regress, yielding the coefficient estimates. The degrees
of freedom of the variance–covariance matrix of the coefficients is then adjusted to account for the
absorbed variables — this calculation yields the same results (up to numerical roundoff error) as if the
matrix had been calculated directly by the formulas given in [R] regress.
areg with vce(robust) or vce(cluster clustvar) works similarly, calling robust after
regress to produce the Huber/White/sandwich estimator of the variance or its clustered version.
See [P] robust, in particular, in Introduction and Methods and formulas. The model F test uses the
robust variance estimates. There is, however, no simple computational means of obtaining a robust
test of the absorbed dummies; thus this test is not displayed when the vce(robust) or vce(cluster
clustvar) option is specified.
The number of groups specified in absorb() are included in the degrees of freedom used in
the finite-sample adjustment of the cluster–robust VCE estimator. This statement is only valid if the
number of groups is small relative to the sample size. (Technically, the number of groups must remain
fixed as the sample size grows.) For an estimator that allows the number of groups to grow with the
sample size, see the xtreg, fe command in [XT] xtreg.
Reference
Blackwell III, J. L. 2005. Estimation and testing of fixed-effect panel-data systems. Stata Journal 5: 202–207.
Also see
[R] areg postestimation — Postestimation tools for areg
[R] regress — Linear regression
[XT] xtreg — Fixed-, between-, and random-effects, and population-averaged linear models
[U] 20 Estimation and postestimation commands
Title
areg postestimation — Postestimation tools for areg
Description
The following postestimation commands are available for areg:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
lrtest
margins
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Syntax for predict
predict
type
newvar
if
in
, statistic
where yj = xj b + dabsorbvar + ej and statistic is
statistic
description
Main
xb
stdp
dresiduals
∗
xbd
∗
d
∗
residuals
∗
score
xj b, fitted values; the default
standard error of the prediction
dabsorbvar + ej = yj − xj b
xj b + dabsorbvar
dabsorbvar
residual
score; equivalent to residuals
Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for
the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample)
is not specified.
84
areg postestimation — Postestimation tools for areg
85
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
xb, the default, calculates the prediction of xj b, the fitted values, by using the average effect of the
absorbed variable. Also see xbd below.
stdp calculates the standard error of xj b.
dresiduals calculates yj − xj b, which are the residuals plus the effect of the absorbed variable.
xbd calculates xj b + dabsorbvar , which are the fitted values including the individual effects of the
absorbed variable.
d calculates dabsorbvar , the individual coefficients for the absorbed variable.
residuals calculates the residuals, that is, yj − (xj b + dabsorbvar ).
score is a synonym for residuals.
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] areg — Linear regression with a large dummy-variable set
[U] 20 Estimation and postestimation commands
Title
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
Syntax
asclogit depvar
indepvars
alternatives(varname)
if
options
in
weight , case(varname)
description
options
Model
∗
∗
case(varname)
alternatives(varname)
casevars(varlist)
basealternative(# | lbl | str)
noconstant
altwise
offset(varname)
constraints(constraints)
collinear
use varname to identify cases
use varname to identify the alternatives available for each case
case-specific variables
alternative used as base category
suppress alternative-specific constant terms
use alternative-wise deletion instead of casewise deletion
include varname in model with coefficient constrained to 1
apply specified linear constraints
keep collinear variables
SE/Robust
vcetype may be oim, robust, cluster clustvar, bootstrap,
or jackknife
vce(vcetype)
Reporting
set confidence level; default is level(95)
report odds ratios
do not display the header on the coefficient table
do not display constraints
level(#)
or
noheader
nocnsreport
Maximization
control the maximization process; seldom used
maximize options
† coeflegend
∗
display coefficients’ legend instead of coefficient table
case(varname) and alternatives(varname) are required.
† coeflegend does not appear in the dialog box.
bootstrap, by, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
fweights, iweights, and pweights are allowed (see [U] 11.1.6 weight), but they are interpreted to apply to cases
as a whole, not to individual observations. See Use of weights in [R] clogit.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Categorical outcomes
>
Alternative-specific conditional logit
86
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
87
Description
asclogit fits McFadden’s choice model, which is a specific case of the more general conditional
logistic regression model (McFadden 1974). asclogit requires multiple observations for each case
(individual or decision), where each observation represents an alternative that may be chosen. The cases
are identified by the variable specified in the case() option, whereas the alternatives are identified by
the variable specified in the alternatives() option. The outcome or chosen alternative is identified
by a value of 1 in depvar, whereas zeros indicate the alternatives that were not chosen. There can be
multiple alternatives chosen for each case.
asclogit allows two types of independent variables: alternative-specific variables and case-specific
variables. Alternative-specific variables vary across both cases and alternatives and are specified in
indepvars. Case-specific variables vary only across cases and are specified in the casevars() option.
See [R] clogit for a more general application of conditional logistic regression. For example,
clogit would be used when you have grouped data where each observation in a group may be
a different individual, but all individuals in a group have a common characteristic. You may use
clogit to obtain the same estimates as asclogit by specifying the case() variable as the group()
variable in clogit and generating variables that interact the casevars() in asclogit with each
alternative (in the form of an indicator variable), excluding the interaction variable associated with the
base alternative. asclogit takes care of this data-management burden for you. Also, for clogit,
each record (row in your data) is an observation, whereas in asclogit each case, consisting of
several records (the alternatives) in your data, is an observation. This last point is important because
asclogit will drop observations, by default, in a casewise fashion. That is, if there is at least one
missing value in any of the variables for each record of a case, the entire case is dropped from
estimation. To use alternative-wise deletion, specify the altwise option and only the records with
missing values will be dropped from estimation.
Options
Model
case(varname) specifies the numeric variable that identifies each case. case() is required and must
be integer valued.
alternatives(varname) specifies the variable that identifies the alternatives for each case. The
number of alternatives can vary with each case; the maximum number of alternatives cannot exceed
the limits of tabulate oneway; see [R] tabulate oneway. alternatives() is required and may
be a numeric or a string variable.
casevars(varlist) specifies the case-specific numeric variables. These are variables that are constant
for each case. If there are a maximum of J alternatives, there will be J − 1 sets of coefficients
associated with the casevars().
basealternative(# | lbl | str) specifies the alternative that normalizes the latent-variable location
(the level of utility). The base alternative may be specified as a number, label, or string depending
on the storage type of the variable indicating alternatives. The default is the alternative with the
highest frequency.
If vce(bootstrap) or vce(jackknife) is specified, you must specify the base alternative. This
is to ensure that the same model is fit with each call to asclogit.
noconstant suppresses the J − 1 alternative-specific constant terms.
altwise specifies that alternative-wise deletion be used when marking out observations due to
missing values in your variables. The default is to use casewise deletion; that is, the entire group
88
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
of observations making up a case is deleted if any missing values are encountered. This option
does not apply to observations that are marked out by the if or in qualifier or the by prefix.
offset(varname), constraints(numlist | matname), collinear; see [R] estimation options.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
Reporting
level(#); see [R] estimation options.
or reports the estimated coefficients transformed to odds ratios, i.e., eb rather than b. Standard errors
and confidence intervals are similarly transformed. This option affects how results are displayed,
not how they are estimated. or may be specified at estimation or when replaying previously
estimated results.
noheader prevents the coefficient table header from being displayed.
nocnsreport; see [R] estimation options.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
technique(bhhh) is not allowed.
The initial estimates must be specified as from(matname , copy ), where matname is the
matrix containing the initial estimates and the copy option specifies that only the position of each
element in matname is relevant. If copy is not specified, the column stripe of matname identifies
the estimates.
The following option is available with asclogit but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
asclogit fits McFadden’s choice model (McFadden [1974]; for a brief introduction, see Greene
[2008, sec. 23.11] or Cameron and Trivedi [2009, sec. 15.5]). In this model, we have a set of
unordered alternatives indexed by 1, 2, . . . , J . Let yij , j = 1, . . . , J , be an indicator variable for
the alternative actually chosen by the ith individual (case). That is, yij = 1 if individual i chose
alternative j and yij = 0 otherwise. The independent variables come in two forms: alternative specific
and case specific. Alternative-specific variables vary among the alternatives (as well as cases), and
case-specific variables vary only among cases. Assume that we have p alternative-specific variables
so that for case i we have a J × p matrix, Xi . Further, assume that we have q case-specific variables
so that we have a 1 × q vector zi for case i. Our random-utility model can then be expressed as
0
ui = Xi β + (zi A) + i
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
89
Here β is a p× 1 vector of alternative-specific regression coefficients and A = (α1 , . . . , αJ ) is a q ×J
matrix of case-specific regression coefficients. The elements of the J × 1 vector i are independent
Type I (Gumbel-type) extreme-value random variables with mean γ (the Euler–Mascheroni constant,
approximately 0.577) and variance π 2 /6. We must fix one of the αj to the constant vector to normalize
the location. We set αk = 0, where k is specified by the basealternative() option. The vector
ui quantifies the utility that the individual gains from the J alternatives. The alternative chosen by
individual i is the one that maximizes utility.
Daniel Little McFadden was born in 1937 in North Carolina. He studied physics, psychology,
and economics at the University of Minnesota and has taught economics at Pittsburgh, Berkeley,
and MIT. His contributions to logit models were triggered by a student’s project on freeway
routing decisions, and his work consistently links economic theory and applied problems. In
2000, he shared the Nobel Prize in Economics with James J. Heckman.
Example 1
We have data on 295 consumers and their choice of automobile. Each consumer chose among an
American, Japanese, or European car; the variable car indicates the nationality of the car for each
alternative. We want to explore the relationship between the choice of car to the consumer’s sex
(variable sex) and income (variable income in thousands of dollars). We also have information on
the number of dealerships of each nationality in the consumer’s city in the variable dealer that we
want to include as a regressor. We assume that consumers’ preferences are influenced by the number
of dealerships in an area but that the number of dealerships is not influenced by consumer preferences
(which we admit is a rather strong assumption). The variable dealer is an alternative-specific variable
(Xi is a 3 × 1 vector in our previous notation), and sex and income are case-specific variables (zi
is a 1 × 2 vector). Each consumer’s chosen car is indicated by the variable choice.
Let’s list some of the data.
. use http://www.stata-press.com/data/r11/choice
. list id car choice dealer sex income in 1/12, sepby(id)
id
car
choice
dealer
sex
income
1.
2.
3.
1
1
1
American
Japan
Europe
0
0
1
18
8
5
male
male
male
46.7
46.7
46.7
4.
5.
6.
2
2
2
American
Japan
Europe
1
0
0
17
6
2
male
male
male
26.1
26.1
26.1
7.
8.
9.
3
3
3
American
Japan
Europe
1
0
0
12
6
2
male
male
male
32.7
32.7
32.7
10.
11.
12.
4
4
4
American
Japan
Europe
0
1
0
18
7
4
female
female
female
49.2
49.2
49.2
We see, for example, that the first consumer, a male earning $46,700 per year, chose to purchase a
European car even though there are more American and Japanese car dealers in his area. The fourth
consumer, a female earning $49,200 per year, purchased a Japanese car.
90
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
We now fit our model.
. asclogit choice dealer, casevars(sex income) case(id) alternatives(car)
Iteration 0:
log likelihood = -273.55685
Iteration 1:
log likelihood = -252.75109
Iteration 2:
log likelihood = -250.78555
Iteration 3:
log likelihood = -250.7794
Iteration 4:
log likelihood = -250.7794
Alternative-specific conditional logit
Case variable: id
Alternative variable: car
Log likelihood =
Number of obs
Number of cases
Alts per case: min
avg
max
Wald chi2(5)
Prob > chi2
-250.7794
choice
Coef.
dealer
.0680938
=
=
=
=
=
=
=
885
295
3
3.0
3
15.86
0.0072
Std. Err.
z
P>|z|
[95% Conf. Interval]
.0344465
1.98
0.048
.00058
.1356076
car
American
(base alternative)
Japan
sex
income
_cons
-.5346039
.0325318
-1.352189
.3141564
.012824
.6911829
-1.70
2.54
-1.96
0.089
0.011
0.050
-1.150339
.0073973
-2.706882
.0811314
.0576663
.0025049
.5704109
.032042
-2.355249
.4540247
.0138676
.8526681
1.26
2.31
-2.76
0.209
0.021
0.006
-.3194612
.004862
-4.026448
1.460283
.0592219
-.6840501
Europe
sex
income
_cons
Displaying the results as odds ratios makes interpretation easier.
. asclogit, or noheader
choice
Odds Ratio
dealer
1.070466
Std. Err.
z
P>|z|
[95% Conf. Interval]
.0368737
1.98
0.048
1.00058
1.145232
car
American
(base alternative)
Japan
sex
income
.5859013
1.033067
.1840647
.013248
-1.70
2.54
0.089
0.011
.3165294
1.007425
1.084513
1.059361
1.768994
1.032561
.8031669
.0143191
1.26
2.31
0.209
0.021
.7265404
1.004874
4.307178
1.061011
Europe
sex
income
These results indicate that men (sex = 1) are less likely to pick a Japanese car over an American
car than women (odds ratio 0.59) but that men are more likely to choose a European car over an
American car (odds ratio 1.77). Raising a person’s income increases the likelihood that he or she
purchases a Japanese or European car; interestingly, the effect of higher income is about the same
for these two types of cars.
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
91
Technical note
McFadden’s choice model is related to multinomial logistic regression (see [R] mlogit). If all the
independent variables are case specific, then the two models are identical. We verify this supposition
by running the previous example without the alternative-specific variable, dealer.
. asclogit choice, casevars(sex income) case(id) alternatives(car) nolog
Alternative-specific conditional logit
Number of obs
=
885
Case variable: id
Number of cases
=
295
Alternative variable: car
Alts per case: min =
3
avg =
3.0
max =
3
Wald chi2(4)
=
12.53
Log likelihood = -252.72012
Prob > chi2
=
0.0138
choice
American
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
(base alternative)
Japan
sex
income
_cons
-.4694799
.0276854
-1.962652
.3114939
.0123666
.6216804
-1.51
2.24
-3.16
0.132
0.025
0.002
-1.079997
.0034472
-3.181123
.141037
.0519236
-.7441807
.5388441
.0273669
-3.180029
.4525279
.013787
.7546837
1.19
1.98
-4.21
0.234
0.047
0.000
-.3480942
.000345
-4.659182
1.425782
.0543889
-1.700876
Europe
sex
income
_cons
To run mlogit, we must rearrange the dataset. mlogit requires a dependent variable that indicates
the choice—1, 2, or 3—for each individual. We will use car as our dependent variable for those
observations that represent the choice actually chosen.
(Continued on next page)
92
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
. keep if choice == 1
(590 observations deleted)
. mlogit car sex income
Iteration 0:
log likelihood = -259.1712
Iteration 1:
log likelihood = -252.81165
Iteration 2:
log likelihood = -252.72014
Iteration 3:
log likelihood = -252.72012
Multinomial logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -252.72012
car
American
Coef.
Std. Err.
z
P>|z|
=
=
=
=
295
12.90
0.0118
0.0249
[95% Conf. Interval]
(base outcome)
Japan
sex
income
_cons
-.4694798
.0276854
-1.962651
.3114939
.0123666
.6216803
-1.51
2.24
-3.16
0.132
0.025
0.002
-1.079997
.0034472
-3.181122
.1410371
.0519236
-.7441801
.5388443
.027367
-3.18003
.4525278
.013787
.7546837
1.19
1.98
-4.21
0.234
0.047
0.000
-.348094
.000345
-4.659182
1.425783
.0543889
-1.700877
Europe
sex
income
_cons
The results are the same except for the model statistic: asclogit uses a Wald test and mlogit
uses a likelihood-ratio test. If you prefer the likelihood-ratio test, you can fit the constant-only model
for asclogit followed by the full model and use [R] lrtest. The following example will carry this
out.
. use http://www.stata-press.com/data/r11/choice, clear
. asclogit choice, case(id) alternatives(car)
. estimates store null
. asclogit choice, casevars(sex income) case(id) alternatives(car)
. lrtest null .
Technical note
We force you to explicitly identify the case-specific variables in the casevars() option to ensure
that the program behaves as you expect. For example, an if or in qualifier may drop observations in
such a way that (what was expected to be) an alternative-specific variable turns into a case-specific
variable. Here you would probably want asclogit to terminate instead of interacting the variable with
the alternative indicators. This situation could also occur if asclogit drops cases, or observations
if you use the altwise option, because of missing values.
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
Saved results
asclogit saves the following in e():
Scalars
e(N)
e(N case)
e(k)
e(k alt)
e(k indvars)
e(k casevars)
e(k eq)
e(k eq model)
e(k autoCns)
e(df m)
e(ll)
e(N clust)
e(const)
e(i base)
e(chi2)
e(F)
e(p)
e(alt min)
e(alt avg)
e(alt max)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
number of cases
number of parameters
number of alternatives
number of alternative-specific variables
number of case-specific variables
number of equations in e(b)
number of equations in model Wald test
number of base, empty, and omitted constraints
model degrees of freedom
log likelihood
number of clusters
constant indicator
base alternative index
χ2
F statistic
significance
minimum number of alternatives
average number of alternatives
maximum number of alternatives
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
(Continued on next page)
93
94
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
Macros
e(cmd)
e(cmdline)
e(depvar)
e(indvars)
e(casevars)
e(case)
e(altvar)
e(alteqs)
e(alt#)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(offset)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(which)
e(ml method)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(predict)
e(marginsnotok)
asclogit
command as typed
name of dependent variable
alternative-specific independent variable
case-specific variables
variable defining cases
variable defining alternatives
alternative equation names
alternative labels
weight type
weight expression
title in estimation output
name of cluster variable
offset
Wald, type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
the checksum
variables used in calculation of checksum
b V
program used to implement estat
program used to implement predict
predictions disallowed by margins
Matrices
e(b)
e(stats)
e(altvals)
e(altfreq)
e(alt casevars)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
alternative statistics
alternative values
alternative frequencies
indicators for estimated case-specific coefficients—e(k alt)×e(k casevars)
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
asclogit is implemented as an ado-file.
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
95
In this model, we have a set of unordered alternatives indexed by 1, 2, . . . , J . Let yij , j = 1, . . . , J ,
be an indicator variable for the alternative actually chosen by the ith individual (case). That is, yij = 1
if individual i chose alternative j and yij = 0 otherwise. The independent variables come in two
forms: alternative specific and case specific. Alternative-specific variables vary among the alternatives
(as well as cases), and case-specific variables vary only among cases. Assume that we have p
alternative-specific variables so that for case i we have a J × p matrix, Xi . Further, assume that
we have q case-specific variables so that we have a 1 × q vector zi for case i. The deterministic
component of the random-utility model can then be expressed as
0
ηi = Xi β + (zi A)
= Xi β + (zi ⊗ IJ ) vec(A0 )
β
= (Xi , zi ⊗ IJ )
vec(A0 )
= X∗i β∗
As before, β is a p × 1 vector of alternative-specific regression coefficients, and A = (α1 , . . . , αJ )
is a q × J matrix of case-specific regression coefficients; remember that we must fix one of the αj
to the constant vector to normalize the location. Here IJ is the J × J identity matrix, vec() is the
vector function that creates a vector from a matrix by placing each column of the matrix on top of
the other (see [M-5] vec( )), and ⊗ is the Kronecker product (see [M-2] op kronecker).
We have rewritten the linear equation so that it is a form that can be used by clogit, namely,
X∗i β∗ , where
X∗i = (Xi , zi ⊗ IJ )
β
β∗ =
vec(A0 )
With this in mind, see Methods and formulas in [R] clogit for the computational details of the
conditional logit model.
This command supports the clustered version of the Huber/White/sandwich estimator of the
variance using vce(robust) and vce(cluster clustvar). See [P] robust, in particular, in Maximum
likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying
vce(cluster casevar), where casevar is the variable that identifies the cases.
References
Cameron, A. C., and P. K. Trivedi. 2009. Microeconometrics Using Stata. College Station, TX: Stata Press.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
McFadden, D. L. 1974. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, ed.
P. Zarembka, 105–142. New York: Academic Press.
96
asclogit — Alternative-specific conditional logit (McFadden’s choice) model
Also see
[R] asclogit postestimation — Postestimation tools for asclogit
[R] asmprobit — Alternative-specific multinomial probit regression
[R] asroprobit — Alternative-specific rank-ordered probit regression
[R] clogit — Conditional (fixed-effects) logistic regression
[R] logistic — Logistic regression, reporting odds ratios
[R] logit — Logistic regression, reporting coefficients
[R] nlogit — Nested logit regression
[R] ologit — Ordered logistic regression
[U] 20 Estimation and postestimation commands
Title
asclogit postestimation — Postestimation tools for asclogit
Description
The following postestimation commands are of special interest after asclogit:
commands
description
estat alternatives
estat mfx
alternative summary statistics
marginal effects
For information about these commands, see below.
The following standard postestimation commands are also available:
commands
description
estat
estimates
hausman
lincom
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
Hausman’s specification test
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
likelihood-ratio test
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predicted probabilities, estimated linear predictor and its standard error
point estimates, standard errors, testing, and inference for generalized
predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
lrtest
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Special-interest postestimation commands
estat alternatives displays summary statistics about the alternatives in the estimation sample.
estat mfx computes probability marginal effects.
97
98
asclogit postestimation — Postestimation tools for asclogit
Syntax for predict
newvar if
in
, statistic options
predict type
stub* | newvarlist
if
in , scores
predict
type
description
statistic
Main
pr
xb
stdp
probability that each alternative is chosen; the default
linear prediction
standard error of the linear prediction
options
description
Main
∗
k(condspec)
altwise
nooffset
∗
condition on condspec alternatives chosen by each case when computing
probabilities
use alternative-wise deletion instead of casewise deletion when computing
probabilities
ignore the offset() variable specified in asclogit
k(condspec) may be used only with pr.
These statistics are available both in and out of sample; type predict
only for the estimation sample.
. . . if e(sample) . . . if wanted
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
pr computes the probability of choosing each alternative conditioned on each case choosing
k(condspec) alternatives. This is the default statistic with default k(1); one alternative per
case is chosen.
xb computes the linear prediction.
stdp computes the standard error of the linear prediction.
k(# | observed) conditions the probability on choosing # alternatives per case, or use k(observed)
to condition on the observed number of alternatives chosen. The default is k(1). This option may
be used only with the pr option.
altwise specifies that alternative-wise deletion be used when marking out observations due to missing
values in your variables. The default is to use casewise deletion. The xb and stdp options always
use alternative-wise deletion.
nooffset is relevant only if you specified offset(varname) for asclogit. It modifies the calculations made by predict so that they ignore the offset variable; the linear prediction is treated as
xβ rather than as xβ + offset.
scores calculates the scores for each coefficient in e(b). This option requires a new variable list of
length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict
generate enumerated variables with prefix stub.
asclogit postestimation — Postestimation tools for asclogit
99
Syntax for estat alternatives
estat alternatives
Menu
Statistics
>
Postestimation
>
Reports and statistics
Syntax for estat mfx
estat mfx
if
in
, options
description
options
Main
varlist(varlist)
display marginal effects for varlist
at(mean atlist | median atlist ) calculate marginal effects at these values
k(#)
condition on the number of alternatives chosen to be #
Options
set confidence interval level; default is level(95)
treat indicator variables as continuous
do not restrict calculation of means and medians to the
estimation sample
ignore weights when calculating means and medians
level(#)
nodiscrete
noesample
nowght
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat mfx
Main
varlist(varlist) specifies the variables for which to display marginal effects. The default is all
variables.
at(mean atlist | median atlist ) specifies the values at which the marginal effects are to be
calculated. atlist is
alternative:variable = #
variable = #
alternative:offset = #
...
The default is to calculate the marginal effects at the means of the independent variables by using
the estimation sample, at(mean). If offset() is used during estimation, the means of the offsets
(by alternative) are computed by default.
After specifying the summary statistic, you can specify a series of specific values for variables.
You can specify values for alternative-specific variables by alternative, or you can specify one
value for all alternatives. You can specify only one value for case-specific variables. You specify
values for the offset() variable (if present) the same way as for alternative-specific variables. For
example, in the choice dataset (car choice), income is a case-specific variable, whereas dealer
is an alternative-specific variable. The following would be a legal syntax for estat mfx:
100
asclogit postestimation — Postestimation tools for asclogit
. estat mfx, at(mean American:dealer=18 income=40)
When nodiscrete is not specified, at(mean atlist ) or at(median atlist ) has no effect on
computing marginal effects for indicator variables, which are calculated as the discrete change in
the simulated probability as the indicator variable changes from 0 to 1.
The mean and median computations respect any if or in qualifiers, so you can restrict the data
over which the statistic is computed. You can even restrict the values to a specific case, e.g.,
. estat mfx if case==21
k(#) computes the probabilities conditioned on # alternatives chosen. The default is one alternative
chosen.
Options
level(#) sets the confidence level; default is level(95).
nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable
is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in
the simulated probability is computed as the indicator variable changes from 0 to 1.
noesample specifies that the whole dataset be considered instead of only those marked in the
e(sample) defined by the asclogit command.
nowght specifies that weights be ignored when calculating the medians.
Remarks
Remarks are presented under the following headings:
Predicted probabilities
Obtaining estimation statistics
Predicted probabilities
After fitting a McFadden’s choice model with alternative-specific conditional logistic regression,
you can use predict to obtain the estimated probability of alternative choices given case profiles.
Example 1
In example 1 of [R] asclogit, we fit a model of consumer choice of automobile. The alternatives are
nationality of the automobile manufacturer: American, Japanese, or European. There is one alternativespecific variable in the model, dealer, which contains the number of dealerships of each nationality
in the consumer’s city. The case-specific variables are sex, the consumer’s sex, and income, the
consumer’s income in thousands of dollars.
. use http://www.stata-press.com/data/r11/choice
. asclogit choice dealer, casevars(sex income) case(id) alternatives(car)
(output omitted )
. predict p
(option pr assumed; Pr(car))
. predict p2, k(2)
(option pr assumed; Pr(car))
. format p p2 %6.4f
asclogit postestimation — Postestimation tools for asclogit
101
. list car choice dealer sex income p p2 in 1/9, sepby(id)
car
choice
dealer
sex
income
p
p2
1.
2.
3.
American
Japan
Europe
0
0
1
18
8
5
male
male
male
46.7
46.7
46.7
0.6025
0.2112
0.1863
0.8589
0.5974
0.5437
4.
5.
6.
American
Japan
Europe
1
0
0
17
6
2
male
male
male
26.1
26.1
26.1
0.7651
0.1282
0.1067
0.9293
0.5778
0.4929
7.
8.
9.
American
Japan
Europe
1
0
0
12
6
2
male
male
male
32.7
32.7
32.7
0.6519
0.1902
0.1579
0.8831
0.5995
0.5174
Obtaining estimation statistics
Here we will demonstrate the specialized estat subcommands after asclogit. Use estat
alternatives to obtain a table of alternative statistics. The table will contain the alternative values,
labels (if any), the number of cases in which each alternative is present, the frequency that the
alternative is selected, and the percent selected.
Use estat mfx to obtain marginal effects after asclogit.
Example 2
We will continue with the automobile choice example, where we first list the alternative statistics
and then compute the marginal effects at the mean income in our sample, assuming that there are
five automobile dealers for each nationality. We will evaluate the probabilities for females because
sex is coded 0 for females, and we will be obtaining the discrete change from 0 to 1.
. estat alternatives
Alternatives summary for car
index
1
2
3
Alternative
value
1
2
3
label
Cases
present
Frequency
selected
Percent
selected
American
Japan
Europe
295
295
295
192
64
39
65.08
21.69
13.22
. estat mfx, at(dealer=0 sex=0) varlist(sex income)
Pr(choice = American|1 selected) = .41964329
variable
casevars
sex*
income
dp/dx
Std. Err.
z
P>|z|
[
95% C.I.
.026238
-.007891
.068311
.002674
0.38
-2.95
0.701
0.003
-.107649
-.013132
]
.160124
-.00265
(*) dp/dx is for discrete change of indicator variable from 0 to 1
X
0
42.097
102
asclogit postestimation — Postestimation tools for asclogit
Pr(choice = Japan|1 selected) = .42696187
variable
casevars
sex*
income
dp/dx
Std. Err.
z
P>|z|
[
95% C.I.
-.161164
.005861
.079238
.002997
-2.03
1.96
0.042
0.051
-.316468
-.000014
]
-.005859
.011735
X
0
42.097
(*) dp/dx is for discrete change of indicator variable from 0 to 1
Pr(choice = Europe|1 selected) = .15339484
variable
casevars
sex*
income
dp/dx
Std. Err.
z
P>|z|
[
95% C.I.
.134926
.00203
.076556
.001785
1.76
1.14
0.078
0.255
-.015122
-.001469
]
.284973
.00553
X
0
42.097
(*) dp/dx is for discrete change of indicator variable from 0 to 1
The marginal effect of income indicates that there is a lower chance for a consumer to buy American
automobiles with an increase in income. There is an indication that men have a higher preference
for European automobiles than women but a lower preference for Japanese automobiles. We did not
include the marginal effects for dealer because we view these as nuisance parameters, so we adjusted
the probabilities by fixing dealer to a constant, 0.
Saved results
estat mfx saves the following in r():
Scalars
r(pr alt) scalars containing the computed probability of each alternative evaluated at the value that is
labeled X in the table output. Here alt are the labels in the macro e(alteqs).
Matrices
r(alt)
matrices containing the computed marginal effects and associated statistics. There is one matrix
for each alternative, where alt are the labels in the macro e(alteqs). Column 1 of each
matrix contains the marginal effects; column 2, their standard errors; column 3, their z
statistics; and columns 4 and 5, the confidence intervals. Column 6 contains the values
of the independent variables used to compute the probabilities r(pr alt).
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
asclogit postestimation — Postestimation tools for asclogit
103
The deterministic component of the random-utility model can be expressed as
0
η = Xβ + (zA)
= Xβ + (z ⊗ IJ ) vec(A0 )
β
= (X, z ⊗ IJ )
vec(A0 )
= X∗ β∗
where X is the J × p matrix containing the alternative-specific covariates, z is a 1 × q vector
of case-specific variables, β is a p × 1 vector of alternative-specific regression coefficients, and
A = (α1 , . . . , αJ ) is a q × J matrix of case-specific regression coefficients (with one of the αj
fixed to the constant). Here IJ is the J × J identity matrix, vec() is the vector function that creates
a vector from a matrix by placing each column of the matrix on top of the other (see [M-5] vec( )),
and ⊗ is the Kronecker product (see [M-2] op kronecker).
We have rewritten the linear equation so that it is a form that we all recognize, namely, η = X∗ β∗ ,
where
X∗ = (X, z ⊗ IJ )
β
∗
β =
vec(A0 )
To compute the marginal effects, we use the derivative of the log likelihood ∂`(y|η)/∂ η, where
`(y|η) = log Pr(y|η) is the log of the probability of the choice indicator vector y given the linear
predictor vector η. Namely,
∂Pr(y|η)
∂`(y|η)
∂η
= Pr(y|η)
∗
0
0
∂vec(X )
∂ η ∂vec(X∗ )0
∂`(y|η) ∗0
β ⊗ IJ
= Pr(y|η)
∂ η0
The standard errors of the marginal effects are computed using the delta method.
Also see
[R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model
[U] 20 Estimation and postestimation commands
Title
asmprobit — Alternative-specific multinomial probit regression
Syntax
asmprobit depvar
indepvars
alternatives(varname)
options
Model
∗
∗
case(varname)
alternatives(varname)
casevars(varlist)
constraints(constraints)
collinear
if
options
in
weight , case(varname)
description
use varname to identify cases
use varname to identify the alternatives available for each case
case-specific variables
apply specified linear constraints
keep collinear variables
Model 2
correlation structure of the latent-variable errors
variance structure of the latent-variable errors
use the structural covariance parameterization; default is the
differenced covariance parameterization
factor(#)
use the factor covariance structure with dimension #
noconstant
suppress the alternative-specific constant terms
basealternative(# | lbl | str) alternative used for normalizing location
scalealternative(# | lbl | str) alternative used for normalizing scale
altwise
use alternative-wise deletion instead of casewise deletion
correlation(correlation)
stddev(stddev)
structural
SE/Robust
vce(vcetype)
vcetype may be oim, robust, cluster clustvar, opg,
bootstrap, or jackknife
Reporting
level(#)
notransform
nocnsreport
set confidence level; default is level(95)
do not transform variance–covariance estimates to the standard
deviation and correlation metric
do not display constraints
104
asmprobit — Alternative-specific multinomial probit regression
105
Integration
intmethod(seqtype)
intpoints(#)
intburn(#)
intseed(code | #)
antithetics
nopivot
initbhhh(#)
favor(speed | space)
type of quasi- or pseudouniform point set
number of points in each sequence
starting index in the Hammersley or Halton sequence
pseudouniform random-number seed
use antithetic draws
do not use integration interval pivoting
use the BHHH optimization algorithm for the first # iterations
favor speed or space when generating integration points
Maximization
maximize options
† coeflegend
control the maximization process
display coefficients’ legend instead of coefficient table
correlation
description
unstructured
one correlation parameter for each pair of alternatives; correlations
with the basealternative() are zero; the default
one correlation parameter common to all pairs of alternatives;
correlations with the basealternative() are zero
constrain all correlation parameters to zero
user-specified matrix identifying the correlation pattern
user-specified matrix identifying the fixed and free correlation
parameters
exchangeable
independent
pattern matname
fixed matname
stddev
description
heteroskedastic
estimate standard deviation for each alternative; standard deviations
for basealternative() and scalealternative() set to one
all standard deviations are one
user-specified matrix identifying the standard deviation pattern
user-specified matrix identifying the fixed and free standard
deviation parameters
homoskedastic
pattern matname
fixed matname
seqtype
description
hammersley
halton
random
Hammersley point set
Halton point set
uniform pseudorandom point set
∗
case(varname) and alternatives(varname) are required.
† coeflegend does not appear in the dialog box.
bootstrap, by, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
106
asmprobit — Alternative-specific multinomial probit regression
Menu
Statistics
>
Categorical outcomes
>
Alternative-specific multinomial probit
Description
asmprobit fits multinomial probit (MNP) models by using maximum simulated likelihood (MSL)
implemented by the Geweke–Hajivassiliou–Keane (GHK) algorithm. By estimating the variance–
covariance parameters of the latent-variable errors, the model allows you to relax the independence
of irrelevant alternatives (IIA) property that is characteristic of the multinomial logistic model.
asmprobit requires multiple observations for each case (decision), where each observation represents an alternative that may be chosen. The cases are identified by the variable specified in the
case() option, whereas the alternatives are identified by the variable specified in the alternative()
option. The outcome (chosen alternative) is identified by a value of 1 in depvar, with 0 indicating
the alternatives that were not chosen; only one alternative may be chosen for each case.
asmprobit allows two types of independent variables: alternative-specific variables and casespecific variables. Alternative-specific variables vary across both cases and alternatives and are specified
in indepvars. Case-specific variables vary only across cases and are specified in the casevars()
option.
Options
Model
case(varname) specifies the variable that identifies each case. This variable identifies the individuals
or entities making a choice. case() is required.
alternatives(varname) specifies the variable that identifies the alternatives available for each case.
The number of alternatives can vary with each case; the maximum number of alternatives is 20.
alternatives() is required.
casevars(varlist) specifies the case-specific variables that are constant for each case(). If there are
a maximum of J alternatives, there will be J − 1 sets of coefficients associated with casevars().
constraints(constraints), collinear; see [R] estimation options.
Model 2
correlation(correlation) specifies the correlation structure of the latent-variable errors.
correlation(unstructured) is the most general and has J(J − 3)/2 + 1 unique correlation
parameters. This is the default unless stdev() or structural are specified.
correlation(exchangeable) provides for one correlation coefficient common to all latent
variables, except the latent variable associated with the basealternative() option.
correlation(independent) assumes that all correlations are zero.
correlation(pattern matname) and correlation(fixed matname) give you more flexibility
in defining the correlation structure. See Variance structures later in this entry for more
information.
stddev(stddev) specifies the variance structure of the latent-variable errors.
stddev(heteroskedastic) is the most general and has J − 2 estimable parameters. The standard
deviations of the latent-variable errors for the alternatives specified in basealternative()
and scalealternative() are fixed to one.
asmprobit — Alternative-specific multinomial probit regression
107
stddev(homoskedastic) constrains all the standard deviations to equal one.
stddev(pattern matname) and stddev(fixed matname) give you added flexibility in defining
the standard deviation parameters. See Variance structures later in this entry for more information.
structural requests the J ×J structural covariance parameterization instead of the default J −1×J −1
differenced covariance parameterization (the covariance of the latent errors differenced with that
of the base alternative). The differenced covariance parameterization will achieve the same MSL
regardless of the choice of basealternative() and scalealternative(). On the other hand,
the structural covariance parameterization imposes more normalizations that may bound the model
away from the maximum likelihood estimates and thus prevent convergence with some datasets
or choices of basealternative() and scalealternative().
factor(#) requests that the factor covariance structure of dimension # be used. The factor() option
can be used with the structural option but cannot be used with stddev() or correlation().
A # × J (or # × J − 1) matrix, C, is used to factor the covariance matrix as I + C0 C, where
I is the identity matrix of dimension J (or J − 1). The column dimension of C depends on
whether the covariance is structural or differenced. The row dimension of C, #, must be less than
or equal to floor((J(J − 1)/2 − 1)/(J − 2)), because there are only J(J − 1)/2 − 1 identifiable
variance–covariance parameters. This covariance parameterization may be useful for reducing the
number of covariance parameters that need to be estimated.
If the covariance is structural, the column of C corresponding to the base alternative contains zeros.
The column corresponding to the scale alternative has a one in the first row and zeros elsewhere.
If the covariance is differenced, the column corresponding to the scale alternative (differenced with
the base) has a one in the first row and zeros elsewhere.
noconstant suppresses the J − 1 alternative-specific constant terms.
basealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable location
(also referred to as the level of utility). The base alternative may be specified as a number, label,
or string. The standard deviation for the latent-variable error associated with the base alternative
is fixed to one, and its correlations with all other latent-variable errors are set to zero. The default
is the first alternative when sorted. If a fixed or pattern matrix is given in the stddev()
and correlation() options, the basealternative() will be implied by the fixed standard
deviations and correlations in the matrix specifications. basealternative() cannot be equal to
scalealternative().
scalealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable scale
(also referred to as the scale of utility). The scale alternative may be specified as a number,
label, or string. The default is to use the second alternative when sorted. If a fixed or pattern
matrix is given in the stddev() option, the scalealternative() will be implied by the
fixed standard deviations in the matrix specification. scalealternative() cannot be equal to
basealternative().
If a fixed or pattern matrix is given for the stddev() option, the base alternative and scale
alternative are implied by the standard deviations and correlations in the matrix specifications, and
they need not be specified in the basealternative() and scalealternative() options.
altwise specifies that alternative-wise deletion be used when marking out observations due to
missing values in your variables. The default is to use casewise deletion; that is, the entire group
of observations making up a case is deleted if any missing values are encountered. This option
does not apply to observations that are marked out by the if or in qualifier or the by prefix.
108
asmprobit — Alternative-specific multinomial probit regression
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
If specifying vce(bootstrap) or vce(jackknife), you must also specify basealternative()
and scalealternative().
Reporting
level(#); see [R] estimation options.
notransform prevents retransforming the Cholesky-factored variance–covariance estimates to the
correlation and standard deviation metric.
This option has no effect if structural is not specified because the default differenced variance–
covariance estimates have no interesting interpretation as correlations and standard deviations.
notransform also has no effect if the correlation() and stddev() options are specified with
anything other than their default values. Here it is generally not possible to factor the variance–
covariance matrix, so optimization is already performed using the standard deviation and correlation
representations.
nocnsreport; see [R] estimation options.
Integration
intmethod(hammersley | halton | random) specifies the method of generating the point sets used in
the quasi–Monte Carlo integration of the multivariate normal density. intmethod(hammersley),
the default, uses the Hammersley sequence; intmethod(halton) uses the Halton sequence; and
intmethod(random) uses a sequence of uniform random numbers.
intpoints(#) specifies the number of points to use in the quasi–Monte Carlo integration. If
this option is not specified, the number of points is 50 × J if intmethod(hammersley) or
intmethod(halton) is used and 100 × J if intmethod(random) is used. Larger values of
intpoints() provide better approximations of the log likelihood, but at the cost of added
computation time.
intburn(#) specifies where in the Hammersley or Halton sequence to start, which helps reduce the
correlation between the sequences of each dimension. The default is 0. This option may not be
specified with intmethod(random).
intseed(code | #) specifies the seed to use for generating the uniform pseudorandom sequence. This
option may be specified only with intmethod(random). code refers to a string that records the
state of the random-number generator runiform(); see [R] set seed. An integer value # may
be used also. The default is to use the current seed value from Stata’s uniform random-number
generator, which can be obtained from c(seed).
antithetics specifies that antithetic draws be used. The antithetic draw for the J − 1 vector
uniform-random variables, x, is 1 − x.
nopivot turns off integration interval pivoting. By default, asmprobit will pivot the wider intervals
of integration to the interior of the multivariate integration. This improves the accuracy of the
quadrature estimate. However, discontinuities may result in the computation of numerical secondorder derivatives using finite differencing (for the Newton–Raphson optimize technique, tech(nr))
when few simulation points are used, resulting in a nonpositive-definite Hessian. asmprobit uses
the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm, by default, which does not require
computing the Hessian numerically using finite differencing.
asmprobit — Alternative-specific multinomial probit regression
109
initbhhh(#) specifies that the Berndt–Hall–Hall–Hausman (BHHH) algorithm be used for the initial
# optimization steps. This option is the only way to use the BHHH algorithm along with other
optimization techniques. The algorithm switching feature of ml’s technique() option cannot
include bhhh.
favor(speed | space) instructs asmprobit to favor either speed or space when generating the
integration points. favor(speed) is the default. When favoring speed, the integration points are
generated once and stored in memory, thus increasing the speed of evaluating the likelihood. This
speed increase can be seen when there are many cases or when the user specifies a large number
of integration points, intpoints(#). When favoring space, the integration points are generated
repeatedly with each likelihood evaluation.
For unbalanced data, where the number of alternatives varies with each case, the estimates computed
using intmethod(random) will vary slightly between favor(speed) and favor(space). This
is because the uniform sequences will not be identical, even when initiating the sequences using the
same uniform seed, intseed(code | #). For favor(speed), ncase blocks of intpoints(#) ×
J − 2 uniform points are generated, where J is the maximum number of alternatives. For
favor(space), the column dimension of the matrices of points varies with the number of
alternatives that each case has.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize.
The following options may be particularly useful in obtaining convergence with asmprobit:
difficult, technique(algorithm spec), nrtolerance(#), nonrtolerance, and
from(init specs).
If technique() contains more than one algorithm specification, bhhh cannot be one of them. To
use the BHHH algorithm with another algorithm, use the initbhhh() option and specify the other
algorithm in technique().
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
The following option is available with asmprobit but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
Remarks are presented under the following headings:
Introduction
Variance structures
Introduction
The MNP model is used with discrete dependent variables that take on more than two outcomes
that do not have a natural ordering. The stochastic error terms are assumed to have a multivariate
normal distribution that is heteroskedastic and correlated. Say that you have a set of J unordered
alternatives that are modeled by a regression of both case-specific and alternative-specific covariates.
A “case” refers to the information on one decision maker. Underlying the model is the set of J latent
variables (utilities),
ηij = xij β + zi αj + ξij
(1)
110
asmprobit — Alternative-specific multinomial probit regression
where i denotes cases and j denotes alternatives. xij is a 1 × p vector of alternative-specific variables,
β is a p × 1 vector of parameters, zi is a 1 × q vector of case-specific variables, αj is a q × 1 vector
of parameters for the j th alternative, and ξi = (ξi1 , . . . , ξiJ ) is distributed multivariate normal with
mean zero and covariance matrix Ω. The decision maker selects the alternative whose latent variable
is highest.
Because the MNP model allows for a general covariance structure in ξij , it does not impose the
IIA property inherent in multinomial logistic and conditional logistic models. That is, the MNP model
permits the odds of choosing one alternative over another to depend on the remaining alternatives. For
example, consider the choice of travel mode between two cities: air, train, bus, or car, as a function
of the travel mode cost, travel time (alternative-specific variables), and an individual’s income (a
case-specific variable). The odds of choosing air travel over a bus may not be independent of the train
alternative because both bus and train travel are public ground transportation. That is, the probability
of choosing air travel is Pr(ηair > ηbus , ηair > ηtrain , ηair > ηcar ), and the two events ηair > ηbus
and ηair > ηtrain may be correlated.
An alternative to MNP that will allow a nested correlation structure in ξij is the nested logit model
(see [R] nlogit).
The added flexibility of the MNP model does impose a significant computation burden because of
the need to evaluate probabilities from the multivariate normal distribution. These probabilities are
evaluated using simulation techniques because a closed-form solution does not exist. See Methods
and formulas for more information.
Not all the J sets of regression coefficients αj are identifiable, nor are all J(J + 1)/2 elements of
the variance–covariance matrix Ω. As described by Train (2003), the model requires normalization
because both the location (level) and scale of the latent variable are irrelevant. Increasing the latent
variables by a constant does not change which ηij is the maximum for decision maker i, nor does
multiplying them by a constant. To normalize location, we choose an alternative, indexed by k , say,
and take the difference between the latent variable k and the J − 1 others,
vijk = ηij − ηik
= (xij − xik )β + zi (αj − αk ) + ξij − ξik
= δij 0 β + zi γj 0 + ij 0
(2)
= λij 0 + ij 0
where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1. One can now work with
the (J − 1) × (J − 1) covariance matrix Σ(k) for 0i = (i1 , . . . , i,J−1 ). The k th alternative here
is the basealternative() in asmprobit. From (2), the probability that decision maker i chooses
alternative k , for example, is
Pr(i chooses k) = Pr(vi1k ≤ 0, . . . , vi,J−1,k ≤ 0)
= Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 )
To normalize for scale, one of the diagonal elements of Σ(k) must be fixed to a constant. In
asmprobit, this is the error variance for the alternative specified by scalealternative(). Thus
there are a total of, at most, J(J − 1)/2 − 1 identifiable variance–covariance parameters. See Variance
structures below for more on this issue.
In fact, the model is slightly more general in that not all cases need to have faced all J alternatives.
The model allows for situations in which some cases chose among all possible alternatives, whereas
other cases were given a choice among a subset of them, and perhaps other cases were given a
choice among a different subset. The number of observations for each case is equal to the number
of alternatives faced.
asmprobit — Alternative-specific multinomial probit regression
111
The MNP model is often motivated using a random-utility consumer-choice framework. Equation
(1) represents the utility that consumer i receives from good j . The consumer purchases the good for
which the utility is highest. Because utility is ordinal, all that matters is the ranking of the utilities
from the alternatives. Thus one must normalize for location and scale.
Example 1
Application of MNP models is common in the analysis of transportation data. Greene (2008,
sec. 23.11.7) uses travel-mode choice data between Sydney and Melbourne to demonstrate estimating
parameters of various discrete-choice models. The data contain information on 210 individuals’
choices of travel mode. The four alternatives are air, train, bus, and car, with indices 1, 2, 3, and 4,
respectively. One alternative-specific variable is travelcost, a measure of generalized cost of travel
that is equal to the sum of in-vehicle cost and a wagelike measure times the amount of time spent
traveling. A second alternative-specific variable is the terminal time, termtime, which is zero for car
transportation. Household income, income, is a case-specific variable.
. use http://www.stata-press.com/data/r11/travel
. list id mode choice travelcost termtime income in 1/12, sepby(id)
id
mode
choice
travel~t
termtime
income
1.
2.
3.
4.
1
1
1
1
air
train
bus
car
0
0
0
1
70
71
70
30
69
34
35
0
35
35
35
35
5.
6.
7.
8.
2
2
2
2
air
train
bus
car
0
0
0
1
68
84
85
50
64
44
53
0
30
30
30
30
9.
10.
11.
12.
3
3
3
3
air
train
bus
car
0
0
0
1
129
195
149
101
69
34
35
0
40
40
40
40
The model of travel choice is
ηij = β1 travelcostij + β2 termtimeij + α1j incomei + α0j + ξij
The alternatives can be grouped as air and ground travel. With this in mind, we set the air alternative
to be the basealternative() and choose train as the scaling alternative. Because these are the
first and second alternatives in the mode variable, they are also the defaults.
(Continued on next page)
112
asmprobit — Alternative-specific multinomial probit regression
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode)
(output omitted )
Alternative-specific multinomial probit
Number of obs
Case variable: id
Number of cases
Alternative variable: mode
Alts per case: min
avg
max
Integration sequence:
Hammersley
Integration points:
200
Wald chi2(5)
Log simulated-likelihood = -190.09419
Prob > chi2
choice
Coef.
Std. Err.
mode
travelcost
termtime
-.0097707
-.0377034
air
(base alternative)
.0027835
.0094046
z
P>|z|
=
=
=
=
=
840
210
4
4.0
4
=
=
32.06
0.0000
[95% Conf. Interval]
-3.51
-4.01
0.000
0.000
-.0152261
-.0561361
-.0043152
-.0192708
train
income
_cons
-.0291886
.5615485
.0089232
.394619
-3.27
1.42
0.001
0.155
-.0466778
-.2118906
-.0116995
1.334988
income
_cons
-.0127473
-.0572738
.0079269
.4791635
-1.61
-0.12
0.108
0.905
-.0282839
-.9964169
.0027892
.8818693
income
_cons
-.0049067
-1.833159
.0077481
.81842
-0.63
-2.24
0.527
0.025
-.0200927
-3.437233
.0102792
-.2290856
/lnl2_2
/lnl3_3
-.5499745
-.6008993
.3903368
.3354232
-1.41
-1.79
0.159
0.073
-1.315021
-1.258317
.2150717
.056518
/l2_1
/l3_1
/l3_2
1.131589
.9720683
.5196988
.2125186
.2352248
.2860692
5.32
4.13
1.82
0.000
0.000
0.069
.7150604
.5110362
-.0409865
1.548118
1.4331
1.080384
bus
car
(mode=air is the alternative normalizing location)
(mode=train is the alternative normalizing scale)
. estimates store full
By default, the differenced covariance parameterization is used, so the covariance matrix for this
model is 3 × 3. There are two free variances to estimate and three correlations. To help ensure that the
covariance matrix remains positive definite, asmprobit uses the square root transformation, where it
optimizes on the Cholesky-factored variance–covariance. To ensure that the diagonal elements of the
Cholesky estimates remain positive, we use the log transformation. The estimates labeled /lnl2 2
and /lnl3 3 in the coefficient table are the log-transformed diagonal elements of the Cholesky
matrix. The estimates labeled /l2 1, /l3 1, and /l3 2 are the off-diagonal entries for elements
(2, 1), (3, 1), and (3, 2) of the Cholesky matrix.
Although the transformed parameters of the differenced covariance parameterization are difficult
to interpret, you can view them untransformed by using the estat command. Typing
asmprobit — Alternative-specific multinomial probit regression
113
. estat correlation
train
bus
car
train
bus
car
1.0000
0.8909
0.7896
1.0000
0.8952
1.0000
Note: correlations are for alternatives differenced with air
gives the correlations, and typing
. estat covariance
train
bus
car
train
bus
car
2
1.600309
1.374712
1.613382
1.39983
1.515656
Note: covariances are for alternatives differenced with air
gives the (co)variances.
We can reduce the number of covariance parameters in the model by using the factor model by
Cameron and Trivedi (2005). For large models with many alternatives, the parameter reduction can
be dramatic, but for our example we will use factor(1), a one-dimension factor model, to reduce
by 3 the number of parameters associated with the covariance matrix.
(Continued on next page)
114
asmprobit — Alternative-specific multinomial probit regression
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode) factor(1)
(output omitted )
Alternative-specific multinomial probit
Case variable: id
Number of obs
Number of cases
=
=
840
210
Alternative variable: mode
Alts per case: min =
avg =
max =
4
4.0
4
Integration sequence:
Hammersley
Integration points:
200
Log simulated-likelihood = -196.85094
choice
Coef.
Std. Err.
mode
travelcost
termtime
-.0093696
-.0593173
air
(base alternative)
.0036329
.0064585
Wald chi2(5)
Prob > chi2
z
P>|z|
=
=
107.85
0.0000
[95% Conf. Interval]
-2.58
-9.18
0.010
0.000
-.01649
-.0719757
-.0022492
-.0466589
train
income
_cons
-.0373511
.1092322
.0098219
.3949529
-3.80
0.28
0.000
0.782
-.0566018
-.6648613
-.0181004
.8833257
income
_cons
-.0158793
-1.082181
.0112239
.4678732
-1.41
-2.31
0.157
0.021
-.0378777
-1.999196
.0061191
-.1651666
income
_cons
.0042677
-3.765445
.0092601
.5540636
0.46
-6.80
0.645
0.000
-.0138817
-4.851389
.0224171
-2.6795
/c1_2
/c1_3
1.182805
1.227705
.3060299
.3401237
3.86
3.61
0.000
0.000
.5829972
.5610747
1.782612
1.894335
bus
car
(mode=air is the alternative normalizing location)
(mode=train is the alternative normalizing scale)
The estimates labeled /c1 2 and /c1 3 in the coefficient table are the factor loadings. These factor
loadings produce the following differenced covariance estimates:
. estat covariance
train
bus
car
train
bus
car
2
1.182805
1.227705
2.399027
1.452135
2.507259
Note: covariances are for alternatives differenced with air
Variance structures
The matrix Ω has J(J + 1)/2 distinct elements because it is symmetric. Selecting a base alternative,
normalizing its error variance to one, and constraining the correlations between its error and the other
errors reduces the number of estimable parameters by J . Moreover, selecting a scale alternative and
normalizing its error variance to one reduces the number by one, as well. Hence, there are at most
m = J(J − 1)/2 − 1 estimable parameters in Ω.
asmprobit — Alternative-specific multinomial probit regression
115
In practice, estimating all m parameters can be difficult, so one must often place more restrictions on
the parameters. The asmprobit command provides the correlation() option to specify restrictions
on the J(J − 3)/2 + 1 correlation parameters not already restricted as a result of choosing the base
alternatives, and it provides stddev() to specify restrictions on the J − 2 standard deviations not
already restricted as a result of choosing the base and scale alternatives.
When the structural option is used, asmprobit fits the model by assuming that all m
parameters can be estimated, which is equivalent to specifying correlation(unstructured) and
stddev(heteroskedastic). The unstructured correlation structure means that all J(J − 3)/2 + 1
of the remaining correlation parameters will be estimated, and the heteroskedastic specification means
that all J − 2 standard deviations will be estimated. With these default settings, the log likelihood is
maximized with respect to the Cholesky decomposition of Ω, and then the parameters are transformed
to the standard deviation and correlation form.
The correlation(exchangeable) option forces the J(J − 3)/2 + 1 correlation parameters
to be equal, and correlation(independent) forces all the correlations to be zero. Using the
stddev(homoskedastic) option forces all J standard deviations to be one. These options may help
in obtaining convergence for a model if the default options do not produce satisfactory results. In
fact, when fitting a complex model, it may be advantageous to first fit a simple one and then proceed
with removing the restrictions one at a time.
Advanced users may wish to specify alternative variance structures of their own choosing, and the
next few paragraphs explain how to do so.
correlation(pattern matname) allows you to give the name of a J × J matrix that identifies
a correlation structure. Sequential positive integers starting at 1 are used to identify each correlation
parameter: if there are three correlation parameters, they are identified by 1, 2, and 3. The integers
can be repeated to indicate that correlations with the same number should be constrained to be equal.
A zero or a missing value (.) indicates that the correlation is to be set to zero. asmprobit considers
only the elements of the matrix below the main diagonal.
Suppose that you have a model with four alternatives, numbered 1–4, and alternative 1 is the
base. The unstructured and exchangeable correlation structures identified in the 4 × 4 lower triangular
matrices are
unstructured
exchangeable
1 2 3 4
1 2 3 4




1 ·
1 ·
20 ·
20 ·




30 1 ·
30 1 ·
4 0 2 3 ·
4 0 1 1 ·
asmprobit labels these correlation structures unstructured and exchangeable, even though the correlations corresponding to the base alternative are set to zero. More formally: these terms are appropriate
when considering the (J − 1) × (J − 1) submatrix Σ(k) defined in the Introduction above.
You can also use the correlation(fixed matname) option to specify a matrix that specifies
fixed and free parameters. Here the free parameters (those that are to be estimated) are identified by
a missing value, and nonmissing values represent correlations that are to be taken as given. Below
is a correlation structure that would set the correlations of alternative 1 to be 0.5:
1

1
·
2  0.5
3  0.5
4 0.5
2
3
4

·
·
·
·
·


·
116
asmprobit — Alternative-specific multinomial probit regression
The order of the elements of the pattern or fixed matrices must be the same as the numeric
order of the alternative levels.
To specify the structure of the standard deviations—the diagonal elements of Ω —you can use the
stddev(pattern matname) option, where matname is a 1 × J matrix. Sequential positive integers
starting at 1 are used to identify each standard deviation parameter. The integers can be repeated to
indicate that standard deviations with the same number are to be constrained to be equal. A missing
value indicates that the corresponding standard deviation is to be set to one. In the four-alternative
example mentioned above, suppose that you wish to set the first and second standard deviations to
one and that you wish to constrain the third and fourth standard deviations to be equal; the following
pattern matrix will do that:
1 2 3 4
1 ( · · 1 1)
Using the stddev(fixed matname) option allows you to identify the fixed and free standard
deviations. Fixed standard deviations are entered as positive real numbers, and free parameters are
identified with missing values. For example, to constrain the first and second standard deviations to
equal one and to allow the third and fourth to be estimated, you would use this fixed matrix:
1 2
1 (1 1
3
·
4
·)
When supplying either the pattern or the fixed matrices, you must ensure that the model is
properly scaled. At least two standard deviations must be constant for the model to be scaled. A
warning is issued if asmprobit detects that the model is not scaled.
The order of the elements of the pattern or fixed matrices must be the same as the numeric
order of the alternative levels.
Example 2
In example 1, we used the differenced covariance parameterization, the default. We now use
the structural option to view the J − 2 standard deviation estimates and the (J − 1)(J − 2)/2
correlation estimates. Here we will fix the standard deviations for the air and train alternatives to
1 and the correlations between air and the rest of the alternatives to 0.
asmprobit — Alternative-specific multinomial probit regression
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode) structural
(output omitted )
Alternative-specific multinomial probit
Number of obs
Case variable: id
Number of cases
Alternative variable: mode
Alts per case: min
avg
max
Integration sequence:
Hammersley
Integration points:
200
Wald chi2(5)
Log simulated-likelihood = -190.09418
Prob > chi2
choice
Coef.
Std. Err.
mode
travelcost
termtime
-.0097703
-.0377103
air
(base alternative)
.0027834
.0094092
z
P>|z|
=
=
=
=
=
840
210
4
4.0
4
=
=
32.05
0.0000
117
[95% Conf. Interval]
-3.51
-4.01
0.000
0.000
-.0152257
-.056152
-.0043149
-.0192687
train
income
_cons
-.0291975
.5616448
.0089246
.3946529
-3.27
1.42
0.001
0.155
-.0466895
-.2118607
-.0117055
1.33515
income
_cons
-.01275
-.0571664
.0079266
.4791996
-1.61
-0.12
0.108
0.905
-.0282858
-.9963803
.0027858
.8820476
income
_cons
-.0049085
-1.833444
.0077486
.8186343
-0.63
-2.24
0.526
0.025
-.0200955
-3.437938
.0102785
-.22895
/lnsigma3
/lnsigma4
-.2447428
-.3309429
.4953363
.6494493
-0.49
-0.51
0.621
0.610
-1.215584
-1.60384
.7260985
.9419543
/atanhr3_2
/atanhr4_2
/atanhr4_3
1.01193
.5786576
.8885204
.3890994
.3940461
.5600561
2.60
1.47
1.59
0.009
0.142
0.113
.249309
-.1936586
-.2091693
1.774551
1.350974
1.98621
sigma1
sigma2
sigma3
sigma4
1
1
.7829059
.7182462
.2965368
.2011227
2.067
2.564989
rho3_2
rho4_2
rho4_3
.766559
.5216891
.7106622
.244269
-.1912734
-.2061713
.9441061
.874283
.9630403
bus
car
(base alternative)
(scale alternative)
.3878017
.4664645
.1604596
.2868027
.277205
(mode=air is the alternative normalizing location)
(mode=train is the alternative normalizing scale)
When comparing this output to that of example 1, we see that we have achieved the same log
likelihood. That is, the structural parameterization using air as the base alternative and train as
the scale alternative applied no restrictions on the model. This will not always be the case. We leave
it up to you to try different base and scale alternatives, and you will see that not all the different
combinations will achieve the same log likelihood. This is not true for the differenced covariance
parameterization: it will always achieve the same log likelihood (and the maximum possible likelihood)
regardless of the base and scale alternatives. This is why it is the default parameterization.
118
asmprobit — Alternative-specific multinomial probit regression
For an exercise, we can compute the differenced covariance displayed in example 1 by using the
following ado-code.
. estat covariance
air
train
bus
car
air
train
bus
car
1
0
0
0
1
.6001436
.3747012
.6129416
.399619
.5158776
. return list
matrices:
r(cov) : 4 x 4
. matrix cov = r(cov)
. matrix M = (1,-1,0,0 \ 1,0,-1,0 \ 1,0,0,-1)
. matrix cov1 = M*cov*M’
. matrix list cov1
symmetric cov1[3,3]
r1
r2
r1
2
r2 1.6001436 1.6129416
r3 1.3747012
1.399619
r3
1.5158776
The slight difference in the regression coefficients between the example 1 and example 2 coefficient
tables reflects the accuracy of the [M-5] ghk( ) algorithm using 200 points from the Hammersley
sequence.
We now fit the model using the exchangeable correlation matrix and compare the models with a
likelihood-ratio test.
asmprobit — Alternative-specific multinomial probit regression
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode) correlation(exchangeable)
(output omitted )
Alternative-specific multinomial probit
Number of obs
Case variable: id
Number of cases
Alternative variable: mode
Alts per case: min
avg
max
Integration sequence:
Hammersley
Integration points:
200
Wald chi2(5)
Log simulated-likelihood = -190.4679
Prob > chi2
choice
Coef.
Std. Err.
mode
travelcost
termtime
-.0084636
-.0345394
air
(base alternative)
.0020452
.0072812
z
P>|z|
=
=
=
=
=
840
210
4
4.0
4
=
=
53.60
0.0000
119
[95% Conf. Interval]
-4.14
-4.74
0.000
0.000
-.012472
-.0488103
-.0044551
-.0202684
train
income
_cons
-.0290357
.5517445
.0083226
.3719913
-3.49
1.48
0.000
0.138
-.0453477
-.177345
-.0127237
1.280834
income
_cons
-.0132562
-.0052517
.0074133
.4337932
-1.79
-0.01
0.074
0.990
-.0277859
-.8554708
.0012735
.8449673
income
_cons
-.0060878
-1.565918
.006638
.6633007
-0.92
-2.36
0.359
0.018
-.0190981
-2.865964
.0069224
-.265873
/lnsigmaP1
/lnsigmaP2
-.3557589
-1.308596
.1972809
.8872957
-1.80
-1.47
0.071
0.140
-.7424222
-3.047663
.0309045
.4304719
/atanhrP1
1.116589
.3765488
2.97
0.003
.3785667
1.854611
sigma1
sigma2
sigma3
sigma4
1
1
.7006416
.2701992
.4759596
.0474697
1.031387
1.537983
rho3_2
rho4_2
rho4_3
.8063791
.8063791
.8063791
.3614621
.3614621
.3614621
.9521783
.9521783
.9521783
bus
car
(base alternative)
(scale alternative)
.1382232
.2397466
.131699
.131699
.131699
(mode=air is the alternative normalizing location)
(mode=train is the alternative normalizing scale)
. lrtest full .
Likelihood-ratio test
(Assumption: . nested in full)
LR chi2(2) =
Prob > chi2 =
0.75
0.6882
The likelihood-ratio test suggests that a common correlation is a plausible hypothesis, but this could
be an artifact of the small sample size. The labeling of the standard deviation and correlation estimates
has changed from /lnsigma and /atanhr, in the previous example, to /lnsigmaP and /atanhrP.
The “P” identifies the parameter’s index in the pattern matrices used by asmprobit. The pattern
matrices are saved in e(stdpattern) and e(corpattern).
120
asmprobit — Alternative-specific multinomial probit regression
Technical note
Another way to fit the model with the exchangeable correlation structure in example 2 is to use
the constraint command to define the constraints on the rho parameters manually and then apply
those.
.
.
.
>
constraint 1 [atanhr3_2]_cons = [atanhr4_2]_cons
constraint 2 [atanhr3_2]_cons = [atanhr4_3]_cons
asmprobit choice travelcost termtime, casevars(income) case(id)
alternatives(mode) constraints(1 2) structural
With this method, however, we must keep track of what parameterization of the rhos is used in
estimation, and that depends on the options specified.
Example 3
In the last example, we used the correlation(exchangeable) option, reducing the number
of correlation parameters from three to one. We can explore a two–correlation parameter model
by specifying a pattern matrix in the correlation() option. Suppose that we wish to have the
correlation between train and bus be equal to the correlation between bus and car and to have the
standard deviations for the bus and car equations be equal. We will use air as the base category and
train as the scale category.
. matrix define corpat = J(4, 4, .)
. matrix corpat[3,2] = 1
. matrix corpat[4,3] = 1
. matrix corpat[4,2] = 2
. matrix define stdpat = J(1, 4, .)
. matrix stdpat[1,3] = 1
. matrix stdpat[1,4] = 1
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode) correlation(pattern corpat) stddev(pattern stdpat)
Iteration 0:
log simulated-likelihood = -201.33896
Iteration 1:
log simulated-likelihood = -201.00457 (backed up)
Iteration 2:
log simulated-likelihood = -200.80208 (backed up)
Iteration 3:
log simulated-likelihood = -200.79758 (backed up)
Iteration 4:
log simulated-likelihood = -200.55655 (backed up)
Iteration 5:
log simulated-likelihood = -200.5421 (backed up)
Iteration 6:
log simulated-likelihood = -196.24925
(output omitted )
Iteration 20: log simulated-likelihood = -190.12874
Iteration 21: log simulated-likelihood = -190.12871
Iteration 22: log simulated-likelihood = -190.12871
Alternative-specific multinomial probit
Number of obs
=
840
Case variable: id
Number of cases
=
210
Alternative variable: mode
Alts per case: min =
4
avg =
4.0
max =
4
Integration sequence:
Hammersley
Integration points:
200
Wald chi2(5)
=
41.67
Log simulated-likelihood = -190.12871
Prob > chi2
=
0.0000
asmprobit — Alternative-specific multinomial probit regression
choice
Coef.
Std. Err.
mode
travelcost
termtime
-.0100335
-.0385731
air
(base alternative)
.0026203
.008608
z
P>|z|
121
[95% Conf. Interval]
-3.83
-4.48
0.000
0.000
-.0151692
-.0554445
-.0048979
-.0217018
train
income
_cons
-.029271
.56528
.0089739
.4008037
-3.26
1.41
0.001
0.158
-.0468595
-.2202809
-.0116824
1.350841
income
_cons
-.0124658
-.0741685
.0080043
.4763422
-1.56
-0.16
0.119
0.876
-.0281539
-1.007782
.0032223
.859445
income
_cons
-.0046905
-1.897931
.0079934
.7912106
-0.59
-2.40
0.557
0.016
-.0203573
-3.448675
.0109763
-.3471867
/lnsigmaP1
-.197697
.2751269
-0.72
0.472
-.7369359
.3415418
/atanhrP1
/atanhrP2
.9704403
.5830923
.3286981
.3690419
2.95
1.58
0.003
0.114
.3262038
-.1402165
1.614677
1.306401
sigma1
sigma2
sigma3
sigma4
1
1
.8206185
.8206185
.4785781
.4785781
1.407115
1.407115
rho3_2
rho4_2
rho4_3
.7488977
.5249094
.7488977
.3151056
-.1393048
.3151056
.9238482
.863362
.9238482
bus
car
(base alternative)
(scale alternative)
.2257742
.2257742
.1443485
.2673598
.1443485
(mode=air is the alternative normalizing location)
(mode=train is the alternative normalizing scale)
In the call to asmprobit, we did not need to specify the basealternative() and scalealternative() options because they are implied by the specifications of the pattern matrices.
Technical note
If you experience convergence problems, try specifying nopivot, increasing intpoints(),
specifying antithetics, specifying technique(nr) with difficult, or specifying a switching
algorithm in the technique() option. As a last resort, you can use the nrtolerance() and
showtolerance options. Changing the base and scale alternative in the model specification can also
affect convergence if the structural option is used.
Because simulation methods are used to obtain multivariate normal probabilities, the estimates
obtained have a limited degree of precision. Moreover, the solutions are particularly sensitive to the
starting values used. Experimenting with different starting values may help in obtaining convergence,
and doing so is a good way to verify previous results.
If you wish to use the BHHH algorithm along with another maximization algorithm, you must
specify the initbhhh(#) option, where # is the number of BHHH iterations to use before switching
to the algorithm specified in technique(). The BHHH algorithm uses an outer-product-of-gradients
122
asmprobit — Alternative-specific multinomial probit regression
approximation for the Hessian, and asmprobit must perform the gradient calculations differently
than for the other algorithms.
Technical note
If there are no alternative-specific variables in your model, the variance–covariance matrix parameters are not identifiable. For such a model to converge, you would therefore need to use correlation(independent) and stddev(homoskedastic). A better alternative is to use mprobit,
which is geared specifically toward models with only case-specific variables. See [R] mprobit.
Saved results
asmprobit saves the following in e():
Scalars
e(N)
e(N case)
e(k)
e(k alt)
e(k indvars)
e(k casevars)
e(k sigma)
e(k rho)
e(k eq)
e(k eq model)
e(k autoCns)
e(df m)
e(ll)
e(N clust)
e(const)
e(i base)
e(i scale)
e(mc points)
e(mc burn)
e(mc antithetics)
e(chi2)
e(p)
e(fullcov)
e(structcov)
e(cholesky)
e(alt min)
e(alt avg)
e(alt max)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
number of cases
number of parameters
number of alternatives
number of alternative-specific variables
number of case-specific variables
number of variance estimates
number of correlation estimates
number of equations in e(b)
number of equations in model Wald test
number of base, empty, and omitted constraints
model degrees of freedom
log simulated-likelihood
number of clusters
constant indicator
base alternative index
scale alternative index
number of Monte Carlo replications
starting sequence index
antithetics indicator
χ2
significance
unstructured covariance indicator
1 if structured covariance; 0 otherwise
Cholesky-factored covariance indicator
minimum number of alternatives
average number of alternatives
maximum number of alternatives
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
asmprobit — Alternative-specific multinomial probit regression
Macros
e(cmd)
e(cmdline)
e(depvar)
e(indvars)
e(casevars)
e(case)
e(altvar)
e(alteqs)
e(alt#)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(correlation)
e(stddev)
e(cov class)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(which)
e(ml method)
e(mc method)
e(mc seed)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(mfx dlg)
e(predict)
e(marginsnotok)
asmprobit
command as typed
name of dependent variable
alternative-specific independent variable
case-specific variables
variable defining cases
variable defining alternatives
alternative equation names
alternative labels
weight type
weight expression
title in estimation output
name of cluster variable
correlation structure
variance structure
class of the covariance structure
Wald, type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
technique used to generate sequences
random-number generator seed
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
the checksum
variables used in calculation of checksum
b V
program used to implement estat
program used to implement estat mfx dialog
program used to implement predict
predictions disallowed by margins
(Continued on next page)
123
124
asmprobit — Alternative-specific multinomial probit regression
Matrices
e(b)
e(Cns)
e(stats)
e(stdpattern)
e(stdfixed)
e(altvals)
e(altfreq)
e(alt casevars)
e(corpattern)
e(corfixed)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
alternative statistics
variance pattern
fixed and free standard deviations
alternative values
alternative frequencies
indicators for estimated case-specific coefficients—e(k alt)×e(k casevars)
correlation structure
fixed and free correlations
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
asmprobit is implemented as an ado-file.
The simulated maximum likelihood estimates for the MNP are obtained using ml; see [R] ml.
The likelihood evaluator implements the GHK algorithm to approximate the multivariate distribution
function (Geweke 1989; Hajivassiliou and McFadden 1998; Keane and Wolpin 1994). The technique
is also described in detail by Genz (1992), but Genz describes a more general algorithm where both
lower and upper bounds of integration are finite. We briefly describe the GHK simulator and refer you
to Bolduc (1999) for the score computations.
As discussed earlier, the latent variables for a J -alternative model are ηij = xij β + zi αj + ξij ,
for j = 1, . . . , J , i = 1, . . . , n, and ξ0i = (ξi,1 , . . . , ξi,J ) ∼ MVN(0, Ω). The experimenter observes
alternative k for the ith observation if k = arg max(ηij , j = 1, . . . , J). Let
vij 0 = ηij − ηik
= (xij − xik )β + zi (αj − αk ) + ξij − ξik
= δij 0 β + zi γj 0 + ij 0
where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1. Further, i =
(i1 , . . . , i,J−1 ) ∼ MVN(0, Σ(k) ). Σ is indexed by k because it depends on the choice made. We
denote the deterministic part of the model as λij 0 = δij 0 β + zj γj 0 , and the probability of this event
is
Pr(yi = k) = Pr(vi1 ≤ 0, . . . , vi,J−1 ≤ 0)
= Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 )
Z −λi1
Z −λi,J−1
−(J−1)/2
−1/2
= (2π)
|Σ(k) |
···
exp − 21 z0 Σ−1
(k) z dz
−∞
−∞
(3)
asmprobit — Alternative-specific multinomial probit regression
125
Simulated likelihood
For clarity in the discussion that follows, we drop the index denoting case so that for an arbitrary
observation υ 0 = (v1 , . . . , vJ−1 ), λ0 = (λ1 , . . . , λJ−1 ), and 0 = (1 , . . . , J−1 ).
The Cholesky-factored variance–covariance, Σ = LL0 , is lower triangular,


l11
0
...
0
l22
...
0
 l21

 .

..
..

.
L=
.
.
 .

l

l
.
.
.
l
J−1,1
J−1,2
J−1,J−1
and the correlated latent-variable errors can be expressed as linear functions of uncorrelated normal
variates, = Lζ , where ζ 0 = (ζ1 , . . . , ζJ−1 ) and ζj ∼ iid N(0, 1). We now have υ = λ + Lζ , and
by defining

λ1

for j = 1

 − l11
Pj−1
(4)
zj =

λ + i=1 lji ζi

− j
for j = 2, . . . , J − 1
ljj
we can express the probability statement (3) as the product of conditional probabilities
Pr(yi = k) = Pr (ζ1 ≤ z1 ) Pr (ζ2 ≤ z2 | ζ1 ≤ z1 ) · · ·
Pr (ζJ−1 ≤ zJ−1 | ζ1 ≤ z1 , . . . , ζJ−2 ≤ zJ−2 )
because
Pr(v1 ≤ 0) = Pr(λ1 + l11 ζ1 ≤ 0)
λ1
= Pr ζ1 ≤ −
l11
Pr(v2 ≤ 0) = Pr(λ2 + l21 ζ1 + l22 ζ2 ≤ 0)
λ2 + l21 ζ1
λ1
= Pr ζ2 ≤ −
| ζ1 ≤ −
l22
l11
...
The Monte Carlo algorithm then must make draws from the truncated standard normal distribution.
It does so by generating J − 1 uniform variates, δj , j = 1, . . . , J − 1, and computing

λ1
−1


Φ
δ
Φ
−
for j = 1
1


l11
(
!)
Pj−1
ζej =

−λj − i=1 lji ζei

−1
Φ
δj Φ
for j = 2, . . . , J − 1

ljj
Define zej by replacing ζei for ζi in (4) so that the simulated probability for the lth draw is
pl =
J−1
Y
j=1
Φ(e
zj )
126
asmprobit — Alternative-specific multinomial probit regression
To increase accuracy, the bounds of integration, λj , are ordered so that the largest integration intervals
are on the inside. The rows and columns of the variance–covariance matrix are pivoted accordingly
(Genz 1992).
For a more detailed description of the GHK algorithm in Stata, see Gates (2006).
b i , is
Repeated draws are made, say, N , and the simulated likelihood for the ith case, denoted L
computed as
N
X
bi = 1
pl
L
N
l=1
The overall simulated log likelihood is
P
b
i log Li .
If the true likelihood is Li , the error bound on the approximation can be expressed as
b i − Li | ≤ V (Li )DN {(δi )}
|L
where V (Li ) is the total variation of Li and DN is the discrepancy, or nonuniformity, of the set of abscissas. For the uniform pseudorandom sequence, δi , the discrepancy is of order O{(log log N/N )1/2 }.
The order of discrepancy can be improved by using quasirandom sequences.
Quasi–Monte Carlo integration is carried out by asmprobit by replacing the uniform deviates
with either the Halton or the Hammersley sequences. These sequences spread the points more evenly
than the uniform random sequence and have a smaller order of discrepancy, O {(log N )J−1 }/N
and O {(log N )J−2 }/N , respectively. The Halton sequence of dimension J − 1 is generated from
the first J − 1 primes, pk , so that on draw l we have hl = {rp1 (l), rp2 (l), . . . , rpJ−1 (l)}, where
rpk (l) =
q
X
bjk (l)p−j−1
∈ (0, 1)
k
j=0
is the radical inverse function of l with base pk so that
(Fang and Wang 1994).
Pq
j
j=0 bjk (l)pk
= l, where pqk ≤ l < pq+1
k
This function is demonstrated with base p3 = 5 and l = 33, which generates r5 (33). Here q = 2,
b0,3 (33) = 3, b1,5 (33) = 1, and b2,5 (33) = 1, so that r5 (33) = 3/5 + 1/25 + 1/625.
The Hammersley sequence uses an evenly spaced set of points with the first J − 2 components
of the Halton sequence
2l − 1
, rp1 (l), rp2 (l), . . . , rpJ−2 (l)
hl =
2N
for l = 1, . . . , N .
For a more detailed description of the Halton and Hammersley sequences, see Drukker and
Gates (2006).
Computations for the derivatives of the simulated likelihood are taken from Bolduc (1999). Bolduc
gives the analytical first-order derivatives for the log of the simulated likelihood with respect to
the regression coefficients and the parameters of the Cholesky-factored variance–covariance matrix.
asmprobit uses these analytical first-order derivatives and numerical second-order derivatives.
This command supports the clustered version of the Huber/White/sandwich estimator of the
variance using vce(robust) and vce(cluster clustvar). See [P] robust, in particular, in Maximum
likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying
vce(cluster casevar), where casevar is the variable that identifies the cases.
asmprobit — Alternative-specific multinomial probit regression
127
References
Bolduc, D. 1999. A practical technique to estimate multinomial probit models in transportation. Transportation Research
Part B 33: 63–79.
Bunch, D. S. 1991. Estimability of the multinomial probit model. Transportation Research Part B 25: 1–12.
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge
University Press.
Cappellari, L., and S. P. Jenkins. 2003. Multivariate probit regression using simulated maximum likelihood. Stata
Journal 3: 278–294.
Drukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. Stata Journal 6: 214–228.
Fang, K.-T., and Y. Wang. 1994. Number-theoretic Methods in Statistics. London: Chapman & Hall.
Gates, R. 2006. A Mata Geweke–Hajivassiliou–Keane multivariate normal simulator. Stata Journal 6: 190–213.
Genz, A. 1992. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical
Statistics 1: 141–149.
Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo integration. Econometrica 57: 1317–1339.
Geweke, J., and M. P. Keane. 2001. Computationally intensive methods for integration in econometrics. In Vol. 5 of
Handbook of Econometrics, ed. J. Heckman and E. Leamer, 3463–3568. Amsterdam: North–Holland.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
Haan, P., and A. Uhlendorff. 2006. Estimation of multinomial logit models with unobserved heterogeneity using
maximum simulated likelihood. Stata Journal 6: 229–245.
Hajivassiliou, V. A., and D. L. McFadden. 1998. The method of simulated scores for the estimation of LDV models.
Econometrica 66: 863–896.
Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401.
Keane, M. P., and K. I. Wolpin. 1994. The solution and estimation of discrete choice dynamic programming models
by simulation and interpolation: Monte Carlo evidence. Review of Economics and Statistics 76: 648–672.
Train, K. E. 2003. Discrete Choice Methods with Simulation. Cambridge: Cambridge University Press.
Also see
[R] asmprobit postestimation — Postestimation tools for asmprobit
[R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model
[R] asroprobit — Alternative-specific rank-ordered probit regression
[R] mprobit — Multinomial probit regression
[R] mlogit — Multinomial (polytomous) logistic regression
[U] 20 Estimation and postestimation commands
Title
asmprobit postestimation — Postestimation tools for asmprobit
Description
The following postestimation commands are of special interest after asmprobit:
command
description
estat
estat
estat
estat
estat
alternative summary statistics
variance–covariance matrix of the alternatives
correlation matrix of the alternatives
covariance factor weights matrix
marginal effects
alternatives
covariance
correlation
facweights
mfx
For information about these commands, see below.
The following standard postestimation commands are also available:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
likelihood-ratio test
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predicted probabilities, estimated linear predictor and its standard error
point estimates, standard errors, testing, and inference for generalized
predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
lrtest
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Special-interest postestimation commands
estat alternatives displays summary statistics about the alternatives in the estimation sample
and provides a mapping between the index numbers that label the covariance parameters of the model
and their associated values and labels for the alternative variable.
estat covariance computes the estimated variance–covariance matrix for the alternatives. The
estimates are displayed, and the variance–covariance matrix is stored in r(cov).
estat correlation computes the estimated correlation matrix for the alternatives. The estimates
are displayed, and the correlation matrix is stored in r(cor).
estat facweights displays the covariance factor weights matrix and stores it in r(C).
estat mfx computes the simulated probability marginal effects.
128
asmprobit postestimation — Postestimation tools for asmprobit
129
Syntax for predict
predict
type
predict
type
newvar
if
in
stub* | newvarlist
, statistic altwise
if
in , scores
description
statistic
Main
probability alternative is chosen; the default
linear prediction
standard error of the linear prediction
pr
xb
stdp
These statistics are available both in and out of sample; type predict
only for the estimation sample.
. . . if e(sample) . . . if wanted
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
pr, the default, calculates the probability that alternative j is chosen in case i.
xb calculates the linear prediction xij β + zi αj for alternative j and case i.
stdp calculates the standard error of the linear predictor.
altwise specifies that alternative-wise deletion be used when marking out observations due to missing
values in your variables. The default is to use casewise deletion. The xb and stdp options always
use alternative-wise deletion.
scores calculates the scores for each coefficient in e(b). This option requires a new variable list of
length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict
generate enumerated variables with prefix stub.
Syntax for estat alternatives
estat alternatives
Menu
Statistics
>
Postestimation
>
Reports and statistics
Syntax for estat covariance
estat covariance
, format(string) border(string) left(#)
130
asmprobit postestimation — Postestimation tools for asmprobit
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat covariance
format(string) sets the matrix display format. The default is format(%9.0g).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
Syntax for estat correlation
estat correlation
, format(string) border(string) left(#)
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat correlation
format(string) sets the matrix display format. The default is format(%9.4f).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
Syntax for estat facweights
estat facweights
, format(string) border(string) left(#)
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat facweights
format(string) sets the matrix display format. The default is format(%9.0f).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
Syntax for estat mfx
estat mfx
if
in
, options
asmprobit postestimation — Postestimation tools for asmprobit
131
description
options
Main
varlist(varlist)
display marginal effects for varlist
at(mean atlist | median atlist ) calculate marginal effects at these values
Options
set confidence interval level; default is level(95)
treat indicator variables as continuous
do not restrict calculation of means and medians to the
estimation sample
ignore weights when calculating means and medians
level(#)
nodiscrete
noesample
nowght
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat mfx
Main
varlist(varlist) specifies the variables for which to display marginal effects. The default is all
variables.
at(mean atlist | median atlist ) specifies the values at which the marginal effects are to be
calculated. atlist is
alternative:variable = #
variable = #
...
The default is to calculate the marginal effects at the means of the independent variables at the
estimation sample, at(mean).
After specifying the summary statistic, you can specify a series of specific values for variables.
You can specify values for alternative-specific variables by alternative, or you can specify one
value for all alternatives. You can specify only one value for case-specific variables. For example,
in the travel dataset, income is a case-specific variable, whereas termtime and travelcost
are alternative-specific variables. The following would be a legal syntax for estat mfx:
. estat mfx, at(mean air:termtime=50 travelcost=100 income=60)
When nodiscrete is not specified, at(mean atlist ) or at(median atlist ) has no effect on
computing marginal effects for indicator variables, which are calculated as the discrete change in
the simulated probability as the indicator variable changes from 0 to 1.
The mean and median computations respect any if and in qualifiers, so you can restrict the data
over which the means or medians are computed. You can even restrict the values to a specific
case; e.g.,
. estat mfx if case==21
Options
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals.
132
asmprobit postestimation — Postestimation tools for asmprobit
nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable
is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in
the simulated probability is computed as the indicator variable changes from 0 to 1.
noesample specifies that the whole dataset be considered instead of only those marked in the
e(sample) defined by the asmprobit command.
nowght specifies that weights be ignored when calculating the means or medians.
Remarks
Remarks are presented under the following headings:
Predicted probabilities
Obtaining estimation statistics
Obtaining marginal effects
Predicted probabilities
After fitting an alternative-specific multinomial probit model, you can use predict to obtain the
simulated probabilities that an individual will choose each of the alternatives. When evaluating the
multivariate normal probabilities via Monte Carlo simulation, predict uses the same method to
generate the random sequence of numbers as the previous call to asmprobit. For example, if you
specified intmethod(Halton) when fitting the model, predict also uses the Halton sequence.
Example 1
In example 1 of [R] asmprobit, we fit a model of individuals’ travel-mode choices. We can obtain
the simulated probabilities that an individual chooses each alternative by using predict:
. use http://www.stata-press.com/data/r11/travel
. asmprobit choice travelcost termtime, casevars(income) case(id)
> alternatives(mode)
(output omitted )
. predict prob
(option pr assumed; Pr(mode))
. list id mode prob choice in 1/12, sepby(id)
id
mode
prob
choice
1.
2.
3.
4.
1
1
1
1
air
train
bus
car
.1494373
.3292305
.1319851
.3898138
0
0
0
1
5.
6.
7.
8.
2
2
2
2
air
train
bus
car
.2566063
.2761403
.0116148
.4556371
0
0
0
1
9.
10.
11.
12.
3
3
3
3
air
train
bus
car
.2098737
.1082147
.1671145
.5147865
0
0
0
1
asmprobit postestimation — Postestimation tools for asmprobit
133
Obtaining estimation statistics
Once you have fit a multinomial probit model, you can obtain the estimated variance or correlation
matrices for the model alternatives by using the estat command.
Example 2
To display the correlations of the errors in the latent-variable equations, we type
. estat correlation
train
bus
car
train
bus
car
1.0000
0.8909
0.7896
1.0000
0.8952
1.0000
Note: correlations are for alternatives differenced with air
The covariance matrix can be displayed by typing
. estat covariance
train
bus
car
train
bus
car
2
1.600309
1.374712
1.613382
1.39983
1.515656
Note: covariances are for alternatives differenced with air
Obtaining marginal effects
The marginal effects are computed as the derivative of the simulated probability for an alternative
with respect to an independent variable. A table of marginal effects is displayed for each alternative,
with the table containing the marginal effect for each case-specific variable and the alternative for
each alternative-specific variable.
By default, the marginal effects are computed at the means of each continuous independent variable
over the estimation sample. For indicator variables, the difference in the simulated probability evaluated
at 0 and 1 is computed by default. Indicator variables will be treated as continuous variables if the
nodiscrete option is used.
Example 3
Continuing with our model from example 1, we obtain the marginal effects for alternatives air,
train, bus, and car evaluated at the mean values of each independent variable. Recall that the
travelcost and termtime variables are alternative specific, taking on different values for each
alternative, so they have a separate marginal effect for each alternative.
134
asmprobit postestimation — Postestimation tools for asmprobit
. estat mfx
Pr(choice = air) = .29437379
variable
dp/dx
Std. Err.
z
P>|z|
[
-.002689
.000901
.000376
.001412
.000677
.000436
.000271
.00051
-3.97
2.07
1.39
2.77
0.000
0.039
0.166
0.006
-.004015
.000046
-.000156
.000412
-.001362
.001755
.000907
.002412
102.65
130.2
115.26
95.414
air
train
bus
car
-.010375
.003475
.00145
.005449
.00271
.001638
.001008
.002164
-3.83
2.12
1.44
2.52
0.000
0.034
0.150
0.012
-.015687
.000265
-.000524
.001207
-.005063
.006686
.003425
.00969
61.01
35.69
41.657
0
casevars
income
.00389
.001847
2.11
0.035
.00027
.00751
34.548
travelcost
air
train
bus
car
95% C.I.
]
X
termtime
Pr(choice = train) = .29535865
variable
dp/dx
Std. Err.
z
P>|z|
[
.000899
-.004081
.001278
.001904
.000436
.001466
.00063
.000887
2.06
-2.78
2.03
2.15
0.039
0.005
0.042
0.032
.000045
-.006954
.000043
.000166
.001753
-.001209
.002513
.003642
102.65
130.2
115.26
95.414
air
train
bus
car
.003469
-.015749
.004932
.007348
.001638
.00247
.001593
.002229
2.12
-6.38
3.10
3.30
0.034
0.000
0.002
0.001
.000259
-.020589
.00181
.00298
.00668
-.010908
.008053
.011716
61.01
35.69
41.657
0
casevars
income
-.009568
.002223
-4.30
0.000
-.013925
-.005212
34.548
travelcost
air
train
bus
car
95% C.I.
]
X
termtime
Pr(choice = bus) = .08879634
variable
dp/dx
Std. Err.
z
P>|z|
[
.00038
.001279
-.003182
.001524
.000274
.00063
.001174
.000675
1.39
2.03
-2.71
2.26
0.165
0.042
0.007
0.024
-.000157
.000044
-.005484
.0002
.000916
.002514
-.00088
.002848
102.65
130.2
115.26
95.414
air
train
bus
car
.001465
.004935
-.01228
.00588
.001016
.001591
.002803
.002255
1.44
3.10
-4.38
2.61
0.149
0.002
0.000
0.009
-.000527
.001817
-.017774
.001461
.003457
.008053
-.006786
.010299
61.01
35.69
41.657
0
casevars
income
.000434
.001461
0.30
0.766
-.002429
.003297
34.548
travelcost
air
train
bus
car
95% C.I.
]
X
termtime
asmprobit postestimation — Postestimation tools for asmprobit
135
Pr(choice = car) = .32161847
variable
dp/dx
Std. Err.
z
P>|z|
[
.00141
.001904
.001523
-.004837
.00051
.000886
.000676
.001539
2.77
2.15
2.25
-3.14
0.006
0.032
0.024
0.002
.000411
.000166
.000199
-.007854
.002409
.003641
.002848
-.00182
102.65
130.2
115.26
95.414
air
train
bus
car
.005441
.007347
.005879
-.018666
.002162
.002228
.002256
.003939
2.52
3.30
2.61
-4.74
0.012
0.001
0.009
0.000
.001204
.00298
.001456
-.026387
.009678
.011713
.010301
-.010945
61.01
35.69
41.657
0
casevars
income
.005246
.002165
2.42
0.015
.001002
.00949
34.548
travelcost
air
train
bus
car
95% C.I.
]
X
termtime
First, we note that there is a separate marginal effects table for each alternative and that table
begins by reporting the overall probability of choosing the alternative, e.g., 0.2944 for air travel. We
see in the first table that a unit increase in terminal time for air travel from 61.01 minutes will result
in a decrease in probability of choosing air travel (when the probability is evaluated at the mean of all
variables) by approximately 0.01, with a 95% confidence interval of about −0.016 to −0.005. Travel
cost has a less negative effect of choosing air travel (at the average cost of 102.65). Alternatively, an
increase in terminal time and travel cost for train, bus, or car from these mean values will increase
the chance for air travel to be chosen. Also, with an increase in income from 34.5, it would appear
that an individual would be more likely to choose air or automobile travel over bus or train. (While
the marginal effect for bus travel is positive, it is not significant.)
Example 4
Plotting the simulated probability marginal effect evaluated over a range of values for an independent
variable may be more revealing than a table of values. Below are the commands for generating the
simulated probability marginal effect of air travel for increasing air travel terminal time. We fix all
other independent variables at their medians.
.
.
.
.
.
qui gen meff = .
qui gen tt = .
qui gen lb = .
qui gen ub = .
forvalues i=0/19 {
2.
local termtime = 5+5*‘i’
3.
qui replace tt = ‘termtime’ if _n == ‘i’+1
4.
qui estat mfx, at(median air:termtime=‘termtime’) var(termtime)
5.
mat air = r(air)
6.
qui replace meff = air[1,1] if _n == ‘i’+1
7.
qui replace lb = air[1,5] if _n == ‘i’+1
8.
qui replace ub = air[1,6] if _n == ‘i’+1
9.
qui replace prob = r(pr_air) if _n == ‘i’+1
10. }
. label variable tt "terminal time"
136
asmprobit postestimation — Postestimation tools for asmprobit
twoway (rarea lb ub tt, pstyle(ci)) (line meff tt, lpattern(solid)), name(meff)
legend(off) title(" marginal effect of air travel" "terminal time and"
"95% confidence interval", position(3))
twoway line prob tt, name(prob) title(" probability of choosing" "air travel",
position(3)) graphregion(margin(r+9)) ytitle("") xtitle("")
graph combine prob meff, cols(1) graphregion(margin(l+5 r+5))
.6
.8
1
.
>
>
.
>
.
0
.2
.4
probability of choosing
air travel
20
40
60
80
100
−.02 −.015 −.01 −.005
0
0
marginal effect of air travel
terminal time and
95% confidence interval
0
20
40
60
terminal time
80
100
From the graphs, we see that the simulated probability of choosing air travel decreases in an
sigmoid fashion. The marginal effects display the rate of change in the simulated probability as a
function of the air travel terminal time. The rate of change in the probability of choosing air travel
decreases until the air travel terminal time reaches about 45; thereafter, it increases.
Saved results
estat mfx saves the following in r():
Scalars
r(pr alt)
Matrices
r(alt)
scalars containing the computed probability of each alternative evaluated at the value that is
labeled X in the table output. Here alt are the labels in the macro e(alteqs).
matrices containing the computed marginal effects and associated statistics. There is one matrix
for each alternative, where alt are the labels in the macro e(alteqs). Column 1 of each
matrix contains the marginal effects; column 2, their standard errors; columns 3 and 4, their z
statistics and the p-values for the z statistics; and columns 5 and 6, the confidence intervals.
Column 7 contains the values of the independent variables used to compute the probabilities
r(pr alt).
asmprobit postestimation — Postestimation tools for asmprobit
137
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Marginal effects
The marginal effects are computed as the derivative of the simulated probability with respect to each
independent variable. A set of marginal effects is computed for each alternative; thus, for J alternatives,
there will be J tables. Moreover, the alternative-specific variables will have J entries, one for each
alternative in each table. The details of computing the effects are different for alternative-specific
variables and case-specific variables, as well as for continuous and indicator variables.
We use the latent-variable notation of asmprobit (see [R] asmprobit) for a J -alternative model
and, for notational convenience, we will drop any subscripts involving observations. We then have
the following linear functions ηj = xj β + zαj , for j = 1, . . . , J . Let k index the alternative of
interest, and then
vj 0 = η j − η k
= (xj − xk )β + z(αj − αk ) + j 0
where j 0 = j if j < k and j 0 = j − 1 if j > k , so that j 0 = 1, . . . , J − 1 and j 0 ∼ MVN(0, Σ).
Denote pk = Pr(v1 ≤ 0, . . . , vJ−1 ≤ 0) as the simulated probability of choosing alternative k
given profile xk and z. The marginal effects are then ∂pk /∂xk , ∂pk /∂xj , and ∂pk /∂z, where
k = 1, . . . , J , j 6= k . asmprobit analytically computes the first-order derivatives of the simulated
probability with respect to the v ’s, and the marginal effects for x’s and z are obtained via the chain
rule. The standard errors for the marginal effects are computed using the delta method.
Also see
[R] asmprobit — Alternative-specific multinomial probit regression
[U] 20 Estimation and postestimation commands
Title
asroprobit — Alternative-specific rank-ordered probit regression
Syntax
indepvars
if
in
weight , case(varname)
alternatives(varname) options
asroprobit depvar
options
Model
∗
∗
case(varname)
alternatives(varname)
casevars(varlist)
constraints(constraints)
collinear
description
use varname to identify cases
use varname to identify the alternatives available for each case
case-specific variables
apply specified linear constraints
keep collinear variables
Model 2
correlation structure of the latent-variable errors
variance structure of the latent-variable errors
use the structural covariance parameterization; default is the
differenced covariance parameterization
factor(#)
use the factor covariance structure with dimension #
noconstant
suppress the alternative-specific constant terms
basealternative(# | lbl | str) alternative used for normalizing location
scalealternative(# | lbl | str) alternative used for normalizing scale
altwise
use alternative-wise deletion instead of casewise deletion
reverse
interpret the lowest rank in depvar as the best; the default is the
highest rank is the best
correlation(correlation)
stddev(stddev)
structural
SE/Robust
vce(vcetype)
vcetype may be oim, robust, cluster clustvar, opg,
bootstrap, or jackknife
Reporting
level(#)
notransform
nocnsreport
set confidence level; default is level(95)
do not transform variance–covariance estimates to the standard
deviation and correlation metric
do not display constraints
138
asroprobit — Alternative-specific rank-ordered probit regression
139
Integration
intmethod(seqtype)
intpoints(#)
intburn(#)
intseed(code | #)
antithetics
nopivot
initbhhh(#)
favor(speed | space)
type of quasi- or pseudouniform sequence
number of points in each sequence
starting index in the Hammersley or Halton sequence
pseudouniform random-number seed
use antithetic draws
do not use integration interval pivoting
use the BHHH optimization algorithm for the first # iterations
favor speed or space when generating integration points
Maximization
maximize options
† coeflegend
control the maximization process
display coefficients’ legend instead of coefficient table
correlation
description
unstructured
one correlation parameter for each pair of alternatives; correlations
with the basealternative() are zero; the default
one correlation parameter common to all pairs of alternatives;
correlations with the basealternative() are zero
constrain all correlation parameters to zero
user-specified matrix identifying the correlation pattern
user-specified matrix identifying the fixed and free correlation
parameters
exchangeable
independent
pattern matname
fixed matname
stddev
description
heteroskedastic
estimate standard deviation for each alternative; standard deviations
for basealternative() and scalealternative() set to one
all standard deviations are one
user-specified matrix identifying the standard deviation pattern
user-specified matrix identifying the fixed and free standard
deviations
homoskedastic
pattern matname
fixed matname
seqtype
description
hammersley
halton
random
Hammersley point set
Halton point set
uniform pseudorandom point set
∗
case(varname) and alternatives(varname) are required.
† coeflegend does not appear in the dialog box.
bootstrap, by, jackknife, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
140
asroprobit — Alternative-specific rank-ordered probit regression
Menu
Statistics
>
Ordinal outcomes
>
Rank-ordered probit regression
Description
asroprobit fits rank-ordered probit (ROP) models by using maximum simulated likelihood (MSL).
The model allows you to relax the independence of irrelevant alternatives (IIA) property that is
characteristic of the rank-ordered logistic model by estimating the variance–covariance parameters
of the latent-variable errors. Each unique identifier in the case() variable has multiple alternatives
identified in the alternatives() variable, and depvar contains the ranked alternatives made by each
case. Only the order in the ranks, not the magnitude of their differences, is assumed to be relevant.
By default, the largest rank indicates the more desirable alternative. Use the reverse option if the
lowest rank should be interpreted as the more desirable alternative. Tied ranks are allowed, but they
increase the computation time because all permutations of the tied ranks are used in computing the
likelihood for each case. asroprobit allows two types of independent variables: alternative-specific
variables, in which the values of each variable vary with each alternative, and case-specific variables,
which vary with each case.
The estimation technique of asroprobit is nearly identical to that of asmprobit, and the two
routines share many of the same options; see [R] asmprobit.
Options
Model
case(varname) specifies the variable that identifies each case. This variable identifies the individuals
or entities making a choice. case() is required.
alternatives(varname) specifies the variable that identifies the alternatives available for each case.
The number of alternatives can vary with each case; the maximum number of alternatives is 20.
alternatives() is required.
casevars(varlist) specifies the case-specific variables that are constant for each case(). If there are
a maximum of J alternatives, there will be J − 1 sets of coefficients associated with casevars().
constraints(constraints), collinear; see [R] estimation options.
Model 2
correlation(correlation) specifies the correlation structure of the latent-variable errors.
correlation(unstructured) is the most general and has J(J − 3)/2 + 1 unique correlation
parameters. This is the default unless stddev() or structural are specified.
correlation(exchangeable) provides for one correlation coefficient common to all latent
variables, except the latent variable associated with the basealternative().
correlation(independent) assumes that all correlations are zero.
correlation(pattern matname) and correlation(fixed matname) give you more flexibility in defining the correlation structure. See Variance structures in [R] asmprobit for more
information.
stddev(stddev) specifies the variance structure of the latent-variable errors.
stddev(heteroskedastic) is the most general and has J − 2 estimable parameters. The standard
deviations of the latent-variable errors for the alternatives specified in basealternative()
and scalealternative() are fixed to one.
asroprobit — Alternative-specific rank-ordered probit regression
141
stddev(homoskedastic) constrains all the standard deviations to equal one.
stddev(pattern matname) and stddev(fixed matname) give you added flexibility in defining
the standard deviation parameters. See Variance structures in [R] asmprobit for more information.
structural requests the J ×J structural covariance parameterization instead of the default J −1×J −1
differenced covariance parameterization (the covariance of the latent errors differenced with that of
the base alternative). The differenced covariance parameterization will achieve the same maximum
simulated likelihood regardless of the choice of basealternative() and scalealternative().
On the other hand, the structural covariance parameterization imposes more normalizations that
may bound the model away from its maximum likelihood and thus prevent convergence with some
datasets or choices of basealternative() and scalealternative().
factor(#) requests that the factor covariance structure of dimension # be used. The factor() option
can be used with the structural option but cannot be used with stddev() or correlation().
A # × J (or # × J − 1) matrix, C, is used to factor the covariance matrix as I + C0 C, where
I is the identity matrix of dimension J (or J − 1). The column dimension of C depends on
whether the covariance is structural or differenced. The row dimension of C, #, must be less than
or equal to floor((J(J − 1)/2 − 1)/(J − 2)), because there are only J(J − 1)/2 − 1 identifiable
variance–covariance parameters. This covariance parameterization may be useful for reducing the
number of covariance parameters that need to be estimated.
If the covariance is structural, the column of C corresponding to the base alternative contains zeros.
The column corresponding to the scale alternative has a one in the first row and zeros elsewhere.
If the covariance is differenced, the column corresponding to the scale alternative (differenced with
the base) has a one in the first row and zeros elsewhere.
noconstant suppresses the J − 1 alternative-specific constant terms.
basealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable location
(also referred to as the level of utility). The base alternative may be specified as a number, label,
or string. The standard deviation for the latent-variable error associated with the base alternative
is fixed to one, and its correlations with all other latent-variable errors are set to zero. The default
is the first alternative when sorted. If a fixed or pattern matrix is given in the stddev()
and correlation() options, the basealternative() will be implied by the fixed standard
deviations and correlations in the matrix specifications. basealternative() cannot be equal to
scalealternative().
scalealternative(# | lbl | str) specifies the alternative used to normalize the latent-variable scale
(also referred to as the scale of utility). The scale alternative may be specified as a number,
label, or string. The default is to use the second alternative when sorted. If a fixed or pattern
matrix is given in the stddev() option, the scalealternative() will be implied in the
fixed standard deviations in the matrix specification. scalealternative() cannot be equal to
basealternative().
If a fixed or pattern matrix is given for the stddev() option, the base alternative and scale
alternative are implied by the standard deviations and correlations in the matrix specifications, and
they need not be specified in the basealternative() and scalealternative() options.
altwise specifies that alternative-wise deletion be used when marking out observations due to
missing values in your variables. The default is to use casewise deletion; that is, the entire group
of observations making up a case is deleted if any missing values are encountered. This option
does not apply to observations that are marked out by the if or in qualifier or the by prefix.
reverse directs asroprobit to interpret the rank in depvar that is smallest in value as the preferred
alternative. By default, the rank that is largest in value is the favored alternative.
142
asroprobit — Alternative-specific rank-ordered probit regression
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
If specifying vce(bootstrap) or vce(jackknife), you must also specify basealternative()
and scalealternative().
Reporting
level(#); see [R] estimation options.
notransform prevents retransforming the Cholesky-factored variance–covariance estimates to the
correlation and standard deviation metric.
This option has no effect if structural is not specified because the default differenced variance–
covariance estimates have no interesting interpretation as correlations and standard deviations.
notransform also has no effect if the correlation() and stddev() options are specified with
anything other than their default values. Here it is generally not possible to factor the variance–
covariance matrix, so optimization is already performed using the standard deviation and correlation
representations.
nocnsreport; see [R] estimation options.
Integration
intmethod(hammersley | halton | random) specifies the method of generating the point sets used in
the quasi–Monte Carlo integration of the multivariate normal density. intmethod(hammersley),
the default, uses the Hammersley sequence; intmethod(halton) uses the Halton sequence; and
intmethod(random) uses a sequence of uniform random numbers.
intpoints(#) specifies the number of points to use in the quasi–Monte Carlo integration. If
this option is not specified, the number of points is 50 × J if intmethod(hammersley) or
intmethod(halton) is used and 100 × J if intmethod(random) is used. Larger values of
intpoints() provide better approximations of the log likelihood, but at the cost of added
computation time.
intburn(#) specifies where in the Halton or Hammersley sequence to start, which helps reduce the
correlation between the sequences of each dimension. The default is 0. This option may not be
specified with intmethod(random).
intseed(code | #) specifies the seed to use for generating the uniform pseudorandom sequence. This
option may be specified only with intmethod(random). code refers to a string that records the
state of the random-number generator runiform(); see [R] set seed. An integer value # may
be used also. The default is to use the current seed value from Stata’s uniform random-number
generator, which can be obtained from c(seed).
antithetics specifies that antithetic draws be used. The antithetic draw for the J − 1 vector
uniform-random variables, x, is 1 − x.
nopivot turns off integration interval pivoting. By default, asroprobit will pivot the wider intervals
of integration to the interior of the multivariate integration. This improves the accuracy of the
quadrature estimate. However, discontinuities may result in the computation of numerical secondorder derivatives using finite differencing (for the Newton–Raphson optimize technique, tech(nr))
when few simulation points are used, resulting in a nonpositive-definite Hessian. asroprobit
uses the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm, by default, which does not
require computing the Hessian numerically using finite differencing.
asroprobit — Alternative-specific rank-ordered probit regression
143
initbhhh(#) specifies that the Berndt–Hall–Hall–Hausman (BHHH) algorithm be used for the initial
# optimization steps. This option is the only way to use the BHHH algorithm along with other
optimization techniques. The algorithm switching feature of ml’s technique() option cannot
include bhhh.
favor(speed | space) instructs asroprobit to favor either speed or space when generating the
integration points. favor(speed) is the default. When favoring speed, the integration points are
generated once and stored in memory, thus increasing the speed of evaluating the likelihood. This
speed increase can be seen when there are many cases or when the user specifies a large number
of integration points, intpoints(#). When favoring space, the integration points are generated
repeatedly with each likelihood evaluation.
For unbalanced data, where the number of alternatives varies with each case, the estimates computed
using intmethod(random) will vary slightly between favor(speed) and favor(space). This
is because the uniform sequences will not be identical, even when initiating the sequences using the
same uniform seed, intseed(code | #). For favor(speed), ncase blocks of intpoints(#) ×
J − 2 uniform points are generated, where J is the maximum number of alternatives. For
favor(space), the column dimension of the matrices of points varies with the number of
alternatives that each case has.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize.
The following options may be particularly useful in obtaining convergence with asroprobit:
difficult, technique(algorithm spec), nrtolerance(#), nonrtolerance, and
from(init specs).
If technique() contains more than one algorithm specification, bhhh cannot be one of them. To
use the BHHH algorithm with another algorithm, use the initbhhh() option and specify the other
algorithm in technique().
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
When specifying from(matname , copy ), the values in matname associated with the latentvariable error variances must be for the log-transformed standard deviations and inverse-hyperbolic
tangent-transformed correlations. This option makes using the coefficient vector from a previously
fitted asroprobit model convenient as a starting point.
The following option is available with asroprobit but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
The mathematical description and numerical computations of the rank-ordered probit model are
similar to that of the multinomial probit model. The only difference is that the dependent variable
of the rank-ordered probit model is ordinal, showing preferences among alternatives, as opposed to
the binary dependent variable of the multinomial probit model, indicating a chosen alternative. We
will describe how the likelihood of a ranking is computed using the latent-variable framework here,
but for details of the latent-variable parameterization of these models and the method of maximum
simulated likelihood, see [R] asmprobit.
Consider the latent-variable parameterization of a J alternative rank-ordered probit model. Using
the notation from asmprobit, we have variables ηij , j = 1, . . . , J , such that
144
asroprobit — Alternative-specific rank-ordered probit regression
ηij = xij β + zi αj + ξij
Here the xij are the alternative-specific independent variables, the zi are the case-specific variables,
and the ξij are multivariate normal with mean zero and covariance Ω. Without loss of generality,
assume that individual i ranks the alternatives in order of the alternative indices j = 1, 2, . . . , J ,
so the alternative J is the preferred alternative and alternative 1 is the least preferred alternative.
The probability of this ranking given β and αj is the probability that ηi,J−1 − ηi,J ≤ 0 and
ηi,J−2 − ηi,J−1 ≤ 0, . . . , and ηi,1 − ηi,2 ≤ 0.
Example 1
Long and Freese (2006) provide an example of a rank-ordered logit model with alternative-specific
variables. We use this dataset to demonstrate asroprobit. The data come from the Wisconsin
Longitudinal Study. This is a study of 1957 Wisconsin high school graduates that were asked to rate
their relative preference of four job characteristics: esteem, a job other people regard highly; variety,
a job that is not repetitive and allows you to do a variety of things; autonomy, a job where your
supervisor does not check on you frequently; and security, a job with a low risk of being laid off. The
case-specific covariates are gender, female, an indicator variable for females, and score, a score
on a general mental ability test measured in standard deviations. The alternative-specific variables
are high and low, which indicate whether the respondent’s current job is high or low in esteem,
variety, autonomy, or security. This approach provides three states for a respondent’s current job
status for each alternative, (1, 0), (0, 1), and (0, 0), using the notation (high, low). The score (1, 1)
is omitted because the respondent’s current job cannot be considered both high and low in one of the
job characteristics. The (0, 0) score would indicate that the respondent’s current job does not rank
high or low (is neutral) in a job characteristic. The alternatives are ranked such that 1 is the preferred
alternative and 4 is the least preferred.
. use http://www.stata-press.com/data/r11/wlsrank
(1992 Wisconsin Longitudinal Study data on job values)
. list id jobchar rank female score high low in 1/12, sepby(id)
id
jobchar
rank
female
score
high
low
1.
2.
3.
4.
1
1
1
1
security
autonomy
variety
esteem
1
4
1
3
1
1
1
1
.0492111
.0492111
.0492111
.0492111
0
0
0
0
0
0
0
0
5.
6.
7.
8.
5
5
5
5
security
variety
esteem
autonomy
2
2
2
1
1
1
1
1
2.115012
2.115012
2.115012
2.115012
1
1
1
0
0
0
0
0
9.
10.
11.
12.
7
7
7
7
autonomy
variety
esteem
security
1
1
4
1
0
0
0
0
1.701852
1.701852
1.701852
1.701852
1
0
0
0
0
1
0
0
The three cases listed have tied ranks. asroprobit will allow ties, but at the cost of increased
computation time. To evaluate the likelihood of the first observation, asroprobit must compute
Pr(esteem = 3, variety = 1, autonomy = 4, security = 2)+
Pr(esteem = 3, variety = 2, autonomy = 4, security = 1)
asroprobit — Alternative-specific rank-ordered probit regression
145
and both of these probabilities are estimated using simulation. In fact, the full dataset contains 7,237
tied ranks and asroprobit takes a great deal of time to estimate the parameters. For exposition, we
estimate the rank-ordered probit model by using the cases without ties. These cases are marked in
the variable noties.
The model of job preference is
ηij = β1 highij + β2 lowij + α1j femalei + α2j scorei + α0j + ξij
for j = 1, 2, 3, 4. The base alternative will be esteem, so α01 = α11 = α21 = 0.
. asroprobit rank high low if noties, casevars(female score) case(id)
> alternatives(jobchar) reverse
note: variable high has 107 cases that are not alternative-specific: there is
no within-case variability
note: variable low has 193 cases that are not alternative-specific: there is
no within-case variability
Iteration 0:
log simulated-likelihood = -1103.2768
Iteration 1:
log simulated-likelihood = -1089.3361 (backed up)
(output omitted )
Alternative-specific rank-ordered probit
Number of obs
=
1660
Case variable: id
Number of cases
=
415
Alternative variable: jobchar
Alts per case: min =
4
avg =
4.0
max =
4
Integration sequence:
Hammersley
Integration points:
200
Wald chi2(8)
=
34.01
Log simulated-likelihood = -1080.2206
Prob > chi2
=
0.0000
rank
Coef.
high
low
.3741029
-.0697443
Std. Err.
z
P>|z|
[95% Conf. Interval]
jobchar
esteem
.0925685
.1093317
4.04
-0.64
0.000
0.524
.192672
-.2840305
.5555337
.1445419
(base alternative)
variety
female
score
_cons
.1351487
.1405482
1.735016
.1843088
.0977567
.1451343
0.73
1.44
11.95
0.463
0.151
0.000
-.2260899
-.0510515
1.450558
.4963873
.3321479
2.019474
autonomy
female
score
_cons
.2561828
.1898853
.7009797
.1679565
.0875668
.1227336
1.53
2.17
5.71
0.127
0.030
0.000
-.0730059
.0182575
.4604262
.5853715
.361513
.9415333
security
female
score
_cons
.232622
-.1780076
1.343766
.2057547
.1102115
.1600059
1.13
-1.62
8.40
0.258
0.106
0.000
-.1706497
-.3940181
1.030161
.6358938
.038003
1.657372
/lnl2_2
/lnl3_3
.1805151
.4843091
.0757296
.0793343
2.38
6.10
0.017
0.000
.0320878
.3288168
.3289424
.6398014
/l2_1
/l3_1
/l3_2
.6062037
.4509217
.2289447
.1169368
.1431183
.1226081
5.18
3.15
1.87
0.000
0.002
0.062
.3770117
.1704151
-.0113627
.8353957
.7314283
.4692521
(jobchar=esteem is the alternative normalizing location)
(jobchar=variety is the alternative normalizing scale)
146
asroprobit — Alternative-specific rank-ordered probit regression
We specified the reverse option because a rank of 1 is the highest preference. The variance–
covariance estimates are for the Cholesky-factored variance–covariance for the latent-variable errors
differenced with that of alternative esteem. We can view the estimated correlations by entering
. estat correlation
variety
autonomy
security
variety
autonomy
security
1.0000
0.4516
0.2652
1.0000
0.2399
1.0000
Note: correlations are for alternatives differenced with esteem
and typing
. estat covariance
variety
autonomy
security
variety
autonomy
security
2
.8573015
.6376996
1.80229
.5475882
2.890048
Note: covariances are for alternatives differenced with esteem
gives the (co)variances. [R] mprobit explains that if the latent-variable errors are independent, then
the correlations in the differenced parameterization should be ∼0.5 and the variances should be ∼2.0,
which seems to be the case here.
The coefficient estimates for the probit models can be difficult to interpret because of the
normalization for location and scale. The regression estimates for the case-specific variables will be
relative to the base alternative and the regression estimates for both the case-specific and alternativespecific variables are affected by the scale normalization. The more pronounced the heteroskedasticity
and correlations, the more pronounced the resulting estimate differences when choosing alternatives
to normalize for location and scale. However, when using the differenced covariance structure, you
will obtain the same model likelihood regardless of which alternatives you choose as the base and
scale alternatives. For model interpretation, you can examine the estimated probabilities and marginal
effects by using postestimation routines predict and estat mfx. See [R] asroprobit postestimation.
asroprobit — Alternative-specific rank-ordered probit regression
Saved results
asroprobit saves the following in e():
Scalars
e(N)
e(N case)
e(N ties)
e(k)
e(k alt)
e(k indvars)
e(k casevars)
e(k sigma)
e(k rho)
e(k eq)
e(k eq model)
e(df m)
e(k autoCns)
e(ll)
e(N clust)
e(const)
e(i base)
e(i scale)
e(mc points)
e(mc burn)
e(mc antithetics)
e(reverse)
e(chi2)
e(p)
e(fullcov)
e(structcov)
e(cholesky)
e(alt min)
e(alt avg)
e(alt max)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
number of cases
number of ties
number of parameters
number of alternatives
number of alternative-specific variables
number of case-specific variables
number of variance estimates
number of correlation estimates
number of equations in e(b)
number of equations in model Wald test
model degrees of freedom
number of base, empty, and omitted constraints
log simulated-likelihood
number of clusters
constant indicator
base alternative index
scale alternative index
number of Monte Carlo replications
starting sequence index
antithetics indicator
1 if minimum rank is best, 0 if maximum rank is best
χ2
significance
unstructured covariance indicator
1 if structured covariance; 0 otherwise
Cholesky-factored covariance indicator
minimum number of alternatives
average number of alternatives
maximum number of alternatives
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
(Continued on next page)
147
148
asroprobit — Alternative-specific rank-ordered probit regression
Macros
e(cmd)
e(cmd2)
e(cmdline)
e(depvar)
e(indvars)
e(casevars)
e(case)
e(altvar)
e(alteqs)
e(alt#)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(correlation)
e(stddev)
e(cov class)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(which)
e(ml method)
e(mc method)
e(mc seed)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(mfx dlg)
e(predict)
e(marginsnotok)
asroprobit
asroprobit
command as typed
name of dependent variable
alternative-specific independent variable
case-specific variables
variable defining cases
variable defining alternatives
alternative equation names
alternative labels
weight type
weight expression
title in estimation output
name of cluster variable
correlation structure
variance structure
class of the covariance structure
Wald, type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
Hammersley, Halton, or uniform random; technique to generate sequences
random-number generator seed
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
the checksum
variables used in calculation of checksum
b V
program used to implement estat
program used to implement estat mfx dialog
program used to implement predict
predictions disallowed by margins
asroprobit — Alternative-specific rank-ordered probit regression
Matrices
e(b)
e(Cns)
e(stats)
e(stdpattern)
e(stdfixed)
e(altvals)
e(altfreq)
e(alt casevars)
e(corpattern)
e(corfixed)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
alternative statistics
variance pattern
fixed and free standard deviations
alternative values
alternative frequencies
indicators for estimated case-specific coefficients—e(k alt)×e(k casevars)
correlation structure
fixed and free correlations
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
149
Methods and formulas
asroprobit is implemented as an ado-file.
From a computational perspective, asroprobit is similar to asmprobit and the two programs
share many numerical tools. Therefore, we will use the notation from Methods and formulas in
[R] asmprobit to discuss the rank-ordered probit probability model.
The latent variables for a J -alternative model are ηij = xij β + zi αj + ξij , for j = 1, . . . , J ,
i = 1, . . . , n, and ξ0i = (ξi,1 , . . . , ξi,J ) ∼ MVN(0, Ω). Without loss of generality, assume for
the ith observation that an individual ranks the alternatives in the order of their numeric indices,
yi = (J, J − 1, . . . , 1), so the first alternative is the most preferred and the last alternative is the
least preferred. We can then difference the latent variables such that
vik = ηi,k+1 − ηi,k
= (xi,k+1 − xi,k )β + zi (αk+1 − αk ) + ξi,k+1 − ξik
= δik β + zi γk + ik
for k = 1, . . . , J − 1 and where i = (i1 , . . . , i,J−1 ) ∼ MVN(0, Σ(i) ). Σ is indexed by i because
it is specific to the ranking of individual i. We denote the deterministic part of the model as
λik = δik β + zj γk , and the probability of this event is
Pr(yi ) = Pr(vi1 ≤ 0, . . . , vi,J−1 ≤ 0)
= Pr(i1 ≤ −λi1 , . . . , i,J−1 ≤ −λi,J−1 )
Z −λi1
Z −λi,J−1
−(J−1)/2
−1/2
z
dz
= (2π)
|Σ(i) |
···
exp − 12 z0 Σ−1
(i)
−∞
−∞
The integral has the same form as (3) of Methods and formulas in [R] asmprobit. See [R] asmprobit
for details on evaluating this integral numerically by using simulation.
150
asroprobit — Alternative-specific rank-ordered probit regression
asroprobit handles tied ranks by enumeration. For k tied ranks, it will generate k! rankings,
where ! is the factorial operator k! = k(k − 1)(k − 2) · · · (2)(1). For two sets of tied ranks of size k1
and k2 , asroprobit will generate k1 !k2 ! rankings. The total probability is the sum of the probability
of each ranking. For example, if there are two tied ranks such that yi = (J, J, J − 2, . . . , 1), then
(1)
(2)
(1)
asroprobit will evaluate Pr(yi ) = Pr(yi ) + Pr(yi ), where yi = (J, J − 1, J − 2, . . . , 1)
(2)
and yi = (J − 1, J, J − 2, . . . , 1).
This command supports the clustered version of the Huber/White/sandwich estimator of the
variance using vce(robust) and vce(cluster clustvar). See [P] robust, in particular, in Maximum
likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying
vce(cluster casevar), where casevar is the variable that identifies the cases.
Reference
Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College
Station, TX: Stata Press.
Also see
[R] asroprobit postestimation — Postestimation tools for asroprobit
[R] asmprobit — Alternative-specific multinomial probit regression
[R] mprobit — Multinomial probit regression
[R] mlogit — Multinomial (polytomous) logistic regression
[R] oprobit — Ordered probit regression
[U] 20 Estimation and postestimation commands
Title
asroprobit postestimation — Postestimation tools for asroprobit
Description
The following postestimation commands are of special interest after asroprobit:
command
description
estat
estat
estat
estat
estat
alternative summary statistics
variance–covariance matrix of the alternatives
correlation matrix of the alternatives
covariance factor weights matrix
marginal effects
alternatives
covariance
correlation
facweights
mfx
For information about these commands, see below.
The following standard postestimation commands are also available:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
likelihood-ratio test
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predicted probabilities, estimated linear predictor and its standard error
point estimates, standard errors, testing, and inference for generalized
predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
lrtest
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Special-interest postestimation commands
estat alternatives displays summary statistics about the alternatives in the estimation sample.
The command also provides a mapping between the index numbers that label the covariance parameters
of the model and their associated values and labels for the alternative variable.
estat covariance computes the estimated variance–covariance matrix for the alternatives. The
estimates are displayed, and the variance–covariance matrix is stored in r(cov).
estat correlation computes the estimated correlation matrix for the alternatives. The estimates
are displayed, and the correlation matrix is stored in r(cor).
estat facweights displays the covariance factor weights matrix and stores it in r(C).
151
152
asroprobit postestimation — Postestimation tools for asroprobit
estat mfx computes marginal effects of a simulated probability of a set of ranked alternatives.
The probability is stored in r(pr), the matrix of rankings is stored in r(ranks), and the matrix of
marginal-effect statistics is stored in r(mfx).
Syntax for predict
predict
type
predict
type
newvar
if
in
stub* | newvarlist
, statistic altwise
if
in , scores
description
statistic
Main
probability of each ranking, by case; the default
probability that each alternative is preferred
linear prediction
standard error of the linear prediction
pr
pr1
xb
stdp
These statistics are available both in and out of sample; type predict
only for the estimation sample.
. . . if e(sample) . . . if wanted
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
pr, the default, calculates the probability of each ranking. For each case, one probability is computed
for the ranks in e(depvar).
pr1 calculates the probability that each alternative is preferred.
xb calculates the linear prediction xij β + zi αj for alternative j and case i.
stdp calculates the standard error of the linear predictor.
altwise specifies that alternative-wise deletion be used when marking out observations due to missing
values in your variables. The default is to use casewise deletion. The xb and stdp options always
use alternative-wise deletion.
scores calculates the scores for each coefficient in e(b). This option requires a new variable list of
length equal to the number of columns in e(b). Otherwise, use the stub* option to have predict
generate enumerated variables with prefix stub.
Syntax for estat alternatives
estat alternatives
asroprobit postestimation — Postestimation tools for asroprobit
153
Menu
Statistics
>
Postestimation
>
Reports and statistics
Syntax for estat covariance
estat covariance
, format(string) border(string) left(#)
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat covariance
format(string) sets the matrix display format. The default is format(%9.0g).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
Syntax for estat correlation
estat correlation
, format(string) border(string) left(#)
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat correlation
format(string) sets the matrix display format. The default is format(%9.4f).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
Syntax for estat facweights
estat facweights
, format(string) border(string) left(#)
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat facweights
format(string) sets the matrix display format. The default is format(%9.0f).
border(string) sets the matrix display border style. The default is border(all). See [P] matlist.
left(#) sets the matrix display left indent. The default is left(2). See [P] matlist.
154
asroprobit postestimation — Postestimation tools for asroprobit
Syntax for estat mfx
estat mfx
if
in
, options
description
options
Main
varlist(varlist)
display marginal effects for varlist
at(median atlist ) calculate marginal effects at these values
rank(atlist)
calculate marginal effects for the simulated probability of these ranked
alternatives
Options
level(#)
nodiscrete
noesample
nowght
set confidence interval level; default is level(95)
treat indicator variables as continuous
do not restrict calculation of the medians to the estimation sample
ignore weights when calculating medians
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat mfx
Main
varlist(varlist) specifies the variables for which to display marginal effects. The default is all
variables.
at(median
alternative:variable = #
variable = #
. . . ) specifies the values at which the
marginal effects are to be calculated. The default is to compute the marginal effects at the medians
of the independent variables by using the estimation sample, at(median). You can also specify
specific values for variables. Values for alternative-specific variables can be specified by alternative,
or you can specify one value for all alternatives. You can specify only one value for case-specific
variables. For example, in the wlsrank dataset, female and score are case-specific variables,
whereas high and low are alternative-specific variables. The following would be a legal syntax
for estat mfx:
. estat mfx, at(median high=0 esteem:high=1 low=0 security:low=1 female=1)
When nodiscrete is not specified, at(median atlist ) has no effect on computing marginal
effects for indicator variables, which are calculated as the discrete change in the simulated probability
as the indicator variable changes from 0 to 1.
The median computations respect any if or in qualifiers, so you can restrict the data over which
the medians are computed. You can even restrict the values to a specific case, e.g.,
. estat mfx if case==13
rank(alternative = # alternative = # . . . ) specifies the ranks for the alternatives. The default is
to rank the calculated latent variables. Alternatives excluded from rank() are omitted from the
analysis. You must therefore specify at least two alternatives in rank(). You may have tied ranks
in the rank specification. Only the order in the ranks is relevant.
asroprobit postestimation — Postestimation tools for asroprobit
155
Options
level(#) sets the confidence level; default is level(95).
nodiscrete specifies that indicator variables be treated as continuous variables. An indicator variable
is one that takes on the value 0 or 1 in the estimation sample. By default, the discrete change in
the simulated probability is computed as the indicator variable changes from 0 to 1.
noesample specifies that the whole dataset be considered instead of only those marked in the
e(sample) defined by the asroprobit command.
nowght specifies that weights be ignored when calculating the medians.
Remarks
Remarks are presented under the following headings:
Predicted probabilities
Obtaining estimation statistics
Predicted probabilities
After fitting an alternative-specific rank-ordered probit model, you can use predict to obtain the
probabilities of alternative rankings or the probabilities of each alternative being preferred. When
evaluating the multivariate normal probabilities via (quasi) Monte Carlo, predict uses the same
method to generate the (quasi) random sequence of numbers as the previous call to asroprobit. For
example, if you specified intmethod(halton) when fitting the model, predict also uses Halton
sequences.
Example 1
In example 1 of [R] asroprobit, we fit a model of job characteristic preferences. This is a study
of 1957 Wisconsin high school graduates that were asked to rate their relative preference of four
job characteristics: esteem, a job other people regard highly; variety, a job that is not repetitive and
allows you to do a variety of things; autonomy, a job where your supervisor does not check on you
frequently; and security, a job with a low risk of being laid off. The case-specific covariates are
gender, female, an indicator variable for females, and score, a score on a general mental ability test
measured in standard deviations. The alternative-specific variables are high and low, which indicate
whether the respondent’s current job is high or low in esteem, variety, autonomy, or security. This
approach provides three states for a respondent’s current job status for each alternative, (1, 0), (0, 1),
and (0, 0), using the notation (high, low). The score (1, 1) is omitted because the respondent’s
current job cannot be considered both high and low in one of the job characteristics. The (0, 0)
score would indicate that the respondent’s current job does not rank high or low (is neutral) in a job
characteristic. The alternatives are ranked such that 1 is the preferred alternative and 4 is the least
preferred.
We can obtain the probabilities of the observed alternative rankings, the pr option, and the
probability of each alternative being preferred, the pr1 option, by using predict:
. use http://www.stata-press.com/data/r11/wlsrank
(1992 Wisconsin Longitudinal Study data on job values)
. asroprobit rank high low if noties, casevars(female score) case(id)
> alternatives(jobchar) reverse
(output omitted )
156
asroprobit postestimation — Postestimation tools for asroprobit
. keep if e(sample)
(11244 observations deleted)
. predict prob, pr
. predict prob1, pr1
. list id jobchar prob prob1 rank female score high low in 1/12
id
jobchar
prob
prob1
rank
female
score
high low
1.
2.
3.
4.
13
13
13
13
security
autonomy
variety
esteem
.0421807
.0421807
.0421807
.0421807
.2784269
.1029036
.6026725
.0160111
3
1
2
4
0
0
0
0
.3246512
.3246512
.3246512
.3246512
0
0
1
0
1
0
0
1
5.
6.
7.
8.
19
19
19
19
autonomy
esteem
security
variety
.0942025
.0942025
.0942025
.0942025
.1232488
.0140261
.4601368
.4025715
4
3
1
2
1
1
1
1
.0492111
.0492111
.0492111
.0492111
0
0
1
0
0
0
0
0
9.
10.
11.
12.
22
22
22
22
esteem
variety
security
autonomy
.1414177
.1414177
.1414177
.1414177
.0255264
.4549441
.2629494
.2566032
4
1
3
2
1
1
1
1
1.426412
1.426412
1.426412
1.426412
1
0
0
1
0
0
0
0
The prob variable is constant for each case because it contains the probability of the ranking in
the rank variable. On the other hand, the prob1 variable contains the estimated probability of each
alternative being preferred. For each case, the sum of the values in prob1 will be approximately 1.0.
They do not add up to exactly 1.0 because of approximations due to the GHK algorithm.
Obtaining estimation statistics
For examples of the specialized estat subcommands covariance and correlation, see [R] asmprobit postestimation. The entry also has a good example of computing marginal effects after asmprobit that is applicable to asroprobit. Below we will elaborate further on marginal effects after
asroprobit where we manipulate the rank() option.
Example 2
We will continue with the preferred job characteristics example where we first compute the marginal
effects for case id = 13.
asroprobit postestimation — Postestimation tools for asroprobit
157
. estat mfx if id==13, rank(security=3 autonomy=1 variety=2 esteem=4)
Pr(esteem=4 variety=2 autonomy=1 security=3) = .04218068
variable
dp/dx
Std. Err.
z
P>|z|
[
95% C.I.
]
X
-.008713
-.009102
.025535
-.003745
.001964
.003127
.007029
.001394
-4.44
-2.91
3.63
-2.69
0.000
0.004
0.000
0.007
-.012562
-.015231
.011758
-.006477
-.004864
-.002973
.039313
-.001013
0
1
0
0
esteem
variety
autonomy
security
.001614
.001809
-.003849
.000582
.002646
.003012
.006104
.000985
0.61
0.60
-0.63
0.59
0.542
0.548
0.528
0.554
-.003572
-.004094
-.015813
-.001348
.0068
.007712
.008115
.002513
1
0
0
1
casevars
female*
score
.009767
.008587
.009064
.004488
1.08
1.91
0.281
0.056
-.007998
-.00021
.027533
.017384
0
.32465
high*
esteem
variety
autonomy
security
low*
(*) dp/dx is for discrete change of indicator variable from 0 to 1
Next we compute the marginal effects for the probability that autonomy is preferred given the profile
of case id = 13.
. estat mfx if id==13, rank(security=2 autonomy=1 variety=2 esteem=2)
Pr(esteem=3
Pr(esteem=4
Pr(esteem=2
Pr(esteem=4
Pr(esteem=2
Pr(esteem=3
variety=4
variety=3
variety=4
variety=2
variety=3
variety=2
variable
autonomy=1
autonomy=1
autonomy=1
autonomy=1
autonomy=1
autonomy=1
security=2)
security=2)
security=3)
security=3)
security=4)
security=4)
+
+
+
+
+
= .10276103
dp/dx
Std. Err.
z
P>|z|
[
95% C.I.
]
X
-.003524
-.036203
.057279
-.0128
.001258
.00894
.013801
.002665
-2.80
-4.05
4.15
-4.80
0.005
0.000
0.000
0.000
-.005989
-.053724
.030231
-.018024
-.001059
-.018681
.084328
-.007576
0
1
0
0
esteem
variety
autonomy
security
.000518
.006409
-.008818
.002314
.000833
.010588
.013766
.003697
0.62
0.61
-0.64
0.63
0.534
0.545
0.522
0.531
-.001116
-.014343
-.035799
-.004932
.002151
.027161
.018163
.009561
1
0
0
1
casevars
female*
score
.013839
.017917
.021607
.011062
0.64
1.62
0.522
0.105
-.028509
-.003764
.056188
.039598
0
.32465
high*
esteem
variety
autonomy
security
low*
(*) dp/dx is for discrete change of indicator variable from 0 to 1
The probability computed by estat mfx matches the probability computed by predict, pr1 only
within three digits. This outcome is because of how the computation is carried out and the numeric
inaccuracy of the GHK simulator using a Hammersley point set of length 200. The computation
carried out by estat mfx literally computes all six probabilities listed in the header of the MFX
table and sums them. The computation by predict, pr1 is the same as predict after asmprobit
(multinomial probit): it computes the probability that autonomy is chosen, thus requiring only one
158
asroprobit postestimation — Postestimation tools for asroprobit
call to the GHK simulator. Hence, there is a difference in the reported values even though the two
probability statements are equivalent.
Saved results
estat mfx saves the following in r():
Scalars
r(pr)
scalar containing the computed probability of the ranked alternatives.
Matrices
r(ranks) column vector containing the alternative ranks. The rownames identify the alternatives.
r(mfx)
matrix containing the computed marginal effects and associated statistics. Column 1 of
the matrix contains the marginal effects; column 2, their standard errors; column 3,
their z statistics; and columns 4 and 5, the confidence intervals. Column 6 contains
the values of the independent variables used to compute the probabilities r(pr).
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] asroprobit — Alternative-specific rank-ordered probit regression
[R] asmprobit — Alternative-specific multinomial probit regression
[U] 20 Estimation and postestimation commands
Title
BIC note — Calculating and interpreting BIC
Description
This entry discusses a statistical issue that arises when using the Bayesian information criterion
(BIC) to compare models.
Stata calculates BIC, assuming N = e(N)—we will explain—but sometimes it would be better if
a different N were used. Commands that calculate BIC have an n() option, allowing you to specify
the N to be used.
In summary,
1. If you are comparing results estimated by the same estimation command, using the default
BIC calculation is probably fine. There is an issue, but most researchers would ignore it.
2. If you are comparing results estimated by different estimation commands, you need to be
on your guard.
a. If the different estimation commands share the same definitions of observations,
independence, and the like, you are back in case 1.
b. If they differ in these regards, you need to think about the value of N that should
be used. For example, logit and xtlogit differ in that the former assumes
independent observations and the latter, independent panels.
c. If estimation commands differ in the events being used over which the likelihood
function is calculated, the information criteria may not be comparable at all. We
say information criteria because this would apply equally to the Akaike information
criterion (AIC), as well as to BIC. For instance, streg and stcox produce such
incomparable results. The events used by streg are the actual survival times,
whereas the events used by stcox are failures within risk pools, conditional on
the times at which failures occurred.
Remarks
Remarks are presented under the following headings:
Background
The problem of determining N
The problem of conformable likelihoods
The first problem does not arise with AIC; the second problem does
Calculating BIC correctly
Background
The AIC and the BIC are two popular measures for comparing maximum likelihood models. AIC
and BIC are defined as
AIC = −2 × ln(likelihood) + 2 × k
BIC
= −2 × ln(likelihood) + ln(N ) × k
159
160
BIC note — Calculating and interpreting BIC
where
k = model degrees of freedom
N = number of observations
We are going to discuss AIC along with BIC because AIC has some of the problems that BIC has,
but not all.
AIC and BIC can be viewed as measures that combine fit and complexity. Fit is measured negatively
by −2 × ln(likelihood); the larger the value, the worse the fit. Complexity is measured positively,
either by 2 × k (AIC) or ln(N ) × k (BIC).
Given two models fit on the same data, the model with the smaller value of the information
criterion is considered to be better.
There is a substantial literature on these measures: see Akaike (1974); Raftery (1995); Sakamoto,
Ishiguro, and Kitagawa (1986); and Schwarz (1978).
When Stata calculates the above measures, it uses the rank of e(V) for k and it uses e(N) for
N . e(V) and e(N) are Stata notation for results stored by the estimation command. e(V) is the
variance–covariance matrix of the estimated parameters, and e(N) is the number of observations in
the dataset used in calculating the result.
The problem of determining N
The difference between AIC and BIC is that AIC uses the constant 2 to weight k , whereas BIC uses
ln(N ).
Determining what value of N should be used is problematic. Despite appearances, the definition
“N is the number of observations” is not easy to make operational. N does not appear in the likelihood
function itself, N is not the output of a standard statistical formula, and what is an observation is
often subjective.
Example 1
Often what is meant by N is obvious. Consider a simple logit model. What is meant by N is the
number of observations that are statistically independent and that corresponds to M , the number of
observations in the dataset used in the calculation. We will write N = M .
But now assume that the same dataset has a grouping variable and the data are thought to be
clustered within group. To keep the problem simple, let’s pretend that there are G groups and m
observations within group, so that M = G×m. Because you are worried about intragroup correlation,
you fit your model with xtlogit, grouping on the grouping variable. Now you wish to calculate
BIC. What is the N that should be used? N = M or N = G?
That is a deep question. If the observations really are independent, then you should use N = M .
If the observations within group are not just correlated but are duplicates of one another, and they
had to be so, then you should use M = G. Between those two extremes, you should probably
use a number between N and G, but determining what that number should be from measured
correlations is difficult. Using N = M is conservative in that, if anything, it overweights complexity.
Conservativeness, however, is subjective, too: using N = G could be considered more conservative
in that fewer constraints are being placed on the data.
When the estimated correlation is high, our reaction would be that using N = G is probably more
reasonable. Our first reaction, however, would be that using BIC to compare models is probably a
misuse of the measure.
BIC note — Calculating and interpreting BIC
161
Stata uses N = M . An informal survey of web-based literature suggests that N = M is the
popular choice.
There is another reason, not so good, to choose N = M . It makes across-model comparisons more
likely to be valid when performed without thinking about the issue. Say that you wish to compare
the logit and xtlogit results. Thus, you need to calculate
BICp
= −2 × ln(likelihoodp ) + ln(Np ) × k
BICx
= −2 × ln(likelihoodx ) + ln(Nx ) × k
Whatever N you use, you must use the same N in both formulas. Stata’s choice of N = M at
least meets that test.
Example 2
In the above example, using N = M is reasonable. Now let’s look at when using N = M is
wrong, even if popular.
Consider a model fit by stcox. Using N = M is certainly wrong if for no other reason than
M is not even a well-defined number. The same data can be represented by different datasets with
different numbers of observations. For example, in one dataset, there might be 1 observation per
subject. In another, the same subjects could have two records each, the first recording the first half
of the time at risk and the second recording the remaining part. All statistics calculated by Stata on
either dataset would be the same, but M would be different.
Deciding on the right definition, however, is difficult. Viewed one way, N in the Cox regression
case should be the number of risk pools, R, because the Cox regression calculation is made on the
basis of the independent risk pools. Viewed another way, N should be the number of subjects, Nsubj ,
because, even though the likelihood function is based on risk pools, the parameters estimated are at
the subject level.
You can decide which argument you prefer.
For parametric survival models, in single-record data, N = M is unambiguously correct. For
multirecord data, there is an argument for N = M and for N = Nsubj .
The problem of conformable likelihoods
The problem of conformable likelihoods does not concern N . Researchers sometimes use information criteria such as BIC and AIC to make comparisons across models. For that to be valid, the
likelihoods must be conformable; i.e., the likelihoods must all measure the same thing.
It is common to think of the likelihood function as the Pr(data | parameters), but in fact, the
likelihood is
Pr(particular events in the data | parameters)
You must ensure that the events are the same.
For instance, they are not the same in the semiparametric Cox regression and the various parametric
survival models. In Cox regression, the events are, at each failure time, that the subjects observed to
fail in fact failed, given that failures occurred at those times. In the parametric models, the events
are that each subject failed exactly when the subject was observed to fail.
162
BIC note — Calculating and interpreting BIC
The formula for AIC and BIC is
measure = −2 × ln(likelihood) + complexity
When you are comparing models, if the likelihoods are measuring different events, even if the
models obtain estimates of the same parameters, differences in the information measures are irrelevant.
The first problem does not arise with AIC; the second problem does
Regardless of model, the problem of defining N never arises with AIC because N is not used in
the AIC calculation. AIC uses a constant 2 to weight complexity as measured by k , rather than ln(N ).
For both AIC and BIC, however, the likelihood functions must be conformable; i.e., they must be
measuring the same event.
Calculating BIC correctly
When using BIC to compare results, and especially when using BIC to compare results from different
models, you should think carefully about how N should be defined. Then specify that number by
using the n() option:
. estimates stats full sub, n(74)
Model
Obs
ll(null)
ll(model)
df
AIC
BIC
full
sub
102
102
-45.03321
-45.03321
-20.59083
-27.17516
4
3
49.18167
60.35031
58.39793
67.26251
Note:
N = 74 used in calculating BIC
Both estimates stats and estat ic allow the n() option; see [R] estimates stats and [R] estat.
Methods and formulas
AIC and BIC are defined as
AIC
= −2 × ln(likelihood) + 2 × k
BIC
= −2 × ln(likelihood) + ln(N ) × k
where k is the model degrees of freedom calculated as the rank of variance–covariance matrix of
the parameters e(V) and N is the number of observations used in estimation or, more precisely, the
number of independent terms in the likelihood. Operationally, N is defined as e(N) unless the n()
option is specified.
References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19:
716–723.
Raftery, A. 1995. Bayesian model selection in social research. In Vol. 25 of Sociological Methodology, ed. P. V.
Marsden, 111–163. Oxford: Blackwell.
BIC note — Calculating and interpreting BIC
163
Sakamoto, Y., M. Ishiguro, and G. Kitagawa. 1986. Akaike Information Criterion Statistics. Dordrecht, The Netherlands:
Reidel.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464.
Also see
[R] estat — Postestimation statistics
[R] estimates stats — Model statistics
Title
binreg — Generalized linear models: Extensions to the binomial family
Syntax
binreg depvar
indepvars
options
if
in
weight
, options
description
Model
noconstant
or
rr
hr
rd
n(# | varname)
exposure(varname)
offset(varname)
constraints(constraints)
collinear
mu(varname)
init(varname)
suppress constant term
use logit link and report odds ratios
use log link and report risk ratios
use log-complement link and report health ratios
use identity link and report risk differences
use # or varname for number of trials
include ln(varname) in model with coefficient constrained to 1
include varname in model with coefficient constrained to 1
apply specified linear constraints
keep collinear variables
use varname as the initial estimate for the mean of depvar
synonym for mu(varname)
SE/Robust
vce(vcetype)
t(varname)
vfactor(#)
disp(#)
scale(x2 | dev | #)
vcetype may be eim, robust, cluster clustvar, oim, opg,
bootstrap, jackknife, hac kernel, jackknife1, or unbiased
variable name corresponding to time
multiply variance matrix by scalar #
quasi-likelihood multiplier
set the scale parameter; default is scale(1)
Reporting
level(#)
coefficients
nocnsreport
display options
set confidence level; default is level(95)
report nonexponentiated coefficients
do not display constraints
control spacing and display of omitted variables and base and
empty cells
Maximization
irls
ml
maximize options
fisher(#)
search
† coeflegend
use iterated, reweighted least-squares optimization; the default
use maximum likelihood optimization
control the maximization process; seldom used
Fisher scoring steps
search for good starting values
display coefficients’ legend instead of coefficient table
164
binreg — Generalized linear models: Extensions to the binomial family
165
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands.
vce(bootstrap), vce(jackknife), and vce(jackknife1) are not allowed with the mi estimate prefix.
Weights are not allowed with the bootstrap prefix.
aweights are not allowed with the jackknife prefix.
fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Generalized linear models
>
GLM for the binomial family
Description
binreg fits generalized linear models for the binomial family. It estimates odds ratios, risk ratios,
health ratios, and risk differences. The available links are
Option
Implied link
Parameter
or
rr
hr
rd
logit
log
log complement
identity
odds ratios = exp(β )
risk ratios = exp(β )
health ratios = exp(β )
risk differences = β
Estimates of odds, risk, and health ratios are obtained by exponentiating the appropriate coefficients.
The or option produces the same results as Stata’s logistic command, and or coefficients
yields the same results as the logit command. When no link is specified, or is assumed.
Options
Model
noconstant; see [R] estimation options.
or requests the logit link and results in odds ratios if coefficients is not specified.
rr requests the log link and results in risk ratios if coefficients is not specified.
hr requests the log-complement link and results in health ratios if coefficients is not specified.
rd requests the identity link and results in risk differences.
n(# | varname) specifies either a constant integer to use as the denominator for the binomial family
or a variable that holds the denominator for each observation.
exposure(varname), offset(varname), constraints(constraints), collinear; see [R] estimation options. constraints(constraints) and collinear are not allowed with irls.
mu(varname) specifies varname containing an initial estimate for the mean of depvar. This option
can be useful if you encounter convergence difficulties. init(varname) is a synonym.
166
binreg — Generalized linear models: Extensions to the binomial family
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are robust
to some kinds of misspecification, that allow for intragroup correlation, that are derived from
asymptotic theory, and that use bootstrap or jackknife methods; see [R] vce option.
vce(eim), the default, uses the expected information matrix (EIM) for the variance estimator.
binreg also allows the following:
vce(hac kernel # ) specifies that a heteroskedasticity- and autocorrelation-consistent (HAC)
variance estimate be used. HAC refers to the general form for combining weighted matrices to
form the variance estimate. There are three kernels built into binreg. kernel is a user-written
program or one of
nwest | gallant | anderson
If # is not specified, N − 2 is assumed.
vce(jackknife1) specifies that the one-step jackknife estimate of variance be used.
vce(unbiased) specifies that the unbiased sandwich estimate of variance be used.
t(varname) specifies the variable name corresponding to time; see [TS] tsset. binreg does not
always need to know t(), though it does if vce(hac . . . ) is specified. Then you can either
specify the time variable with t(), or you can tsset your data before calling binreg. When the
time variable is required, binreg assumes that the observations are spaced equally over time.
vfactor(#) specifies a scalar by which to multiply the resulting variance matrix. This option
allows users to match output with other packages, which may apply degrees of freedom or other
small-sample corrections to estimates of variance.
disp(#) multiplies the variance of depvar by # and divides the deviance by #. The resulting
distributions are members of the quasilikelihood family.
scale(x2 | dev | #) overrides the default scale parameter. This option is allowed only with Hessian
(information matrix) variance estimates.
By default, scale(1) is assumed for the discrete distributions (binomial, Poisson, and negative
binomial), and scale(x2) is assumed for the continuous distributions (Gaussian, gamma, and
inverse Gaussian).
scale(x2) specifies that the scale parameter be set to the Pearson chi-squared (or generalized
chi-squared) statistic divided by the residual degrees of freedom, which was recommended by
McCullagh and Nelder (1989) as a good general choice for continuous distributions.
scale(dev) sets the scale parameter to the deviance divided by the residual degrees of freedom.
This option provides an alternative to scale(x2) for continuous distributions and overdispersed
or underdispersed discrete distributions.
scale(#) sets the scale parameter to #.
Reporting
level(#), noconstant; see [R] estimation options.
coefficients displays the nonexponentiated coefficients and corresponding standard errors and
confidence intervals. This option has no effect when the rd option is specified, because it always
presents the nonexponentiated coefficients.
nocnsreport; see [R] estimation options.
binreg — Generalized linear models: Extensions to the binomial family
167
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Maximization
irls requests iterated, reweighted least-squares (IRLS) optimization of the deviance instead of
Newton–Raphson optimization of the log likelihood. This option is the default.
ml requests that optimization be carried out by using Stata’s ml command.
maximize options: technique(algorithm spec), no log, trace, gradient, showstep, hessian,
showtolerance, difficult, iterate(#), tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
Setting the optimization method to ml, with technique() set to something other than BHHH,
changes the vcetype to vce(oim). Specifying technique(bhhh) changes vcetype to vce(opg).
fisher(#) specifies the number of Newton–Raphson steps that should use the Fisher scoring Hessian
or EIM before switching to the observed information matrix (OIM). This option is available only
if ml is specified and is useful only for Newton–Raphson optimization.
search specifies that the command search for good starting values. This option is available only if
ml is specified and is useful only for Newton–Raphson optimization.
The following option is available with binreg but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
Wacholder (1986) suggests methods for estimating risk ratios and risk differences from prospective
binomial data. These estimates are obtained by selecting the proper link functions in the generalized
linear-model framework. (See Methods and formulas for details; also see [R] glm.)
Example 1
Wacholder (1986) presents an example, using data from Wright et al. (1983), of an investigation
of the relationship between alcohol consumption and the risk of a low-birthweight baby. Covariates
examined included whether the mother smoked (yes or no), mother’s social class (three levels), and
drinking frequency (light, moderate, or heavy). The data for the 18 possible categories determined
by the covariates are illustrated below.
(Continued on next page)
168
binreg — Generalized linear models: Extensions to the binomial family
Let’s first describe the data and list a few observations.
. use http://www.stata-press.com/data/r11/binreg
. list
cat
d
n
alc
smo
soc
1.
2.
3.
4.
5.
1
2
3
4
5
11
5
11
6
3
84
79
169
28
13
3
2
1
3
2
1
1
1
2
2
1
1
1
1
1
6.
7.
8.
9.
10.
6
7
8
9
10
1
4
3
12
4
26
22
25
162
17
1
3
2
1
3
2
1
1
1
2
1
2
2
2
2
11.
12.
13.
14.
15.
11
12
13
14
15
2
6
0
1
12
7
38
14
18
91
2
1
3
2
1
2
2
1
1
1
2
2
3
3
3
16.
17.
18.
16
17
18
7
2
8
19
18
70
3
2
1
2
2
2
3
3
3
Each observation corresponds to one of the 18 covariate structures. The number of low-birthweight
babies from n in each category is given by the d variable.
We begin by estimating risk ratios:
. binreg d i.soc i.alc i.smo, n(n) rr
Iteration 1:
deviance =
14.2879
Iteration 2:
deviance =
13.607
Iteration 3:
deviance = 13.60503
Iteration 4:
deviance = 13.60503
Generalized linear models
Optimization
: MQL Fisher scoring
(IRLS EIM)
Deviance
=
13.6050268
Pearson
= 11.51517095
Variance function: V(u) = u*(1-u/n)
Link function
: g(u) = ln(u/n)
No. of obs
Residual df
Scale parameter
(1/df) Deviance
(1/df) Pearson
[Binomial]
[Log]
BIC
=
=
=
=
=
18
12
1
1.133752
.9595976
= -21.07943
EIM
Std. Err.
z
P>|z|
[95% Conf. Interval]
1.340001
1.349487
.3127382
.3291488
1.25
1.23
0.210
0.219
.848098
.8366715
2.11721
2.176619
alc
2
3
1.191157
1.974078
.3265354
.4261751
0.64
3.15
0.523
0.002
.6960276
1.293011
2.038503
3.013884
2.smo
1.648444
.332875
2.48
0.013
1.109657
2.448836
d
Risk Ratio
soc
2
3
binreg — Generalized linear models: Extensions to the binomial family
169
By default, Stata reports the risk ratios (the exponentiated regression coefficients) estimated by the
model. We can see that the risk ratio comparing heavy drinkers with light drinkers, after adjusting
for smoking and social class, is 1.974078. That is, mothers who drink heavily during their pregnancy
have approximately twice the risk of delivering low-birthweight babies as mothers who are light
drinkers.
The nonexponentiated coefficients can be obtained with the coefficients option:
. binreg d i.soc i.alc i.smo, n(n) rr coefficients
Iteration 1:
deviance =
14.2879
Iteration 2:
deviance =
13.607
Iteration 3:
deviance = 13.60503
Iteration 4:
deviance = 13.60503
Generalized linear models
No. of obs
Optimization
: MQL Fisher scoring
Residual df
(IRLS EIM)
Scale parameter
Deviance
=
13.6050268
(1/df) Deviance
Pearson
= 11.51517095
(1/df) Pearson
Variance function: V(u) = u*(1-u/n)
[Binomial]
Link function
: g(u) = ln(u/n)
[Log]
BIC
=
=
=
=
=
18
12
1
1.133752
.9595976
= -21.07943
EIM
Std. Err.
z
P>|z|
.2926702
.2997244
.2333866
.2439066
1.25
1.23
0.210
0.219
-.1647591
-.1783238
.7500994
.7777726
alc
2
3
.1749248
.6801017
.274133
.2158856
0.64
3.15
0.523
0.002
-.362366
.2569737
.7122156
1.10323
2.smo
_cons
.4998317
-2.764079
.2019329
.2031606
2.48
-13.61
0.013
0.000
.1040505
-3.162266
.8956129
-2.365891
d
Coef.
soc
2
3
(Continued on next page)
[95% Conf. Interval]
170
binreg — Generalized linear models: Extensions to the binomial family
Risk differences are obtained with the rd option:
. binreg d i.soc i.alc i.smo, n(n) rd
Iteration 1:
deviance = 18.67277
Iteration 2:
deviance = 14.94364
Iteration 3:
deviance =
14.9185
Iteration 4:
deviance = 14.91762
Iteration 5:
deviance = 14.91758
Iteration 6:
deviance = 14.91758
Iteration 7:
deviance = 14.91758
Generalized linear models
Optimization
: MQL Fisher scoring
(IRLS EIM)
Deviance
= 14.91758277
Pearson
= 12.60353235
Variance function: V(u) = u*(1-u/n)
Link function
: g(u) = u/n
No. of obs
Residual df
Scale parameter
(1/df) Deviance
(1/df) Pearson
[Binomial]
[Identity]
BIC
=
=
=
=
=
18
12
1
1.243132
1.050294
= -19.76688
EIM
Std. Err.
z
P>|z|
.0263817
.0365553
.0232124
.0268668
1.14
1.36
0.256
0.174
-.0191137
-.0161026
.0718771
.0892132
alc
2
3
.0122539
.0801291
.0257713
.0302878
0.48
2.65
0.634
0.008
-.0382569
.020766
.0627647
.1394921
2.smo
_cons
.0542415
.059028
.0270838
.0160693
2.00
3.67
0.045
0.000
.0011582
.0275327
.1073248
.0905232
d
Risk Diff.
soc
2
3
[95% Conf. Interval]
The risk difference between heavy drinkers and light drinkers is simply the value of the coefficient for
3.alc = 0.0801291. Because the risk differences are obtained directly from the coefficients estimated
by using the identity link, the coefficients option has no effect here.
Health ratios are obtained with the hr option. The health ratios (exponentiated coefficients for the
log-complement link) are reported directly.
binreg — Generalized linear models: Extensions to the binomial family
. binreg d i.soc i.alc i.smo, n(n) hr
Iteration 1:
deviance = 21.15233
Iteration 2:
deviance = 15.16467
Iteration 3:
deviance = 15.13205
Iteration 4:
deviance = 15.13114
Iteration 5:
deviance = 15.13111
Iteration 6:
deviance = 15.13111
Iteration 7:
deviance = 15.13111
Generalized linear models
Optimization
: MQL Fisher scoring
(IRLS EIM)
Deviance
= 15.13110545
Pearson
= 12.84203917
Variance function: V(u) = u*(1-u/n)
Link function
: g(u) = ln(1-u/n)
EIM
Std. Err.
d
HR
soc
2
3
.9720541
.9597182
.024858
.0290412
alc
2
3
.9871517
.9134243
2.smo
.9409983
No. of obs
=
18
Residual df
=
12
Scale parameter =
1
(1/df) Deviance = 1.260925
(1/df) Pearson =
1.07017
[Binomial]
[Log complement]
BIC
= -19.55336
z
P>|z|
[95% Conf. Interval]
-1.11
-1.36
0.268
0.174
.9245342
.9044535
1.022017
1.01836
.0278852
.0325726
-0.46
-2.54
0.647
0.011
.9339831
.8517631
1.043347
.9795493
.0296125
-1.93
0.053
.8847125
1.000865
(HR) Health ratios
To see the nonexponentiated coefficients, we can specify the coefficients option.
(Continued on next page)
171
172
binreg — Generalized linear models: Extensions to the binomial family
Saved results
binreg, irls saves the following in e():
Scalars
e(N)
e(k)
e(k eq model)
e(df m)
e(df)
e(phi)
e(disp)
e(bic)
e(N clust)
e(deviance)
e(deviance s)
e(deviance p)
e(deviance ps)
e(dispers)
e(dispers s)
e(dispers p)
e(dispers ps)
e(vf)
e(rank)
e(rc)
number of observations
number of parameters
number of equations in model Wald test
model degrees of freedom
residual degrees of freedom
model scale parameter
dispersion parameter
model BIC
number of clusters
deviance
scaled deviance
Pearson deviance
scaled Pearson deviance
dispersion
scaled dispersion
Pearson dispersion
scaled Pearson dispersion
factor set by vfactor(), 1 if not set
rank of e(V)
return code
binreg — Generalized linear models: Extensions to the binomial family
Macros
e(cmd)
e(cmdline)
e(depvar)
e(eform)
e(varfunc)
e(varfunct)
e(varfuncf)
e(link)
e(linkt)
e(linkf)
e(m)
e(wtype)
e(wexp)
e(title fl)
e(clustvar)
e(offset)
e(cons)
e(hac kernel)
e(hac lag)
e(vce)
e(vcetype)
e(opt)
e(opt1)
e(opt2)
e(crittype)
e(properties)
e(predict)
e(marginsnotok)
e(asbalanced)
e(asobserved)
binreg
command as typed
name of dependent variable
eform() option implied by or, rr, hr, or rd
name of variance function used
Binomial
variance function
link function used by glm
link title
link form
number of binomial trials
weight type
weight expression
family–link title
name of cluster variable
offset
noconstant or not set
HAC kernel
HAC lag
vcetype specified in vce()
title used to label Std. Err.
type of optimization
optimization title, line 1
optimization title, line 2
optimization criterion
b V
program used to implement predict
predictions disallowed by margins
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
coefficient vector
e(V)
variance–covariance matrix of the estimators
e(V modelbased) model-based variance
Functions
e(sample)
marks estimation sample
(Continued on next page)
173
174
binreg — Generalized linear models: Extensions to the binomial family
binreg, ml saves the following in e():
Scalars
e(N)
e(k)
e(k eq)
e(k eq model)
e(k dv)
e(k autoCns)
e(df m)
e(df)
e(phi)
e(aic)
e(bic)
e(ll)
e(N clust)
e(chi2)
e(p)
e(deviance)
e(deviance s)
e(deviance p)
e(deviance ps)
e(dispers)
e(dispers s)
e(dispers p)
e(dispers ps)
e(vf)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
number of parameters
number of equations in e(b)
number of equations in model Wald test
number of dependent variables
number of base, empty, and omitted constraints
model degrees of freedom
residual degrees of freedom
model scale parameter
model AIC, if ml
model BIC
log likelihood, if ml
number of clusters
χ2
significance of model test
deviance
scaled deviance
Pearson deviance
scaled Pearson deviance
dispersion
scaled dispersion
Pearson dispersion
scaled Pearson dispersion
factor set by vfactor(), 1 if not set
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
binreg — Generalized linear models: Extensions to the binomial family
Macros
e(cmd)
e(cmdline)
e(depvar)
e(eform)
e(varfunc)
e(varfunct)
e(varfuncf)
e(link)
e(linkt)
e(linkf)
e(m)
e(wtype)
e(wexp)
e(title)
e(title fl)
e(clustvar)
e(offset)
e(cons)
e(hac kernel)
e(hac lag)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(opt1)
e(which)
e(ml method)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(properties)
e(predict)
e(marginsnotok)
e(asbalanced)
e(asobserved)
binreg
command as typed
name of dependent variable
eform() option implied by or, rr, hr, or rd
name of variance function used
Binomial
variance function
link function used by glm
link title
link form
number of binomial trials
weight type
weight expression
title in estimation output
family–link title
name of cluster variable
offset
noconstant or not set
HAC kernel
HAC lag
LR; type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
optimization title, line 1
max or min; whether optimizer is to perform maximization or minimization
type of ml method
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
b V
program used to implement predict
predictions disallowed by margins
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(Cns)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
175
176
binreg — Generalized linear models: Extensions to the binomial family
Methods and formulas
binreg is implemented as an ado-file.
Let πi be the probability of success for the ith observation, i = 1, . . . , N , and let Xβ be the linear
predictor. The link function relates the covariates of each observation to its respective probability
through the linear predictor.
In logistic regression, the logit link is used:
π
ln
1−π
= Xβ
The regression coefficient βk represents the change in the logarithm of the odds associated with a
one-unit change in the value of the Xk covariate; thus exp(βk ) is the ratio of the odds associated
with a change of one unit in Xk .
For risk differences, the identity link π = Xβ is used. The regression coefficient βk represents
the risk difference associated with a change of one unit in Xk . When using the identity link, you can
obtain fitted probabilities outside the interval (0, 1). As suggested by Wacholder, at each iteration,
fitted probabilities are checked for range conditions (and put back in range if necessary). For example,
if the identity link results in a fitted probability that is smaller than 1e–4, the probability is replaced
with 1e–4 before the link function is calculated.
A similar adjustment is made for the logarithmic link, which is used for estimating the risk ratio,
ln(π) = Xβ , where exp(βk ) is the risk ratio associated with a change of one unit in Xk , and for
the log-complement link used to estimate the probability of no disease or health, where exp(βk )
represents the “health ratio” associated with a change of one unit in Xk .
This command supports the Huber/White/sandwich estimator of the variance and its clustered
version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, in particular,
in Maximum likelihood estimators and Methods and formulas.
References
Hardin, J. W., and M. A. Cleves. 1999. sbe29: Generalized linear models: Extensions to the binomial family. Stata
Technical Bulletin 50: 21–25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 140–146. College Station,
TX: Stata Press.
Kleinbaum, D. G., and M. Klein. 2002. Logistic Regression: A Self-Learning Text. 2nd ed. New York: Springer.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC.
Wacholder, S. 1986. Binomial regression in GLIM: Estimating risk ratios and risk differences. American Journal of
Epidemiology 123: 174–184.
Wright, J. T., I. G. Barrison, I. G. Lewis, K. D. MacRae, E. J. Waterson, P. J. Toplis, M. G. Gordon, N. F. Morris,
and I. M. Murray-Lyon. 1983. Alcohol consumption, pregnancy and low birthweight. Lancet 1: 663–665.
Also see
[R] binreg postestimation — Postestimation tools for binreg
[R] glm — Generalized linear models
[U] 20 Estimation and postestimation commands
Title
binreg postestimation — Postestimation tools for binreg
Description
The following postestimation commands are available for binreg:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
linktest
margins
nlcom
predict
predictnl
test
testnl
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
See the corresponding entries in the Base Reference Manual for details.
177
178
binreg postestimation — Postestimation tools for binreg
Syntax for predict
predict
type
newvar
if
in
, statistic options
description
statistic
Main
mu
xb
eta
stdp
anscombe
cooksd
deviance
hat
likelihood
pearson
response
score
working
expected value of y ; the default
b
linear prediction η = xβ
synonym for xb
standard error of the linear prediction
Anscombe (1953) residuals
Cook’s distance
deviance residuals
diagonals of the “hat” matrix as an analog to simple linear regression
weighted average of the standardized deviance and standard Pearson residuals
Pearson residuals
differences between the observed and fitted outcomes
first derivative of the log likelihood with respect to xj β
working residuals
options
description
Options
nooffset
standardized
studentized
modified
adjusted
modify calculations to ignore the offset variable
multiply residual by the factor (1 − h)1/2
multiply residual by one over the square root of the estimated scale parameter
modify denominator of residual to be a reasonable estimate of the variance of
depvar
adjust deviance residual to make the convergence to the limiting normal
distribution faster
These statistics are available both in and out of sample; type predict
the estimation sample.
. . . if e(sample) . . . if wanted only for
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
b ) [ng −1 (xβ
b)
mu, the default, specifies that predict calculate the expected value of y , equal to g −1 (xβ
for the binomial family].
b
xb calculates the linear prediction η = xβ.
eta is a synonym for xb.
binreg postestimation — Postestimation tools for binreg
179
stdp calculates the standard error of the linear prediction.
anscombe calculates the Anscombe (1953) residuals to produce residuals that closely follow a normal
distribution.
cooksd calculates Cook’s distance, which measures the aggregate change in the estimated coefficients
when each observation is left out of the estimation.
deviance calculates the deviance residuals, which are recommended by McCullagh and Nelder (1989)
and others as having the best properties for examining goodness of fit of a GLM. They are
approximately normally distributed if the model is correct and may be plotted against the fitted
values or against a covariate to inspect the model’s fit. Also see the pearson option below.
hat calculates the diagonals of the “hat” matrix as an analog to simple linear regression.
likelihood calculates a weighted average of the standardized deviance and standardized Pearson
(described below) residuals.
pearson calculates the Pearson residuals, which often have markedly skewed distributions for
nonnormal family distributions. Also see the deviance option above.
response calculates the differences between the observed and fitted outcomes.
score calculates the equation-level score, ∂ ln L/∂(xj β).
working calculates the working residuals, which are response residuals weighted according to the
derivative of the link function.
Options
nooffset is relevant only if you specified offset(varname) for binreg. It modifies the calculations
made by predict so that they ignore the offset variable; the linear prediction is treated as xj b
rather than as xj b + offsetj .
standardized requests that the residual be multiplied by the factor (1 − h)−1/2 , where h is the
diagonal of the hat matrix. This step is done to take into account the correlation between depvar
and its predicted value.
studentized requests that the residual be multiplied by one over the square root of the estimated
scale parameter.
modified requests that the denominator of the residual be modified to be a reasonable estimate
of the variance of depvar. The base residual is multiplied by the factor (k/w)−1/2 , where k is
either one or the user-specified dispersion parameter and w is the specified weight (or one if left
unspecified).
adjusted adjusts the deviance residual to make the convergence to the limiting normal distribution
faster. The adjustment deals with adding to the deviance residual a higher-order term depending
on the variance function family. This option is allowed only when deviance is specified.
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
References
Anscombe, F. J. 1953. Contribution of discussion paper by H. Hotelling “New light on the correlation coefficient and
its transforms”. Journal of the Royal Statistical Society, Series B 15: 229–230.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman & Hall/CRC.
180
binreg postestimation — Postestimation tools for binreg
Also see
[R] binreg — Generalized linear models: Extensions to the binomial family
[U] 20 Estimation and postestimation commands
Title
biprobit — Bivariate probit regression
Syntax
Bivariate probit model
biprobit depvar1 depvar2
varlist
if
in
weight
, options
Seemingly unrelated bivariate probit model
biprobit equation1 equation2 if
in
weight
, su options
where equation1 and equation2 are specified as
( eqname: depvar =
varlist
, noconstant offset(varname) )
options
description
Model
noconstant
partial
offset1(varname)
offset2(varname)
constraints(constraints)
collinear
suppress constant term
fit partial observability model
offset variable for first equation
offset variable for second equation
apply specified linear constraints
keep collinear variables
SE/Robust
vce(vcetype)
vcetype may be oim, robust, cluster clustvar, opg, bootstrap,
or jackknife
Reporting
level(#)
noskip
nocnsreport
display options
set confidence level; default is level(95)
perform likelihood-ratio test
do not display constraints
control spacing and display of omitted variables and base and
empty cells
Maximization
maximize options
† coeflegend
control the maximization process; seldom used
display coefficients’ legend instead of coefficient table
181
182
biprobit — Bivariate probit regression
description
su options
Model
partial
constraints(constraints)
collinear
fit partial observability model
apply specified linear constraints
keep collinear variables
SE/Robust
vcetype may be oim, robust, cluster clustvar, opg, bootstrap,
or jackknife
vce(vcetype)
Reporting
set confidence level; default is level(95)
perform likelihood-ratio test
do not display constraints
control spacing and display of omitted variables and base and
empty cells
level(#)
noskip
nocnsreport
display options
Maximization
control the maximization process; seldom used
maximize options
† coeflegend
display coefficients’ legend instead of coefficient table
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
depvar1 , depvar2 , varlist, and depvar may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, rolling, statsby, and svy are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
vce(), noskip, and weights are not allowed with the svy prefix.
pweights, fweights, and iweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
biprobit
Statistics
>
Binary outcomes
>
Bivariate probit regression
seemingly unrelated biprobit
Statistics
>
Binary outcomes
>
Seemingly unrelated bivariate probit regression
Description
biprobit fits maximum-likelihood two-equation probit models—either a bivariate probit or a
seemingly unrelated probit (limited to two equations).
biprobit — Bivariate probit regression
183
Options
Model
noconstant; see [R] estimation options.
partial specifies that the partial observability model be fit. This particular model commonly has
poor convergence properties, so we recommend that you use the difficult option if you want
to fit the Poirier partial observability model; see [R] ml.
This model computes the product of the two dependent variables so that you do not have to replace
each with the product.
offset1(varname), offset2(varname), constraints(constraints), collinear; see [R] estimation options.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
Reporting
level(#); see [R] estimation options.
noskip specifies that a full maximum-likelihood model with only a constant for the regression equation
be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test
for the model test statistic displayed in the estimation header. By default, the overall model test
statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation
being zero (except the constant). For many models, this option can substantially increase estimation
time.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
The following option is available with biprobit but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
For a good introduction to the bivariate probit models, see Greene (2008, 817–826) and Pindyck
and Rubinfeld (1998). Poirier (1980) explains the partial observability model. Van de Ven and Van
Pragg (1981) explain the probit model with sample selection; see [R] heckprob for details.
184
biprobit — Bivariate probit regression
Example 1
We use the data from Pindyck and Rubinfeld (1998, 332). In this dataset, the variables are
whether children attend private school (private), number of years the family has been at the present
residence (years), log of property tax (logptax), log of income (loginc), and whether the head of
the household voted for an increase in property taxes (vote).
We wish to model the bivariate outcomes of whether children attend private school and whether
the head of the household voted for an increase in property tax based on the other covariates.
. use http://www.stata-press.com/data/r11/school
. biprobit private vote years logptax loginc
Fitting comparison equation 1:
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-31.967097
-31.452424
-31.448958
-31.448958
=
=
=
=
-63.036914
-58.534843
-58.497292
-58.497288
Fitting comparison equation 2:
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
Comparison:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
log likelihood = -89.946246
Fitting full model:
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-89.946246
-89.258897
-89.254028
-89.254028
Bivariate probit regression
Number of obs
Wald chi2(6)
Prob > chi2
Log likelihood = -89.254028
Coef.
private
years
logptax
loginc
_cons
Std. Err.
z
P>|z|
=
=
=
95
9.59
0.1431
[95% Conf. Interval]
-.0118884
-.1066962
.3762037
-4.184694
.0256778
.6669782
.5306484
4.837817
-0.46
-0.16
0.71
-0.86
0.643
0.873
0.478
0.387
-.0622159
-1.413949
-.663848
-13.66664
.0384391
1.200557
1.416255
5.297253
years
logptax
loginc
_cons
-.0168561
-1.288707
.998286
-.5360573
.0147834
.5752266
.4403565
4.068509
-1.14
-2.24
2.27
-0.13
0.254
0.025
0.023
0.895
-.0458309
-2.416131
.1352031
-8.510188
.0121188
-.1612839
1.861369
7.438073
/athrho
-.2764525
.2412099
-1.15
0.252
-.7492153
.1963102
rho
-.2696186
.2236753
-.6346806
.1938267
vote
Likelihood-ratio test of rho=0:
chi2(1) =
1.38444
Prob > chi2 = 0.2393
The output shows several iteration logs. The first iteration log corresponds to running the univariate
probit model for the first equation, and the second log corresponds to running the univariate probit
for the second model. If ρ = 0, the sum of the log likelihoods from these two models will equal the
log likelihood of the bivariate probit model; this sum is printed in the iteration log as the comparison
log likelihood.
biprobit — Bivariate probit regression
185
The final iteration log is for fitting the full bivariate probit model. A likelihood-ratio test of the
log likelihood for this model and the comparison log likelihood is presented at the end of the output.
If we had specified the vce(robust) option, this test would be presented as a Wald test instead of
as a likelihood-ratio test.
We could have fit the same model by using the seemingly unrelated syntax as
. biprobit (private=years logptax loginc) (vote=years logptax loginc)
Saved results
biprobit saves the following in e():
Scalars
e(N)
e(k)
e(k eq)
e(k aux)
e(k eq model)
e(k dv)
e(k autoCns)
e(df m)
e(ll)
e(ll 0)
e(ll c)
e(N clust)
e(chi2)
e(chi2 c)
e(p)
e(rho)
e(rank)
e(rank0)
e(ic)
e(rc)
e(converged)
number of observations
number of parameters
number of equations
number of auxiliary parameters
number of equations in model Wald test
number of dependent variables
number of base, empty, and omitted constraints
model degrees of freedom
log likelihood
log likelihood, constant-only model (noskip only)
log likelihood, comparison model
number of clusters
χ2
χ2 for comparison test
significance
ρ
rank of e(V)
rank of e(V) for constant-only model
number of iterations
return code
1 if converged, 0 otherwise
(Continued on next page)
186
biprobit — Bivariate probit regression
Macros
e(cmd)
e(cmdline)
e(depvar)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(offset1)
e(offset2)
e(chi2type)
e(chi2 ct)
e(vce)
e(vcetype)
e(diparm#)
e(opt)
e(which)
e(ml method)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(properties)
d(predict)
e(asbalanced)
e(asobserved)
biprobit
command as typed
names of dependent variables
weight type
weight expression
title in estimation output
name of cluster variable
offset for first equation
offset for second equation
Wald or LR; type of model χ2 test
Wald or LR; type of model χ2 test corresponding to e(chi2 c)
vcetype specified in vce()
title used to label Std. Err.
display transformed parameter #
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
b V
program used to implement predict
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(Cns)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
biprobit is implemented as an ado-file.
biprobit — Bivariate probit regression
187
The log likelihood, lnL, is given by
ξjβ = xj β + offsetβj
ξjγ = zj γ + offsetγj
n
1
if y1j 6= 0
q1j =
−1 otherwise
n
1
if y2j 6= 0
q2j =
−1 otherwise
ρ∗j = q1j q2j ρ
n
X
lnL =
wj lnΦ2 q1j ξjβ , q2j ξjγ , ρ∗j
j=1
0
where Φ2 () is the cumulative bivariate normal distribution function (with mean [ 0 0 ] ) and wj is
an optional weight for observation j . This derivation assumes that
∗
y1j
= xj β + 1j + offsetβj
∗
y2j
= zj γ + 2j + offsetγj
E(1 ) = E(2 ) = 0
Var(1 ) = Var(2 ) = 1
Cov(1 , 2 ) = ρ
∗
∗
∗
where y1j
and y2j
are the unobserved latent variables; instead, we observe only yij = 1 if yij
>0
and yij = 0 otherwise (for i = 1, 2).
In the maximum likelihood estimation, ρ is not directly estimated, but atanh ρ is
atanh ρ =
1+ρ
1
ln
2
1−ρ
From the form of the likelihood, if ρ = 0, then the log likelihood for the bivariate probit models
is equal to the sum of the log likelihoods of the two univariate probit models. A likelihood-ratio test
may therefore be performed by comparing the likelihood of the full bivariate model with the sum of
the log likelihoods for the univariate probit models.
This command supports the Huber/White/sandwich estimator of the variance and its clustered
version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, in particular,
in Maximum likelihood estimators and Methods and formulas.
biprobit also supports estimation with survey data. For details on VCEs with survey data, see
[SVY] variance estimation.
References
De Luca, G. 2008. SNP and SML estimation of univariate and bivariate binary-choice models. Stata Journal 8: 190–220.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
Hardin, J. W. 1996. sg61: Bivariate probit models. Stata Technical Bulletin 33: 15–20. Reprinted in Stata Technical
Bulletin Reprints, vol. 6, pp. 152–158. College Station, TX: Stata Press.
188
biprobit — Bivariate probit regression
Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47: 153–161.
Pindyck, R. S., and D. L. Rubinfeld. 1998. Econometric Models and Economic Forecasts. 4th ed. New York:
McGraw–Hill.
Poirier, D. J. 1980. Partial observability in bivariate probit models. Journal of Econometrics 12: 209–217.
Van de Ven, W. P. M. M., and B. M. S. Van Pragg. 1981. The demand for deductibles in private health insurance: A
probit model with sample selection. Journal of Econometrics 17: 229–252.
Also see
[R] biprobit postestimation — Postestimation tools for biprobit
[R] mprobit — Multinomial probit regression
[R] probit — Probit regression
[SVY] svy estimation — Estimation commands for survey data
[U] 20 Estimation and postestimation commands
Title
biprobit postestimation — Postestimation tools for biprobit
Description
The following postestimation commands are available for biprobit:
command
description
estat
estat (svy)
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
postestimation statistics for survey data
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
seemingly unrelated estimation
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
lrtest1
margins
nlcom
predict
predictnl
suest
test
testnl
1
lrtest is not appropriate with svy estimation results.
See the corresponding entries in the Base Reference Manual for details, but see [SVY] estat for
details about estat (svy).
189
190
biprobit postestimation — Postestimation tools for biprobit
Syntax for predict
predict
type
predict
type
newvar
if
in
, statistic nooffset
stub* | newvareq1 newvareq2 newvarathrho
if
in , scores
description
statistic
Main
Φ2 (xj b, zj g, ρ), predicted probability Pr(y1j = 1, y2j = 1); the default
Φ2 (xj b, −zj g, −ρ), predicted probability Pr(y1j = 1, y2j = 0)
Φ2 (−xj b, zj g, −ρ), predicted probability Pr(y1j = 0, y2j = 1)
Φ2 (−xj b, −zj g, ρ), predicted probability Pr(y1j = 0, y2j = 0)
Φ(xj b), marginal success probability for equation 1
Φ(zj g), marginal success probability for equation 2
Φ2 (xj b, zj g, ρ)/Φ(zj g), conditional probability of success for equation 1
Φ2 (xj b, zj g, ρ)/Φ(xj b), conditional probability of success for equation 2
xj b, linear prediction for equation 1
zj g, linear prediction for equation 2
standard error of the linear prediction for equation 1
standard error of the linear prediction for equation 2
p11
p10
p01
p00
pmarg1
pmarg2
pcond1
pcond2
xb1
xb2
stdp1
stdp2
where Φ() is the standard normal-distribution function and Φ2 () is the bivariate standard
normal-distribution function.
These statistics are available both in and out of sample; type predict
the estimation sample.
. . . if e(sample) . . . if wanted only for
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
p11, the default, calculates the bivariate predicted probability Pr(y1j = 1, y2j = 1).
p10 calculates the bivariate predicted probability Pr(y1j = 1, y2j = 0).
p01 calculates the bivariate predicted probability Pr(y1j = 0, y2j = 1).
p00 calculates the bivariate predicted probability Pr(y1j = 0, y2j = 0).
pmarg1 calculates the univariate (marginal) predicted probability of success Pr(y1j = 1).
pmarg2 calculates the univariate (marginal) predicted probability of success Pr(y2j = 1).
pcond1 calculates the conditional (on success in equation 2) predicted probability of success
Pr(y1j = 1, y2j = 1)/Pr(y2j = 1).
pcond2 calculates the conditional (on success in equation 1) predicted probability of success
Pr(y1j = 1, y2j = 1)/Pr(y1j = 1).
biprobit postestimation — Postestimation tools for biprobit
191
xb1 calculates the probit linear prediction xj b.
xb2 calculates the probit linear prediction zj g.
stdp1 calculates the standard error of the linear prediction for equation 1.
stdp2 calculates the standard error of the linear prediction for equation 2.
nooffset is relevant only if you specified offset1(varname) or offset2(varname) for biprobit.
It modifies the calculations made by predict so that they ignore the offset variables; the linear
predictions are treated as xj b rather than as xj b + offset1j and zj γ rather than as zj γ + offset2j .
scores calculates equation-level score variables.
The first new variable will contain ∂ ln L/∂(xj β).
The second new variable will contain ∂ ln L/∂(zj γ).
The third new variable will contain ∂ ln L/∂(atanh ρ).
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] biprobit — Bivariate probit regression
[U] 20 Estimation and postestimation commands
Title
bitest — Binomial probability test
Syntax
Binomial probability test
bitest varname== # p
if
in
weight
, detail
Immediate form of binomial probability test
bitesti # N # succ # p , detail
by is allowed with bitest; see [D] by.
fweights are allowed with bitest; see [U] 11.1.6 weight.
Menu
bitest
Statistics
>
Summaries, tables, and tests
>
Classical tests of hypotheses
>
Binomial probability test
>
Summaries, tables, and tests
>
Classical tests of hypotheses
>
Binomial probability test calculator
bitesti
Statistics
Description
bitest performs exact hypothesis tests for binomial random variables. The null hypothesis is that
the probability of a success on a trial is # p . The total number of trials is the number of nonmissing
values of varname (in bitest) or # N (in bitesti). The number of observed successes is the number
of 1s in varname (in bitest) or # succ (in bitesti). varname must contain only 0s, 1s, and missing.
bitesti is the immediate form of bitest; see [U] 19 Immediate commands for a general
introduction to immediate commands.
Option
Advanced
detail shows the probability of the observed number of successes, kobs ; the probability of the
number of successes on the opposite tail of the distribution that is used to compute the two-sided
p-value, kopp ; and the probability of the point next to kopp . This information can be safely ignored.
See the technical note below for details.
Remarks
Remarks are presented under the following headings:
bitest
bitesti
192
bitest — Binomial probability test
193
bitest
Example 1
We test 15 university students for high levels of one measure of visual quickness which, from
other evidence, we believe is present in 30% of the nonuniversity population. Included in our data is
quick, taking on the values 1 (“success”) or 0 (“failure”) depending on the outcome of the test.
. use http://www.stata-press.com/data/r11/quick
. bitest quick == 0.3
Variable
N
quick
15
Observed k
7
Pr(k >= 7)
= 0.131143
Pr(k <= 7)
= 0.949987
Pr(k <= 1 or k >= 7) = 0.166410
Expected k
4.5
Assumed p
Observed p
0.30000
0.46667
(one-sided test)
(one-sided test)
(two-sided test)
The first part of the output reveals that, assuming a true probability of success of 0.3, the expected
number of successes is 4.5 and that we observed seven. Said differently, the assumed frequency under
the null hypothesis H0 is 0.3, and the observed frequency is 0.47.
The first line under the table is a one-sided test; it is the probability of observing seven or
more successes conditional on p = 0.3. It is a test of H0 : p = 0.3 versus the alternative hypothesis
HA : p > 0.3. Said in English, the alternative hypothesis is that more than 30% of university students
score at high levels on this test of visual quickness. The p-value for this hypothesis test is 0.13.
The second line under the table is a one-sided test of H0 versus the opposite alternative hypothesis
HA : p < 0.3.
The third line is the two-sided test. It is a test of H0 versus the alternative hypothesis HA : p 6= 0.3.
Technical note
The p-value of a hypothesis test is the probability (calculated assuming H0 is true) of observing
any outcome as extreme or more extreme than the observed outcome, with extreme meaning in the
direction of the alternative hypothesis. In example 1, the outcomes k = 8, 9, . . . , 15 are clearly
“more extreme” than the observed outcome kobs = 7 when considering the alternative hypothesis
HA : p 6= 0.3. However, outcomes with only a few successes are also in the direction of this alternative
hypothesis. For two-sided hypotheses, outcomes with k successes are considered “as extreme or more
extreme” than the observed outcome kobs if Pr(k) ≤ Pr(kobs ). Here Pr(k = 0) and Pr(k = 1) are
both less than Pr(k = 7), so they are included in the two-sided p-value.
The detail option allows you to see the probability (assuming that H0 is true) of the observed
successes (k = 7) and the probability of the boundary point (k = 1) of the opposite tail used for the
two-sided p-value.
194
bitest — Binomial probability test
. bitest quick == 0.3, detail
Variable
N
quick
15
Observed k
Expected k
7
Observed p
0.30000
0.46667
4.5
Pr(k >= 7)
= 0.131143
Pr(k <= 7)
= 0.949987
Pr(k <= 1 or k >= 7) = 0.166410
(one-sided test)
(one-sided test)
(two-sided test)
Pr(k == 7)
Pr(k == 2)
Pr(k == 1)
(observed)
= 0.081130
= 0.091560
= 0.030520
Assumed p
(opposite extreme)
Also shown is the probability of the point next to the boundary point. This probability, namely,
Pr(k = 2) = 0.092, is certainly close to the probability of the observed outcome Pr(k = 7) = 0.081,
so some people might argue that k = 2 should be included in the two-sided p-value. Statisticians
(at least some we know) would reply that the p-value is a precisely defined concept and that this
is an arbitrary “fuzzification” of its definition. When you compute exact p-values according to the
precise definition of a p-value, your type I error is never more than what you say it is — so no one
can criticize you for being anticonservative. Including the point k = 2 is being overly conservative
because it makes the p-value larger yet. But it is your choice; being overly conservative, at least in
statistics, is always safe. Know that bitest and bitesti always keep to the precise definition of
a p-value, so if you wish to include this extra point, you must do so by hand or by using the r()
saved results; see Saved results below.
bitesti
Example 2
The binomial test is a function of two statistics and one parameter: N , the number of observations;
kobs , the number of observed successes; and p, the assumed probability of a success on a trial. For
instance, in a city of N = 2,500,000, we observe kobs = 36 cases of a particular disease when the
population rate for the disease is p = 0.00001.
. bitesti 2500000 36 .00001
N
2500000
Observed k
Expected k
36
25
Pr(k >= 36)
= 0.022458
Pr(k <= 36)
= 0.985448
Pr(k <= 14 or k >= 36) = 0.034859
Assumed p
Observed p
0.00001
0.00001
(one-sided test)
(one-sided test)
(two-sided test)
Example 3
Boice Jr. and Monson (1977) present data on breast cancer cases and person-years of observations
for women with tuberculosis who were repeatedly exposed to multiple x-ray fluoroscopies and for
women with tuberculosis who were not. The data are
Breast cancer
Person-years
Exposed
41
28,010
Not exposed
15
19,017
Total
56
47,027
bitest — Binomial probability test
195
We can thus test whether x-ray fluoroscopic examinations are associated with breast cancer; the
assumed rate of exposure is p = 28010/47027.
. bitesti 56 41 28010/47027
N
Observed k
Expected k
56
41
Pr(k >= 41)
Pr(k <= 41)
Pr(k <= 25 or k >= 41)
Assumed p
33.35446
= 0.023830
= 0.988373
= 0.040852
Observed p
0.59562
0.73214
(one-sided test)
(one-sided test)
(two-sided test)
Saved results
bitest and bitesti save the following in r():
Scalars
r(N)
r(P p)
r(k)
r(p l)
r(p u)
r(p)
number N of trials
assumed probability p of success
observed number k of successes
lower one-sided p-value
upper one-sided p-value
two-sided p-value
r(k
r(P
r(P
r(k
r(P
opp)
k)
oppk)
nopp)
noppk)
opposite extreme k
probability of observed k (detail only)
probability of opposite extreme k (detail only)
k next to opposite extreme (detail only)
probability of k next to opposite extreme
(detail only)
Methods and formulas
bitest and bitesti are implemented as ado-files.
Let N , kobs , and p be, respectively, the number of observations, the observed number of successes,
and the assumed probability of success on a trial. The expected number of successes is N p, and the
observed probability of success on a trial is kobs /N .
bitest and bitesti compute exact p-values based on the binomial distribution. The upper
one-sided p-value is
N
X
N
Pr(k ≥ kobs ) =
pm (1 − p)N −m
m
m=kobs
The lower one-sided p-value is
Pr(k ≤ kobs ) =
k
obs
X
N
pm (1 − p)N −m
m
m=0
If kobs ≥ N p, the two-sided p-value is
Pr(k ≤ kopp or k ≥ kobs )
where kopp is the largest number ≤ N p such that Pr(k = kopp ) ≤ Pr(k = kobs ). If kobs < N p,
the two-sided p-value is
Pr(k ≤ kobs or k ≥ kopp )
where kopp is the smallest number ≥ N p such that Pr(k = kopp ) ≤ Pr(k = kobs ).
196
bitest — Binomial probability test
References
Boice Jr., J. D., and R. R. Monson. 1977. Breast cancer in women after repeated fluoroscopic examinations of the
chest. Journal of the National Cancer Institute 59: 823–832.
Hoel, P. G. 1984. Introduction to Mathematical Statistics. 5th ed. New York: Wiley.
Also see
[R] ci — Confidence intervals for means, proportions, and counts
[R] prtest — One- and two-sample tests of proportions
Title
bootstrap — Bootstrap sampling and estimation
Syntax
bootstrap exp list , options eform option : command
options
description
Main
reps(#)
perform # bootstrap replications; default is reps(50)
Options
strata(varlist)
size(#)
cluster(varlist)
idcluster(newvar)
saving( filename, . . .)
bca
mse
variables identifying strata
draw samples of size #; default is N
variables identifying resampling clusters
create new cluster ID variable
save results to filename; save statistics in double precision;
save results to filename every # replications
compute acceleration for BCa confidence intervals
use MSE formula for variance estimation
Reporting
level(#)
notable
noheader
nolegend
verbose
nodots
noisily
trace
title(text)
display options
set confidence level; default is level(95)
suppress table of results
suppress table header
suppress table legend
display the full table legend
suppress the replication dots
display any output from command
trace the command
use text as title for bootstrap results
control spacing and display of omitted variables and base and
empty cells
Advanced
nodrop
nowarn
force
reject(exp)
seed(#)
do not drop observations
do not warn when e(sample) is not set
do not check for weights or svy commands; seldom used
identify invalid results
set random-number seed to #
† group(varname)
ID variable for groups within cluster()
† jackknifeopts(jkopts) options for jackknife
† coeflegend
display coefficients’ legend instead of coefficient table
197
198
bootstrap — Bootstrap sampling and estimation
† group(), jackknifeopts(), and coeflegend do not appear in the dialog box.
weights are not allowed in command.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
exp list contains
elist contains
eexp is
specname is
eqno is
(name: elist)
elist
eexp
newvar = (exp)
(exp)
specname
[eqno]specname
b
b[]
se
se[]
##
name
exp is a standard Stata expression; see [U] 13 Functions and expressions.
Distinguish between [ ], which are to be typed, and , which indicate optional arguments.
Menu
Statistics
>
Resampling
>
Bootstrap estimation
Description
bootstrap performs bootstrap estimation. Typing
. bootstrap exp list, reps(#): command
executes command multiple times, bootstrapping the statistics in exp list by resampling observations
(with replacement) from the data in memory # times. This method is commonly referred to as the
nonparametric bootstrap.
command defines the statistical command to be executed. Most Stata commands and user-written
programs can be used with bootstrap, as long as they follow standard Stata syntax; see [U] 11 Language syntax. If the bca option is supplied, command must also work with jackknife; see
[R] jackknife. The by prefix may not be part of command.
exp list specifies the statistics to be collected from the execution of command. If command changes
the contents in e(b), exp list is optional and defaults to b.
Because bootstrapping is a random process, if you want to be able to reproduce results, set the
random-number seed by specifying the seed(#) option or by typing
. set seed #
where # is a seed of your choosing, before running bootstrap; see [R] set seed.
bootstrap — Bootstrap sampling and estimation
199
Many estimation commands allow the vce(bootstrap) option. For those commands, we recommend using vce(bootstrap) over bootstrap because the estimation command already handles
clustering and other model-specific details for you. The bootstrap prefix command is intended
for use with nonestimation commands, such as summarize, user-written commands, or functions of
coefficients.
bs and bstrap are synonyms for bootstrap.
Options
Main
reps(#) specifies the number of bootstrap replications to be performed. The default is 50. A total of
50 – 200 replications are generally adequate for estimates of standard error and thus are adequate
for normal-approximation confidence intervals; see Mooney and Duval (1993, 11). Estimates of
confidence intervals using the percentile or bias-corrected methods typically require 1,000 or more
replications.
Options
strata(varlist) specifies the variables that identify strata. If this option is specified, bootstrap samples
are taken independently within each stratum.
size(#) specifies the size of the samples to be drawn. The default is N, meaning to draw samples of
the same size as the data. If specified, # must be less than or equal to the number of observations
within strata().
If cluster() is specified, the default size is the number of clusters in the original dataset. For
unbalanced clusters, resulting sample sizes will differ from replication to replication. For cluster
sampling, # must be less than or equal to the number of clusters within strata().
cluster(varlist) specifies the variables that identify resampling clusters. If this option is specified,
the sample drawn during each replication is a bootstrap sample of clusters.
idcluster(newvar) creates a new variable containing a unique identifier for each resampled cluster.
This option requires that cluster() also be specified.
saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of, for each statistic
in exp list, a variable containing the bootstrap replicates.
double specifies that the results for each replication be stored as doubles, meaning 8-byte reals.
By default, they are stored as floats, meaning 4-byte reals. This option may be used without
the saving() option to compute the variance estimates by using double precision.
every(#) specifies that results be written to disk every #th replication. every() should be specified
only in conjunction with saving() when command takes a long time for each replication. This
option will allow recovery of partial results should some other software crash your computer.
See [P] postfile.
replace specifies that filename be overwritten, if it exists. This option does not appear in the
dialog box.
bca specifies that bootstrap estimate the acceleration of each statistic in exp list. This estimate
is used to construct BCa confidence intervals. Type estat bootstrap, bca to display the BCa
confidence interval generated by the bootstrap command.
mse specifies that bootstrap compute the variance by using deviations of the replicates from the
observed value of the statistics based on the entire dataset. By default, bootstrap computes the
variance by using deviations from the average of the replicates.
200
bootstrap — Bootstrap sampling and estimation
Reporting
level(#); see [R] estimation options.
notable suppresses the display of the table of results.
noheader suppresses the display of the table header. This option implies nolegend. This option
may also be specified when replaying estimation results.
nolegend suppresses the display of the table legend. This option may also be specified when replaying
estimation results.
verbose specifies that the full table legend be displayed. By default, coefficients and standard errors
are not displayed. This option may also be specified when replaying estimation results.
nodots suppresses display of the replication dots. By default, one dot character is displayed for each
successful replication. A red ‘x’ is displayed if command returns an error or if one of the values
in exp list is missing.
noisily specifies that any output from command be displayed. This option implies the nodots
option.
trace causes a trace of the execution of command to be displayed. This option implies the noisily
option.
title(text) specifies a title to be displayed above the table of bootstrap results. The default title is
the title saved in e(title) by an estimation command, or if e(title) is not filled in, Bootstrap
results is used. title() may also be specified when replaying estimation results.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Advanced
nodrop prevents observations outside e(sample) and the if and in conditions from being dropped
before the data are resampled.
nowarn suppresses the display of a warning message when command does not set e(sample).
force suppresses the restriction that command not specify weights or be a svy command. This is a
rarely used option. Use it only if you know what you are doing.
reject(exp) identifies an expression that indicates when results should be rejected. When exp is
true, the resulting values are reset to missing values.
seed(#) sets the random-number seed. Specifying this option is equivalent to typing the following
command prior to calling bootstrap:
. set seed #
The following options are available with bootstrap but are not shown in the dialog box:
eform option causes the coefficient table to be displayed in exponentiated form: for each coefficient,
exp(b) rather than b is displayed. Standard errors and confidence intervals are also transformed.
Display of the intercept, if any, is suppressed.
command determines which of the following are allowed (eform(string) and eform are always
allowed):
bootstrap — Bootstrap sampling and estimation
eform option
description
eform(string)
eform
hr
shr
irr
or
rrr
use string for the column title
exponentiated coefficient, string is exp(b)
hazard ratio, string is Haz. Ratio
subhazard ratio, string is SHR
incidence-rate ratio, string is IRR
odds ratio, string is Odds Ratio
relative-risk ratio, string is RRR
201
group(varname) re-creates varname containing a unique identifier for each group across the resampled
clusters. This option requires that idcluster() also be specified.
This option is useful for maintaining unique group identifiers when sampling clusters with replacement. Suppose that cluster 1 contains 3 groups. If the idcluster(newclid) option is specified
and cluster 1 is sampled multiple times, newclid uniquely identifies each copy of cluster 1. If
group(newgroupid) is also specified, newgroupid uniquely identifies each copy of each group.
jackknifeopts(jkopts) identifies options that are to be passed to jackknife when it computes the
acceleration values for the BCa confidence intervals; see [R] jackknife. This option requires the
bca option and is mostly used for passing the eclass, rclass, or n(#) option to jackknife.
coeflegend; see [R] estimation options.
Remarks
Remarks are presented under the following headings:
Introduction
Regression coefficients
Expressions
Combining bootstrap datasets
A note about macros
Achieved significance level
Bootstrapping a ratio
Warning messages and e(sample)
Bootstrapping statistics from data with a complex structure
Introduction
With few assumptions, bootstrapping provides a way of estimating standard errors and other measures
of statistical precision (Efron 1979; Efron and Stein 1981; Efron 1982; Efron and Tibshirani 1986;
Efron and Tibshirani 1993; also see Davison and Hinkley [1997]; Guan [2003]; Mooney and Duval
[1993]; Poi [2004]; and Stine [1990]). It provides a way to obtain such measures when no formula
is otherwise available or when available formulas make inappropriate assumptions. Cameron and
Trivedi (2009, chap. 13) discuss many bootstrapping topics and demonstrate how to do them in Stata.
To illustrate bootstrapping, suppose that you have a dataset containing N observations and an
estimator that, when applied to the data, produces certain statistics. You draw, with replacement, N
observations from the N -observation dataset. In this random drawing, some of the original observations
will appear once, some more than once, and some not at all. Using the resampled dataset, you apply
the estimator and collect the statistics. This process is repeated many times; each time, a new random
sample is drawn and the statistics are recalculated.
202
bootstrap — Bootstrap sampling and estimation
This process builds a dataset of replicated statistics. From these data, you can calculate the standard
error by using the standard formula for the sample standard deviation
se
b =
1 X b
(θi − θ)2
k−1
1/2
where θbi is the statistic calculated using the ith bootstrap sample and k is the number of replications.
This formula gives an estimate of the standard error of the statistic, according to Hall and Wilson (1991).
Although the average, θ, of the bootstrapped estimates is used in calculating the standard deviation,
it is not used as the estimated value of the statistic itself. Instead, the original observed value of the
statistic, θb, is used, meaning the value of the statistic computed using the original N observations.
You might think that θ is a better estimate of the parameter than θb, but it is not. If the statistic is
biased, bootstrapping exaggerates the bias. In fact, the bias can be estimated as θ − θb (Efron 1982, 33).
Knowing this, you might be tempted to subtract this estimate of bias from θb to produce an unbiased
statistic. The bootstrap bias estimate has an indeterminate amount of random error, so this unbiased
estimator may have greater mean squared error than the biased estimator (Mooney and Duval 1993;
Hinkley 1978). Thus θb is the best point estimate of the statistic.
The logic behind the bootstrap is that all measures of precision come from a statistic’s sampling
distribution. When the statistic is estimated on a sample of size N from some population, the sampling
distribution tells you the relative frequencies of the values of the statistic. The sampling distribution,
in turn, is determined by the distribution of the population and the formula used to estimate the
statistic.
Sometimes the sampling distribution can be derived analytically. For instance, if the underlying
population is distributed normally and you calculate means, the sampling distribution for the mean is
also normal but has a smaller variance than that of the population. In other cases, deriving the sampling
distribution is difficult, as when means are calculated from nonnormal populations. Sometimes, as in
the case of means, it is not too difficult to derive the sampling distribution as the sample size goes
to infinity (N → ∞). However, such asymptotic distributions may not perform well when applied to
finite samples.
If you knew the population distribution, you could obtain the sampling distribution by simulation:
you could draw random samples of size N , calculate the statistic, and make a tally. Bootstrapping
does precisely this, but it uses the observed distribution of the sample in place of the true population
distribution. Thus the bootstrap procedure hinges on the assumption that the observed distribution
is a good estimate of the underlying population distribution. In return, the bootstrap produces an
estimate, called the bootstrap distribution, of the sampling distribution. From this, you can estimate
the standard error of the statistic, produce confidence intervals, etc.
The accuracy with which the bootstrap distribution estimates the sampling distribution depends on
the number of observations in the original sample and the number of replications in the bootstrap. A
crudely estimated sampling distribution is adequate if you are only going to extract, say, a standard
error. A better estimate is needed if you want to use the 2.5th and 97.5th percentiles of the distribution
to produce a 95% confidence interval. To extract many features simultaneously about the distribution,
an even better estimate is needed. Generally, replications on the order of 1,000 produce very good
estimates, but only 50 – 200 replications are needed for estimates of standard errors. See Poi (2004)
for a method to choose the number of bootstrap replications.
bootstrap — Bootstrap sampling and estimation
203
Regression coefficients
Example 1
Let’s say that we wish to compute bootstrap estimates for the standard errors of the coefficients
from the following regression:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight gear foreign
SS
Source
df
MS
Model
Residual
1629.67805
813.781411
3
70
543.226016
11.6254487
Total
2443.45946
73
33.4720474
mpg
Coef.
weight
gear_ratio
foreign
_cons
-.006139
1.457113
-2.221682
36.10135
Std. Err.
.0007949
1.541286
1.234961
6.285984
t
-7.72
0.95
-1.80
5.74
Number of obs
F( 3,
70)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
=
=
=
=
=
=
74
46.73
0.0000
0.6670
0.6527
3.4096
[95% Conf. Interval]
0.000
0.348
0.076
0.000
-.0077245
-1.616884
-4.684735
23.56435
-.0045536
4.53111
.2413715
48.63835
To run the bootstrap, we simply prefix the above regression command with the bootstrap command
(specifying its options before the colon separator). We must set the random-number seed before calling
bootstrap.
. bootstrap, reps(100) seed(1): regress mpg weight gear foreign
(running regress on estimation sample)
Bootstrap replications (100)
1
2
3
4
5
..................................................
..................................................
Linear regression
50
100
Number of obs
Replications
Wald chi2(3)
Prob > chi2
R-squared
Adj R-squared
Root MSE
mpg
Observed
Coef.
Bootstrap
Std. Err.
weight
gear_ratio
foreign
_cons
-.006139
1.457113
-2.221682
36.10135
.0006498
1.297786
1.162728
4.71779
z
-9.45
1.12
-1.91
7.65
P>|z|
0.000
0.262
0.056
0.000
=
=
=
=
=
=
=
74
100
111.96
0.0000
0.6670
0.6527
3.4096
Normal-based
[95% Conf. Interval]
-.0074127
-1.086501
-4.500587
26.85465
-.0048654
4.000727
.0572236
45.34805
The displayed confidence interval is based on the assumption that the sampling (and hence bootstrap)
distribution is approximately normal (see Methods and formulas below). Because this confidence
interval is based on the standard error, it is a reasonable estimate if normality is approximately true,
even for a few replications. Other types of confidence intervals are available after bootstrap; see
[R] bootstrap postestimation.
204
bootstrap — Bootstrap sampling and estimation
We could instead supply names to our expressions when we run bootstrap. For example,
. bootstrap diff=(_b[weight]-_b[gear]): regress mpg weight gear foreign
would bootstrap a statistic, named diff, equal to the difference between the coefficients on weight
and gear ratio.
Expressions
Example 2
When we use bootstrap, the list of statistics can contain complex expressions, as long as each
expression is enclosed in parentheses. For example, to bootstrap the range of a variable x, we could
type
. bootstrap range=(r(max)-r(min)), reps(1000): summarize x
Of course, we could also bootstrap the minimum and maximum and later compute the range.
. bootstrap max=r(max) min=r(min), reps(1000) saving(mybs): summarize x
. use mybs, clear
(bootstrap: summarize)
. generate range = max - min
. bstat range, stat(19.5637501)
The difference between the maximum and minimum of x in the sample is 19.5637501.
The stat() option to bstat specifies the observed value of the statistic (range) to be summarized.
This option is useful when, as shown above, the statistic of ultimate interest is not specified directly
to bootstrap but instead is calculated by other means.
Here the observed values of r(max) and r(min) are saved as characteristics of the dataset created
by bootstrap and are thus available for retrieval by bstat; see [R] bstat. The observed range,
however, is unknown to bstat, so it must be specified.
Combining bootstrap datasets
You can combine two datasets from separate runs of bootstrap by using append (see [D] append)
and then get the bootstrap statistics for the combined datasets by running bstat. The runs must
have been performed independently (having different starting random-number seeds), and the original
dataset, command, and bootstrap statistics must have been all the same.
A note about macros
In the previous example, we executed the command
. bootstrap max=r(max) min=r(min), reps(1000) saving(mybs): summarize x
bootstrap — Bootstrap sampling and estimation
205
We did not enclose r(max) and r(min) in single quotes, as we would in most other contexts, because
it would not produce what was intended:
. bootstrap ‘r(max)’ ‘r(min)’, reps(1000) saving(mybs): summarize x
To understand why, note that ‘r(max)’, like any reference to a local macro, will evaluate to a literal
string containing the contents of r(max) before bootstrap is even executed. Typing the command
above would appear to Stata as if we had typed
. bootstrap 14.5441234 33.4393293, reps(1000) saving(mybs): summarize x
Even worse, the current contents of r(min) and r(max) could be empty, producing an even more
confusing result. To avoid this outcome, refer to statistics by name (e.g., r(max)) and not by value
(e.g., ‘r(max)’).
Achieved significance level
Example 3
Suppose that we wish to estimate the achieved significance level (ASL) of a test statistic by using
the bootstrap. ASL is another name for p-value. An example is
b 0
ASL = Pr θb∗ ≥ θ|H
for an upper-tailed, alternative hypothesis, where H0 denotes the null hypothesis, θb is the observed
value of the test statistic, and θb∗ is the random variable corresponding to the test statistic, assuming
that H0 is true.
Here we will compare the mean miles per gallon (mpg) between foreign and domestic cars by
using the two-sample t test with unequal variances. The following results indicate the p-value to be
0.0034 for the two-sided test using Satterthwaite’s approximation. Thus assuming that mean mpg is
the same for foreign and domestic cars, we would expect to observe a t statistic more extreme (in
absolute value) than 3.1797 in about 0.3% of all possible samples of the type that we observed.
Thus we have evidence to reject the null hypothesis that the means are equal.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. ttest mpg, by(foreign) unequal
Two-sample t test with unequal variances
Group
Obs
Mean
Domestic
Foreign
52
22
combined
74
diff
Std. Err.
Std. Dev.
[95% Conf. Interval]
19.82692
24.77273
.657777
1.40951
4.743297
6.611187
18.50638
21.84149
21.14747
27.70396
21.2973
.6725511
5.785503
19.9569
22.63769
-4.945804
1.555438
-8.120053
-1.771556
diff = mean(Domestic) - mean(Foreign)
t = -3.1797
Ho: diff = 0
Satterthwaite’s degrees of freedom = 30.5463
Ha: diff < 0
Ha: diff != 0
Ha: diff > 0
Pr(T < t) = 0.0017
Pr(|T| > |t|) = 0.0034
Pr(T > t) = 0.9983
206
bootstrap — Bootstrap sampling and estimation
We also place the value of the test statistic in a scalar for later use.
. scalar tobs = r(t)
Efron and Tibshirani (1993, 224) describe an alternative to Satterthwaite’s approximation that
estimates the ASL by bootstrapping the statistic from the test of equal means. Their idea is to recenter
the two samples to the combined sample mean so that the data now conform to the null hypothesis
but that the variances within the samples remain unchanged.
. summarize mpg, meanonly
. scalar omean = r(mean)
. summarize mpg if foreign==0, meanonly
. replace mpg = mpg - r(mean) + scalar(omean) if foreign==0
mpg was int now float
(52 real changes made)
. summarize mpg if foreign==1, meanonly
. replace mpg = mpg - r(mean) + scalar(omean) if foreign==1
(22 real changes made)
. sort foreign
. by foreign: summarize mpg
-> foreign = Domestic
Variable
Obs
mpg
Mean
52
21.2973
Variable
Obs
Mean
mpg
22
21.2973
Std. Dev.
4.743297
Min
Max
13.47037
35.47038
Min
Max
10.52457
37.52457
-> foreign = Foreign
Std. Dev.
6.611187
Each sample (foreign and domestic) is a stratum, so the bootstrapped samples must have the same
number of foreign and domestic cars as the original dataset. This requirement is facilitated by the
strata() option to bootstrap. By typing the following, we bootstrap the test statistic using the
modified dataset and save the values in bsauto2.dta:
bootstrap — Bootstrap sampling and estimation
207
. keep mpg foreign
. set seed 1
. bootstrap t=r(t), rep(1000) strata(foreign) saving(bsauto2) nodots: ttest mpg,
> by(foreign) unequal
Warning:
Because ttest is not an estimation command or does not set
e(sample), bootstrap has no way to determine which observations are
used in calculating the statistics and so assumes that all
observations are used. This means that no observations will be
excluded from the resampling because of missing values or other
reasons.
If the assumption is not true, press Break, save the data, and drop
the observations that are to be excluded. Be sure that the dataset
in memory contains only the relevant data.
Bootstrap results
Number of strata
command:
t:
t
=
2
Number of obs
Replications
=
=
74
1000
ttest mpg, by(foreign) unequal
r(t)
Observed
Coef.
Bootstrap
Std. Err.
z
P>|z|
1.75e-07
1.036437
0.00
1.000
Normal-based
[95% Conf. Interval]
-2.031379
2.031379
We can use the data in bsauto2.dta to estimate ASL via the fraction of bootstrap test statistics
that are more extreme than 3.1797.
. use bsauto2, clear
(bootstrap: ttest)
. generate indicator = abs(t)>=abs(scalar(tobs))
. summarize indicator, meanonly
. display "ASLboot = " r(mean)
ASLboot = .005
The result is ASLboot = 0.005. Assuming that the mean mpg is the same between foreign and
domestic cars, we would expect to observe a t statistic more extreme (in absolute value) than 3.1797
in about 0.5% of all possible samples of the type we observed. This finding is still strong evidence
to reject the hypothesis that the means are equal.
Bootstrapping a ratio
Example 4
Suppose that we wish to produce a bootstrap estimate of the ratio of two means. Because summarize
saves results for only one variable, we must call summarize twice to compute the means. Actually,
we could use collapse to compute the means in one call, but calling summarize twice is much
faster. Thus we will have to write a small program that will return the results we want.
208
bootstrap — Bootstrap sampling and estimation
We write the program below and save it to a file called ratio.ado (see [U] 17 Ado-files). Our
program takes two variable names as input and saves them in the local macros y (first variable)
and x (second variable). It then computes one statistic: the mean of ‘y’ divided by the mean of
‘x’. This value is returned as a scalar in r(ratio). ratio also returns the ratio of the number of
observations used to the mean for each variable.
program myratio, rclass
version 11
args y x
confirm var ‘y’
confirm var ‘x’
tempname ymean yn
summarize ‘y’, meanonly
scalar ‘ymean’ = r(mean)
return scalar n_‘y’ = r(N)
summarize ‘x’, meanonly
return scalar n_‘x’ = r(N)
return scalar ratio = ‘ymean’/r(mean)
end
Remember to test any newly written commands before using them with bootstrap.
. use http://www.stata-press.com/data/r11/auto, clear
(1978 Automobile Data)
. summarize price
Variable
Obs
Mean
Std. Dev.
price
74
. scalar mean1=r(mean)
. summarize weight
Variable
Obs
6165.257
weight
74
. scalar mean2=r(mean)
3019.459
Mean
Min
Max
2949.496
3291
15906
Std. Dev.
Min
Max
777.1936
1760
4840
. di scalar(mean1)/scalar(mean2)
2.0418412
. myratio price weight
. return list
scalars:
r(ratio) = 2.041841210168278
r(n_weight) = 74
r(n_price) = 74
bootstrap — Bootstrap sampling and estimation
209
The results of running bootstrap on our program are
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. set seed 1
. bootstrap ratio=r(ratio), reps(1000) nowarn nodots: myratio price weight
Bootstrap results
Number of obs
=
74
Replications
=
1000
command: myratio price weight
ratio: r(ratio)
ratio
Observed
Coef.
Bootstrap
Std. Err.
2.041841
.0942932
z
21.65
P>|z|
0.000
Normal-based
[95% Conf. Interval]
1.85703
2.226652
As mentioned previously, we should specify the saving() option if we wish to save the bootstrap
dataset.
Warning messages and e(sample)
bootstrap is not meant to be used with weighted calculations. bootstrap determines the presence
of weights by parsing the prefixed command with standard syntax. However, commands like stcox
and streg require that weights be specified in stset, and some user commands may allow weights
to be specified by using an option instead of the standard syntax. Both cases pose a problem for
bootstrap because it cannot determine the presence of weights under these circumstances. In these
cases, we can only assume that you know what you are doing.
bootstrap does not know which variables of the dataset in memory matter to the calculation at
hand. You can speed their execution by dropping unnecessary variables because, otherwise, they are
included in each bootstrap sample.
You should thus drop observations with missing values. Leaving in missing values causes no
problem in one sense because all Stata commands deal with missing values gracefully. It does,
however, cause a statistical problem. Bootstrap sampling is defined as drawing, with replacement,
samples of size N from a set of N observations. bootstrap determines N by counting the number
of observations in memory, not counting the number of nonmissing values on the relevant variables.
The result is that too many observations are resampled; the resulting bootstrap samples, because they
are drawn from a population with missing values, are of unequal sizes.
If the number of missing values relative to the sample size is small, this will make little difference.
If you have many missing values, however, you should first drop the observations that contain them.
Example 5
To illustrate, we use the previous example but replace some of the values of price with missing
values. The number of values of price used to compute the mean for each bootstrap is not constant.
This is the purpose of the Warning message.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. replace price = . if inlist(_n,1,3,5,7)
(4 real changes made, 4 to missing)
. set seed 1
210
bootstrap — Bootstrap sampling and estimation
. bootstrap ratio=r(ratio) np=r(n_price) nw=r(n_weight), reps(100) nodots:
> myratio price weight
Warning: Because myratio is not an estimation command or does not set
e(sample), bootstrap has no way to determine which observations are
used in calculating the statistics and so assumes that all
observations are used. This means that no observations will be
excluded from the resampling because of missing values or other
reasons.
If the assumption is not true, press Break, save the data, and drop
the observations that are to be excluded. Be sure that the dataset
in memory contains only the relevant data.
Bootstrap results
Number of obs
=
74
Replications
=
100
command: myratio price weight
ratio: r(ratio)
np: r(n_price)
nw: r(n_weight)
ratio
np
nw
Observed
Coef.
Bootstrap
Std. Err.
2.063051
70
74
.0893669
1.872178
.
z
23.09
37.39
.
P>|z|
Normal-based
[95% Conf. Interval]
0.000
0.000
.
1.887896
66.3306
.
2.238207
73.6694
.
Bootstrapping statistics from data with a complex structure
Here we describe how to bootstrap statistics from data with a complex structure, e.g., longitudinal or
panel data, or matched data. bootstrap, however, is not designed to work with complex survey data.
It is important to include all necessary information about the structure of the data in the bootstrap
syntax to obtain correct bootstrap estimates for standard errors and confidence intervals.
bootstrap offers several options identifying the specifics of the data. These options are strata(),
cluster(), idcluster(), and group(). The usage of strata() was described in example 3 above.
Below we demonstrate several examples that require specifying the other three options.
Example 6
Suppose that the auto data in example 1 above are clustered by rep78. We want to obtain
bootstrap estimates for the standard errors of the difference between the coefficients on weight and
gear ratio, taking into account clustering.
We supply the cluster(rep78) option to bootstrap to request resampling from clusters rather
than from observations in the dataset.
bootstrap — Bootstrap sampling and estimation
211
. use http://www.stata-press.com/data/r11/auto, clear
(1978 Automobile Data)
. keep if rep78 < .
(5 observations deleted)
. bootstrap diff=(_b[weight]-_b[gear]), seed(1) cluster(rep78): regress mpg weight
> gear foreign
(running regress on estimation sample)
Bootstrap replications (50)
1
2
3
4
5
..................................................
50
Linear regression
command:
diff:
diff
Number of obs
Replications
=
=
69
50
regress mpg weight gear foreign
_b[weight]-_b[gear]
(Replications based on 5 clusters in rep78)
Observed
Coef.
Bootstrap
Std. Err.
-1.910396
1.876778
z
-1.02
P>|z|
0.309
Normal-based
[95% Conf. Interval]
-5.588812
1.768021
We drop missing values in rep78 before issuing the command because bootstrap does not allow
missing values in cluster(). See the section above about using bootstrap when variables contain
missing values.
We can also obtain these same results by using the following syntax:
. bootstrap diff=(_b[weight]-_b[gear]), seed(1): regress mpg weight gear foreign,
> vce(cluster rep78)
When only clustered information is provided to the command, bootstrap can pick up the
vce(cluster clustvar) option from the main command and use it to resample from clusters.
Example 7
Suppose now that we have matched data and want to use bootstrap to obtain estimates of the
standard errors of the exponentiated difference between two coefficients (or, equivalently, the ratio
of two odds ratios) estimated by clogit. Consider the example of matched case–control data on
birthweight of infants described in example 2 of [R] clogit.
The infants are paired by being matched on mother’s age. All groups, defined by the pairid
variable, have 1:2 matching. clogit requires that the matching information, pairid, be supplied to
the group() (or, equivalently, strata()) option to be used in computing the parameter estimates.
Because the data are matched, we need to resample from groups rather than from the whole
dataset. However, simply supplying the grouping variable pairid in cluster() is not enough with
bootstrap, as it is with clustered data.
212
bootstrap — Bootstrap sampling and estimation
. use http://www.stata-press.com/data/r11/lowbirth2, clear
(Applied Logistic Regression, Hosmer & Lemeshow)
. bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(pairid): clogit low lwt
> smoke ptd ht ui i.race, group(pairid)
(running clogit on estimation sample)
Bootstrap replications (50)
1
2
3
4
5
..................................................
50
Bootstrap results
Number of obs
=
112
Replications
=
50
command:
ratio:
ratio
clogit low lwt smoke ptd ht ui i.race, group(pairid)
exp(_b[smoke]-_b[ptd])
(Replications based on 56 clusters in pairid)
Observed
Coef.
Bootstrap
Std. Err.
z
P>|z|
.6654095
17.71791
0.04
0.970
Normal-based
[95% Conf. Interval]
-34.06106
35.39187
For the syntax above, imagine that the first pair was sampled twice during a replication. Then the
bootstrap sample has four subjects with pairid equal to one, which clearly violates the original 1:2
matching design. As a result, the estimates of the coefficients obtained from this bootstrap sample
will be incorrect.
Therefore, in addition to resampling from groups, we need to ensure that resampled groups are
uniquely identified in each of the bootstrap samples. The idcluster(newcluster) option is designed
for this. It requests that at each replication bootstrap create the new variable, newcluster, containing
unique identifiers for all resampled groups. Thus, to make sure that the correct matching is preserved
during each replication, we need to specify the grouping variable in cluster(), supply a variable
name to idcluster(), and use this variable as the grouping variable with clogit, as we demonstrate
below.
. bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(pairid)
> idcluster(newpairid): clogit low lwt smoke ptd ht ui i.race, group(newpairid)
(running clogit on estimation sample)
Bootstrap replications (50)
1
2
3
4
5
..................................................
50
Bootstrap results
Number of obs
=
112
Replications
=
50
command: clogit low lwt smoke ptd ht ui i.race, group(newpairid)
ratio: exp(_b[smoke]-_b[ptd])
(Replications based on 56 clusters in pairid)
ratio
Observed
Coef.
Bootstrap
Std. Err.
z
P>|z|
.6654095
7.919441
0.08
0.933
Normal-based
[95% Conf. Interval]
-14.85641
16.18723
Note the difference between the estimates of the bootstrap standard error for the two specifications
of the bootstrap syntax.
bootstrap — Bootstrap sampling and estimation
213
Technical note
Similarly, when you have panel (longitudinal) data, all resampled panels must be unique
in each of the bootstrap samples to obtain correct bootstrap estimates of statistics. Therefore,
both cluster(panelvar) and idcluster(newpanelvar) must be specified with bootstrap, and
i(newpanelvar) must be used with the main command. Moreover, you must clear the current xtset
settings by typing xtset, clear before calling bootstrap.
Example 8
Continuing with our birthweight data, suppose that we have more information about doctors
supervising women’s pregnancies. We believe that the data on the pairs of infants from the same
doctor may be correlated and want to adjust standard errors for possible correlation among the pairs.
clogit offers the vce(cluster clustvar) option to do this.
Let’s add a cluster variable to our dataset. One thing to keep in mind is that to use vce(cluster
clustvar), groups in group() must be nested within clusters.
. use http://www.stata-press.com/data/r11/lowbirth2, clear
(Applied Logistic Regression, Hosmer & Lemeshow)
. set seed 12345
. by pairid, sort: egen byte doctor = total(int(2*runiform()+1)*(_n == 1))
. clogit low lwt smoke ptd ht ui i.race, group(pairid) vce(cluster doctor)
Iteration 0:
log pseudolikelihood = -26.768693
Iteration 1:
log pseudolikelihood = -25.810476
Iteration 2:
log pseudolikelihood = -25.794296
Iteration 3:
log pseudolikelihood = -25.794271
Iteration 4:
log pseudolikelihood = -25.794271
Conditional (fixed-effects) logistic regression
Number of obs
=
112
Wald chi2(1)
=
.
Prob > chi2
=
.
Log pseudolikelihood = -25.794271
Pseudo R2
=
0.3355
(Std. Err. adjusted for 2 clusters in doctor)
Robust
Std. Err.
low
Coef.
z
P>|z|
[95% Conf. Interval]
lwt
smoke
ptd
ht
ui
-.0183757
1.400656
1.808009
2.361152
1.401929
.0217802
.0085545
.938173
1.587013
.8568119
-0.84
163.73
1.93
1.49
1.64
0.399
0.000
0.054
0.137
0.102
-.0610641
1.38389
-.0307765
-.7493362
-.2773913
.0243128
1.417423
3.646794
5.47164
3.08125
race
2
3
.5713643
-.0253148
.0672593
.9149785
8.49
-0.03
0.000
0.978
.4395385
-1.81864
.7031902
1.76801
To obtain correct bootstrap standard errors of the exponentiated difference between the two
coefficients in this example, we need to make sure that both resampled clusters and groups within
resampled clusters are unique in each of the bootstrap samples. To achieve this, bootstrap needs
the information about clusters in cluster(), the variable name of the new identifier for clusters
in idcluster(), and the information about groups in group(). We demonstrate the corresponding
syntax of bootstrap below.
214
bootstrap — Bootstrap sampling and estimation
. bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) cluster(doctor)
> idcluster(uidoctor) group(pairid): clogit low lwt smoke ptd ht ui i.race,
> group(pairid)
(running clogit on estimation sample)
Bootstrap replications (50)
1
2
3
4
5
..................................................
50
Bootstrap results
Number of obs
=
112
Replications
=
50
command: clogit low lwt smoke ptd ht ui i.race, group(pairid)
ratio: exp(_b[smoke]-_b[ptd])
(Replications based on 2 clusters in doctor)
ratio
Observed
Coef.
Bootstrap
Std. Err.
z
P>|z|
Normal-based
[95% Conf. Interval]
.6654095
.3156251
2.11
0.035
.0467956
1.284023
In the above syntax, although we specify group(pairid) with clogit, it is not the group identifiers
of the original pairid variable that are used to compute parameter estimates from bootstrap samples.
The way bootstrap works is that, at each replication, the clusters defined by doctor are resampled
and the new variable, uidoctor, uniquely identifying resampled clusters is created. After that, another
new variable uniquely identifying the (uidoctor, group) combination is created and renamed to
have the same name as the grouping variable, pairid. This newly defined grouping variable is then
used by clogit to obtain the parameter estimates from this bootstrap sample of clusters. After all
replications are performed, the original values of the grouping variable are restored.
Technical note
The same logic must be used when running bootstrap with commands designed for panel (longitudinal) data that allow specifying the cluster(clustervar) option. To ensure that the combination of
(clustervar, panelvar) values are unique in each of the bootstrap samples, cluster(clustervar), idcluster(newclustervar), and group(panelvar) must be specified with bootstrap, and i(panelvar)
must be used with the main command.
Bradley Efron was born in 1938 in Minnesota and studied mathematics and statistics at Caltech
and Stanford; he has lived in northern California since 1960. He has worked on empirical Bayes,
survival analysis, exponential families, bootstrap and jackknife methods, and confidence intervals,
in conjunction with applied work in biostatistics, astronomy, and physics.
bootstrap — Bootstrap sampling and estimation
215
Saved results
bootstrap saves the following in e():
Scalars
e(N)
e(N reps)
e(N misreps)
e(N strata)
e(N clust)
e(k eq)
e(k exp)
e(k eexp)
e(k extra)
e(level)
e(bs version)
e(rank)
sample size
number of complete replications
number of incomplete replications
number of strata
number of clusters
number of equations
number of standard expressions
number of extended expressions (i.e., b)
number of extra equations beyond the original ones from e(b)
confidence level for bootstrap CIs
version for bootstrap results
rank of e(V)
Macros
e(cmdname)
e(cmd)
e(command)
e(cmdline)
e(prefix)
e(title)
e(strata)
e(cluster)
e(seed)
e(size)
e(exp#)
e(mse)
e(vce)
e(vcetype)
e(properties)
command name from command
same as e(cmdname) or bootstrap
command
command as typed
bootstrap
title in estimation output
strata variables
cluster variables
initial random-number seed
from the size(#) option
expression for the #th statistic
mse, if specified
bootstrap
title used to label Std. Err.
b V
Matrices
e(b)
e(b bs)
e(reps)
e(bias)
e(se)
e(z0)
e(accel)
e(ci normal)
e(ci percentile)
e(ci bc)
e(ci bca)
e(V)
e(V modelbased)
observed statistics
bootstrap estimates
number of nonmissing results
estimated biases
estimated standard errors
median biases
estimated accelerations
normal-approximation CIs
percentile CIs
bias-corrected CIs
bias-corrected and accelerated CIs
bootstrap variance–covariance matrix
model-based variance
When exp list is
command.
b, bootstrap will also carry forward most of the results already in e() from
216
bootstrap — Bootstrap sampling and estimation
Methods and formulas
bootstrap is implemented as an ado-file.
Let θb be the observed value of the statistic, that is, the value of the statistic calculated with the
original dataset. Let i = 1, 2, . . . , k denote the bootstrap samples, and let θbi be the value of the
statistic from the ith bootstrap sample.
When the mse option is specified, the standard error is estimated as
X
1/2
k
1
2
b
b
se
b MSE =
(θi − θ)
k
i=1
Otherwise, the standard error is estimated as
1/2
k
1 X b
2
(θi − θ)
se
b =
k−1
i=1
where
k
1 Xb
θ=
θi
k i=1
The variance–covariance matrix is similarly computed. The bias is estimated as
d = θ − θb
bias
Confidence intervals with nominal coverage rates 1 − α are calculated according to the following
formulas. The normal-approximation method yields the confidence intervals
θb − z1−α/2 se,
b θb + z1−α/2 se
b
where z1−α/2 is the (1 − α/2)th quantile of the standard normal distribution. If the mse option is
specified, bootstrap will report the normal confidence interval using se
b MSE instead of se
b . estat
bootstrap only uses se
b in the normal confidence interval.
The percentile method yields the confidence intervals
∗
∗
θα/2 , θ1−α/2
where θp∗ is the pth quantile (the 100pth percentile) of the bootstrap distribution (θb1 , . . . , θbk ).
Let
b
z0 = Φ−1 {#(θbi ≤ θ)/k}
b is the number of elements of the bootstrap distribution that are less than or equal
where #(θbi ≤ θ)
to the observed statistic and Φ is the standard cumulative normal. z0 is known as the median bias of
θb. Let
Pn
(θ(·) − θb(i) )3
a = P i=1
n
b 2 3/2
6
i=1 (θ (·) − θ(i) )
where θb(i) are the leave-one-out (jackknife) estimates of θb and θ(·) is their mean. This expression is
known as the jackknife estimate of acceleration for θb. Let
z0 − z1−α/2
p1 = Φ z0 +
1 − a(z0 − z1−α/2 )
z0 + z1−α/2
p2 = Φ z0 +
1 − a(z0 + z1−α/2 )
bootstrap — Bootstrap sampling and estimation
217
where z1−α/2 is the (1 −α/2)th quantile of the normal distribution. The bias-corrected and accelerated
(BCa ) method yields confidence intervals
θp∗1 , θp∗2
where θp∗ is the pth quantile of the bootstrap distribution as defined previously. The bias-corrected
(but not accelerated) method is a special case of BCa with a = 0.
References
Cameron, A. C., and P. K. Trivedi. 2009. Microeconometrics Using Stata. College Station, TX: Stata Press.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge: Cambridge University
Press.
Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Annals of Statistics 7: 1–26.
. 1982. The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied
Mathematics.
Efron, B., and C. Stein. 1981. The jackknife estimate of variance. Annals of Statistics 9: 586–596.
Efron, B., and R. J. Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures
of statistical accuracy. Statistical Science 1: 54–77.
. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC.
Gleason, J. R. 1997. ip18: A command for randomly resampling a dataset. Stata Technical Bulletin 37: 17–22. Reprinted
in Stata Technical Bulletin Reprints, vol. 7, pp. 77–83. College Station, TX: Stata Press.
. 1999. ip18.1: Update to resample. Stata Technical Bulletin 52: 9–10. Reprinted in Stata Technical Bulletin
Reprints, vol. 9, p. 119. College Station, TX: Stata Press.
Gould, W. W. 1994. ssi6.2: Faster and easier bootstrap estimation. Stata Technical Bulletin 21: 24–33. Reprinted in
Stata Technical Bulletin Reprints, vol. 4, pp. 211–223. College Station, TX: Stata Press.
Guan, W. 2003. From the help desk: Bootstrapped standard errors. Stata Journal 3: 71–80.
Hall, P., and S. R. Wilson. 1991. Two guidelines for bootstrap hypothesis testing. Biometrics 47: 757–762.
Hamilton, L. C. 1991. ssi2: Bootstrap programming. Stata Technical Bulletin 4: 18–27. Reprinted in Stata Technical
Bulletin Reprints, vol. 1, pp. 208–220. College Station, TX: Stata Press.
. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury.
. 2009. Statistics with Stata (Updated for Version 10). Belmont, CA: Brooks/Cole.
Hinkley, D. V. 1978. Improving the jackknife with special reference to correlation estimation. Biometrika 65: 13–22.
Holmes, S., C. Morris, and R. J. Tibshirani. 2003. Bradley Efron: A conversation with good friends. Statistical Science
18: 268–281.
Mooney, C. Z., and R. D. Duval. 1993. Bootstrapping: A Nonparametric Approach to Statistical Inference. Newbury
Park, CA: Sage.
Poi, B. P. 2004. From the help desk: Some bootstrapping techniques. Stata Journal 4: 312–328.
Stine, R. 1990. An introduction to bootstrap methods: Examples and ideas. In Modern Methods of Data Analysis, ed.
J. Fox and J. S. Long, 353–373. Newbury Park, CA: Sage.
218
bootstrap — Bootstrap sampling and estimation
Also see
[R] bootstrap postestimation — Postestimation tools for bootstrap
[R] jackknife — Jackknife estimation
[R] permute — Monte Carlo permutation tests
[R] simulate — Monte Carlo simulations
[U] 13.5 Accessing coefficients and standard errors
[U] 13.6 Accessing results from Stata commands
[U] 20 Estimation and postestimation commands
Title
bootstrap postestimation — Postestimation tools for bootstrap
Description
The following postestimation command is of special interest after bootstrap:
command
description
estat bootstrap
percentile-based and bias-corrected CI tables
For information about estat bootstrap, see below.
The following standard postestimation commands are also available:
command
estat
estimates
∗
hausman
∗
lincom
∗
∗
∗
∗
∗
∗
margins
nlcom
predict
predictnl
test
testnl
∗
description
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
Hausman’s specification test
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized
predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
This postestimation command is allowed if it may be used after command.
See the corresponding entries in the Stata Base Reference Manual for details.
Special-interest postestimation command
estat bootstrap displays a table of confidence intervals for each statistic from a bootstrap
analysis.
Syntax for predict
The syntax of predict (and even if predict is allowed) following bootstrap depends upon
the command used with bootstrap. If predict is not allowed, neither is predictnl.
219
220
bootstrap postestimation — Postestimation tools for bootstrap
Syntax for estat bootstrap
estat bootstrap , options
options
description
bc
bca
normal
percentile
all
noheader
nolegend
verbose
bias-corrected CIs; the default
bias-corrected and accelerated (BCa ) CIs
normal-based CIs
percentile CIs
all available CIs
suppress table header
suppress table legend
display the full table legend
bc, bca, normal, and percentile may be used together.
Menu
Statistics
>
Postestimation
>
Reports and statistics
Options for estat bootstrap
bc is the default and displays bias-corrected confidence intervals.
bca displays bias-corrected and accelerated confidence intervals. This option assumes that you also
specified the bca option on the bootstrap prefix command.
normal displays normal approximation confidence intervals.
percentile displays percentile confidence intervals.
all displays all available confidence intervals.
noheader suppresses display of the table header. This option implies nolegend.
nolegend suppresses display of the table legend, which identifies the rows of the table with the
expressions they represent.
verbose requests that the full table legend be displayed.
Remarks
Example 1
The estat bootstrap postestimation command produces a table containing the observed value
of the statistic, an estimate of its bias, the bootstrap standard error, and up to four different confidence
intervals.
If we were interested merely in getting bootstrap standard errors for the model coefficients, we
could use the bootstrap prefix with our estimation command. If we were interested in performing
a thorough bootstrap analysis of the model coefficients, we could use the estat bootstrap
postestimation command after fitting the model with the bootstrap prefix.
bootstrap postestimation — Postestimation tools for bootstrap
221
Using example 1 from [R] bootstrap, we need many more replications for the confidence interval
types other than the normal based, so let’s rerun the estimation command. We will reset the randomnumber seed—in case we wish to reproduce the results—increase the number of replications, and
save the bootstrap distribution as a dataset called bsauto.dta.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. set seed 1
. bootstrap _b, reps(1000) saving(bsauto) bca: regress mpg weight gear foreign
(output omitted )
. estat bootstrap, all
Linear regression
Number of obs
=
74
Replications
=
1000
mpg
Observed
Coef.
Bias
weight
-.00613903
.0000567
.000628
gear_ratio
1.4571134
.1051696
1.4554785
foreign
-2.2216815
-.0196361
1.2023286
_cons
36.101353
-.502281
5.4089441
(N)
(P)
(BC)
(BCa)
Bootstrap
Std. Err.
[95% Conf. Interval]
-.0073699
-.0073044
-.0074355
-.0075282
-1.395572
-1.262111
-1.523927
-1.492223
-4.578202
-4.442199
-4.155504
-4.216531
25.50002
24.48569
25.59799
25.85658
-.0049082
-.0048548
-.004928
-.0050258
4.309799
4.585372
4.174376
4.231356
.1348393
.2677989
.6170642
.5743973
46.70269
46.07086
46.63227
47.02108
(N)
(P)
(BC)
(BCa)
(N)
(P)
(BC)
(BCa)
(N)
(P)
(BC)
(BCa)
(N)
(P)
(BC)
(BCa)
normal confidence interval
percentile confidence interval
bias-corrected confidence interval
bias-corrected and accelerated confidence interval
The estimated standard errors here differ from our previous estimates using only 100 replications
by, respectively, 8%, 3%, 11%, and 6%; see example 1 of [R] bootstrap. So much for our advice
that 50 – 200 replications are good enough to estimate standard errors. Well, the more replications the
better — that advice you should believe.
Which of the methods to compute confidence intervals should we use? If the statistic is unbiased,
the percentile (P) and bias-corrected (BC) methods should give similar results. The bias-corrected
confidence interval will be the same as the percentile confidence interval when the observed value of
the statistic is equal to the median of the bootstrap distribution. Thus, for unbiased statistics, the two
methods should give similar results as the number of replications becomes large. For biased statistics,
the bias-corrected method should yield confidence intervals with better coverage probability (closer
to the nominal value of 95% or whatever was specified) than the percentile method. For statistics
with variances that vary as a function of the parameter of interest, the bias-corrected and accelerated
method (BCa ) will typically have better coverage probability than the others.
When the bootstrap distribution is approximately normal, all these methods should give similar
confidence intervals as the number of replications becomes large. If we examine the normality of
these bootstrap distributions using, say, the pnorm command (see [R] diagnostic plots), we see that
they closely follow a normal distribution. Thus here, the normal approximation would also be a valid
222
bootstrap postestimation — Postestimation tools for bootstrap
choice. The chief advantage of the normal-approximation method is that it (supposedly) requires fewer
replications than the other methods. Of course, it should be used only when the bootstrap distribution
exhibits normality.
We can load bsauto.dta containing the bootstrap distributions for these coefficients:
. use bsauto
(bootstrap: regress)
. describe *
storage
variable name
type
display
format
_b_weight
_b_gear_ratio
_b_foreign
_b_cons
%9.0g
%9.0g
%9.0g
%9.0g
float
float
float
float
value
label
variable label
_b[weight]
_b[gear_ratio]
_b[foreign]
_b[_cons]
We can now run other commands, such as pnorm, on the bootstrap distributions. As with all
standard estimation commands, we can use the bootstrap command to replay its output table. The
default variable names assigned to the statistics in exp list are bs 1, bs 2, . . . , and each variable
is labeled with the associated expression. The naming convention for the extended expressions b
and se is to prepend b and se , respectively, onto the name of each element of the coefficient
vector. Here the first coefficient is b[weight], so bootstrap named it b weight.
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] bootstrap — Bootstrap sampling and estimation
[U] 20 Estimation and postestimation commands
Title
boxcox — Box–Cox regression models
Syntax
boxcox depvar
indepvars
if
in
weight
, options
description
options
Model
noconstant
model(lhsonly)
model(rhsonly)
model(lambda)
model(theta)
notrans(varlist)
suppress constant term
left-hand-side Box–Cox model; the default
right-hand-side Box–Cox model
both sides Box–Cox model with same parameter
both sides Box–Cox model with different parameters
nontransformed independent variables
Reporting
set confidence level; default is level(95)
perform likelihood-ratio test
level(#)
lrtest
Maximization
nolog
nologlr
maximize options
suppress full-model iteration log
suppress restricted-model lrtest iteration log
control the maximization process; seldom used
depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, rolling, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
fweights and iweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Linear models and related
>
Box-Cox regression
Description
boxcox finds the maximum likelihood estimates of the parameters of the Box–Cox transform, the
coefficients on the independent variables, and the standard deviation of the normally distributed errors
for a model in which depvar is regressed on indepvars. You can fit the following models:
223
224
boxcox — Box–Cox regression models
Option
Estimates
lhsonly
yj = β1 x1j + β2 x2j + · · · + βk xkj + j
rhsonly
yj = β1 x1j + β2 x2j + · · · + βk xkj + j
(θ)
(λ)
(λ)
(λ)
(λ)
(λ)
(λ)
(λ)
β1 x1j
(λ)
β1 x1j
(λ)
β1 x1j
(λ)
β1 x1j
(λ)
β2 x2j
(λ)
β2 x2j
(λ)
β2 x2j
(λ)
β2 x2j
(λ)
βk xkj
(λ)
βk xkj
(λ)
βk xkj
(λ)
βk xkj
rhsonly notrans() yj = β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + · · · + γl zlj + j
lambda
lambda notrans()
theta
theta notrans()
(λ)
yj =
(λ)
yj =
(θ)
yj =
(θ)
yj =
+
+
+
+
+ ··· +
+ ··· +
+ ··· +
+ ··· +
+ j
+ γ1 z1j + · · · + γl zlj + j
+ j
+ γ1 z1j + · · · + γl zlj + j
Any variable to be transformed must be strictly positive.
Options
Model
noconstant; see [R] estimation options.
model( lhsonly | rhsonly | lambda | theta ) specifies which of the four models to fit.
model(lhsonly) applies the Box–Cox transform to depvar only. model(lhsonly) is the default.
model(rhsonly) applies the transform to the indepvars only.
model(lambda) applies the transform to both depvar and indepvars, and they are transformed by
the same parameter.
model(theta) applies the transform to both depvar and indepvars, but this time, each side is
transformed by a separate parameter.
notrans(varlist) specifies that the variables in varlist be included as nontransformed independent
variables.
Reporting
level(#); see [R] estimation options.
lrtest specifies that a likelihood-ratio test of significance be performed and reported for each
independent variable.
Maximization
nolog suppresses the iteration log when fitting the full model.
nologlr suppresses the iteration log when fitting the restricted models required by the lrtest option.
maximize options: iterate(#) and from(init specs); see [R] maximize.
Model
Initial value specification
lhsonly
rhsonly
lambda
theta
from(θ0 , copy)
from(λ0 , copy)
from(λ0 , copy)
from(λ0 θ0 , copy)
boxcox — Box–Cox regression models
225
Remarks
Remarks are presented under the following headings:
Introduction
Theta model
Lambda model
Left-hand-side-only model
Right-hand-side-only model
Introduction
The Box–Cox transform
y (λ) =
yλ − 1
λ
has been widely used in applied data analysis. Box and Cox (1964) developed the transformation and
argued that the transformation could make the residuals more closely normal and less heteroskedastic.
Cook and Weisberg (1982) discuss the transform in this light. Because the transform embeds several
popular functional forms, it has received some attention as a method for testing functional forms, in
particular,
y (λ)

if λ = 1
y − 1
ln
(y)
if
λ=0
=

1 − 1/y if λ = −1
Davidson and MacKinnon (1993) discuss this use of the transform. Atkinson (1985) also gives a good
general treatment.
Theta model
boxcox obtains the maximum likelihood estimates of the parameters for four different models.
The most general of the models, the theta model, is
(θ)
yj
(λ)
(λ)
(λ)
= β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j
where ∼ N (0, σ 2 ). Here the dependent variable, y , is subject to a Box–Cox transform with
parameter θ. Each of the indepvars, x1 , x2 , . . . , xk , is transformed by a Box–Cox transform with
parameter λ. The z1 , z2 , . . . , zl specified in the notrans() option are independent variables that are
not transformed.
Box and Cox (1964) argued that this transformation would leave behind residuals that more closely
follow a normal distribution than those produced by a simple linear regression model. Bear in mind
that the normality of is assumed and that boxcox obtains maximum likelihood estimates of the
k + l + 4 parameters under this assumption. boxcox does not choose λ and θ so that the residuals are
approximately normally distributed. If you are interested in this type of transformation to normality,
see the official Stata commands lnskew0 and bcskew0 in [R] lnskew0. However, those commands
work on a more restrictive model in which none of the independent variables is transformed.
226
boxcox — Box–Cox regression models
Example 1
Consider an example using the auto data.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. boxcox mpg weight price, notrans(foreign) model(theta) lrtest
Fitting comparison model
Iteration 0:
log likelihood = -234.39434
Iteration 1:
log likelihood = -228.26891
Iteration 2:
log likelihood = -228.26777
Iteration 3:
log likelihood = -228.26777
Fitting full model
Iteration 0:
log likelihood = -194.13727
(output omitted )
Fitting comparison models for LR tests
Iteration 0:
log likelihood = -179.58214
Iteration 1:
log likelihood = -177.59036
Iteration 2:
log likelihood = -177.58739
Iteration 3:
log likelihood = -177.58739
Iteration 0:
log likelihood = -203.92855
Iteration 1:
log likelihood = -201.30202
Iteration 2:
log likelihood = -201.18359
Iteration 3:
log likelihood = -201.18233
Iteration 4:
log likelihood = -201.18233
Iteration 0:
log likelihood = -178.83799
Iteration 1:
log likelihood = -175.98405
Iteration 2:
log likelihood = -175.97931
Iteration 3:
log likelihood = -175.97931
Number of obs
LR chi2(4)
Log likelihood = -175.67343
Prob > chi2
mpg
Coef.
/lambda
/theta
.760169
-.7189315
Std. Err.
.6289991
.3244439
z
1.21
-2.22
P>|z|
chi2(df)
-.0114338
1.377399
3.828
weight
price
-.000239
-6.18e-06
51.018
0.612
/sigma
.0143509
Notrans
foreign
_cons
P>chi2(df)
0.050
0.227
0.027
-.4726466
-1.35483
df of chi2
1
Trans
Test
H0:
theta=lambda = -1
theta=lambda = 0
theta=lambda = 1
0.000
0.434
Restricted
log likelihood
chi2
-181.64479
-178.2406
-194.13727
11.94
5.13
36.93
74
105.19
0.000
[95% Conf. Interval]
Estimates of scale-variant parameters
Coef.
=
=
=
1
1
Prob > chi2
0.001
0.023
0.000
1.992985
-.0830331
boxcox — Box–Cox regression models
227
The output is composed of the iteration logs and three distinct tables. The first table contains
a standard header for a maximum likelihood estimator and a standard output table for the Box–
Cox transform parameters. The second table contains the estimates of the scale-variant parameters.
The third table contains the output from likelihood-ratio tests on three standard functional form
specifications.
If we were to interpret this output, the right-hand-side transformation would not significantly add
to the regression, whereas the left-hand-side transformation would make the 5% but not the 1%
cutoff. price is certainly not significant, and foreign lies right on the 5% cutoff. weight is clearly
significant. The output also shows that the linear and multiplicative inverse specifications are both
strongly rejected. A natural log specification can be rejected at the 5% level but not at the 1% level.
Technical note
Spitzer (1984) showed that the Wald tests of the joint significance of the coefficients of the
right-hand-side variables, either transformed or untransformed, are not invariant to changes in the
scale of the transformed dependent variable. Davidson and MacKinnon (1993) also discuss this point.
This problem demonstrates that Wald statistics can be manipulated in nonlinear models. Lafontaine
and White (1986) analyze this problem numerically, and Phillips and Park (1988) analyze it by using
Edgeworth expansions. See Drukker (2000b) for a more detailed discussion of this issue. Because the
parameter estimates and their Wald tests are not scale invariant, no Wald tests or confidence intervals
are reported for these parameters. However, when the lrtest option is specified, likelihood-ratio
tests are performed and reported. Schlesselman (1971) showed that, if a constant is included in the
model, the parameter estimates of the Box–Cox transforms are scale invariant. For this reason, we
strongly recommend that you not use the noconstant option.
The lrtest option does not perform a likelihood-ratio test on the constant, so no value for this
statistic is reported. Unless the data are properly scaled, the restricted model does not often converge.
For this reason, no likelihood-ratio test on the constant is performed by the lrtest option. However,
if you have a special interest in performing this test, you can do so by fitting the constrained model
separately. If problems with convergence are encountered, rescaling the data by their means may
help.
Lambda model
A less general model than the one above is called the lambda model. It specifies that the same
parameter be used in both the left-hand-side and right-hand-side transformations. Specifically,
(λ)
yj
(λ)
(λ)
(λ)
= β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j
where ∼ N (0, σ 2 ). Here the depvar variable, y , and each of the indepvars, x1 , x2 , . . . , xk , is
transformed by a Box–Cox transform with the common parameter λ. Again the z1 , z2 , . . . , zl are
independent variables that are not transformed.
228
boxcox — Box–Cox regression models
Left-hand-side-only model
Even more restrictive than a common transformation parameter is transforming the dependent
variable only. Because the dependent variable is on the left-hand side of the equation, this model is
known as the lhsonly model. Here you are estimating the parameters of the model
(θ)
yj
= β0 + β1 x1j + β2 x2j + · · · + βk xkj + j
where ∼ N (0, σ 2 ). Here only the depvar, y , is transformed by a Box–Cox transform with the
parameter θ.
Example 2
We again hypothesize mpg to be a function of weight, price, and foreign in a Box–Cox model
in which only mpg is subject to the transform:
. boxcox mpg weight price foreign, model(lhs) lrtest nolog nologlr
Fitting comparison model
Fitting full model
Fitting comparison models for LR tests
Number of obs
=
LR chi2(3)
=
Log likelihood = -175.74705
Prob > chi2
=
mpg
Coef.
/theta
-.7826999
Std. Err.
z
.281954
-2.78
74
105.04
0.000
P>|z|
[95% Conf. Interval]
0.006
-1.33532
-.2300802
Estimates of scale-variant parameters
Coef.
chi2(df)
Notrans
weight
price
foreign
_cons
-.0000294
-4.66e-07
-.0097564
1.249845
58.056
0.469
4.644
/sigma
.0118454
P>chi2(df)
0.000
0.493
0.031
Test
H0:
Restricted
log likelihood
LR statistic
chi2
theta = -1
theta = 0
theta = 1
-176.04312
-179.54104
-194.13727
0.59
7.59
36.78
df of chi2
1
1
1
P-value
Prob > chi2
0.442
0.006
0.000
This model rejects both linear and log specifications of mpg but fails to reject the hypothesis
that 1/mpg is linear in the independent variables. These findings are in line with what an engineer
would have expected. In engineering terms, gallons per mile represents actual energy consumption,
and energy consumption should be approximately linear in weight.
boxcox — Box–Cox regression models
229
Right-hand-side-only model
The fourth model leaves the depvar alone and transforms a subset of the indepvars using the
parameter λ. This is the rhsonly model. In this model, the depvar, y , is given by
(λ)
(λ)
(λ)
yj = β0 + β1 x1j + β2 x2j + · · · + βk xkj + γ1 z1j + γ2 z2j + · · · + γl zlj + j
where ∼ N (0, σ 2 ). Here each of the indepvars, x1 , x2 , . . . , xk , is transformed by a Box–Cox
transform with the parameter λ. Again the z1 , z2 , . . . , zl are independent variables that are not
transformed.
Example 3
Here is an example with the rhsonly model. price and foreign are not included in the list of
covariates. (You are invited to use the auto data and check that they fare no better here than above.)
. boxcox mpg weight, model(rhs) lrtest nolog nologlr
Fitting full model
Fitting comparison models for LR tests
Comparison model for LR test on weight is a linear regression.
Lambda is not identified in the restricted model.
Number of obs
LR chi2(2)
Log likelihood = -192.94368
Prob > chi2
mpg
Coef.
/lambda
-.4460916
Std. Err.
z
.6551108
P>|z|
-0.68
=
=
=
74
82.90
0.000
[95% Conf. Interval]
0.496
-1.730085
.8379019
Estimates of scale-variant parameters
Coef.
Notrans
_cons
chi2(df)
P>chi2(df)
df of chi2
1359.092
Trans
weight
-614.3874
/sigma
3.281854
82.901
0.000
Test
H0:
Restricted
log likelihood
LR statistic
chi2
lambda = -1
lambda = 0
lambda = 1
-193.2893
-193.17892
-195.38869
0.69
0.47
4.89
1
P-value
Prob > chi2
0.406
0.493
0.027
The interpretation of the output is similar to that in all the cases above, with one caveat. As
requested, a likelihood-ratio test was performed on the lone independent variable. However, when it is
dropped to form the constrained model, the comparison model is not a right-hand-side-only Box–Cox
model but rather a simple linear regression on a constant model. When weight is dropped, there are
no longer any transformed variables. Hence, λ is not identified, and it must also be dropped. This
process leaves a linear regression on a constant as the “comparison model”. It also implies that the
test statistic has 2 degrees of freedom instead of 1. At the top of the output, a more concise warning
informs you of this point.
230
boxcox — Box–Cox regression models
A similar identification issue can also arise in the lambda and theta models when only one
independent variable is specified. In these cases, warnings also appear on the output.
Saved results
boxcox saves the following in e():
Scalars
e(N)
e(ll)
e(chi2)
e(df m)
e(ll0)
e(df r)
e(ll t1)
e(chi2 t1)
e(p t1)
number of observations
log likelihood
LR statistic of full vs. comparison
full model degrees of freedom
log likelihood of the restricted model
restricted model degrees of freedom
log likelihood of model λ=θ=1
LR of λ=θ=1 vs. full model
p-value of λ=θ=1 vs. full model
e(ll tm1)
e(chi2 tm1)
e(p tm1)
e(ll t0)
e(chi2 t0)
e(p t0)
e(rank)
e(ic)
e(rc)
Macros
e(cmd)
e(cmdline)
e(depvar)
e(model)
e(wtype)
e(wexp)
e(ntrans)
boxcox
command as typed
name of dependent variable
lhsonly, rhsonly, lambda, or theta
weight type
weight expression
yes if nontransformed indepvars
e(chi2type)
e(lrtest)
e(properties)
e(predict)
coefficient vector
variance–covariance matrix of
the estimators (see note below)
p-values for LR tests on indepvars
e(df)
Matrices
e(b)
e(V)
e(pm)
Functions
e(sample)
log likelihood of model λ=θ=−1
LR of λ=θ=−1 vs. full model
p-value of λ=θ=−1 vs. full model
log likelihood of model λ=θ=0
LR of λ=θ=0 vs. full model
p-value of λ=θ=0 vs. full model
rank of e(V)
number of iterations
return code
LR; type of model χ2 test
lrtest, if requested
b V
program used to implement
predict
e(marginsnotok) predictions disallowed by margins
e(chi2m)
degrees of freedom of LR tests on
indepvars
LR statistics for tests on indepvars
marks estimation sample
e(V) contains all zeros, except for the elements that correspond to the parameters of the Box–Cox
transform.
Methods and formulas
boxcox is implemented as an ado-file.
boxcox — Box–Cox regression models
231
In the internal computations,
y (λ) =
 λ
 y λ−1

if |λ| > 10−10
ln(y) otherwise
The unconcentrated log likelihood for the theta model is
lnL =
−N
2
2
ln(2π) + ln(σ ) + (θ − 1)
N
X
ln(yi ) −
i=1
1
2σ 2
SSR
where
SSR =
N
X
(θ)
(λ)
(λ)
(λ)
(yi − β0 + β1 xi1 + β2 xi2 + · · · + βk xik + γ1 zi1 + γ2 zi2 + · · · + γl zil )2
i=1
Writing the SSR in matrix form,
SSR = (Y(θ) − X(λ) b0 − Zg0 )0 (Y(θ) − X(λ) b0 − Zg0 )
where Y(θ) is an N × 1 vector of elementwise transformed data, X(λ) is an N × k matrix of
elementwise transformed data, Z is an N × l matrix of untransformed data, b is a 1 × k vector of
coefficients, and g is a 1 × l vector of coefficients. Letting
Wλ = X(λ) Z
be the horizontal concatenation of X(λ) and Z and
0
b
d0 =
g0
be the vertical concatenation of the coefficients yields
SSR = (Y(θ) − Wλ d0 )0 (Y(θ) − Wλ d0 )
For given values of λ and θ, the solutions for d0 and σ 2 are
b 0 = (W0 Wλ )−1 W0 Y(θ)
d
λ
λ
and
0 1 (θ)
b0
b0
Y − Wλ d
Y(θ) − Wλ d
N
Substituting these solutions into the log-likelihood function yields the concentrated log-likelihood
function
N
X
N 2
lnLc = −
ln(2π) + 1 + ln(b
σ ) + (θ − 1)
ln(yi )
2
i=1
σ
b2 =
232
boxcox — Box–Cox regression models
Similar calculations yield the concentrated log-likelihood function for the lambda model,
lnLc =
N
−
2
2
ln(2π) + 1 + ln(b
σ ) + (λ − 1)
N
X
ln(yi )
i=1
the lhsonly model,
N
X
N ln(2π) + 1 + ln(b
σ 2 ) + (θ − 1)
ln(yi )
lnLc = −
2
i=1
and the rhsonly model,
lnLc =
−
N
2
ln(2π) + 1 + ln(b
σ 2)
where σ
b 2 is specific to each model and is defined analogously to that in the theta model.
References
Atkinson, A. C. 1985. Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic
Regression Analysis. Oxford: Oxford University Press.
Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society, Series
B 26: 211–243.
Carroll, R. J., and D. Ruppert. 1988. Transformation and Weighting in Regression. New York: Chapman & Hall.
Cook, R. D., and S. Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman & Hall/CRC.
Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University
Press.
Drukker, D. M. 2000a. sg130: Box–Cox regression models. Stata Technical Bulletin 54: 27–36. Reprinted in Stata
Technical Bulletin Reprints, vol. 9, pp. 307–319. College Station, TX: Stata Press.
. 2000b. sg131: On the manipulability of Wald tests in Box–Cox regression models. Stata Technical Bulletin 54:
36–42. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 319–327. College Station, TX: Stata Press.
Lafontaine, F., and K. J. White. 1986. Obtaining any Wald statistic you want. Economics Letters 21: 35–40.
Phillips, P. C. B., and J. Y. Park. 1988. On the formulation of Wald tests of nonlinear restrictions. Econometrica 56:
1065–1083.
Schlesselman, J. J. 1971. Power families: A note on the Box and Cox transformation. Journal of the Royal Statistical
Society, Series B 33: 307–311.
Spitzer, J. J. 1984. Variance estimates in models with the Box–Cox transformation: Implications for estimation and
hypothesis testing. Review of Economics and Statistics 66: 645–652.
Also see
[R] boxcox postestimation — Postestimation tools for boxcox
[R] regress — Linear regression
[R] lnskew0 — Find zero-skewness log or Box – Cox transform
[U] 20 Estimation and postestimation commands
Title
boxcox postestimation — Postestimation tools for boxcox
Description
The following postestimation commands are available for boxcox:
command
description
AIC, BIC, VCE, and estimation sample summary
estat
estimates
∗
lincom
∗
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
nlcom
predict
test
∗
testnl
∗
∗
Inference is valid only for hypotheses concerning
λ and θ.
See the corresponding entries in the Base Reference Manual for details.
Syntax for predict
predict
type
newvar
if
in
, statistic
description
statistic
Main
xbt
yhat
residuals
transformed linear prediction; the default
predicted value of y
residuals
These statistics are available both in and out of sample; type predict
only for the estimation sample.
. . . if e(sample) . . . if wanted
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
xbt, the default, calculates the “linear” prediction. For all the models except model(lhsonly), all
the indepvars except those specified in the notrans() option are transformed.
233
234
boxcox postestimation — Postestimation tools for boxcox
yhat calculates the predicted value of y .
residuals calculates the residuals after the predicted value of y has been subtracted from the actual
value.
Remarks
boxcox estimates variances only for the λ and θ parameters (see the technical note in [R] boxcox),
so the extent to which postestimation commands can be used following boxcox is limited. Formulas
used in lincom, nlcom, test, and testnl are dependent on the estimated variances. Therefore,
the use of these commands is limited and generally applicable only to inferences on the λ and θ
coefficients.
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] boxcox — Box–Cox regression models
[R] lnskew0 — Find zero-skewness log or Box – Cox transform
[U] 20 Estimation and postestimation commands
Title
brier — Brier score decomposition
Syntax
brier outcomevar forecastvar
if
in
, group(#)
by is allowed; see [D] by.
Menu
Statistics
>
Epidemiology and related
>
Other
>
Brier score decomposition
Description
brier computes the Yates, Sanders, and Murphy decompositions of the Brier Mean Probability
Score. outcomevar contains 0/1 values reflecting the actual outcome of the experiment, and forecastvar
contains the corresponding probabilities as predicted by, say, logit, probit, or a human forecaster.
Option
Main
group(#) specifies the number of groups that will be used to compute the decomposition. group(10)
is the default.
Remarks
You have a binary (0/1) response and a formula that predicts the corresponding probabilities of
having observed a positive outcome (1). If the probabilities were obtained from logistic regression,
there are many methods that assess goodness of fit (see, for instance, estat gof in [R] logistic).
However, the probabilities might be computed from a published formula or from a model fit on
another sample, both completely unrelated to the data at hand, or perhaps the forecasts are not from
a formula at all. In any case, you now have a test dataset consisting of the forecast probabilities and
observed outcomes. Your test dataset might, for instance, record predictions made by a meteorologist
on the probability of rain along with a variable recording whether it actually rained.
The Brier score is an aggregate measure of disagreement between the observed outcome and a
prediction — the average squared error difference. The Brier score decomposition is a partition of the
Brier score into components that suggest reasons for discrepancy. These reasons fall roughly into
three groups: 1) lack of overall calibration between the average predicted probability and the actual
probability of the event in your data, 2) misfit of the data in groups defined within your sample, and
3) inability to match actual 0 and 1 responses.
Problem 1 refers to simply overstating or understating the probabilities.
Problem 2 refers to what is standardly called a goodness-of-fit test: the data are grouped, and the
predictions for the group are compared with the outcomes.
235
236
brier — Brier score decomposition
Problem 3 refers to an individual-level measure of fit. Imagine that the grouped outcomes are predicted
on average correctly but that within the group, the outcomes are poorly predicted.
Using logit or probit analysis to fit your data will guarantee that there is no lack of fit due to problem
1, and a good model fitter will be able to avoid problem 2. Problem 3 is inherent in any prediction
exercise.
Example 1
We have data on the outcomes of 20 basketball games (win) and the probability of victory predicted
by a local pundit (for).
. use http://www.stata-press.com/data/r11/bball
. summarize win for
Variable
Obs
Mean
Std. Dev.
win
20
.65
.4893605
for
20
.4785
.2147526
. brier win for, group(5)
Mean probability of outcome
0.6500
of forecast 0.4785
Correlation
0.5907
ROC area
0.8791 p = 0.0030
Brier score
0.1828
Spiegelhalter’s z-statistic -0.6339 p = 0.7369
Sanders-modified Brier score 0.1861
Sanders resolution
0.1400
Outcome index variance
0.2275
Murphy resolution
0.0875
Reliability-in-the-small
0.0461
Forecast variance
0.0438
Excess forecast variance
0.0285
Minimum forecast variance
0.0153
Reliability-in-the-large
0.0294
2*Forecast-Outcome-Covar
0.1179
Min
Max
0
.15
1
.9
The mean probabilities of forecast and outcome are simply the mean of the predicted probabilities
and the actual outcomes (wins/losses). The correlation is the product-moment correlation between
them.
The Brier score measures the total difference between the event (winning) and the forecast
probability of that event as an average squared difference. As a benchmark, a perfect forecaster would
have a Brier score of 0, a perfect misforecaster (predicts probability of win is 1 when loses and 0
when wins) would have a Brier score of 1, and a fence-sitter (forecasts every game as 50/50) would
have a Brier score of 0.25. Our pundit is doing reasonably well.
Spiegelhalter’s z statistic is a standard normal test statistic for testing whether an individual Brier
score is extreme. The ROC area is the area under the receiver operating curve, and the associated test
is a test of whether it is greater than 0.5. The more accurate the forecast probabilities, the larger the
ROC area.
The Sanders-modified Brier score measures the difference between a grouped forecast measure
and the event, where the data are grouped by sorting the sample on the forecast and dividing it into
approximately equally sized groups. The difference between the modified and the unmodified score
is typically minimal. For this and the other statistics that require grouping—the Sanders and Murphy
resolutions and reliability-in-the-small—to be well-defined, group boundaries are chosen so as not
to allocate observations with the same forecast probability to different groups. This task is done by
grouping on the forecast using xtile, n(#), with # being the number of groups; see [D] pctile.
brier — Brier score decomposition
237
The Sanders resolution measures error that arises from statistical considerations in evaluating
the forecast for a group. A group with all positive or all negative outcomes would have a Sanders
resolution of 0; it would most certainly be feasible to predict exactly what happened to each member
of the group. If the group had 40% positive responses, on the other hand, a forecast that assigned
p = 0.4 to each member of the group would be a good one, and yet, there would be “errors” in
the squared difference sense. The “error” would be (1 − 0.4)2 or (0 − 0.4)2 for each member. The
Sanders resolution is the average across groups of such “expected” errors. The 0.1400 value in our
data from an overall Brier score of 0.1828 or 0.1861 suggests that a substantial portion of the “error”
in our data is inherent.
Outcome index variance is just the variance of the outcome variable. This is the expected value of
the Brier score if all the forecast probabilities were merely the average observed outcome. Remember
that a fence-sitter has an expected Brier score of 0.25; a smarter fence sitter (who would guess
p = 0.65 for these data) would have a Brier score of 0.2275.
The Murphy resolution measures the variation in the average outcomes across groups. If all groups
have the same frequency of positive outcomes, little information in any forecast is possible, and the
Murphy resolution is 0. If groups differ markedly, the Murphy resolution is as large as 0.25. The
0.0875 means that there is some variation but not a lot, and 0.0875 is probably higher than in most
real cases. If you had groups in your data that varied between 40% and 60% positive outcomes, the
Murphy resolution would be 0.01; between 30% and 70%, it would be 0.04.
Reliability-in-the-small measures the error that comes from the average forecast within group not
measuring the average outcome within group — a classical goodness-of-fit measure, with 0 meaning a
perfect fit and 1 meaning a complete lack of fit. The calculated value of 0.0461 shows some amount
of lack of fit.
√ Remember, the number is squared, and we are saying that probabilities could be just
more than 0.0461 = 0.215 or 21.5% off.
Forecast variance measures the amount of discrimination being attempted — that is, the variation in
the forecasted probabilities. A small number indicates a fence-sitter making constant predictions. If
the forecasts were from a logistic regression model, forecast variance would tend to increase with the
amount of√
information available. Our pundit shows considerable forecast variance of 0.0438 (standard
deviation 0.0438 = 0.2093), which is in line with the reliability-in-the-small, suggesting that the
forecaster is attempting as much variation as is available in these data.
Excess forecast variance is the amount of actual forecast variance over a theoretical minimum.
The theoretical minimum — called the minimum forecast variance — corresponds to forecasts of p0
for observations ultimately observed to be negative responses and p1 for observations ultimately
observed to be positive outcomes. Moreover, p0 and p1 are set to the average forecasts made for the
ultimate negative and positive outcomes. These predictions would be just as good as the predictions
the forecaster did make, and any variation in the actual forecast probabilities above this is useless.
If this number is large, above 1% – 2%, then the forecaster may be attempting more than is possible.
The 0.0285 in our data suggests this possibility.
Reliability-in-the-large measures the discrepancy between the mean forecast and the observed
fraction of positive outcomes. This discrepancy will be 0 for forecasts made by most statistical
models — at least when measured on the same sample used for estimation — because√they, by design,
reproduce sample means. For our human pundit, the 0.0294 says that there is a 0.0294, or 17percentage-point, difference. (This difference can also be found by calculating the difference in the
averages of the observed outcomes and forecast probabilities: 0.65 − 0.4785 = 0.17.) That difference,
however, is not significant, as we would see if we typed ttest win=for; see [R] ttest. If these data
were larger and the bias persisted, this difference would be a critical shortcoming of the forecast.
238
brier — Brier score decomposition
Twice the forecast-outcome covariance is a measure of how accurately the forecast corresponds to
the outcome. The concept is similar to that of R-squared in linear regression.
Saved results
brier saves the following in r():
Scalars
r(p roc)
r(roc area)
r(z)
r(p)
r(brier)
r(brier s)
r(sanders)
r(oiv)
significance of ROC area
ROC area
Spiegelhalter’s z statistic
significance of z statistic
Brier score
Sanders-modified Brier score
Sanders resolution
outcome index variance
r(murphy)
r(relinsm)
r(Var f)
r(Var fex)
r(Var fmin)
r(relinla)
r(cov 2f)
Murphy resolution
reliability in the small
forecast variance
excess forecast variance
minimum forecast variance
reliability in the large
2×forecast-outcome-covariance
Methods and formulas
brier is implemented as an ado-file.
See Wilks (2006, 284–287, 289–292, 298–299) or Schmidt and Griffith (2005) for a discussion of
the Brier score.
Let dj , j = 1, . . . , N , be the observed outcomes with dj = 0 or dj = 1, and let fj be the
corresponding forecasted probabilities that dj is 1, 0 ≤ fj ≤ 1. Assume that the data are ordered so
that fj+1 ≥ fj (brier sorts the data to obtain this order). Divide the data into K nearly equally
sized groups, with group 1 containing observations 1 through j2 − 1, group 2 containing observations
j2 through j3 − 1, and so on.
Define
f 0 = average fj among dj = 0
f 1 = average fj among dj = 1
f = average fj
d = average dj
e
fk = average fj in group k
dek = average dj in group k
n
ek = number of observations in group k
The Brier score is
P
j (dj
− fj )2 /N .
The Sanders-modified Brier score is
P
j (dj
− fek(j) )2 /N .
Let pj denote the true but unknown probability that dj = 1. Under the null hypothesis that pj =
fj for all j , Spiegelhalter (1986) determined that the expectation and variance of the Brier score is
given by the following:
brier — Brier score decomposition
239
N
E (Brier) =
Var(Brier) =
1X
fj (1 − fj )
N j=1
N
1 X
2
fj (1 − fj )(1 − 2fj )
N 2 j=1
Denoting the observed value of the Brier score by O(Brier), Spiegelhalter’s z statistic is given by
Z=
O(Brier) − E(Brier)
p
Var(Brier)
The corresponding p-value is given by the upper-tail probability of Z under the standard normal
distribution.
The area under the ROC curve is estimated by applying the trapezoidal rule to the empirical ROC
curve. This area is Wilcoxon’s test statistic, so the corresponding p-value is just that of a one-sided
Wilcoxon test of the null hypothesis that the distribution of predictions is constant across the two
outcomes.
P
The Sanders resolution is k n
ek {dek (1 − dek )}/N .
The outcome index variance is d(1 − d).
P
The Murphy resolution is k n
ek (dek − d)2 /N .
P
Reliability-in-the-small is k n
ek (dek − fek )2 /N .
P
The forecast variance is j (fj − f )2 /N .
P
P
2
2
The minimum forecast variance is
j∈S (fj − f 1 ) /N , where F is the
j∈F (fj − f 0 ) +
set of observations for which dj = 0 and S is the complement.
The excess forecast variance is the difference between the forecast variance and the minimum
forecast variance.
Reliability-in-the-large is (f − d)2 .
Twice the outcome covariance is 2(f 1 − f 0 )d(1 − d).
Glenn Wilson Brier (1913–1998) was an American meteorological statistician who, after obtaining
degrees in physics and statistics, was for many years head of meteorological statistics at the
U.S. Weather Bureau, Washington, DC. In the latter part of his career, he was associated with
Colorado State University. Brier worked especially on verification and evaluation of predictions
and forecasts, statistical decision making, the statistical theory of turbulence, the analysis of
weather modification experiments, and the application of permutation techniques.
Acknowledgment
We thank Richard Goldstein for his contributions to this improved version of brier.
240
brier — Brier score decomposition
References
Brier, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78: 1–3.
Goldstein, R. 1996. sg55: Extensions to the brier command. Stata Technical Bulletin 32: 21–22. Reprinted in Stata
Technical Bulletin Reprints, vol. 6, pp. 133–134. College Station, TX: Stata Press.
Hadorn, D. C., E. B. Keeler, W. H. Rogers, and R. H. Brook. 1993. Assessing the Performance of Mortality Prediction
Models. Santa Monica, CA: Rand.
Holloway, L., and P. Mielke. 1998. Glenn Wilson Brier 1913–1998. Bulletin of the American Meteorological Society
79: 1438–1439.
Jolliffe, I. T., and D. B. Stephenson, ed. 2003. Forecast Verification: A Practitioner’s Guide in Atmospheric Science.
Chichester, UK: Wiley.
Murphy, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12: 595–600.
. 1997. Forecast verification. In Economic Value of Weather and Climate Forecasts, ed. R. W. Katz and A. H.
Murphy, 19–74. Cambridge: Cambridge University Press.
Redelmeier, D. A., D. A. Bloch, and D. H. Hickam. 1991. Assessing predictive accuracy: How to compare Brier
scores. Journal of Clinical Epidemiology 44: 1141–1146.
Rogers, W. H. 1992. sbe9: Brier score decomposition. Stata Technical Bulletin 10: 20–22. Reprinted in Stata Technical
Bulletin Reprints, vol. 2, pp. 92–94. College Station, TX: Stata Press.
Sanders, F. 1963. On subjective probability forecasting. Journal of Applied Meteorology 2: 191–201.
Schmidt, C. H., and J. L. Griffith. 2005. Multivariate classification rules: Calibration and discrimination. In Vol. 2 of
Encyclopedia of Biostatistics, ed. P. Armitage and T. Colton, 3492–3494. Chichester, UK: Wiley.
Spiegelhalter, D. J. 1986. Probabilistic prediction in patient management and clinical trials. Statistics in Medicine 5:
421–433.
Von Storch, H., and F. W. Zwiers. 1999. Statistical Analysis in Climate Research. Cambridge: Cambridge University
Press.
Wilks, D. S. 2006. Statistical Methods in the Atmospheric Sciences. 2nd ed. Burlington, MA: Academic Press.
Yates, J. F. 1982. External correspondence: Decompositions of the mean probability score. Organizational Behavior
and Human Performance 30: 132–156.
Also see
[R] logistic — Logistic regression, reporting odds ratios
[R] logit — Logistic regression, reporting coefficients
[R] probit — Probit regression
Title
bsample — Sampling with replacement
Syntax
bsample exp
if
in
, options
where exp is a standard Stata expression; see [U] 13 Functions and expressions.
options
description
strata(varlist)
cluster(varlist)
idcluster(newvar)
weight(varname)
variables identifying strata
variables identifying resampling clusters
create new cluster ID variable
replace varname with frequency weights
Menu
Statistics
>
Resampling
>
Draw bootstrap sample
Description
bsample draws random samples with replacement from the data in memory.
exp specifies the size of the sample, which must be less than or equal to the number of sampling
units in the data. The observed number of units is the default when exp is not specified.
For a simple random sample (SRS) of the observations, exp must be less than or equal to N (the
number of observations in the data; see [U] 13.4 System variables ( variables)).
For stratified SRS, exp must be less than or equal to N within the strata identified by the strata()
option.
For clustered sampling, exp must be less than or equal to Nc (the number of clusters identified by
the cluster() option).
For stratified sampling of clusters, exp must be less than or equal to Nc within the strata identified
by the strata() option.
Observations that do not meet the optional if and in criteria are dropped (not sampled).
Options
strata(varlist) specifies the variables identifying strata. If strata() is specified, bootstrap samples
are selected within each stratum.
cluster(varlist) specifies the variables identifying resampling clusters. If cluster() is specified,
the sample drawn during each replication is a bootstrap sample of clusters.
idcluster(newvar) creates a new variable containing a unique identifier for each resampled cluster.
241
242
bsample — Sampling with replacement
weight(varname) specifies a variable in which the sampling frequencies will be placed. varname
must be an existing variable, which will be replaced. After bsample, varname can be used as
an fweight in any Stata command that accepts fweights, which can speed up resampling for
commands like regress and summarize. This option cannot be combined with idcluster().
By default, bsample replaces the data in memory with the sampled observations; however,
specifying the weight() option causes only the specified varname to be changed.
Remarks
Below is a series of examples illustrating how bsample is used with various sampling schemes.
Example 1: Simple random sampling
We have data on the characteristics of hospital patients and wish to draw a simple random sample
of 200 patients. We type
. use http://www.stata-press.com/data/r11/bsample1
. bsample 200
. count
200
Example 2: Stratified samples with equal sizes
Among the variables in our dataset is female, an indicator for the female patients. To get a
stratified simple random sample of 200 female patients and 200 male patients, we type
. use http://www.stata-press.com/data/r11/bsample1, clear
. bsample 200, strata(female)
. tab female
Freq.
Percent
Cum.
female
male
female
200
200
50.00
50.00
Total
400
100.00
50.00
100.00
Example 3: Stratified samples with unequal sizes
To sample 300 females and 200 males, we must generate a variable that is 300 for females and
200 for males and then use this variable in exp when we call bsample.
.
.
.
.
use http://www.stata-press.com/data/r11/bsample1, clear
gen nsamp = cond(female,300,200)
bsample nsamp, strata(female)
tab female
female
Freq.
Percent
Cum.
male
female
200
300
40.00
60.00
Total
500
100.00
40.00
100.00
bsample — Sampling with replacement
243
Example 4: Samples satisfying a condition
For a simple random sample of 200 female patients, we type
. use http://www.stata-press.com/data/r11/bsample1, clear
. bsample 200 if female
. tab female
female
Freq.
Percent
Cum.
female
200
100.00
100.00
Total
200
100.00
Example 5: Generating frequency weights
To identify the sampled observations using frequency weights instead of dropping unsampled
observations, we use the weight() option (we will need to supply it an existing variable name) and
type
. use http://www.stata-press.com/data/r11/bsample1, clear
. set seed 1234
. gen fw = .
(5810 missing values generated)
. bsample 200 if female, weight(fw)
. tabulate fw female
fw
female
male
female
Total
0
1
2
2,392
0
0
3,221
194
3
5,613
194
3
Total
2,392
3,418
5,810
Note that (194 × 1) + (3 × 2) = 200.
Example 6: Oversampling observations
bsample requires the expression in exp to evaluate to a number that is less than or equal to the
number of observations. To sample twice as many male and female patients as there are already in
memory, we must expand the data before using bsample. For example,
. use http://www.stata-press.com/data/r11/bsample1, clear
. set seed 1234
. expand 2
(5810 observations created)
. bsample, strata(female)
244
bsample — Sampling with replacement
. tab female
female
Freq.
Percent
Cum.
male
female
4,784
6,836
41.17
58.83
41.17
100.00
Total
11,620
100.00
Example 7: Stratified oversampling with unequal sizes
To sample twice as many female patients as male patients, we must expand the records for the
female patients because there are less than twice as many of them as there are male patients, but first
put the number of observed male patients in a local macro. After expanding the female records, we
generate a variable that contains the number of observations to sample within the two groups.
. use http://www.stata-press.com/data/r11/bsample1, clear
. set seed 1234
. count if !female
2392
. local nmale = r(N)
. expand 2 if female
(3418 observations created)
. gen nsamp = cond(female,2*‘nmale’,‘nmale’)
. bsample nsamp, strata(female)
. tab female
Freq.
Percent
Cum.
female
male
female
2,392
4,784
33.33
66.67
Total
7,176
100.00
33.33
100.00
Example 8: Oversampling of clusters
For clustered data, sampling more clusters than are present in the original dataset requires more
than just expanding the data. To illustrate, suppose we wanted a bootstrap sample of eight clusters
from a dataset consisting of five clusters of observations.
. use http://www.stata-press.com/data/r11/bsample2, clear
. tabstat x, stat(n mean) by(group)
Summary for variables: x
by categories of: group
group
N
mean
A
B
C
D
E
15 -.3073028
10
-.00984
11 .0810985
11 -.1989179
29 -.095203
Total
76 -.1153269
bsample — Sampling with replacement
245
bsample will complain if we simply expand the dataset.
. use http://www.stata-press.com/data/r11/bsample2, clear
. expand 3
(152 observations created)
. bsample 8, cluster(group)
resample size must not be greater than number of clusters
r(498);
Expanding the data will only partly solve the problem. We also need a new variable that uniquely
identifies the copied clusters. We use the expandcl command to accomplish both these tasks; see
[D] expandcl.
. use http://www.stata-press.com/data/r11/bsample2, clear
. set seed 1234
. expandcl 2, generate(expgroup) cluster(group)
(76 observations created)
. tabstat x, stat(n mean) by(expgroup)
Summary for variables: x
by categories of: expgroup
N
mean
expgroup
1
2
3
4
5
6
7
8
9
10
15
15
10
10
11
11
11
11
29
29
Total
-.3073028
-.3073028
-.00984
-.00984
.0810985
.0810985
-.1989179
-.1989179
-.095203
-.095203
152 -.1153269
. gen fw = .
(152 missing values generated)
. bsample 8, cluster(expgroup) weight(fw)
. tabulate fw group
group
fw
A
B
C
D
E
Total
0
1
2
15
15
0
10
10
0
0
22
0
0
22
0
29
0
29
54
69
29
Total
30
20
22
22
58
152
The results from tabulate on the generated frequency weight variable versus the original cluster ID
(group) show us that the bootstrap sample contains one copy of cluster A, one copy of cluster B, two
copies of cluster C, two copies of cluster D, and two copies of cluster E (1 + 1 + 2 + 2 + 2 = 8).
246
bsample — Sampling with replacement
Example 9: Stratified oversampling of clusters
Suppose that we have a dataset containing two strata with five clusters in each stratum, but the
cluster identifiers are not unique between the strata. To get a stratified bootstrap sample with eight
clusters in each stratum, we first use expandcl to expand the data and get a new cluster ID variable.
We use cluster(strid group) in the call to expandcl; this action will uniquely identify the
2 ∗ 5 = 10 clusters across the strata.
. use http://www.stata-press.com/data/r11/bsample2, clear
. set seed 1234
. tab group strid
strid
group
1
2
Total
A
B
C
D
E
7
5
5
5
14
8
5
6
6
15
15
10
11
11
29
Total
36
40
76
. expandcl 2, generate(expgroup) cluster(strid group)
(76 observations created)
Now we can use bsample with the expanded data, stratum ID variable, and new cluster ID variable.
. gen fw = .
(152 missing values generated)
. bsample 8, cluster(expgroup) str(strid) weight(fw)
. by strid, sort: tabulate fw group
-> strid = 1
group
fw
A
B
C
D
E
Total
0
1
2
0
14
0
5
5
0
0
10
0
5
5
0
14
0
14
24
34
14
Total
14
10
10
10
28
72
fw
A
B
C
D
E
Total
0
1
2
8
8
0
10
0
0
0
6
6
6
6
0
0
15
15
24
35
21
Total
16
10
12
12
30
80
-> strid = 2
group
The results from by strid: tabulate on the generated frequency weight variable versus the original
cluster ID (group) show us how many times each cluster was sampled for each stratum. For stratum
1, the bootstrap sample contains two copies of cluster A, one copy of cluster B, two copies of cluster
C, one copy of cluster D, and two copies of cluster E (2 + 1 + 2 + 1 + 2 = 8). For stratum 2, the
bootstrap sample contains one copy of cluster A, zero copies of cluster B, three copies of cluster C,
one copy of cluster D, and three copies of cluster E (1 + 0 + 3 + 1 + 3 = 8).
bsample — Sampling with replacement
Methods and formulas
bsample is implemented as an ado-file.
Also see
[R] bootstrap — Bootstrap sampling and estimation
[R] bstat — Report bootstrap results
[R] simulate — Monte Carlo simulations
[D] sample — Draw random sample
247
Title
bstat — Report bootstrap results
Syntax
Bootstrap statistics from variables
bstat varlist
if
in
, options
Bootstrap statistics from file
bstat namelist
using filename
if
in
, options
description
options
Main
observed values for each statistic
acceleration values for each statistic
use MSE formula for variance estimation
stat(vector)
accel(vector)
mse
Reporting
set confidence level; default is level(95)
# of observations from which bootstrap samples were taken
suppress table of results
suppress table header
suppress table legend
display the full table legend
use text as title for bootstrap results
level(#)
n(#)
notable
noheader
nolegend
verbose
title(text)
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Resampling
>
Report bootstrap results
Description
bstat is a programmer’s command that computes and displays estimation results from bootstrap
statistics.
For each variable in varlist (the default is all variables), then bstat computes a covariance
matrix, estimates bias, and constructs several different confidence intervals (CIs). The following CIs
are constructed by bstat:
1. Normal CIs (using the normal approximation)
2. Percentile CIs
3. Bias-corrected (BC) CIs
4. Bias-corrected and accelerated (BCa ) CIs (optional)
248
bstat — Report bootstrap results
249
estat bootstrap displays a table of one or more of the above confidence intervals; see
[R] bootstrap postestimation.
If there are bootstrap estimation results in e(), bstat replays them. If given the using modifier,
bstat uses the data in filename to compute the bootstrap statistics while preserving the data currently
in memory. Otherwise, bstat uses the data in memory to compute the bootstrap statistics.
The following options may be used to replay estimation results from bstat:
level(#) notable noheader nolegend verbose title(text)
For all other options and the qualifiers using, if, and in, bstat requires a bootstrap dataset.
Options
Main
stat(vector) specifies the observed value of each statistic (i.e., the value of the statistic using the
original dataset).
accel(vector) specifies the acceleration of each statistic, which is used to construct BCa CIs.
mse specifies that bstat compute the variance by using deviations of the replicates from the observed
value of the statistics. By default, bstat computes the variance by using deviations from the
average of the replicates.
Reporting
level(#); see [R] estimation options.
n(#) specifies the number of observations from which bootstrap samples were taken. This value is
used in no calculations but improves the table header when this information is not saved in the
bootstrap dataset.
notable suppresses the display of the output table.
noheader suppresses the display of the table header. This option implies nolegend.
nolegend suppresses the display of the table legend.
verbose specifies that the full table legend be displayed. By default, coefficients and standard errors
are not displayed.
title(text) specifies a title to be displayed above the table of bootstrap results; the default title is
Bootstrap results.
Remarks
Remarks are presented under the following headings:
Bootstrap datasets
Creating a bootstrap dataset
250
bstat — Report bootstrap results
Bootstrap datasets
Although bstat allows you to specify the observed value and acceleration of each bootstrap
statistic via the stat() and accel() options, programmers may be interested in what bstat uses
when these options are not supplied.
When working from a bootstrap dataset, bstat first checks the data characteristics (see [P] char)
that it understands:
dta[bs version] identifies the version of the bootstrap dataset. This characteristic may be empty
(not defined), 2, or 3; otherwise, bstat will quit and display an error message. This version
tells bstat which other characteristics to look for in the bootstrap dataset.
bstat uses the following characteristics from version 3 bootstrap datasets:
dta[N]
dta[N strata]
dta[N cluster]
dta[command]
varname[observed]
varname[acceleration]
varname[expression]
bstat uses the following characteristics from version 2 bootstrap datasets:
dta[N]
dta[N strata]
dta[N cluster]
varname[observed]
varname[acceleration]
An empty bootstrap dataset version implies that the dataset was created by the bstrap
command in a version of Stata earlier than Stata 8. Here bstat expects varname[bstrap]
to contain the observed value of the statistic identified by varname (varname[observed]
in version 2). All other characteristics are ignored.
dta[N] is the number of observations in the observed dataset. This characteristic may be overruled
by specifying the n() option.
dta[N strata] is the number of strata in the observed dataset.
dta[N cluster] is the number of clusters in the observed dataset.
dta[command] is the command used to compute the observed values of the statistics.
varname[observed] is the observed value of the statistic identified by varname. To specify a different
value, use the stat() option.
varname[acceleration] is the estimate of acceleration for the statistic identified by varname. To
specify a different value, use the accel() option.
varname[expression] is the expression or label that describes the statistic identified by varname.
Creating a bootstrap dataset
Suppose that we are interested in obtaining bootstrap statistics by resampling the residuals from
a regression (which is not possible with the bootstrap command). After loading some data, we
run a regression, save some results relevant to the bstat command, and save the residuals in a new
variable, res.
bstat — Report bootstrap results
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight length
SS
df
MS
Source
Model
Residual
1616.08062
827.378835
2
71
808.040312
11.653223
Total
2443.45946
73
33.4720474
mpg
Coef.
weight
length
_cons
-.0038515
-.0795935
47.88487
Std. Err.
.001586
.0553577
6.08787
t
-2.43
-1.44
7.87
Number of obs
F( 2,
71)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.018
0.155
0.000
=
=
=
=
=
=
251
74
69.34
0.0000
0.6614
0.6519
3.4137
[95% Conf. Interval]
-.0070138
-.1899736
35.746
-.0006891
.0307867
60.02374
. matrix b = e(b)
. local n = e(N)
. predict res, residuals
We can resample the residual values in res by generating a random observation ID (rid), generate
a new response variable (y), and run the original regression with the new response variables.
. set seed 54321
. gen rid = int(_N*runiform())+1
. matrix score double y = b
. replace y = y + res[rid]
(74 real changes made)
. regress y weight length
Source
SS
df
MS
Model
Residual
1773.23548
608.747732
2
71
886.617741
8.57391172
Total
2381.98321
73
32.629907
y
Coef.
weight
length
_cons
-.0059938
-.0127875
42.23195
Std. Err.
.0013604
.0474837
5.22194
t
-4.41
-0.27
8.09
Number of obs
F( 2,
71)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.788
0.000
=
=
=
=
=
=
74
103.41
0.0000
0.7444
0.7372
2.9281
[95% Conf. Interval]
-.0087064
-.1074673
31.8197
-.0032813
.0818924
52.6442
Instead of programming this resampling inside a loop, it is much more convenient to write a short
program and use the simulate command; see [R] simulate. In the following, mysim r requires
the user to specify a coefficient vector and a residual variable. mysim r then retrieves the list of
predictor variables (removing cons from the list), generates a new temporary response variable with
the resampled residuals, and regresses the new response variable on the predictors.
252
bstat — Report bootstrap results
program mysim_r
version 11
syntax name(name=bvector), res(varname)
tempvar y rid
local xvars : colnames ‘bvector’
local cons _cons
local xvars : list xvars - cons
matrix score double ‘y’ = ‘bvector’
gen long ‘rid’ = int(_N*runiform()) + 1
replace ‘y’ = ‘y’ + ‘res’[‘rid’]
regress ‘y’ ‘xvars’
end
We can now give mysim r a test run, but we first set the random-number seed (to reproduce
results).
. set seed 54321
. mysim_r b, res(res)
(74 real changes made)
Source
SS
df
MS
Model
Residual
1773.23548
608.747732
2
71
886.617741
8.57391172
Total
2381.98321
73
32.629907
__000000
Coef.
weight
length
_cons
-.0059938
-.0127875
42.23195
Std. Err.
.0013604
.0474837
5.22194
t
-4.41
-0.27
8.09
Number of obs
F( 2,
71)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.788
0.000
=
=
=
=
=
=
74
103.41
0.0000
0.7444
0.7372
2.9281
[95% Conf. Interval]
-.0087064
-.1074673
31.8197
-.0032813
.0818924
52.6442
Now that we have a program that will compute the results we want, we can use simulate to
generate a bootstrap dataset and bstat to display the results.
. set seed 54321
. simulate, reps(200) nodots: mysim_r b, res(res)
command: mysim_r b, res(res)
. bstat, stat(b) n(‘n’)
Bootstrap results
_b_weight
_b_length
_b_cons
Number of obs
Replications
Observed
Coef.
Bootstrap
Std. Err.
-.0038515
-.0795935
47.88487
.0015715
.0552415
6.150069
z
-2.45
-1.44
7.79
P>|z|
0.014
0.150
0.000
=
=
74
200
Normal-based
[95% Conf. Interval]
-.0069316
-.1878649
35.83096
-.0007713
.0286779
59.93879
Finally, we see that simulate created some of the data characteristics recognized by bstat. All
we need to do is correctly specify the version of the bootstrap dataset, and bstat will automatically
use the relevant data characteristics.
bstat — Report bootstrap results
. char list
_dta[seed]:
_dta[command]:
_b_weight[is_eexp]:
_b_weight[colname]:
_b_weight[coleq]:
_b_weight[expression]:
_b_length[is_eexp]:
_b_length[colname]:
_b_length[coleq]:
_b_length[expression]:
_b_cons[is_eexp]:
_b_cons[colname]:
_b_cons[coleq]:
_b_cons[expression]:
. char _dta[bs_version] 3
. bstat, stat(b) n(‘n’)
Bootstrap results
command:
weight
length
_cons
X681014b5c43f462544a474abacbdd93d12a1
mysim_r b, res(res)
1
weight
_
_b[weight]
1
length
_
_b[length]
1
_cons
_
_b[_cons]
Number of obs
Replications
=
=
74
200
mysim_r b, res(res)
Observed
Coef.
Bootstrap
Std. Err.
-.0038515
-.0795935
47.88487
.0015715
.0552415
6.150069
z
-2.45
-1.44
7.79
P>|z|
0.014
0.150
0.000
See Poi (2004) for another example of residual resampling.
(Continued on next page)
Normal-based
[95% Conf. Interval]
-.0069316
-.1878649
35.83096
-.0007713
.0286779
59.93879
253
254
bstat — Report bootstrap results
Saved results
bstat saves the following in e():
Scalars
e(N)
e(N reps)
e(N misreps)
e(N strata)
e(N clust)
e(k aux)
e(k eq)
e(k exp)
e(k eexp)
e(k extra)
e(level)
e(bs version)
e(rank)
sample size
number of complete replications
number of incomplete replications
number of strata
number of clusters
number of auxiliary parameters
number of equations
number of standard expressions
number of extended expressions (i.e., b)
number of extra equations beyond the original ones from e(b)
confidence level for bootstrap CIs
version for bootstrap results
rank of e(V)
Macros
e(cmd)
e(command)
e(cmdline)
e(title)
e(exp#)
e(prefix)
e(mse)
e(vce)
e(vcetype)
e(properties)
bstat
from dta[command]
command as typed
title in estimation output
expression for the #th statistic
bootstrap
mse if specified
bootstrap
title used to label Std. Err.
b V
Matrices
e(b)
e(b bs)
e(reps)
e(bias)
e(se)
e(z0)
e(accel)
e(ci normal)
e(ci percentile)
e(ci bc)
e(ci bca)
e(V)
observed statistics
bootstrap estimates
number of nonmissing results
estimated biases
estimated standard errors
median biases
estimated accelerations
normal-approximation CIs
percentile CIs
bias-corrected CIs
bias-corrected and accelerated CIs
bootstrap variance–covariance matrix
Methods and formulas
bstat is implemented as an ado-file.
bstat — Report bootstrap results
Reference
Poi, B. P. 2004. From the help desk: Some bootstrapping techniques. Stata Journal 4: 312–328.
Also see
[R] bootstrap — Bootstrap sampling and estimation
[R] bsample — Sampling with replacement
255
Title
centile — Report centile and confidence interval
Syntax
centile varlist
if
in
, options
description
options
Main
centile(numlist)
report specified centiles; default is centile(50)
Options
binomial exact; conservative confidence interval
normal, based on observed centiles
normal, based on mean and standard deviation
set confidence level; default is level(95)
cci
normal
meansd
level(#)
by is allowed; see [D] by.
Menu
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
>
Centiles with CIs
Description
centile estimates specified centiles and calculates confidence intervals. If no varlist is specified,
centile calculates centiles for all the variables in the dataset. If centile() is not specified, medians
(centile(50)) are reported.
Options
Main
centile(numlist) specifies the centiles to be reported. The default is to display the 50th centile.
Specifying centile(5) requests that the fifth centile be reported. Specifying centile(5 50
95) requests that the 5th, 50th, and 95th centiles be reported. Specifying centile(10(10)90)
requests that the 10th, 20th, . . . , 90th centiles be reported; see [U] 11.1.8 numlist.
Options
cci (conservative confidence interval) forces the confidence limits to fall exactly on sample values.
Confidence intervals displayed with the cci option are slightly wider than those with the default
(nocci) option.
normal causes the confidence interval to be calculated by using a formula for the standard error
of a normal-distribution quantile given by Kendall and Stuart (1969, 237). The normal option is
useful when you want empirical centiles — that is, centiles based on sample order statistics rather
than on the mean and standard deviation — and are willing to assume normality.
256
centile — Report centile and confidence interval
257
meansd causes the centile and confidence interval to be calculated based on the sample mean and
standard deviation, and it assumes normality.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [R] level.
Remarks
The q th centile of a continuous random variable, X , is defined as the value of Cq , which fulfills
the condition Pr(X ≤ Cq ) = q/100. The value of q must be in the range 0 < q < 100, though q
is not necessarily an integer. By default, centile estimates Cq for the variables in varlist and for
the values of q given in centile(numlist). It makes no assumptions about the distribution of X ,
and, if necessary, uses linear interpolation between neighboring sample values. Extreme centiles (for
example, the 99th centile in samples smaller than 100) are fixed at the minimum or maximum sample
value. An “exact” confidence interval for Cq is also given, using the binomial-based method described
below in Methods and formulas and in Conover (1999, 143–148). Again linear interpolation is used
to improve the accuracy of the estimated confidence limits, but extremes are fixed at the minimum
or maximum sample value.
You can prevent centile from interpolating when calculating binomial-based confidence intervals
by specifying cci. The resulting intervals are generally wider than with the default; that is, the
coverage (confidence level) tends to be greater than the nominal value (given as usual by level(#),
by default 95%).
If the data are believed to be normally distributed (a common case), there are two alternative
methods for estimating centiles. If normal is specified, Cq is calculated, as just described, but its
confidence interval is based on a formula for the standard error (se) of a normal-distribution quantile
given by Kendall and Stuart (1969, 237). If meansd is alternatively specified, Cq is estimated as
x + zq × s, where x and s are the sample mean and standard deviation, and zq is the q th centile of
the standard normal distribution (e.g., z95 = 1.645). The confidence interval is derived from the se
of the estimate of Cq .
Example 1
Using auto.dta, we estimate the 5th, 50th, and 95th centiles of the price variable:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. format price %8.2fc
. centile price, centile(5 50 95)
Variable
Obs
price
74
Percentile
5
50
95
Centile
3,727.75
5,006.50
13,498.00
Binom. Interp.
[95% Conf. Interval]
3,291.23
4,593.57
11,061.53
3,914.16
5,717.90
15,865.30
summarize produces somewhat different results from centile; see Methods and formulas.
258
centile — Report centile and confidence interval
. summarize price, detail
Price
1%
5%
10%
25%
50%
75%
90%
95%
99%
Percentiles
3291
3748
3895
4195
Smallest
3291
3299
3667
3748
5006.5
Largest
13466
13594
14500
15906
6342
11385
13466
15906
Obs
Sum of Wgt.
74
74
Mean
Std. Dev.
6165.257
2949.496
Variance
Skewness
Kurtosis
8699526
1.653434
4.819188
The confidence limits produced by using the cci option are slightly wider than those produced
without this option:
. centile price, c(5 50 95) cci
Variable
Obs
price
74
Percentile
5
50
95
Centile
3,727.75
5,006.50
13,498.00
Binomial Exact
[95% Conf. Interval]
3,291.00
4,589.00
10,372.00
3,955.00
5,719.00
15,906.00
If we are willing to assume that price is normally distributed, we could include either the normal
or the meansd option:
. centile price, c(5 50 95) normal
Variable
Obs
price
74
Percentile
5
50
95
Normal, based on observed centiles
Centile
[95% Conf. Interval]
3,727.75
5,006.50
13,498.00
3,211.19
4,096.68
5,426.81
4,244.31
5,916.32
21,569.19
. centile price, c(5 50 95) meansd
Variable
Obs
price
74
Percentile
5
50
95
Normal, based on mean and std. dev.
Centile
[95% Conf. Interval]
1,313.77
6,165.26
11,016.75
278.93
5,493.24
9,981.90
2,348.61
6,837.27
12,051.59
With the normal option, the centile estimates are, by definition, the same as before. The confidence
intervals for the 5th and 50th centiles are similar to the previous ones, but the interval for the
95th centile is different. The results using the meansd option also differ from both previous sets of
estimates.
We can use sktest (see [R] sktest) to check the correctness of the normality assumption:
. sktest price
Skewness/Kurtosis tests for Normality
Variable
Obs
Pr(Skewness)
Pr(Kurtosis)
adj chi2(2)
price
74
0.0000
0.0127
21.77
joint
Prob>chi2
0.0000
centile — Report centile and confidence interval
259
sktest reveals that price is definitely not normally distributed, so the normal assumption is not
reasonable, and the normal and meansd options are not appropriate for these data. We should rely
on the results from the default choice, which does not assume normality. If the data are normally
distributed, however, the precision of the estimated centiles and their confidence intervals will be
ordered (best) meansd > normal > [default] (worst). The normal option is useful when we really
do want empirical centiles (that is, centiles based on sample order statistics rather than on the mean
and standard deviation) but are willing to assume normality.
Saved results
centile saves the following in r():
Scalars
r(N)
r(n cent)
r(c #)
number of observations
number of centiles requested
value of # centile
r(lb #) #-requested centile lower confidence bound
r(ub #) #-requested centile upper confidence bound
Macros
r(centiles) centiles requested
Methods and formulas
centile is implemented as an ado-file.
Methods and formulas are presented under the following headings:
Default case
Normal case
meansd case
Default case
The calculation is based on the method of Mood and Graybill (1963, 408). Let x1 ≤ x2 ≤ · · · ≤ xn
be a sample of size n arranged in ascending order. Denote the estimated q th centile of the x’s as
cq . We require that 0 < q < 100. Let R = (n + 1)q/100 have integer part r and fractional part f ;
that is, r = int(R) and f = R − r. (If R is itself an integer, then r = R and f = 0.) Note that
0 ≤ r ≤ n. For convenience, define x0 = x1 and xn+1 = xn . Cq is estimated by
cq = xr + f × (xr+1 − xr )
that is, cq is a weighted average of xr and xr+1 . Loosely speaking, a (conservative) p% confidence
interval for Cq involves finding the observations ranked t and u, which correspond, respectively, to
the α = (100 − p)/200 and 1 − α quantiles of a binomial distribution with parameters n and q/100,
i.e., B(n, q/100). More precisely, define the ith value (i = 0, . . . , n) of the cumulative binomial
distribution function as Fi = Pr(S ≤ i), where S has distribution B(n, q/100). For convenience,
let F−1 = 0 and Fn+1 = 1. t is found such that Ft ≤ α and Ft+1 > α, and u is found such that
1 − Fu ≤ α and 1 − Fu−1 > α.
With the cci option in force, the (conservative) confidence interval is (xt+1 , xu+1 ), and its actual
coverage probability is Fu − Ft .
260
centile — Report centile and confidence interval
The default case uses linear interpolation on the Fi as follows. Let
g = (α − Ft )/(Ft+1 − Ft )
h = {α − (1 − Fu )}/{(1 − Fu−1 ) − (1 − Fu )}
= (α − 1 + Fu )/(Fu − Fu−1 )
The interpolated lower and upper confidence limits (cqL , cqU ) for Cq are
cqL = xt+1 + g × (xt+2 − xt+1 )
cqU = xu+1 − h × (xu+1 − xu )
Suppose that we want a 95% confidence interval for the median of a sample of size 13. n = 13,
q = 50, p = 95, α = 0.025, R = 14 × 50/100 = 7, and f = 0. Therefore, the median is the 7th
observation. Some example data, xi , and the values of Fi are as follows:
i
0
1
2
3
4
5
6
Fi 1 − Fi
0.0001 0.9999
0.0017 0.9983
0.0112 0.9888
0.0461 0.9539
0.1334 0.8666
0.2905 0.7095
0.5000 0.5000
xi
–
5
7
10
15
23
28
i
7
8
9
10
11
12
13
F i 1 − Fi
xi
0.7095 0.2905
33
0.8666 0.1334
37
0.9539 0.0461
45
0.9888 0.0112
59
0.9983 0.0017
77
0.9999 0.0001 104
1.0000 0.0000 211
The median is x7 = 33. Also, F2 ≤ 0.025 and F3 > 0.025, so t = 2; 1 − F10 ≤ 0.025 and
1 − F9 > 0.025, so u = 10. The conservative confidence interval is therefore
(c50L , c50U ) = (xt+1 , xu+1 ) = (x3 , x11 ) = (10, 77)
with actual coverage F10 − F2 = 0.9888 − 0.0112 = 0.9776 (97.8% confidence). For the interpolation
calculation, we have
g = (0.025 − 0.0112)/(0.0461 − 0.0112) = 0.395
h = (0.025 − 1 + 0.9888)/(0.9888 − 0.9539) = 0.395
So,
c50L = x3 + 0.395 × (x4 − x3 ) = 10 + 0.395 × 5 = 11.98
c50U = x11 − 0.395 × (x11 − x10 ) = 77 − 0.395 × 18 = 69.89
Normal case
The value of cq is as above. Its se is given by the formula
n
o
p
√
sq = q(100 − q)
100 nZ(cq ; x, s)
where x and s are the mean and standard deviation of the xi , and
√
2
2
Z(Y ; µ, σ) = 1 2πσ 2 e−(Y −µ) /2σ
is the density function of a normally distributed variable Y with mean µ and standard deviation σ .
The confidence interval for Cq is (cq − z100(1−α) sq , cq + z100(1−α) sq ).
centile — Report centile and confidence interval
261
meansd case
The value of cq is x + zq × s. Its se is given by the formula
q
s?q = s 1/n + zq2 /(2n − 2)
The confidence interval for Cq is (cq − z100(1−α) × s?q , cq + z100(1−α) × s?q ).
Acknowledgment
centile was written by Patrick Royston, MRC Clinical Trials Unit, London.
References
Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: Wiley.
Kendall, M. G., and A. Stuart. 1969. The Advanced Theory of Statistics, Vol. 1: Distribution Theory. 3rd ed. London:
Griffin.
Mood, A. M., and F. A. Graybill. 1963. Introduction to the Theory of Statistics. 2nd ed. New York: McGraw–Hill.
Newson, R. 2000. snp16: Robust confidence intervals for median and other percentile differences between two groups.
Stata Technical Bulletin 58: 30–35. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 324–331. College
Station, TX: Stata Press.
Royston, P. 1992. sg7: Centile estimation command. Stata Technical Bulletin 8: 12–15. Reprinted in Stata Technical
Bulletin Reprints, vol. 2, pp. 122–125. College Station, TX: Stata Press.
Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol I. 6th ed. London:
Arnold.
Also see
[R] ci — Confidence intervals for means, proportions, and counts
[R] summarize — Summary statistics
[D] pctile — Create variable containing percentiles
Title
ci — Confidence intervals for means, proportions, and counts
Syntax
Syntax for ci
ci varlist
if
in
weight
, options
Immediate command for variable distributed as normal
cii # obs # mean # sd , ciin option
Immediate command for variable distributed as binomial
cii # obs # succ , ciib options
Immediate command for variable distributed as Poisson
cii # exposure # events , poisson ciip options
options
description
Main
binomial
poisson
exposure(varname)
exact
wald
wilson
agresti
jeffreys
total
separator(#)
level(#)
binomial 0/1 variables; compute exact confidence intervals
Poisson variables; compute exact confidence intervals
exposure variable; implies poisson
calculate exact confidence intervals; the default
calculate Wald confidence intervals
calculate Wilson confidence intervals
calculate Agresti–Coull confidence intervals
calculate Jeffreys confidence intervals
add output for all groups combined (for use with by only)
draw separator line after every # variables; default is separator(5)
set confidence level; default is level(95)
by is allowed with ci; see [D] by.
aweights and fweights are allowed, but aweights may not be specified with the binomial or
poisson options; see [U] 11.1.6 weight.
ciin option
description
level(#)
set confidence level; default is level(95)
262
ci — Confidence intervals for means, proportions, and counts
∗
ciib options
description
level(#)
exact
wald
wilson
agresti
jeffreys
set confidence level; default is level(95)
calculate exact confidence intervals; the default
calculate Wald confidence intervals
calculate Wilson confidence intervals
calculate Agresti–Coull confidence intervals
calculate Jeffreys confidence intervals
ciip options
description
poisson
level(#)
numbers are Poisson-distributed counts
set confidence level; default is level(95)
∗
263
poisson is required.
Menu
ci
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
>
Confidence intervals
>
Normal CI calculator
>
Binomial CI calculator
>
Poisson CI calculator
cii for variable distributed as normal
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
cii for variable distributed as binomial
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
cii for variable distributed as Poisson
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
Description
ci computes standard errors and confidence intervals for each of the variables in varlist.
cii is the immediate form of ci; see [U] 19 Immediate commands for a general discussion of
immediate commands.
In the binomial and Poisson variants of cii, the second number specified (#succ or #events ) must
be an integer or between 0 and 1. If the number is between 0 and 1, Stata interprets it as the fraction
of successes or events and converts it to an integer number representing the number of successes or
events. The computation then proceeds as if two integers had been specified.
Options
Main
binomial tells ci that the variables are 0/1 variables and that binomial confidence intervals will be
calculated. (cii produces binomial confidence intervals when only two numbers are specified.)
264
ci — Confidence intervals for means, proportions, and counts
poisson specifies that the variables (or numbers for cii) are Poisson-distributed counts; exact Poisson
confidence intervals will be calculated.
exposure(varname) is used only with poisson. You do not need to specify poisson if you specify
exposure(); poisson is assumed. varname contains the total exposure (typically a time or an
area) during which the number of events recorded in varlist were observed.
exact, wald, wilson, agresti, and jeffreys specify that variables are 0/1 and specify how
binomial confidence intervals are to be calculated.
exact is the default and specifies exact (also known in the literature as Clopper–Pearson [1934])
binomial confidence intervals.
wald specifies calculation of Wald confidence intervals.
wilson specifies calculation of Wilson confidence intervals.
agresti specifies calculation of Agresti–Coull confidence intervals.
jeffreys specifies calculation of Jeffreys confidence intervals.
See Brown, Cai, and DasGupta (2001) for a discussion and comparison of the different binomial
confidence intervals.
total is for use with the by prefix. It requests that, in addition to output for each by-group, output
be added for all groups combined.
separator(#) specifies how often separation lines should be inserted into the output. The default is
separator(5), meaning that a line is drawn after every five variables. separator(10) would
draw the line after every 10 variables. separator(0) suppresses the separation line.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [R] level.
Remarks
Remarks are presented under the following headings:
Ordinary confidence intervals
Binomial confidence intervals
Poisson confidence intervals
Immediate form
Ordinary confidence intervals
Example 1
Without the binomial or poisson options, ci produces “ordinary” confidence intervals, meaning
those that are correct if the variable is distributed normally, and asymptotically correct for all other
distributions satisfying the conditions of the central limit theorem.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. ci mpg price
Variable
Obs
Mean
Std. Err.
mpg
price
74
74
21.2973
6165.257
.6725511
342.8719
[95% Conf. Interval]
19.9569
5481.914
22.63769
6848.6
ci — Confidence intervals for means, proportions, and counts
265
The standard error of the mean of mpg is 0.67, and the 95% confidence interval is [ 19.96, 22.64 ].
We can obtain wider confidence intervals, 99%, by typing
. ci mpg price, level(99)
Variable
Obs
mpg
price
74
74
Mean
21.2973
6165.257
Std. Err.
[99% Conf. Interval]
.6725511
342.8719
19.51849
5258.405
23.07611
7072.108
Example 2
by() breaks out the confidence intervals according to by-group; total adds an overall summary.
For instance,
. ci mpg, by(foreign) total
-> foreign = Domestic
Variable
Obs
mpg
Mean
52
19.82692
-> foreign = Foreign
Variable
Obs
Mean
mpg
22
24.77273
-> Total
Variable
Obs
Mean
mpg
74
21.2973
Std. Err.
.657777
Std. Err.
1.40951
Std. Err.
.6725511
[95% Conf. Interval]
18.50638
21.14747
[95% Conf. Interval]
21.84149
27.70396
[95% Conf. Interval]
19.9569
22.63769
Technical note
You can control the formatting of the numbers in the output by specifying a display format for
the variable; see [U] 12.5 Formats: Controlling how data are displayed. For instance,
. format mpg %9.2f
. ci mpg
Variable
Obs
Mean
mpg
21.30
74
Std. Err.
0.67
[95% Conf. Interval]
19.96
22.64
Binomial confidence intervals
Example 3
We have data on employees, including a variable marking whether the employee was promoted
last year.
. use http://www.stata-press.com/data/r11/promo
266
ci — Confidence intervals for means, proportions, and counts
. ci promoted, binomial
Variable
Obs
Mean
promoted
20
.1
Std. Err.
.067082
Binomial Exact
[95% Conf. Interval]
.0123485
.3169827
The above interval is the default for binomial data, known equivalently as both the exact binomial
and the Clopper–Pearson interval.
Nominally, the interpretation of a 95% confidence interval is that under repeated samples or
experiments, 95% of the resultant intervals would contain the unknown parameter in question.
However, for binomial data, the actual coverage probability, regardless of method, usually differs from
that interpretation. This result occurs because of the discreteness of the binomial distribution, which
produces only a finite set of outcomes, meaning that coverage probabilities are subject to discrete
jumps and the exact nominal level cannot always be achieved. Therefore, the term exact confidence
interval refers to its being derived from the binomial distribution, the distribution exactly generating
the data, rather than resulting in exactly the nominal coverage.
For the Clopper–Pearson interval, the actual coverage probability is guaranteed to be greater than
or equal to the nominal confidence interval, here 95%. Because of the way it is calculated—see
Methods and formulas—it may also be interpreted as follows: If the true probability of being promoted
were 0.012, the chances of observing a result as extreme or more extreme than the result observed
(20 × 0.1 = 2 or more promotions) would be 2.5%. If the true probability of being promoted were
0.317, the chances of observing a result as extreme or more extreme than the result observed (two
or fewer promotions) would be 2.5%.
Example 4
The Clopper–Pearson interval is desirable because it guarantees nominal coverage; however, by
dropping this restriction, you may obtain accurate intervals that are not as conservative. In this vein,
you might opt for the Wilson (1927) interval,
. ci promoted, binomial wilson
Variable
Obs
Mean
promoted
20
.1
Std. Err.
.067082
Wilson
[95% Conf. Interval]
.0278665
.3010336
the Agresti–Coull (1998) interval,
. ci promoted, binomial agresti
Variable
Obs
Mean
promoted
20
.1
Std. Err.
.067082
Agresti-Coull
[95% Conf. Interval]
.0156562
.3132439
or the Bayesian-derived Jeffreys interval (Brown, Cai, and DasGupta 2001),
. ci promoted, binomial jeffreys
Variable
Obs
Mean
promoted
20
.1
Std. Err.
.067082
Jeffreys
[95% Conf. Interval]
.0213725
.2838533
ci — Confidence intervals for means, proportions, and counts
267
Picking the best interval is a matter of balancing accuracy (coverage) against precision (average
interval length) and depends on sample size and success probability. Brown, Cai, and DasGupta (2001)
recommend the Wilson or Jeffreys interval for small sample sizes (≤40) yet favor the Agresti–Coull
interval for its simplicity, decent performance for sample sizes less than or equal to 40, and performance
comparable to Wilson/Jeffreys for sample sizes greater than 40. They also deem the Clopper–Pearson
interval to be “wastefully conservative and [. . . ] not a good choice for practical use”, unless of course
one requires, at a minimum, the nominal coverage level.
Finally, the binomial Wald confidence interval is obtained by specifying the binomial and wald
options. The Wald interval is the one taught in most introductory statistics courses and for the above
is simply, for level 1 − α, Mean±zα (Std. Err.), where zα is the 1 − α/2 quantile of the standard
normal. Because its overall poor performance makes it impractical, the Wald interval is available
mainly for pedagogical purposes. The binomial Wald interval is also similar to the interval produced
by treating binary data as normal data and using ci without the binomial option, with two exceptions.
First, when binomial is specified, the calculation of the standard error uses denominator n rather
than n − 1, used for normal data. Second, confidence intervals for normal data are based on the
t distribution rather than the standard normal. Of course, both discrepancies vanish as sample size
increases.
Technical note
Let’s repeat example 3, but this time with data in which there are no promotions over the observed
period:
. use http://www.stata-press.com/data/r11/promonone
. ci promoted, binomial
Variable
Obs
Mean
Std. Err.
promoted
20
0
(*) one-sided, 97.5% confidence interval
Binomial Exact
[95% Conf. Interval]
0
0
.1684335*
The confidence interval is [ 0, 0.168 ], and this is the confidence interval that most books publish. It
is not, however, a true 95% confidence interval because the lower tail has vanished. As Stata notes,
it is a one-sided, 97.5% confidence interval. If you wanted to put 5% in the right tail, you could type
ci promoted, binomial level(90).
Technical note
ci with the binomial option ignores any variables that do not take on the values 0 and 1
exclusively. For instance, with our automobile dataset,
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. ci mpg foreign, binomial
Variable
Obs
Mean
foreign
74
.2972973
Std. Err.
Binomial Exact
[95% Conf. Interval]
.0531331
.196584
.4148353
We also requested the confidence interval for mpg, but Stata ignored us. It does that so you can type
ci, binomial and obtain correct confidence intervals for all the variables that are 0/1 in your data.
268
ci — Confidence intervals for means, proportions, and counts
Poisson confidence intervals
Example 5
We have data on the number of bacterial colonies on a Petri dish. The dish has been divided into
36 small squares, and the number of colonies in each square has been counted. Each observation in
our dataset represents a square on the dish. The variable count records the number of colonies in
each square counted, which varies from 0 to 5.
. use http://www.stata-press.com/data/r11/petri
. ci count, poisson
Variable
Exposure
Mean
count
36
2.333333
Std. Err.
Poisson Exact
[95% Conf. Interval]
.2545875
1.861158
2.888825
ci reports that the average number of colonies per square is 2.33. If the expected number of colonies
per square were as low as 1.86, the probability of observing 2.33 or more colonies per square would
be 2.5%. If the expected number were as large as 2.89, the probability of observing 36 × 2.33 = 84
or fewer colonies per square would be 2.5%.
Technical note
The number of “observations” — how finely the Petri dish is divided — makes no difference. The
Poisson distribution is a function only of the count. In example 4, we observed a total of 2.33 × 36 = 84
colonies and a confidence interval of [ 1.86 × 36, 2.89 × 36 ] = [ 67, 104 ]. We would obtain the same
[ 67, 104 ] confidence interval if our dish were divided into, say, 49 squares, rather than 36.
For the counts, it is not even important that all the squares be of the same size. For rates, however,
such differences do matter, but in an easy-to-calculate way. Rates are obtained from counts by dividing
by exposure, which is typically a number multiplied by either time or an area. For our Petri dishes,
we divide by an area to obtain a rate, but if our example were cast in terms of being infected by a
disease, we might divide by person-years to obtain the rate. Rates are convenient because they are
easier to compare: we might have 2.3 colonies per square inch or 0.0005 infections per person-year.
So, let’s assume that we wish to obtain the number of colonies per square inch, and, moreover,
that not all the “squares” on our dish are of equal size. We have a variable called area that records
the area of each “square”:
. ci count, exposure(area)
Variable
Exposure
Mean
count
3
28
Std. Err.
3.055051
Poisson Exact
[95% Conf. Interval]
22.3339
34.66591
The rates are now in more familiar terms. In our sample, there are 28 colonies per square inch and
the 95% confidence interval is [ 22.3, 34.7 ]. When we did not specify exposure(), ci assumed that
each observation contributed 1 to exposure.
ci — Confidence intervals for means, proportions, and counts
269
Technical note
As with the binomial option, if there were no colonies on our dish, ci would calculate a one-sided
confidence interval:
. use http://www.stata-press.com/data/r11/petrinone
. ci count, poisson
Variable
Exposure
Mean
Std. Err.
count
36
0
(*) one-sided, 97.5% confidence interval
Poisson Exact
[95% Conf. Interval]
0
0
.1024689*
Immediate form
Example 6
We are reading a soon-to-be-published paper by a colleague. In it is a table showing the number of
observations, mean, and standard deviation of 1980 median family income for the Northeast and West.
We correctly think that the paper would be much improved if it included the confidence intervals.
The paper claims that for 166 cities in the Northeast, the average of median family income is $19,509
with a standard deviation of $4,379:
For the Northeast:
. cii 166 19509 4379
Variable
Obs
Mean
Std. Err.
[95% Conf. Interval]
166
19509
339.8763
18837.93
. cii 256 22557 5003
Obs
Variable
Mean
Std. Err.
[95% Conf. Interval]
256
22557
312.6875
21941.22
20180.07
For the West:
23172.78
Example 7
We flip a coin 10 times, and it comes up heads only once. We are shocked and decide to obtain
a 99% confidence interval for this coin:
. cii 10 1, level(99)
Variable
Obs
Mean
10
.1
Std. Err.
Binomial Exact
[99% Conf. Interval]
.0948683
.0005011
.5442871
270
ci — Confidence intervals for means, proportions, and counts
Example 8
The number of reported traffic accidents in Santa Monica over a 24-hour period is 27. We need
know nothing else:
. cii 1 27, poisson
Variable
Exposure
Mean
1
27
Std. Err.
Poisson Exact
[95% Conf. Interval]
5.196152
17.79317
39.28358
Saved results
ci and cii saves the following in r():
Scalars
r(N)
number of observations or exposure
r(mean) mean
r(se)
estimate of standard error
r(lb) lower bound of confidence interval
r(ub) upper bound of confidence interval
Methods and formulas
ci and cii are implemented as ado-files.
Methods and formulas are presented under the following headings:
Ordinary
Binomial
Poisson
Ordinary
Define n, x, and s2 as, respectively, the number of observations, (weighted) average, and (unbiased)
estimated variance of the variable in question; see [R] summarize.
p
The standard error of the mean, sµ , is defined as s2 /n.
Let α be 1 − l/100, where l is the significance level specified by the user. Define tα as the
two-sided t statistic corresponding to a significance level of α with n − 1 degrees of freedom; tα
is obtained from Stata as invttail(n-1,0.5*α). The lower and upper confidence bounds are,
respectively, x − sµ tα and x + sµ tα .
Binomial
Given k successes of n trials, the estimated probability is pb = k/n with standard error
ci calculates the exact (Clopper–Pearson) confidence interval [ p1 , p2 ] such that
Pr(K ≥ k|p = p1 ) = α/2
and
Pr(K ≤ k|p = p2 ) = α/2
p
pb(1 − pb)/n.
ci — Confidence intervals for means, proportions, and counts
271
where K is distributed as binomial(n, p). The endpoints may be obtained directly by using Stata’s
invbinomial() function. If k = 0 or k = n, the calculation of the appropriate tail is skipped.
p
The Wald interval is pb ± zα pb(1 − pb)/n, where zα is the 1 − α/2 quantile of the standard
normal. The interval is obtained by inverting the acceptance region of the large-sample Wald test of
H0 : p = p0 versus the two-sided alternative. That is, the confidence interval is the set of all p0 such
that
pb − p0
p −1
≤ zα
n pb(1 − pb) p
The Wilson interval is a variation on the Wald interval, using the null standard error n−1 p0 (1 − p0 )
p
in place of the estimated standard error
n−1 pb(1 − pb) in the above expression. Inverting this
acceptance region is more complicated yet results in the closed form
zα n1/2
k + zα2 /2
±
n + zα2
n + zα2 /2
pb(1 − pb) +
zα2
4n
1/2
The Agresti–Coull interval is basically a Wald interval that borrows its center from the Wilson
k/e
n, the Agresti–Coull interval is
e = n + zα2 , and (hence) pe = e
interval. Defining e
k = k + zα2 /2, n
pe ± zα
p
pe(1 − pe)/e
n
When α = 0.05, zα is near enough to 2 that pe can be thought of as a typical estimate of proportion
where two successes and two failures have been added to the sample (Agresti and Coull 1998).
This typical estimate of proportion makes the Agresti–Coull interval an easy-to-present alternative
for introductory statistics students.
The Jeffreys interval is a Bayesian interval and is based on the Jeffreys prior, which is the
Beta(1/2, 1/2) distribution. Assigning this prior to p results in a posterior distribution for p that is
Beta with parameters k + 1/2 and n−k + 1/2. The Jeffreys interval is then taken to be the 1 −α central
posterior probability interval, namely, the α/2 and 1 −α/2 quantiles of the Beta(k + 1/2, n−k + 1/2)
distribution. These quantiles may be obtained directly by using Stata’s invibeta() function.
Poisson
Given the total cases, k , the estimate of the expected count λ is k , and its standard error is
ci calculates the exact confidence interval [ λ1 , λ2 ] such that
√
k.
Pr(K ≥ k|λ = λ1 ) = α/2
and
Pr(K ≤ k|λ = λ2 ) = α/2
where K is Poisson with mean λ. Solution is by Newton’s method. If k = 0, the calculation of λ1
is skipped. All values are then reported as rates, which are the above numbers divided by the total
exposure.
272
ci — Confidence intervals for means, proportions, and counts
Harold Jeffreys (1891–1989) was born near Durham, England, and spent more than 75 years
studying and working at the University of Cambridge, principally on theoretical and observational
problems in geophysics, astronomy, mathematics, and statistics. He developed a systematic
Bayesian approach to inference in his monograph Theory of Probability.
E. B. Wilson (1879–1964) majored in mathematics at Harvard and studied and taught at Yale
and MIT before returning to Harvard in 1922. He worked in mathematics, physics, and statistics.
His method for binomial intervals can be considered a precursor, for a particular problem, of
Neyman’s concept of confidence intervals.
Jerzy Neyman (1894–1981) was born in Bendery, Moldavia. He studied and then taught at
Kharkov University, moving from physics to mathematics. In 1921, Neyman moved to Poland,
where he worked in statistics at Bydgoszcz and then Warsaw. Receiving a Rockefeller Fellowship
to work with Karl Pearson at University College London, he befriended Egon Pearson, Karl’s
son, and they worked together on the theory of hypothesis testing. Life in Poland became
progressively more difficult, and Neyman returned to UCL to work there from 1934 to 1938.
At this time, he published on the theory of confidence intervals. He then was offered a post in
California at Berkeley, where he settled. Neyman established an outstanding statistics department
and remained highly active in research, including applications in astronomy, meteorology, and
medicine. He was one of the great statisticians of the 20th century.
Acknowledgment
We thank Nicholas J. Cox of Durham University for his assistance with the jeffreys and wilson
options.
References
Agresti, A., and B. A. Coull. 1998. Approximate is better than “exact” for interval estimation of binomial proportions.
American Statistician 52: 119–126.
Brown, L. D., T. T. Cai, and A. DasGupta. 2001. Interval estimation for a binomial proportion. Statistical Science 16:
101–133.
Campbell, M. J., D. Machin, and S. J. Walters. 2007. Medical Statistics: A Textbook for the Health Sciences. 4th ed.
Chichester, UK: Wiley.
Clopper, C. J., and E. S. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial.
Biometrika 26: 404–413.
Cook, A. 1990. Sir Harold Jeffreys. 2 April 1891–18 March 1989. Biographical Memoirs of Fellows of the Royal
Society 36: 303–333.
Gleason, J. R. 1999. sg119: Improved confidence intervals for binomial proportions. Stata Technical Bulletin 52: 16–18.
Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 208–211. College Station, TX: Stata Press.
Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society
of London, Series A 186: 453–461.
Lindley, D. V. 2001. Harold Jeffreys. In Statisticians of the Centuries, ed. C. C. Heyde and E. Seneta, 402–405. New
York: Springer.
Reid, C. 1982. Neyman—from Life. New York: Springer.
Rothman, K. J., S. Greenland, and T. L. Lash. 2008. Modern Epidemiology. 3rd ed. Philadelphia: Lippincott Williams
& Wilkins.
ci — Confidence intervals for means, proportions, and counts
273
Seed, P. T. 2001. sg159: Confidence intervals for correlations. Stata Technical Bulletin 59: 27–28. Reprinted in Stata
Technical Bulletin Reprints, vol. 10, pp. 267–269. College Station, TX: Stata Press.
Stigler, S. M. 1997. Wilson, Edwin Bidwell. In Leading Personalities in Statistical Sciences: From the Seventeenth
Century to the Present, ed. N. L. Johnson and S. Kotz, 344–346. New York: Wiley.
Utts, J. M. 2005. Seeing Through Statistics. 3rd ed. Belmont, CA: Brooks/Cole.
Wang, D. 2000. sg154: Confidence intervals for the ratio of two binomial proportions by Koopman’s method. Stata
Technical Bulletin 58: 16–19. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 244–247. College Station,
TX: Stata Press.
Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American
Statistical Association 22: 209–212.
Also see
[R] bitest — Binomial probability test
[R] prtest — One- and two-sample tests of proportions
[R] ttest — Mean-comparison tests
[R] ameans — Arithmetic, geometric, and harmonic means
[R] centile — Report centile and confidence interval
[R] summarize — Summary statistics
[D] pctile — Create variable containing percentiles
Title
clogit — Conditional (fixed-effects) logistic regression
Syntax
clogit depvar
indepvars
if
in
weight , group(varname) options
description
options
Model
∗
group(varname)
offset(varname)
constraints(numlist)
collinear
matched group variable
include varname in model with coefficient constrained to 1
apply specified linear constraints
keep collinear variables
SE/Robust
vce(vcetype)
nonest
vcetype may be oim, robust, cluster clustvar, opg, bootstrap,
or jackknife
do not check that panels are nested within clusters
Reporting
level(#)
or
nocnsreport
display options
set confidence level; default is level(95)
report odds ratios
do not display constraints
control spacing and display of omitted variables and base and
empty cells
Maximization
maximize options
† coeflegend
∗
control the maximization process; seldom used
display coefficients’ legend instead of coefficient table
group(varname) is required.
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
bootstrap, by, fracpoly, jackknife, mfp, mi estimate, nestreg, rolling, statsby, stepwise, and svy are
allowed; see [U] 11.1.10 Prefix commands.
vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix.
Weights are not allowed with the bootstrap prefix.
vce(), nonest, and weights are not allowed with the svy prefix.
fweights, iweights, and pweights are allowed (see [U] 11.1.6 weight), but they are interpreted to apply to groups
as a whole, not to individual observations. See Use of weights below.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Categorical outcomes
>
Conditional logistic regression
274
clogit — Conditional (fixed-effects) logistic regression
275
Description
clogit fits what biostatisticians and epidemiologists call conditional logistic regression for matched
case – control groups (see, for example, Hosmer Jr. and Lemeshow [2000, chap. 7]) and what economists
and other social scientists call fixed-effects logit for panel data (see, for example, Chamberlain [1980]).
Computationally, these models are the same.
See [R] asclogit if you want to fit McFadden’s choice model (McFadden 1974). Also see [R] logistic
for a list of related estimation commands.
Options
Model
group(varname) is required; it specifies an identifier variable (numeric or string) for the matched
groups. strata(varname) is a synonym for group().
offset(varname), constraints(numlist), collinear; see [R] estimation options.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
nonest, available only with vce(cluster clustvar), prevents checking that matched groups are
nested within clusters. It is the user’s responsibility to verify that the standard errors are theoretically
correct.
Reporting
level(#); see [R] estimation options.
or reports the estimated coefficients transformed to odds ratios, i.e., eb rather than b. Standard errors
and confidence intervals are similarly transformed. This option affects how results are displayed,
not how they are estimated. or may be specified at estimation or when replaying previously
estimated results.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
The following option is available with clogit but is not shown in the dialog box:
coeflegend; see [R] estimation options.
276
clogit — Conditional (fixed-effects) logistic regression
Remarks
Remarks are presented under the following headings:
Introduction
Matched case–control data
Use of weights
Fixed-effects logit
Introduction
clogit fits maximum likelihood models with a dichotomous dependent variable coded as 0/1
(more precisely, clogit interprets 0 and not 0 to indicate the dichotomy). Conditional logistic analysis
differs from regular logistic regression in that the data are grouped and the likelihood is calculated
relative to each group; i.e., a conditional likelihood is used. See Methods and formulas at the end of
this entry.
Biostatisticians and epidemiologists fit these models when analyzing matched case – control studies
with 1 : 1 matching, 1 : k2i matching, or k1i : k2i matching, where i denotes the ith matched group for
i = 1, 2, . . . , n, where n is the total number of groups. clogit fits a model appropriate for all these
matching schemes or for any mix of the schemes because the matching k1i : k2i can vary from group
to group. clogit always uses the true conditional likelihood, not an approximation. Biostatisticians
and epidemiologists sometimes refer to the matched groups as “strata”, but we will stick to the more
generic term “group”.
Economists and other social scientists fitting fixed-effects logit models have data that look exactly
like the data biostatisticians and epidemiologists call k1i : k2i matched case – control data. In terms
of how the data are arranged, k1i : k2i matching means that in the ith group, the dependent variable
is 1 a total of k1i times and 0 a total of k2i times. There are a total of Ti = k1i + k2i observations
for the ith group. This data arrangement is what economists and other social scientists call “panel
data”, “longitudinal data”, or “cross-sectional time-series data”.
So no matter what terminology you use, the computation and the use of the clogit command is
the same. The following example shows how your data should be arranged to use clogit.
Example 1
Suppose that we have grouped data with the variable id containing a unique identifier for each
group. Our outcome variable, y, contains 0s and 1s. If we were biostatisticians, y = 1 would indicate
a case, y = 0 would be a control, and id would be an identifier variable that indicates the groups of
matched case – control subjects.
If we were economists, y = 1 might indicate that a person was unemployed at any time during
a year and y = 0, that a person was employed all year, and id would be an identifier variable for
persons.
clogit — Conditional (fixed-effects) logistic regression
277
If we list the first few observations of this dataset, it looks like
. use http://www.stata-press.com/data/r11/clogitid
. list y x1 x2 id in 1/11
y
x1
x2
id
1.
2.
3.
4.
5.
0
0
0
1
0
0
1
1
1
0
4
4
6
8
1
1014
1014
1014
1014
1017
6.
7.
8.
9.
10.
0
1
0
0
1
0
1
0
1
1
7
10
1
7
7
1017
1017
1019
1019
1019
11.
1
1
9
1019
Pretending that we are biostatisticians, we describe our data as follows. The first group (id = 1014)
consists of four matched persons: 1 case (y = 1) and three controls (y = 0), i.e., 1 : 3 matching. The
second group has 1 : 2 matching, and the third 2 : 2.
Pretending that we are economists, we describe our data as follows. The first group consists of
4 observations (one per year) for person 1014. This person had a period of unemployment during 1
year of 4. The second person had a period of unemployment during 1 year of 3, and the third had a
period of 2 years of 4.
Our independent variables are x1 and x2. To fit the conditional (fixed-effects) logistic model, we
type
. clogit y x1 x2, group(id)
note: multiple positive outcomes within groups encountered.
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood = -123.42828
log likelihood = -123.41386
log likelihood = -123.41386
Conditional (fixed-effects) logistic regression
Log likelihood = -123.41386
y
Coef.
x1
x2
.653363
.0659169
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Std. Err.
z
P>|z|
.2875215
.0449555
2.27
1.47
0.023
0.143
=
=
=
=
369
9.07
0.0107
0.0355
[95% Conf. Interval]
.0898312
-.0221943
1.216895
.1540281
Technical note
The message “note: multiple positive outcomes within groups encountered” at the top of the
clogit output for the previous example merely informs us that we have k1i : k2i matching with
k1i > 1 for at least one group. If your data should be 1 : k2i matched, this message tells you that
there is an error in the data somewhere.
278
clogit — Conditional (fixed-effects) logistic regression
We can see the distribution of k1i and Ti = k1i + k2i for the data of the previous example by
using the following steps:
. by id, sort: gen k1 = sum(y)
. by id: replace k1 = . if _n < _N
(303 real changes made, 303 to missing)
. by id: gen T = sum(y < .)
. by id: replace T = . if _n < _N
(303 real changes made, 303 to missing)
. tab k1
k1
Freq.
Percent
Cum.
1
2
3
4
48
12
4
2
72.73
18.18
6.06
3.03
72.73
90.91
96.97
100.00
Total
. tab T
T
66
100.00
Freq.
Percent
Cum.
2
3
4
5
6
7
8
9
10
5
5
12
11
13
8
3
7
2
7.58
7.58
18.18
16.67
19.70
12.12
4.55
10.61
3.03
7.58
15.15
33.33
50.00
69.70
81.82
86.36
96.97
100.00
Total
66
100.00
We see that k1i ranges from 1 to 4 and Ti ranges from 2 to 10 for these data.
Technical note
For k1i : k2i matching (and hence in the general case of fixed-effects logit), clogit uses a recursive
algorithm to compute the likelihood,P
which means that there are no limits on the size of Ti . However,
computation time is proportional to
Ti min(k1i , k2i ), so clogit will take roughly 10 times longer
to fit a model with 10 : 10 matching than one with 1 : 10 matching. But clogit is fast, so computation
time becomes an issue only when min(k1i , k2i ) is around 100 or more. See Methods and formulas
for details.
Matched case–control data
Here we give a more detailed example of matched case – control data.
clogit — Conditional (fixed-effects) logistic regression
279
Example 2
Hosmer Jr. and Lemeshow (2000, 25) present data on matched pairs of infants, each pair having
one with low birthweight and another with regular birthweight. The data are matched on age of the
mother. Several possible maternal exposures are considered: race (three categories), smoking status,
presence of hypertension, presence of uterine irritability, previous preterm delivery, and weight at the
last menstrual period.
. use http://www.stata-press.com/data/r11/lowbirth2, clear
(Applied Logistic Regression, Hosmer & Lemeshow)
. describe
Contains data from http://www.stata-press.com/data/r11/lowbirth2.dta
obs:
112
Applied Logistic Regression,
Hosmer & Lemeshow
vars:
9
26 Apr 2009 09:33
size:
1,904 (99.9% of memory free)
variable name
pairid
low
age
lwt
smoke
ptd
ht
ui
race
storage
type
byte
byte
byte
int
byte
byte
byte
byte
float
display
format
value
label
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%9.0g
variable label
Case-control pair ID
Baby has low birthweight
Age of mother
Mother’s last menstrual weight
Mother smoked during pregnancy
Mother had previous preterm baby
Mother has hypertension
Uterine irritability
race of mother: 1=white, 2=black,
3=other
Sorted by:
We list the case – control indicator variable, low; the match identifier variable, pairid; and two of
the covariates, lwt and smoke, for the first 10 observations.
. list low lwt smoke pairid in 1/10
low
lwt
smoke
pairid
1.
2.
3.
4.
5.
0
1
0
1
0
135
101
98
115
95
0
1
0
0
0
1
1
2
2
3
6.
7.
8.
9.
10.
1
0
1
0
1
130
103
130
122
110
0
0
1
1
1
3
4
4
5
5
We fit a conditional logistic model of low birthweight on mother’s weight, race, smoking behavior,
and history.
280
clogit — Conditional (fixed-effects) logistic regression
. clogit low lwt smoke ptd ht ui i.race, group(pairid) nolog
Conditional (fixed-effects) logistic regression
Log likelihood = -25.794271
Std. Err.
z
Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2
P>|z|
=
=
=
=
112
26.04
0.0005
0.3355
low
Coef.
[95% Conf. Interval]
lwt
smoke
ptd
ht
ui
-.0183757
1.400656
1.808009
2.361152
1.401929
.0100806
.6278396
.7886502
1.086128
.6961585
-1.82
2.23
2.29
2.17
2.01
0.068
0.026
0.022
0.030
0.044
-.0381333
.1701131
.2622828
.2323796
.0374836
.0013819
2.631199
3.353735
4.489924
2.766375
race
2
3
.5713643
-.0253148
.689645
.6992044
0.83
-0.04
0.407
0.971
-.7803149
-1.39573
1.923044
1.345101
We might prefer to see results presented as odds ratios. We could have specified the or option when
we first fit the model, or we can now redisplay results and specify or:
. clogit, or
Conditional (fixed-effects) logistic regression
Log likelihood = -25.794271
low
Odds Ratio
Std. Err.
lwt
smoke
ptd
ht
ui
.9817921
4.057862
6.098293
10.60316
4.06303
.009897
2.547686
4.80942
11.51639
2.828513
race
2
3
1.770681
.975003
1.221141
.6817263
z
Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2
=
=
=
=
112
26.04
0.0005
0.3355
P>|z|
[95% Conf. Interval]
-1.82
2.23
2.29
2.17
2.01
0.068
0.026
0.022
0.030
0.044
.9625847
1.185439
1.299894
1.261599
1.038195
1.001383
13.89042
28.60938
89.11467
15.90088
0.83
-0.04
0.407
0.971
.4582617
.2476522
6.84175
3.838573
Smoking, previous preterm delivery, hypertension, uterine irritability, and possibly the mother’s
weight all contribute to low birthweight. 2.race (mother black) and 3.race (mother other) are
statistically insignificant when compared with the 1.race (mother white) omitted group, although
the 2.race effect is large. We can test the joint statistical significance of 2.race and 3.race by
using test:
. test 2.race 3.race
( 1)
( 2)
[low]2.race = 0
[low]3.race = 0
chi2( 2) =
Prob > chi2 =
0.88
0.6436
For a more complete description of test, see [R] test. test presents results in coefficients rather
than odds ratios. Jointly testing that the coefficients on 2.race and 3.race are 0 is equivalent to
jointly testing that the odds ratios are 1.
clogit — Conditional (fixed-effects) logistic regression
281
Here one case was matched to one control, i.e., 1 : 1 matching. From clogit’s point of view,
that was not important — k1 cases could have been matched to k2 controls (k1 : k2 matching), and
we would have fit the model in the same way. Furthermore, the matching can change from group
to group, which we have denoted as k1i : k2i matching, where i denotes the group. clogit does
not care. To fit the conditional logistic regression model, we specified the group(varname) option,
group(pairid). The case and control are stored in separate observations. clogit knew that they
were linked (in the same group) because the related observations share the same value of pairid.
Technical note
clogit provides a way to extend McNemar’s test to multiple controls per case (1 : k2i matching)
and to multiple controls matched with multiple cases (k1i : k2i matching).
In Stata, McNemar’s test is calculated by the mcc command; see [ST] epitab. The mcc command,
however, requires that the matched case and control appear in one observation, so the data will need to
be manipulated from 1 to 2 observations per stratum before using clogit. Alternatively, if you begin
with clogit’s 2-observations-per-group organization, you will have to change it to 1 observation
per group if you wish to use mcc. In either case, reshape provides an easy way to change the
organization of the data. We will demonstrate its use below, but we direct you to [D] reshape for a
more thorough discussion.
In the example above, we used clogit to analyze the relationship between low birthweight and
various characteristics of the mother. Assume that we now want to assess the relationship between
low birthweight and smoking, ignoring the mother’s other characteristics. Using clogit, we obtain
the following results:
. clogit low smoke, group(pairid) or
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood = -35.425931
log likelihood = -35.419283
log likelihood = -35.419282
Conditional (fixed-effects) logistic regression
Log likelihood = -35.419282
low
Odds Ratio
smoke
2.75
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
112
6.79
0.0091
0.0875
Std. Err.
z
P>|z|
[95% Conf. Interval]
1.135369
2.45
0.014
1.224347
6.176763
Let’s compare our estimated odds ratio and 95% confidence interval with that produced by mcc.
We begin by reshaping the data:
. keep low smoke pairid
. reshape wide smoke, i(pairid) j(low 0 1)
Data
Number of obs.
Number of variables
j variable (2 values)
xij variables:
long
->
wide
112
3
low
->
->
->
56
3
(dropped)
smoke
->
smoke0 smoke1
282
clogit — Conditional (fixed-effects) logistic regression
We now have the variables smoke0 (formed from smoke and low = 0), recording 1 if the control
mother smoked and 0 otherwise; and smoke1 (formed from smoke and low = 1), recording 1 if the
case mother smoked and 0 otherwise. We can now use mcc:
. mcc smoke1 smoke0
Cases
Exposed
Unexposed
Controls
Exposed
Unexposed
Total
8
8
22
18
30
26
Total
16
40
56
McNemar’s chi2(1) =
6.53
Prob > chi2 = 0.0106
Exact McNemar significance probability
= 0.0161
Proportion with factor
Cases
.5357143
Controls
.2857143
[95% Conf. Interval]
difference
ratio
rel. diff.
odds ratio
.25
1.875
.35
2.75
.0519726
1.148685
.1336258
1.179154
.4480274
3.060565
.5663742
7.143667
(exact)
Both methods estimated the same odds ratio, and the 95% confidence intervals are similar. clogit
produced a confidence interval of [ 1.22, 6.18 ], whereas mcc produced a confidence interval of
[ 1.18, 7.14 ].
Use of weights
With clogit, weights apply to groups as a whole, not to individual observations. For example,
if there is a group in your dataset with a frequency weight of 3, there are a total of three groups
in your sample with the same values of the dependent and independent variables as this one group.
Weights must have the same value for all observations belonging to the same group; otherwise, an
error message will be displayed.
Example 3
We use the example from the above discussion of the mcc command. Here we have a total of 56
matched case – control groups, each with one case matched to one control. We had 8 matched pairs
in which both the case and the control are exposed, 22 pairs in which the case is exposed and the
control is unexposed, 8 pairs in which the case is unexposed and the control is exposed, and 18 pairs
in which they are both unexposed.
With weights, it is easy to enter these data into Stata and run clogit.
clogit — Conditional (fixed-effects) logistic regression
. clear
. input id case exposed weight
id
case
exposed
1. 1 1 1 8
2. 1 0 1 8
3. 2 1 1 22
4. 2 0 0 22
5. 3 1 0 8
6. 3 0 1 8
7. 4 1 0 18
8. 4 0 0 18
9. end
weight
. clogit case exposed [w=weight], group(id) or
(frequency weights assumed)
Iteration 0:
log likelihood = -35.425931
Iteration 1:
log likelihood = -35.419283
Iteration 2:
log likelihood = -35.419282
Conditional (fixed-effects) logistic regression
Log likelihood = -35.419282
case
Odds Ratio
exposed
2.75
283
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
112
6.79
0.0091
0.0875
Std. Err.
z
P>|z|
[95% Conf. Interval]
1.135369
2.45
0.014
1.224347
6.176763
Fixed-effects logit
The fixed-effects logit model can be written as
Pr(yit = 1 | xit ) = F (αi + xit β)
where F is the cumulative logistic distribution
F (z) =
exp(z)
1 + exp(z)
i = 1, 2, . . . , n denotes the independent units (called “groups” by clogit), and t = 1, 2, . . . , Ti
denotes the observations for the ith unit (group).
Fitting this model by using a full maximum-likelihood approach leads to difficulties, however.
When Ti is fixed, the maximum likelihood estimates for αi and β are inconsistent (Andersen 1970;
Chamberlain 1980). This difficulty can be circumvented by looking at the probability of yi =
PTi
yit . This conditional probability does not involve the αi , so they
(yi1 , . . . , yiTi ) conditional on t=1
are never estimated when the resulting conditional likelihood is used. See Hamerle and Ronning (1995)
for a succinct and lucid development. See Methods and formulas for the estimation equation.
284
clogit — Conditional (fixed-effects) logistic regression
Example 4
We are studying unionization of women in the United States by using the union dataset; see
[XT] xt. We fit the fixed-effects logit model:
. use http://www.stata-press.com/data/r11/union, clear
(NLS Women 14-24 in 1968)
. clogit union age grade not_smsa south black, group(idcode)
note: multiple positive outcomes within groups encountered.
note: 2744 groups (14165 obs) dropped because of all positive or
all negative outcomes.
note: black omitted because of no within-group variance.
Iteration 0:
log likelihood = -4521.3385
Iteration 1:
log likelihood = -4516.1404
Iteration 2:
log likelihood = -4516.1385
Iteration 3:
log likelihood = -4516.1385
Conditional (fixed-effects) logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Log likelihood = -4516.1385
Pseudo R2
union
Coef.
age
grade
not_smsa
south
.0170301
.0853572
.0083678
-.748023
Std. Err.
.004146
.0418781
.1127963
.1251752
z
4.11
2.04
0.07
-5.98
P>|z|
0.000
0.042
0.941
0.000
=
=
=
=
12035
68.09
0.0000
0.0075
[95% Conf. Interval]
.0089042
.0032777
-.2127088
-.9933619
.0251561
.1674368
.2294445
-.5026842
We received three messages at the top of the output. The first one, “multiple positive outcomes within
groups encountered”, we expected. Our data do indeed have multiple positive outcomes (union = 1)
in many groups. (Here a group consists of all the observations for a particular individual.)
The second message tells us that 2,744 groups were “dropped” by clogit. When either union = 0
or union = 1 for all observations for an individual, this individual’s contribution to the log-likelihood
is zero. Although these are perfectly valid observations in every sense, they have no effect on the
estimation, so they are not included in the total “Number of obs”. Hence, the reported “Number of
obs” gives the effective sample size of the estimation. Here it is 12,035 observations — only 46% of
the total 26,200.
We can easily check that there are indeed 2,744 groups with union either all 0 or all 1. We will
generate a variable that contains the fraction of observations for each individual who has union = 1.
clogit — Conditional (fixed-effects) logistic regression
285
. by idcode, sort: generate fraction = sum(union)/sum(union < .)
. by idcode: replace fraction = . if _n < _N
(21766 real changes made, 21766 to missing)
. tabulate fraction
fraction
0
.0833333
.0909091
.1
(output omitted )
.9
.9090909
.9166667
1
Total
Freq.
Percent
Cum.
2,481
30
33
53
55.95
0.68
0.74
1.20
55.95
56.63
57.37
58.57
10
11
10
263
0.23
0.25
0.23
5.93
93.59
93.84
94.07
100.00
4,434
100.00
Because 2481 + 263 = 2744, we confirm what clogit did.
The third warning message from clogit said “black omitted because of no within-group variance”.
Obviously, race stays constant for an individual across time. Any such variables are collinear with the
αi (i.e., the fixed effects), and just as the αi drop out of the conditional likelihood, so do all variables
that are unchanging within groups. Thus they cannot be estimated with the conditional fixed-effects
model.
There are several other estimators implemented in Stata that we could use with these data:
cloglog . . . , vce(cluster idcode)
logit . . . , vce(cluster idcode)
probit . . . , vce(cluster idcode)
scobit . . . , vce(cluster idcode)
xtcloglog . . .
xtgee . . . , family(binomial) link(logit) corr(exchangeable)
xtlogit . . .
xtprobit . . .
See [R] cloglog, [R] logit, [R] probit, [R] scobit, [XT] xtcloglog, [XT] xtgee, [XT] xtlogit, and
[XT] xtprobit for details.
(Continued on next page)
286
clogit — Conditional (fixed-effects) logistic regression
Saved results
clogit saves the following in e():
Scalars
e(N)
e(N drop)
e(N group drop)
e(k)
e(k eq)
e(k eq model)
e(k dv)
e(k autoCns)
e(df m)
e(r2 p)
e(ll)
e(ll 0)
e(N clust)
e(chi2)
e(p)
e(rank)
e(ic)
e(rc)
e(converged)
Macros
e(cmd)
e(cmdline)
e(depvar)
e(group)
e(multiple)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(offset)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(which)
e(ml method)
e(user)
e(technique)
e(singularHmethod)
e(crittype)
e(properties)
e(predict)
e(marginsok)
e(marginsnotok)
e(marginsprop)
e(asbalanced)
e(asobserved)
number of observations
number of observations dropped because of all positive or all negative outcomes
number of groups dropped because of all positive or all negative outcomes
number of parameters
number of equations in e(b)
number of equations in model Wald test
number of dependent variables
number of base, empty, and omitted constraints
model degrees of freedom
pseudo-R-squared
log likelihood
log likelihood, constant-only model
number of clusters
χ2
significance
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
clogit
command as typed
name of dependent variable
name of group() variable
multiple if multiple positive outcomes within group
weight type
weight expression
title in estimation output
name of cluster variable
offset
LR; type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
name of likelihood-evaluator program
maximization technique
m-marquardt or hybrid; method used when Hessian is singular
optimization criterion
b V
program used to implement predict
predictions allowed by margins
predictions disallowed by margins
signals to the margins command
factor variables fvset as asbalanced
factor variables fvset as asobserved
clogit — Conditional (fixed-effects) logistic regression
Matrices
e(b)
e(Cns)
e(ilog)
e(V)
e(V modelbased)
e(gradient)
coefficient vector
constraints matrix
iteration log (up to 20 iterations)
variance–covariance matrix of the estimators
model-based variance
gradient vector
Functions
e(sample)
marks estimation sample
287
Methods and formulas
clogit is implemented as an ado-file.
Breslow and Day (1980, 247–279), Collett (2003, 251–267), and Hosmer Jr. and Lemeshow (2000,
223–259) provide a biostatistical point of view on conditional logistic regression. Hamerle and
Ronning (1995) give a succinct and lucid review of fixed-effects logit; Chamberlain (1980) is a
standard reference for this model. Greene (2008, chap. 23) provides a straightforward textbook
description of conditional logistic regression from an economist’s point of view, as well as a brief
description of choice models.
Let i = 1, 2, . . . , n denote the groups and let t = 1, 2, . . . , Ti denote the observations for the ith
group. Let yit be the dependent variable taking on values 0 or 1. Let yi = (yi1 , . . . , yiTi ) be the
outcomes for the ith group as a whole. Let xit be a row vector of covariates. Let
k1i =
Ti
X
yit
t=1
be the observed number of ones for the dependent variable in the ith group. Biostatisticians would
say that there are k1i cases matched to k2i = Ti − k1i controls in the ith group.
PTi
We consider the probability of a possible value of yi conditional on
t=1 yit = k1i (Hamerle
and Ronning 1995, eq. 8.33; Hosmer Jr. and Lemeshow 2000, eq. 7.4),
PTi
PTi
exp
t=1 yit xit β
Pr yi | t=1 yit = k1i = P
PTi
di ∈Si exp
t=1 dit xit β
PTi
where dit is equal to 0 or 1 with t=1
dit = k1i , and Si is the set of all possible combinations of
k1i ones and k2i zeros. Clearly, there are kT1ii such combinations, but we need not count all these
combinations to compute the denominator of the above equation. It can be computed recursively.
Denote the denominator by
fi (Ti , k1i ) =
X
di ∈Si
exp
X
Ti
dit xit β
t=1
Consider, computationally, how fi changes as we go from a total of 1 observation in the group to 2
observations to 3, etc. Doing this, we derive the recursive formula
fi (T, k) = fi (T − 1, k) + fi (T − 1, k − 1) exp(xiT β)
288
clogit — Conditional (fixed-effects) logistic regression
where we define fi (T, k) = 0 if T < k and fi (T, 0) = 1.
The conditional log-likelihood is
lnL =
(T
n
i
X
X
i=1
)
yit xit β − log fi (Ti , k1i )
t=1
The derivatives of the conditional log-likelihood can also be computed recursively by taking derivatives
of the recursive formula for fi .
Computation time is roughly proportional to
p2
n
X
Ti min(k1i , k2i )
i=1
where p is the number of independent variables in the model. If min(k1i , k2i ) is small, computation
time is not an issue. But if it is large—say, 100 or more—patience may be required.
If Ti is large for all groups, the bias of the unconditional fixed-effects estimator is not a concern,
and we can confidently use logit with an indicator variable for each group (provided, of course,
that the number of groups does not exceed matsize; see [R] matsize).
This command supports the clustered version of the Huber/White/sandwich estimator of the
variance using vce(robust) and vce(cluster clustvar). See [P] robust, in particular, in Maximum
likelihood estimators and Methods and formulas. Specifying vce(robust) is equivalent to specifying
vce(cluster groupvar), where groupvar is the variable for the matched groups.
clogit also supports estimation with survey data. For details on VCEs with survey data, see
[SVY] variance estimation.
References
Andersen, E. B. 1970. Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal
Statistical Society, Series B 32: 283–301.
Breslow, N. E., and N. E. Day. 1980. Statistical Methods in Cancer Research: Vol. 1—The Analysis of Case–Control
Studies. Lyon: IARC.
Chamberlain, G. 1980. Analysis of covariance with qualitative data. Review of Economic Studies 47: 225–238.
Collett, D. 2003. Modelling Binary Data. 2nd ed. London: Chapman & Hall/CRC.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
Hamerle, A., and G. Ronning. 1995. Panel analysis for qualitative variables. In Handbook of Statistical Modeling for
the Social and Behavioral Sciences, ed. G. Arminger, C. C. Clogg, and M. E. Sobel, 401–451. New York: Plenum.
Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood. Stata Journal 7: 388–401.
Hosmer Jr., D. W., and S. Lemeshow. 2000. Applied Logistic Regression. 2nd ed. New York: Wiley.
Kleinbaum, D. G., and M. Klein. 2002. Logistic Regression: A Self-Learning Text. 2nd ed. New York: Springer.
Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College
Station, TX: Stata Press.
McFadden, D. L. 1974. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, ed.
P. Zarembka, 105–142. New York: Academic Press.
clogit — Conditional (fixed-effects) logistic regression
Also see
[R] clogit postestimation — Postestimation tools for clogit
[R] asclogit — Alternative-specific conditional logit (McFadden’s choice) model
[R] logistic — Logistic regression, reporting odds ratios
[R] mlogit — Multinomial (polytomous) logistic regression
[R] nlogit — Nested logit regression
[R] ologit — Ordered logistic regression
[R] scobit — Skewed logistic regression
[SVY] svy estimation — Estimation commands for survey data
[XT] xtgee — Fit population-averaged panel-data models by using GEE
[XT] xtlogit — Fixed-effects, random-effects, and population-averaged logit models
[U] 20 Estimation and postestimation commands
289
Title
clogit postestimation — Postestimation tools for clogit
Description
The following postestimation commands are available for clogit:
command
description
estat
estat (svy)
estimates
hausman
lincom
AIC, BIC, VCE, and estimation sample summary
postestimation statistics for survey data
cataloging estimation results
Hausman’s specification test
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
seemingly unrelated estimation
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
lrtest1
margins2
nlcom
predict
predictnl
suest
test
testnl
1
lrtest is not appropriate with svy estimation results.
2
The default prediction statistic pc1 cannot be correctly handled by margins; however, margins can be used
after clogit with options predict(pu0) and predict(xb).
See the corresponding entries in the Base Reference Manual for details, but see [SVY] estat for
details about estat (svy).
290
clogit postestimation — Postestimation tools for clogit
291
Syntax for predict
predict
type
newvar
if
in
, statistic nooffset
description
statistic
Main
pc1
pu0
xb
stdp
∗
dbeta
∗
dx2
∗
gdbeta
∗
gdx2
∗
hat
∗
residuals
∗
rstandard
score
probability of a positive outcome; the default
probability of a positive outcome, assuming fixed effect is zero
linear prediction
standard error of the linear prediction
Delta-β influence statistic
Delta-χ2 lack-of-fit statistic
Delta-β influence statistic for each group
Delta-χ2 lack-of-fit statistic for each group
Hosmer and Lemeshow leverage
Pearson residuals
standardized Pearson residuals
first derivative of the log likelihood with respect to xj β
Unstarred statistics are available both in and out of sample; type predict . . . if e(sample) . . . if wanted only for
the estimation sample. Starred statistics are calculated only for the estimation sample, even when if e(sample)
is not specified.
Starred statistics are available for multiple controls per case-matching design only. They are not available if vce(robust),
vce(cluster clustvar), or pweights were specified with clogit.
dbeta, dx2, gdbeta, gdx2, hat, and rstandard are not available if constraints() was specified with clogit.
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
pc1, the default, calculates the probability of a positive outcome conditional on one positive outcome
within group.
pu0 calculates the probability of a positive outcome, assuming that the fixed effect is zero.
xb calculates the linear prediction.
stdp calculates the standard error of the linear prediction.
dbeta calculates the Delta-β influence statistic, a standardized measure of the difference in the
coefficient vector that is due to deletion of the observation.
dx2 calculates the Delta-χ2 influence statistic, reflecting the decrease in the Pearson chi-squared that
is due to deletion of the observation.
gdbeta calculates the approximation to the Pregibon stratum-specific Delta-β influence statistic, a
standardized measure of the difference in the coefficient vector that is due to deletion of the entire
stratum.
292
clogit postestimation — Postestimation tools for clogit
gdx2 calculates the approximation to the Pregibon stratum-specific Delta-χ2 influence statistic,
reflecting the decrease in the Pearson chi-squared that is due to deletion of the entire stratum.
hat calculates the Hosmer and Lemeshow leverage or the diagonal element of the hat matrix.
residuals calculates the Pearson residuals.
rstandard calculates the standardized Pearson residuals.
score calculates the equation-level score, ∂ ln L/∂(xit β).
nooffset is relevant only if you specified offset(varname) for clogit. It modifies the calculations
made by predict so that they ignore the offset variable; the linear prediction is treated as xj b
rather than as xj b + offsetj . This option cannot be specified with dbeta, dx2, gdbeta, gdx2,
hat, and rstandard.
Remarks
predict may be used after clogit to obtain predicted values of the index xit β. Predicted
probabilities for conditional logistic regression must be interpreted carefully. Probabilities are estimated
for each group as a whole, not for individual observations. Furthermore, the probabilities are conditional
on the number of positive outcomes in the group (i.e., the number of cases and the number of controls),
or it is assumed that the fixed effect is zero. predict may also be used to obtain influence and lack of
fit statistics for an individual observation and for the whole group, to compute Pearson, standardized
Pearson residuals, and leverage values.
predict may be used for both within-sample and out-of-sample predictions.
Example 1
Suppose that we have 1 : k2i matched data and that we have previously fit the following model:
. use http://www.stata-press.com/data/r11/clogitid
. clogit y x1 x2, group(id)
(output omitted )
To obtain the predicted values of the index, we could type predict idx, xb to create a new
variable called idx. From idx, we could then calculate the predicted probabilities. Easier, however,
would be to type
. predict phat
(option pc1 assumed; probability of success given one success within group)
phat would then contain the predicted probabilities.
As noted previously, the predicted probabilities are really predicted probabilities for the group as
a whole (i.e., they are the predicted probability of observing yit = 1 and yit0 = 0 for all t0 6= t).
Thus, if we want to obtain the predicted probabilities for the estimation sample, it is important that,
when we make the calculation, predictions be restricted to the same sample on which we estimated
the data. We cannot predict the probabilities and then just keep the relevant ones because the entire
sample determines each probability. Thus, assuming that we are not attempting to make out-of-sample
predictions, we type
. predict phat2 if e(sample)
(option pc1 assumed; probability of success given one success within group)
clogit postestimation — Postestimation tools for clogit
293
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Recall that i = 1, . . . , n denote the groups and t = 1, . . . , Ti denote the observations for the ith
group.
predict produces probabilities of a positive outcome within group conditional on there being one
positive outcome (pc1),
T
!
i
X
exp(xit β)
yit = 1 = PTi
Pr yit = 1 exp(xit β)
t=1
t=1
or predict calculates the unconditional pu0:
Pr(yit = 1) =
exp(xit β)
1 + exp(xit β)
Pn
Let N = j=1 Tj denote the total number of observations, p denote the number of covariates,
and θbit denote the conditional predicted probabilities of a positive outcome (pc1).
For the multiple control per case (1 : k2i ) matching, Hosmer Jr. and Lemeshow (2000, 248–251)
propose the following diagnostics:
The Pearson residual is
rit =
(yit − θbit )
q
θbit
The leverage (hat) value is defined as
e T UX)
e −1 x
eTit (X
eit
hit = θbit x
eit = xit −
where x
PTi
xij θbij is the 1 × p row vector of centered by a weighted stratum-specific
e N ×p are composed of x
eit values.
mean covariate values, UN = diag{θbit }, and the rows of X
j=1
The standardized Pearson residual is
rsit = √
rit
1 − hit
The lack of fit and influence diagnostics for an individual observation are (respectively) computed
as
2
∆χ2it = rsit
and
∆βbit = ∆χ2it
hit
1 − hit
The lack of fit and influence diagnostics for the groups are the group-specific totals of the respective
individual diagnostics shown above.
294
clogit postestimation — Postestimation tools for clogit
Reference
Hosmer Jr., D. W., and S. Lemeshow. 2000. Applied Logistic Regression. 2nd ed. New York: Wiley.
Also see
[R] clogit — Conditional (fixed-effects) logistic regression
[U] 20 Estimation and postestimation commands
Title
cloglog — Complementary log-log regression
Syntax
cloglog depvar
indepvars
if
in
weight
, options
description
options
Model
noconstant
offset(varname)
asis
constraints(constraints)
collinear
suppress constant term
include varname in model with coefficient constrained to 1
retain perfect predictor variables
apply specified linear constraints
keep collinear variables
SE/Robust
vcetype may be oim, robust, cluster clustvar, opg, bootstrap,
or jackknife
vce(vcetype)
Reporting
set confidence level; default is level(95)
report exponentiated coefficients
do not display constraints
control spacing and display of omitted variables and base and
empty cells
level(#)
eform
nocnsreport
display options
Maximization
control the maximization process; seldom used
maximize options
† coeflegend
display coefficients’ legend instead of coefficient table
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, mi estimate, nestreg, rolling, statsby, stepwise, and svy are allowed; see
[U] 11.1.10 Prefix commands.
vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix.
Weights are not allowed with the bootstrap prefix.
vce() and weights are not allowed with the svy prefix.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Binary outcomes
>
Complementary log-log regression
295
296
cloglog — Complementary log-log regression
Description
cloglog fits maximum-likelihood complementary log-log models.
See [R] logistic for a list of related estimation commands.
Options
Model
noconstant, offset(varname); see [R] estimation options.
asis forces retention of perfect predictor variables and their associated perfectly predicted observations
and may produce instabilities in maximization; see [R] probit.
constraints(constraints), collinear; see [R] estimation options.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
Reporting
level(#); see [R] estimation options.
eform displays the exponentiated coefficients and corresponding standard errors and confidence
intervals.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
The following option is available with cloglog but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
Remarks are presented under the following headings:
Introduction to complementary log-log regression
Robust standard errors
cloglog — Complementary log-log regression
297
Introduction to complementary log-log regression
cloglog fits maximum likelihood models with dichotomous dependent variables coded as 0/1 (or,
more precisely, coded as 0 and not 0).
Example 1
We have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles.
We wish to fit a model explaining whether a car is foreign based on its weight and mileage. Here is
an overview of our data:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. keep make mpg weight foreign
. describe
Contains data from http://www.stata-press.com/data/r11/auto.dta
obs:
74
1978 Automobile Data
vars:
4
13 Apr 2009 17:45
size:
1,998 (99.9% of memory free) (_dta has notes)
variable name
make
mpg
weight
foreign
storage
type
str18
int
int
byte
Sorted by:
Note:
display
format
%-18s
%8.0g
%8.0gc
%8.0g
value
label
variable label
origin
Make and Model
Mileage (mpg)
Weight (lbs.)
Car type
foreign
dataset has changed since last saved
. inspect foreign
foreign:
#
#
#
#
#
#
0
Car type
Number of Observations
Negative
Zero
Positive
#
#
Total
Missing
Total
52
22
74
-
Integers
52
22
Nonintegers
-
74
-
1
74
(2 unique values)
foreign is labeled and all values are documented in the label.
The variable foreign takes on two unique values, 0 and 1. The value 0 denotes a domestic car,
and 1 denotes a foreign car.
The model that we wish to fit is
Pr(foreign = 1) = F (β0 + β1 weight + β2 mpg)
where F (z) = 1 − exp
− exp(z) .
298
cloglog — Complementary log-log regression
To fit this model, we type
. cloglog foreign weight mpg
Iteration 0:
log likelihood = -34.054593
Iteration 1:
log likelihood = -27.869915
Iteration 2:
log likelihood = -27.742997
Iteration 3:
log likelihood = -27.742769
Iteration 4:
log likelihood = -27.742769
Complementary log-log regression
Number of obs
Zero outcomes
Nonzero outcomes
LR chi2(2)
Prob > chi2
Log likelihood = -27.742769
foreign
Coef.
weight
mpg
_cons
-.0029153
-.1422911
10.09694
Std. Err.
.0006974
.076387
3.351841
z
-4.18
-1.86
3.01
P>|z|
0.000
0.062
0.003
=
=
=
=
=
74
52
22
34.58
0.0000
[95% Conf. Interval]
-.0042823
-.2920069
3.527448
-.0015483
.0074247
16.66642
We find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are
also less likely to be foreign, at least when holding the weight of the car constant.
See [R] maximize for an explanation of the output.
Technical note
Stata interprets a value of 0 as a negative outcome (failure) and treats all other values (except
missing) as positive outcomes (successes). Thus, if your dependent variable takes on the values 0 and
1, 0 is interpreted as failure and 1 as success. If your dependent variable takes on the values 0, 1,
and 2, 0 is still interpreted as failure, but both 1 and 2 are treated as successes.
If you prefer a more formal mathematical statement, when you type cloglog y x, Stata fits the
model
n
o
Pr(yj 6= 0 | xj ) = 1 − exp − exp(xj β)
Robust standard errors
If you specify the vce(robust) option, cloglog reports robust standard errors, as described in
[U] 20.16 Obtaining robust variance estimates. For the model of foreign on weight and mpg, the
robust calculation increases the standard error of the coefficient on mpg by 44%:
cloglog — Complementary log-log regression
299
. cloglog foreign weight mpg, vce(robust)
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
=
=
=
=
=
-34.054593
-27.869915
-27.742997
-27.742769
-27.742769
Complementary log-log regression
Log pseudolikelihood = -27.742769
foreign
Coef.
weight
mpg
_cons
-.0029153
-.1422911
10.09694
Robust
Std. Err.
.0007484
.1102466
4.317305
z
-3.90
-1.29
2.34
Number of obs
Zero outcomes
Nonzero outcomes
=
=
=
74
52
22
Wald chi2(2)
Prob > chi2
=
=
29.74
0.0000
P>|z|
0.000
0.197
0.019
[95% Conf. Interval]
-.0043822
-.3583704
1.635174
-.0014484
.0737882
18.5587
Without vce(robust), the standard error for the coefficient on mpg was reported to be 0.076, with
a resulting confidence interval of [ −0.29, 0.01 ].
The vce(cluster clustvar) option can relax the independence assumption required by the
complementary log-log estimator to being just independence between clusters. To demonstrate this
ability, we will switch to a different dataset.
We are studying unionization of women in the United States by using the union dataset; see
[XT] xt. We fit the following model, ignoring that women are observed an average of 5.9 times each
in this dataset:
. use http://www.stata-press.com/data/r11/union, clear
(NLS Women 14-24 in 1968)
. cloglog union age grade not_smsa south##c.year
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-13606.373
-13540.726
-13540.607
-13540.607
Complementary log-log regression
Log likelihood = -13540.607
Std. Err.
z
Number of obs
Zero outcomes
Nonzero outcomes
=
=
=
26200
20389
5811
LR chi2(6)
Prob > chi2
=
=
647.24
0.0000
union
Coef.
P>|z|
[95% Conf. Interval]
age
grade
not_smsa
1.south
year
.0185346
.0452772
-.1886592
-1.422292
-.0133007
.0043616
.0057125
.0317801
.3949381
.0049576
4.25
7.93
-5.94
-3.60
-2.68
0.000
0.000
0.000
0.000
0.007
.009986
.0340809
-.2509471
-2.196356
-.0230174
.0270833
.0564736
-.1263712
-.648227
-.0035839
south#c.year
1
.0105659
.0049234
2.15
0.032
.0009161
.0202157
_cons
-1.219801
.2952374
-4.13
0.000
-1.798455
-.6411462
300
cloglog — Complementary log-log regression
The reported standard errors in this model are probably meaningless. Women are observed repeatedly,
and so the observations are not independent. Looking at the coefficients, we find a large southern
effect against unionization and a different time trend for the south. The vce(cluster clustvar)
option provides a way to fit this model and obtains correct standard errors:
. cloglog union age grade not_smsa south##c.year, vce(cluster id) nolog
Complementary log-log regression
Log pseudolikelihood = -13540.607
Number of obs
Zero outcomes
Nonzero outcomes
=
=
=
26200
20389
5811
Wald chi2(6)
Prob > chi2
=
=
160.76
0.0000
(Std. Err. adjusted for 4434 clusters in idcode)
Robust
Std. Err.
union
Coef.
z
P>|z|
[95% Conf. Interval]
age
grade
not_smsa
1.south
year
.0185346
.0452772
-.1886592
-1.422292
-.0133007
.0084873
.0125776
.0642068
.506517
.0090628
2.18
3.60
-2.94
-2.81
-1.47
0.029
0.000
0.003
0.005
0.142
.0018999
.0206255
-.3145021
-2.415047
-.0310633
.0351694
.069929
-.0628162
-.4295365
.004462
south#c.year
1
.0105659
.0063175
1.67
0.094
-.0018162
.022948
_cons
-1.219801
.5175129
-2.36
0.018
-2.234107
-.2054942
These standard errors are larger than those reported by the inappropriate conventional calculation.
By comparison, another way we could fit this model is with an equal-correlation population-averaged
complementary log-log model:
. xtcloglog union age grade not_smsa south##c.year, pa nolog
GEE population-averaged model
Group variable:
idcode
Link:
cloglog
Family:
binomial
Correlation:
exchangeable
Scale parameter:
Number of obs
Number of groups
Obs per group: min
avg
max
Wald chi2(6)
Prob > chi2
1
Std. Err.
z
P>|z|
=
=
=
=
=
=
=
26200
4434
1
5.9
12
234.66
0.0000
union
Coef.
[95% Conf. Interval]
age
grade
not_smsa
1.south
year
.0153737
.0549518
-.1045232
-1.714868
-.0115881
.0081156
.0095093
.0431082
.3384558
.0084125
1.89
5.78
-2.42
-5.07
-1.38
0.058
0.000
0.015
0.000
0.168
-.0005326
.0363139
-.1890138
-2.378229
-.0280763
.03128
.0735897
-.0200326
-1.051507
.0049001
south#c.year
1
.0149796
.0041687
3.59
0.000
.0068091
.0231501
_cons
-1.488278
.4468005
-3.33
0.001
-2.363991
-.6125652
The coefficient estimates are similar, but these standard errors are smaller than those produced by
cloglog, vce(cluster clustvar). This finding is as we would expect. If the within-panel correlation
assumptions are valid, the population-averaged estimator should be more efficient.
cloglog — Complementary log-log regression
301
In addition to this estimator, we may use the xtgee command to fit a panel estimator (with
complementary log-log link) and any number of assumptions on the within-idcode correlation.
cloglog, vce(cluster clustvar) is robust to assumptions about within-cluster correlation. That
is, it inefficiently sums within cluster for the standard-error calculation rather than attempting to exploit
what might be assumed about the within-cluster correlation (as do the xtgee population-averaged
models).
Saved results
cloglog saves the following in e():
Scalars
e(N)
e(k)
e(k eq)
e(k eq model)
e(k dv)
e(k autoCns)
e(N f)
e(N s)
e(df m)
e(ll)
e(ll 0)
e(N clust)
e(chi2)
e(p)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
number of parameters
number of equations in e(b)
number of equations in model Wald test
number of dependent variables
number of base, empty, and omitted constraints
number of zero outcomes
number of nonzero outcomes
model degrees of freedom
log likelihood
log likelihood, constant-only model
number of clusters
χ2
significance
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
(Continued on next page)
302
cloglog — Complementary log-log regression
Macros
e(cmd)
e(cmdline)
e(depvar)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(offset)
e(chi2type)
e(vce)
e(vcetype)
e(opt)
e(which)
e(ml method)
e(user)
e(singularHmethod)
e(technique)
e(crittype)
e(properties)
e(predict)
e(asbalanced)
e(asobserved)
cloglog
command as typed
name of dependent variable
weight type
weight expression
title in estimation output
name of cluster variable
offset
Wald or LR; type of model χ2 test
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
type of ml method
name of likelihood-evaluator program
m-marquardt or hybrid; method used when Hessian is singular
maximization technique
optimization criterion
b V
program used to implement predict
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(Cns)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
cloglog is implemented as an ado-file.
Complementary log-log analysis (related to the gompit model, so named because of its relationship
to the Gompertz distribution) is an alternative to logit and probit analysis, but it is unlike these other
estimators in that the transformation is not symmetric. Typically, this model is used when the positive
(or negative) outcome is rare.
The log-likelihood function for complementary log-log is
lnL =
X
j∈S
wj lnF (xj b) +
X
n
o
wj ln 1 − F (xj b)
j6∈S
where S is the set of all observations j such that yj 6= 0, F (z) = 1 − exp − exp(z) , and wj
denotes the optional weights. lnL is maximized as described in [R] maximize.
cloglog — Complementary log-log regression
303
We can fit a gompit model by reversing the success–failure sense of the dependent variable and
using cloglog.
This command supports the Huber/White/sandwich estimator of the variance and its clustered
version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, in particular,
in Maximum likelihood
estimators
and Methods and formulas. The scores are calculated as uj =
[exp(xj b) exp − exp(xj b) /F (xj b)]xj for the positive outcomes and {− exp(xj b)}xj for the
negative outcomes.
cloglog also supports estimation with survey data. For details on VCEs with survey data, see
[SVY] variance estimation.
Acknowledgment
We thank Joseph Hilbe of Arizona State University for providing the inspiration for the cloglog
command (Hilbe 1996, 1998).
References
Clayton, D. G., and M. Hills. 1993. Statistical Models in Epidemiology. Oxford: Oxford University Press.
Hilbe, J. M. 1996. sg53: Maximum-likelihood complementary log-log regression. Stata Technical Bulletin 32: 19–20.
Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 129–131. College Station, TX: Stata Press.
. 1998. sg53.2: Stata-like commands for complementary log-log regression. Stata Technical Bulletin 41: 23.
Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 166–167. College Station, TX: Stata Press.
Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage.
Long, J. S., and J. Freese. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College
Station, TX: Stata Press.
Xu, J., and J. S. Long. 2005. Confidence intervals for predicted outcomes in regression models for categorical outcomes.
Stata Journal 5: 537–559.
Also see
[R] cloglog postestimation — Postestimation tools for cloglog
[R] clogit — Conditional (fixed-effects) logistic regression
[R] glm — Generalized linear models
[R] logistic — Logistic regression, reporting odds ratios
[R] scobit — Skewed logistic regression
[SVY] svy estimation — Estimation commands for survey data
[XT] xtcloglog — Random-effects and population-averaged cloglog models
[U] 20 Estimation and postestimation commands
Title
cloglog postestimation — Postestimation tools for cloglog
Description
The following postestimation commands are available for cloglog:
command
description
estat
estat (svy)
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
postestimation statistics for survey data
cataloging estimation results
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
seemingly unrelated estimation
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
lrtest1
margins
nlcom
predict
predictnl
suest
test
testnl
1
lrtest is not appropriate with svy estimation results.
See the corresponding entries in the Base Reference Manual for details, but see [SVY] estat for
details about estat (svy).
Syntax for predict
predict
statistic
type
newvar
if
in
, statistic nooffset
description
Main
pr
xb
stdp
score
probability of a positive outcome; the default
linear prediction
standard error of the linear prediction
first derivative of the log likelihood with respect to xj β
These statistics are available both in and out of sample; type predict
the estimation sample.
304
. . . if e(sample) . . . if wanted only for
cloglog postestimation — Postestimation tools for cloglog
305
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
pr, the default, calculates the probability of a positive outcome.
xb calculates the linear prediction.
stdp calculates the standard error of the linear prediction.
score calculates the equation-level score, ∂ ln L/∂(xj β).
nooffset is relevant only if you specified offset(varname) for cloglog. It modifies the calculations
made by predict so that they ignore the offset variable; the linear prediction is treated as xj b
rather than as xj b + offsetj .
Remarks
Once you have fit a model, you can obtain the predicted probabilities by using the predict
command for both the estimation sample and other samples; see [U] 20 Estimation and postestimation
commands and [R] predict. Here we will make only a few comments.
predict without arguments calculates the predicted probability of a positive outcome. With the
xb option, it calculates the linear combination xj b, where xj are the independent variables in the
j th observation and b is the estimated parameter vector.
With the stdp option, predict calculates the standard error of the linear prediction, which is not
adjusted for replicated covariate patterns in the data.
Example 1
In example 1 in [R] cloglog, we fit the complementary log-log model cloglog foreign weight
mpg. To obtain predicted probabilities,
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. cloglog foreign weight mpg
(output omitted )
. predict p
(option pr assumed; Pr(foreign))
. summarize foreign p
Variable
Obs
Mean
foreign
p
74
74
.2972973
.2928348
Std. Dev.
.4601885
.29732
Min
Max
0
.0032726
1
.9446067
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
306
cloglog postestimation — Postestimation tools for cloglog
Also see
[R] cloglog — Complementary log-log regression
[U] 20 Estimation and postestimation commands
Title
cnsreg — Constrained linear regression
Syntax
cnsreg depvar indepvars if
in
weight , constraints(constraints) options
description
options
Model
∗
constraints(constraints)
noconstant
apply specified linear constraints
suppress constant term
SE/Robust
vce(vcetype)
vcetype may be ols, robust, cluster clustvar, bootstrap,
or jackknife
Reporting
level(#)
nocnsreport
display options
† mse1
† coeflegend
∗
set confidence level; default is level(95)
do not display constraints
control spacing and display of omitted variables and base and
empty cells
force MSE to be 1
display coefficients’ legend instead of coefficient table
constraints(constraints) is required.
† mse1 and coeflegend do not appear in the dialog.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
depvar and indepvars may contain time-series operators; see [U] 11.4.4 Time-series varlists.
bootstrap, by, jackknife, mi estimate, rolling, statsby, and svy are allowed; see
[U] 11.1.10 Prefix commands.
vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate prefix.
Weights are not allowed with the bootstrap prefix.
aweights are not allowed with the jackknife prefix.
vce(), mse1, and weights are not allowed with the svy prefix.
aweights, fweights, pweights, and iweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Linear models and related
>
Constrained linear regression
Description
cnsreg fits constrained linear regression models. cnsreg typed without arguments redisplays the
previous cnsreg results.
307
308
cnsreg — Constrained linear regression
Options
Model
constraints(constraints), noconstant; see [R] estimation options.
SE/Robust
vce(vcetype) specifies the type of standard error reported, which includes types that are derived
from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup
correlation, and that use bootstrap or jackknife methods; see [R] vce option.
vce(ols), the default, uses the standard variance estimator for ordinary least-squares regression.
Reporting
level(#); see [R] estimation options.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
The following options are available with cnsreg but are not shown in the dialog box:
mse1 is used only in programs and ado-files that use cnsreg to fit models other than constrained linear
regression. mse1 sets the mean squared error to 1, thus forcing the variance–covariance matrix of
the estimators to be (X0 DX)−1 (see Methods and formulas in [R] regress) and affecting calculated
standard errors. Degrees of freedom for t statistics are calculated as n rather than n − p + c, where
p is the total number of parameters (prior to restrictions and including the constant) and c is the
number of constraints.
mse1 is not allowed with the svy prefix.
coeflegend; see [R] estimation options.
Remarks
For a discussion of constrained linear regression, see Greene (2008, 88); Hill, Griffiths, and
Lim (2008, 146–148); or Davidson and MacKinnon (1993, 17).
Example 1
In principle, we can obtain constrained linear regression estimates by modifying the list of
independent variables. For instance, if we wanted to fit the model
mpg = β0 + β1 price + β2 weight + u
and constrain β1 = β2 , we could write
mpg = β0 + β1 (price + weight) + u
and run a regression of mpg on price + weight. The estimated coefficient on the sum would be the
constrained estimate of β1 and β2 . Using cnsreg, however, is easier:
cnsreg — Constrained linear regression
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. constraint 1 price = weight
. cnsreg mpg price weight, constraint(1)
Constrained linear regression
( 1)
Number of obs =
F( 1,
72) =
Prob > F
=
Root MSE
=
309
74
37.59
0.0000
4.722
price - weight = 0
mpg
Coef.
price
weight
_cons
-.0009875
-.0009875
30.36718
Std. Err.
.0001611
.0001611
1.577958
t
-6.13
-6.13
19.24
P>|t|
0.000
0.000
0.000
[95% Conf. Interval]
-.0013086
-.0013086
27.22158
-.0006664
-.0006664
33.51278
We define constraints by using the constraint command; see [R] constraint. We fit the model with
cnsreg and specify the constraint number or numbers in the constraints() option.
Just to show that the results above are correct, here is the result of applying the constraint by hand:
. generate x = price + weight
. regress mpg x
Source
SS
df
MS
Model
Residual
838.065767
1605.39369
1
72
838.065767
22.2971346
Total
2443.45946
73
33.4720474
mpg
Coef.
x
_cons
-.0009875
30.36718
Std. Err.
.0001611
1.577958
t
-6.13
19.24
Number of obs
F( 1,
72)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
=
=
=
=
=
=
74
37.59
0.0000
0.3430
0.3339
4.722
[95% Conf. Interval]
-.0013086
27.22158
-.0006664
33.51278
Example 2
Models can be fit subject to multiple simultaneous constraints. We simply define the constraints
and then include the constraint numbers in the constraints() option. For instance, say that we
wish to fit the model
mpg = β0 + β1 price + β2 weight + β3 displ + β4 gear ratio + β5 foreign +
β6 length + u
subject to the constraints
β1 = β2 = β3 = β6
β4 = −β5 = β0 /20
(This model, like the one in example 1, is admittedly senseless.) We fit the model by typing
. constraint 1 price=weight
. constraint 2 displ=weight
. constraint 3 length=weight
310
cnsreg — Constrained linear regression
. constraint 5 gear_ratio = -foreign
. constraint 6 gear_ratio = _cons/20
. cnsreg mpg price weight displ gear_ratio foreign length, c(1-3,5-6)
Constrained linear regression
Number of obs =
F( 2,
72) =
Prob > F
=
Root MSE
=
( 1) price - weight = 0
( 2) - weight + displacement = 0
( 3) - weight + length = 0
( 4) gear_ratio + foreign = 0
( 5) gear_ratio - .05 _cons = 0
mpg
Coef.
price
weight
displacement
gear_ratio
foreign
length
_cons
-.000923
-.000923
-.000923
1.326114
-1.326114
-.000923
26.52229
Std. Err.
.0001534
.0001534
.0001534
.0687589
.0687589
.0001534
1.375178
t
-6.02
-6.02
-6.02
19.29
-19.29
-6.02
19.29
P>|t|
0.000
0.000
0.000
0.000
0.000
0.000
0.000
74
785.20
0.0000
4.6823
[95% Conf. Interval]
-.0012288
-.0012288
-.0012288
1.189046
-1.463183
-.0012288
23.78092
-.0006172
-.0006172
-.0006172
1.463183
-1.189046
-.0006172
29.26365
There are many ways we could have specified the constraints() option (which we abbreviated
c() above). We typed c(1-3,5-6), meaning that we want constraints 1 through 3 and 5 and 6; those
numbers correspond to the constraints we defined. The only reason we did not use the number 4
was to emphasize that constraints do not have to be consecutively numbered. We typed c(1-3,5-6),
but we could have typed c(1,2,3,5,6) or c(1-3,5,6) or c(1-2,3,5,6) or even c(1-6), which
would have worked as long as constraint 4 was not defined. If we had previously defined a constraint
4, then c(1-6) would have included it.
Saved results
cnsreg saves the following in e():
Scalars
e(N)
e(k autoCns)
e(df m)
e(df r)
e(F)
e(rmse)
e(ll)
e(N clust)
e(rank)
number of observations
number of base, empty, and omitted constraints
model degrees of freedom
residual degrees of freedom
F statistic
root mean squared error
log likelihood
number of clusters
rank of e(V)
cnsreg — Constrained linear regression
Macros
e(cmd)
e(cmdline)
e(depvar)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(vce)
e(vcetype)
e(properties)
e(predict)
e(asbalanced)
e(asobserved)
cnsreg
command as typed
name of dependent variable
weight type
weight expression
title in estimation output
name of cluster variable
vcetype specified in vce()
title used to label Std. Err.
b V
program used to implement predict
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(Cns)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
311
Methods and formulas
cnsreg is implemented as an ado-file.
Let n be the number of observations, p be the total number of parameters (prior to restrictions
and including
the constant), and c be the number of constraints.
The coefficients are calculated as
b0 = T (T0 X0 WXT)−1 (T0 X0 Wy − T0 X0 WXa0 ) + a0 , where T and a are as defined in
[P] makecns. W = I if no weights are specified. If weights are specified, let v: 1 × n be the
specified weights. If fweight frequency weights are specified, W = diag(v). If aweight analytic
weights are specified, then W = diag[v/(10 v)(10 1)], meaning that the weights are normalized to
sum to the number of observations.
The mean squared error is s2 = (y0 Wy − 2b0 X0 Wy + b0 X0 WXb)/(n − p + c). The variance–
covariance matrix is s2 T(T0 X0 WXT)−1 T0 .
This command supports the Huber/White/sandwich estimator of the variance and its clustered
version using vce(robust) and vce(cluster clustvar), respectively. See [P] robust, in particular,
in Introduction and Methods and formulas.
cnsreg also supports estimation with survey data. For details on VCEs with survey data, see [SVY]
variance estimation.
References
Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. New York: Oxford University
Press.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
Hill, R. C., W. E. Griffiths, and G. C. Lim. 2008. Principles of Econometrics. 3rd ed. Hoboken, NJ: Wiley.
312
cnsreg — Constrained linear regression
Also see
[R] cnsreg postestimation — Postestimation tools for cnsreg
[R] regress — Linear regression
[SVY] svy estimation — Estimation commands for survey data
[U] 20 Estimation and postestimation commands
Title
cnsreg postestimation — Postestimation tools for cnsreg
Description
The following postestimation commands are available for cnsreg:
command
description
estat
estat (svy)
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
postestimation statistics for survey data
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
seemingly unrelated estimation
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
lrtest1
margins
nlcom
predict
predictnl
suest
test
testnl
1
lrtest is not appropriate with svy estimation results.
See the corresponding entries in the Base Reference Manual for details, but see [SVY] estat for
details about estat (svy).
Syntax for predict
predict
statistic
type
newvar
if
in
, statistic
description
Main
xb
residuals
stdp
stdf
pr(a,b)
e(a,b)
ystar(a,b)
score
linear prediction; the default
residuals
standard error of the prediction
standard error of the forecast
Pr(a < yj < b)
E(yj | a < yj < b)
E(yj∗ ), yj∗ = max{a, min(yj , b)}
equivalent to residuals
313
314
cnsreg postestimation — Postestimation tools for cnsreg
These statistics are available both in and out of sample; type predict
the estimation sample.
stdf is not allowed with svy estimation results.
. . . if e(sample) . . . if wanted only for
where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .)
means +∞; see [U] 12.2.1 Missing values.
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
xb, the default, calculates the linear prediction.
residuals calculates the residuals, that is, yj − xj b.
stdp calculates the standard error of the prediction, which can be thought of as the standard error of
the predicted expected value or mean for the observation’s covariate pattern. The standard error
of the prediction is also referred to as the standard error of the fitted value.
stdf calculates the standard error of the forecast, which is the standard error of the point prediction
for 1 observation. It is commonly referred to as the standard error of the future or forecast value.
By construction, the standard errors produced by stdf are always larger than those produced by
stdp; see Methods and formulas in [R] regress.
pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the
interval (a, b).
a and b may be specified as numbers or variable names; lb and ub are variable names;
pr(20,30) calculates Pr(20 < xj b + uj < 30);
pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and
pr(20,ub) calculates Pr(20 < xj b + uj < ub).
a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30);
pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ .
and calculates Pr(lb < xj b + uj < 30) elsewhere.
b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20);
pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ .
and calculates Pr(20 < xj b + uj < ub) elsewhere.
e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on
yj |xj being in the interval (a, b), meaning that yj |xj is censored. a and b are specified as they
are for pr().
ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and
yj∗ = xj b + uj otherwise, meaning that yj∗ is truncated. a and b are specified as they are for
pr().
score is equivalent to residuals for linear regression models.
cnsreg postestimation — Postestimation tools for cnsreg
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] cnsreg — Constrained linear regression
[U] 20 Estimation and postestimation commands
315
Title
constraint — Define and list constraints
Syntax
Define constraints
constraint define # exp=exp | coeflist
List constraints
constraint dir
constraint list
numlist | all
numlist | all
Drop constraints
constraint drop
numlist | all
Programmer’s commands
constraint get #
constraint free
where coeflist is as defined in [R] test and # is restricted to the range 1–1,999, inclusive.
Menu
Statistics
>
Other
>
Manage constraints
Description
constraint defines, lists, and drops linear constraints. Constraints are for use by models that
allow constrained estimation.
Constraints are defined by the constraint command. The currently defined constraints can be
listed by either constraint list or constraint dir; both do the same thing. Existing constraints
can be eliminated by constraint drop.
constraint get and constraint free are programmer’s commands. constraint get returns
the contents of the specified constraint in macro r(contents) and returns in scalar r(defined) 0
or 1—1 being returned if the constraint was defined. constraint free returns the number of a free
(unused) constraint in macro r(free).
316
constraint — Define and list constraints
317
Remarks
Using constraints is discussed in [R] cnsreg, [R] mlogit, and [R] reg3; this entry is concerned only
with practical aspects of defining and manipulating constraints.
Example 1
Constraints are numbered from 1 to 1,999, and we assign the number when we define the constraint:
. use http://www.stata-press.com/data/r11/sysdsn1
. constraint 2 [Indemnity]2.site = 0
The currently defined constraints can be listed by constraint list:
. constraint list
2: [Indemnity]2.site = 0
constraint drop drops constraints:
. constraint drop 2
. constraint list
The empty list after constraint list indicates that no constraints are defined. Below we demonstrate
the various syntaxes allowed by constraint:
.
.
.
.
.
.
constraint
constraint
constraint
constraint
constraint
constraint
1 [Indemnity]
10 [Indemnity]: 1.site 2.site
11 [Indemnity]: 3.site
21 [Prepaid=Uninsure]: nonwhite
30 [Prepaid]
31 [Insure]
. constraint list
1: [Indemnity]
10: [Indemnity]: 1.site 2.site
11: [Indemnity]: 3.site
21: [Prepaid=Uninsure]: nonwhite
30: [Prepaid]
31: [Insure]
. constraint drop 21-25, 31
. constraint list
1: [Indemnity]
10: [Indemnity]: 1.site 2.site
11: [Indemnity]: 3.site
30: [Prepaid]
. constraint drop _all
. constraint list
Technical note
The constraint command does not check the syntax of the constraint itself because a constraint
can be interpreted only in the context of a model. Thus constraint is willing to define constraints
that later will not make sense. Any errors in the constraints will be detected and mentioned at the
time of estimation.
318
constraint — Define and list constraints
Reference
Weesie, J. 1999. sg100: Two-stage linear constrained estimation. Stata Technical Bulletin 47: 24–30. Reprinted in Stata
Technical Bulletin Reprints, vol. 8, pp. 217–225. College Station, TX: Stata Press.
Also see
[R] cnsreg — Constrained linear regression
Title
copyright — Display copyright information
Syntax
copyright
Description
copyright presents copyright notifications concerning tools, libraries, etc., used in the construction
of Stata.
Remarks
The correct form for a copyright notice is
Copyright dates by author/owner
c symbol, but “(C)” has never been given
The word “Copyright” is spelled out. You can use the legal recognition. The phrase “All Rights Reserved” was historically required but is no longer needed.
Currently, most works are copyrighted from the moment they are written, and no copyright notice
is required. Copyright concerns the protection of the expression and structure of facts and ideas, not
the facts and ideas themselves. Copyright concerns the ownership of the expression and not the name
given to the expression, which is covered under trademark law.
Copyright law as it exists today began in England in 1710 with the Statute of Anne, An Act for
the Encouragement of Learning, by Vesting the Copies of Printed Books in the Authors or Purchases
of Such Copies, during the Times therein mentioned . In 1672, Massachusetts introduced the first
copyright law in what was to become the United States. After the Revolutionary War, copyright was
introduced into the U.S. Constitution in 1787 and went into effect on May 31, 1790. On June 9,
1790, the first copyright in the United States was registered for The Philadelphia Spelling Book by
John Barry.
There are significant differences in the understanding of copyright in the English- and non–Englishspeaking world. The Napoleonic or Civil Code, the dominant legal system in the non–English-speaking
world, splits the rights into two classes: the author’s economic rights and the author’s moral rights.
Moral rights are available only to “natural persons”. Legal persons (corporations) have economic
rights but not moral rights.
Also see
Copyright page of this book
319
Title
copyright lapack — LAPACK copyright notification
Description
Stata uses portions of LAPACK, a linear algebra package, with the express permission of the authors
pursuant to the following notice:
c 1992–2008 The University of Tennessee. All rights reserved.
Copyright • Redistributions of source code must retain the above copyright notice, this list of conditions, and
the following disclaimer.
• Redistributions in binary form must reproduce the above copyright notice, this list of conditions,
and the following disclaimer, listed in this license in the documentation or other materials provided
with the distribution or both.
• Neither the names of the copyright holders nor the names of its contributors may be used to
endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS
IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Also see
[R] copyright — Display copyright information
320
Title
copyright scintilla — Scintilla copyright notification
Description
Stata uses portions of Scintilla with the express permission of the author, pursuant to the following
notice:
c 1998–2002 by Neil Hodgson <neilh@scintilla.org>
Copyright All Rights Reserved
Permission to use, copy, modify, and distribute this software and its documentation
for any purpose and without fee is hereby granted, provided that the above copyright
notice appear in all copies and that both that copyright notice and this permission
notice appear in supporting documentation.
NEIL HODGSON DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO
EVENT SHALL NEIL HODGSON BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF
USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
Also see
[R] copyright — Display copyright information
321
Title
copyright ttf2pt1 — ttf2pt1 copyright notification
Description
Stata uses portions of ttf2pt1 to convert TrueType fonts to PostScript fonts, with express permission
of the authors, pursuant to the following notice:
c 1997–2003 by the AUTHORS:
Copyright Andrew Weeks <ccsaw@bath.ac.uk>
Frank M. Siegert <fms@this.net>
Mark Heath <mheath@netspace.net.au>
Thomas Henlich <thenlich@rcs.urz.tu-dresden.de>
Sergey Babkin <babkin@users.sourceforge.net>, <sab123@hotmail.com>
Turgut Uyar <uyar@cs.itu.edu.tr>
Rihardas Hepas <rch@WriteMe.Com>
Szalay Tamas <tomek@elender.hu>
Johan Vromans <jvromans@squirrel.nl>
Petr Titera <P.Titera@sh.cvut.cz>
Lei Wang <lwang@amath8.amt.ac.cn>
Chen Xiangyang <chenxy@sun.ihep.ac.cn>
Zvezdan Petkovic <z.petkovic@computer.org>
Rigel <rigel863@yahoo.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided
that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and
the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the
distribution.
3. All advertising materials mentioning features or use of this software must display the following
acknowledgment: This product includes software developed by the TTF2PT1 Project and its
contributors.
THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS “AS IS” AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
322
copyright ttf2pt1 — ttf2pt1 copyright notification
323
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.
Also see
[R] copyright — Display copyright information
Title
correlate — Correlations (covariances) of variables or coefficients
Syntax
Display correlation matrix or covariance matrix
if
in
weight
, correlate options
correlate varlist
Display all pairwise correlation coefficients
pwcorr varlist
if
in
weight
, pwcorr options
correlate options
description
Options
means
noformat
covariance
wrap
display means, standard deviations, minimums, and maximums with matrix
ignore display format associated with variables
display covariances
allow wide matrices to wrap
pwcorr options
description
Main
obs
sig
listwise
casewise
print(#)
star(#)
bonferroni
sidak
print number of observations for each entry
print significance level for each entry
use listwise deletion to handle missing values
synonym for listwise
significance level for displaying coefficients
significance level for displaying with a star
use Bonferroni-adjusted significance level
use Šidák-adjusted significance level
varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists.
by is allowed with correlate and pwcorr; see [D] by.
aweights and fweights are allowed; see [U] 11.1.6 weight.
Menu
correlate
Statistics
>
Summaries, tables, and tests
>
Summary and descriptive statistics
>
Correlations and covariances
>
Summaries, tables, and tests
>
Summary and descriptive statistics
>
Pairwise correlations
pwcorr
Statistics
324
correlate — Correlations (covariances) of variables or coefficients
325
Description
The correlate command displays the correlation matrix or covariance matrix for a group of
variables. If varlist is not specified, the matrix is displayed for all variables in the dataset. Also see
the estat vce command in [R] estat.
pwcorr displays all the pairwise correlation coefficients between the variables in varlist or, if
varlist is not specified, all the variables in the dataset.
Options for correlate
Options
means displays summary statistics (means, standard deviations, minimums, and maximums) with the
matrix.
noformat displays the summary statistics requested by the means option in g format, regardless of
the display formats associated with the variables.
covariance displays the covariances rather than the correlation coefficients.
wrap requests that no action be taken on wide correlation matrices to make them readable. It prevents
Stata from breaking wide matrices into pieces to enhance readability. You might want to specify
this option if you are displaying results in a window wider than 80 characters. Then you may need
to set linesize to however many characters you can display across a line; see [R] log.
Options for pwcorr
Main
obs adds a line to each row of the matrix reporting the number of observations used to calculate the
correlation coefficient.
sig adds a line to each row of the matrix reporting the significance level of each correlation coefficient.
listwise handles missing values through listwise deletion, meaning that the entire observation is
omitted from the estimation sample if any of the variables in varlist is missing for that observation.
By default, pwcorr handles missing values by pairwise deletion; all available observations are
used to calculate each pairwise correlation without regard to whether variables outside that pair
are missing.
correlate uses listwise deletion. Thus listwise allows users of pwcorr to mimic correlate’s
treatment of missing values while retaining access to pwcorr’s features.
casewise is a synonym for listwise.
print(#) specifies the significance level of correlation coefficients to be printed. Correlation coefficients with larger significance levels are left blank in the matrix. Typing pwcorr, print(.10)
would list only correlation coefficients significant at the 10% level or better.
star(#) specifies the significance level of correlation coefficients to be starred. Typing pwcorr,
star(.05) would star all correlation coefficients significant at the 5% level or better.
bonferroni makes the Bonferroni adjustment to calculated significance levels. This option affects
printed significance levels and the print() and star() options. Thus pwcorr, print(.05)
bonferroni prints coefficients with Bonferroni-adjusted significance levels of 0.05 or less.
326
correlate — Correlations (covariances) of variables or coefficients
sidak makes the Šidák adjustment to calculated significance levels. This option affects printed
significance levels and the print() and star() options. Thus pwcorr, print(.05) sidak
prints coefficients with Šidák-adjusted significance levels of 0.05 or less.
Remarks
Remarks are presented under the following headings:
correlate
pwcorr
correlate
Typing correlate by itself produces a correlation matrix for all variables in the dataset. If you
specify the varlist, a correlation matrix for just those variables is displayed.
Example 1
We have state data on demographic characteristics of the population. To obtain a correlation matrix,
we type
. use http://www.stata-press.com/data/r11/census13
(1980 Census data by state)
. correlate
(obs=50)
state
brate
pop
medage division
state
brate
pop
medage
division
region
mrgrate
dvcrate
medagesq
1.0000
0.0208
1.0000
-0.0540 -0.2830
-0.0624 -0.8800
-0.1345
0.6356
-0.1339
0.6086
0.0509
0.0677
-0.0655
0.3508
-0.0621 -0.8609
dvcrate medagesq
dvcrate
medagesq
1.0000
-0.2192
1.0000
0.3294
-0.1081
-0.1515
-0.1502
-0.2064
0.3324
1.0000
-0.5207
-0.5292
-0.0177
-0.2229
0.9984
1.0000
0.9688
0.2280
0.5522
-0.5162
region
mrgrate
1.0000
0.2490
0.5682
-0.5239
1.0000
0.7700
-0.0202
1.0000
Because we did not specify the wrap option, Stata did its best to make the result readable by breaking
the table into two parts.
correlate — Correlations (covariances) of variables or coefficients
327
To obtain the correlations between mrgrate, dvcrate, and medage, we type
. correlate mrgrate dvcrate medage
(obs=50)
mrgrate dvcrate
medage
mrgrate
dvcrate
medage
1.0000
0.7700
-0.0177
1.0000
-0.2229
1.0000
Example 2
The pop variable in our previous example represents the total population of the state. Thus, to
obtain population-weighted correlations among mrgrate, dvcrate, and medage, we type
. correlate mrgrate dvcrate medage [w=pop]
(analytic weights assumed)
(sum of wgt is
2.2591e+08)
(obs=50)
mrgrate dvcrate
medage
mrgrate
dvcrate
medage
1.0000
0.5854
-0.1316
1.0000
-0.2833
1.0000
With the covariance option, correlate can be used to obtain covariance matrices, as well as
correlation matrices, for both weighted and unweighted data.
Example 3
To obtain the matrix of covariances between mrgrate, dvcrate, and medage, we type correlate
mrgrate dvcrate medage, covariance:
. correlate mrgrate dvcrate medage, covariance
(obs=50)
mrgrate dvcrate
medage
mrgrate
dvcrate
medage
.000662
.000063 1.0e-05
-.000769 -.001191
2.86775
We could have obtained the pop-weighted covariance matrix by typing correlate mrgrate
dvcrate medage [w=pop], covariance.
pwcorr
correlate calculates correlation coefficients by using casewise deletion; when you request
correlations of variables x1 , x2 , . . . , xk , any observation for which any of x1 , x2 , . . . , xk is missing
is not used. Thus if x3 and x4 have no missing values, but x2 is missing for half the data, the
correlation between x3 and x4 is calculated using only the half of the data for which x2 is not
missing. Of course, you can obtain the correlation between x3 and x4 by using all the data by typing
correlate x3 x4 .
pwcorr makes obtaining such pairwise correlation coefficients easier.
328
correlate — Correlations (covariances) of variables or coefficients
Example 4
Using auto.dta, we investigate the correlation between several of the variables.
. use http://www.stata-press.com/data/r11/auto1
(Automobile Models)
. pwcorr mpg price rep78 foreign, obs sig
mpg
price
rep78 foreign
mpg
1.0000
74
price
rep78
-0.4594
0.0000
74
1.0000
0.3739
0.0016
69
0.0066
0.9574
69
74
1.0000
69
foreign
0.3613
0.0487
0.5922
1.0000
0.0016
0.6802
0.0000
74
74
69
74
. pwcorr mpg price headroom rear_seat trunk rep78 foreign, print(.05) star(.01)
mpg
price headroom rear_s~t
trunk
rep78 foreign
mpg
price
headroom
rear_seat
trunk
rep78
foreign
1.0000
-0.4594*
-0.4220*
-0.5213*
-0.5703*
0.3739*
0.3613*
1.0000
0.4194*
0.3143*
1.0000
0.5238*
0.6620*
-0.2939
1.0000
0.6480*
-0.2409
1.0000
-0.3594*
1.0000
0.5922*
1.0000
. pwcorr mpg price headroom rear_seat trunk rep78 foreign, print(.05) bon
mpg
price headroom rear_s~t
trunk
rep78 foreign
mpg
price
headroom
rear_seat
trunk
rep78
foreign
1.0000
-0.4594
-0.4220
-0.5213
-0.5703
0.3739
0.3613
1.0000
0.4194
1.0000
0.5238
0.6620
1.0000
0.6480
1.0000
-0.3594
1.0000
0.5922
1.0000
Technical note
The correlate command will report the correlation matrix of the data, but there are occasions
when you need the matrix stored as a Stata matrix so that you can further manipulate it. You can
obtain the matrix by typing
. matrix accum R = varlist, nocons dev
. matrix R = corr(R)
The first line places the cross-product matrix of the data in matrix R. The second line converts that
to a correlation matrix. Also see [P] matrix define and [P] matrix accum.
correlate — Correlations (covariances) of variables or coefficients
329
Saved results
correlate saves the following in r():
Scalars
r(N)
number of observations
r(rho)
ρ (first and second variables)
r(cov 12) covariance (covariance only)
Matrices
r(C)
r(Var 1) variance of first variable (covariance only)
r(Var 2) variance of second variable (covariance only)
correlation or covariance matrix
pwcorr will leave in its wake only the results of the last call that it makes internally to correlate
for the correlation between the last variable and itself. Only rarely is this feature useful.
Methods and formulas
pwcorr is implemented as an ado-file.
For a discussion of correlation, see, for instance, Snedecor and Cochran (1989, 177–195); for an
introductory explanation using Stata examples, see Acock (2008, 168–174).
According to Snedecor and Cochran (1989, 180), the term “co-relation” was first proposed by
Galton (1888). The product-moment correlation coefficient is often called the Pearson product-moment
correlation coefficient because Pearson (1896) and Pearson and Filon (1898) were partially responsible
for popularizing its use. See Stigler (1986) for information on the history of correlation.
The estimate of the product-moment correlation coefficient, ρ, is
Pn
ρb = pPn
i=1
wi (xi − x)(yi − y)
pPn
2
i=1 wi (yi − y)
2
i=1 wi (xi − x)
P
P
where wi are the weights, if specified, or wi = 1 if weights are not specified. x = ( wi xi )/( wi )
is the mean of x, and ȳ is similarly defined.
The unadjusted significance level is calculated by pwcorr as
p
√
p = 2 ∗ ttail(n − 2, |b
ρ| n − 2 / 1 − ρb2 )
Let v be the number of variables specified so that k = v(v − 1)/2 correlation coefficients are to be
0
estimated. If bonferroni
is specified,kthe
adjusted significance level is p = min(1, kp). If sidak
0
is specified, p = min 1, 1 − (1 − p) . In both cases, see Methods and formulas in [R] oneway
for a more complete description of the logic behind these adjustments.
330
correlate — Correlations (covariances) of variables or coefficients
Carlo Emilio Bonferroni (1892–1960) studied in Turin and taught there and in Bari and Florence.
He published on actuarial mathematics, probability, statistics, analysis, geometry, and mechanics.
His work on probability inequalities has been applied to simultaneous statistical inference, although
the method known as Bonferroni adjustment usually relies only on an inequality established
earlier by Boole.
Karl Pearson (1857–1936) studied mathematics at Cambridge. He was professor of applied mathematics (1884–1911) and eugenics (1911–1933) at University College London. His publications
include literary, historical, philosophical, and religious topics. Statistics became his main interest
in the early 1890s after he learned about its application to biological problems. His work centered
on distribution theory, the method of moments, correlation, and regression. Pearson introduced
the chi-squared test and the terms coefficient of variation, contingency table, heteroskedastic,
histogram, homoskedastic, kurtosis, mode, random sampling, random walk, skewness, standard
deviation, and truncation. Despite many strong qualities, he also fell into prolonged disagreements
with others, most notably, William Bateson and R. A. Fisher.
Zbyněk Šidák (1933–1999) was an outstanding Czech statistician and probabilist who worked
on Markov chains, rank tests, multivariate distribution theory and multiple comparison methods.
References
Acock, A. C. 2008. A Gentle Introduction to Stata. 2nd ed. College Station, TX: Stata Press.
Dewey, M. E., and E. Seneta. 2001. Carlo Emilio Bonferroni. In Statisticians of the Centuries, ed. C. C. Heyde and
E. Seneta, 411–414. New York: Springer.
Eisenhart, C. 1974. Pearson, Karl. In Vol. 10 of Dictionary of Scientific Biography, ed. C. C. Gillispie, 447–473. New
York: Charles Scribner’s Sons.
Galton, F. 1888. Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal
Society of London 45: 135–145.
Gleason, J. R. 1996. sg51: Inference about correlations using the Fisher z-transform. Stata Technical Bulletin 32:
13–18. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 121–128. College Station, TX: Stata Press.
Goldstein, R. 1996. sg52: Testing dependent correlation coefficients. Stata Technical Bulletin 32: 18. Reprinted in Stata
Technical Bulletin Reprints, vol. 6, pp. 128–129. College Station, TX: Stata Press.
Pearson, K. 1896. Mathematical contributions to the theory of evolution—III. Regression, heredity, and panmixia.
Philosophical Transactions of the Royal Society of London, Series A 187: 253–318.
Pearson, K., and L. N. G. Filon. 1898. Mathematical contributions to the theory of evolution. IV. On the probable
errors of frequency constants and on the influence of random selection on variation and correlation. Philosophical
Transactions of the Royal Society of London, Series A 191: 229–311.
Porter, T. M. 2004. Karl Pearson: The Scientific Life in a Statistical Age. Princeton, NJ: Princeton University Press.
Rodgers, J. L., and W. A. Nicewander. 1988. Thirteen ways to look at the correlation coefficient. American Statistician
42: 59–66.
Rovine, M. J., and A. von Eye. 1997. A 14th way to look at the correlation coefficient: Correlation as the proportion
of matches. American Statistician 51: 42–46.
Seed, P. T. 2001. sg159: Confidence intervals for correlations. Stata Technical Bulletin 59: 27–28. Reprinted in Stata
Technical Bulletin Reprints, vol. 10, pp. 267–269. College Station, TX: Stata Press.
Seidler, J., J. Vondráček, and I. Saxl. 2000. The life and work of Zbyněk Šidák (1933–1999). Applications of
Mathematics 45: 321–336.
Snedecor, G. W., and W. G. Cochran. 1989. Statistical Methods. 8th ed. Ames, IA: Iowa State University Press.
Stigler, S. M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Belknap
Press.
correlate — Correlations (covariances) of variables or coefficients
331
Wolfe, F. 1997. sg64: pwcorrs: Enhanced correlation display. Stata Technical Bulletin 35: 22–25. Reprinted in Stata
Technical Bulletin Reprints, vol. 6, pp. 163–167. College Station, TX: Stata Press.
. 1999. sg64.1: Update to pwcorrs. Stata Technical Bulletin 49: 17. Reprinted in Stata Technical Bulletin Reprints,
vol. 9, p. 159. College Station, TX: Stata Press.
Also see
[R] pcorr — Partial and semipartial correlation coefficients
[R] spearman — Spearman’s and Kendall’s correlations
[R] summarize — Summary statistics
[R] tetrachoric — Tetrachoric correlations for binary variables
Title
cumul — Cumulative distribution
Syntax
cumul varname
if
weight , generate(newvar) options
description
options
Main
∗
generate(newvar)
freq
equal
∗
in
create variable newvar
use frequency units for cumulative
generate equal cumulatives for tied values
generate(newvar) is required.
by is allowed; see [D] by.
fweights and aweights are allowed; see [U] 11.1.6 weight.
Menu
Statistics
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Cumulative distribution graph
Description
cumul creates newvar, defined as the empirical cumulative distribution function of varname.
Options
Main
generate(newvar) is required. It specifies the name of the new variable to be created.
freq specifies that the cumulative be in frequency units; otherwise, it is normalized so that newvar
is 1 for the largest value of varname.
equal requests that observations with equal values in varname get the same cumulative value in
newvar.
Jean Baptiste Joseph Fourier (1768–1830) was born in Auxerre in France. He got caught up
in the Revolution and its aftermath and was twice arrested and imprisoned between periods of
studying and teaching mathematics. Fourier joined Napoleon’s army in its invasion of Egypt in
1798 as a scientific adviser, returning to France in 1801, when he was appointed Prefect of
the Department of Isère. While Prefect, Fourier did his important mathematical work on the
theory of heat, based on what are now called Fourier series. This work was published in 1822,
despite the skepticism of Lagrange, Laplace, Legendre, and others—who found the work lacking
in generality and even rigor—and disagreements of both priority and substance with Biot and
Poisson.
332
cumul — Cumulative distribution
333
Remarks
Example 1
cumul is most often used with graph to graph the empirical cumulative distribution. For instance,
we have data on the median family income of 957 U.S. cities:
. use http://www.stata-press.com/data/r11/hsng
(1980 Census housing data)
. cumul faminc, gen(cum)
. sort cum
. line cum faminc, ylab(, grid) ytitle("") xlab(, grid)
> title("Cumulative of median family income")
> subtitle("1980 Census, 957 U.S. Cities")
Cumulative of median family income
0
.2
.4
.6
.8
1
1980 Census, 957 U.S. Cities
15,000
20,000
25,000
Median family income in 1979
30,000
It would have been enough to type line cum faminc, but we wanted to make the graph look better;
see [G] graph twoway line.
If we had wanted a weighted cumulative, we would have typed cumul faminc [w=pop] at the
first step.
Example 2
To graph two (or more) cumulatives on the same graph, use cumul and stack; see [D] stack. For
instance, we have data on the average January and July temperatures of 956 U.S. cities:
334
cumul — Cumulative distribution
. use http://www.stata-press.com/data/r11/citytemp
(City Temperature Data)
. cumul tempjan, gen(cjan)
. cumul tempjuly, gen(cjuly)
.
.
>
>
>
stack cjan tempjan cjuly tempjuly, into(c temp) wide clear
line cjan cjuly temp, sort ylab(, grid) ytitle("") xlab(, grid)
xtitle("Temperature (F)")
title("Cumulatives:" "Average January and July Temperatures")
subtitle("956 U.S. Cities") clstyle(. dot)
Cumulatives:
Average January and July Temperatures
0
.2
.4
.6
.8
1
956 U.S. Cities
0
20
40
60
Temperature (F)
cjan
80
100
cjuly
As before, it would have been enough to type line cjan cjuly temp, sort. See [D] stack for an
explanation of how the stack command works.
Technical note
According to Beniger and Robyn (1978), Fourier (1821) published the first graph of a cumulative
frequency distribution, which was later given the name “ogive” by Galton (1875).
Methods and formulas
cumul is implemented as an ado-file.
Acknowledgment
The equal option was added by Nicholas J. Cox, Durham University, Durham, UK.
cumul — Cumulative distribution
335
References
Beniger, J. R., and D. L. Robyn. 1978. Quantitative graphics in statistics: A brief history. American Statistician 32:
1–11.
Clayton, D. G., and M. Hills. 1999. gr37: Cumulative distribution function plots. Stata Technical Bulletin 49: 10–12.
Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 96–98. College Station, TX: Stata Press.
Cox, N. J. 1999. gr41: Distribution function plots. Stata Technical Bulletin 51: 12–16. Reprinted in Stata Technical
Bulletin Reprints, vol. 9, pp. 108–112. College Station, TX: Stata Press.
Fourier, J. B. J. 1821. Notions générales, sur la population. Recherches Statistiques sur la Ville de Paris et le
Département de la Seine 1: 1–70.
Galton, F. 1875. Statistics by intercomparison, with remarks on the law of frequency of error. Philosophical Magazine
49: 33–46.
Wilk, M. B., and R. Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1–17.
Also see
[D] stack — Stack data
[R] diagnostic plots — Distributional diagnostic plots
[R] kdensity — Univariate kernel density estimation
Title
cusum — Graph cumulative spectral distribution
Syntax
cusum yvar xvar
if
in
, options
description
options
Main
generate(newvar)
yfit(fitvar)
nograph
nocalc
save cumulative sum in newvar
calculate cumulative sum against fitvar
suppress the plot
suppress cusum test statistics
Cusum plot
affect the rendition of the plotted line
connect options
Add plots
add plots to the generated graph
addplot(plot)
Y axis, X axis, Titles, Legend, Overall
any options other than by() documented in [G] twoway options
twoway options
Menu
Statistics
>
Other
>
Quality control
>
Cusum plots and tests for binary variables
Description
cusum graphs the cumulative sum (cusum) of a binary (0/1) variable, yvar, against a (usually)
continuous variable, xvar.
Options
Main
generate(newvar) saves the cusum in newvar.
yfit(fitvar) calculates a cusum against fitvar, that is, the running sums of the “residuals” fitvar
minus yvar. Typically, fitvar is the predicted probability of a positive outcome obtained from a
logistic regression analysis.
nograph suppresses the plot.
nocalc suppresses calculation of the cusum test statistics.
Cusum plot
connect options affect the rendition of the plotted line; see [G] connect options.
336
cusum — Graph cumulative spectral distribution
337
Add plots
addplot(plot) provides a way to add other plots to the generated graph. See [G] addplot option.
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
Remarks
The cusum is the running sum of the proportion of ones in the sample, a constant number, minus
yvar,
j
X
cj =
f − yvar(k) ,
1≤j≤N
k=1
P
where f = ( yvar)/N and yvar(k) refers to the corresponding value of yvar when xvar is placed in
ascending order: xvar(k+1) ≥ xvar(k) . Tied values of xvar are broken at random. If you want them
broken the same way in two runs, you must set the random-number seed to the same value before
giving the cusum command; see [R] set seed.
A U-shaped or inverted U-shaped cusum indicates, respectively, a negative or a positive trend of
yvar with xvar. A sinusoidal shape is evidence of a nonmonotonic (for example, quadratic) trend.
cusum displays the maximum absolute cusum for monotonic and nonmonotonic trends of yvar on
xvar. These are nonparametric tests of departure from randomness of yvar with respect to xvar.
Approximate values for the tests are given.
Example 1
For the automobile dataset, auto.dta, we wish to investigate the relationship between foreign
(0 = domestic, 1 = foreign) and car weight as follows:
−10
−8
Cusum (Car type)
−6
−4
−2
0
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. cusum foreign weight
2000
3000
Weight (lbs.)
4000
5000
338
cusum — Graph cumulative spectral distribution
Variable
Obs
Pr(1)
CusumL
zL
Pr>zL
CusumQ
foreign
74
0.2973
10.30
3.963
0.000
3.32
zQ
Pr>zQ
0.469
0.320
The resulting plot, which is U-shaped, suggests a negative monotonic relationship. The trend is
confirmed by a highly significant linear cusum statistic, labeled CusumL in the output above.
Some 29.73% of the cars are foreign (coded 1). The proportion of foreign cars diminishes with
increasing weight. The domestic cars are crudely heavier than the foreign ones. We could have
discovered that by typing table foreign, stats(mean weight), but such an approach does not
give the full picture of the relationship. The quadratic cusum (CusumQ) is not significant, so we
do not suspect any tendency for the very heavy cars to be foreign rather than domestic. A slightly
enhanced version of the plot shows the preponderance of domestic (coded 0) cars at the heavy end
of the weight axis:
0
. label values foreign
. cusum foreign weight, s(none) recast(scatter) mlabel(foreign) mlabp(0)
0
0
11
−2
1
1
1
1
10
0
11
0
10
1
1
−8
Cusum (Car type)
−6
−4
1
1
1
−10
1
0
0
0
01 0
0 0
01
0
1
1
2000
0
00
00
01
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
3000
Weight (lbs.)
0
0
0
0
0
0
0
0
0
4000
0
0
5000
Variable
Obs
Pr(1)
CusumL
zL
Pr>zL
CusumQ
foreign
74
0.2973
10.30
3.963
0.000
2.92
zQ
Pr>zQ
0.064
0.475
The example is, of course, artificial, because we would not really try to model the probability of a
car being foreign given its weight.
Saved results
cusum saves the following in r():
Scalars
r(N)
r(prop1)
r(cusuml)
r(zl)
number of observations
proportion of positive outcomes
cusum
test (linear)
r(P zl)
r(cusumq)
r(zq)
r(P zq)
p-value for test (linear)
quadratic cusum
test (quadratic)
p-value for test (quadratic)
cusum — Graph cumulative spectral distribution
339
Methods and formulas
cusum is implemented as an ado-file.
Acknowledgment
cusum was written by Patrick Royston, MRC Clinical Trials Unit, London.
References
Royston, P. 1992. The use of cusums and other techniques in modelling continuous covariates in logistic regression.
Statistics in Medicine 11: 1115–1129.
. 1993. sqv7: Cusum plots and tests for binary variables. Stata Technical Bulletin 12: 16–17. Reprinted in Stata
Technical Bulletin Reprints, vol. 2, pp. 175–177. College Station, TX: Stata Press.
Also see
[R] logistic — Logistic regression, reporting odds ratios
[R] logit — Logistic regression, reporting coefficients
[R] probit — Probit regression
Title
db — Launch dialog
Syntax
Syntax for db
db commandname
For programmers
db commandname
, message(string) debug dryrun
Set system parameter
set maxdb # , permanently
where # must be between 5 and 1,000.
Description
db is the command-line way to launch a dialog for a Stata command.
The second syntax (which is the same but includes options) is for use by programmers.
If you wish to allow the launching of dialogs from a help file, see [P] smcl for information on the
dialog SMCL directive.
set maxdb sets the maximum number of dialog boxes whose contents are remembered from one
invocation to the next during a session. The default value of maxdb is 50.
Options
message(string) specifies that string be passed to the dialog box, where it can be referred to from
the
MESSAGE STRING property.
debug specifies that the underlying dialog box be loaded with debug messaging turned on.
dryrun specifies that, rather than launching the dialog, db show the commands it would issue to
launch the dialog.
permanently specifies that, in addition to making the change right now, the maxdb setting be
remembered and become the default setting when you invoke Stata.
Remarks
The usual way to launch a dialog is to open the Data, Graphics, or Statistics menu and to make
your selection from there. When you know the name of the command that you want to run, however,
db provides a way to invoke the dialog from the command line.
db follows the same abbreviation rules that Stata’s command-line interface follows. So, to launch
the dialog for regress, you can type
340
db — Launch dialog
341
. db regress
or
. db reg
Say that you use the dialog box for regress, either by selecting
Statistics > Linear models and related > Linear regression
or by typing
. db regress
You fit a regression.
Much later during the session, you return to the regress dialog box. It will have the contents
as you left them if 1) you have not typed clear all between the first and second invocations; 2)
you have not typed discard between the two invocations; and 3) you have not used more than 50
different dialog boxes—regardless of how many times you have used each—between the first and
second invocations of regress. If you use 51 or more, the contents of the regress dialog box will
be forgotten.
set maxdb determines how many different dialog boxes are remembered. A dialog box takes, on
average, about 20 KB of memory, so the 50 default corresponds to allowing dialog boxes to consume
about 1 MB of memory.
Methods and formulas
db is implemented as an ado-file.
Also see
[R] query — Display system parameters
Title
diagnostic plots — Distributional diagnostic plots
Syntax
Symmetry plot
in
, options1
symplot varname if
Ordered values of varname against quantiles of uniform distribution
quantile varname if
in
, options1
Quantiles of varname1 against quantiles of varname2
qqplot varname1 varname2 if
in
, options1
Quantiles of varname against quantiles of normal distribution
qnorm varname if
in
, options2
Standardized normal probability plot
pnorm varname if
in
, options2
Quantiles of varname against quantiles of χ2 distribution
qchi varname if
in
, options3
χ2 probability plot
pchi varname
if
in
options1
, options3
description
Plot
marker options
marker label options
change look of markers (color, size, etc.)
add marker labels; change look or position
Reference line
rlopts(cline options)
affect rendition of the reference line
Add plots
addplot(plot)
add other plots to the generated graph
Y axis, X axis, Titles, Legend, Overall
twoway options
any options other than by() documented in
[G] twoway options
342
diagnostic plots — Distributional diagnostic plots
343
description
options2
Main
add grid lines
grid
Plot
change look of markers (color, size, etc.)
add marker labels; change look or position
marker options
marker label options
Reference line
affect rendition of the reference line
rlopts(cline options)
Add plots
add other plots to the generated graph
addplot(plot)
Y axis, X axis, Titles, Legend, Overall
twoway options
any options other than by() documented in
[G] twoway options
options3
description
Main
add grid lines
degrees of freedom of χ2 distribution; default is df(1)
grid
df(#)
Plot
change look of markers (color, size, etc.)
add marker labels; change look or position
marker options
marker label options
Reference line
affect rendition of the reference line
rlopts(cline options)
Add plots
add other plots to the generated graph
addplot(plot)
Y axis, X axis, Titles, Legend, Overall
any options other than by() documented in
[G] twoway options
twoway options
Menu
symplot
Statistics
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Symmetry plot
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Quantiles plot
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Quantile-quantile plot
quantile
Statistics
qqplot
Statistics
344
diagnostic plots — Distributional diagnostic plots
qnorm
Statistics
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Normal quantile plot
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Normal probability plot, standardized
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Chi-squared quantile plot
>
Summaries, tables, and tests
>
Distributional plots and tests
>
Chi-squared probability plot
pnorm
Statistics
qchi
Statistics
pchi
Statistics
Description
symplot graphs a symmetry plot of varname.
quantile plots the ordered values of varname against the quantiles of a uniform distribution.
qqplot plots the quantiles of varname1 against the quantiles of varname2 (Q – Q plot).
qnorm plots the quantiles of varname against the quantiles of the normal distribution (Q – Q plot).
pnorm graphs a standardized normal probability plot (P – P plot).
qchi plots the quantiles of varname against the quantiles of a χ2 distribution (Q – Q plot).
pchi graphs a χ2 probability plot (P – P plot).
See [R] regress postestimation for regression diagnostic plots and [R] logistic postestimation for
logistic regression diagnostic plots.
Options for symplot, quantile, and qqplot
Plot
marker options affect the rendition of markers drawn at the plotted points, including their shape,
size, color, and outline; see [G] marker options.
marker label options specify if and how the markers are to be labeled; see [G] marker label options.
Reference line
rlopts(cline options) affect the rendition of the reference line; see [G] cline options.
Add plots
addplot(plot) provides a way to add other plots to the generated graph; see [G] addplot option.
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
diagnostic plots — Distributional diagnostic plots
345
Options for qnorm and pnorm
Main
grid adds grid lines at the 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and 0.95 quantiles when specified with
qnorm. With pnorm, grid is equivalent to yline(.25,.5,.75) xline(.25,.5,.75).
Plot
marker options affect the rendition of markers drawn at the plotted points, including their shape,
size, color, and outline; see [G] marker options.
marker label options specify if and how the markers are to be labeled; see [G] marker label options.
Reference line
rlopts(cline options) affect the rendition of the reference line; see [G] cline options.
Add plots
addplot(plot) provides a way to add other plots to the generated graph; see [G] addplot option.
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
Options for qchi and pchi
Main
grid adds grid lines at the 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and .95 quantiles when specified with
qchi. With pchi, grid is equivalent to yline(.25,.5,.75) xline(.25,.5,.75).
df(#) specifies the degrees of freedom of the χ2 distribution. The default is 1.
Plot
marker options affect the rendition of markers drawn at the plotted points, including their shape,
size, color, and outline; see [G] marker options.
marker label options specify if and how the markers are to be labeled; see [G] marker label options.
Reference line
rlopts(cline options) affect the rendition of the reference line; see [G] cline options.
Add plots
addplot(plot) provides a way to add other plots to the generated graph; see [G] addplot option.
346
diagnostic plots — Distributional diagnostic plots
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
Remarks
Remarks are presented under the following headings:
symplot
quantile
qqplot
qnorm
pnorm
qchi
pchi
symplot
Example 1
We have data on 74 automobiles. To make a symmetry plot of the variable price, we type
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. symplot price
0
2000
Distance above median
4000
6000
8000
10000
Price
0
500
1000
Distance below median
1500
2000
All points would lie along the reference line (defined as y = x) if car prices were symmetrically
distributed. The points in this plot lie above the reference line, indicating that the distribution of car
prices is skewed to the right — the most expensive cars are far more expensive than the least expensive
cars are inexpensive.
The logic works as follows: a variable, z , is distributed symmetrically if
median − z(i) = z(N +1−i) − median
diagnostic plots — Distributional diagnostic plots
347
where z(i) indicates the ith-order statistic of z . symplot graphs yi = median − z(i) versus xi =
z(N +1−i) − median.
For instance, consider the largest and smallest values of price in the example above. The most
expensive car costs $15,906 and the least expensive, $3,291. Let’s compare these two cars with the
typical car in the data and see how much more it costs to buy the most expensive car, and compare
that with how much less it costs to buy the least expensive car. If the automobile price distribution
is symmetric, the price differences would be the same.
Before we can make this comparison, we must agree on a definition for the word “typical”. Let’s
agree that “typical” means median. The price of the median car is $5,006.50, so the most expensive
car costs $10,899.50 more than the median car, and the least expensive car costs $1,715.50 less than
the median car. We now have one piece of evidence that the car price distribution is not symmetric.
We can repeat the experiment for the second-most-expensive car and the second-least-expensive car.
We find that the second-most-expensive car costs $9,494.50 more than the median car, and the
second-least-expensive car costs $1,707.50 less than the median car. We now have more evidence.
We can continue doing this with the third most expensive and the third least expensive, and so on.
Once we have all these numbers, we want to compare each pair and ask how similar, on average,
they are. The easiest way to do that is to plot all the pairs.
quantile
Example 2
We have data on the prices of 74 automobiles. To make a quantile plot of price, we type
. use http://www.stata-press.com/data/r11/auto, clear
(1978 Automobile Data)
0
Quantiles of Price
5000
10000
15000
. quantile price, rlopts(clpattern(dash))
0
.25
.5
Fraction of the data
.75
1
We changed the pattern of the reference line by specifying rlopts(clpattern(dash)).
348
diagnostic plots — Distributional diagnostic plots
In a quantile plot, each value of the variable is plotted against the fraction of the data that
have values less than that fraction. The diagonal line is a reference line. If automobile prices were
rectangularly distributed, all the data would be plotted along the line. Because all the points are below
the reference line, we know that the price distribution is skewed right.
qqplot
Example 3
We have data on the weight and country of manufacture of 74 automobiles. We wish to compare
the distributions of weights for domestic and foreign automobiles:
. use http://www.stata-press.com/data/r11/auto, clear
(1978 Automobile Data)
. generate weightd=weight if !foreign
(22 missing values generated)
. generate weightf=weight if foreign
(52 missing values generated)
. qqplot weightd weightf
2000
weightd
3000
4000
5000
Quantile−Quantile Plot
1500
2000
2500
weightf
3000
3500
qnorm
Example 4
Continuing with our price data on 74 automobiles, we now wish to compare the distribution of
price with the normal distribution:
diagnostic plots — Distributional diagnostic plots
349
. qnorm price, grid ylabel(, angle(horizontal) axis(1))
> ylabel(, angle(horizontal) axis(2))
1,313.8
6,165.3
11,017
15,000
13,466
Price
10,000
5,006.5
5,000
3,748
0
0
5,000
10,000
Inverse Normal
15,000
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
The result shows that the distributions are different.
Technical note
The idea behind qnorm is recommended strongly by Miller Jr. (1997): he calls it probit plotting. His
recommendations from much practical experience should interest many users. “My recommendation
for detecting nonnormality is probit plotting” (Miller Jr. 1997, 10). “If a deviation from normality
cannot be spotted by eye on probit paper, it is not worth worrying about. I never use the Kolmogorov–
Smirnov test (or one of its cousins) or the χ2 test as a preliminary test of normality. They do not tell
you how the sample is differing from normality, and I have a feeling they are more likely to detect
irregularities in the middle of the distribution than in the tails” (Miller Jr. 1997, 13–14).
pnorm
Example 5
Quantile–normal plots emphasize the tails of the distribution. Normal probability plots put the
focus on the center of the distribution:
(Continued on next page)
350
diagnostic plots — Distributional diagnostic plots
0.00
Normal F[(price−m)/s]
0.25
0.50
0.75
1.00
. pnorm price, grid
0.00
0.25
0.50
Empirical P[i] = i/(N+1)
0.75
1.00
qchi
Example 6
Suppose that we want to examine the distribution of the sum of squares of price and mpg,
standardized for their variances.
.
.
.
.
egen c1 = std(price)
egen c2 = std(mpg)
generate ch = c1^2 + c2^2
qchi ch, df(2) grid ylabel(, alt axis(2)) xlabel(, alt axis(2))
1.386294
5.991465
0
2
4
6
2
Expected χ d.f. = 2
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
The quadratic form is clearly not χ2 with 2 degrees of freedom.
8
.1598352
.7587778
0
5
ch
10
8.926035
15
.1025866
diagnostic plots — Distributional diagnostic plots
351
pchi
Example 7
We can focus on the center of the distribution by doing a probability plot:
0.00
0.25
2
χ (ch) d.f. = 2
0.50
0.75
1.00
. pchi ch, df(2) grid
0.00
0.25
0.50
Empirical P[i] = i/(N+1)
0.75
1.00
Methods and formulas
symplot, quantile, qqplot, qnorm, pnorm, qchi, and pchi are implemented as ado-files. Let
x(1) , x(2) , . . . , x(N ) be the data sorted in ascending order.
If a continuous variable, x, has a cumulative distribution function F (x) = P (X ≤ x) = p, the
quantiles xpi are such that F (xpi ) = pi . For example, if pi = 0.5, then x0.5 is the median. When
we plot data, the probabilities, pi , are often referred to as plotting positions. There are many different
conventions for choice of plotting positions, given x(1) ≤ · · · ≤ x(N ) . Most belong to the family
(i − a)/(N − 2a + 1). a = 0.5 (suggested by Hazen) and a = 0 (suggested by Weibull) are popular
choices.
For a wider discussion of the calculation of plotting positions, see Cox (2002).
symplot plots median − x(i) versus x(N +1−i) − median.
quantile plots x(i) versus (i − 0.5)/N (the Hazen position).
qnorm plots x(i) against qi , where qi = Φ−1 (pi ), Φ is the cumulative normal distribution, and
pi = i/(N + 1) (the Weibull position).
pnorm plots Φ (xi − µ
b)/b
σ versus pi = i/(N + 1), where µ
b is the mean of the data and σ
b is
the standard deviation.
qchi and pchi are similar to qnorm and pnorm; the cumulative χ2 distribution is used in place
of the cumulative normal distribution.
352
diagnostic plots — Distributional diagnostic plots
qqplot is just a two-way scatterplot of one variable against the other after both variables have been
sorted into ascending order, and both variables have the same number of nonmissing observations. If
the variables have unequal numbers of nonmissing observations, interpolated values of the variable
with more data are plotted against the variable with fewer data.
Ramanathan Gnanadesikan (1932– ) was born in Madras. He obtained degrees from the Universities of Madras and North Carolina. He worked in industry at Procter and Gamble, Bell Labs,
and Bellcore, as well as in universities, retiring from Rutgers in 1998. Among many contributions
to statistics he is especially well known for work on probability plotting, robustness, outlier
detection, clustering, classification, and pattern recognition.
Martin Bradbury Wilk (1922– ) was born in Montreal. He obtained degrees in chemical engineering
and statistics from McGill and Iowa State Universities. After several posts in statistics in industry
and universities (including periods at Princeton, Bell Labs, and Rutgers), Wilk was appointed
Chief Statistician at Statistics Canada (1980–1986). He is especially well known for his work
with Gnanadesikan on probability plotting and with Shapiro on tests for normality.
Acknowledgments
We thank Peter A. Lachenbruch of the Department of Public Health, Oregon State University,
for writing the original version of qchi and pchi. Patrick Royston of the MRC Clinical Trials Unit,
London, also published a similar command in the Stata Technical Bulletin (Royston 1996).
References
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical Methods for Data Analysis. Belmont,
CA: Wadsworth.
Cox, N. J. 1999. gr42: Quantile plots, generalized. Stata Technical Bulletin 51: 16–18. Reprinted in Stata Technical
Bulletin Reprints, vol. 9, pp. 113–116. College Station, TX: Stata Press.
. 2001. gr42.1: Quantile plots, generalized: Update to Stata 7. Stata Technical Bulletin 61: 10. Reprinted in Stata
Technical Bulletin Reprints, vol. 10, pp. 55–56. College Station, TX: Stata Press.
. 2002. Speaking Stata: On getting functions to do the work. Stata Journal 2: 411–427.
. 2004a. Speaking Stata: Graphing distributions. Stata Journal 4: 66–88.
. 2004b. gr42 2: Software update: Quantile plots, generalized. Stata Journal 4: 97.
. 2005a. Speaking Stata: Density probability plots. Stata Journal 5: 259–273.
. 2005b. Speaking Stata: The protean quantile plot. Stata Journal 5: 442–460.
. 2005c. Speaking Stata: Smoothing in various directions. Stata Journal 5: 574–593.
. 2007. Stata tip 47: Quantile–quantile plots without programming. Stata Journal 7: 275–279.
Daniel, C., and F. S. Wood. 1980. Fitting Equations to Data: Computer Analysis of Multifactor Data. 2nd ed. New
York: Wiley.
Gan, F. F., K. J. Koehler, and J. C. Thompson. 1991. Probability plots and distribution curves for assessing the fit of
probability models. American Statistician 45: 14–21.
Hamilton, L. C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury.
. 2009. Statistics with Stata (Updated for Version 10). Belmont, CA: Brooks/Cole.
Hoaglin, D. C. 1985. Using quantiles to study shape. In Exploring Data Tables, Trends, and Shapes, ed. D. C. Hoaglin,
F. Mosteller, and J. W. Tukey, 417–460. New York: Wiley.
diagnostic plots — Distributional diagnostic plots
353
Kettenring, J. R. 2001. A conversation with Ramanathan Gnanadesikan. Statistical Science 16: 295–309.
Miller Jr., R. G. 1997. Beyond ANOVA: Basics of Applied Statistics. London: Chapman & Hall.
Nolan, D., and T. Speed. 2000. Stat Labs: Mathematical Statistics Through Applications. New York: Springer.
Royston, P. 1996. sg47: A plot and a test for the χ2 distribution. Stata Technical Bulletin 29: 26–27. Reprinted in
Stata Technical Bulletin Reprints, vol. 5, pp. 142–144. College Station, TX: Stata Press.
Scotto, M. G. 2000. sg140: The Gumbel quantile plot and a test for choice of extreme models. Stata Technical Bulletin
55: 23–25. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 156–159. College Station, TX: Stata Press.
Wilk, M. B., and R. Gnanadesikan. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1–17.
Also see
[R] cumul — Cumulative distribution
[R] kdensity — Univariate kernel density estimation
[R] logistic postestimation — Postestimation tools for logistic
[R] lv — Letter-value displays
[R] regress postestimation — Postestimation tools for regress
Title
display — Substitute for a hand calculator
Syntax
display exp
Description
display displays strings and values of scalar expressions.
display really has many more features and a more complex syntax diagram, but the diagram
shown above is adequate for interactive use. For a full discussion of display’s capabilities, see
[P] display.
Remarks
display can be used as a substitute for a hand calculator.
Example 1
display 2+2 produces the output 4. Stata variables may also appear in the expression, such as in
display myvar/2. Because display works only with scalars, the resulting calculation is performed
only for the first observation. You could type display myvar[10]/2 to display the calculation for
the 10th observation. Here are more examples:
. display sqrt(2)/2
.70710678
. display normal(-1.1)
.13566606
. di (57.2-3)/(12-2)
5.42
. display myvar/10
7
. display myvar[10]/2
3.5
Also see
[P] display — Display strings and values of scalar expressions
[U] 13 Functions and expressions
354
Title
do — Execute commands from a file
Syntax
do | run filename arguments
, nostop
Menu
File
>
Do...
Description
do and run cause Stata to execute the commands stored in filename just as if they were entered
from the keyboard. do echoes the commands as it executes them, whereas run is silent. If filename
is specified without an extension, .do is assumed.
Option
nostop allows the do-file to continue executing even if an error occurs. Normally, Stata stops executing
the do-file when it detects an error (nonzero return code).
Remarks
You can create filename (called a do-file) using Stata’s Do-file Editor; see [R] doedit. This file
will be a standard ASCII (text) file. A complete discussion of do-files can be found in [U] 16 Do-files.
You can also create filename by using a non-Stata text editor; see [D] shell for a way to invoke
your favorite editor from inside Stata. Make sure that you save the file in ASCII format.
If the path or filename contains spaces, it should be enclosed in double quotes.
Reference
Jenkins, S. P. 2006. Stata tip 32: Do not stop. Stata Journal 6: 281.
Also see
[R] doedit — Edit do-files and other text files
[P] include — Include commands from file
[GSM] 13 Using the Do-file Editor—automating Stata
[GSU] 13 Using the Do-file Editor—automating Stata
[GSW] 13 Using the Do-file Editor—automating Stata
[U] 15 Saving and printing output—log files
[U] 16 Do-files
355
Title
doedit — Edit do-files and other text files
Syntax
doedit
filename
Menu
Window
>
Do-file Editor
Description
doedit opens a text editor that lets you edit do-files and other text files.
The Do-file Editor lets you submit several commands to Stata at once.
Remarks
Clicking on the Do-file Editor button is equivalent to typing doedit.
doedit, typed by itself, invokes the Editor with an empty document. If you specify filename, that
file is displayed in the Editor.
You may have more than one Do-file Editor open at once. Each time you submit the doedit
command, a new window will be opened.
A tutorial discussion of doedit can be found in the Getting Started with Stata manual. Read
[U] 16 Do-files for an explanation of do-files, and then read [GSW] 13 Using the Do-file Editor—
automating Stata to learn how to use the Do-file Editor to create and execute do-files.
Also see
[GSM] 13 Using the Do-file Editor—automating Stata
[GSU] 13 Using the Do-file Editor—automating Stata
[GSW] 13 Using the Do-file Editor—automating Stata
[U] 16 Do-files
356
Title
dotplot — Comparative scatterplots
Syntax
Dotplot of varname, with one column per value of groupvar
dotplot varname if
in
, options
Dotplot for each variable in varlist, with one column per variable
dotplot varlist if
in
, options
description
options
Options
display one columnar dotplot for each value of groupvar
horizontal dot density; default is nx(0)
vertical dot density; default is ny(35)
label every # group; default is incr(1)
plot a horizontal line of pluses at the mean or median
use minimum and maximum as boundaries
plot horizontal dashed lines at shoulders of each group
use the actual values of yvar
center the dot for each column
over(groupvar)
nx(#)
ny(#)
incr(#)
mean | median
bounded
bar
nogroup
center
Plot
change look of markers (color, size, etc.)
add marker labels; change look or position
marker options
marker label options
Y axis, X axis, Titles, Legend, Overall
any options other than by() documented in [G] twoway options
twoway options
Menu
Graphics
>
Distributional graphs
>
Distribution dotplot
Description
A dotplot is a scatterplot with values grouped together vertically (“binning”, as in a histogram)
and with plotted points separated horizontally. The aim is to display all the data for several variables
or groups in one compact graphic.
In the first syntax, dotplot produces a columnar dotplot of varname, with one column per value
of groupvar. In the second syntax, dotplot produces a columnar dotplot for each variable in varlist,
with one column per variable; over(groupvar) is not allowed. In each case, the “dots” are plotted
as small circles to increase readability.
357
358
dotplot — Comparative scatterplots
Options
Options
over(groupvar) identifies the variable for which dotplot will display one columnar dotplot for
each value of groupvar.
nx(#) sets the horizontal dot density. A larger value of # will increase the dot density, reducing the
horizontal separation between dots. This option will increase the separation between columns if
two or more groups or variables are used.
ny(#) sets the vertical dot density (number of “bins” on the y axis). A larger value of # will result
in more bins and a plot that is less spread out horizontally. # should be determined in conjunction
with nx() to give the most pleasing appearance.
incr(#) specifies how the x axis is to be labeled. incr(1), the default, labels all groups. incr(2)
labels every second group.
mean | median plots a horizontal line of pluses at the mean or median of each group.
bounded forces the minimum and maximum of the variable to be used as boundaries of the smallest
and largest bins. It should be used with one variable whose support is not the whole of the real
line and whose density does not tend to zero at the ends of its support, e.g., a uniform random
variable or an exponential random variable.
bar plots horizontal dashed lines at the shoulders of each group. The “shoulders” are taken to be
the upper and lower quartiles unless mean has been specified; here they will be the mean plus or
minus the standard deviation.
nogroup uses the actual values of yvar rather than grouping them (the default). This option may be
useful if yvar takes on only a few values.
center centers the dots for each column on a hidden vertical line.
Plot
marker options affect the rendition of markers drawn at the plotted points, including their shape,
size, color, and outline; see [G] marker options.
marker label options specify if and how the markers are to be labeled; see [G] marker label options.
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
Remarks
dotplot produces a figure that has elements of a boxplot, a histogram, and a scatterplot. Like a
boxplot, it is most useful for comparing the distributions of several variables or the distribution of 1
variable in several groups. Like a histogram, the figure provides a crude estimate of the density, and,
as with a scatterplot, each symbol (dot) represents 1 observation.
dotplot — Comparative scatterplots
359
Example 1
dotplot may be used as an alternative to Stata’s histogram graph for displaying the distribution
of one variable.
. set seed 123456789
. set obs 1000
. generate norm = rnormal()
. dotplot norm, title("Normal distribution, sample size 1000")
−4
−2
norm
0
2
4
Normal distribution, sample size 1000
0
20
40
Frequency
60
80
Example 2
The over() option lets us use dotplot to compare the distribution of one variable within different
levels of a grouping variable. The center, median, and bar options create a graph that may be
compared with Stata’s boxplot; see [G] graph box. The next graph illustrates this option with Stata’s
automobile dataset.
(Continued on next page)
360
dotplot — Comparative scatterplots
Mileage (mpg)
30
40
. use http://www.stata-press.com/data/r11/auto, clear
(1978 Automobile Data)
. dotplot mpg, over(foreign) nx(25) ny(10) center median bar
− − − −−− − − −
− − − − − − −−−−−−−−− − − − − − −
20
− − − −−− − − −
10
− − − − − − −−−−−−−−− − − − − − −
Domestic
Foreign
Car type
Example 3
The second version of dotplot lets us compare the distribution of several variables. In the next
graph, all 10 variables contain measurements on tumor volume.
0
200
Tumor volume, cu mm
400
600
800
1000
. use http://www.stata-press.com/data/r11/dotgr
. dotplot g1r1-g1r10, ytitle("Tumor volume, cu mm")
g1r1
g1r2
g1r3
g1r4
g1r5
g1r6
g1r7
g1r8
g1r9 g1r10
dotplot — Comparative scatterplots
361
Example 4
When using the first form with the over() option, we can encode a third dimension in a dotplot
by using a different plotting symbol for different groups. The third dimension cannot be encoded
with a varlist. The example is of a hypothetical matched case – control study. The next graph shows
the exposure of each individual in each matched stratum. Cases are marked by the letter ‘x’, and
controls are marked by the letter ‘o’.
use http://www.stata-press.com/data/r11/dotdose
label define symbol 0 "o" 1 "x"
label values case symbol
dotplot dose, over(strata) m(none) mlabel(case) mlabp(0) center
40
50
.
.
.
.
30
dose
20
o
o
o
o
oxx
o
o
xoo
oo
oo
o
ooo
x
oo
ox
oo
oo
o
o
xo
oo
oooo
o
o
oo
ooo
o
ooo
xo
ooox
o
o
xoo
ooo
o
ooo
o
oo
oo
o
o
xo
o
o
ooo
oo
o
ooo
o
oo
11
12
o
o
o
o
o
o
o
ox
o
oo
o
o
oo
xoo
ox
xo
o
x
oo
0
10
o
o
o
o
o
oo
oo
oo
0
1
2
3
4
5
6
7
strata
8
9
10
Example 5
dotplot can also be used with two virtually continuous variables as an alternative to jittering the
data to distinguish ties. We must use the xlab option, because otherwise dotplot will attempt to
label too many points on the x axis. It is often useful in such instances to use a value of nx that
is smaller than the default. That was not necessary in this example, partly because of our choice of
symbols.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. generate byte hi_price = (price>10000) if price < .
. label define symbol 0 "|" 1 "o"
. label values hi_price symbol
(Continued on next page )
362
dotplot — Comparative scatterplots
5,000
. dotplot weight, over(gear_ratio) m(none) mlabel(hi_price) mlabp(0) center
> xlabel(#5)
o
o
Weight (lbs.)
3,000
4,000
oo
|
|
o||
o
o
|
|
||
|
|
o
|
|||
|
||
||||
|
|
o
|
|
||
|
o
|
|
|
||
|
|
|
|
|
||
|
|
|
2,000
|
|
|
|
2
2.5
|
3
Gear Ratio
||
|
|
|
3.5
|
|
|
| ||
|| |
|
|
4
Example 6
The following figure is included mostly for aesthetic reasons. It also demonstrates dotplot’s
ability to cope with even very large datasets. The sample size for each variable is 10,000, so it may
take a long time to print.
−2
0
2
4
6
clear all
set seed 123456789
set obs 10000
gen norm0 = rnormal()
gen norm1 = rnormal() + 1
gen norm2 = rnormal() + 2
label variable norm0 "N(0,1)"
label variable norm1 "N(1,1)"
label variable norm2 "N(2,1)"
dotplot norm0 norm1 norm2
−4
.
.
.
.
.
.
.
.
.
.
N(0,1)
N(1,1)
N(2,1)
dotplot — Comparative scatterplots
363
Saved results
dotplot saves the following in r():
Scalars
r(nx) horizontal dot density
r(ny) vertical dot density
Methods and formulas
dotplot is implemented as an ado-file.
Acknowledgments
dotplot was written by Peter Sasieni of the Wolfson Institute of Preventive Medicine, London,
and Patrick Royston of the MRC Clinical Trials Unit, London.
References
Sasieni, P., and P. Royston. 1994. gr14: dotplot: Comparative scatterplots. Stata Technical Bulletin 19: 8–10. Reprinted
in Stata Technical Bulletin Reprints, vol. 4, pp. 50–54. College Station, TX: Stata Press.
. 1996. Dotplots. Applied Statistics 45: 219–234.
Title
dstdize — Direct and indirect standardization
Syntax
Direct standardization
dstdize charvar popvar stratavars
if
in , by(groupvars) dstdize options
Indirect standardization
istdize casevars popvars stratavars if
in using filename,
popvars(casevarp popvarp ) | rate(ratevarp # | crudevarp )
istdize options
dstdize options
Main
∗
by(groupvars)
using( filename)
base(# | string)
level(#)
description
study populations
use standard population from Stata dataset
use standard population from a value of grouping variable
set confidence level; default is level(95)
Options
saving( filename)
format(% fmt)
print
nores
∗
save computed standard population distribution as a Stata dataset
final summary table display format; default is %10.0g
include table summary of standard population in output
suppress saving results in r()
by(groupvars) is required.
istdize options
description
Main
∗
for standard population, casevarp is number of cases and
popvarp is number of individuals
∗
rate(ratevarp # | crudevarp ) ratevarp is stratum-specific rates and # or crudevarp is the
crude case rate value or variable
set confidence level; default is level(95)
level(#)
popvars(casevarp popvarp )
Options
by(groupvars)
format(% fmt)
print
∗
variables identifying study populations
final summary table display format; default is %10.0g
include table summary of standard population in output
Either popvars(casevarp popvarp ) or rate(ratevarp {# | crudevarp }) must be specified.
364
dstdize — Direct and indirect standardization
365
Menu
dstdize
Statistics
>
Epidemiology and related
>
Other
>
Direct standardization
>
Epidemiology and related
>
Other
>
Indirect standardization
istdize
Statistics
Description
dstdize produces standardized rates for charvar, which are defined as a weighted average of the
stratum-specific rates. These rates can be used to compare the characteristic charvar across different
populations identified by groupvars. Weights used in the standardization are given by popvar; the
strata across which the weights are to be averaged are defined by stratavars.
istdize produces indirectly standardized rates for a study population based on a standard population. This standardization method is appropriate when the stratum-specific rates for the population
being studied are either unavailable or based on small samples and thus are unreliable. The standardization uses the stratum-specific rates of a standard population to calculate the expected number of
cases in the study population(s), sums them, and then compares them with the actual number of cases
observed. The standard population is in another Stata data file specified by using filename, and it
must contain popvar and stratavars.
In addition to calculating rates, the indirect standardization command produces point estimates and
exact confidence intervals of the study population’s standardized mortality ratio (SMR), if death is the
event of interest, or the standardized incidence ratio (SIR) for studies of incidence. Here we refer to
both ratios as SMR.
casevars is the variable name for the study population’s number of cases (usually deaths). It must
contain integers, and for each group, defined by groupvar, each subpopulation identified by stratavars
must have the same values or missing.
popvars identifies the number of subjects represented by each observation in the study population.
stratavars define the strata.
Options for dstdize
Main
by(groupvars) is required for the dstdize command; it specifies the variables identifying the study
populations. If base() is also specified, there must be only one variable in the by() group. If
you do not have a variable for this option, you can generate one by using something like gen
newvar=1 and then use newvar as the argument to this option.
using(filename) or base(# | string) may be used to specify the standard population. You may not
specify both options. using( filename) supplies the name of a .dta file containing the standard
population. The standard population must contain the popvar and the stratavars. If using() is
not specified, the standard population distribution will be obtained from the data. base(# | string)
lets you specify one of the values of groupvar—either a numeric value or a string—to be used
as the standard population. If neither base() nor using() is specified, the entire dataset is used
to determine an estimate of the standard population.
366
dstdize — Direct and indirect standardization
level(#) specifies the confidence level, as a percentage, for a confidence interval of the adjusted
rate. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of
confidence intervals.
Options
saving( filename) saves the computed standard population distribution as a Stata dataset that can be
used in further analyses.
format(% fmt) specifies the format in which to display the final summary table. The default is
%10.0g.
print includes a table summary of the standard population before displaying the study population
results.
nores suppresses saving results in r(). This option is seldom specified. Some saved results are stored
in matrices. If there are more groups than matsize, dstdize will report “matsize too small”.
Then you can either increase matsize or specify nores. The nores option does not change how
results are calculated but specifies that results need not be left behind for use by other programs.
Options for istdize
Main
popvars(casevarp popvarp ) or rate(ratevarp # | ratevarp crudevarp ) must be specified with the
istdize command. Only one of these two options is allowed. These options are used to describe
the standard population’s data.
With popvars(casevarp popvarp ), casevarp records the number of cases (deaths) for each stratum
in the standard population, and popvarp records the total number of individuals in each stratum
(individuals at risk).
With rate(ratevarp # | crudevarp ), ratevarp contains the stratum-specific rates. # | crudevarp
specifies the crude case rate either by a variable name or, optionally, by the crude case rate value.
If a crude rate variable is used, it must be the same for all observations, although it could be
missing for some.
level(#) specifies the confidence level, as a percentage, for a confidence interval of the adjusted
rate. The default is level(95) or as set by set level; see [U] 20.7 Specifying the width of
confidence intervals.
Options
by(groupvars) specifies variables identifying study populations when more than one exists in the
data. If this option is not specified, the entire study population is treated as one group.
format(% fmt) specifies the format in which to display the final summary table. The default is
%10.0g.
print outputs a table summary of the standard population before displaying the study population
results.
dstdize — Direct and indirect standardization
367
Remarks
Remarks are presented under the following headings:
Direct standardization
Indirect standardization
In epidemiology and other fields, you will often need to compare rates for some characteristic
across different populations. These populations often differ on factors associated with the characteristic
under study; thus directly comparing overall rates may be misleading.
See van Belle et al. (2004, 642–684), Fleiss, Levin, and Paik (2003, chap. 19), or Kirkwood and
Sterne (2003, chap. 25) for a discussion of direct and indirect standardization.
Direct standardization
The direct method of adjusting for differences among populations involves computing the overall
rates that would result if, instead of having different distributions, all populations had the same
standard distribution. The standardized rate is defined as a weighted average of the stratum-specific
rates, with the weights taken from the standard distribution. Direct standardization may be applied
only when the specific rates for a given population are available.
dstdize generates adjusted summary measures of occurrence, which can be used to compare
prevalence, incidence, or mortality rates between populations that may differ on certain characteristics
(e.g., age, gender, race). These underlying differences may affect the crude prevalence, mortality, or
incidence rates.
Example 1
We have data (Rothman 1986, 42) on mortality rates for Sweden and Panama for 1962, and we
wish to compare mortality in these two countries:
. use http://www.stata-press.com/data/r11/mortality
(1962 Mortality, Sweden & Panama)
. describe
Contains data from http://www.stata-press.com/data/r11/mortality.dta
obs:
6
1962 Mortality, Sweden & Panama
vars:
4
14 Apr 2009 16:18
size:
114 (99.9% of memory free)
variable name
nation
age_category
population
deaths
Sorted by:
storage
type
display
format
str6
byte
float
float
%9s
%9.0g
%10.0gc
%9.0gc
value
label
age_lbl
variable label
Nation
Age Category
Population in Age Category
Deaths in Age Category
368
dstdize — Direct and indirect standardization
. list, sepby(nation) abbrev(12) divider
nation
age_category
population
deaths
1.
2.
3.
Sweden
Sweden
Sweden
0 - 29
30 - 59
60+
3145000
3057000
1294000
3,523
10,928
59,104
4.
5.
6.
Panama
Panama
Panama
0 - 29
30 - 59
60+
741,000
275,000
59,000
3,904
1,421
2,456
We divide the total number of cases in the population by the population to obtain the crude rate:
. collapse (sum) pop deaths, by(nation)
. list, abbrev(10) divider
1.
2.
nation
population
deaths
Panama
Sweden
1075000
7496000
7,781
73,555
. generate crude = deaths/pop
. list, abbrev(10) divider
1.
2.
nation
population
deaths
crude
Panama
Sweden
1075000
7496000
7,781
73,555
.0072381
.0098126
If we examine the total number of deaths in the two nations, the total crude mortality rate in
Sweden is higher than that in Panama. From the original data, we see one possible explanation:
Swedes are older than Panamanians, making direct comparison of the mortality rates difficult.
Direct standardization lets us remove the distortion caused by the different age distributions. The
adjusted rate is defined as the weighted sum of the crude rates, where the weights are given by the
standard distribution. Suppose that we wish to standardize these mortality rates to the following age
distribution:
. use http://www.stata-press.com/data/r11/1962, clear
(Standard Population Distribution)
. list, abbrev(12) divider
1.
2.
3.
age_category
population
0 - 29
30 - 59
60+
.35
.35
.3
. sort age_cat
. save 1962
file 1962.dta saved
If we multiply the above weights for the age strata by the crude rate for the corresponding age
category, the sum gives us the standardized rate.
dstdize — Direct and indirect standardization
. use http://www.stata-press.com/data/r11/mortality
(1962 Mortality, Sweden & Panama)
. generate crude=deaths/pop
. drop pop
. sort age_cat
. merge m:1 age_cat using 1962
age_category was byte now float
Result
# of obs.
not matched
matched
0
6
(_merge==3)
. list, sepby(age_category) abbrev(12)
nation
age_category
deaths
crude
population
_merge
1.
2.
Sweden
Panama
0 - 29
0 - 29
3,523
3,904
.0011202
.0052686
.35
.35
matched (3)
matched (3)
3.
4.
Panama
Sweden
30 - 59
30 - 59
1,421
10,928
.0051673
.0035747
.35
.35
matched (3)
matched (3)
5.
6.
Panama
Sweden
60+
60+
2,456
59,104
.0416271
.0456754
.3
.3
matched (3)
matched (3)
. generate product = crude*pop
. by nation, sort: egen adj_rate = sum(product)
. drop _merge
. list, sepby(nation)
nation
age_ca~y
deaths
crude
popula~n
product
adj_rate
1.
2.
3.
Panama
Panama
Panama
0 - 29
30 - 59
60+
3,904
1,421
2,456
.0052686
.0051673
.0416271
.35
.35
.3
.001844
.0018085
.0124881
.0161407
.0161407
.0161407
4.
5.
6.
Sweden
Sweden
Sweden
60+
30 - 59
0 - 29
59,104
10,928
3,523
.0456754
.0035747
.0011202
.3
.35
.35
.0137026
.0012512
.0003921
.0153459
.0153459
.0153459
Comparing the standardized rates indicates that the Swedes have a slightly lower mortality rate.
(Continued on next page)
369
370
dstdize — Direct and indirect standardization
To perform the above analysis with dstdize, type
. use http://www.stata-press.com/data/r11/mortality, clear
(1962 Mortality, Sweden & Panama)
. dstdize deaths pop age_cat, by(nation) using(1962)
-> nation= Panama
Unadjusted
Std.
Pop. Stratum Pop.
Cases Dist. Rate[s] Dst[P]
Stratum
Pop.
0 - 29
30 - 59
60+
741000
275000
59000
3904
1421
2456
1075000
7781
Totals:
0.689 0.0053
0.256 0.0052
0.055 0.0416
s*P
0.350 0.0018
0.350 0.0018
0.300 0.0125
Adjusted Cases: 17351.2
Crude Rate:
0.0072
Adjusted Rate:
0.0161
95% Conf. Interval: [0.0156, 0.0166]
-> nation= Sweden
Unadjusted
Std.
Pop. Stratum Pop.
Cases Dist. Rate[s] Dst[P]
Stratum
Pop.
0 - 29
30 - 59
60+
3145000
3057000
1294000
3523
10928
59104
7496000
73555
Totals:
0.420 0.0011
0.408 0.0036
0.173 0.0457
s*P
0.350 0.0004
0.350 0.0013
0.300 0.0137
Adjusted Cases: 115032.5
Crude Rate:
0.0098
Adjusted Rate:
0.0153
95% Conf. Interval: [0.0152, 0.0155]
Summary of Study Populations:
nation
N
Crude
Adj_Rate
Confidence Interval
Panama
Sweden
1075000
7496000
0.007238
0.009813
0.016141
0.015346
[
[
0.015645,
0.015235,
0.016637]
0.015457]
The summary table above lets us make a quick inspection of the results within the study populations,
and the detail tables give the behavior among the strata within the study populations.
Example 2
We have individual-level data on persons in four cities over several years. Included in the data is
a variable indicating whether the person has high blood pressure, together with information on the
person’s age, sex, and race. We wish to obtain standardized high blood pressure rates for each city
for 1990 and 1992, using, as the standard, the age, sex, and race distribution of the four cities and
two years combined.
dstdize — Direct and indirect standardization
371
Our dataset contains
. use http://www.stata-press.com/data/r11/hbp
. describe
Contains data from http://www.stata-press.com/data/r11/hbp.dta
obs:
1,130
vars:
7
21 Feb 2009 06:42
size:
23,730 (99.9% of memory free)
variable name
id
city
year
sex
age_group
race
hbp
storage
type
str10
byte
int
byte
byte
byte
byte
display
format
%10s
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
value
label
variable label
Record identification number
sexfmt
agefmt
racefmt
yn
high blood pressure
Sorted by:
The dstdize command is designed to work with aggregate data but will work with individuallevel data only if we create a variable recording the population represented by each observation. For
individual-level data, this is one:
. gen pop = 1
On the next page, we specify print to obtain a listing of the standard population and level(90)
to request 90% rather than 95% confidence intervals. Typing if year==1990 | year==1992 restricts
the data to the two years for both summary tables and the standard population.
(Continued on next page)
372
dstdize — Direct and indirect standardization
. dstdize hbp pop age race sex if year==1990 | year==1992, by(city year) print
> level(90)
Standard Population
Stratum
Pop.
15
15
15
15
15
15
20
20
20
20
20
20
25
25
25
25
25
25
30
30
30
30
30
30
-
19
19
19
19
19
19
24
24
24
24
24
24
29
29
29
29
29
29
34
34
34
34
34
34
Black
Black
Hispanic
Hispanic
White
White
Black
Black
Hispanic
Hispanic
White
White
Black
Black
Hispanic
Hispanic
White
White
Black
Black
Hispanic
Hispanic
White
White
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
35
44
5
10
7
5
43
67
14
13
4
21
17
44
7
13
9
16
16
32
2
3
5
23
Dist.
0.077
0.097
0.011
0.022
0.015
0.011
0.095
0.147
0.031
0.029
0.009
0.046
0.037
0.097
0.015
0.029
0.020
0.035
0.035
0.070
0.004
0.007
0.011
0.051
Total:
455
(6 observations excluded due to missing values)
-> city year= 1 1990
15
15
15
20
20
25
25
25
25
30
30
-
19
Black
19
Black
19 Hispanic
24
Black
24
Black
29
Black
29
Black
29 Hispanic
29
White
34
Black
34
Black
Totals:
(output omitted )
Unadjusted
Std.
Pop. Stratum Pop.
Cases Dist. Rate[s] Dst[P]
Stratum
Pop.
Female
Male
Male
Female
Male
Female
Male
Female
Female
Female
Male
6
6
1
3
11
4
6
2
1
1
6
2
0
0
0
0
0
1
0
0
0
0
47
3
0.128
0.128
0.021
0.064
0.234
0.085
0.128
0.043
0.021
0.021
0.128
0.3333
0.0000
0.0000
0.0000
0.0000
0.0000
0.1667
0.0000
0.0000
0.0000
0.0000
0.077
0.097
0.022
0.095
0.147
0.037
0.097
0.015
0.020
0.035
0.070
s*P
0.0256
0.0000
0.0000
0.0000
0.0000
0.0000
0.0161
0.0000
0.0000
0.0000
0.0000
Adjusted Cases:
2.0
Crude Rate:
0.0638
Adjusted Rate:
0.0418
90% Conf. Interval: [0.0074, 0.0761]
dstdize — Direct and indirect standardization
-> city year= 5 1992
15
15
15
15
15
15
20
20
20
20
25
25
25
25
30
30
30
30
30
-
19
19
19
19
19
19
24
24
24
24
29
29
29
29
34
34
34
34
34
Black
Black
Hispanic
Hispanic
White
White
Black
Black
Hispanic
White
Black
Black
Hispanic
White
Black
Black
Hispanic
White
White
Stratum
Pop.
Female
Male
Female
Male
Female
Male
Female
Male
Male
Male
Female
Male
Male
Male
Female
Male
Male
Female
Male
6
9
1
2
2
1
13
10
1
3
2
2
3
1
4
5
2
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
69
1
Totals:
Summary of Study Populations:
city
year
N
Crude
1
1990
1
1992
2
1990
2
1992
3
1990
3
1992
5
1990
5
1992
Unadjusted
Std.
Pop. Stratum Pop.
Cases Dist. Rate[s] Dst[P]
0.087
0.130
0.014
0.029
0.029
0.014
0.188
0.145
0.014
0.043
0.029
0.029
0.043
0.014
0.058
0.072
0.029
0.014
0.014
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
1.0000
0.077
0.097
0.011
0.022
0.015
0.011
0.095
0.147
0.029
0.046
0.037
0.097
0.029
0.035
0.035
0.070
0.007
0.011
0.051
s*P
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0505
Adjusted Cases:
3.5
Crude Rate:
0.0145
Adjusted Rate:
0.0505
90% Conf. Interval: [0.0505, 0.0505]
Adj_Rate
Confidence Interval
47
0.063830
0.041758
[
0.007427,
0.076089]
56
0.017857
0.008791
[
0.000000,
0.022579]
64
0.046875
0.044898
[
0.009072,
0.080724]
67
0.029851
0.014286
[
0.002537,
0.026035]
69
0.159420
0.088453
[
0.050093,
0.126813]
37
0.189189
0.046319
[
0.025271,
0.067366]
46
0.043478
0.022344
[
0.002044,
0.042644]
69
0.014493
0.050549
[
0.050549,
0.050549]
373
374
dstdize — Direct and indirect standardization
Indirect standardization
Standardization of rates can be performed via the indirect method whenever the stratum-specific
rates are either unknown or unreliable. If the stratum-specific rates are known, the direct standardization
method is preferred.
To apply the indirect method, you must have the following information:
• The observed number of cases in each population to be standardized, O. For example, if death
rates in two states are being standardized using the U.S. death rate for the same period, you must
know the total number of deaths in each state.
• The distribution across the various strata for the population being studied, n1 , . . . , nk . If you are
standardizing the death rate in the two states, adjusting for age, you must know the number of
individuals in each of the k age groups.
• The stratum-specific rates for the standard population, p1 , . . . , pk . For example, you must have
the U.S. death rate for each stratum (age group).
• The crude rate of the standard population, C . For example, you must have the U.S. mortality rate
for the year.
The indirect adjusted rate is then
Rindirect = C
O
E
where E is the expected number of cases (deaths) in each population. See Methods and formulas for
a more detailed description of calculations.
Example 3
This example is borrowed from Kahn and Sempos (1989, 95–105). We want to compare 1970
mortality rates in California and Maine, adjusting for age. Although we have age-specific population
counts for the two states, we lack age-specific death rates. Direct standardization is not feasible here.
We can use the U.S. population census data for the same year to produce indirectly standardized rates
for these two states.
From the U.S. census, the standard population for this example was entered into Stata and saved
in popkahn.dta.
. use http://www.stata-press.com/data/r11/popkahn, clear
. list age pop deaths rate, sep(4)
age
population
deaths
rate
1.
2.
3.
4.
<15
15-24
25-34
35-44
57,900,000
35,441,000
24,907,000
23,088,000
103,062
45,261
39,193
72,617
.00178
.00128
.00157
.00315
5.
6.
7.
8.
45-54
55-64
65-74
75+
23,220,000
18,590,000
12,436,000
7,630,000
169,517
308,373
445,531
736,758
.0073
.01659
.03583
.09656
dstdize — Direct and indirect standardization
375
The standard population contains for each age stratum the total number of individuals (pop) and
both the age-specific mortality rate (rate) and the number of deaths. The standard population need
not contain all three. If we have only the age-specific mortality rate, we can use the rate(ratevarp
crudevarp ) or rate(ratevarp #) option, where crudevarp refers to the variable containing the total
population’s crude death rate or # is the total population’s crude death rate.
Now let’s look at the states’ data (study population):
. use http://www.stata-press.com/data/r11/kahn
. list, sep(4)
state
age
populat~n
death
st
death_~e
1.
2.
3.
4.
California
California
California
California
<15
15-24
25-34
35-44
5,524,000
3,558,000
2,677,000
2,359,000
166,285
166,285
166,285
166,285
1
1
1
1
.0016
.0013
.0015
.0028
5.
6.
7.
8.
California
California
California
California
45-54
55-64
65-74
75+
2,330,000
1,704,000
1,105,000
696,000
166,285
166,285
166,285
166,285
1
1
1
1
.0067
.0154
.0328
.0917
9.
10.
11.
12.
Maine
Maine
Maine
Maine
<15
15-24
25-34
35-44
286,000
168,000
110,000
109,000
11,051
.
.
.
2
2
2
2
.0019
.0011
.0014
.0029
13.
14.
15.
16.
Maine
Maine
Maine
Maine
45-54
55-64
65-74
75+
110,000
94,000
69,000
46,000
.
.
.
.
2
2
2
2
.0069
.0173
.039
.1041
For each state, the number of individuals in each stratum (age group) is contained in the pop variable.
The death variable is the total number of deaths observed in the state during the year. It must have
the same value for all observations in the group, as for California, or it could be missing in all but
one observation per group, as for Maine.
To match these two datasets, the strata variables must have the same name in both datasets and
ideally the same levels. If a level is missing from either dataset, that level will not be included in the
standardization.
With kahn.dta in memory, we now execute the command. We will use the print option to
obtain the standard population’s summary table, and because we have both the standard population’s
age-specific count and deaths, we will specify the popvars(casevarp popvarp ) option. Or, we could
specify the rate(rate 0.00945) option because we know that 0.00945 is the U.S. crude death rate
for 1970.
(Continued on next page)
376
dstdize — Direct and indirect standardization
. istdize death pop age using http://www.stata-press.com/data/r11/popkahn,
> by(state) pop(deaths pop) print
Standard Population
Stratum
Rate
<15
15-24
25-34
35-44
45-54
55-64
65-74
75+
0.00178
0.00128
0.00157
0.00315
0.00730
0.01659
0.03583
0.09656
Standard population’s crude rate:
0.00945
-> state= California
Indirect Standardization
Standard
Population
Observed
Stratum
Rate
Population
<15
15-24
25-34
35-44
45-54
55-64
65-74
75+
Totals:
0.0018
0.0013
0.0016
0.0031
0.0073
0.0166
0.0358
0.0966
5524000
3558000
2677000
2359000
2330000
1704000
1105000
696000
Cases
Expected
9832.72
4543.85
4212.46
7419.59
17010.10
28266.14
39587.63
67206.23
19953000
178078.73
Observed Cases:
SMR (Obs/Exp):
SMR exact 95% Conf. Interval: [0.9293,
Crude Rate:
Adjusted Rate:
95% Conf. Interval: [0.0088,
166285
0.93
0.9383]
0.0083
0.0088
0.0089]
dstdize — Direct and indirect standardization
-> state= Maine
Stratum
<15
15-24
25-34
35-44
45-54
55-64
65-74
75+
Indirect Standardization
Standard
Population
Observed
Rate
Population
0.0018
0.0013
0.0016
0.0031
0.0073
0.0166
0.0358
0.0966
Totals:
286000
168000
110000
109000
110000
94000
69000
46000
Cases
Expected
509.08
214.55
173.09
342.83
803.05
1559.28
2471.99
4441.79
992000
10515.67
Observed Cases:
SMR (Obs/Exp):
SMR exact 95% Conf. Interval: [1.0314,
Crude Rate:
Adjusted Rate:
95% Conf. Interval: [0.0097,
Summary of Study Populations (Rates):
Cases
state
Observed
Crude
Adj_Rate
Confidence Interval
California
Maine
[0.008782, 0.008866]
[0.009747, 0.010118]
166285
11051
0.008334
0.011140
0.008824
0.009931
Summary of Study Populations (SMR):
Cases
Cases
state
Observed
Expected
California
Maine
166285
11051
178078.73
10515.67
SMR
Exact
Confidence Interval
0.934
1.051
[0.929290, 0.938271]
[1.031405, 1.070687]
11051
1.05
1.0707]
0.0111
0.0099
0.0101]
Saved results
dstdize saves the following in r():
Scalars
r(k)
number of populations
Macros
r(by)
r(c#)
r(se)
r(ub)
r(lb)
variable names specified in by()
values of r(by) for #th group
standard errors of adjusted rates
upper bounds of confidence intervals for adjusted rates
lower bounds of confidence intervals for adjusted rates
Matrices
r(Nobs)
1×k vector of number of observations
r(crude) 1×k vector of crude rates (*)
r(adj)
1×k vector of adjusted rates (*)
(*) If, in a group, the number of observations is 0, then 9
is stored for the corresponding crude and adjusted rates.
377
378
dstdize — Direct and indirect standardization
Methods and formulas
dstdize and istdize are implemented as ado-files.
The directly standardized rate, SR , is defined by
k
X
SR =
wi Ri
i=1
k
X
wi
i=1
(Rothman 1986, 44), where Ri is the stratum-specific rate in stratum i and wi is the weight for
stratum i derived from the standard population.
If ni is the population of stratum i, the standard error, se(SR ), in stratified sampling for proportions
(ignoring the finite population correction) is
v
u k
X wi 2 Ri (1 − Ri )
1 u
se(SR ) = P t
wi i=1
ni
(Cochran 1977, 108), from which the confidence intervals are calculated.
For indirect standardization, define O as the observed number of cases in each population to be
standardized; n1 , . . . , nk as the distribution across the various strata for the population being studied;
R1 , . . . , Rk as the stratum-specific rates for the standard population; and C as the crude rate of the
standard population. The expected number of cases (deaths), E , in each population is obtained by
applying the standard population stratum-specific rates, R1 , . . . , Rk , to the study populations:
E=
k
X
ni Ri
i=1
The indirectly adjusted rate is then
Rindirect = C
O
E
and O/E is the study population’s SMR if death is the event of interest or the SIR for studies of
disease (or other) incidence.
The exact confidence interval is calculated for each estimated SMR by assuming a Poisson process
as described in Breslow and Day (1987, 69–71). These intervals are obtained by first calculating
the upper and lower bounds for the confidence interval of the Poisson-distributed observed events,
O —say, L and U, respectively—and then computing SMRL = L/E and SMRU = U/E .
Acknowledgments
We gratefully acknowledge the collaboration of Dr. Joel A. Harrison, CIGNA HealthCare of Texas;
Dr. José Maria Pacheco from the Departamento de Epidemiologia, Faculdade de Saúde Pública/USP,
Sao Paulo, Brazil; and Dr John L. Moran from The Queen Elizabeth Hospital, Woodville, Australia.
dstdize — Direct and indirect standardization
379
References
Breslow, N. E., and N. E. Day. 1987. Statistical Methods in Cancer Research: Vol. 2—The Design and Analysis of
Cohort Studies. Lyon: IARC.
Cleves, M. A. 1998. sg80: Indirect standardization. Stata Technical Bulletin 42: 43–47. Reprinted in Stata Technical
Bulletin Reprints, vol. 7, pp. 224–228. College Station, TX: Stata Press.
Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley.
Fleiss, J. L., B. Levin, and M. C. Paik. 2003. Statistical Methods for Rates and Proportions. 3rd ed. New York: Wiley.
Forthofer, R. N., and E. S. Lee. 1995. Introduction to Biostatistics: A Guide to Design, Analysis, and Discovery. New
York: Academic Press.
Juul, S. 2008. An Introduction to Stata for Health Researchers. 2nd ed. College Station, TX: Stata Press.
Kahn, H. A., and C. T. Sempos. 1989. Statistical Methods in Epidemiology. New York: Oxford University Press.
Kirkwood, B. R., and J. A. C. Sterne. 2003. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell.
McGuire, T. J., and J. A. Harrison. 1994. sbe11: Direct standardization. Stata Technical Bulletin 21: 5–9. Reprinted
in Stata Technical Bulletin Reprints, vol. 4, pp. 88–94. College Station, TX: Stata Press.
Pagano, M., and K. Gauvreau. 2000. Principles of Biostatistics. 2nd ed. Belmont, CA: Duxbury.
Rothman, K. J. 1986. Modern Epidemiology. Boston: Little, Brown.
van Belle, G., L. D. Fisher, P. J. Heagerty, and T. S. Lumley. 2004. Biostatistics: A Methodology for the Health
Sciences. 2nd ed. New York: Wiley.
Wang, D. 2000. sbe40: Modeling mortality data using the Lee–Carter model. Stata Technical Bulletin 57: 15–17.
Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 118–121. College Station, TX: Stata Press.
Also see
[ST] epitab — Tables for epidemiologists
[SVY] direct standardization — Direct standardization of means, proportions, and ratios
Title
dydx — Calculate numeric derivatives and integrals
Syntax
Derivatives of numeric functions
dydx yvar xvar if
in , generate(newvar) dydx options
Integrals of numeric functions
integ yvar xvar if
in
, integ options
dydx options
Main
∗
generate(newvar)
replace
∗
description
create variable named newvar
overwrite the existing variable
generate(newvar) is required.
integ options
description
Main
generate(newvar)
trapezoid
initial(#)
replace
create variable named newvar
use trapezoidal rule to compute integrals; default is cubic splines
initial value of integral; default is 0
overwrite the existing variable
by is allowed with dydx and integ; see [D] by.
Menu
dydx
Data
>
Create or change data
>
Other variable-creation commands
>
Calculate numerical derivatives
Create or change data
>
Other variable-creation commands
>
Calculate numeric integrals
integ
Data
>
Description
dydx and integ calculate derivatives and integrals of numeric “functions”.
380
dydx — Calculate numeric derivatives and integrals
381
Options
Main
generate(newvar) specifies the name of the new variable to be created. It must be specified with
dydx.
trapezoid requests that the trapezoidal rule [the sum of (xi − xi−1 )(yi + yi−1 )/2 be used to
compute integrals. The default is cubic splines, which give superior results for most smooth
functions; for irregular functions, trapezoid may give better results.
initial(#) specifies the initial condition for calculating definite integrals; see Methods and formulas
below. If not specified, the initial condition is taken as 0.
replace specifies that if an existing variable is specified for generate(), it should be overwritten.
Remarks
dydx and integ lets you extend Stata’s graphics capabilities beyond data analysis and into
mathematics. (See Gould [1993] for another command that draws functions.)
Example 1
We graph y = e−x/6 sin(x) over the interval [ 0, 12.56 ]:
. range x 0 12.56 100
obs was 0, now 100
. generate y = exp(-x/6)*sin(x)
. label variable y "exp(-x/6)*sin(x)"
−.5
exp(−x/6)*sin(x)
0
.5
1
. twoway connected y x, connect(i) yline(0)
0
5
10
15
x
We estimate the derivative by using dydx and compute the relative difference between this estimate
and the true derivative.
. dydx y x, gen(dy)
. generate dytrue = exp(-x/6)*(cos(x) - sin(x)/6)
. generate error = abs(dy - dytrue)/dytrue
382
dydx — Calculate numeric derivatives and integrals
The error is greatest at the endpoints, as we would expect. The error is approximately 0.5% at each
endpoint, but the error quickly falls to less than 0.01%.
0
Error in derivative estimate
.002
.004
.006
. label variable error "Error in derivative estimate"
. twoway line error x, ylabel(0(.002).006)
0
5
10
15
x
We now estimate the integral by using integ:
. integ y x, gen(iy)
number of points = 100
integral
= .85316396
. generate iytrue = (36/37)*(1 - exp(-x/6)*(cos(x) + sin(x)/6))
. display iytrue[_N]
.85315901
. display abs(r(integral) - iytrue[_N])/iytrue[_N]
5.799e-06
. generate diff = iy - iytrue
The relative difference between the estimate [stored in r(integral)] and the true value of the
integral is about 6 × 10−6 . A graph of the absolute difference (diff) is shown below. Here error is
cumulative. Again most of the error is due to a relatively poorer fit near the endpoints.
. label variable diff "Error in integral estimate"
. twoway line diff x, ylabel(0(5.00e-06).00001)
383
0
Error in integral estimate
5.00e−06
.00001
dydx — Calculate numeric derivatives and integrals
0
5
10
15
x
Saved results
dydx saves the following in r():
Macros
r(y) name of yvar
integ saves the following in r():
Scalars
r(N points) number of unique x points
r(integral) estimate of the integral
Methods and formulas
dydx and integ are implemented as ado-files.
Consider a set of data points, (x1 , y1 ), . . . , (xn , yn ), generated by a function y = f (x). dydx and
integ first fit these points with a cubic spline, which is then analytically differentiated (integrated)
to give an approximation for the derivative (integral) of f .
The cubic spline (see, for example, Press et al. [2007]) consists of n − 1 cubic polynomials Pi (x),
with the ith one defined on the interval [xi , xi+1 ],
00
Pi (x) = yi ai (x) + yi+1 bi (x) + yi00 ci (x) + yi+1
di (x)
384
dydx — Calculate numeric derivatives and integrals
where
ai (x) =
xi+1 − x
xi+1 − xi
bi (x) =
x − xi
xi+1 − xi
ci (x) =
1
(xi+1 − xi )2 ai (x)[{ai (x)}2 − 1]
6
di (x) =
1
(xi+1 − xi )2 bi (x)[{bi (x)}2 − 1]
6
00
and yi00 and yi+1
are constants whose values will be determined as described below. The notation for
00
these constants is justified because Pi00 (xi ) = yi00 and Pi00 (xi+1 ) = yi+1
.
Because ai (xi ) = 1, ai (xi+1 ) = 0, bi (xi ) = 0, and bi (xi+1 ) = 1. Therefore, Pi (xi ) = yi , and
Pi (xi+1 ) = yi+1 . Thus the Pi jointly define a function that is continuous at the interval boundaries.
The first derivative should be continuous at the interval boundaries; that is,
0
Pi0 (xi+1 ) = Pi+1
(xi+1 )
The above n − 2 equations (one equation for each point except the two endpoints) and the values of
0
the first derivative at the endpoints, P10 (x1 ) and Pn−1
(xn ), determine the n constants yi00 .
The value of the first derivative at an endpoint is set to the value of the derivative obtained by
fitting a quadratic to the endpoint and the two adjacent points; namely, we use
P10 (x1 ) =
y1 − y2
y1 − y3
y2 − y3
+
−
x1 − x2
x1 − x3
x2 − x3
and a similar formula for the upper endpoint.
dydx approximates f 0 (xi ) by using Pi0 (xi ).
Rx
integ approximates F (xi ) = F (x1 ) + x1i f (x) dx by using
I0 +
i−1 Z
X
k=1
xk+1
Pk (x) dx
xk
where I0 (an estimate of F (x1 )) is the value specified by the initial(#) option. If the trapezoid
option is specified, integ approximates the integral by using the trapezoidal rule:
I0 +
i−1
X
1
k=1
2
(xk+1 − xk )(yk+1 + yk )
If there are ties among the xi , the mean of yi is computed at each set of ties and the cubic spline
is fit to these values.
Acknowledgment
The present versions of dydx and integ were inspired by the dydx2 command written by Patrick
Royston of the MRC Clinical Trials Unit, London.
dydx — Calculate numeric derivatives and integrals
385
References
Gould, W. W. 1993. ssi5.1: Graphing functions. Stata Technical Bulletin 16: 23–26. Reprinted in Stata Technical
Bulletin Reprints, vol. 3, pp. 188–193. College Station, TX: Stata Press.
. 1997. crc46: Better numerical derivatives and integrals. Stata Technical Bulletin 35: 3–5. Reprinted in Stata
Technical Bulletin Reprints, vol. 6, pp. 8–12. College Station, TX: Stata Press.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 2007. Numerical Recipes in C: The Art of
Scientific Computing. 3rd ed. Cambridge: Cambridge University Press.
Also see
[D] obs — Increase the number of observations in a dataset
[D] range — Generate numerical range
Title
eform option — Displaying exponentiated coefficients
Description
An eform option causes the coefficient table to be displayed in exponentiated form: for each
coefficient, eb rather than b is displayed. Standard errors and confidence intervals (CIs) are also
transformed. Display of the intercept, if any, is suppressed.
An eform option is one of the following:
eform option
description
eform(string)
eform
hr
shr
irr
or
rrr
use string for the column title
exponentiated coefficient, string is exp(b)
hazard ratio, string is Haz. Ratio
subhazard ratio, string is SHR
incidence-rate ratio, string is IRR
odds ratio, string is Odds Ratio
relative-risk ratio, string is RRR
Remarks
Example 1
Here is a simple example of the or option with svy: logit. The CI for the odds ratio is computed
by transforming (by exponentiating) the endpoints of the CI for the corresponding coefficient.
. use http://www.stata-press.com/data/r11/nhanes2d
. svy, or: logit highbp female black
(running logit on estimation sample)
(output omitted )
highbp
Odds Ratio
female
black
.693628
1.509155
Linearized
Std. Err.
.048676
.2089569
t
-5.21
2.97
P>|t|
[95% Conf. Interval]
0.000
0.006
.6011298
1.137872
.8003593
2.001586
We also could have specified the following command and received the same results as above:
. svy: logit highbp female black, or
Also see
[R] ml — Maximum likelihood estimation
386
Title
eivreg — Errors-in-variables regression
Syntax
eivreg depvar indepvars
if
in
weight
, reliab(indepvar # indepvar # . . . ) level(#) display options coeflegend
† coeflegend does not appear in the dialog box.
indepvars may contain factor variables; see [U] 11.4.3 Factor variables.
bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
aweights are not allowed with the jackknife prefix.
aweights and fweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Linear models and related
>
Errors-in-variables regression
Description
eivreg fits errors-in-variables regression models.
Options
Model
reliab(indepvar # indepvar # . . . ) specifies the measurement reliability for each independent
variable measured with error. Reliabilities are specified as pairs consisting of an independent
variable name (a name that appears in indepvars) and the corresponding reliability r, 0 < r ≤ 1.
Independent variables for which no reliability is specified are assumed to have reliability 1. If the
option is not specified, all variables are assumed to have reliability 1, and the result is thus the
same as that produced by regress (the ordinary least-squares results).
Reporting
level(#); see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
The following option is available with eivreg but is not shown in the dialog box:
coeflegend; see [R] estimation options.
387
388
eivreg — Errors-in-variables regression
Remarks
For an introduction to errors-in-variables regression, see Draper and Smith (1998, 89–91) or
Kmenta (1997, 352–357). Treiman (2009, 258–261) compares the results of errors-in-variables regression with conventional regression.
Errors-in-variables regression models are useful when one or more of the independent variables are
measured with additive noise. Standard regression (as performed by regress) would underestimate
the effect of the variable, and the other coefficients in the model can be biased to the extent that
they are correlated with the poorly measured variable. You can adjust for the biases if you know the
reliability:
noise variance
r =1−
total variance
That is, given the model y = Xβ + u, for some variable xi in X, the xi is observed with error,
xi = x∗i + e, and the noise variance is the variance of e. The total variance is the variance of xi .
Example 1
Say that in our automobile data, the weight of cars was measured with error, and the reliability
of our measured weight is 0.85. The result of this would be to underestimate the effect of weight
in a regression of, say, price on weight and foreign, and it would also bias the estimate of the
coefficient on foreign (because being of foreign manufacture is correlated with the weight of cars).
We would ignore all of this if we fit the model with regress:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress price weight foreign
SS
df
MS
Source
Model
Residual
316859273
318206123
2
71
158429637
4481776.38
Total
635065396
73
8699525.97
price
Coef.
weight
foreign
_cons
3.320737
3637.001
-4942.844
Std. Err.
.3958784
668.583
1345.591
t
8.39
5.44
-3.67
Number of obs
F( 2,
71)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
=
=
=
=
=
=
74
35.35
0.0000
0.4989
0.4848
2117
[95% Conf. Interval]
2.531378
2303.885
-7625.876
4.110096
4970.118
-2259.812
With eivreg, we can account for our measurement error:
. eivreg price weight foreign, r(weight .85)
assumed
variable
reliability
weight
*
Errors-in-variables regression
Number of obs =
74
F( 2,
71) =
50.37
Prob > F
= 0.0000
R-squared
= 0.6483
Root MSE
= 1773.54
0.8500
1.0000
price
Coef.
weight
foreign
_cons
4.31985
4637.32
-8257.017
Std. Err.
.431431
624.5362
1452.086
t
10.01
7.43
-5.69
P>|t|
0.000
0.000
0.000
[95% Conf. Interval]
3.459601
3392.03
-11152.39
5.180099
5882.609
-5361.639
eivreg — Errors-in-variables regression
389
The effect of weight is increased — as we knew it would be — and here the effect of foreign manufacture
is also increased. A priori, we knew only that the estimate of foreign might be biased; we did not
know the direction.
Technical note
Swept under the rug in our example is how we would determine the reliability, r. We can easily
see that a variable is measured with error, but we may not know the reliability because the ingredients
for calculating r depend on the unobserved noise.
For our example, we made up a value for r, and in fact we do not believe that weight is measured
with error at all, so the reported eivreg results have no validity. The regress results were the
statistically correct results here.
But let’s say that we do suspect that weight is measured with error and that we do not know r.
We could then experiment with various values of r to describe the sensitivity of our estimates to
possible error levels. We may not know r, but r does have a simple interpretation, and we could
probably produce a sensible range for r by thinking about how the data were collected.
If the reliability, r, is less than the R2 from a regression of the poorly measured variable on all
the other variables, including the dependent variable, the information might as well not have been
collected; no adjustment to the final results is possible. For our automobile data, running a regression
of weight on foreign and price would result in an R2 of 0.6743. Thus the reliability must be at
least 0.6743 here. If we specify a reliability that is too small, eivreg will inform us and refuse to
fit the model:
. eivreg price weight foreign, r(weight .6742)
reliability r() too small
r(399);
Returning to our problem of how to estimate r, too small or not, if the measurements are summaries
of scaled items, the reliability may be estimated using the alpha command; see [R] alpha. If the
score is computed from factor analysis and the data are scored using predict’s default options (see
[MV] factor postestimation), the square of the standard deviation of the score is an estimate of the
reliability.
Technical note
Consider a model with more than one variable measured with error. For instance, say that our
model is that price is a function of weight, foreign, and mpg and that both weight and mpg are
measured with error.
(Continued on next page)
390
eivreg — Errors-in-variables regression
. eivreg price weight foreign mpg, r(weight .85 mpg .9)
assumed
Errors-in-variables regression
variable
reliability
Number of obs =
74
weight
0.8500
F( 3,
70) = 429.14
mpg
0.9000
Prob > F
= 0.0000
*
1.0000
R-squared
= 0.9728
Root MSE
= 496.41
price
Coef.
weight
foreign
mpg
_cons
12.88302
8268.951
999.2043
-56473.19
Std. Err.
.6820532
352.8719
73.60037
3710.015
t
18.89
23.43
13.58
-15.22
P>|t|
0.000
0.000
0.000
0.000
[95% Conf. Interval]
11.52271
7565.17
852.413
-63872.58
Saved results
eivreg saves the following in e():
Scalars
e(N)
e(df m)
e(df r)
e(r2)
e(F)
e(rmse)
e(rank)
number of observations
model degrees of freedom
residual degrees of freedom
R-squared
F statistic
root mean squared error
rank of e(V)
Macros
e(cmd)
e(cmdline)
e(depvar)
e(rellist)
e(wtype)
e(wexp)
e(properties)
e(predict)
e(asbalanced)
e(asobserved)
eivreg
command as typed
name of dependent variable
indepvars and associated reliabilities
weight type
weight expression
b V
program used to implement predict
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(V)
coefficient vector
variance–covariance matrix of the estimators
Functions
e(sample)
marks estimation sample
14.24333
8972.732
1145.996
-49073.8
eivreg — Errors-in-variables regression
391
Methods and formulas
eivreg is implemented as an ado-file.
Let the model to be fit be
y = X∗ β + e
X = X∗ + U
where X∗ are the true values and X are the observed values. Let W be the user-specified weights. If
no weights are specified, W = I. If weights are specified, let v be the specified weights. If fweight
frequency weights are specified, then W = diag(v). If aweight analytic weights are specified,
then W = diag{v/(10 v)(10 1)}, meaning that the weights are normalized to sum to the number of
observations.
The estimates b of β are obtained as A−1 X0 Wy, where A = X0 WX − S. S is a diagonal
matrix with elements N (1 − ri )s2i . N is the number of observations, ri is the user-specified reliability
coefficient for the ith explanatory variable or 1 if not specified, and s2i is the (appropriately weighted)
variance of the variable.
The variance–covariance matrix of the estimators is obtained as s2 A−1 X0 WXA−1 , where the
root mean squared error s2 = (y0 Wy − bAb0 )/(N − p), where p is the number of estimated
parameters.
References
Draper, N., and H. Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley.
Kmenta, J. 1997. Elements of Econometrics. 2nd ed. Ann Arbor: University of Michigan Press.
Treiman, D. J. 2009. Quantitative Data Analysis: Doing Social Research to Test Ideas. San Francisco, CA: Jossey-Bass.
Also see
[R] eivreg postestimation — Postestimation tools for eivreg
[R] regress — Linear regression
[U] 20 Estimation and postestimation commands
Title
eivreg postestimation — Postestimation tools for eivreg
Description
The following postestimation commands are available for eivreg:
command
description
estat
estimates
lincom
VCE and estimation sample summary
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
margins
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Syntax for predict
predict
statistic
type
newvar
if
in
, statistic
description
Main
xb
residuals
stdp
stdf
pr(a,b)
e(a,b)
ystar(a,b)
linear prediction; the default
residuals
standard error of the prediction
standard error of the forecast
Pr(a < yj < b)
E(yj | a < yj < b)
E(yj∗ ), yj∗ = max{a, min(yj , b)}
These statistics are available both in and out of sample; type predict
the estimation sample.
. . . if e(sample) . . . if wanted only for
where a and b may be numbers or variables; a missing (a ≥ .) means −∞, and b missing (b ≥ .)
means +∞; see [U] 12.2.1 Missing values.
392
eivreg postestimation — Postestimation tools for eivreg
393
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
xb, the default, calculates the linear prediction.
residuals calculates the residuals, that is, yj − xj b.
stdp calculates the standard error of the prediction, which can be thought of as the standard error of
the predicted expected value or mean for the observation’s covariate pattern. The standard error
of the prediction is also referred to as the standard error of the fitted value.
stdf calculates the standard error of the forecast, which is the standard error of the point prediction
for 1 observation and is commonly referred to as the standard error of the future or forecast value.
By construction, the standard errors produced by stdf are always larger than those produced by
stdp; see Methods and formulas in [R] regress.
pr(a,b) calculates Pr(a < xj b + uj < b), the probability that yj |xj would be observed in the
interval (a, b).
a and b may be specified as numbers or variable names; lb and ub are variable names;
pr(20,30) calculates Pr(20 < xj b + uj < 30);
pr(lb,ub) calculates Pr(lb < xj b + uj < ub); and
pr(20,ub) calculates Pr(20 < xj b + uj < ub).
a missing (a ≥ .) means −∞; pr(.,30) calculates Pr(−∞ < xj b + uj < 30);
pr(lb,30) calculates Pr(−∞ < xj b + uj < 30) in observations for which lb ≥ .
and calculates Pr(lb < xj b + uj < 30) elsewhere.
b missing (b ≥ .) means +∞; pr(20,.) calculates Pr(+∞ > xj b + uj > 20);
pr(20,ub) calculates Pr(+∞ > xj b + uj > 20) in observations for which ub ≥ .
and calculates Pr(20 < xj b + uj < ub) elsewhere.
e(a,b) calculates E(xj b + uj | a < xj b + uj < b), the expected value of yj |xj conditional on
yj |xj being in the interval (a, b), meaning that yj |xj is censored. a and b are specified as they
are for pr().
ystar(a,b) calculates E(yj∗ ), where yj∗ = a if xj b + uj ≤ a, yj∗ = b if xj b + uj ≥ b, and
yj∗ = xj b + uj otherwise, meaning that yj∗ is truncated. a and b are specified as they are for
pr().
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] eivreg — Errors-in-variables regression
[U] 20 Estimation and postestimation commands
Title
error messages — Error messages and return codes
Description
Whenever Stata detects that something is wrong — that what you typed is uninterpretable, that you
are trying to do something you should not be trying to do, or that you requested the impossible — Stata
responds by typing a message describing the problem, together with a return code. For instance,
. lsit
unrecognized command: lsit
r(199);
. list myvar
variable myvar not found
r(111);
. test a=b
last estimates not found
r(301);
In each case, the message is probably sufficient to guide you to a solution. When we typed
lsit, Stata responded with “unrecognized command”. We meant to type list. When we typed
list myvar, Stata responded with “variable myvar not found”. There is no variable named myvar
in our data. When we typed test a=b, Stata responded with “last estimates not found”. test tests
hypotheses about previously fit models, and we have not yet fit a model.
The numbers in parentheses in the r(199), r(111), and r(301) messages are called the return
codes. To find out more about these messages, type search rc #, where # is the number returned
in the parentheses.
Example 1
. search rc 301
[P]
error messages . . . . . . . . . . . . . . . . . . . . Return code 301
last estimates not found;
You typed an estimation command such as regress without arguments
or attempted to perform a test or typed predict, but there were no
previous estimation results.
Programmers should see [P] error for details on programming error messages.
Also see
[R] search — Search Stata documentation
394
Title
estat — Postestimation statistics
Syntax
Common subcommands
Obtain information criteria
estat ic , n(#)
Summarize estimation sample
estat summarize eqlist
, estat summ options
Display covariance matrix estimates
estat vce , estat vce options
Command-specific subcommands
estat subcommand1 , options1
estat subcommand2 , options2
...
estat summ options
description
equation
labels
noheader
noweights
display options
display summary by equation
display variable labels
suppress the header
ignore weights
control spacing and display of omitted variables and base and
empty cells
eqlist is rarely used and specifies the variables, with optional equation name, to be summarized. eqlist may be
varlist or (eqname1 : varlist) (eqname2 : varlist) . . . . varlist may contain time-series operators; see
[U] 11.4.4 Time-series varlists.
estat vce options
description
covariance
correlation
equation(spec)
block
diag
format(% fmt)
nolines
display options
display as covariance matrix; the default
display as correlation matrix
display only specified equations
display submatrices by equation
display submatrices by equation; diagonal blocks only
display format for covariances and correlations
suppress lines between equations
control display of omitted variables and base and empty cells
395
396
estat — Postestimation statistics
Menu
Statistics
>
Postestimation
>
Reports and statistics
Description
estat displays scalar- and matrix-valued statistics after estimation; it complements predict,
which calculates variables after estimation. Exactly what statistics estat can calculate depends on
the previous estimation command.
Three sets of statistics are so commonly used that they are available after all estimation commands
that store the model log likelihood. estat ic displays Akaike’s and Schwarz’s Bayesian information
criteria. estat summarize summarizes the variables used by the command and automatically restricts
the sample to e(sample); it also summarizes the weight variable and cluster structure, if specified.
estat vce displays the covariance or correlation matrix of the parameter estimates of the previous
model.
Option for estat ic
n(#) specifies the N to be used in calculating BIC; see [R] BIC note.
Options for estat summarize
equation requests that the dependent variables and the independent variables in the equations be
displayed in the equation-style format of estimation commands, repeating the summary information
about variables entered in more than one equation.
labels displays variable labels.
noheader suppresses the header.
noweights ignores the weights, if any, from the previous estimation command. The default when
weights are present is to perform a weighted summarize on all variables except the weight variable
itself. An unweighted summarize is performed on the weight variable.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Options for estat vce
covariance displays the matrix as a variance–covariance matrix; this is the default.
correlation displays the matrix as a correlation matrix rather than a variance–covariance matrix.
rho is a synonym.
equation(spec) selects part of the VCE to be displayed. If spec is eqlist, the VCE for the listed
equations is displayed. If spec is eqlist1 \ eqlist2, the part of the VCE associated with the equations
in eqlist1 (rowwise) and eqlist2 (columnwise) is displayed. If spec is *, all equations are displayed.
equation() implies block if diag is not specified.
block displays the submatrices pertaining to distinct equations separately.
diag displays the diagonal submatrices pertaining to distinct equations separately.
estat — Postestimation statistics
397
format(% fmt) specifies the number format for displaying the elements of the matrix. The default is
format(%10.0g) for covariances and format(%8.4f) for correlations. See [U] 12.5 Formats:
Controlling how data are displayed for more information.
nolines suppresses lines between equations.
display options: noomitted, noemptycells, baselevels, allbaselevels; see [R] estimation
options.
Remarks
estat displays a variety of scalar- and matrix-valued statistics after you have estimated the
parameters of a model. Exactly what statistics estat can calculate depends on the estimation
command used, and command-specific statistics are detailed in that command’s postestimation manual
entry. The rest of this entry discusses three sets of statistics that are available after all estimation
commands.
Remarks are presented under the following headings:
estat ic
estat summarize
estat vce
estat ic
estat ic calculates two information criteria used to compare models. Unlike likelihood-ratio,
Wald, and similar testing procedures, the models need not be nested to compare the information
criteria. Because they are based on the log-likelihood function, information criteria are available only
after commands that report the log likelihood.
In general, “smaller is better”: given two models, the one with the smaller AIC fits the data better
than the one with the larger AIC. As with the AIC, a smaller BIC indicates a better-fitting model. For
AIC and BIC formulas, see Methods and formulas.
Example 1
In [R] mlogit, we fit a model explaining the type of insurance a person has on the basis of age,
gender, race, and site of study. Here we refit the model with and without the site dummies and
compare the models.
. use http://www.stata-press.com/data/r11/sysdsn1
(Health insurance data)
. mlogit insure age male nonwhite
(output omitted )
. estat ic
Model
Obs
ll(null)
ll(model)
df
AIC
BIC
.
615
-555.8545
-545.5833
8
1107.167
1142.54
Note: N=Obs used in calculating BIC; see [R] BIC note
. mlogit insure age male nonwhite i.site
(output omitted )
398
estat — Postestimation statistics
. estat ic
Model
Obs
ll(null)
ll(model)
df
AIC
BIC
.
615
-555.8545
-534.3616
12
1092.723
1145.783
Note:
N=Obs used in calculating BIC; see [R] BIC note
The AIC indicates that the model including the site dummies fits the data better, whereas the BIC
indicates the opposite. As is often the case, different model-selection criteria have led to conflicting
conclusions.
Technical note
glm and binreg, ml report a slightly different version of AIC and BIC; see [R] glm for the
formulas used. That version is commonly used within the GLM literature; see, for example, Hardin
and Hilbe (2007). The literature on information criteria is vast; see, among others, Akaike (1973),
Sawa (1978), and Raftery (1995). Judge et al. (1985) contains a discussion of using information criteria
in econometrics. Royston and Sauerbrei (2008, chap. 2) examine the use of information criteria as
an alternative to stepwise procedures for selecting model variables.
estat summarize
Often when fitting a model, you will also be interested in obtaining summary statistics, such as
the sample means and standard deviations of the variables in the model. estat summarize makes
this process simple. The output displayed is similar to that obtained by typing
. summarize varlist if e(sample)
without the need to type the varlist containing the dependent and independent variables.
Example 2
Continuing with the previous multinomial logit model, here we summarize the variables by using
estat summarize.
. estat summarize
Estimation sample mlogit
Variable
Mean
insure
age
male
nonwhite
1.596748
44.46832
.2504065
.196748
site
2
3
.3707317
.3138211
Number of obs =
Std. Dev.
615
Min
Max
.6225846
14.18523
.4335998
.3978638
1
18.1109
0
0
3
86.0725
1
1
.4833939
.4644224
0
0
1
1
estat — Postestimation statistics
399
The output in the previous example contains all the variables in one table, though mlogit presents
its results in a multiple-equation format. For models in which the same variables appear in all
equations, that is fine; but for other multiple-equation models, we may prefer to have the variables
separated by the equation in which they appear. The equation option makes this possible.
Example 3
Systems of simultaneous equations typically have different variables in each equation, and the
equation option of estat summarize is helpful in such situations. In example 2 of [R] reg3, we
have a model of supply and demand. We first refit the model and then call estat summarize.
. use http://www.stata-press.com/data/r11/supDem
. reg3 (Demand:quantity price pcompete income) (Supply:quantity price praw),
> endog(price)
(output omitted )
. estat summarize, equation
Estimation sample reg3
Number of obs =
49
Variable
Mean
depvar
quantity
quantity
12.61818
12.61818
Std. Dev.
Min
Max
2.774952
2.774952
7.71069
7.71069
20.0477
20.0477
32.70944
5.929975
7.811735
2.882684
3.508264
4.18859
26.3819
.207647
.570417
38.4769
11.5549
14.0077
32.70944
4.740891
2.882684
2.962565
26.3819
.151028
38.4769
9.79881
Demand
price
pcompete
income
Supply
price
praw
The first block of the table contains statistics on the dependent (or, more accurately, left-hand-side)
variables, and because we specified quantity as the left-hand-side variable in both equations, it is
listed twice. The second block refers to the variables in the first equation we specified, which we
labeled “Demand” in our call to reg3; and the final block refers to the supply equation.
estat vce
estat vce allows you to display the VCE of the parameters of the previously fit model, as either
a covariance matrix or a correlation matrix.
(Continued on next page)
400
estat — Postestimation statistics
Example 4
Returning to the mlogit example, we type
. use http://www.stata-press.com/data/r11/sysdsn1
(Health insurance data)
. mlogit insure age male nonwhite, nolog
Multinomial logistic regression
Number of obs
LR chi2(6)
Prob > chi2
Pseudo R2
Log likelihood = -545.58328
insure
Indemnity
Coef.
Std. Err.
z
P>|z|
=
=
=
=
615
20.54
0.0022
0.0185
[95% Conf. Interval]
(base outcome)
Prepaid
age
male
nonwhite
_cons
-.0111915
.5739825
.7312659
.1567003
.0060915
.2005221
.218978
.2828509
-1.84
2.86
3.34
0.55
0.066
0.004
0.001
0.580
-.0231305
.1809665
.302077
-.3976773
.0007475
.9669985
1.160455
.7110778
-.0058414
.5102237
.4333141
-1.811165
.0114114
.3639793
.4106255
.5348606
-0.51
1.40
1.06
-3.39
0.609
0.161
0.291
0.001
-.0282073
-.2031626
-.371497
-2.859473
.0165245
1.22361
1.238125
-.7628578
Uninsure
age
male
nonwhite
_cons
. estat vce, block
Covariance matrix of coefficients of mlogit model
covariances of equation Indemnity
age
male
nonwhite
_cons
age
male
nonwhite
_cons
0
0
0
0
0
0
0
0
0
0
covariances of equation Prepaid (row) by equation Indemnity (column)
age
male
nonwhite
_cons
age
male
nonwhite
_cons
0
0
0
0
0
0
0
0
0
0
age
male
nonwhite
_cons
.00003711
-.00015303
-.00008948
-.00159095
.0402091
.00470608
-.00398961
.04795135
-.00628886
.08000462
covariances of equation Prepaid
age
male
nonwhite
_cons
covariances of equation Uninsure (row) by equation Indemnity (column)
age
male
nonwhite
_cons
age
male
nonwhite
_cons
0
0
0
0
0
0
0
0
0
0
estat — Postestimation statistics
401
covariances of equation Uninsure (row) by equation Prepaid (column)
age
male
nonwhite
_cons
age
.00001753 -.00007926
-.00007544
.02188398
male
nonwhite
-.00004577
.00250588
_cons
-.00077045 -.00130535
covariances of equation Uninsure
age
male
age
male
nonwhite
_cons
.00013022
-.00050406
-.00026145
-.00562159
.13248095
.01505449
-.01686629
-.00004564
.0023186
.02813553
-.00257593
-.00076886
-.00145923
-.00263872
.03888032
nonwhite
_cons
.16861327
-.02474852
.28607591
The block option is particularly useful for multiple-equation estimators. The first block of output
here corresponds to the VCE of the estimated parameters for the first equation—the square roots of
the diagonal elements of this matrix are equal to the standard errors of the first equation’s parameters.
Similarly, the final block corresponds to the VCE of the parameters for the second equation. The middle
block shows the covariances between the estimated parameters of the first and second equations.
Saved results
estat ic saves the following in r():
Matrices
r(S) 1 ×
1.
2.
3.
6 matrix of results:
sample size
log likelihood of null model
log likelihood of full model
4. degrees of freedom
5. AIC
6. BIC
estat summarize saves the following in r():
Matrices
r(stats) k × 4 matrix of means, standard
deviations, minimums, and maximums
estat vce saves the following in r():
Matrices
r(V) VCE or correlation matrix
Methods and formulas
estat is implemented as an ado-file.
Akaike’s (1974) information criterion is defined as
AIC = −2 lnL + 2k
402
estat — Postestimation statistics
where lnL is the maximized log-likelihood of the model and k is the number of parameters estimated.
Some authors define the AIC as the expression above divided by the sample size.
Schwarz’s (1978) Bayesian information criterion is another measure of fit defined as
BIC = −2 lnL + k lnN
where N is the sample size. See [R] BIC note for additional information on calculating and interpreting
BIC.
Hirotugu Akaike (1927– ) was born in Fujinomiya City, Shizuoka Prefecture, Japan. He gained
BA and PhD degrees from the University of Tokyo. Akaike’s career from 1952 at the Institute
of Statistical Mathematics in Japan culminated in service as Director General; since 1994, he
has been Professor Emeritus. His best known work in a prolific career is on what is now known
as the Akaike information criterion (AIC), which was formulated to help selection of the most
appropriate model from a number of candidates.
Gideon E. Schwarz (1933– ) is Professor Emeritus of Statistics at the Hebrew University,
Jerusalem. He was born in Salzburg, Austria, and obtained an MSc in 1956 from the Hebrew
University and a PhD in 1961 from Columbia University. He is interested in stochastic processes,
sequential analysis, probability, and geometry, and is known for the Bayesian information criterion
(BIC).
References
Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Second International
Symposium on Information Theory, ed. B. N. Petrov and F. Csaki, 267–281. Budapest: Akailseoniai–Kiudo.
. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716–723.
Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley.
Findley, D. F., and E. Parzen. 1995. A conversation with Hirotugu Akaike. Statistical Science 10: 104–117.
Hardin, J. W., and J. M. Hilbe. 2007. Generalized Linear Models and Extensions. 2nd ed. College Station, TX: Stata
Press.
Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics.
2nd ed. New York: Wiley.
Raftery, A. 1995. Bayesian model selection in social research. In Vol. 25 of Sociological Methodology, ed. P. V.
Marsden, 111–163. Oxford: Blackwell.
Royston, P., and W. Sauerbrei. 2008. Multivariable Model-building: A Pragmatic Approach to Regression Analysis
Based on Fractional Polynomials for Modelling Continuous Variables. Chichester, UK: Wiley.
Sawa, T. 1978. Information criteria for discriminating among alternative regression models. Econometrica 46: 1273–1291.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464.
Also see
[R] estimates — Save and manipulate estimation results
[R] summarize — Summary statistics
[P] estat programming — Controlling estat after user-written commands
[U] 20 Estimation and postestimation commands
Title
estimates — Save and manipulate estimation results
Syntax
command
reference
Save and use results from disk
[R] estimates save
[R] estimates save
estimates save filename
estimates use filename
estimates describe using filename [R] estimates describe
estimates esample: . . .
[R] estimates save
Store and restore estimates in memory
estimates store name
estimates restore name
[R] estimates store
[R] estimates store
estimates query
estimates dir
[R] estimates store
[R] estimates store
estimates drop namelist
estimates clear
[R] estimates store
[R] estimates store
Set titles and notes
[R] estimates title
[R] estimates title
estimates title: text
estimates title
estimates
estimates
estimates
estimates
[R]
[R]
[R]
[R]
notes: text
notes
notes list . . .
notes drop . . .
estimates
estimates
estimates
estimates
notes
notes
notes
notes
Report
estimates describe name
estimates replay namelist
[R] estimates describe
[R] estimates replay
Tables and statistics
estimates table namelist
estimates stats namelist
estimates for namelist: . . .
403
[R] estimates table
[R] estimates stats
[R] estimates for
404
estimates — Save and manipulate estimation results
Description
estimates allows you to store and manipulate estimation results:
• You can save estimation results in a file for use in later sessions.
• You can store estimation results in memory so that you can
a. switch among separate estimation results and
b. form tables combining separate estimation results.
Remarks
estimates is for use after you have fit a model, be it with regress, logistic, etc. You can
use estimates after any estimation command, whether it be an official estimation command of Stata
or a user-written one.
estimates has three separate but related capabilities:
1. You can save estimation results in a file on disk so that you can use them later, even in a
different Stata session.
2. You can store up to 300 estimation results in memory so that they are at your fingertips.
3. You can make tables comparing any results you have stored in memory.
Remarks are presented under the following headings:
Saving and using estimation results
Storing and restoring estimation results
Comparing estimation results
Jargon
Saving and using estimation results
After you have fit a model, say, with regress, type
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ foreign
(output omitted )
You can save the results in a file:
. estimates save basemodel
(file basemodel.ster saved)
Later, say, in a different session, you can reload those results:
. estimates use basemodel
The situation is now nearly identical to what it was immediately after you fit the model. You can
replay estimation results:
. regress
(output omitted )
You can perform tests:
. test foreign==0
(output omitted )
estimates — Save and manipulate estimation results
405
And you can use any postestimation command or postestimation capability of Stata. The only difference
is that Stata no longer knows what the estimation sample, e(sample) in Stata jargon, was. When
you reload the estimation results, you might not even have the original data in memory. That is okay.
Stata will know to refuse to calculate anything that can be calculated only on the original estimation
sample.
If it is important that you use a postestimation command that can be used only on the original
estimation sample, there is a way you can do that. You use the original data and then use estimates
esample to tell Stata what the original sample was.
See [R] estimates save for details.
Storing and restoring estimation results
Storing and restoring estimation results in memory is much like saving them to disk. You type
. estimates store base
to save the current estimation results under the name base, and you type
. estimates restore base
to get them back later. You can find out what you have stored by typing
. estimates dir
Saving estimation results to disk is more permanent than storing them in memory, so why would
you want merely to store them? The answer is that, once they are stored, you can use other estimates
commands to produce tables and reports from them.
See [R] estimates store for details about the estimates store and restore commands.
Comparing estimation results
Let’s say that you have done the following:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ
(output omitted )
. estimates store base
. regress mpg weight displ foreign
(output omitted )
. estimates store alt
You can now get a table comparing the coefficients:
. estimates table base alt
Variable
weight
displacement
foreign
_cons
base
alt
-.00656711
.00528078
-.00677449
.00192865
-1.6006312
41.847949
40.084522
estimates table can do much more; see [R] estimates table. Also see [R] estimates stats. It
works similarly to table but produces model comparisons in terms of BIC and AIC.
406
estimates — Save and manipulate estimation results
Jargon
You know that if you fit a model, say, by typing
. regress mpg weight displacement
then you can later replay the results by typing
. regress
and you can do tests and calculate other postestimation statistics by typing
. test displacement==0
. estat vif
. predict mpghat
As a result, we often refer to the estimation results or the current estimation results or the most
recent estimation results or the last estimation results or the estimation results in memory.
With estimates store and estimates restore, you can have many estimation results in
memory. One set of those, the set most recently estimated, or the set most recently restored, are the
current or active estimation results, which you can replay, which you can test, or from which you
can calculate postestimation statistics.
Current and active are the two words we will use interchangeably from now on.
Also see
[R] estimates save — Save and use estimation results
[R] estimates describe — Describe estimation results
[R] estimates store — Store and restore estimation results
[R] estimates title — Set title for estimation results
[R] estimates notes — Add notes to estimation results
[R] estimates replay — Redisplay estimation results
[R] estimates table — Compare estimation results
[R] estimates stats — Model statistics
[R] estimates for — Repeat postestimation command across models
Title
estimates describe — Describe estimation results
Syntax
estimates describe
estimates describe name
estimates describe using filename
, number(#)
Menu
Statistics
>
Postestimation
>
Manage estimation results
>
Describe results
Description
estimates describe describes the current (active) estimates. Reported are the command line
that produced the estimates, any title that was set by estimates title (see [R] estimates title), and
any notes that were added by estimates notes (see [R] estimates notes).
estimates describe name does the same but reports results for estimates stored by estimates
store (see [R] estimates store).
estimates describe using filename does the same but reports results for estimates saved by
estimates save (see [R] estimates save). If filename contains multiple sets of estimates (saved in
it by estimates save, append), the number of sets of estimates is also reported. If filename is
specified without an extension, .ster is assumed.
Option
number(#) specifies that the #th set of estimation results from filename be described. This assumes
that multiple sets of estimation results have been saved in filename by estimates save, append.
The default is number(1).
Remarks
estimates describe can be used to describe the estimation results currently in memory,
. estimates describe
Estimation results produced by
. regress mpg weight displ if foreign
or to describe results saved by estimates save in a .ster file:
. estimates describe using final
Estimation results "Final results" saved on 12apr2009 14:20, produced by
. logistic myopic age sex drug1 drug2 if complete==1
Notes:
1. Used file patient.dta
2. "datasignature myopic age sex drug1 drug2 if complete==1"
reports 148:5(58763):2252897466:3722318443
3. must be reviewed by rgg
407
408
estimates describe — Describe estimation results
Example 1
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ if foreign
(output omitted )
. estimates notes: file ‘c(filename)’
. datasignature
74:12(71728):3831085005:1395876116
. estimates notes: datasignature report ‘r(datasignature)’
. estimates save foreign
file foreign.ster saved
. regress mpg weight displ if !foreign
(output omitted )
. estimates describe using foreign
Estimation results saved on 02may2009 10:33, produced by
. regress mpg weight displ if foreign
Notes:
1. file http://www.stata-press.com/data/r11/auto.dta
2. datasignature report 74:12(71728):3831085005:1395876116
Saved results
estimates describe and estimates describe name save the following in r():
Macros
r(title)
title
r(cmdline) original command line
estimates describe using filename saves the above and the following in r():
Scalars
r(datetime)
r(nestresults)
%tc value of date/time file saved
number of sets of estimation results in file
Methods and formulas
estimates describe is implemented as an ado-file.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates for — Repeat postestimation command across models
Syntax
estimates for namelist
, options : postestimation command
where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active)
estimates. all and * mean the same thing.
options
description
noheader
nostop
do not display title
do not stop if command fails
Description
estimates for performs postestimation command on each estimation result specified.
Options
noheader suppresses the display of the header as postestimation command is executed each time.
nostop specifies that execution of postestimation command is to be performed on the remaining
models even if it fails on some.
Remarks
In the example that follows, we fit a model two different ways, store the results, and then use
estimates for to perform the same test on both of them:
Example 1
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. gen fwgt = foreign*weight
. gen dwgt = !foreign*weight
. gen gpm = 1/mpg
. regress gpm fwgt dwgt displ foreign
(output omitted )
. estimates store reg
. qreg gpm fwgt dwgt displ foreign
(output omitted )
. estimates store qreg
409
410
estimates for — Repeat postestimation command across models
. estimates for reg qreg: test fwgt==dwgt
Model reg
( 1)
fwgt - dwgt = 0
F( 1,
69) =
Prob > F =
4.87
0.0307
Model qreg
( 1)
fwgt - dwgt = 0
F(
1,
69) =
Prob > F =
0.07
0.7937
Methods and formulas
estimates for is implemented as an ado-file.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates notes — Add notes to estimation results
Syntax
estimates notes: text
estimates notes
estimates notes list
in noterange
estimates notes drop in noterange
where noterange is # or #/# and where # may be a number, the letter f (meaning first), or the letter
l (meaning last).
Description
estimates notes: text adds a note to the current (active) estimation results.
estimates notes and estimates notes list list the current notes.
estimates notes drop in noterange eliminates the specified notes.
Remarks
After adding or removing notes, if estimates have been stored, do not forget to store them again.
If estimates have been saved, do not forget to save them again.
Notes are most useful when you intend to save estimation results in a file; see [R] estimates save.
For instance, after fitting a model, you might type
. estimates note: I think these are final
. estimates save lock2
and then, later when going through your files, you could type
. estimates use lock2
. estimates notes
1. I think these are final
Up to 9,999 notes can be attached to estimation results. If estimation results are important, we
recommend that you add a note identifying the .dta dataset you used. The best way to do that is to
type
. estimates notes:
file ‘c(filename)’
because ‘c(filename)’ will expand to include not just the name of the file but also its full path;
see [P] creturn.
If estimation results took a long time to estimate—say, they were produced by asmprobit or
gllamm (see [R] asmprobit and http://www.gllamm.org)—it is also a good idea to add a data signature.
A data signature takes less time to compute than reestimation when you need proof that you really
have the right dataset. The easy way to do that is to type
411
412
estimates notes — Add notes to estimation results
. datasignature
74:12(71728):3831085005:1395876116
. estimates notes: datasignature reports ‘r(datasignature)’
Now when you ask to see the notes, you will see
. estimates notes
1. I think these are final
2. file C:\project\one\pat4.dta
3. datasignature reports 74:12(71728):3831085005:1395876116
See [D] datasignature.
Notes need not be positive. You might set a note to be, “I need to check that age is defined
correctly.”
Example 1
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ if foreign
(output omitted )
. estimates notes: file ‘c(filename)’
. datasignature
74:12(71728):3831085005:1395876116
. estimates notes: datasignature report ‘r(datasignature)’
. estimates save foreign
file foreign.ster saved
. estimates notes list in 1/2
1. file http://www.stata-press.com/data/r11/auto.dta
2. datasignature report 74:12(71728):3831085005:1395876116
. estimates notes drop in 2
(1 note dropped)
. estimates notes
1. file http://www.stata-press.com/data/r11/auto.dta
Methods and formulas
estimates notes is implemented as an ado-file.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates replay — Redisplay estimation results
Syntax
estimates replay
estimates replay namelist
where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active)
estimates. all and * mean the same thing.
Menu
Statistics
>
Postestimation
>
Manage estimation results
>
Redisplay estimation output
Description
estimates replay redisplays the current (active) estimation results, just as typing the name of
the estimation command would do.
estimates replay namelist redisplays each specified estimation result. The active estimation
results are left unchanged.
Remarks
In the example that follows, we fit a model two different ways, store the results, use estimates
for to perform the same test on both of them, and then replay the results:
Example 1
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. gen fwgt = foreign*weight
. gen dwgt = !foreign*weight
. gen gpm = 1/mpg
. regress gpm fwgt dwgt displ foreign
(output omitted )
. estimates store reg
. qreg gpm fwgt dwgt displ foreign
(output omitted )
. estimates store qreg
. estimates for reg qreg: test fwgt==dwgt
Model reg
( 1)
fwgt - dwgt = 0
F( 1,
69) =
Prob > F =
4.87
0.0307
413
414
estimates replay — Redisplay estimation results
Model qreg
( 1)
fwgt - dwgt = 0
F( 1,
69) =
Prob > F =
. estimates replay
0.07
0.7937
Model qreg
Median regression
Raw sum of deviations .7555689 (about .05)
Min sum of deviations .3201479
gpm
Coef.
fwgt
dwgt
displacement
foreign
_cons
.0000155
.0000147
.0000179
.0065352
.0003134
Number of obs =
Pseudo R2
Std. Err.
t
P>|t|
2.87e-06
1.88e-06
.0000147
.0078098
.0042851
5.40
7.81
1.22
0.84
0.07
0.000
0.000
0.226
0.406
0.942
74
=
0.5763
[95% Conf. Interval]
9.76e-06
.0000109
-.0000113
-.009045
-.0082351
.0000212
.0000184
.0000471
.0221153
.0088618
. estimates replay reg
Model reg
Source
SS
df
MS
Model
Residual
.009342436
.002615192
4
69
.002335609
.000037901
Total
.011957628
73
.000163803
gpm
Coef.
fwgt
dwgt
displacement
foreign
_cons
.00002
.0000123
.0000296
-.0117756
.0053352
Std. Err.
3.27e-06
2.30e-06
.0000187
.0086088
.0046748
t
6.12
5.36
1.58
-1.37
1.14
Methods and formulas
estimates replay is implemented as an ado-file.
Also see
[R] estimates — Save and manipulate estimation results
Number of obs
F( 4,
69)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.119
0.176
0.258
=
=
=
=
=
=
74
61.62
0.0000
0.7813
0.7686
.00616
[95% Conf. Interval]
.0000135
7.75e-06
-7.81e-06
-.0289497
-.0039909
.0000265
.0000169
.000067
.0053986
.0146612
Title
estimates save — Save and use estimation results
Syntax
estimates save filename
estimates use filename
, append replace
, number(#)
if
in
weight
estimates esample: varlist
, replace stringvars(varlist) zeroweight
estimates esample
Menu
estimates save
Statistics
>
Postestimation
>
Manage estimation results
>
Save to disk
>
Manage estimation results
>
Load from disk
estimates use
Statistics
>
Postestimation
Description
estimates save filename saves the current (active) estimation results in filename.
estimates use filename loads the results saved in filename into the current (active) estimation
results.
In both cases, if filename is specified without an extension, .ster is assumed.
estimates esample: (note the colon) resets e(sample). After estimates use filename,
e(sample) is set to contain 0, meaning that none of the observations currently in memory was used
in obtaining the estimates.
estimates esample (without a colon) displays how e(sample) is currently set.
Options
append, used with estimates save, specifies that results be appended to an existing file. If the file
does not already exist, a new file is created.
replace, used with estimates save, specifies that filename can be replaced if it already exists.
number(#), used with estimates use, specifies that the #th set of estimation results from filename
be loaded. This assumes that multiple sets of estimation results have been saved in filename by
estimates save, append. The default is number(1).
415
416
estimates save — Save and use estimation results
replace, used with estimates esample:, specifies that e(sample) can be replaced even if it is
already set.
stringvars(varlist), used with estimates esample:, specifies string variables. Observations
containing variables that contain "" will be omitted from e(sample).
zeroweight, used with estimates esample:, specifies that observations with zero weights are to
be included in e(sample).
Remarks
See [R] estimates for an overview of the estimates commands.
For a description of estimates save and estimates use, see Saving and using estimation
results in [R] estimates.
The rest of this entry concerns e(sample).
Remarks are presented under the following headings:
Setting e(sample)
Resetting e(sample)
Determining who set e(sample)
Setting e(sample)
After estimates use filename, the situation is nearly identical to what it was immediately after
you fit the model. The one difference is that e(sample) is set to 0.
e(sample) is Stata’s function to mark which observations among those currently in memory were
used in producing the estimates. For instance, you might type
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ if foreign
(output omitted )
. summarize mpg if e(sample)
(output omitted )
and summarize would report the summary statistics for the observations regress in fact used, which
would exclude not only observations for which foreign = 0 but also any observations for which
mpg, weight, or displ was missing.
If you saved the above estimation results and then reloaded them, however, summarize mpg if
e(sample) would produce
. summarize mpg if e(sample)
Variable
Obs
mpg
0
Mean
Std. Dev.
Min
Max
Stata thinks that none of these observations was used in producing the estimates currently loaded.
What else could Stata think? When you estimates use filename, you do not have to have the
original data in memory. Even if you do have data in memory that look like the original data, it might
not be. Setting e(sample) to 0 is the safe thing to do. There are some postestimation statistics, for
instance, that are appropriate only when calculated on the estimation sample. Setting e(sample) to
0 ensures that, should you ask for one of them, you will get back a null result.
estimates save — Save and use estimation results
417
We recommend that you leave e(sample) set to 0. But what if you really need to calculate that
postestimation statistic? Well, you can get it, but you are going to take responsibility for setting
e(sample) correctly. Here we just happen to know that all the foreign observations were used, so
we can type
. estimates esample:
if foreign
If all the observations had been used, we could simply type
. estimates esample:
The safe thing to do, however, is to look at the estimation command—estimates describe will
show it to you—and then type
. estimates esample:
mpg weight displ if foreign
Resetting e(sample)
estimates esample: will allow you to not only set but also reset e(sample). If e(sample)
has already been set (say that you just fit the model) and you try to set it, you will see
. estimates esample: mpg weight displ if foreign
no; e(sample) already set
r(322);
Here you can specify the replace option:
. estimates esample:
mpg weight displ if foreign, replace
We do not recommend resetting e(sample), but the situation can arise where you need to. Imagine
that you estimates use filename, you set e(sample), and then you realize that you set it wrong.
Here you would want to reset it.
Determining who set e(sample)
estimates esample without a colon will report whether and how e(sample) was set. You might
see
. estimates esample
e(sample) set by estimation command
or
. estimates esample
e(sample) set by user
or
. estimates esample
e(sample) not set (0 assumed)
Saved results
estimates esample without the colon saves macro r(who), which will contain cmd, user, or
zero’d.
418
estimates save — Save and use estimation results
Methods and formulas
estimates save, estimates use, estimates esample:, and estimates esample are implemented as ado-files.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates stats — Model statistics
Syntax
estimates stats
namelist
, n(#)
where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active)
estimates. all and * mean the same thing.
Menu
Statistics
>
Postestimation
>
Manage estimation results
>
Table of fit statistics
Description
estimates stats reports model-selection statistics, including the Akaike information criterion
(AIC) and the Bayesian information criterion (BIC). These measures are appropriate for maximum
likelihood models.
If estimates stats is used for a non–likelihood-based model, such as qreg, missing values are
reported.
Option
n(#) specifies the N to be used in calculating BIC; see [R] BIC note.
Remarks
If you type estimates stats without arguments, a table for the most recent estimation results
will be shown:
. logistic foreign mpg weight displ
(output omitted )
. estimates stats
Model
Obs
ll(null)
ll(model)
df
AIC
BIC
.
74
-45.03321
-20.59083
4
49.18167
58.39793
Note:
N=Obs used in calculating BIC; see [R] BIC note
Regarding the note at the bottom of the table, N is an ingredient in the calculation of BIC; see
[R] BIC note. The note changes if you specify the n() option, which tells estimates stats what
N to use. N = Obs is the default.
Regarding the table itself, ll(null) is the log likelihood for the constant-only model, ll(model)
is the log likelihood for the model, df is the number of degrees of freedom, and AIC and BIC are
the Akaike and Bayesian information criteria.
Models with smaller values of an information criterion are considered preferable.
419
420
estimates stats — Model statistics
estimates stats can compare estimation results:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. logistic foreign mpg weight displ
(output omitted )
. estimates store full
. logistic foreign mpg weight
(output omitted )
. estimates store sub
. estimates stats full sub
Model
Obs
ll(null)
ll(model)
df
AIC
BIC
full
sub
74
74
-45.03321
-45.03321
-20.59083
-27.17516
4
3
49.18167
60.35031
58.39793
67.26251
Note:
N=Obs used in calculating BIC; see [R] BIC note
Saved results
estimates stats saves the following in r():
Matrices
r(S) matrix with 6 columns (N, ll0, ll, df, AIC, and BIC) and rows corresponding to models in table
Methods and formulas
estimates stats is implemented as an ado-file.
See [R] BIC note.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates store — Store and restore estimation results
Syntax
estimates store name
, nocopy
estimates restore name
estimates query
estimates dir
namelist
estimates drop namelist
estimates clear
where namelist is a name, a list of names, all, or *. all and * mean the same thing.
Menu
estimates store
Statistics
>
Postestimation
>
Manage estimation results
>
Store in memory
>
Manage estimation results
>
Restore from memory
>
Manage estimation results
>
List results stored in memory
>
Manage estimation results
>
Drop from memory
estimates restore
Statistics
>
Postestimation
estimates dir
Statistics
>
Postestimation
estimates drop
Statistics
>
Postestimation
Description
estimates store name saves the current (active) estimation results under the name name.
estimates restore name loads the results saved under name into the current (active) estimation
results.
estimates query tells you whether the current (active) estimates have been stored and, if so,
the name.
estimates dir displays a list of the stored estimates.
estimates drop namelist drops the specified stored estimation results.
estimates clear drops all stored estimation results.
421
422
estimates store — Store and restore estimation results
estimates clear, estimates drop all, and estimates drop * do the same thing. estimates
drop and estimates clear do not eliminate the current (active) estimation results.
Option
nocopy, used with estimates store, specifies that the current (active) estimation results are to be
moved into name rather than copied. Typing
. estimates store hold, nocopy
is the same as typing
. estimates store hold
. ereturn clear
except that the former is faster. The nocopy option is sometimes used by programmers.
Remarks
estimates store stores estimation results in memory so that you can access them later.
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ
(output omitted )
. estimates store myreg
. ... you do other things, including fitting other models ...
. estimates restore myreg
. regress
(same output shown again)
After estimates restore myreg, things are once again just as they were, estimationwise, just
after you typed regress mpg weight displ.
estimates store stores results in memory. When you exit Stata, those stored results vanish. If
you wish to make a permanent copy of your estimation results, see [R] estimates save.
The purpose of making copies in memory is 1) so that you can quickly switch between them and
2) so that you can make tables comparing estimation results. Concerning the latter, see [R] estimates
table and [R] estimates stats.
Saved results
estimates dir saves the following in r():
Macros
r(names) names of stored results
estimates store — Store and restore estimation results
423
Methods and formulas
estimates store, estimates restore, estimates query, estimates dir, estimates drop,
and estimates clear are implemented as ado-files.
References
Jann, B. 2005. Making regression tables from stored estimates. Stata Journal 5: 288–308.
. 2007. Making regression tables simplified. Stata Journal 7: 227–244.
Also see
[R] estimates — Save and manipulate estimation results
Title
estimates table — Compare estimation results
Syntax
estimates table
namelist
, options
where namelist is a name, a list of names, all, or *. A name may be ., meaning the current (active)
estimates. all and * mean the same thing.
options
description
Main
stats(scalarlist)
star (#1 #2 #3)
report scalarlist in table
use stars to denote significance levels
Options
keep(coeflist)
drop(coeflist)
equations(matchlist)
report coefficients in order specified
omit specified coefficients from table
match equations of models as specified
Numerical formats
b (% fmt)
se (% fmt)
t (% fmt)
p (% fmt)
stfmt(% fmt)
how to format coefficients, which are always reported
report standard errors and use optional format
report t or z and use optional format
report p-values and use optional format
how to format scalar statistics
General format
varwidth(#)
modelwidth(#)
use # characters to display variable names and statistics
use # characters to display model names
eform
label
newpanel
display coefficients in exponentiated form
display variable labels rather than variable names
display statistics in separate table from coefficients
style(oneline)
style(columns)
style(noline)
put vertical line after variable names; the default
put vertical line separating every column
suppress all vertical lines
coded
display compact table
Reporting
display options
† title(string)
control spacing and display of omitted variables and base and
empty cells
title for table
† title() does not appear in the dialog box.
424
estimates table — Compare estimation results
425
where
• A scalarlist is a list of any or all of the names of scalars stored in e(), plus aic, bic, and
rank.
• #1 #2 #3 are three numbers such as .05 .01 .001.
• A coeflist is a list of coefficient names, each name of which may be simple (e.g., price),
an equation name followed by a colon (e.g., mean:), or a full name (e.g., mean:price).
Names are separated by blanks.
• A matchlist specifies how equations from different estimation results are to be matched. If
you need to specify a matchlist, the solution is usually 1, as in equations(1). The full
syntax is
matchlist := term , term . . .
term := eqname = #:# . . . :#
eqname = #
See equations() under Options below.
• % fmt is any valid Stata numerical display format.
Menu
Statistics
>
Postestimation
>
Manage estimation results
>
Table of fit statistics
Description
estimates table displays a table of coefficients and statistics for one or more sets of estimation
results.
Options
Main
stats(scalarlist) specifies one or more scalar statistics to be displayed in the table. scalarlist may
contain
aic
bic
rank
Akaike’s information criterion
Schwarz’s Bayesian information criterion
rank of e(V) (# of free parameters in model)
along with the names of any scalars stored in e(). The specified statistics do not have to be
available for all estimation results being displayed.
For example, stats(N ll chi2 aic) specifies that e(N), e(ll), e(chi2), and AIC be included.
In Stata, e(N) records the number of observations; e(ll), the log likelihood; and e(chi2), the
chi-squared test that all coefficients in the first equation of the model are equal to zero.
426
estimates table — Compare estimation results
star and star(#1 #2 #3) specify that stars (asterisks) are to be used to mark significance. The
second syntax specifies the significance levels for one, two, and three stars. If you specify simply
star, that is equivalent to specifying star(.05 .01 .001), which means one star (*) if p < 0.05,
two stars (**) if p < 0.01, and three stars (***) if p < 0.001.
The star and star() options may not be combined with se, t, or p option.
Options
keep(coeflist) and drop(coeflist) are alternatives; they specify coefficients to be included or omitted
from the table. The default is to display all coefficients.
If keep() is specified, it specifies not only the coefficients to be included but also the order in
which they appear.
A coeflist is a list of coefficient names, each name of which may be simple (e.g., price), an
equation name followed by a colon (e.g., mean:), or a full name (e.g., mean:price). Names are
separated from each other by blanks.
When full names are not specified, all coefficients that match the partial specification are included.
For instance, drop( cons) would omit cons for all equations.
equations(matchlist) specifies how the equations of the models in namelist are to be matched. The
default is to match equations by name. Matching by name usually works well when all results were
fit by the same estimation command. When you are comparing results from different estimation
commands, however, specifying equations() may be necessary.
The most common usage is equations(1), which indicates that all first equations are to be
matched into one equation named #1.
matchlist has the syntax
term , term . . .
where term is
eqname = #:# . . .:#
eqname = #
(syntax 1)
(syntax 2)
In syntax 1, each # is a number or a period (.). If a number, it specifies the position of the equation
in the corresponding model; 1:3:1 would indicate that equation 1 in the first model matches
equation 3 in the second, which matches equation 1 in the third. A period indicates that there
is no corresponding equation in the model; 1:.:1 indicates that equation 1 in the first matches
equation 1 in the third.
In syntax 2, you specify just one number, say, 1 or 2, and that is shorthand for 1:1. . . :1 or
2:2. . . :2, meaning that equation 1 matches across all models specified or that equation 2 matches
across all models specified.
Now that you can specify a term, you can put that together into a matchlist by separating one term
from the other by commas. In what follows, we will assume that three names were specified,
. estimates table alpha beta gamma, ...
equations(1) is equivalent to equations(1:1:1); we would be saying that the first equations
match across the board.
equations(1:.:1) would specify that equation 1 matches in models alpha and gamma but that
there is nothing corresponding in model beta.
estimates table — Compare estimation results
427
equations(1,2) is equivalent to equations(1:1:1, 2:2:2). We would be saying that the first
equations match across the board and so do the second equations.
equations(1, 2:.:2) would specify that the first equations match across the board, that the
second equations match for models alpha and gamma, and that there is nothing equivalent to
equation 2 in model beta.
If equations() is specified, equations not matched by position are matched by name.
Numerical formats
b(% fmt) specifies how the coefficients are to be displayed. You might specify b(%9.2f) to make
decimal points line up. There is also a b option, which specifies that coefficients are to be displayed,
but that is just included for consistency with the se, t, and p options. Coefficients are always
displayed.
se, t, and p specify that standard errors, t or z statistics, and significance levels are to be displayed.
The default is not to display them. se(% fmt), t(% fmt), and p(% fmt) specify that each is to be
displayed and specifies the display format to be used to format them.
stfmt(% fmt) specifies the format for displaying the scalar statistics included by the stats() option.
General format
varwidth(#) specifies the number of character positions used to display the names of the variables
and statistics. The default is 12.
modelwidth(#) specifies the number of character positions used to display the names of the models.
The default is 12.
eform displays coefficients in exponentiated form. For each coefficient, exp(β) rather than β is
displayed, and standard errors are transformed appropriately. Display of the intercept, if any, is
suppressed.
label specifies that variable labels be displayed instead of variable names.
newpanel specifies that the statistics be displayed in a table separated by a blank line from the table
with coefficients rather than in the style of another equation in the table of coefficients.
style(stylespec) specifies the style of the coefficient table.
style(oneline) specifies that a vertical line be displayed after the variables but not between
the models. This is the default.
style(columns) specifies that vertical lines be displayed after each column.
style(noline) specifies that no vertical lines be displayed.
coded specifies that a compact table be displayed. This format is especially useful for comparing
variables that are included in a large collection of models.
Reporting
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
The following option is available with estimates table but is not shown in the dialog box:
title(string) specifies the title to appear above the table.
428
estimates table — Compare estimation results
Remarks
If you type estimates table without arguments, a table of the most recent estimation results
will be shown:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg weight displ
(output omitted )
. estimates table
Variable
weight
displacement
_cons
active
-.00656711
.00528078
40.084522
The real use of estimates table, however, is for comparing estimation results, and that requires
using it after estimates store:
. regress mpg weight displ
(output omitted )
. estimates store base
. regress mpg weight displ foreign
(output omitted )
. estimates store alt
. qreg mpg weight displ foreign
(output omitted )
. estimates store qreg
. estimates table base alt qreg, stats(r2)
Variable
base
alt
qreg
weight
displacement
foreign
_cons
-.00656711
.00528078
40.084522
-.00677449
.00192865
-1.6006312
41.847949
-.00595056
.00018552
-2.1326004
39.213348
r2
.6529307
.66287957
Saved results
estimates table saves the following in r():
Macros
r(names) names of results used
Matrices
r(coef)
matrix M : n × 2∗m
M [i, 2j−1] = ith parameter estimate for model j ;
M [i, 2j] = variance of M [i, 2j−1]; i=1,...,n; j=1,...,m
r(stats) matrix S : k×m (if option stats() specified)
S[i, j] = ith statistic for model j ; i=1,...,k; j=1,...,m
estimates table — Compare estimation results
Methods and formulas
estimates table is implemented as an ado-file.
Also see
[R] estimates — Save and manipulate estimation results
429
Title
estimates title — Set title for estimation results
Syntax
estimates title:
text
estimates title
Menu
Statistics
>
Postestimation
>
Manage estimation results
>
Title/retitle results
Description
estimates title: (note the colon) sets or clears the title for the current estimation results. The
title is used by estimates table and estimates stats (see [R] estimates table and [R] estimates
stats).
estimates title without the colon displays the current title.
Remarks
After setting the title, if estimates have been stored, do not forget to store them again:
. use http://www.stata-press.com/data/r11/auto
(1978 Automobile Data)
. regress mpg gear turn
(output omitted )
. estimates store reg
Now let’s add a title:
. estimates title: "My regression"
. estimates store reg
Methods and formulas
estimates title: and estimates title are implemented as ado-files.
Also see
[R] estimates — Save and manipulate estimation results
430
Title
estimation options — Estimation options
Description
This entry describes the options common to many estimation commands. Not all the options
documented below work with all estimation commands. See the documentation for the particular
estimation command; if an option is listed there, it is applicable.
Options
Model
noconstant suppresses the constant term (intercept) in the model.
offset(varname) specifies that varname be included in the model with the coefficient constrained
to be 1.
exposure(varname) specifies a variable that reflects the amount of exposure over which the depvar
events were observed for each observation; ln(varname) with coefficient constrained to be 1 is
entered into the log-link function.
constraints(numlist | matname) specifies the linear constraints to be applied during estimation.
The default is to perform unconstrained estimation. See [R] reg3 for the use of constraints in
multiple-equation contexts.
constraints(numlist) specifies the constraints by number after they have been defined by using
the constraint command; see [R] constraint. Some commands (for example, slogit) allow
only constraints(numlist).
constraints(matname) specifies a matrix containing the constraints; see [P] makecns.
constraints(clist)
is usedby some estimation commands, such as mlogit, where clist has the
form # -#
, # -# . . . .
collinear specifies that the estimation command not omit collinear variables. Usually, there is no
reason to leave collinear variables in place, and, in fact, doing so usually causes the estimation
to fail because of the matrix singularity caused by the collinearity. However, with certain models,
the variables may be collinear, yet the model is fully identified because of constraints or other
features of the model. In such cases, using the collinear option allows the estimation to take
place, leaving the equations with collinear variables intact. This option is seldom used.
force specifies that estimation be forced even though the time variable is not equally spaced.
This is relevant only for correlation structures that require knowledge of the time variable. These
correlation structures require that observations be equally spaced so that calculations based on lags
correspond to a constant time change. If you specify a time variable indicating that observations
are not equally spaced, the (time dependent) model will not be fit. If you also specify force,
the model will be fit, and it will be assumed that the lags based on the data ordered by the time
variable are appropriate.
Correlation
corr(correlation) specifies the within-group correlation structure; the default corresponds to the
equal-correlation model, corr(exchangeable).
431
432
estimation options — Estimation options
When you specify a correlation structure that requires a lag, you indicate the lag after the structure’s
name with or without a blank; e.g., corr(ar 1) or corr(ar1).
If you specify the fixed correlation structure, you specify the name of the matrix containing the
assumed correlations following the word fixed, e.g., corr(fixed myr).
Reporting
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals.
noskip specifies that a full maximum-likelihood model with only a constant for the regression equation
be fit. This model is not displayed but is used as the base model to compute a likelihood-ratio test
for the model test statistic displayed in the estimation header. By default, the overall model test
statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation
being zero (except the constant). For many models, this option can substantially increase estimation
time.
nocnsreport specifies that no constraints be reported. The default is to display user-specified
constraints above the coefficient table.
noomitted specifies that variables that were omitted because of collinearity not be displayed. The
default is to include in the table any variables omitted because of collinearity and to label them
as “(omitted)”.
vsquish specifies that the blank space separating factor-variable terms or time-series–operated variables
from other variables in the model be suppressed.
noemptycells specifies that empty cells for interactions of factor variables not be displayed. The
default is to include in the table interaction cells that do not occur in the estimation sample and
to label them as “(empty)”.
baselevels and allbaselevels control whether the base levels of factor variables and interactions
are displayed. The default is to exclude from the table all base categories.
baselevels specifies that base levels be reported for factor variables and for interactions whose
bases cannot be inferred from their component factor variables.
allbaselevels specifies that all base levels of factor variables and interactions be reported.
Integration
intmethod(intmethod) specifies the integration method to be used for the random-effects model. It
accepts one of three arguments: mvaghermite, the default, performs mean and variance adaptive
Gauss–Hermite quadrature first on every and then on alternate iterations; aghermite performs
mode and curvature adaptive Gauss–Hermite quadrature on the first iteration only; ghermite
performs nonadaptive Gauss–Hermite quadrature.
intpoints(#) specifies the number of integration points to use for integration by quadrature.
The default is intpoints(12); the maximum is intpoints(195). Increasing this value slightly
improves the accuracy but also increases computation time. Computation time is roughly proportional
to its value.
The following option is not shown in the dialog box:
coeflegend specifies that the legend of the coefficients and how to specify them in an expression
be displayed rather than the coefficient table.
estimation options — Estimation options
Also see
[U] 20 Estimation and postestimation commands
433
Title
exit — Exit Stata
Syntax
exit , clear
Description
Typing exit causes Stata to stop processing and return control to the operating system. If the
dataset in memory has changed since the last save command, you must specify the clear option
before Stata will let you exit.
exit may also be used for exiting do-files or programs; see [P] exit.
Stata for Windows users may also exit Stata by clicking on the Close button or by pressing Alt+F4.
Stata for Mac users may also exit Stata by pressing Command+Q.
Stata(GUI) users may also exit Stata by clicking on the Close button.
Option
clear permits you to exit, even if the current dataset has not been saved.
Remarks
Type exit to leave Stata and return to the operating system. If the dataset in memory has changed
since the last time it was saved, however, Stata will refuse. At that point, you can either save the
dataset and then type exit, or type exit, clear:
. exit
no; data in memory would be lost
r(4);
. exit, clear
Also see
[P] exit — Exit from a program or do-file
434
Title
exlogistic — Exact logistic regression
Syntax
exlogistic depvar indepvars
if
in
weight
, options
description
options
Model
condvars(varlist)
condition on variables in varlist
group(varname)
groups/strata are stratified by unique values of varname
binomial(varname | #) data are in binomial form and the number of trials is contained in
varname or in #
estconstant
estimate constant term; do not condition on the number of successes
noconstant
suppress constant term
Terms
terms(termsdef )
terms definition
Options
memory(# b | k | m | g ) set limit on memory usage; default is memory(10m)
saving(filename)
save the joint conditional distribution to filename
Reporting
level(#)
coef
test(testopt)
mue(varlist)
midp
nolog
set confidence level; default is level(95)
report estimated coefficients
report significance of observed sufficient statistic, conditional scores test,
or conditional probabilities test
compute the median unbiased estimates for varlist
use the mid-p-value rule
do not display the enumeration log
by, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
fweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Exact statistics
>
Exact logistic regression
Description
exlogistic fits an exact logistic regression model of depvar on indepvars.
exlogistic is an alternative to logistic, the standard maximum-likelihood–based logistic
regression estimator; see [R] logistic. exlogistic produces more-accurate inference in small samples
because it does not depend on asymptotic results and exlogistic can better deal with one-way
causation, such as the case where all females are observed to have a positive outcome.
435
436
exlogistic — Exact logistic regression
exlogistic with the group(varname) option is an alternative to clogit, the conditional logistic
regression estimator; see [R] clogit. Like clogit, exlogistic conditions on the number of positive
outcomes within stratum.
depvar can be specified in two ways. It can be zero/nonzero, with zero indicating failure and
nonzero representing positive outcomes (successes), or if you specify the binomial(varname | #)
option, depvar may contain the number of positive outcomes within each trial.
exlogistic is computationally intensive. Unlike most estimators, rather than calculating coefficients for all independent variables at once, results for each independent variable are calculated
separately with the other independent variables temporarily conditioned out. You can save considerable
computer time by skipping the parameter calculations for variables that are not of direct interest.
Specify such variables in the condvars() option rather than among the indepvars; see condvars()
below.
Unlike Stata’s other estimation commands, you may not use test, lincom, or other postestimation
commands after exlogistic. Given the method used to calculate estimates, hypothesis tests must
be performed during estimation by using exlogistic’s terms() option; see terms() below.
Options
Model
condvars(varlist) specifies variables whose parameter estimates are not of interest to you. You can
save substantial computer time and memory moving such variables from indepvars to condvars().
Understand that you will get the same results for x1 and x3 whether you type
. exlogistic y x1 x2 x3 x4
or
. exlogistic y x1 x3, condvars(x2 x4)
group(varname) specifies the variable defining the strata, if any. A constant term is assumed for
each stratum identified in varname, and the sufficient statistics for indepvars are conditioned on
the observed number of successes within each group. This makes the model estimated equivalent
to that estimated by clogit, Stata’s conditional logistic regression command (see [R] clogit).
group() may not be specified with noconstant or estconstant.
binomial(varname | #) indicates that the data are in binomial form and depvar contains the number
of successes. varname contains the number of trials for each observation. If all observations have
the same number of trials, you can instead specify the number as an integer. The number of trials
must be a positive integer at least as great as the number of successes. If binomial() is not
specified, the data are assumed to be Bernoulli, meaning that depvar equaling zero or nonzero
records one failure or success.
estconstant estimates the constant term. By default, the models are assumed to have an intercept
(constant), but the value of the intercept is not calculated. That is, the conditional distribution of
the sufficient statistics for the indepvars is computed given the number of successes in depvar,
thus conditioning out the constant term of the model. Use estconstant if you want the estimate
of the intercept reported. estconstant may not be specified with group().
noconstant; see [R] estimation options. noconstant may not be specified with group().
exlogistic — Exact logistic regression
437
Terms
terms(termname = variable . . . variable , termname = variable . . . variable . . . ) defines additional
terms of the model on which you want exlogistic to perform joint-significance hypothesis tests.
By default, exlogistic reports tests individually on each variable in indepvars. For instance,
if variables x1 and x3 are in indepvars, and you want to jointly test their significance, specify
terms(t1=x1 x3). To also test the joint significance of x2 and x4, specify terms(t1=x1 x3,
t2=x2 x4). Each variable can be assigned to only one term.
Joint tests are computed only for the conditional scores tests and the conditional probabilities tests.
See the test() option below.
Options
memory(# b | k | m | g ) sets a limit on the amount of memory exlogistic can use when computing
the conditional distribution of the parameter sufficient statistics. The default is memory(10m),
where m stands for megabyte, or 1,048,576 bytes. The following are also available: b stands for
byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte, which is
equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 2048m or 2g,
but do not attempt to use more memory than is available on your computer.
saving(filename , replace ) saves the joint conditional distribution to filename. This distribution
is conditioned on those variables specified in condvars(). Use replace to replace an existing
file with filename. A Stata data file is created containing all the feasible values of the parameter
sufficient statistics. The variable names are the same as those in indepvars, in addition to a variable
named f containing the feasible value frequencies (sometimes referred to as the condition
numbers).
Reporting
level(#); see [R] estimation options. The level(#) option will not work on replay because
confidence intervals are based on estimator-specific enumerations. To change the confidence level,
you must refit the model.
coef reports the estimated coefficients rather than odds ratios (exponentiated coefficients). coef may
be specified when the model is fit or upon replay. coef affects only how results are displayed and
not how they are estimated.
test(sufficient | score | probability) reports the significance level of the observed sufficient
statistics, the conditional scores tests, or the conditional probabilities tests, respectively. The default
is test(sufficient). If terms() is included in the specification, the conditional scores test
and the conditional probabilities test are applied to each term providing conditional inference for
several parameters simultaneously. All the statistics are computed at estimation time regardless of
which is specified. Each statistic may thus also be displayed postestimation without having to refit
the model; see [R] exlogistic postestimation.
mue(varlist) specifies that median unbiased estimates (MUEs) be reported for the variables in varlist.
By default, the conditional maximum likelihood estimates (CMLEs) are reported, except for those
parameters for which the CMLEs are infinite. Specify mue( all) if you want MUEs for all the
indepvars.
midp instructs exlogistic to use the mid-p-value rule when computing the MUEs, significance
levels, and confidence intervals. This adjustment is for the discreteness of the distribution and
halves the value of the discrete probability of the observed statistic before adding it to the p-value.
The mid-p-value rule cannot be applied to MUEs whose corresponding parameter CMLE is infinite.
438
exlogistic — Exact logistic regression
nolog prevents the display of the enumeration log. By default, the enumeration log is displayed,
showing the progress of computing the conditional distribution of the sufficient statistics.
Remarks
Exact logistic regression is the estimation of the logistic model parameters by using the conditional
distribution of the parameter sufficient statistics. The estimates are referred to as the conditional
maximum likelihood estimates (CMLEs). This technique was first introduced by Cox and Snell (1989)
as an alternative to using maximum likelihood estimation, which can perform poorly for small sample
sizes. For stratified data, exact logistic regression is a small-sample alternative to conditional logistic
regression. See [R] logit, [R] logistic, and [R] clogit to obtain maximum likelihood estimates (MLEs)
for the logistic model and the conditional logistic model. For a comprehensive overview of exact
logistic regression, see Mehta and Patel (1995).
Let Yi denote a Bernoulli random variable where we observe the outcome Yi = yi , i = 1, . . . , n.
Associated with each independent observation is a 1 × p vector of covariates, xi . We will denote
πi = Pr (Yi | xi ) and let the logit function model the relationship between Yi and xi ,
log
πi
1 − πi
= θ + xi β
where the constant term θ and the p × 1 vector of regression parameters β are unknown. The
probability of observing Yi = yi , i = 1, . . . , n, is
Pr(Y = y) =
n
Y
πiyi (1 − πi )
1−yi
i=1
where Y = (Y1 , . . . , Yn ) and y = (y1 , . . . , yn ). The MLEs for θ and β maximize the log of this
function.
Pn
Pn
The sufficient statistics for θ and βj , j = 1, . . . , p, are M = i=1 Yi and Tj = i=1 Yi xij ,
respectively, and we observe M = m and Tj = tj . By default, exlogistic tallies the conditional
n
distribution of T = (T1 , . . . , Tp ) given M = m. This distribution will have a size of
. (It
m
would have a size of 2n without conditioning on M = m.) Denote one of these vectors T(k) =
PN
(k)
(k)
n
(t1 , . . . , tp ), k = 1, . . . , N , with combinatorial coefficient (frequency) ck ,
k=1 ck = m .
For each independent variable xj , j = 1, . . . , p, we reduce the conditional distribution further by
conditioning on all other observed sufficient statistics Tl = tl , l 6= j . The conditional probability of
observing Tj = tj has the form
Pr(Tj = tj | Tl = tl , l 6= j, M = m) = P
c etj βj
k ck e
(k)
tj βj
exlogistic — Exact logistic regression
(k)
(k)
(k)
439
(k)
where the sum is over the subset of T vectors such that (T1 = t1 , . . . , Tj = tj , . . . , Tp = tp )
and c is the combinatorial coefficient associated with the observed t. The CMLE for βj maximizes
the log of this function.
Specifying nuisance variables in condvars() will reduce the size of the conditional distribution by
conditioning on their observed sufficient statistics as well as conditioning on M = m. This reduces
the amount of memory consumed at the cost of not obtaining regression estimates for those variables
specified in condvars().
Inferences from MLEs rely on asymptotics, and if your sample size is small, these inferences may
not be valid. On the other hand, inferences from the CMLEs are exact in the sense that they use the
conditional distribution of the sufficient statistics outlined above.
For small datasets, it is common for the dependent variable to be completely determined by the
data. Here the MLEs and the CMLEs are unbounded. exlogistic will instead compute the MUE,
the regression estimate that places the observed sufficient statistic at the median of the conditional
distribution.
Example 1
One example presented by Mehta and Patel (1995) is data from a prospective study of perinatal
infection and human immunodeficiency virus type 1 (HIV-1). We use a variation of this dataset. There
was an investigation Hutto et al. (1991) into whether the blood serum levels of glycoproteins CD4
and CD8 measured in infants at 6 months of age might predict their development of HIV infection.
The blood serum levels are coded as ordinal values 0, 1, and 2.
. use http://www.stata-press.com/data/r11/hiv1
(prospective study of perinatal infection of HIV-1)
. list
hiv
1.
2.
3.
4.
5.
46.
47.
cd4
cd8
1
0
0
0
1
0
1
1
0
1
(output omitted )
0
2
0
2
0
0
2
0
0
1
2
We first obtain the MLEs from logistic so that we can compare the estimates and associated statistics
with the CMLEs from exlogistic.
. logistic hiv cd4 cd8, coef
Logistic regression
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -20.751687
hiv
Coef.
cd4
cd8
_cons
-2.541669
1.658586
.5132389
Std. Err.
.8392231
.821113
.6809007
z
-3.03
2.02
0.75
P>|z|
0.002
0.043
0.451
=
=
=
=
47
15.75
0.0004
0.2751
[95% Conf. Interval]
-4.186517
.0492344
-.8213019
-.8968223
3.267938
1.84778
440
exlogistic — Exact logistic regression
. exlogistic hiv cd4 cd8, coef
Enumerating sample-space combinations:
observation 1:
enumerations =
2
observation 2:
enumerations =
3
(output omitted )
observation 46: enumerations =
601
observation 47: enumerations =
326
Exact logistic regression
hiv
Coef.
Suff.
cd4
cd8
-2.387632
1.592366
10
12
Number of obs =
Model score
=
Pr >= score
=
2*Pr(Suff.)
0.0004
0.0528
47
13.34655
0.0006
[95% Conf. Interval]
-4.699633
-.0137905
-.8221807
3.907876
exlogistic produced a log showing how many records are generated as it processes each observation.
The primary purpose of the log is to provide feedback because generating the distribution can be time
consuming, but we also see from the last entry that the joint distribution for the sufficient statistics
for cd4
and cd8 conditioned on the total number of successes has 326 unique values (but a size of
47
= 341,643,774,795).
14
The statistics for logistic are based on asymptotics: for a large sample size, each Z statistic
will be approximately normally distributed (with a mean of zero and a standard deviation of one)
if the associated regression parameter is zero. The question is whether a sample size of 47 is large
enough.
On the other hand, the p-values computed by exlogistic are from the conditional distributions
of the sufficient statistics for each parameter given the sufficient statistics for all other parameters.
In this sense, these p-values are exact. By default, exlogistic reports the sufficient statistics for
the regression parameters and the probability of observing a more extreme value. These are singleparameter tests for H0: βcd4 = 0 and H0: βcd8 = 0 versus the two-sided alternatives. The conditional
scores test, located in the coefficient table header, is testing that both H0: βcd4 = 0 and H0: βcd8 = 0.
We find these p-values to be in fair agreement with the Wald and likelihood-ratio tests from logistic.
The confidence intervals for exlogistic are computed from the exact conditional distributions.
The exact confidence intervals are asymmetrical about the estimate and are wider than the normal-based
confidence intervals from logistic.
Both estimation techniques indicate that the incidence of HIV infection decreases with increasing
CD4 blood serum levels and increases with increasing CD8 blood serum levels. The constant term is
missing from the exact logistic coefficient table because we conditioned out its observed sufficient
statistic when tallying the joint distribution of the sufficient statistics for the cd4 and cd8 parameters.
The test() option provides two other test statistics used in exact logistic: the conditional scores
test, test(score), and the conditional probabilities test, test(probability). For comparison, we
display the individual parameter conditional scores tests.
exlogistic — Exact logistic regression
. exlogistic, test(score) coef
Exact logistic regression
Number of obs =
Model score
=
Pr >= score
=
hiv
Coef.
Score
cd4
cd8
-2.387632
1.592366
12.88022
4.604816
Pr>=Score
0.0003
0.0410
441
47
13.34655
0.0006
[95% Conf. Interval]
-4.699633
-.0137905
-.8221807
3.907876
For the probabilities test, the probability statistic is computed from (1) in Methods and formulas
with β = 0. For this example, the significance of the probabilities tests matches the scores tests so
they are not displayed here.
Technical note
Typically, the value of θ, the constant term, is of little interest, as well as perhaps some of the
parameters in β, but we need to include all parameters in the model to correctly specify it. By
conditioning out the nuisance parameters, we can reduce the size of the joint conditional distribution
that is used to estimate the regression parameters of interest. The condvars() option allows you to
specify a varlist of nuisance variables. By default, exlogistic conditions on the sufficient statistic
of θ, which is the number of successes. You can save computation time and computer memory by
using the condvars() option because infeasible values of the sufficient statistics associated with the
variables in condvars() can be dropped from consideration before all n observations are processed.
Specifying some of your independent variables in condvars() will not change the estimated
regression coefficients of the remaining independent variables. For instance, in example 1, if we
instead type
. exlogistic hiv cd4, condvars(cd8) coef
the regression coefficient for cd4 (as well as all associated inference) will be identical.
One reason to have multiple variables in indepvars is to make conditional inference of several
parameters simultaneously by using the terms() option. If you do not wish to test several parameters
simultaneously, it may be more efficient to obtain estimates for individual variables by calling
exlogistic multiple times with one variable in indepvars and all other variables listed in condvars().
The estimates will be the same as those with all variables in indepvars.
Technical note
If you fit a clogit model to the HIV data from example 1, you will find that the estimates differ
from those with exlogistic. (To fit the clogit model, you will have to create a group variable that
includes all observations.) The regression estimates will be different because clogit conditions on
the constant term only, whereas the estimates from exlogistic condition on the sufficient statistic
of the other regression parameter as well as the constant term.
Example 2
The HIV data presented in table IV of Mehta and Patel (1995) are in a binomial form, where the
variable hiv contains the HIV cases that tested positive and the variable n contains the number of
individuals with the same CD4 and CD8 levels, the binomial number-of-trials parameter. Here depvar
is hiv, and we use the binomial(n) option to identify the number-of-trials variable.
442
exlogistic — Exact logistic regression
. use http://www.stata-press.com/data/r11/hiv_n, clear
(prospective study of perinatal infection of HIV-1; binomial form)
. list
cd4
cd8
hiv
n
1.
2.
3.
4.
5.
0
1
0
1
2
2
2
0
1
2
1
2
4
4
1
1
2
7
12
3
6.
7.
8.
1
2
2
0
0
1
2
0
0
7
2
13
Further, the cd4 and cd8 variables of the hiv dataset are actually factor variables, where each has
the ordered levels of (0, 1, 2). Another approach to the analysis is to use indicator variables, and
following Mehta and Patel (1995), we used a 0–1 coding scheme that will give us the odds ratio of
level 0 versus 2 and level 1 versus 2.
. gen byte cd4_0 = (cd4==0)
. gen byte cd4_1 = (cd4==1)
. gen byte cd8_0 = (cd8==0)
. gen byte cd8_1 = (cd8==1)
. exlogistic hiv cd4_0 cd4_1 cd8_0 cd8_1, terms(cd4=cd4_0 cd4_1,
> cd8=cd8_0 cd8_1) binomial(n) test(probability) saving(dist) nolog
note: saving distribution to file dist
note: CMLE estimate for cd4_0 is +inf; computing MUE
note: CMLE estimate for cd4_1 is +inf; computing MUE
note: CMLE estimate for cd8_0 is -inf; computing MUE
note: CMLE estimate for cd8_1 is -inf; computing MUE
Exact logistic regression
Number of obs =
47
Binomial variable: n
Model prob.
= 3.19e-06
Pr <= prob.
=
0.0011
hiv
Odds Ratio
cd4
Prob.
Pr<=Prob.
[95% Conf. Interval]
cd4_0
cd4_1
18.82831*
11.53732*
.0007183
.007238
.0063701
0.0055
0.0072
0.0105
1.714079
1.575285
cd8_0
cd8_1
.1056887*
.0983388*
.0053212
.0289948
.0241503
0.0323
0.0290
0.0242
0
0
cd8
+Inf
+Inf
1.072531
.9837203
(*) median unbiased estimates (MUE)
. matrix list e(sufficient)
e(sufficient)[1,4]
cd4_0 cd4_1 cd8_0 cd8_1
r1
5
8
6
4
. display e(n_possible)
1091475
Here we used terms() to specify two terms in the model, cd4 and cd8, that make up the cd4 and cd8
indicator variables. By doing so, we obtained a conditional probabilities test for cd4, simultaneously
testing both cd4 0 and cd4 1, and for cd8, simultaneously testing both cd8 0 and cd8 1. The
significance levels for the two terms are 0.0055 and 0.0323, respectively.
exlogistic — Exact logistic regression
443
This example also illustrates instances where the dependent variable is completely determined by
the independent variables and CMLEs are infinite. If we try to obtain MLEs, logistic will drop each
variable and then terminate with a no-data error, error number 2000.
. use http://www.stata-press.com/data/r11/hiv_n, clear
(prospective study of perinatal infection of HIV-1; binomial form)
. gen byte cd4_0 = (cd4==0)
. gen byte cd4_1 = (cd4==1)
. gen byte cd8_0 = (cd8==0)
. gen byte cd8_1 = (cd8==1)
. expand n
(39 observations created)
. logistic hiv cd4_0 cd4_1 cd8_0 cd8_1
note: cd4_0 != 0 predicts success perfectly
cd4_0 dropped and 8 obs not used
note: cd4_1 != 0 predicts success perfectly
cd4_1 dropped and 21 obs not used
note: cd8_0 != 0 predicts failure perfectly
cd8_0 dropped and 2 obs not used
outcome = cd8_1 <= 0 predicts data perfectly
r(2000);
In the previous example, exlogistic generated the joint conditional distribution of Tcd4 0 , Tcd4 1 ,
Tcd8 0 , and Tcd8 1 given M = 14 (the number of individuals that tested positive), and for reference,
we listed the observed sufficient statistics that are stored in the matrix e(sufficient). Below we
take that distribution and further condition on Tcd4 1 = 8, Tcd8 0 = 6, and Tcd8 1 = 4, giving the
conditional distribution of Tcd4 0 . Here we see that the observed sufficient statistic Tcd4 0 = 5 is last
in the sorted listing or, equivalently, Tcd4 0 is at the domain boundary of the conditional probability
distribution. When this occurs, the conditional probability distribution is monotonically increasing in
βcd4 0 and a maximum does not exist.
. use dist, clear
. keep if cd4_1==8 & cd8_0==6 & cd8_1==4
(4139 observations deleted)
. list, sep(0)
1.
2.
3.
4.
5.
6.
_f_
cd4_0
cd4_1
cd8_0
cd8_1
1668667
18945542
55801053
55867350
17423175
1091475
0
1
2
3
4
5
8
8
8
8
8
8
6
6
6
6
6
6
4
4
4
4
4
4
When the CMLEs are infinite, the MUEs are computed (Hirji, Tsiatis, and Mehta 1989). For the cd4 0
estimate, we compute the value β cd4 0 such that
Pr(Tcd4
0
≥ 5 | βcd4
0
= β cd4 0 , Tcd4
using (1) in Methods and formulas.
1
= 8, Tcd8
0
= 6, Tcd8
1
= 4, M = 14) = 1/2
444
exlogistic — Exact logistic regression
The output is in agreement with example 1: there is an increase in risk of HIV infection for a CD4
blood serum level of 0 relative to a level of 2 and for a level of 1 relative to a level of 2; there is a
decrease in risk of HIV infection for a CD8 blood serum level of 0 relative to a level of 2 and for a
level of 1 relative to a level of 2.
We also displayed e(n possible). This is the combinatorial coefficient associated with the
observed sufficient statistics. The same value is found in the
f variable of the conditional distribution
47
dataset listed above. The size of the distribution is
= 341,643,774,795. This can be verified
14
by summing the f variable of the generated conditional distribution dataset.
. use dist, clear
. summarize _f_, meanonly
. di %15.1f r(sum)
341643774795.0
Example 3
One can think of exact logistic regression as a covariate-adjusted exact binomial. To demonstrate
this point, we will use exlogistic to compute a binomial confidence interval for m successes of
n trials, by fitting the constant-only model, and we will compare it with the confidence interval
computed by ci (see [R] ci). We will use the saving() option to retain the dataset containing the
feasible values for the constant term sufficient statistic,namely,
the number of successes, m, given
n
, m = 0, 1, . . . , n.
n trials and their associated combinatorial coefficients
m
. input y
1.
2.
3.
4.
5.
6.
7.
. ci
y
1
0
1
0
1
1
end
y, binomial
Variable
Obs
Mean
Std. Err.
Binomial Exact
[95% Conf. Interval]
y
6
.6666667
.1924501
.2227781
. exlogistic y, estconstant nolog coef saving(binom)
note: saving distribution to file binom
Exact logistic regression
Number of obs =
y
Coef.
Suff.
_cons
.6931472
4
2*Pr(Suff.)
0.6875
.9567281
6
[95% Conf. Interval]
-1.24955
3.096017
We use the postestimation program estat predict to transform the estimated constant term and its
confidence bounds by using the inverse logit function, invlogit() (see [D] functions). The standard
error for the estimated probability is computed using the delta method.
exlogistic — Exact logistic regression
445
. estat predict
y
Predicted
Probability
0.6667
Std. Err.
0.1925
[95% Conf. Interval]
0.2228
0.9567
. use binom, replace
. list, sep(0)
1.
2.
3.
4.
5.
6.
7.
_f_
_cons_
1
6
15
20
15
6
1
0
1
2
3
4
5
6
Examining the listing of the generated data, the values contained in the variable
are
the
6
feasible values of M , and the values contained in the variable f are the binomial coefficients
m
6 X
6
with total
= 26 = 64. In the coefficient table, the sufficient statistic for the constant term,
m
m=0
labeled Suff., is m = 4. This value is located at record 5 of the dataset. Therefore, the two-tailed
probability of the sufficient statistic is computed as 0.6875 = 2(15 + 6 + 1)/64.
cons
The constant term is the value of θ that maximizes the probability of observing M = 4; see (1)
of Methods and formulas:
Pr(M = 4|θ) =
15e4α
1 + 6eα + 15e2α + 20e3α + 15e4α + 6e5α + e6α
.4
The maximum is at the value θ = log 2, which is demonstrated in the figure below.
0
.1
probability
.2
.3
(log(2),0.33)
−2
0
2
constant term
4
446
exlogistic — Exact logistic regression
.2
cummlative probability
.4
.6
.8
1
The lower and upper confidence bounds are the values of θ such that Pr(M ≥ 4|θ) = 0.025
and Pr(M ≤ 4|θ) = 0.025, respectively. These probabilities are plotted in the figure below for
θ ∈ [−2, 4].
(3.1, .025)
0
(−1.25, .025)
−2
0
2
4
constant term
Pr(M >= 4)
Pr(M <= 4)
confidence bounds
Example 4
This example demonstrates the group() option, which allows the analysis of stratified data. Here
the logistic model is
πik
log
= θk + xki β
1 − πik
where k indexes the
Psnkstrata, k = 1, . . . , s, and θk is the strata-specific constant term whose sufficient
Yki .
statistic is Mk = i=1
Mehta and Patel (1995) use a case–control study to demonstrate this model, which is useful in
comparing the estimates from exlogistic and clogit. This study was intended to determine the role
of birth complications in people with schizophrenia (Garsd 1988). Siblings from seven families took
part in the study, and each individual was classified as normal or schizophrenic. A birth complication
index is recorded for each individual that ranges from 0, an uncomplicated birth, to 15, a very
complicated birth. Some of the frequencies contained in variable f are greater than 1, and these count
different births at different times where the individual has the same birth complications index, found
in variable BCindex.
exlogistic — Exact logistic regression
. use http://www.stata-press.com/data/r11/schizophrenia, clear
(case-control study on birth complications for people with schizophrenia)
. list, sepby(family)
family
BCindex
schizo
f
1.
2.
3.
4.
5.
6.
7.
1
1
1
1
1
1
1
6
7
3
2
5
0
15
0
0
0
0
0
0
1
1
1
2
3
1
1
1
8.
9.
2
2
2
0
1
0
1
1
10.
11.
12.
3
3
3
2
9
1
0
1
0
1
1
1
13.
14.
4
4
2
0
1
0
1
4
15.
16.
17.
5
5
5
3
6
0
1
0
1
1
1
1
18.
19.
20.
6
6
6
3
0
0
0
1
0
1
1
2
21.
22.
7
7
2
6
0
1
1
1
. exlogistic schizo BCindex [fw=f], group(family) test(score) coef
Enumerating sample-space combinations:
observation 1:
enumerations =
2
observation 2:
enumerations =
3
observation 3:
enumerations =
4
observation 4:
enumerations =
5
observation 5:
enumerations =
6
(output omitted )
observation 21: enumerations =
72
observation 22: enumerations =
40
Exact logistic regression
Group variable: family
Number of obs
Number of groups
Obs per group: min
avg
max
Model score
Pr >= score
schizo
Coef.
Score
BCindex
.3251178
6.328033
Pr>=Score
0.0167
=
=
=
=
=
=
=
29
7
2
4.1
10
6.32803
0.0167
[95% Conf. Interval]
.0223423
.7408832
447
448
exlogistic — Exact logistic regression
The asymptotic alternative for this model can be estimated using clogit (equivalently, xtlogit,
fe) and is listed below for comparison. We must expand the data because clogit will not accept
frequency weights if they are not constant within the groups.
. expand f
(7 observations created)
. clogit schizo BCindex, group(family) nolog
note: multiple positive outcomes within groups encountered.
Conditional (fixed-effects) logistic regression
Log likelihood = -6.2819819
schizo
Coef.
BCindex
.3251178
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
Std. Err.
z
P>|z|
.1678981
1.94
0.053
=
=
=
=
29
5.20
0.0226
0.2927
[95% Conf. Interval]
-.0039565
.654192
Both techniques compute the same regression estimate for the BCindex, which might not be too
surprising because both estimation techniques condition on the total number of successes in each group.
The difference lies in the p-values and confidence intervals. The p-value testing H0 : βBCindex = 0
is approximately 0.0167 for the exact conditional scores test and 0.053 for the asymptotic Wald test.
Moreover, the exact confidence interval is asymmetric about the estimate and does not contain zero.
Technical note
The memory(#) option limits the amount of memory that exlogistic will consume when
computing the conditional distribution of the parameter sufficient statistics. memory() is independent
of the system setting c(memory) (see set memory in [D] memory), and it is possible for exlogistic
to exceed the memory limit specified in c(memory) without terminating. By default, a log is provided
that displays the number of enumerations (the size of the conditional distribution) after processing each
observation. Typically, you will see the number of enumerations increase, and then at some point they
will decrease as the multivariate shift algorithm (Hirji, Mehta, and Patel 1987) determines that some
of the enumerations cannot achieve the observed sufficient statistics of the conditioning variables.
When the algorithm is complete, however, it is necessary to store the conditional distribution of the
parameter sufficient statistics as a dataset. It is possible, therefore, to get a memory error when the
algorithm has completed and c(memory) is not large enough to store the conditional distribution.
Technical note
Computing the conditional distributions and reported statistics requires data sorting and numerical
comparisons. If there is at least one single-precision variable specified in the model, exlogistic
will make comparisons with a relative precision of 2−5 . Otherwise, a relative precision of 2−11 is
used. Be careful if you use recast to promote a single-precision variable to double precision (see
[D] recast). You might try listing the data in full precision (maybe %20.15g; see [D] format) to make
sure that this is really what you want. See [D] data types for information on precision of numeric
storage types.
exlogistic — Exact logistic regression
Saved results
exlogistic saves the following in e():
Scalars
e(N)
e(k groups)
e(n possible)
e(n trials)
e(sum y)
e(k indvars)
e(k terms)
e(k condvars)
e(condcons)
e(midp)
e(eps)
Macros
e(cmd)
e(cmdline)
e(title)
e(depvar)
e(indvars)
e(condvars)
e(groupvar)
e(binomial)
e(level)
e(wtype)
e(wexp)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(predict)
e(marginsnotok)
number of observations
number of groups
number of distinct possible outcomes where sum(sufficient) equals observed
e(sufficient)
binomial number-of-trials parameter
sum of depvar
number of independent variables
number of model terms
number of conditioning variables
conditioned on the constant(s) indicator
mid-p-value rule indicator
relative difference tolerance
exlogistic
command as typed
Exact logistic regression
dependent variable
independent variables
conditional variables
group variable
binomial number-of-trials variable
confidence level
weight type
weight expression
the checksum
variables used in calculation of checksum
b
program used to implement estat
program used to implement predict
predictions disallowed by margins
(Continued on next page)
449
450
exlogistic — Exact logistic regression
Matrices
e(b)
e(mue indicators)
e(se)
e(ci)
e(sum y groups)
e(N g)
e(sufficient)
e(p sufficient)
e(scoretest)
e(p scoretest)
e(probtest)
e(p probtest)
e(scoretest m)
e(p scoretest m)
e(probtest m)
e(p probtest m)
coefficient vector
indicator for elements of e(b) estimated using MUE instead of CMLE
e(b) standard errors (CMLEs only)
matrix of e(level) confidence intervals for e(b)
sum of e(depvar) for each group
number of observations in each group
sufficient statistics for e(b)
p-value for e(sufficient)
conditional scores tests for indepvars
p-values for e(scoretest)
conditional probabilities tests for indepvars
p-value for e(probtest)
conditional scores tests for model terms
p-value for e(scoretest m)
conditional probabilities tests for model terms
p-value for e(probtest m)
Functions
e(sample)
marks estimation sample
Methods and formulas
exlogistic is implemented as an ado-file.
Methods and formulas are presented under the following headings:
Sufficient statistics
Conditional distribution and CMLE
Median unbiased estimates and exact CI
Conditional hypothesis tests
Sufficient-statistic p-value
Sufficient statistics
Let {Y1 , Y2 , . . . , Yn } be a set of n independent Bernoulli random variables, each of which can
realize two outcomes, {0, 1}. For each i = 1, . . . , n, we observe Yi = yi , and associated with each
observation is the covariate row vector of length p, xi = (xi1 , . . . , xip ). Denote β = (β1 , . . . , βp )T to
be theP
column vector of regression parameters and θ P
to be the constant. The sufficient statistic
Pn for βj is
n
n
= 1, . . . , p, and for θ is M = i=1 Yi . We observe Tj = tj , tj = i=1 yi xij ,
Tj = i=1 Yi xij , jP
n
and M = m, m = i=1 yi . The probability of observing (Y1 = y1 , Y2 = y2 , . . . , Yn = yn ) is
exp(mθ + tβ)
{1
+ exp(θ + xi β)}
i=1
Pr(Y1 = y1 , . . . , Yn = yn | β, X) = Qn
where t = (t1 , . . . , tp ) and X = (xT1 , . . . , xTn )T .
The joint distribution of the sufficient statistics T is obtained by summing over all possible binary
sequences Y1 , . . . , Yn such that T = t and M = m. This probability function is
c(t, m) exp(mθ + tβ)
Pr(T1 = t1 , . . . , Tp = tp , M = m | β, X) = Qn
i=1 {1 + exp(θ + xi β)}
where c(t, m) is the combinatorial coefficient of (t, m) or the number of distinct binary sequences
Y1 , . . . , Yn such that T = t and M = m (Cox and Snell 1989).
exlogistic — Exact logistic regression
451
Conditional distribution and CMLE
Without loss of generality, we will restrict our discussion to computing the CMLE of β1 . If we
condition on observing M = m and T2 = t2 , . . . , Tp = tp , the probability function of (T1 | β1 , T2 =
t2 , . . . , Tp = tp , M = m) is
c(t, m)et1 β1
uβ1
u c(u, t2 , . . . , tp , m)e
Pr(T1 = t1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) = P
(1)
where the sum in the denominator is over all possible values of T1 such that M = m and
T2 = t2 , . . . , Tp = tp and c(u, t2 , . . . , tp , m) is the combinatorial coefficient of (u, t2 , . . . , tp , m)
(Cox and Snell 1989). The CMLE for β1 is the value βb1 that maximizes the log of (1). This optimization
task is carried out by ml, using the conditional frequency distribution of (T1 | T2 = t2 , . . . , Tp =
tp , M = m) as a dataset. Generating the joint conditional distribution is efficiently computed using
the multivariate shift algorithm described by Hirji, Mehta, and Patel (1987).
Difficulties in computing βb1 arise if the observed (T1 = t1 , . . . , Tp = tp , M = m) lies on
the boundaries of the distribution of (T1 | T2 = t2 , . . . , Tp = tp , M = m), where the conditional
probability function is monotonically increasing (or decreasing) in β1 . Here the CMLE is plus infinity if
it is on the upper boundary, Pr(T1 ≤ t1 | T2 = t2 , . . . , Tp = tp , M = m) = 1, and is minus infinity
if it is on the lower boundary of the distribution, Pr(T1 ≥ t1 | T2 = t2 , . . . , Tp = tp , M = m) = 1.
This concept is demonstrated in example 2. When infinite CMLEs occur, the MUE is computed.
Median unbiased estimates and exact CI
The MUE is computed using the technique outlined by Hirji, Tsiatis, and Mehta (1989). First, we
(u)
(l)
find the values of β1 and β1 such that
(u)
Pr(T1 ≤ t1 | β1 = β1 , T2 = t2 , . . . , Tp = tp , M = m) =
(2)
(l)
Pr(T1 ≥ t1 | β1 = β1 , T2 = t2 , . . . , Tp = tp , M = m) = 1/2
(l)
(u)
The MUE is then β 1 = β1 + β1
/2. However, if T1 is equal to the minimum of the domain of
the conditional distribution, β (l) does not exist and β 1 = β (u) . If T1 is equal to the maximum of the
domain of the conditional distribution, β (u) does not exist and β 1 = β (l) .
Confidence bounds for β are computed similarly, except that we substitute α/2 for 1/2 in (2),
(u)
(l)
where 1 − α is the confidence level. Here β1 would then be the lower confidence bound and β1
would be the upper confidence bound (see example 3).
Conditional hypothesis tests
P
To test H0: β1 = 0 versus H1 : βP
1 6= 0, we obtain the exact p-value from
u∈E f1 (u) − f1 (t1 )/2
if the mid-p-value rule is used and u∈E f1 (u) otherwise. Here E is a critical region, and we define
f1 (u) = Pr(T1 = u | β1 = 0, T2 = t2 , . . . , Tp = tp , M = m) for ease of notation. There are two
popular ways to define the critical region: the conditional probabilities test and the conditional scores
test (Mehta and Patel 1995). The critical region when using the conditional probabilities test is all
values of the sufficient statistic for β1 that have a probability less than or equal to that of the observed
t1 , Ep = {u : f1 (u) ≤ f1 (t1 )}. The critical region of the conditional scores test is defined as all
values of the sufficient statistic for β1 such that its score is greater than or equal to that of t1 ,
Es = u : (u − µ1 )2 /σ12 ≥ (t1 − µ1 )2 /σ12 )
452
exlogistic — Exact logistic regression
Here µ1 and σ12 are the mean and variance of (T1 | β1 = 0, T2 = t2 , . . . , Tp = tp , M = m).
The score statistic is defined as
∂`(β)
∂β
−1
2 2
∂ `(β)
−E
∂β 2
evaluated at H0: β = 0, where ` is the log of (1). The score test simplifies to (t−E [T |β])2 /var(T |β)
(Hirji 2006), where the mean and variance are computed from the conditional distribution of the
sufficient statistic with β = 0 and t is the observed sufficient statistic.
Sufficient-statistic p-value
The p-value for testing H0 : β1 = 0 versus the two-sided alternative when (T1 = t1 |T2 =
t2 , . . . , Tp = tp ) is computed as 2×min(pl , pu ), where
P
u≤t c(u, t2 , . . . , tp , m)
pl = P 1
c(u, t2 , . . . , tp , m)
P u
u≥t c(u, t2 , . . . , tp , m)
pu = P 1
u c(u, t2 , . . . , tp , m)
It is the probability of observing a more extreme T1 .
References
Cox, D. R., and E. J. Snell. 1989. Analysis of Binary Data. 2nd ed. London: Chapman & Hall.
Garsd, A. 1988. Schizophrenia and birth complications. Unpublished manuscript.
Hirji, K. F. 2006. Exact Analysis of Discrete Data. Boca Raton: Chapman & Hall/CRC.
Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic regression. Journal of the
American Statistical Association 82: 1110–1117.
Hirji, K. F., A. A. Tsiatis, and C. R. Mehta. 1989. Median unbiased estimation for binary data. American Statistician
43: 7–11.
Hutto, C., W. P. Parks, S. Lai, M. T. Mastrucci, C. Mitchell, J. Muñoz, E. Trapido, I. M. Master, and G. B. Scott.
1991. A hospital-based prospective study of perinatal infection with human immunodeficiency virus type 1. Journal
of Pediatrics 118: 347–353.
Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statistics in Medicine 14:
2143–2160.
Also see
[R] exlogistic postestimation — Postestimation tools for exlogistic
[R] binreg — Generalized linear models: Extensions to the binomial family
[R] clogit — Conditional (fixed-effects) logistic regression
[R] expoisson — Exact Poisson regression
[R] logit — Logistic regression, reporting coefficients
[R] logistic — Logistic regression, reporting odds ratios
[U] 20 Estimation and postestimation commands
Title
exlogistic postestimation — Postestimation tools for exlogistic
Description
The following postestimation commands are of special interest after exlogistic:
command
description
estat predict
estat se
single-observation prediction
report odds ratio or coefficient asymptotic standard errors
For information about these commands, see below.
The following standard postestimation command is also available:
command
description
estat summarize
estimation sample summary
estat summarize is not allowed if the binomial() option was specified in exlogistic.
Special-interest postestimation commands
estat predict computes a predicted probability (or linear predictor), its asymptotic standard
error, and its exact confidence interval for 1 observation. Predictions are carried out by estimating the
constant coefficient after shifting the independent variables and conditioned variables by the values
specified in the at() option or by their medians. Therefore, predictions must be done with the
estimation sample in memory. If a different dataset is used or if the dataset is modified, then an error
will result.
estat se reports odds ratio or coefficient asymptotic standard errors. The estimates are stored in
the matrix r(estimates).
Syntax for estat predict
estat predict
, options
options
description
pr
xb
at(atspec)
level(#) memory(# b | k | m | g )
nolog
probability; the default
linear effect
use the specified values for the indepvars and condvars()
set confidence level for the predicted value; default is level(95)
set limit on memory usage; default is memory(10m)
do not display the enumeration log
These statistics are available only for the estimation sample.
453
454
exlogistic postestimation — Postestimation tools for exlogistic
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for estat predict
pr, the default, calculates the probability.
xb calculates the linear effect.
at(varname = #
varname = #
. . . ) specifies values to use in computing the predicted
value. Here varname is one of the independent variables, indepvars, or the conditioned variables,
condvars(). The default is to use the median of each independent and conditioned variable.
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals.
memory(# b | k | m | g ) sets a limit on the amount of memory estat predict can use when
generating the conditional distribution of the constant parameter sufficient statistic. The default is
memory(10m), where m stands for megabyte, or 1,048,576 bytes. The following are also available:
b stands for byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte,
which is equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 512m
or 0.5g, but do not attempt to use more memory than is available on your computer. Also see
Remarks in [R] exlogistic for details on enumerating the conditional distribution.
nolog prevents the display of the enumeration log. By default, the enumeration log is displayed
showing the progress of enumerating the distribution of the observed successes conditioned on the
independent variables shifted by the values specified in at() (or by their medians). See Methods
and formulas in [R] exlogistic for details of the computations.
Syntax for estat se
estat se
, coef
Menu
Statistics
>
Postestimation
>
Reports and statistics
Option for estat se
coef requests that the estimated coefficients and their asymptotic standard errors be reported. The
default is to report the odds ratios and their asymptotic standard errors.
Remarks
Predictions must be done using the estimation sample. This is because the prediction is really an
estimated constant coefficient (the intercept) after shifting the independent variables and conditioned
variables by the values specified in at() or by their medians. The justification for this approach can
be seen by rewriting the model as
πi
log
= (α + x0 β) + (xi − x0 )β
1 − πi
exlogistic postestimation — Postestimation tools for exlogistic
455
where x0 are the specified values for the indepvars (Metha and Patel 1995). Because the estimation
of the constant term is required, this technique is not appropriate for stratified models that used the
group() option.
Example 1
To demonstrate, we return to the example 2 in [R] exlogistic using data from a prospective study
of perinatal infection and HIV-1. Here there was an investigation into whether the blood serum levels
of CD4 and CD8 measured in infants at 6 months of age might predict their development of HIV
infection. The blood serum levels are coded as ordinal values 0, 1, and 2. These data are used by
Mehta and Patel (1995) as an exposition of exact logistic.
. use http://www.stata-press.com/data/r11/hiv_n
(prospective study of perinatal infection of HIV-1; binomial form)
. gen byte cd4_0 = (cd4==0)
.
.
.
.
>
gen byte cd4_1 = (cd4==1)
gen byte cd8_0 = (cd8==0)
gen byte cd8_1 = (cd8==1)
exlogistic hiv cd4_0 cd4_1 cd8_0 cd8_1, terms(cd4=cd4_0 cd4_1,
cd8=cd8_0 cd8_1) binomial(n) test(probability) saving(dist, replace)
(output omitted )
. estat predict
Enumerating sample-space combinations:
observation 1:
enumerations =
3
observation 2:
enumerations =
12
observation 3:
enumerations =
5
observation 4:
enumerations =
5
observation 5:
enumerations =
5
observation 6:
enumerations =
35
observation 7:
enumerations =
15
observation 8:
enumerations =
15
observation 9:
enumerations =
9
observation 10: enumerations =
9
observation 11: enumerations =
5
observation 12: enumerations =
18
note: CMLE estimate for _cons is -inf; computing MUE
Predicted value at cd4_0 = 0, cd4_1 = 0, cd8_0 = 0, cd8_1 = 1
hiv
Probability
Predicted
0.0390*
Std. Err.
N/A
[95% Conf. Interval]
0.0000
0.1962
(*) identifies median unbiased estimates (MUE); because an MUE
is computed, there is no SE estimate
Because we did not specify values by using the at() option, the median values of the indepvars
are used for the prediction. By default, medians are used instead of means because we want to use
values that are observed in the dataset. If the means of the binary variables cd4 0–cd8 1 were
used, we would have created floating point variables in (0, 1) that not only do not properly represent
the indicator variables but also would be a source of computational inefficiency in generating the
conditional distribution. Because the MUE is computed for the predicted value, there is no standard-error
estimate.
From the example discussions in [R] exlogistic, the infants at highest risk are those with a CD4
level of 0 and a CD8 level of 2. Below we use the at() option to make a prediction at these blood
serum levels.
456
exlogistic postestimation — Postestimation tools for exlogistic
. estat predict, at(cd4_0=1 cd4_1=0 cd8_0=0 cd8_1=0) nolog
note: CMLE estimate for _cons is +inf; computing MUE
Predicted value at cd4_0 = 1, cd4_1 = 0, cd8_0 = 0, cd8_1 = 0
hiv
Probability
Predicted
Std. Err.
0.9063*
N/A
[95% Conf. Interval]
0.4637
1.0000
(*) identifies median unbiased estimates (MUE); because an MUE
is computed, there is no SE estimate
Saved results
estat predict saves the following in r():
Scalars
r(imue)
r(pred)
r(se)
1 if r(pred) is an MUE and 0 if a CMLE
estimated probability or the linear effect
asymptotic standard error of r(pred)
Macros
r(estimate) prediction type: pr or xb
r(level)
confidence level
Matrices
r(ci)
r(x)
confidence interval
indepvars and condvars() values
Methods and formulas
All postestimation commands listed above are implemented as ado-files using Mata.
Reference
Mehta, C. R., and N. R. Patel. 1995. Exact logistic regression: Theory and examples. Statistics in Medicine 14:
2143–2160.
Also see
[R] exlogistic — Exact logistic regression
[U] 20 Estimation and postestimation commands
Title
expoisson — Exact Poisson regression
Syntax
expoisson depvar indepvars
if
in
weight
, options
description
options
Model
condvars(varlist)
group(varname)
exposure(varnamee )
offset(varnameo )
condition on variables in varlist
groups/strata are stratified by unique values of varname
include ln(varnamee ) in model with coefficient constrained to 1
include varnameo in model with coefficient constrained to 1
Options
memory(# b | k | m | g ) set limit on memory usage; default is memory(25m)
saving(filename)
save the joint conditional distribution to filename
Reporting
level(#)
irr
test(testopt)
mue(varlist)
midp
nolog
set confidence level; default is level(95)
report incidence-rate ratios
report significance of observed sufficient statistic, conditional scores test,
or conditional probabilities test
compute the median unbiased estimates for varlist
use the mid-p-value rule
do not display the enumeration log
by, statsby, and xi are allowed; see [U] 11.1.10 Prefix commands.
fweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Menu
Statistics
>
Exact statistics
>
Exact Poisson regression
Description
expoisson fits an exact Poisson regression model of depvar on indepvars. Exact Poisson regression
is an alternative to standard maximum-likelihood–based Poisson regression (see [R] poisson) that
offers more accurate inference in small samples because it does not depend on asymptotic results.
For stratified data, expoisson is an alternative to fixed-effects Poisson regression (see xtpoisson,
fe in [XT] xtpoisson); like fixed-effects Poisson regression, exact Poisson regression conditions on
the number of events in each stratum.
Exact Poisson regression is computationally intensive, so if you have regressors whose parameter
estimates are not of interest (i.e., nuisance parameters), you should specify those variables in the
condvars() option instead of in indepvars.
457
458
expoisson — Exact Poisson regression
Options
Model
condvars(varlist) specifies variables whose parameter estimates are not of interest to you. You
can save substantial computer time and memory by moving such variables from indepvars to
condvars(). Understand that you will get the same results for x1 and x3 whether you type
. expoisson y x1 x2 x3 x4
or
. expoisson y x1 x3, condvars(x2 x4)
group(varname) specifies the variable defining the strata, if any. A constant term is assumed for
each stratum identified in varname, and the sufficient statistics for indepvars are conditioned on
the observed number of successes within each group (as well as other variables in the model).
The group variable must be integer valued.
exposure(varnamee ), offset(varnameo ); see [R] estimation options.
Options
memory(# b | k | m | g ) sets a limit on the amount of memory expoisson can use when computing
the conditional distribution of the parameter sufficient statistics. The default is memory(25m),
where m stands for megabyte, or 1,048,576 bytes. The following are also available: b stands for
byte; k stands for kilobyte, which is equal to 1,024 bytes; and g stands for gigabyte, which is
equal to 1,024 megabytes. The minimum setting allowed is 1m and the maximum is 2048m or 2g,
but do not attempt to use more memory than is available on your computer.
saving(filename , replace ) saves the joint conditional distribution for each independent variable
specified in indepvars. There is one file for each variable, and it is named using the prefix filename
with the variable name appended. For example, saving(mydata) with an independent variable
named X would generate a data file named mydata X.dta. Use replace to replace an existing
file. Each file contains the conditional distribution for one of the independent variables specified in
indepvars conditioned on all other indepvars and those variables specified in condvars(). There
are two variables in each data file: the feasible sufficient statistics for the variable’s parameter and
their associated weights. The weights variable is named w .
Reporting
level(#); see [R] estimation options. The level(#) option will not work on replay because
confidence intervals are based on estimator-specific enumerations. To change the confidence level,
you must refit the model.
irr reports estimated coefficients transformed to incidence-rate ratios, that is, exp(β) rather than β .
Standard errors and confidence intervals are similarly transformed. This option affects how results
are displayed, not how they are estimated or stored. irr may be specified at estimation or when
replaying previously estimated results.
test(sufficient | score | probability) reports the significance level of the observed sufficient statistic, the conditional scores test, or the conditional probabilities test. The default is
test(sufficient). All the statistics are computed at estimation time, and each statistic may be
displayed postestimation; see [R] expoisson postestimation.
mue(varlist) specifies that median unbiased estimates (MUEs) be reported for the variables in varlist.
By default, the conditional maximum likelihood estimates (CMLEs) are reported, except for those
parameters for which the CMLEs are infinite. Specify mue( all) if you want MUEs for all the
indepvars.
expoisson — Exact Poisson regression
459
midp instructs expoisson to use the mid-p-value rule when computing the MUEs, significance levels,
and confidence intervals. This adjustment is for the discreteness of the distribution by halving
the value of the discrete probability of the observed statistic before adding it to the p-value. The
mid-p-value rule cannot be MUEs whose corresponding parameter CMLE is infinite.
nolog prevents the display of the enumeration log. By default, the enumeration log is displayed,
showing the progress of computing the conditional distribution of the sufficient statistics.
Remarks
Exact Poisson regression estimates the model parameters by using the conditional distributions
of the parameters’ sufficient statistics, and the resulting parameter estimates are known as CMLEs.
Exact Poisson regression is a small-sample alternative to the maximum-likelihood ML Poisson model.
See [R] poisson and [XT] xtpoisson to obtain maximum likelihood estimates (MLEs) for the Poisson
model and the fixed-effects Poisson model.
Let Yi denote a Poisson random variable where we observe the outcome Yi = yi , i = 1, . . . , n.
Associated with each independent observation is a 1 × p vector of covariates, xi . We will denote
µi = E [Yi | xi ] and use the log linear model to model the relationship between Yi and xi ,
log (µi ) = θ + xi β
where the constant term, θ, and the p × 1 vector of regression parameters, β, are unknown. The
probability of observing Yi = yi , i = 1, . . . , n, is
Pr(Y = y) =
n
Y
µyi e−µi
i
i=1
yi !
where Y = (Y1 , . . . , Yn ) and y = (y1 , . . . , yn ). The MLEs for θ and β maximize the log of this
function.
Pn
Pn
The sufficient statistics for θ and βj , j = 1, . . . , p, are M = i=1 Yi and Tj = i=1 Yi xij ,
respectively, and we observe M = m and Tj = tj . expoisson tallies the conditional distribution
for each Tj , given the other sufficient statistics Tl = tl , l 6= j and M = m. Denote one of these
(k)
values to be tj , k = 1, . . . , N , with weight wk that accounts for all the generated Y vectors that
(k)
give rise to tj . The conditional probability of observing Tj = tj has the form
Pr(Tj = tj | Tl = tl , l 6= j, M = m) = P
w etj βj
k
(k)
(k)
wk etj
(1)
βj
(k)
(k)
(k)
where the sum is over the subset of T vectors such that (T1 = t1 , . . . , Tj = tj , . . . , Tp = tp )
and w is the weight associated with the observed t. The CMLE for βj maximizes the log of this
function.
Specifying nuisance variables in condvars() prevents expoisson from estimating their associated
regression coefficients. These variables are still conditional variables when tallying the conditional
distribution for the variables in indepvars.
Inferences from MLEs rely on asymptotics, and if your sample size is small, these inferences may
not be valid. On the other hand, inferences from the CMLEs are exact in that they use the conditional
distribution of the sufficient statistics outlined above.
460
expoisson — Exact Poisson regression
For small datasets, the dependent variable can be completely determined by the data. Here the MLEs
and the CMLEs are unbounded. When this occurs, expoisson will compute the MUE, the regression
estimate that places the observed sufficient statistic at the median of the conditional distribution.
See [R] exlogistic for a more thorough discussion of exact estimation and related statistics.
Example 1
Armitage, Berry, and Matthews (2002, 499–501) fit a log-linear model to data containing the
number of cerebrovascular accidents experienced by 41 men during a fixed period, each of whom
had recovered from a previous cerebrovascular accident and was hypertensive. Sixteen men received
treatment, and in the original data, there are three age groups (40–49, 50–59, ≥60), but we pool the
first two age groups to simplify the example. Armitage, Berry, and Matthews point out that this was
not a controlled trial, but the data are useful to inquire whether there is evidence of fewer accidents
for the treatment group and if age may be an important factor. The dependent variable count contains
the number of accidents, variable treat is an indicator for the treatment group (1 = treatment, 0 =
control), and variable age is an indicator for the age group (0 = 40−59; 1 = ≥60).
First, we load the data, list it, and tabulate the cerebrovascular accident counts by treatment and
age group.
. use http://www.stata-press.com/data/r11/cerebacc
(cerebrovascular accidents in hypotensive-treated and control groups)
. list
1.
2.
3.
4.
5.
6.
7.
treat
count
age
control
control
control
control
control
0
0
1
1
2
40/59
>=60
40/59
>=60
40/59
35.
control
2
>=60
control
3
40/59
(output omitted )
treatment
0
40/59
36.
37.
38.
39.
40.
treatment
treatment
treatment
treatment
treatment
0
0
0
1
1
40/59
40/59
40/59
40/59
40/59
41.
treatment
1
40/59
. tabulate treat age [fw=count]
hypotensive
drug
age group
treatment
40/59
>=60
Total
control
treatment
15
4
10
0
25
4
Total
19
10
29
Next we estimate the CMLE with expoisson and, for comparison, the MLE with poisson.
expoisson — Exact Poisson regression
. expoisson count treat age
Estimating: treat
Enumerating sample-space combinations:
observation 1:
enumerations =
observation 2:
enumerations =
observation 3:
enumerations =
(output omitted )
observation 39: enumerations =
observation 40: enumerations =
observation 41: enumerations =
Estimating: age
Enumerating sample-space combinations:
observation 1:
enumerations =
observation 2:
enumerations =
observation 3:
enumerations =
(output omitted )
observation 39: enumerations =
observation 40: enumerations =
observation 41: enumerations =
Exact Poisson regression
11
11
11
410
410
30
5
15
15
455
455
30
Number of obs =
count
Coef.
Suff.
treat
age
-1.594306
-.5112067
4
10
2*Pr(Suff.)
0.0026
0.2794
. poisson count treat age, nolog
Poisson regression
Log likelihood =
Coef.
treat
age
_cons
-1.594306
-.5112067
.233344
Std. Err.
.5573614
.4043525
.2556594
z
-2.86
-1.26
0.91
41
[95% Conf. Interval]
-3.005089
-1.416179
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
-38.97981
count
461
P>|z|
0.004
0.206
0.361
-.4701708
.3429232
=
=
=
=
41
10.64
0.0049
0.1201
[95% Conf. Interval]
-2.686714
-1.303723
-.2677391
-.5018975
.2813096
.7344271
expoisson generates an enumeration log for each independent variable in indepvars. The conditional distribution of the parameter sufficient statistic is tallied for each independent variable. The
conditional distribution for treat, for example, has 30 records containing the weights, wk , and
(k)
(k)
feasible sufficient statistics, ttreat . In essence, the set of points (wk , ttreat ), k = 1, . . . , 30, tallied by
expoisson now become the data to estimate the regression coefficient for treat, using (1) as the
(k)
likelihood. Remember that one of the 30 (wk , ttreat ) must contain the observed sufficient statistic,
P41
ttreat = i=1 treati × counti = 4, and its relative position in the sorted set of points (sorted by
(k)
ttreat ) is how the sufficient-statistic significance is computed. This algorithm is repeated for the age
variable.
The regression coefficients for treat and age are numerically identical for both Poisson models.
Both models indicate that the treatment is significant at reducing the rate of cerebrovascular accidents,
≈ e−1.59 ≈ 0.204, or a reduction of about 80%. There is no significant age effect.
The p-value for the treatment regression-coefficient sufficient statistic indicates that the treatment
effect is a bit more significant than for the corresponding asymptotic Z statistic from poisson.
However, the exact confidence intervals are wider than their asymptotic counterparts.
462
expoisson — Exact Poisson regression
Example 2
Agresti (2002) used the data from Laird and Olivier (1981) to demonstrate the Poisson model
for modeling rates. The data consist of patient survival after heart valve replacement operations. The
sample consists of 109 patients that are classified by type of heart valve (aortic, mitral) and by age
(<55, ≥55). Follow-up observations cover lengths from 3 to 97 months, and the time at risk, or
exposure, is stored in the variable TAR. The response is whether the subject died. First, we take a
look at the data and then estimate the incidence rates (IRs) with expoisson and poisson.
. use http://www.stata-press.com/data/r11/heartvalve
(heart valve replacement data)
. list
age
1.
2.
3.
4.
<
<
>=
>=
55
55
55
55
valve
deaths
TAR
aortic
mitral
aortic
mitral
4
1
7
9
1259
2082
1417
1647
The age variable is coded 0 for age <55 and 1 for age ≥55, and the valve variable is coded 0 for
the aortic valve and 1 for the mitral valve. The total number of deaths, M = 21, is small enough that
enumerating the conditional distributions for age and valve type is feasible and asymptotic inferences
associated with standard ML Poisson regression may be questionable.
. expoisson deaths age valve, exposure(TAR) irr
Estimating: age
Enumerating sample-space combinations:
observation 1:
enumerations =
11
observation 2:
enumerations =
11
observation 3:
enumerations =
132
observation 4:
enumerations =
22
Estimating: valve
Enumerating sample-space combinations:
observation 1:
enumerations =
17
observation 2:
enumerations =
17
observation 3:
enumerations =
102
observation 4:
enumerations =
22
Exact Poisson regression
Number of obs =
deaths
IRR
Suff.
age
valve
TAR
3.390401
.7190197
(exposure)
16
10
2*Pr(Suff.)
0.0194
0.5889
4
[95% Conf. Interval]
1.182297
.2729881
11.86935
1.870068
expoisson — Exact Poisson regression
463
. poisson deaths age valve, exposure(TAR) irr nolog
Poisson regression
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -8.1747285
deaths
IRR
age
valve
TAR
3.390401
.7190197
(exposure)
Std. Err.
z
1.741967
.3150492
2.38
-0.75
=
=
=
=
4
7.62
0.0222
0.3178
P>|z|
[95% Conf. Interval]
0.017
0.452
1.238537
.3046311
9.280965
1.6971
The CMLE and the MLE are numerically identical. The death rate for the older age group is about
3.4 times higher than the younger age group, and this difference is significant at the 5% level. This
means that for every death in the younger group each month, we would expect about three deaths
in the older group. The IR estimate for valve type is approximately 0.72, but it is not significantly
different from one. The exact Poisson confidence intervals are a bit wider than the asymptotic CIs.
You can use ir (see [ST] epitab) to estimate IRs and exact CIs for one covariate, and we compare
these CIs with those from expoisson, where we estimate the incidence rate by using age only.
. ir deaths age TAR
age of patient
Exposed
Unexposed
Total
number of deaths
time at risk
16
3064
5
3341
21
6405
Incidence rate
.0052219
.0014966
.0032787
Point estimate
Inc. rate diff.
Inc. rate ratio
Attr. frac. ex.
Attr. frac. pop
[95% Conf. Interval]
.0037254
3.489295
.7134092
.5435498
.00085
1.221441
.1812948
(midp)
Pr(k>=16) =
(midp) 2*Pr(k>=16) =
.0066007
12.17875 (exact)
.9178898 (exact)
0.0049 (exact)
0.0099 (exact)
. expoisson deaths age, exposure(TAR) irr midp nolog
Exact Poisson regression
Number of obs =
deaths
IRR
Suff.
age
TAR
3.489295
(exposure)
16
2*Pr(Suff.)
0.0099
4
[95% Conf. Interval]
1.324926
10.64922
mid-p-value computed for the probabilities and CIs
Both ir and expoisson give identical IRs and p-values. Both report the two-sided exact significance
by using the mid-p-value rule that accounts for the discreteness in the distribution by subtracting p1/2 =
Pr(T = t)/2 from pl = Pr(T ≤ t) and pg = Pr(T ≥ t), computing 2 × min(pl − p1/2 , pg − p1/2 ).
By default, expoisson will not use the mid-p-value rule (when you exclude the midp option), and
here the two-sided exact significance would be 2 × min(pl , pg ) = 0.0158. The confidence intervals
differ because expoisson uses the mid-p-value rule when computing the confidence intervals, yet
464
expoisson — Exact Poisson regression
ir does not. You can verify this by executing expoisson without the midp option for this example;
you will get the same CIs as ir.
You can replay expoisson to view the conditional scores test or the conditional probabilities test
by using the test() option.
. expoisson, test(score) irr
Exact Poisson regression
Number of obs =
deaths
IRR
Score
age
TAR
3.489295
(exposure)
6.76528
Pr>=Score
0.0113
4
[95% Conf. Interval]
1.324926
10.64922
mid-p-value computed for the probabilities and CIs
All the statistics for expoisson are defined in Methods and formulas of [R] exlogistic. Apart
from enumerating the conditional distributions for the logistic and Poisson sufficient statistics, computationally, the primary difference between exlogistic and expoisson is the weighting values in
the likelihood for the parameter sufficient statistics.
Example 3
In this example, we fabricate data that will demonstrate the difference between the CMLE and
the MUE when the CMLE is not infinite. A difference in these estimates will be more pronounced
when the probability of the coefficient sufficient statistic is skewed when plotted as a function of the
regression coefficient.
. clear
. input y x
y
x
1. 0 2
2. 1 1
3. 1 0
4. 0 0
5. 0 .5
6. 1 .5
7. 2 .01
8. 3 .001
9. 4 .0001
10. end
. expoisson y x, test(score)
Enumerating sample-space combinations:
observation 1:
enumerations =
13
observation 2:
enumerations =
91
observation 3:
enumerations =
169
observation 4:
enumerations =
169
observation 5:
enumerations =
313
observation 6:
enumerations =
313
observation 7:
enumerations =
1469
observation 8:
enumerations =
5525
observation 9:
enumerations =
5479
Exact Poisson regression
Number of obs =
y
Coef.
Score
x
-1.534468
2.955316
Pr>=Score
0.0810
9
[95% Conf. Interval]
-3.761718
.0485548
expoisson — Exact Poisson regression
465
. expoisson y x, test(score) mue(x) nolog
Exact Poisson regression
Number of obs =
y
x
Coef.
-1.309268*
Score
2.955316
Pr>=Score
0.0810
9
[95% Conf. Interval]
-3.761718
.0485548
(*) median unbiased estimates (MUE)
P9
We observe (xi , yiP
), i = 1, . . . , 9. If we condition on m =
i=1 yi = 12, the conditional
distribution of Tx =
Y
x
has
a
size
of
5,479
elements.
For
each
entry in this enumeration,
i i i
P (k)
(k)
a realization of Yi = yi , k = 1, . . . , 5,479,
is generated such that
= 12. One of these
i yi
P
realizations produces the observed tx = i yi xi ≈1.5234.
Below is a graphical display comparing the CMLE with the MUE. We plot Pr(Tx = tx | M = 12, βx )
versus βx , −6 ≤ βx ≤ 1, in the upper panel and the cumulative probabilities, Pr(Tx ≤ tx | M =
12, βx ) and Pr(Tx ≥ tx | M = 12, βx ), in the lower panel.
MUE
0
cumulative probability
.2
.4 .5 .6
.8
1
0
probability
.0001 .0002
.0003
CMLE
−6
−4
−2
x coefficient
0
1
The location of the CMLE, indicated by the dashed line, is at the mode of the probability profile, and
(u)
(l)
the MUE, indicated by the dotted line, is to the right of the mode. If we solve for the βx and βx
(u)
(l)
such that Pr(Tx ≤ tx | M = 12, βx ) = 1/2 and Pr(Tx ≥ tx | M = 12, βx ) = 1/2, the MUE is
(u)
(l)
(βx + βx )/2. As you can see in the lower panel, the MUE cuts through the intersection of these
cumulative probability profiles.
466
expoisson — Exact Poisson regression
Technical note
The memory(#) option limits the amount of memory that expoisson will consume when computing
the conditional distribution of the parameter sufficient statistics. memory() is independent of the system
setting c(memory) (see set memory in [D] memory), and expoisson can exceed the memory limit
specified in c(memory) without terminating. By default, a log is provided that displays the number
of enumerations (the size of the conditional distribution) after processing each observation. Typically,
you will see the number of enumerations increase, and then at some point they will decrease as the
multivariate shift algorithm (Hirji, Mehta, and Patel 1987) determines that some of the enumerations
cannot achieve the observed sufficient statistics of the conditioning variables. When the algorithm
is complete, however, it is necessary to store the conditional distribution of the parameter sufficient
statistics as a dataset. It is possible, therefore, to get a memory error when the algorithm has completed
and c(memory) is not large enough to store the conditional distribution.
Technical note
Computing the conditional distributions and reported statistics requires data sorting and numerical
comparisons. If there is at least one single-precision variable specified in the model, expoisson
will make comparisons with a relative precision of 2−5 . Otherwise, a relative precision of 2−11 is
used. Be careful if you use recast to promote a single-precision variable to double precision (see
[D] recast). You might try listing the data in full precision (maybe %20.15g; see [D] format) to make
sure that this is really what you want. See [D] data types for information on precision of numeric
storage types.
expoisson — Exact Poisson regression
Saved results
expoisson saves the following in e():
Scalars
e(N)
e(k groups)
e(relative weight)
e(sum y)
e(k indvars)
e(k condvars)
e(midp)
e(eps)
number of observations
number of groups
relative weight for the observed e(sufficient) and e(condvars)
sum of depvar
number of independent variables
number of conditioning variables
mid-p-value rule indicator
relative difference tolerance
Macros
e(cmd)
e(cmdline)
e(title)
e(depvar)
e(indvars)
e(condvars)
e(groupvar)
e(exposure)
e(level)
e(wtype)
e(wexp)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(predict)
e(marginsnotok)
expoisson
command as typed
Exact Poisson regression
dependent variable
independent variables
conditional variables
group variable
exposure variable
confidence level
weight type
weight expression
the checksum
variables used in calculation of checksum
b V
program used to implement estat
program used to implement predict
predictions disallowed by margins
(Continued on next page)
467
468
expoisson — Exact Poisson regression
Matrices
e(b)
e(mue indicators)
e(se)
e(ci)
e(sum y groups)
e(N g)
e(sufficient)
e(p sufficient)
e(scoretest)
e(p scoretest)
e(probtest)
e(p probtest)
coefficient vector
indicator for elements of e(b) estimated using MUE instead of CMLE
e(b) standard errors (CMLEs only)
matrix of e(level) confidence intervals for e(b)
sum of e(depvar) for each group
number of observations in each group
sufficient statistics for e(b)
p-value for e(sufficient)
conditional scores tests for indepvars
p-values for e(scoretest)
conditional probability tests for indepvars
p-value for e(probtest)
Functions
e(sample)
marks estimation sample
Methods and formulas
expoisson is implemented as an ado-file.
Let {Y1 , Y2 , . . . , Yn } be a set of n independent Poisson random variables. For each i = 1, . . . , n,
we observe Yi = yi ≥ 0, and associated with each observation is the covariate row vector of length
p, xi = (xi1 , . . . , xip ). Denote β = (β1 , . . . , βp )T to be the column
Pn vector of regression parameters
and θ to
be
the
constant.
The
sufficient
statistic
for
β
is
T
=
= 1, . . . , p, and for θ is
j
j
i=1 Yi xij , j P
Pn
Pn
n
M = i=1 Yi . We observe Tj = tj , tj = i=1 yi xij , and M = m, m = i=1 yi . Let κi be the
exposure for the ith observation. Then the probability of observing (Y1 = y1 , Y2 = y2 , . . . , Yn = yn )
is
n
Y
κyi i
exp(mθ + tβ)
Pn
Pr(Y1 = y1 , . . . , Yn = yn | β, X, κ) =
yi !
exp{ i=1 κi exp(θ + xi β)}
i=1
where t = (t1 , . . . , tp ), X = (xT1 , . . . , xTn )T , and κ = (κ1 , . . . , κn )T .
The joint distribution of the sufficient statistics (T, M ) is obtained by summing over all possible
sequences Y1 ≥ 0, . . . , Yn ≥ 0 such that T = t and M = m. This probability function is
!
n
XY
κui i
exp(mθ + tβ)
Pn
Pr(T1 = t1 , . . . , Tp = tp , M = m | β, X, κ) =
u!
exp { i=1 κi exp(θ + xi β)}
u i=1 i
P
Pn
where
u is over all nonnegative vectors u of length n such that
i=1 ui = m and
Pn the sum
i=1 ui xi = t.
Conditional distribution
Without loss of generality, we will restrict our discussion to the conditional distribution of the
sufficient statistic for β1 , T1 . If we condition on observing M = m and T2 = t2 , . . . , Tp = tp , the
probability function of (T1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) is
P Q
u κi i
n
t1 β1
u
i=1 ui ! e
P
Pr(T1 = t1 | β1 , T2 = t2 , . . . , Tp = tp , M = m) = P Q
(2)
v κi i
n
β1
vi xi1
i
v
i=1 vi ! e
expoisson — Exact Poisson regression
469
P
Pn
where the sum
all nonnegative vectors u of length n such that
u is over P
i=1
P
Puni = m and
n
u
x
=
t
,
and
the
sum
is
over
all
nonnegative
vectors
v
of
length
n
such
that
i=1 vi = m,
Pni=1 i i
Pn v
v
x
=
t
,
.
.
.
,
v
x
=
t
.
The
CMLE
for
β
is
the
value
that
maximizes
the log of
2
p
1
i=1 i i2
i=1 i ip
(1). This optimization task is carried out by ml (see [R] ml), using the conditional distribution of
(T1 | T2 = t2 , . . . , Tp = tp , M = m) as a dataset. This dataset consists of the feasible values and
weights for T1 ,
(
s1 ,
n
Y
κvi
i
i=1
vi !
!
:
n
X
vi = m,
i=1
n
X
vi xi1 = s1 ,
i=1
n
X
vi xi2 = t2 , . . . ,
i=1
n
X
)
vi xip = tp
i=1
Computing the CMLE, MUE, confidence intervals, conditional hypothesis tests, and sufficient statistic
p-values is discussed in Methods and formulas of [R] exlogistic. The only difference between the
two techniques is the use of the weights; that is, the weights for exact logistic are the combinatorial
coefficients, c(t, m), in (1) of Methods and formulas in [R] exlogistic. expoisson and exlogistic
use the same ml likelihood evaluator to compute the CMLEs as well as the same ado-programs and
Mata functions to compute the MUEs and estimate statistics.
References
Agresti, A. 2002. Categorical Data Analysis. 2nd ed. Hoboken, NJ: Wiley.
Armitage, P., G. Berry, and J. N. S. Matthews. 2002. Statistical Methods in Medical Research. 4th ed. Oxford:
Blackwell.
Cox, D. R., and E. J. Snell. 1989. Analysis of Binary Data. 2nd ed. London: Chapman & Hall.
Hirji, K. F., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic regression. Journal of the
American Statistical Association 82: 1110–1117.
Laird, N. M., and D. Olivier. 1981. Covariance analysis of censored survival data using log-linear analysis techniques.
Journal of the American Statistical Association 76: 231–240.
Also see
[R] expoisson postestimation — Postestimation tools for expoisson
[R] poisson — Poisson regression
[XT] xtpoisson — Fixed-effects, random-effects, and population-averaged Poisson models
[U] 20 Estimation and postestimation commands
Title
expoisson postestimation — Postestimation tools for expoisson
Description
The following postestimation command is of special interest after expoisson:
command
description
estat se
report odds ratio or coefficient asymptotic standard errors
For information about this command, see below.
The following standard postestimation command is also available:
command
description
estat summarize
estimation sample summary
Special-interest postestimation command
estat se reports regression coefficients or incidence-rate asymptotic standard errors. The estimates
are stored in the matrix r(estimates).
Syntax for estat se
estat se
, irr
Menu
Statistics
>
Postestimation
>
Reports and statistics
Option for estat se
irr reports estimated coefficients transformed to incidence-rate ratios, that is, exp(β) rather than
β . Standard errors and confidence intervals are similarly transformed. The default is to report the
regression coefficients and their asymptotic standard errors.
470
expoisson postestimation — Postestimation tools for expoisson
Remarks
Example 1
To demonstrate estat se after expoisson, we use the British physicians smoking data.
. use http://www.stata-press.com/data/r11/smokes
(cigarette smoking and lung cancer among British physicians (45-49 years))
. expoisson cases smokes, exposure(peryrs) irr nolog
Exact Poisson regression
Number of obs =
7
cases
IRR
Suff.
smokes
peryrs
1.077718
(exposure)
797.4
2*Pr(Suff.)
0.0000
[95% Conf. Interval]
1.04552
. estat se, irr
cases
IRR
smokes
1.077718
Std. Err.
.0168547
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] expoisson — Exact Poisson regression
[U] 20 Estimation and postestimation commands
1.111866
471
Title
fracpoly — Fractional polynomial regression
Syntax
Fractional polynomial regression
fracpoly , fracpoly options : regression cmd yvar1 yvar2
xvar1 # # . . .
xvar2 # # . . .
...
xvarlist
if
in
weight
, regression cmd options
Display table showing the best fractional polynomial model for each degree
fracpoly, compare
Create variables containing fractional polynomial powers
fracgen varname # # . . .
if
in
, fracgen options
fracpoly options
description
Model
degree(#)
degree of fractional polynomial to fit; default is degree(2)
Model 2
noscaling
noconstant
powers(numlist)
center(cent list)
suppress scaling of first independent variable
suppress constant term
list of fractional polynomial powers from which models are chosen
specification of centering for the independent variables
Reporting
log
compare
all
regression cmd options
display iteration log
compare models by degree
include out-of-sample observations in generated variables
description
Model 2
regression cmd options
options appropriate to the regression command in use
All weight types supported by regression cmd are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
472
fracpoly — Fractional polynomial regression
473
where
cent list is a comma-separated list with elements varlist:{mean | # | no}, except that the first
element may optionally be of the form {mean | # | no} to specify the default for all variables.
regression cmd may be clogit, glm, intreg, logistic, logit, mlogit, nbreg, ologit,
oprobit, poisson, probit, qreg, regress, rreg, stcox, streg, or xtgee.
fracgen options
description
Main
center(no | mean | #)
noscaling
restrict( varname
if )
replace
center varname as specified; default is center(no)
suppress scaling of varname
compute centering and scaling using specified subsample
replace variables if they exist
Menu
fracpoly
Statistics
>
Linear models and related
>
Fractional polynomials
>
Fractional polynomial regression
Linear models and related
>
Fractional polynomials
>
Create fractional polynomial powers
fracgen
Statistics
>
Description
fracpoly fits fractional polynomials (FPs) in xvar1 as part of the specified regression model.
After execution, fracpoly leaves variables in the dataset named Ixvar 1, Ixvar 2, . . . , where
xvar represents the first four letters of the name of xvar1 . The new variables contain the best-fitting
FP powers of xvar1 .
Covariates other than xvar1 , which are optional, are specified in xvar2 , . . . , and xvarlist. They
may be modeled linearly and with specified FP transformations. Fractional polynomial powers are
specified by typing numbers after the variable’s name. A variable name typed without numbers is
entered linearly.
fracgen creates new variables named varname 1, varname 2, . . . , containing FP powers of
varname by using the powers (# [#. . . ]) specified.
See [R] fracpoly postestimation for information on fracplot and fracpred.
See [R] mfp for multivariable FP model fitting.
Options for fracpoly
Model
degree(#) determines the degree of FP to be fit. The default is degree(2), i.e., a model with two
power terms.
Model 2
noscaling suppresses scaling of xvar1 and its powers.
474
fracpoly — Fractional polynomial regression
noconstant suppresses the regression constant if this is permitted by regression cmd.
powers(numlist) is the set of FP powers from which models are to be chosen. The default is
powers(-2,-1,-.5,0,.5,1,2,3) (0 means log).
center(cent list) defines the centering for the covariates xvar1 , xvar2 , . . . , xvarlist. The default
is center(mean). A typical item in cent list is varlist:{mean | # | no}. Items are separated by
commas. The first item is special because varlist: is optional, and if omitted, the default is (re)set
to the specified value (mean or # or no). For example, center(no, age:mean) sets the default
to no and sets the centering for age to mean.
regression cmd options are options appropriate to the regression command in use. For example, for
stcox, regression cmd options may include efron or some alternate method for handling tied
failures.
Reporting
log displays deviances and (for regress) residual standard deviations for each FP model fit.
compare reports a closed-test comparison between FP models.
all includes out-of-sample observations when generating the best-fitting FP powers of xvar1 , xvar2 ,
etc. By default, the generated FP variables contain missing values outside the estimation sample.
Options for fracgen
Main
center(no | mean | #) specifies whether varname is to be centered; the default is center(no).
noscaling suppresses scaling of varname.
restrict( varname
if ) specifies that centering and scaling be computed using the subsample
identified by varname and if.
The subsample is defined by the observations for which varname 6= 0 that also meet the if
conditions. Typically, varname = 1 defines the subsample and varname = 0 indicates observations
not belonging to the subsample. For observations whose subsample status is uncertain, varname
should be set to a missing value; such observations are dropped from the subsample.
By default, fracgen
the centering and scaling by using the sample of observations
computes
identified in the if
in options. The restrict() option identifies a subset of this sample.
replace specifies that any existing variables named varname 1, varname 2, . . . may be replaced.
Remarks
Remarks are presented under the following headings:
Introduction
fracpoly
Centering
Output with the compare option
fracgen
Models with several continuous covariates
Examples
fracpoly — Fractional polynomial regression
475
Introduction
Regression models based on FP functions of a continuous covariate are described by Royston and
Altman (1994b). Detailed examples using an earlier and rather more complex version of this set of
commands are presented by Royston and Altman (1994a).
FPs increase the flexibility afforded by the family of conventional polynomial models. Although
polynomials are popular in data analysis, linear and quadratic functions are severely limited in their
range of curve shapes, whereas cubic and higher-order curves often produce undesirable artifacts,
such as edge effects and waves.
A polynomial of degree m may be written as
β0 + β 1 x + β 2 x 2 + · · · + βm x m
whereas FP of degree m has m integer and/or fractional powers p1 < · · · < pm ,
β0 + β1 x(p1 ) + β2 x(p2 ) + · · · + βm x(pm )
where for a power, p,
x(p) =
xp
if p 6= 0
log x if p = 0
x must be positive. An FP of first degree (m = 1) involves one power or log transformation of x.
This family of FP functions may be extended in a mathematically natural way to include repeated
powers. An FP of degree m with exactly m repeated powers of p is defined as
β0 + β1 x(p) + β2 x(p) log x + · · · + βm x(p) (log x)m−1
For example, an FP of second degree (m = 2) with repeated powers of 0.5 is
β0 + β1 x0.5 + β2 x0.5 log x
A general FP may include some unique and some repeated powers. For example, one with powers
(−1, 1, 3, 3) is
β0 + β1 x−1 + β2 x + β3 x3 + β4 x3 log x
The permitted powers are restricted to the set {−2, −1, −0.5, 0, 0.5, 1, 2, 3} because our experience
using FPs in data analysis indicates that including extra powers in the set is not often worthwhile.
Now we consider using FPs in regression modeling. If the values of the powers p1 , . . . , pm
were known, the FP would resemble a conventional multiple linear regression model with coefficients
β0 , β1 , . . . , βm . However, the powers are not (usually) known and must be estimated, together with the
coefficients, from the data. Estimation involves a systematic search for the best power or combination
of powers from the permitted set. For each possible combination, a linear regression model as just
described is fit, and the corresponding deviance (defined as minus twice the log likelihood) is noted.
The model with the lowest deviance is deemed to have the best fit, and the corresponding powers and
regression coefficients constitute the final FP model. In practice, m = 2 is often sufficient (Royston
and Sauerbrei 2008, 76).
476
fracpoly — Fractional polynomial regression
fracpoly
fracpoly finds and reports a multiple regression model comprising the best-fitting powers of
xvar1 together with other covariates specified by xvar2 , . . . , xvarlist. The model that is fit depends
on the type of regression cmd used.
The regression output for the best-fitting model may be reproduced by typing regression cmd
without variables or options. predict, test, etc., may be used after fracpoly; the results will
depend on regression cmd.
The standard errors of the fitted values (as estimated after use of fracpoly by using predict
or fracpred with the stdp option) are somewhat too low because no allowance has been made for
the estimation of the powers.
If xvar1 has any negative or zero values, fracpoly subtracts the minimum of xvar from xvar
and then adds the rounding (or counting) interval. The interval is defined as the smallest positive
difference between the ordered values of xvar. After this change of origin, the minimum value of
xvar1 is positive, so FPs (which require xvar1 > 0) can be used. Unless the noscaling option is
used, fracpoly scales the resulting variable by a power of 10 calculated from the data. The scaling
is designed to improve numerical stability when fitting FP models.
After execution, fracpoly leaves in the dataset variables named Ixvar 1, Ixvar 2, . . . , which
are the best-fitting FP powers of xvar1 (calculated, if necessary, after a change in origin and scale as
just described, and if centering is specified, with a constant added to or subtracted from the values
after FP transformation). Other variables, whose names follow the same convention, are left in the
dataset if xvar2 has been specified.
Centering
As discussed by Garrett (1995, 1998), covariate centering is a sensible, indeed often essential, step
when reporting and interpreting the results of multiple regression models. For this and other reasons,
centering has been introduced as the default option in fracpoly. As written, the familiar straight-line
regression function E[y|x] = β0 + β1 x is “centered” to 0 in that β0 = E[y|0]. This is fine if x = 0
is a sensible base point. However, the sample values of x may not even encompass 0 (this is usually
the case when FP models are contemplated). Then β0 is a meaningless intercept, and the standard
error of its estimate βb0 will be large. For FP model E[y|x] = β0 + β1 x(p) , the point x(p) = 0 may
even correspond to x = ∞ (consider p < 0). The scheme adopted by fracpoly is to center on the
mean of x. For example, for the FP E[y|x] = β0 + β1 xp + β1 xq , fracpoly actually fits the model
E[y|x] = β0 + β1 (xp − xp ) + β2 (xq − xq )
where x is the sample mean of the x values and E[y|x] = β0 , giving β0 a respectable interpretation
as the predicted value of y at the mean of x. This approach has the advantage that plots of the fitted
values and 95% confidence intervals for E[y|x] as a function of x, even within a multiple regression
model, are always sensible (provided that the other predictors are suitably centered—otherwise, the
confidence limits can be alarmingly wide).
Sometimes centering on the mean is not appropriate, an example being a binary covariate where
often you will want to center on the lower value, usually 0 (i.e., not center). You should then use the
center() option to override the default. An example is center(x1:mean,x2-x5:no,x6:1).
fracpoly — Fractional polynomial regression
477
Output with the compare option
If the compare option is used, fracpoly displays a table showing the best FP model for each
degree k < m (including the model without x and the model linear in x). Deviance differences
between each FP model and the degree m model are also reported along with the corresponding
p-values (Royston and Altman 1994b; Royston and Sauerbrei 2008).
The compare option implements a closed-test approach to selecting an FP model. It has the
advantage of preserving the type I error probability at a nominal value. For example, suppose a
nominal 5% significance level was chosen, and the test of FP2 versus the null model (i.e., omitting x)
was not significant. No further tests among FP models would then be done, and x would be considered
nonsignificant, regardless of the results of any further model comparisons.
fracgen
The basic syntax of fracgen is
fracgen varname # # . . .
Each power (represented by # in the syntax diagram) should be separated by a space. fracgen creates
new variables called varname 1, varname 2, etc. Each variable is labeled according to its power,
preliminary linear transformation, and centering, if applied.
Positive or negative powers of varname are defined in the usual way. A power of zero is interpreted
as log.
Models with several continuous covariates
fracpoly estimates powers for FP models in just one continuous covariate (xvar1 ), though other
covariates of any kind (xvar2 , . . . , xvarlist) may be included as linear or predefined FP terms. An
algorithm was suggested by Royston and Altman (1994b) for the joint estimation of FP models in
several continuous covariates. It was later refined by Sauerbrei and Royston (1999) and is implemented
in the Stata command mfp. See [R] mfp as well as Royston and Ambler (1998) and Royston and
Sauerbrei (2008).
Examples
Example 1
Consider the serum immunoglobulin G (IgG) dataset from Isaacs et al. (1983), which consists of
298 independent observations in young children. The dependent variable sqrtigg is the square root
of the IgG concentration, and the independent variable age is the age of each child. (Preliminary
Box – Cox analysis shows that a square root transformation removes the skewness in IgG.) The aim is
to find a model that accurately predicts the mean of sqrtigg given age. We use fracpoly to find
the best FP model of degree 2 (the default option) and graph the resulting fit and 95% confidence
interval:
478
fracpoly — Fractional polynomial regression
. use http://www.stata-press.com/data/r11/igg
(Immunoglobulin in children)
. fracpoly: regress sqrtigg age
........
-> gen double Iage__1 = age^-2-.1299486216 if e(sample)
-> gen double Iage__2 = age^2-7.695349038 if e(sample)
SS
Source
df
MS
Model
Residual
22.2846976
50.9676492
2
295
11.1423488
.172771692
Total
73.2523469
297
.246640898
sqrtigg
Coef.
Iage__1
Iage__2
_cons
-.1562156
.0148405
2.283145
Std. Err.
.027416
.0027767
.0305739
t
-5.70
5.34
74.68
Number of obs
F( 2,
295)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.000
=
=
=
=
=
=
298
64.49
0.0000
0.3042
0.2995
.41566
[95% Conf. Interval]
-.2101713
.0093757
2.222974
-.10226
.0203052
2.343315
Deviance:
319.45. Best powers of age among 44 models fit: -2 2.
. fracplot age, msize(small)
1
Predictor+residual of sqrtigg
2
3
4
Fractional Polynomial (−2 2)
0
2
4
6
Age (years)
The new variables created by fracpoly contain the best-fitting FP powers of age, as centered
by fracpoly. For example, Iage 1 is centered by subtracting the mean of age raised to the
power −2. In general, the variables created by fracpoly are centered and possibly scaled, which
is reflected in the estimated regression coefficients and intercept. Centering does have its advantages
(see the Centering section earlier in this entry); however, sometimes you may want estimation for
uncentered variables. To obtain regression results for uncentered and unscaled FP variables, specify
options center(no) and noscaling to fracpoly. For a more detailed discussion, see Royston and
Sauerbrei (2008, sec. 4.11).
The fitted curve has an asymmetric S shape. This model has powers (−2, 2) and deviance 319.45.
As many as 44 models have been quietly fit in the search for the best powers. Now let’s look at
models of degree ≤ 4:
fracpoly — Fractional polynomial regression
479
. fracpoly, degree(4) compare: regress sqrtigg age
...............................................................................
> .............................................................................
> ........
-> gen double Iage__1 = ln(age)-1.020308063 if e(sample)
-> gen double Iage__2 = age^3-21.34727694 if e(sample)
-> gen double Iage__3 = age^3*ln(age)-21.78079878 if e(sample)
-> gen double Iage__4 = age^3*ln(age)^2-22.22312461 if e(sample)
SS
df
MS
Number of obs =
298
Source
F( 4,
293) =
32.63
Model
22.5754541
4 5.64386353
Prob > F
= 0.0000
Residual
50.6768927
293 .172958678
R-squared
= 0.3082
Adj R-squared = 0.2987
Total
73.2523469
297 .246640898
Root MSE
= .41588
sqrtigg
Coef.
Iage__1
Iage__2
Iage__3
Iage__4
_cons
.8761824
-.1922029
.2043794
-.0560067
2.238735
Std. Err.
.1898721
.0684934
.074947
.0212969
.0482705
t
4.61
-2.81
2.73
-2.63
46.38
P>|t|
0.000
0.005
0.007
0.009
0.000
[95% Conf. Interval]
.5024962
-.3270044
.0568767
-.097921
2.143734
1.249869
-.0574015
.3518821
-.0140924
2.333736
Deviance:
317.74. Best powers of age among 494 models fit: 0 3 3 3.
Fractional polynomial model comparisons:
age
Not in model
Linear
m = 1
m = 2
m = 3
m = 4
df
Deviance
Res. SD
Dev. dif.
P (*)
0
1
2
4
6
8
427.539
337.561
327.436
319.448
319.275
317.744
.49663
.42776
.420554
.415658
.416243
.415883
109.795
19.818
9.692
1.705
1.532
0.000
0.006
0.140
0.794
0.473
Powers
1
0
-2 2
-2 1 1
0 3 3 3
(*) P-value from deviance difference comparing reported model with m = 4 model
It appears that the degree 4 FP model is not significantly different from the other FP models (at the
5% level).
Let’s compare the curve shape from the m = 2 model with that from a conventional quartic
polynomial, whose fit turns out to be significantly better than a cubic (not shown). We use the ability
of fracpoly both to generate the required powers of age, namely, (1, 2, 3, 4) for the quartic and
(−2, 2) for the second-degree FP, and to fit the model. We fit both models and graph the resulting
curves:
(Continued on next page)
480
fracpoly — Fractional polynomial regression
. fracpoly: regress sqrtigg age 1 2 3 4
-> gen double Iage__1 = age-2.774049213 if e(sample)
-> gen double Iage__2 = age^2-7.695349038 if e(sample)
-> gen double Iage__3 = age^3-21.34727694 if e(sample)
-> gen double Iage__4 = age^4-59.21839681 if e(sample)
SS
df
MS
Number of obs
Source
F( 4,
293)
22.5835458
4 5.64588646
Prob > F
Model
Residual
50.668801
293 .172931061
R-squared
Adj R-squared
Total
73.2523469
297 .246640898
Root MSE
sqrtigg
Coef.
Iage__1
Iage__2
Iage__3
Iage__4
_cons
2.047831
-1.058902
.2284917
-.0168534
2.240012
Std. Err.
.4595962
.2822803
.0667591
.0053321
.0480157
t
4.46
-3.75
3.42
-3.16
46.65
P>|t|
0.000
0.000
0.001
0.002
0.000
=
=
=
=
=
=
298
32.65
0.0000
0.3083
0.2989
.41585
[95% Conf. Interval]
1.143302
-1.614456
.0971037
-.0273475
2.145512
2.952359
-.5033479
.3598798
-.0063594
2.334511
Deviance:
317.70.
. predict fit1
(option xb assumed; fitted values)
. fracpoly: regress sqrtigg age -2 2
-> gen double Iage__1 = age^-2-.1299486216 if e(sample)
-> gen double Iage__2 = age^2-7.695349038 if e(sample)
Source
SS
df
MS
Number of obs
F( 2,
295)
Model
22.2846976
2 11.1423488
Prob > F
Residual
50.9676492
295 .172771692
R-squared
Adj R-squared
Total
73.2523469
297 .246640898
Root MSE
sqrtigg
Coef.
Iage__1
Iage__2
_cons
-.1562156
.0148405
2.283145
Std. Err.
.027416
.0027767
.0305739
Deviance:
319.45.
. predict fit2
(option xb assumed; fitted values)
t
-5.70
5.34
74.68
P>|t|
0.000
0.000
0.000
=
=
=
=
=
=
298
64.49
0.0000
0.3042
0.2995
.41566
[95% Conf. Interval]
-.2101713
.0093757
2.222974
-.10226
.0203052
2.343315
fracpoly — Fractional polynomial regression
481
1
Square root of IgG
2
3
4
. scatter sqrtigg fit1 fit2 age, c(. l l) m(o i i) msize(small)
> clpattern(. -_.) ytitle("Square root of IgG") xtitle("Age, years")
0
2
4
6
Age, years
Square root of IgG
Fitted values
Fitted values
The quartic curve has an unsatisfactory wavy appearance that is implausible for the known behavior of
IgG, the serum level of which increases throughout early life. The FP curve increases monotonically
and is therefore biologically the more plausible curve. The two models have approximately the same
deviance.
Example 2
Data from Smith et al. (1992) contain times to complete healing of leg ulcers in a randomized
controlled clinical trial of two treatments in 192 elderly patients. Several covariates were available,
of which an important one is mthson, the number of months since the recorded onset of the ulcer.
Because the response variable is time to an event of interest and some (in fact, about one-half) of
the times are censored, using Cox regression to analyze the data is appropriate. We consider FPs in
mthson, adjusting for four other covariates: age; ulcarea, the area of tissue initially affected by
the ulcer; deepppg, a binary variable indicating the presence or absence of deep vein involvement;
and treat, a binary variable indicating treatment type. We fit FPs of degree 1 and 2:
. use http://www.stata-press.com/data/r11/legulcer, clear
(Leg ulcer clinical trial)
. stset ttevent, fail(cens)
(output omitted )
(Continued on next page )
482
fracpoly — Fractional polynomial regression
. fracpoly, compare: stcox mthson age ulcarea deepppg treat, nohr
-> gen double Iage__1 = age-73.453125 if e(sample)
-> gen double Iulca__1 = ulcarea-1326.203125 if e(sample)
-> gen double Itrea__1 = treat-1 if e(sample)
........
-> gen double Imths__1 = X^.5-.4930242557 if e(sample)
-> gen double Imths__2 = X^.5*ln(X)+.6973304564 if e(sample)
(where: X = (mthson+1)/100)
failure _d: censored
analysis time _t: ttevent
Iteration 0:
log likelihood = -422.65089
Iteration 1:
log likelihood = -390.49313
Iteration 2:
log likelihood = -383.44258
Iteration 3:
log likelihood = -374.28707
Iteration 4:
log likelihood = -369.31417
Iteration 5:
log likelihood = -368.38104
Iteration 6:
log likelihood = -368.35448
Iteration 7:
log likelihood = -368.35446
Refining estimates:
Iteration 0:
log likelihood = -368.35446
Cox regression -- Breslow method for ties
No. of subjects =
192
No. of failures =
92
Time at risk
=
13825
Log likelihood
=
-368.35446
_t
Coef.
Imths__1
Imths__2
Iage__1
Iulca__1
deepppg
Itrea__1
-2.81425
1.541451
-.0261111
-.0017491
-.5850499
-.1624663
Std. Err.
.6996385
.4703143
.0087983
.000359
.2163173
.2171048
z
-4.02
3.28
-2.97
-4.87
-2.70
-0.75
Number of obs
=
192
LR chi2(6)
Prob > chi2
=
=
108.59
0.0000
P>|z|
0.000
0.001
0.003
0.000
0.007
0.454
[95% Conf. Interval]
-4.185516
.6196521
-.0433556
-.0024527
-1.009024
-.5879838
-1.442984
2.46325
-.0088667
-.0010455
-.1610758
.2630513
Deviance:
736.71. Best powers of mthson among 44 models fit: .5 .5.
Fractional polynomial model comparisons:
mthson
Not in model
Linear
m = 1
m = 2
df
Deviance
Dev. dif.
P (*)
0
1
2
4
754.345
751.680
738.969
736.709
17.636
14.971
2.260
0.001
0.002
0.323
Powers
1
-.5
.5 .5
(*) P-value from deviance difference comparing reported model with m = 2 model
The best-fit FP of degree 2 has powers (0.5, 0.5) and deviance 736.71. However, this model does not
fit significantly better than the FP of degree 1 (at the 5% level), which has power −0.5 and deviance
738.97. We prefer the model with m = 1, for which the partial linear predictor is shown on the next
page.
fracpoly — Fractional polynomial regression
483
. quietly fracpoly, degree(1): stcox mthson age ulcarea deepppg treat, nohr
. fracplot, ytitle(Partial linear predictor) m(i) ciopts(bcolor(white))
−2
Partial linear predictor
0
2
4
Fractional Polynomial (−.5),
adjusted for covariates
0
100
200
Months since onset
300
400
The hazard for healing is much higher for patients whose ulcer is of recent onset than for those who
have had an ulcer for many months.
fracpoly has automatically centered the predictors on their mean values, but because in Cox
regression there is no constant term, we cannot see the effects of centering in the table of regression
estimates. The effects would be present if we were to graph the baseline hazard or survival function
because these functions are defined with all predictors set equal to 0.
A more appropriate analysis of this dataset, if one wished to model all the predictors, possibly
with FP functions, would be to use mfp; see [R] mfp.
(Continued on next page)
484
fracpoly — Fractional polynomial regression
Saved results
In addition to what regression cmd saves, fracpoly saves the following in e():
Scalars
e(fp N)
e(fp dev)
e(fp df)
e(fp d0)
e(fp s0)
e(fp dlin)
e(fp slin)
e(fp d1), e(fp
e(fp s1), e(fp
Macros
e(fp cmd)
e(cmdline)
e(fp depv)
e(fp rhs)
e(fp base)
e(fp xp)
e(fp fvl)
e(fp wgt)
e(fp wexp)
e(fp pwrs)
e(fp x1), e(fp
e(fp k1), e(fp
d2),
s2),
x2),
k2),
...
...
number of nonmissing observations
deviance for FP model of degree m
FP model degrees of freedom
deviance for model without xvar1
residual SD for model without xvar1
deviance for model linear in xvar1
residual SD for model linear in xvar1
deviances for FP models of degree 1,2,...,m
residual SDs for FP models of degree 1,2,...,m
...
...
fracpoly
command as typed
yvar1 (yvar2)
xvar1
variables in xvar2 , . . . , xvarlist after centering and FP transformation
Ixvar 1, Ixvar 2, etc.
variables in model finally estimated
weight type or ""
weight expression if ‘e(fp wgt)’ != ""
powers for FP model of degree m
xvar1 and variables in model
powers for FP models of degree 1,2,...,m
Residual SDs are stored only when regression cmd is regress.
Methods and formulas
fracpoly and fracgen are implemented as ado-files.
The general definition of an FP, accommodating possible repeated powers, may be written for
functions H1 (x), . . . , Hm (x) as
m
X
β0 +
βj Hj (x)
j=1
where H1 (x) = x(p1 ) and for j = 2, . . . , m,
Hj (x) =
x(pj )
if pj =
6 pj−1
Hj−1 (x) log x if pj = pj−1
For example, an FP of degree 3 with powers (1, 3, 3) has H1 (x) = x, H2 (x) = x3 , and H3 (x) =
x3 log x and equals β0 + β1 x + β2 x3 + β3 x3 log x.
An FP model of degree m is taken to have 2m + 1 degrees of freedom (df ): one for β0 and one
for each βj and its associated power. Because the powers in an FP are chosen from a finite set rather
than from the entire real line, the df defined in this way are approximate.
fracpoly — Fractional polynomial regression
485
The deviance D of a model is defined as −2 times its maximized log likelihood. For normal-errors
models, we use the formula
2πRSS
D = n 1 − l + log
n
where n is the sample size, l is the mean of the lognormalized weights (l = 0 if the weights are all
equal), and RSS is the residual sum of squares as fit by regress.
The compare option causes fracpoly to report a table comparing FP models of degree k < m to
the degree m FP model. Suppose that we are comparing FP regression models with degrees k and m.
The p-values reported by compare are calculated differently for normal and nonnormal regressions.
Let Dk and Dm be the deviances of the models with degrees k and m, respectively. For normal-errors
models such as regress, a variance ratio F is calculated as
n2
Dk − Dm
F =
exp
−1
n1
n
where n1 is the numerator df (here, 2m − 2k ) and n2 is the denominator df (equal to rdf − 2m,
where rdf is the residual df for the regression model involving only the covariates in xvar2 , if any,
but not x). The p-value is obtained by referring F to an F distribution on (2, rdf) df.
For nonnormal models (clogit, glm, logistic, . . . stcox, or streg; not regress), the p-value
is obtained by referring Dk − Dm to a χ2 distribution on 2m − 2k df. These p-values for comparing
models are approximate and are typically somewhat conservative (Royston and Altman 1994b).
The component-plus-residual values graphed by fracplot are calculated as follows: let the
data consist of triplets (yi , xi , zi ), i = 1, . . . , n, where zi is the vector of covariates for the
ith observation, after applying possible FP transformation and centering as described earlier. Let
0
0b
+ zi γ
b be the linear predictor from the FP model, as given by the
ηbi = βb0 + {H(xi ) − H(x0 )} β
fracpred command, or equivalently, by the predict command with the xb option, after the use
of fracpoly. Here H(xi ) = {H1 (xi ), . . . , Hm (xi )}0 is the vector of FP functions described above,
H(x0 ) = {H1 (x0 ), . . . , Hm (x0 )}0 is the vector of centering to x0 (x0 is often chosen to be the
b is the estimated parameter vector, and γ
mean of the xi ), β
b is the estimated parameter vector for
0b
the covariates. The values ηbi∗ = βb0 + {H(xi ) − H(x0 )} β
represent the behavior of the FP model
for x at fixed values z = 0 of the (centered) covariates. The ith component-plus-residual is defined
as ηbi∗ √
+ di , where di is the deviance residual for the ith observation. For normal-errors models,
di = wi (yi − ηbi ), where wi is the case weight (or 1, if weight is not specified). For logistic,
Cox, and generalized linear regression models, see [R] logistic, [R] probit, [ST] stcox, and [R] glm,
respectively, for the formula for di . The formula for poisson models is the same as that for glm with
family(poisson). For stcox, di is the partial martingale residual (see [ST] stcox postestimation).
Acknowledgment
fracpoly and fracgen were written by Patrick Royston, MRC Clinical Trials Unit, London.
References
Becketti, S. 1995. sg26.2: Calculating and graphing fractional polynomials. Stata Technical Bulletin 24: 14–16. Reprinted
in Stata Technical Bulletin Reprints, vol. 4, pp. 129–132. College Station, TX: Stata Press.
Garrett, J. M. 1995. sg33: Calculation of adjusted means and adjusted proportions. Stata Technical Bulletin 24: 22–25.
Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 161–165. College Station, TX: Stata Press.
486
fracpoly — Fractional polynomial regression
. 1998. sg33.1: Enhancements for calculation of adjusted means and adjusted proportions. Stata Technical Bulletin
43: 16–24. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 111–123. College Station, TX: Stata Press.
Isaacs, D., D. G. Altman, C. E. Tidmarsh, H. B. Valman, and A. D. Webster. 1983. Serum immunoglobulin concentrations
in preschool children measured by laser nephelometry: Reference ranges for IgG, IgA, IgM. Journal of Clinical
Pathology 36: 1193–1196.
Royston, P. 1995. sg26.3: Fractional polynomial utilities. Stata Technical Bulletin 25: 9–13. Reprinted in Stata Technical
Bulletin Reprints, vol. 5, pp. 82–87. College Station, TX: Stata Press.
Royston, P., and D. G. Altman. 1994a. sg26: Using fractional polynomials to model curved regression relationships.
Stata Technical Bulletin 21: 11–23. Reprinted in Stata Technical Bulletin Reprints, vol. 4, pp. 110–128. College
Station, TX: Stata Press.
. 1994b. Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling
(with discussion). Applied Statistics 43: 429–467.
Royston, P., and G. Ambler. 1998. sg81: Multivariable fractional polynomials. Stata Technical Bulletin 43: 24–32.
Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 123–132. College Station, TX: Stata Press.
. 1999a. sg112: Nonlinear regression models involving power or exponential functions of covariates. Stata Technical
Bulletin 49: 25–30. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 173–179. College Station, TX: Stata
Press.
. 1999b. sg81.1: Multivariable fractional polynomials: Update. Stata Technical Bulletin 49: 17–23. Reprinted in
Stata Technical Bulletin Reprints, vol. 9, pp. 161–168. College Station, TX: Stata Press.
. 1999c. sg112.1: Nonlinear regression models involving power or exponential functions of covariates: Update.
Stata Technical Bulletin 50: 26. Reprinted in Stata Technical Bulletin Reprints, vol. 9, p. 180. College Station,
TX: Stata Press.
. 1999d. sg81.2: Multivariable fractional polynomials: Update. Stata Technical Bulletin 50: 25. Reprinted in Stata
Technical Bulletin Reprints, vol. 9, p. 168. College Station, TX: Stata Press.
Royston, P., and W. Sauerbrei. 2008. Multivariable Model-building: A Pragmatic Approach to Regression Analysis
Based on Fractional Polynomials for Modelling Continuous Variables. Chichester, UK: Wiley.
Sauerbrei, W., and P. Royston. 1999. Building multivariable prognostic and diagnostic models: Transformation of the
predictors by using fractional polynomials. Journal of the Royal Statistical Society, Series A 162: 71–94.
Smith, J. M., C. J. Dore, A. Charlett, and J. D. Lewis. 1992. A randomized trial of Biofilm dressing for venous leg
ulcers. Phlebology 7: 108–113.
Also see
[R] fracpoly postestimation — Postestimation tools for fracpoly
[R] mfp — Multivariable fractional polynomial models
[U] 20 Estimation and postestimation commands
Title
fracpoly postestimation — Postestimation tools for fracpoly
Description
The following postestimation commands are of special interest after fracpoly:
command
description
fracplot
fracpred
plot data and fit from most recently fit fractional polynomial model
create variable containing prediction, deviance residuals, or SEs of fitted values
For information about these commands, see below.
The following standard postestimation commands are also available if available after regression cmd:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
linktest
lrtest
margins
nlcom
predict
predictnl
test
testnl
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
See the corresponding entries in the Base Reference Manual for details.
Special-interest postestimation commands
fracplot plots the data and fit, with 95% confidence limits, from the most recently fit fractional
polynomial (FP) model. The data and fit are plotted against varname, which may be xvar1 or another
of the covariates (xvar2 , . . . , or a variable from xvarlist). If varname is not specified, xvar1 is
assumed.
fracpred creates newvar containing the fitted index or deviance residuals for the whole model,
or the fitted index or its standard error for varname, which may be xvar1 or another covariate.
487
488
fracpoly postestimation — Postestimation tools for fracpoly
Syntax for predict
The behavior of predict following fracpoly is determined by regression cmd. See the corresponding regression cmd postestimation entry for available predict options.
Also see information on fracpred below.
Syntax for fracplot and fracpred
Plot data and fit from most recently fit fractional polynomial model
fracplot varname
if
in
, fracplot options
Create variable containing the prediction, deviance residuals, or SEs of fitted values
fracpred newvar , fracpred options
fracplot options
description
Plot
marker options
marker label options
change look of markers (color, size, etc.)
add marker labels; change look or position
Fitted line
lineopts(cline options) affect rendition of the fitted line
CI plot
ciopts(area options)
affect rendition of the confidence bands
Add plots
addplot(plot)
add other plots to the generated graph
Y axis, X axis, Titles, Legend, Overall
twoway options
any options other than by() documented in [G] twoway options
fracpred options
description
for(varname)
dresid
stdp
compute prediction for varname
compute deviance residuals
compute standard errors of the fitted values varname
fracplot is not allowed after fracpoly with clogit and mlogit. fracpred, dresid is not
allowed after fracpoly with clogit or mlogit.
fracpoly postestimation — Postestimation tools for fracpoly
489
Menu
fracplot
Statistics
>
Linear models and related
>
Fractional polynomials
>
Fractional polynomial regression plot
Linear models and related
>
Fractional polynomials
>
Fractional polynomial prediction
fracpred
Statistics
>
Options for fracplot
Plot
marker options affect the rendition of markers drawn at the plotted points, including their shape,
size, color, and outline; see [G] marker options.
marker label options specify if and how the markers are to be labeled; see [G] marker label options.
Fitted line
lineopts(cline options) affect the rendition of the fitted line; see [G] cline options.
CI plot
ciopts(area options) affect the rendition of the confidence bands; see [G] area options.
Add plots
addplot(plot) provides a way to add other plots to the generated graph. See [G] addplot option.
Y axis, X axis, Titles, Legend, Overall
twoway options are any of the options documented in [G] twoway options, excluding by(). These
include options for titling the graph (see [G] title options) and for saving the graph to disk (see
[G] saving option).
Options for fracpred
for(varname) specifies (partial) prediction for variable varname. The fitted values are adjusted to
the value specified by the adjust() option in fracpoly.
dresid specifies that deviance residuals be calculated.
stdp specifies calculation of the standard errors of the fitted values varname, adjusted for all the
other predictors at the values specified by adjust().
Remarks
fracplot actually produces a component-plus-residual plot. For normal-error models with constant
weights and one covariate, this amounts to a plot of the observations with the fitted line inscribed.
For other normal-error models, weighted residuals are calculated and added to the fitted values.
For models with additional covariates, the line is the partial linear predictor for the variable in
question (xvar1 or a covariate) and includes the intercept β0 .
490
fracpoly postestimation — Postestimation tools for fracpoly
For generalized linear and Cox models, the fitted values are plotted on the scale of the “index” (linear
predictor). Deviance residuals are added to the (partial) linear predictor to give component-plus-residual
values. These values are plotted as small circles.
See [R] fracpoly for examples using fracplot.
Methods and formulas
All postestimation commands listed above and fracplot and fracpred are implemented as
ado-files.
See Methods and formulas in [R] fracpoly for notation.
The component-plus-residual values graphed by fracplot are calculated as follows: Let the data
consist of triplets (yi , xi , zi ), i = 1, . . . , n, where zi is the vector of covariates for the ith observation,
after applying possible fractional polynomial transformation and adjustment as described earlier. Let
0
0b
+ zi γ
b be the linear predictor from the FP model, as given by
ηbi = βb0 + {H(xi ) − H(x0 )} β
the fracpred command or, equivalently, by the predict command with the xb option, following
fracpoly. Here H(xi ) = {H1 (xi ), . . . , Hm (xi )}0 is the vector of FP functions described above,
H(x0 ) = {H1 (x0 ), . . . , Hm (x0 )}0 is the vector of adjustments to x0 (often, x0 is chosen to be the
b is the estimated parameter vector, and γ
mean of the xi ), β
b is the estimated parameter vector for
0b
the covariates. The values ηbi∗ = βb0 + {H(xi ) − H(x0 )} β
represent the behavior of the FP model
for x at fixed values z = 0 of the (adjusted) covariates. The ith component-plus-residual is defined
as ηbi∗√
+ di , where di is the deviance residual for the ith observation. For normal-errors models,
di = wi (yi − ηbi ), where wi is the case weight (or 1, if weight is not specified). For logistic, Cox,
and generalized linear regression models, see [R] logistic, [R] probit, [ST] stcox, and [R] glm for the
formula for di . The formula for poisson models is the same as that for glm with family(poisson).
For stcox, di is the partial martingale residual (see [ST] stcox postestimation).
fracplot plots the values of di and the curve represented by ηbi∗ against xi . The confidence
interval for ηbi∗ is obtained from the variance–covariance matrix of the entire model and takes into
account the uncertainty in estimating β0 , β, and γ (but not in estimating the FP powers for x).
fracpred with the for(varname) option calculates the predicted index at xi = x0 and zi = 0;
0b
that is, ηbi = βb0 +{H(xi ) − H(x0 )} β.
The standard error is calculated from the variance–covariance
b
b
matrix of (β0 , β), again ignoring estimation of the powers.
Acknowledgment
fracplot and fracpred were written by Patrick Royston of the MRC Clinical Trials Unit, London.
Also see
[R] fracpoly — Fractional polynomial regression
[U] 20 Estimation and postestimation commands
Title
frontier — Stochastic frontier models
Syntax
frontier depvar
indepvars
options
if
in
weight
, options
description
Model
noconstant
distribution(hnormal)
distribution(exponential)
distribution(tnormal)
ufrom(matrix)
cm(varlist , noconstant )
suppress constant term
half-normal distribution for the inefficiency term
exponential distribution for the inefficiency term
truncated-normal distribution for the inefficiency term
specify untransformed log likelihood; only with d(tnormal)
fit conditional mean model; only with d(tnormal); use
noconstant to suppress constant term
Model 2
constraints(constraints)
collinear uhet(varlist , noconstant )
vhet(varlist , noconstant )
cost
apply specified linear constraints
keep collinear variables
explanatory variables for technical inefficiency variance
function; use noconstant to suppress constant term
explanatory variables for idiosyncratic error variance
function; use noconstant to suppress constant term
fit cost frontier model; default is production frontier model
SE
vce(vcetype)
vcetype may be oim, opg, bootstrap, or jackknife
Reporting
level(#)
nocnsreport
display options
set confidence level; default is level(95)
do not display constraints
control spacing and display of omitted variables and base and
empty cells
Maximization
maximize options
† coeflegend
control the maximization process; seldom used
display coefficients’ legend instead of coefficient table
† coeflegend does not appear in the dialog box.
indepvars and varlist may contain factor variables; see [U] 11.4.3 Factor variables.
bootstrap, by, jackknife, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands.
Weights are not allowed with the bootstrap prefix.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
491
492
frontier — Stochastic frontier models
Menu
Statistics
>
Linear models and related
>
Frontier models
Description
frontier fits stochastic production or cost frontier models; the default is a production frontier
model. It provides estimators for the parameters of a linear model with a disturbance that is assumed
to be a mixture of two components, which have a strictly nonnegative and symmetric distribution,
respectively. frontier can fit models in which the nonnegative distribution component (a measurement
of inefficiency) is assumed to be from a half-normal, exponential, or truncated-normal distribution.
See Kumbhakar and Lovell (2000) for a detailed introduction to frontier analysis.
Options
Model
noconstant; see [R] estimation options.
distribution(distname) specifies the distribution for the inefficiency term as half-normal (hnormal),
exponential, or truncated-normal (tnormal). The default is hnormal.
ufrom(matrix) specifies a 1 × K matrix of untransformed starting values when the distribution is
truncated-normal (tnormal). frontier can estimate the parameters of the model by maximizing
either the log likelihood or a transformed log likelihood (see Methods and formulas). frontier
automatically transforms the starting values before passing them on to the transformed log likelihood.
The matrix must have the same number of columns as there are parameters to estimate.
cm(varlist , noconstant ) may be used only with distribution(tnormal). Here frontier
will fit a conditional mean model in which the mean of the truncated-normal distribution is modeled
as a linear function of the set of covariates specified in varlist. Specifying noconstant suppresses
the constant in the mean function.
Model 2
constraints(constraints), collinear; see [R] estimation options.
By default, when fitting the truncated-normal model or the conditional mean model, frontier
maximizes a transformed log likelihood. When constraints are applied, frontier will maximize
the untransformed log likelihood with constraints defined in the untransformed metric.
uhet(varlist , noconstant ) specifies that the technical inefficiency component is heteroskedastic,
with the variance function depending on a linear combination of varlistu . Specifying noconstant
suppresses the constant term from the variance function. This option may not be specified with
distribution(tnormal).
vhet(varlist , noconstant ) specifies that the idiosyncratic error component is heteroskedastic,
with the variance function depending on a linear combination of varlistv . Specifying noconstant
suppresses the constant term from the variance function. This option may not be specified with
distribution(tnormal).
cost specifies that frontier fit a cost frontier model.
frontier — Stochastic frontier models
493
SE
vce(vcetype) specifies the type of standard error reported, which includes types that are derived from
asymptotic theory and that use bootstrap or jackknife methods; see [R] vce option.
Reporting
level(#); see [R] estimation options.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Maximization
maximize options: difficult, technique(algorithm spec), iterate(#), no log, trace,
gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#),
nrtolerance(#), nonrtolerance, from(init specs); see [R] maximize. These options are
seldom used.
Setting the optimization type to technique(bhhh) resets the default vcetype to vce(opg).
The following option is available with frontier but is not shown in the dialog box:
coeflegend; see [R] estimation options.
Remarks
Stochastic production frontier models were introduced by Aigner, Lovell, and Schmidt (1977) and
Meeusen and van den Broeck (1977). Since then, stochastic frontier models have become a popular
subfield in econometrics. Kumbhakar and Lovell (2000) provide a good introduction.
frontier fits three stochastic frontier models with distinct parameterizations of the inefficiency
term and can fit stochastic production or cost frontier models.
Let’s review the nature of the stochastic frontier problem. Suppose that a producer has a production
function f (zi , β). In a world without error or inefficiency, the ith firm would produce
qi = f (zi , β)
Stochastic frontier analysis assumes that each firm potentially produces less than it might due to
a degree of inefficiency. Specifically,
qi = f (zi , β)ξi
where ξi is the level of efficiency for firm i; ξi must be in the interval (0, 1 ]. If ξi = 1, the firm
is achieving the optimal output with the technology embodied in the production function f (zi , β).
When ξi < 1, the firm is not making the most of the inputs zi given the technology embodied in the
production function f (zi , β). Because the output is assumed to be strictly positive (i.e., qi > 0), the
degree of technical efficiency is assumed to be strictly positive (i.e., ξi > 0).
Output is also assumed to be subject to random shocks, implying that
qi = f (zi , β)ξi exp(vi )
494
frontier — Stochastic frontier models
Taking the natural log of both sides yields
ln(qi ) = ln f (zi , β) + ln(ξi ) + vi
Assuming that there are k inputs and that the production function is linear in logs, defining
ui = − ln(ξi ) yields
k
X
ln(qi ) = β0 +
βj ln(zji ) + vi − ui
(1)
j=1
Because ui is subtracted from ln(qi ), restricting ui ≥ 0 implies that 0 < ξi ≤ 1, as specified above.
Kumbhakar and Lovell (2000) provide a detailed version of the above derivation, and they show
that performing an analogous derivation in the dual cost function problem allows us to specify the
problem as
k
X
ln(ci ) = β0 + βq ln(qi ) +
βj ln(pji ) + vi + ui
(2)
j=1
where qi is output, zji are input quantities, ci is cost, and the pji are input prices.
Intuitively, the inefficiency effect is required to lower output or raise expenditure, depending on the
specification.
Technical note
The model that frontier actually fits is of the form
y i = β0 +
k
X
βj xji + vi − sui
j=1
where
s=
1, for production functions
−1, for cost functions
so, in the context of the discussion above, yi = ln(qi ), and xji = ln(zji ) for a production function;
and for a cost function, yi = ln(ci ), and the xji are the ln(pji ) and ln(qi ). You must take the
natural logarithm of the data before fitting a stochastic frontier production or cost model. frontier
performs no transformations on the data.
Different specifications of the ui and the vi terms give rise to distinct models. frontier provides
estimators for the parameters of three basic models in which the idiosyncratic component, vi , is
assumed to be independently N (0, σv ) distributed over the observations. The basic models differ in
their specification of the inefficiency term, ui , as follows:
exponential: the ui are independently exponentially distributed with variance σu2
hnormal: the ui are independently half-normally N + (0, σu2 ) distributed
tnormal: the ui are independently N + (µ, σu2 ) distributed with truncation point at 0
For half-normal or exponential distributions, frontier can fit models with heteroskedastic error
components, conditional on a set of covariates. For a truncated-normal distribution, frontier can
also fit a conditional mean model in which the mean is modeled as a linear function of a set of
covariates.
frontier — Stochastic frontier models
495
Example 1
For our first example, we demonstrate the half-normal and exponential models by reproducing a
study found in Greene (2008), which uses data originally published in Zellner and Revankar (1969).
In this study of the transportation-equipment manufacturing industry, observations on value added,
capital, and labor are used to estimate a Cobb–Douglas production function. The variable lnv is the
log-transformed value added, lnk is the log-transformed capital, and lnl is the log-transformed labor.
OLS estimates are compared with those from stochastic frontier models using both the half-normal
and exponential distribution for the inefficiency term.
. use http://www.stata-press.com/data/r11/greene9
. regress lnv lnk lnl
Source
SS
df
MS
Model
Residual
44.1727741
1.22225984
2
22
22.086387
.055557265
Total
45.3950339
24
1.89145975
lnv
Coef.
lnk
lnl
_cons
.2454281
.805183
1.844416
. frontier lnv
Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Stoc. frontier
Coef.
lnk
lnl
_cons
25
397.54
0.0000
0.9731
0.9706
.23571
t
P>|t|
[95% Conf. Interval]
.1068574
.1263336
.2335928
2.30
6.37
7.90
0.032
0.000
0.000
.0238193
.5431831
1.359974
Number of obs
Wald chi2(2)
Prob > chi2
2.4695222
lnv
=
=
=
=
=
=
Std. Err.
lnk lnl
log likelihood = 2.3357572
log likelihood = 2.4673009
log likelihood = 2.4695125
log likelihood = 2.4695222
log likelihood = 2.4695222
normal/half-normal model
Log likelihood =
Number of obs
F( 2,
22)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
.4670368
1.067183
2.328858
25
743.71
0.0000
Std. Err.
z
P>|z|
[95% Conf. Interval]
.2585478
.7802451
2.081135
.098764
.1199399
.281641
2.62
6.51
7.39
0.009
0.000
0.000
.0649738
.5451672
1.529128
.4521218
1.015323
2.633141
/lnsig2v
/lnsig2u
-3.48401
-3.014599
.6195353
1.11694
-5.62
-2.70
0.000
0.007
-4.698277
-5.203761
-2.269743
-.8254368
sigma_v
sigma_u
sigma2
lambda
.1751688
.2215073
.0797496
1.264536
.0542616
.1237052
.0426989
.1678684
.0954514
.074134
-.0039388
.9355204
.3214633
.6618486
.163438
1.593552
Likelihood-ratio test of sigma_u=0: chibar2(01) = 0.43
. predict double u_h, u
Prob>=chibar2 = 0.256
496
frontier — Stochastic frontier models
. frontier lnv
Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Stoc. frontier
lnk lnl, distribution(exponential)
log likelihood = 2.7270659
log likelihood = 2.8551532
log likelihood = 2.8604815
log likelihood = 2.8604897
log likelihood = 2.8604897
normal/exponential model
Number of obs
Wald chi2(2)
Log likelihood = 2.8604897
Prob > chi2
lnv
Coef.
lnk
lnl
_cons
=
=
=
25
845.68
0.0000
Std. Err.
z
P>|z|
[95% Conf. Interval]
.2624859
.7703795
2.069242
.0919988
.1109569
.2356159
2.85
6.94
8.78
0.004
0.000
0.000
.0821717
.5529079
1.607444
.4428002
.9878511
2.531041
/lnsig2v
/lnsig2u
-3.527598
-4.002457
.4486176
.9274575
-7.86
-4.32
0.000
0.000
-4.406873
-5.820241
-2.648324
-2.184674
sigma_v
sigma_u
sigma2
lambda
.1713925
.1351691
.0476461
.7886525
.0384448
.0626818
.0157921
.087684
.1104231
.0544692
.016694
.616795
.2660258
.3354317
.0785981
.9605101
Likelihood-ratio test of sigma_u=0: chibar2(01) = 1.21
. predict double u_e, u
. list state u_h u_e
state
u_h
u_e
1.
2.
3.
4.
5.
Alabama
California
Connecticut
Florida
Georgia
.2011338
.14480966
.1903485
.51753139
.10397912
.14592865
.0972165
.13478797
.5903303
.07140994
6.
7.
8.
9.
10.
Illinois
Indiana
Iowa
Kansas
Kentucky
.12126696
.21128212
.24933153
.10099517
.05626919
.0830415
.15450664
.20073081
.06857629
.04152443
11.
12.
13.
14.
15.
Louisiana
Maine
Maryland
Massachusetts
Michigan
.20332731
.22263164
.13534062
.15636999
.15809566
.15066405
.17245793
.09245501
.10932923
.10756915
16.
17.
18.
19.
20.
Missouri
NewJersey
NewYork
Ohio
Pennsylvania
.10288047
.09584337
.27787793
.22914231
.1500667
.0704146
.06587986
.22249416
.16981857
.10302905
21.
22.
23.
24.
25.
Texas
Virginia
Washington
WestVirginia
Wisconsin
.20297875
.14000132
.11047581
.15561392
.14067066
.14552218
.09676078
.07533251
.11236153
.0970861
Prob>=chibar2 = 0.135
frontier — Stochastic frontier models
497
The parameter estimates and the estimates of the inefficiency terms closely match those published in
Greene (2008), but the standard errors of the parameter estimates are estimated differently (see the
technical note below).
The output from frontier includes estimates of the standard deviations of the two error components,
σv and σu , which are labeled sigma v and sigma u, respectively. In the log likelihood, they are
parameterized as lnσv2 and lnσu2 , and these estimates are labeled /lnsig2v and /lnsig2u in the
output. frontier also reports two other useful parameterizations. The estimate of the total error
variance, σS2 = σv2 + σu2 , is labeled sigma2, and the estimate of the ratio of the standard deviation
of the inefficiency component to the standard deviation of the idiosyncratic component, λ = σu /σv ,
is labeled lambda.
At the bottom of the output, frontier reports the results of a test that there is no technical
inefficiency component in the model. This is a test of the null hypothesis H0 : σu2 = 0 against
the alternative hypotheses H1 : σu2 > 0. If the null hypothesis is true, the stochastic frontier model
reduces to an OLS model with normal errors. However, because the test lies on the boundary of the
parameter space of σu2 , the standard likelihood-ratio test is not valid, and a one-sided generalized
likelihood-ratio test must be constructed; see Gutierrez, Carter, and Drukker (2001). For this example,
the output shows LR = 0.43 with a p-value of 0.256 for the half-normal model and LR = 1.21 with
a p-value of 0.135 for the exponential model. There are several possible reasons for the failure to
reject the null hypothesis, but the fact that the test is based on an asymptotic distribution and the
sample size was 25 is certainly a leading candidate among those possibilities.
Technical note
frontier maximizes the log-likelihood function of a stochastic frontier model by using the
Newton–Raphson method, and the estimated variance–covariance matrix is calculated as the inverse
of the negative Hessian (matrix of second partial derivatives); see [R] ml. When comparing the results
with those published using other software, be aware of the difference in the optimization methods,
which may result in different, yet asymptotically equivalent, variance estimates.
Example 2
Often the error terms may not have constant variance. frontier allows you to model heteroskedasticity in either error term as a linear function of a set of covariates. The variance of either the technical
inefficiency or the idiosyncratic component may be modeled as
σi2 = exp(wi δ)
The default constant included in wi may be suppressed by appending a noconstant option to the
list of covariates. Also, you can simultaneously specify covariates for both σui and σvi .
In the example below, we use a sample of 756 observations of fictional firms producing a
manufactured good by using capital and labor. The firms are hypothesized to use a constant returnsto-scale technology, but the sizes of the firms differ. Believing that this size variation will introduce
heteroskedasticity into the idiosyncratic error term, we estimate the parameters of a Cobb–Douglas
production function. To do this, we use a conditional heteroskedastic half-normal model, with the
size of the firm as an explanatory variable in the variance function for the idiosyncratic error. We
also perform a test of the hypothesis that the firms use a constant returns-to-scale technology.
498
frontier — Stochastic frontier models
. use http://www.stata-press.com/data/r11/frontier1, clear
. frontier lnoutput lnlabor lncapital, vhet(size)
Iteration 0:
log likelihood = -1508.3692
Iteration 1:
log likelihood = -1501.583
Iteration 2:
log likelihood = -1500.3942
Iteration 3:
log likelihood = -1500.3794
Iteration 4:
log likelihood = -1500.3794
Stoc. frontier normal/half-normal model
Number of obs
Wald chi2(2)
Log likelihood = -1500.3794
Prob > chi2
lnoutput
Coef.
lnoutput
lnlabor
lncapital
_cons
=
=
=
756
9.68
0.0079
Std. Err.
z
P>|z|
[95% Conf. Interval]
.7090933
.3931345
1.252199
.2349374
.5422173
3.14656
3.02
0.73
0.40
0.003
0.468
0.691
.2486244
-.6695919
-4.914946
1.169562
1.455861
7.419344
size
_cons
-.0016951
3.156091
.0004748
.9265826
-3.57
3.41
0.000
0.001
-.0026256
1.340023
-.0007645
4.97216
lnsig2u
_cons
1.947487
.1017653
19.14
0.000
1.748031
2.146943
sigma_u
2.647838
.134729
2.396514
2.925518
lnsig2v
. test _b[lnlabor] + _b[lncapital] = 1
( 1) [lnoutput]lnlabor + [lnoutput]lncapital = 1
chi2( 1) =
0.03
Prob > chi2 =
0.8622
The output above indicates that the variance of the idiosyncratic error term is a function of firm size.
Also, we failed to reject the hypothesis that the firms use a constant returns-to-scale technology.
Technical note
In small samples, the conditional heteroskedastic estimators will lack precision for the variance
parameters and may fail to converge altogether.
Example 3
Let’s turn our attention to the truncated-normal model. Once again, we will use fictional data. For
this example, we have 1,231 observations on the quantity of output, the total cost of production for
each firm, the prices that each firm paid for labor and capital services, and a categorical variable
measuring the quality of each firm’s management. After taking the natural logarithm of the costs
(lncost), prices (lnp k and lnp l), and output (lnout), we fit a stochastic cost frontier model
and specify the distribution for the inefficiency term to be truncated normal.
frontier — Stochastic frontier models
499
. use http://www.stata-press.com/data/r11/frontier2
. frontier lncost lnp_k lnp_l lnout, distribution(tnormal) cost
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
5:
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
-2386.9523
-2386.5146
-2386.2704
-2386.2504
-2386.2493
-2386.2493
Stoc. frontier normal/truncated-normal model
Log likelihood = -2386.2493
lncost
Coef.
lnp_k
lnp_l
lnout
_cons
Number of obs
Wald chi2(3)
Prob > chi2
=
=
=
1231
8.82
0.0318
Std. Err.
z
P>|z|
.3410717
.6608628
.7528653
2.602609
.2363861
.4951499
.3468968
1.083004
1.44
1.33
2.17
2.40
0.149
0.182
0.030
0.016
-.1222366
-.3096131
.0729601
.4799595
.80438
1.631339
1.432771
4.725259
/mu
/lnsigma2
/ilgtgamma
1.095705
1.5534
1.257862
.881517
.1873464
.2589522
1.24
8.29
4.86
0.214
0.000
0.000
-.632037
1.186208
.7503255
2.823446
1.920592
1.765399
sigma2
gamma
sigma_u2
sigma_v2
4.727518
.7786579
3.681119
1.046399
.8856833
.0446303
.7503408
.2660035
3.274641
.6792496
2.210478
.5250413
6.825001
.8538846
5.15176
1.567756
H0: No inefficiency component:
z =
5.595
[95% Conf. Interval]
Prob>=z = 0.000
In addition to the coefficients, the output reports estimates for several parameters. sigma v2 is the
estimate of σv2 . sigma u2 is the estimate of σu2 . gamma is the estimate of γ = σu2 /σS2 . sigma2 is the
estimate of σS2 = σv2 + σu2 . Because γ must be between 0 and 1, the optimization is parameterized
in terms of the inverse logit of γ , and this estimate is reported as ilgtgamma. Because σS2 must
be positive, the optimization is parameterized in terms of ln(σS2 ), whose estimate is reported as
lnsigma2. Finally, mu is the estimate of µ, the mean of the truncated-normal distribution.
In the output above, the generalized log-likelihood test for the presence of the inefficiency term
has been replaced with a test based on the third moment of the OLS residuals. When µ = 0 and
σu = 0, the truncated-normal model reduces to a linear regression model with normally distributed
errors. However, the distribution of the test statistic under the null hypothesis is not well established,
because it becomes impossible to evaluate the log likelihood as σu approaches zero, prohibiting the
use of the likelihood-ratio test.
However, Coelli (1995) noted that the presence of an inefficiency term would negatively skew the
residuals from an OLS regression. By identifying negative skewness in the residuals with the presence
of an inefficiency term, Coelli derived a one-sided test for the presence of the inefficiency term. The
results of this test are given at the bottom of the output. For this example, the null hypothesis of no
inefficiency component is rejected.
In the example below, we fit a truncated model and detect a statistically significant inefficiency
term in the model. We might question whether the inefficiency term is identically distributed over
all firms or whether there might be heterogeneity across firms. frontier provides an extension
to the truncated normal model by allowing the mean of the inefficiency term to be modeled as a
linear function of a set of covariates. In our dataset, we have a categorical variable that measures the
quality of a firm’s management. We refit the model, including the cm() option, specifying a set of
500
frontier — Stochastic frontier models
binary indicator variables representing the different categories of the quality-measurement variable as
covariates.
. frontier lncost lnp_k lnp_l lnout, distribution(tnormal) cm(i.quality) cost
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
5:
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
-2386.9523
-2384.936
-2382.3942
-2382.324
-2382.3233
-2382.3233
Stoc. frontier normal/truncated-normal model
Log likelihood = -2382.3233
lncost
Coef.
Number of obs
Wald chi2(3)
Prob > chi2
Std. Err.
z
P>|z|
=
=
=
1231
9.31
0.0254
[95% Conf. Interval]
lncost
lnp_k
lnp_l
lnout
_cons
.3611204
.680446
.7605533
2.550769
.2359749
.4934935
.3466102
1.078911
1.53
1.38
2.19
2.36
0.126
0.168
0.028
0.018
-.1013819
-.2867835
.0812098
.4361417
.8236227
1.647675
1.439897
4.665396
quality
2
3
4
5
.5056067
.783223
.5577511
.6792882
.3382907
.376807
.3355061
.3428073
1.49
2.08
1.66
1.98
0.135
0.038
0.096
0.048
-.1574309
.0446947
-.0998288
.0073981
1.168644
1.521751
1.215331
1.351178
_cons
.6014025
.990167
0.61
0.544
-1.339289
2.542094
/lnsigma2
/ilgtgamma
1.541784
1.242302
.1790926
.2588968
8.61
4.80
0.000
0.000
1.190769
.734874
1.892799
1.749731
sigma2
gamma
sigma_u2
sigma_v2
4.67292
.7759645
3.62602
1.0469
.8368852
.0450075
.7139576
.2583469
3.289611
.6758739
2.226689
.5405491
6.637923
.8519189
5.025351
1.553251
mu
The conditional mean model was developed in the context of panel-data estimators, and we can
apply frontier’s conditional mean model to panel data.
frontier — Stochastic frontier models
Saved results
frontier saves the following in e():
Scalars
e(N)
e(df m)
e(k)
e(k eq)
e(k eq model)
e(k dv)
e(k autoCns)
e(chi2)
e(ll)
e(ll c)
e(z)
e(sigma u)
e(sigma v)
e(p)
e(chi2 c)
e(p z)
e(rank)
e(ic)
e(rc)
e(converged)
number of observations
model degrees of freedom
number of parameters
number of equations
number of equations in model Wald test
number of dependent variables
number of base, empty, and omitted constraints
χ2
log likelihood
log likelihood for H0 : σu =0
test for negative skewness of OLS residuals
standard deviation of technical inefficiency
standard deviation of vi
significance
LR test statistic
p-value for z
rank of e(V)
number of iterations
return code
1 if converged, 0 otherwise
Macros
e(cmd)
frontier
e(cmdline)
command as typed
e(depvar)
name of dependent variable
e(function)
production or cost
e(wtype)
weight type
e(wexp)
weight expression
e(title)
title in estimation output
e(chi2type)
Wald; type of model χ2 test
e(dist)
distribution assumption for ui
e(het)
heteroskedastic components
varlist in uhet()
e(u hetvar)
e(v hetvar)
varlist in vhet()
e(vce)
vcetype specified in vce()
e(vcetype)
title used to label Std. Err.
e(opt)
type of optimization
e(which)
max or min; whether optimizer is to perform maximization or minimization
e(ml method)
type of ml method
e(user)
name of likelihood-evaluator program
e(technique)
maximization technique
e(singularHmethod) m-marquardt or hybrid; method used when Hessian is singular
e(crittype)
optimization criterion
e(properties)
b V
e(predict)
program used to implement predict
e(asbalanced)
factor variables fvset as asbalanced
e(asobserved)
factor variables fvset as asobserved
501
502
frontier — Stochastic frontier models
Matrices
e(b)
e(Cns)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
coefficient vector
constraints matrix
iteration log (up to 20 iterations)
gradient vector
variance–covariance matrix of the estimators
model-based variance
Functions
e(sample)
marks estimation sample
Methods and formulas
frontier is implemented as an ado-file.
Consider an equation of the form
yi = xi β + vi − sui
where yi is the dependent variable, xi is a 1 × k vector of observations on the independent variables
included as indent covariates, β is a k × 1 vector of coefficients, and
s=
1, for production functions
−1, for cost functions
The log-likelihood functions are as follows.
Normal/half-normal model:
lnL =
N X
1
i=1
2
si λ
2i
ln
− lnσS + lnΦ −
− 2
2
π
σS
2σS
Normal/exponential model:

N 
X

−si −
σv2
lnL =
− lnσu + 2 + lnΦ 

2σ
σv
u
i=1
σv2
σu


s
i
+
σu 

Normal/truncated-normal model:
lnL =
N X
1
µ
− ln (2π) − lnσS − lnΦ
√
2
σS γ
i=1
"
#
2 )
(1 − γ) µ − sγi
1 i + sµ
+ lnΦ
−
1/2
2
σS
{σS2 γ (1 − γ)}
frontier — Stochastic frontier models
503
where σS = (σu2 + σv2 )1/2 , λ = σu /σv , γ = σu2 /σS2 , i = yi − xi β, and Φ() is the cumulative
distribution function of the standard normal distribution.
To obtain estimation for ui , you can use either the mean or the mode of the conditional distribution
f (u|).
φ(−µ∗i /σ∗ )
E (ui | i ) = µ∗i + σ∗
Φ(µ∗i /σ∗ )
µ∗i , if µ∗i ≥ 0
M (ui | i ) =
0,
otherwise
Then the technical efficiency (s = 1) or cost efficiency (s = −1) will be estimated by
Ei = E {exp(−sui ) | i }
1 − Φ (sσ∗ − µ∗i /σ∗ )
1
=
exp −sµ∗i + σ∗2
1 − Φ (−µ∗i /σ∗ )
2
where µ∗i and σ∗ are defined for the normal/half-normal model as
µ∗i = −si σu2 /σS2
σ∗ = σu σv /σS
for the normal/exponential model as
µ∗i = −si − σv2 /σu
σ∗ = σv
and for the normal/truncated-normal model as
µ∗i =
−si σu2 + µσv2
σS2
σ∗ = σu σv /σS
In the half-normal and exponential models, when heteroskedasticity is assumed, the standard
deviations, σu or σv , will be replaced in the above equations by
σi2 = exp(wi δ)
where w is the vector of explanatory variables in the variance function.
In the conditional mean model, the mean parameter of the truncated normal distribution, µ, is
modeled as a linear combination of the set of covariates, w.
µ = wi δ
504
frontier — Stochastic frontier models
Therefore, the log-likelihood function can be rewritten as
!
"
N
X
1
wi δ
lnL =
− ln (2π) − lnσS − lnΦ p 2
2
σS γ
i=1
)
(
2 #
1 i + swi δ
(1 − γ) wi δ − sγi
p
−
+ lnΦ
2
σS
σS2 γ (1 − γ)
The z test reported in the output of the truncated-normal model is a third-moment test developed by
Coelli (1995) as an extension of a test previously developed by Pagan and Hall (1983). Coelli shows
that under the null of normally distributed errors, the statistic
m3
z=
1/2
3
6m2
N
has a standard normal distribution, where m3 is the third moment from the OLS regression. Because
the residuals are either negatively skewed (production function) or positively skewed (cost function),
a one-sided p-value is used.
References
Aigner, D., C. A. K. Lovell, and P. Schmidt. 1977. Formulation and estimation of stochastic frontier production function
models. Journal of Econometrics 6: 21–37.
Caudill, S. B., J. M. Ford, and D. M. Gropper. 1995. Frontier estimation and firm-specific inefficiency measures in
the presence of heteroscedasticity. Journal of Business and Economic Statistics 13: 105–111.
Coelli, T. J. 1995. Estimators and hypothesis tests for a stochastic frontier function: A Monte Carlo analysis. Journal
of Productivity Analysis 6: 247–268.
Gould, W. W., J. S. Pitblado, and W. M. Sribney. 2006. Maximum Likelihood Estimation with Stata. 3rd ed. College
Station, TX: Stata Press.
Greene, W. H. 2008. Econometric Analysis. 6th ed. Upper Saddle River, NJ: Prentice–Hall.
Gutierrez, R. G., S. Carter, and D. M. Drukker. 2001. sg160: On boundary-value likelihood-ratio tests. Stata Technical
Bulletin 60: 15–18. Reprinted in Stata Technical Bulletin Reprints, vol. 10, pp. 269–273. College Station, TX:
Stata Press.
Kumbhakar, S. C., and C. A. K. Lovell. 2000. Stochastic Frontier Analysis. Cambridge: Cambridge University Press.
Meeusen, W., and J. van den Broeck. 1977. Efficiency estimation from Cobb–Douglas production functions with
composed error. International Economic Review 18: 435–444.
Pagan, A. R., and A. D. Hall. 1983. Diagnostic tests as residual analysis. Econometric Reviews 2: 159–218.
Petrin, A., B. P. Poi, and J. Levinsohn. 2004. Production function estimation in Stata using inputs to control for
unobservables. Stata Journal 4: 113–123.
Stevenson, R. E. 1980. Likelihood functions for generalized stochastic frontier estimation. Journal of Econometrics 13:
57–66.
Zellner, A., and N. S. Revankar. 1969. Generalized production functions. Review of Economic Studies 36: 241–250.
Also see
[R] frontier postestimation — Postestimation tools for frontier
[R] regress — Linear regression
[XT] xtfrontier — Stochastic frontier models for panel data
[U] 20 Estimation and postestimation commands
Title
frontier postestimation — Postestimation tools for frontier
Description
The following postestimation commands are available for frontier:
command
description
estat
estimates
lincom
AIC, BIC, VCE, and estimation sample summary
cataloging estimation results
point estimates, standard errors, testing, and inference for linear combinations
of coefficients
link test for model specification
likelihood-ratio test
marginal means, predictive margins, marginal effects, and average marginal effects
point estimates, standard errors, testing, and inference for nonlinear combinations
of coefficients
predictions, residuals, influence statistics, and other diagnostic measures
point estimates, standard errors, testing, and inference for generalized predictions
Wald tests of simple and composite linear hypotheses
Wald tests of nonlinear hypotheses
linktest
lrtest
margins
nlcom
predict
predictnl
test
testnl
See the corresponding entries in the Base Reference Manual for details.
Syntax for predict
predict
type
predict
type
statistic
newvar
if
in
, statistic
stub* | newvarxb newvarv newvaru
if
in , scores
description
Main
xb
stdp
u
m
te
linear prediction; the default
standard error of the prediction
estimates of minus the natural log of the technical efficiency via E (ui | i )
estimates of minus the natural log of the technical efficiency via M (ui | i )
estimates of the technical efficiency via E {exp(−sui ) | i }
1, for production functions
s=
−1, for cost functions
These statistics are available both in and out of sample; type predict
the estimation sample.
505
. . . if e(sample) . . . if wanted only for
506
frontier postestimation — Postestimation tools for frontier
Menu
Statistics
>
Postestimation
>
Predictions, residuals, etc.
Options for predict
Main
xb, the default, calculates the linear prediction.
stdp calculates the standard error of the linear prediction.
u produces estimates of minus the natural log of the technical efficiency via E (ui | i ).
m produces estimates of minus the natural log of the technical efficiency via M (ui | i ).
te produces estimates of the technical efficiency via E {exp(−sui ) | i }.
scores calculates equation-level score variables.
The first new variable will contain ∂ ln L/∂(xi β).
The second new variable will contain ∂ ln L/∂(lnsig2v).
The third new variable will contain ∂ ln L/∂(lnsig2u).
Methods and formulas
All postestimation commands listed above are implemented as ado-files.
Also see
[R] frontier — Stochastic frontier models
[U] 20 Estimation and postestimation commands
Title
fvrevar — Factor-variables operator programming command
Syntax
fvrevar varlist
if
in
, substitute tsonly list
You must tsset your data before using fvrevar if varlist contains time-series operators; see [TS] tsset.
Description
fvrevar creates an equivalent, temporary variable list for a varlist that might contain factor
variables, interactions, or time-series–operated variables so that the resulting variable list can be used
by commands that do not otherwise support factor variables or time-series–operated variables. The
resulting list also could be used in a program to speed execution at the cost of using more memory.
Options
substitute specifies that equivalent, temporary variables be substituted for any factor variables,
interactions, or time-series–operated variables in varlist. substitute is the default action taken
by fvrevar; you do not need to specify the option.
tsonly specifies that equivalent, temporary variables be substituted for only the time-series–operated
variables in varlist.
list specifies that all factor-variable operators and time-series operators be removed from varlist
and the resulting list of base variables be returned in r(varlist). No new variables are created
with this option.
Remarks
fvrevar might create no new variables, one new variable, or many new variables, depending on
the number of factor variables, interactions, and time-series operators appearing in varlist. Any new
variables created are temporary. The new, equivalent varlist is returned in r(varlist). The new
varlist corresponds one to one with the original varlist.
Example 1
Typing
. use http://www.stata-press.com/data/r11/auto
. fvrevar i.rep78 mpg turn
creates five temporary variables corresponding to the levels of rep78. No new variables are created
for variables mpg and turn because they do not contain factor-variable or time-series operators.
The resulting variable list is
. display "‘r(varlist)’"
000000
000001
000002
000003
000004
507
mpg turn
508
fvrevar — Factor-variables operator programming command
(Your temporary variable names may be different, but that is of no consequence.)
Temporary variables automatically vanish when the program concludes.
Example 2
Suppose we want to create temporary variables for specific levels of a factor variable. To do this,
we can use the parenthesis notation of factor-variable syntax.
. fvrevar i(2,3)bn.rep78 mpg
creates two temporary variables corresponding to levels 2 and 3 of rep78. Notice that we specified
that neither level 2 nor 3 be set as the base level by using the bn notation. If we did not specify bn,
level 2 would have been treated as the base level.
The resulting variable list is
. display "‘r(varlist)’"
00000E
00000F mpg
We can see the results by listing the new variables alongside the original value of rep78.
. list rep78 ‘r(varlist)’
1.
2.
3.
4.
5.
in 1/5
rep78
__00000E
__00000F
mpg
3
3
.
3
4
1
1
.
1
0
0
0
.
0
1
22
17
22
20
15
If we had needed only the base-variable names, we could have specified
. fvrevar i(2,3)bn.rep78 mpg, list
. display "‘r(varlist)’"
mpg rep78
The order of the list will probably differ from that of the original list; base variables are listed only
once.
Example 3
Now let’s assume we have a varlist containing both an interaction and time-series–operated variables.
If we want to create temporary variables for the entire equivalent varlist, we can specify fvrevar
with no options.
. generate t = _n
. tsset t
. fvrevar c.turn#i(2,3).rep78 L.mpg
The resulting variable list is
. display "‘r(varlist)’"
00000I
00000K
00000M
fvrevar — Factor-variables operator programming command
509
If we want to create temporary variables only for the time-series–operated variables, we can specify
the tsonly option.
. fvrevar c.turn#i(2,3).rep78 L.mpg, tsonly
The resulting variable list is
. display "‘r(varlist)’"
2b.rep78#c.turn 3.rep78#c.turn
0