SURVEY METHODOLOGY Canada c

SURVEY METHODOLOGY Canada c
c
- ^
SURVEY
METHODOLOGY
1^1
Statistics
Canada
Statistique
Canada
Canada
SURVEY
METHODOLOGY
A JOURTiAL
PUBLISHED BY
STATISTICS CAriADA
JUriE1999
•
VOLUME 2 5
•
NUMBER 1
Published by authority of the Minister
responsible for Statistics Canada
® Minister of Industry, 1999
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording or otherwise
without prior written permission from Licence Services,
Marl^eting Division, Statistics Canada,
Ottawa, Ontario, Canada K1A 0T6.
September 1999
Catalogue no. 12-001-XPB
Frequency: Semi-annual
ISSN 0714-0045
Ottawa
• ^ 1
• ~ l
Statistics
Canada
Statistique
Canada
f OT^O/Ho
VyClJ.lclvi.cL
SURVEY METHODOLOGY
A Journal Published by Statistics Canada
Survey Methodology is abstiacted in The Survey Statistician, Statistical Theory and Methods Absttacts and SRM Database of
Social Research Metiiodology, Erasmus University and is referenced in the Current Index to Statistics, and Journal Contents in
Qualitative Methods.
MANAGEMENT BOARD
Chairman
G.J. Brackstone
Members
D. Binder
G.J.C. Hole
F. Mayda (Production Manager)
C. Patiick
R. Platek (Past Chairman)
D.Roy
M.P. Singh
EDITORIAL BOARD
Editor
M.P. Singh, Statistics Canada
Associate Editors
D.R. Bellhouse, University of Western Ontario
D. Binder, Statistics Canada
J.-C. Deville, INSEE
J.D. Drew, Statistics Canada
J. Eltinge, Texas A&M University
W.A. Fuller, Iowa State University
R.M. Groves, University of Maryland
M.A. Hidiroglou, Statistics Canada
D. Holt, Central Statistical Office, U.K.
G. Kalton, Westat, Inc.
R. Lachapelle, Statistics Canada
P. Lahiri, University of Nebraska-Lincoln
S. Linacre, Australian Bureau of Statistics
G. Nathan, Central Bureau of Statistics, Israel
Assistant Editors
D. Pfeffermann, Hebrew University
J.N.K. Rao, Carleton University
L.-P. Rivest, Universite Laval
I. Sande, Bell Communications Research, U.S.A.
F.J. Scheuren, Ernst and Young, LLP
J. Sedransk, Case Western Reserve University
R. Sitter, Simon Eraser University
C.J. Skinner, University of Southampton
R. Valliant, Westat, Inc.
V.K. Verma, University of Essex
P.J. Waite, U.S. Bureau of the Census
J. Waksberg, Westat, Inc.
K.M. Wolter, National Opinion Research Center
A. Zaslavsky, Harvard University
P. Dick, H. Mantel, B. Quenneville and D. Stukel, Statistics Canada
EDITORIAL POLICY
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such
as design issues in the context of practical constiaints, use of different data sources and collection techniques, total survey error,
survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data
integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the
development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be
refereed. However, tiie autiiors retain full responsibility for the contents of tiieir papers and opinions expressed are not necessarily
those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year. Authors are invited to submit their manuscripts in either English or French to
the Editor, Dr. M.P. Singh, Household Survey Metiiods Division, Statistics Canada, Tunney's Pasture, Ottawa, Ontario, Canada
Kl A 0T6. Four nonreturnable copies of each manuscript prepared following the guidelines given in the Joumal are requested.
Subscription Rates
The price of Survey Methodology (Catalogue no. 12-(X)I-XPB) is $47 per year in Canada and US $47 per year Outside Canada.
Subscription order should be sent to Statistics Canada, Operations and Integration Division, Circulation Management,
120 Parkdale Avenue, Ottawa, Ontario, Canada Kl A 0T6 or by dialling (613) 951-7277 or 1 8(X) 700-1033, by fax (613) 951-1584
or 1 800 889-9734 or by Internet: order@statcan.ca. A reduced price is available to members of the American Statistical
Association, the International Association of Survey Statisticians, the American Association for Public Opiiuon Research, and
the Statistical Society of Canada.
SURVEY METHODOLOGY
A Joumal Published by Statistics Canada
Volume 25, Number 1, June 1999
CONTENTS
In This Issue
I
H. KROGER, C.-E. SARNDAL and I. TEIKARI
Poisson Mixture Sampling: A Family of Designs for Coordinated Selection Using Permanent Random Numbers .. 3
W. R. BELL and M. KRAMER
Toward Variances for X-11 Seasonal Adjustments
13
J. DE HAAN, E. OPPERDOES and C. M. SCHUT
Item Selection in the Consumer Price Index: Cut-off Versus Probability Sampling
31
P. DUCHESNE
Robust Calibration Estimators
43
Y. TILLE
Estimation in Surveys Using Conditional Inclusion Probabilities: Complex Design
57
N.G.N. PRASAD and J.N.K. RAO
On Robust Small Area Estimation Using a Simple Random Effects Model
67
F.A.S. MOURA and D. HOLT
Small Area Estimation Using Multilevel Models
73
M. CHATTOPADHYAY, P. LAHIRI, M. LARSEN and J. REIMNITZ
Composite Estimation of Drug Prevalences for Sub-State Areas
81
S. RUBIN BLEUER and M. KOVACEVIC
Some Issues in the Estimation of Income Dynamics
87
P. F. TATE
Utilising Longitudinally Linked Data from the British Labour Force Survey
99
S. GABLER, S. HAEDER and P. LAHIRI
A Model Based Justification of Kish's Formula for Design Effects for Weighting and Clustering
105
Survey IVIethodology, June 1999
Vol. 25, No. 1, pp. 1-2
Statistics Canada
In This Issue
Dear Readers,
I would like to share with you good news on two fronts. First, the upcoming December issue will
mark the 25* anniversary of Survey Methodology. This issue of the joumal will be slightly larger
than usual and will contain papers from some very prominent statisticians of our time. Second, we
are looking into producing an electronic version of the Joumal. Our current plan is to make the
December 1999 issue available on a special Web site. All curtent subscribers will be able to
download the Joumal free of charge. Based on the response to this trial we will see if it is feasible
to offer the Joumal in that medium instead of or in addition to the curtent paper version . Watch for
further information in the next issue. As usual your comments and suggestions are always welcome.
This issue covers a variety of topics - three papers on small area estimation, four papers on
general estimation issues and two each on new sampling designs and data analysis.
Kroger, Samdal and Teikari they introduce a new family of sampling designs, called Poisson
Mixture Sampling, which comprises of a weighted mixture of Poisson and Bernoulli sampling.
Through a Monte Carlo study using Finnish data, they empirically show that, for a variety of point
estimators, Poisson Mixture Sampling is more efficient than the usual Poisson sampling.
Bell and Kramer deal with the long standing problem of estimating the variance of X-11
estimators. Each month, statistical bureaus throughout the world publish the raw estimates of
variables along with a corresponding measure of ertor, usually a standard ertor or a coefficient of
variation. However, the corresponding seasonally adjusted or trend estimates, obtained by
application of the X-11 method, do not have such an associated measure of error. Bell and Kramer
present an interesting approach that offers a practical solution to this problem. They calculate two
sources of ertor: one resulting from the sampling error and the other resulting from the use of
ARIMA extrapolations at the two ends of the series.
De Haan, Opperdoes and Schut discuss sampling the items in a commodity group for input to the
Consumer Price Index using scanner data. While most statistical offices curtently use a judgmental
selection procedure, this naturally leads to biased estimates. The authors address the question of
whether probability sampling would lead to better results in terms of mean square ertor, with
interesting results.
Pierre Duchesne considers a new class of robust calibration estimators used to obtain constrained
weights at given intervals. The process involves changing carefully selected robust default weights
into calibrated weights. In a brief empirical study, the new estimators are illustrated and compared
to. estimators which have already been proposed.
Tille investigates a repeated sampling approach which takes into account auxiliary information.
First he generalizes the use of conditional inclusion probabilities for use with any sampling design.
He then constructs estimators that can be viewed as optimal linear estimators, and compares them
with the GREG-estimator. He contrasts all of the estimators via a set of simulations. Finally he
discusses the problem of interaction between the design and the auxiliary variables.
Prasad and Rao consider the problem of small area estimation through the use of a random
effects model. While traditional methods rely on model-based methods to obtain estimates of small
area means, Prasad and Rao obtain design-based (model-assisted) estimates by integrating survey
weights. Cortesponding model-based estimators of the mean squared ertors (MSE) of the small aiea
estimates are also derived. Through simulation results, they show that their MSE estimator has low
bias and is quite stable.
In their paper on small area estimation, Moura and Holt focus on multilevel models, which make
use of auxiliary information at both the unit and the small area levels, and allow small area random
effects for both the intercepts and the regression slopes. The fixed and random effects parameters
are estimated using restricted iterative generalized least squares. The mean square ertor is
approximated. Simulations show that the model can lead to better small area estimators than those
based on simpler models, that overspecification of the model does not lead to a serious loss of
efficiency, and that the MSE approximation and associated MSE estimator work well.
In This Issue
Chattopadhyay, Lahiri, Larsen and Reimnitz consider estimation of proportions for rare events
in small areas. Their method is illustrated and compared to other approaches using data from a
telephone survey of alcohol and drug use. Their proposed estimator combines
census-based demographic estimates of population within age/sex/county groups with survey-based
empirical Bayes estimates of proportions within those groups. A jackknife estimator of mean square
error is proposed which captures variability due to estimation of model parameters.
The problem of estimating longitudinal low income proportions from a longitudinal survey having
a complex design is studied in Rubin-Bleuer and Kovacevic. Two design-based estimators are
considered: one based on both the longitudinal and cross-sectional sample, called the "mixed
estimator", and one based entirely on the longitudinal sample. Through simulation, the two
estimators are compared in the presence of attrition using models of compensation that assume
"missing at random" and "completely missing at random" underlying mechanisms. The results are
illustrated using data from two longitudinal sueveys.
Tate considers linking data on the same individuals from subsequent quarters of the British
Labour Force Survey, a rotating panel survey in which one fifth of the sample is renewed at each
occasion. She analyzes the various factors which can introduce bias into analyses derived from such
linked data. In particular, she studies the possible effects of sample attrition, respondent ertors and
proxy respondents. She also considers various approaches to adjusting for these biases.
Finally, in a short note, Gabler, Haeder and Lahiri present a model-based justification for Kish's
well known formula for design effects. They show that the result is actually a conservative value
for the actual design effect.
The Editor
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 3-11
Statistics Canada
Poisson Mixture Sampling: A Family of Designs for
Coordinated Selection Using Permanent Random Numbers
HANNU KROGER, CARL-ERIK SARNDAL and ISMO TEIKARI'
ABSTRACT
This paper introduces Poisson Mixture sampling, a family of sampling designs so named because each member of the family
is a mixture of two Poisson sampling designs, Poisson Tips sampling and Bernoulli sampling. These two designs are at
opposite ends of a continuous spectrum, indexed by a continuous parameter. Poisson Mixture sampling is conceived for
use with the highly skewed populations often arising in business surveys. It gives the statistician a range of different options
for the extent of the sample coordination and the control of response burden. Some Poisson Mixture sampling designs give
considerably more precise estimates than the usual Poisson Tips sampling. This result is noteworthy, because Poisson Tips
is in itself highly efficient, assuming it is based on a strong measure of size.
KEY WORDS: Business surveys; Skewed populations; Response burden; Regression estimators.
1. THE OBJECTIVES OF POISSON
MIXTURE SAMPLING
Poisson Mixture (Pomix) sampling is a family of
sampling designs suitable for business surveys with its often
highly skewed populations. The Pomix family contains the
traditional Bernoulli sampling and Poisson Tcps sampling
designs as two special cases, situated at the two extiemes of
a range of possibilities indexed by a continuous parameter.
This parameter, called the Bernoulli width and denoted B,
satisfies 0 ^ B ^/^, where/^ is the predetermined expected
sampling fraction in the "take-some" portion of the population, that is, the portion where randomized selection is
applied.
Random numbers, in the form of independent realizations of the Unif (0,1) random variable, are commonly used
in modem computerized sample selection. Fan, Muller and
Rezucha (1962) introduced several sequential (unit by unit)
drawing mechanisms based on random numbers. Now,
Pomix sampling is based on the Permanent Random
Number (PRN) technique, which calls for assigning at birth
a random number to each unit in the frame (the business
register, in the case of a business survey). The random
number is permanent in the sense of remaining attached to
the unit during its entire lifetime. The PRN technique
makes it easy to achieve coordination of samples and
control of response burden. Early references to sampling
with the aid of PRN's are Brewer, Early and Joyce (1972)
and Atmer, Thulin and Backlund (1975). A recent review
of different PRN techniques, and important extensions, are
given in Ohlsson (1995).
Poisson 7q)s sampling has the desirable feature of
selecting large units with relatively greater probability than
small units, whose contribution to estimated population
totals will in any case be relatively minor. Coordination of
Poisson jrps samples with the aid of PRN's was introduced
by Brewer et at., (1972) and is discussed subsequently by
several authors, including Sunter (1977) and Ohlsson
(1995).
Similarly as in Poisson Ttps, Pomix sampling allows
control of the response burden, as explained in the next
section. Larger enterprises will be selected relatively more
often than smaller ones. The selection is controlled through
rotation so as to distribute the response burden. Another
objective of Pomix sampling is for all (or a substantial
portion) of the population units to be included in sample
(therefore observed, so that their basic data can be updated)
with regularity over a period of time. The objective can be,
for example, that every enterprise should be in sample at
least once during a ten or twelve year period.
2. THE SELECTION PROCEDURE UNDER
POISSON MIXTURE SAMPLING
Denote the finite population as U = {l,...,k, ...,N],
where the integer k represents the k-th population unit.
Denote by y the variable of interest and by y^ its value for
unit k; y^, is unknown before sample selection and
observation. With the unit ke U is also associated a known
positive size measure Xj^. Its role in Pomix sampling is to
bring about a more frequent selection of the larger units; in
addition, the size variable should be used as an auxiliary
variable at the estimation stage.
A sample, s, is realized from the population U. The size
of s may be random; its expected size, denoted n, is a
number fixed in advance. We allow s to consist of two
nonoverlapping parts, s = s^^usj^, where s^. is called the
certainty part of s and s^ the randomization part of s. The
part S(., consisting of very large units selected with
Hannu Kroger and Ismo Teikari, P.O. Box 3A, Fin-00022, Statistics Finland, Finland; Carl-Erik Samdal, 11th floor, R.H. Coats Bldg., Statistics Canada,
Ottawa, Ontario, KIA 0T6, Canada.
Kroger, Sarndal and Teikan: Poisson Mixture Sampling
probability one, is designated in a preliminary step, with the
aid of the known size measures x^. One procedure for this
is given in section 3. Depending on the population characteristics, it could happen that iio certainty part is designated,
so that s^ is the empty set, but this eventuality is rather
exceptional with the highly skewed populations usually
occurring in business surveys.
A frequently used synonymous term for the certainty part
is take-all stratum. If the take-all stratum is denoted U^., a
probabilistic description is to say that 5^ is drawn from the
take-all stratum U^ so that s^ = U^ with probability one.
We denote the size of s^ = U^ by n^..
Next, the randomization sample, s^, is selected from the
rest of the population, Ujf = U - U^, of size N^= N - n^.
It consists of units with inclusion probability 71^ strictiy less
than unity. In this paper, 5^ is drawn by Pomix sampling
(thus it uses the PRN technique). The size of Sj^ is random;
its expected size, denoted n^, is fixed by the equation
"R="
In this paper, we use the term Poisson sampling for
selection cortesponding to independent unit by unit
Bernoulli trials with any inclusion probabilities n^. More
specifically, by Poisson Ttps we mean Poisson sampling
with 71^ directiy proportional to a measure of size. Bernoulli
sampling is the special case of Poisson sampling where all
7t, are equal.
For Pomix sampling, we need some more notation. For
unit keUj^, define the relative size measure
\=^R''k/jlu,''k-
U-P h
1 -B X,
_^-PlfR
I -B
For Poisson Tips sampling, the inclusion probability of
unit ^ is 7t^ = A^ given by (2.1). Coordination of PRN-based
Poisson samples was introduced by Brewer et al, (1972)
using a graphical representation cortesponding to fi = 0 in
Figures 1 and 2. At each occasion, the selection area is then
a triangle; the unit's PRN on the horizontal axis is plotted
against the unit's size measure, A^^, on the vertical axis.
Coordination is obtained by "moving the selection area
over" to the right. Coordination of Pomix samples is
realized in a similar fashion.
Q A
(2.1)
We can from now on assume that A^ < 1 for all ke U^^,
because if y4^< 1 had not been true for certain units keU^,
then the procedure in section 3 for constructing the certainty
part of the sample would in effect have assigned those units
to the certainty part s^.
We now define Pomix sampling with the aid of a twodimensional diagram. On the horizontal axis, a unit's PRN
is plotted. On the vertical axis, a size-related measure, (2^,
is plotted. At each survey occasion, a new sampling selection region is designated by rotation in this diagram, and
sample coordination is realized in the manner that we now
describe.
Pomix sampling is characterized by two parameters, B
and fg, where /^ such that 0 </^ = nj^lNj^ < 1 denotes the
fixed expected sampling rate in Uj^ = U - U^. The parameter B, called tiie Bernoulli widtii, is such that 0 <.B <.f,^.
For every unit k€ U^, define
Qk
from Ujf, which will be seen to reduce to Bernoulli
sampling. The measures Qj^ are used in Pomix selection of
coordinated samples, as we now describe.
Start with a plot of the points (r^^, g/t) for A: e f/j,, where
r^ denotes the PRN attached at birth to unit k, and (2^ is
given by (2.2). With reference to Figure I, Pomix sampling
is defined as follows: Include in the randomization sample, 5^,
all units having PRN's r^ falling in the (0, B] interval, and
also include some units having PRN's r^ in the {B, 1]
interval, namely, those for which (2^ is at least equal to a
threshold value situated on the line joining the points {B, 0)
and (1, 1). The selection area is thus the shaded part of
Figure 1. Note that since A^ < 1 for all ke U^, •we have by
(2.2) that Q^<{1 - Blf^)l{l - B) <.l for all A: e f/„, when
(2.2)
where Xy -Y,U/^J^R^°'' ^ " 0 , we have Qi^=Ai^,
which is the size measure for the usual Poisson Ttps sampling. At the other extreme, B =/^, we havefi^^= 0 for all
keU„; in this case, size will play no role in the selection
B
1
r
Fig 1. Sampling at time 1
D B
1
r
Fig 2. Sampling at time 2
Figures 1 and 2 illustrate how coordinated Pomix sampling from Uj^ = U - U^ can be carried out at two consecutive survey occasions. In each of the two figures, the
sample is defined as the set of units for which the point
{''i^,Qi^) falls in the shaded area. The "starting point" on the
PRN axis is the point to the right of which we start to count
units for inclusion in the sample. At time 1 (Figure 1), the
starting point is 0; at time 2 (Figure 2), the starting point is
D. (In general, the starting point can be a randomly selected
point in the unit interval; in other words, the sample identified in Figure 2 is also the one that would be selected at
time t = I if at that time the randomly selected starting
point on the PRN axis had been equal to D.)
A convenient way to achieve sample rotation is through
the constant shift method, which implies that the starting
point is moved over to the right by a fixed amount at every
Survey Methodology, June 1999
new occasion of sampling; see Ohlsson (1995). The constant D is called the constant shift. The starting point at time
3 would thus be 2D, and so on.
In the following we examine Pomix sampling and estimation at a single occasion, and we can concentrate on
Figure 1 (time 1), with starting point 0 on the PRN axis.
The algorithm for Pomix sampling with parameters B and
/^, and starting point 0, is thus as follows: From Figure 1,
unit k is included in the randomization part, 5^, (i) if
0<r^^fi,or(ii)iffi<r^^ 1 and G^ ^ ( r ^ - f i ) / ( 1 -B).
Consequentiy, k is included in s^ if
Table 1
Values of the Inclusion Probability 71^^ as a Function
of the Parameter B and the Relative Size /l^
When the Fixed Expected Sampling Rate, /^j, is 0.1
Values of Jt^
Value of B
0
0.03
0.05
0.07
0.10
i4j = 0 (small unit)
0
0.03
0.05
0.07
0.10
Aj = 1 (large unit)
1
0.73
0.55
0.37
0.10
This illustrates that for a Pomix sampling design close to
Bernoulli {B near 0.10), the inclusion probabilities of large
and small units alike lie near the fixed expected sampling
Because r^ ~ Unif (0,1), the first order inclusion probabilirate, 0.10. By contrast, in a Pomix design close to Poisson
ties under Pomix sampling are
Ttps {B near 0), a small unit is practically certain not to be in
sample, and a large unit is practically sure to be in sample.
P-'Qki^ -P) for keU^
The
table also illustrates how Pomix sampling with an inter(2.3)
\ ='
for keU,cmediate value of B will modify the inclusion probabilities:
a small unit's chances to appear in the sample are decreased
somewhat compared to Bernoulli, and the selection of a
It is easy to see that the inclusion probabilities satisfy the
large unit becomes less probable than under Poisson Ttps.
necessary requirement that their sum must equal the
The implications for response burden are: The total
expected sample size fixed in advance:
response burden on the population rests the same for all
values of B; an expected total of n = n^ + n^. units are
E^'^. = E^, i-E^,{5-e,(i-B)} = «c-". = «- always asked to report. Compared to Poisson Ttps (B = 0),
the fixing of a value B in the interior of the interval [0, /^]
We now note two extreme cases of the family of Pomix
will have the effect of shifting some of the response burden
sampling schemes: Bernoulli sampling is obtained if B =f^ from larger onto smaller units; at the same time the preciin the Pomix algorithm, because then Qi^ = 0 for all ke U^, sion of the estimates is increased in many cases (see
and the algorithm becomes: Include unit /: e f/^ in 5^ if
sections 5 and 6).
0 < r ^ ^ / ^ , which is Bernoulli sampling. Poisson Trps
Finally, we need to mention the second order inclusion
sampling is obtained if fi = 0 in the Pomix algorithm,
probabilities under Pomix sampling, because they are
because then the algorithm becomes: Include unif ke U^
required for the design-based variance calculation. They are
in 5^ if 0 < r^ < A^, where A^ is given by (2.1). But this is
simple. If Tt^, denotes the probability that units k and / are
Poisson Ttps sampling from f/^, the inclusion probability
both in the sample, then
being TI^ = A^, that is, directly proportional to the size
measure x^.
^Tt^Ttj
(2.4)
Pomix sampling is a mixture of Poisson Ttps and
Bernoulli in that the Pomix inclusion probability,
for k*leU = U^uUc, because the PRN's r^ are
Tc^ =fi + (2^(1 -B), equals a linear combination of the independent realizations of the Unif(0,1) random variable.
inclusion probabilities that apply under the two extreme
For k = 1,'we have TI^, = Tt^^ = 7t^. The multiphcative feature
designs, weighted by the relative Bernoulli width. A, =fi//^,
(2.4) of the Jt^, gready simplifies the design-based variance
such that 0 ^X<. 1. Wehave TI^ =A,TL + ( 1 -X)^K^^, where calculation. We get a simple, single-sum variance estimator,
Ti^ =/yj for all k (Bernoulli) and nf' = A^ (Poisson ;tps).
as in (4.2) below.
The character of Pomix sampling is determined by its
two parameters, B and/^. To illustrate, we note from (2.1)
and (2.3) that the inclusion probability of unit keUj^ is
3. DETERMINING THE CERTAINTY
Tt^ = 5 + (1 - Blf^Aj^. Thus, for a unit k that is large (but
PART OF THE SAMPLE
not large enough to quahfy for s^ = U^), so that A^ is near
unity, we have, to close approximation, TI^ = B + 1 - Blfj^.
If the population is highly skewed, a set of units (the
By contrast, for a unit that is small, so that its value A^^ is certainty part of the sample, s^) will be sampled with
very near zero, we have, to close approximation, Tt^ = B, probability one, and Pomix sampling can be used for
independentiy of the size. For example, with /^ = 10%, the
randomized selection in the remaining part of the populafollowing Table 1 shows how the inclusion probability TI^
tion, f/^. Several procedures could be considered for the
varies with B, where B = 0 is Poisson Ttps, and
construction of the certainty set; here we give one that is
B = / „ = 0.10 is Bernoulli.
reasonable (though not necessarily optimal) and used in the
0 < r , ^ 5 + e^(l
-B).
Kroger, Sarndal and Teikari: Poisson Mixture Sampling
Monte Carlo simulation reported later in section 5. The
certainty set is designated with the aid of the known positive size measures x^ through the following procedure in
one or more steps. An expected sample size, n, is fixed for
the whole sample, 5 = j ^ u 5^. In step one, compute the
relative size measures A^^-j. = nx^/ J^^x^ for k£ U. Those
units k, if any, for which A^,j. ^ 1 are assigned to the
certainty part. They form a set denoted t/„j.; let its size be
«P(i^. The procedure is then repeated to see if additional
units should be assigned to the certainty part. In step two,
calculate the relative size measures A
k{2)
(«-«C(l))
xjY^x^ where the summation extends over the set
^ ~ ^c(i) - ^ ^k(2) *-1 for all ^ e [/ - f/^^,^, the procedure
stops, and the final certainty part is s^ = U^,^y But if
Aj^,2) ^ 1 for some units, then these are also assigned to the
certainty part, and so on until a step / is reached where all
intermediate relative size measures A^^^.^ are less than unity.
The ultimate certainty part s^ will contain, say, n^ units,
and we then have A^< 1 for all keU-s^,
where
^k ^ "R-'^J'L^k ^^* "R"^" - "c ^"^ ^^ ^"™extends over
U-s,
'c-
4. ESTIMATION FOLLOWING POMIX
SAMPLING
Although the auxiliary variable serves a useful purpose
at the sampling stage, we advocate using it also at the estimation stage. To estimate the population total, Y = X^y^iconsider the generalized regression (GREG) estimator
Hs'^kSkYk
(4-1)
where a^ = 1 /TI^ is the sampling weight and the second
weight, the g-weight, is given by
'GREG
8=iHX-xyT:
(2.4) applies, and all product terms of the quadratic form
are zero. With only the squared terms left, the variance
2 2
estimator becomes simply V = Y,s^k'<'^k ~ ^)Sk ^k - Finally,
because a^ - 1 = 0 for all ke s^., we get the variance
estimator used in the Monte Carlo study reported later in
section 5, namely,
2
2
^ = E , «*(«*-!)?*«
(4.2)
5. A MONTE CARLO STUDY OF POMIX
SAMPLING USING FINNISH DATA
To illustrate various aspects of Pomix sampling, we
conducted a Monte Carlo study involving four different
estimators of the population total Y. The experiment
involved repeated draws of samples as well as repeated
assignments of the set of A^ PRN's to the population units.
Note that since every assignment of the A^ PRN's to the
population units is a random outcome, a proper Monte
Carlo study also requires repetitions of the PRN assignments. Therefore, after the first assignment of the N
PRN's, we selected 100 Pomix samples, using a fixed
value of the Bernoulli width B. (Each sample was realized
using a new, randomly selected starting point on the PRN
axis). Then a new set of N PRN's were assigned, 100
additional samples were drawn, and so on, until we had
reached 100 x 100 = 10,000 PRN/sample pairs, for the
given value of B. For each of the 10,000 pairs, we
computed the four point estimators, the cortesponding four
variance estimators, and the cortesponding four confidence
intervals. With 10,000 repetitions, we expect the Monte
Carlo error to be rather small.
The four estimators used in the Monte Carlo study have
the following expressions, where a^ = 1 /TI^ is the sampling
weight of unit k, and 7t^ is given by (2.3):
Xkl(^k
1. TheHorvitz-Thompson estimator,
where x^ denotes the auxiliary vector value for unit k,
X = Y.s'^k^k' ^"'^ ^s = Y^s^k^k^'J^k- "^h^ auxiliary infori ^ i = E , «it>'i = E(/^3', + E . , «A3'r
mation requirement is that the vector total X = Y^^JX^ must
be known from a reliable source. The unidimensional size
2. The (combined) ratio estimator.
variable x^ used for computing the Q^. in the Pomix
sampling scheme can be one of the components of jc^ or it
Y2=Xb2
can a linear combination of the components of x:^. In the
empirical study reported in section 5, JC^ is unidimensional,
and Xi^=Xi^. The constants c^, specified by the user,
where X = X(/^i and ^2 = L « t V L ^ * ^ r ^^ »s a
provide a means of weighing the data, in addition to the
special case of (4.1) such that x^ = ^^^ = c,^.
survey weights a^^. An often used choice is c^ = 1 for all k.
A commonly used estimator of the variance of the
3. The GREG estimator
GREG estimator (see Samdal, Swensson and Wretman
1992, Ch. 6) is given as a quadratic form in g^^e^^j^ where c^
Y3 = E y , Yk * E . , ^kYk + ^^R - E . , ^k^k)
is the regression residual e^^ = y^ - A:^ b, where
* = 'Ps^Hs'^k^kYk^^k- ^^^ '^^ ^^^ advantages of Pomix
sampling is that the cortesponding variance estimation is
simple. This is because the PRN's r^^ are independent
realizations from the Unif(0,l) distribution. Hence equation
where X^ = X y / i and b^ =Zs,^^("k-'^)yk^k''Ls^
Uft^a^^ -l)x^. It is a special case of (4.1) such that
x^ =x^ and c^ = {a^ - 1)"'.
Survey Methodology, June 1999
divided by mean) is 1.78 for the variable y and 1.94 for
the variable x; the coefficient of cortelation between x and
y is 0.965. The randomization part s^, of expected size n^
Y.-Yu^Yk-^RK
= 100 - 29 = 71 units, is realized, in the simulation, by
repeated Pomix sample selection from C/^. A plot of
where X^ = J^^jX^ and b^ =
l,a^yjl,a^x^(x^, y^) for the units k in I/^ is shown in the Appendix.
To see the effect of the Bernoulli width, we carried out
For Poisson Ttps sampling (B = 0), we have b^ = the simulation for a range of different B-values: B = 0,0.OI,
CLs yJ^k)^"s' where n^ is the random size of 5^; the
..., 0.07, and, in addition, B =f^ = n^lN^ = 71/971 = 0.073
corresponding estirriator Y^ was considered by Brewer (which gives Bernoulli). For each value of B, 100 x 100 =
etal, (1972). Now Y^^ and Y^ differ in that the regression 10,000 PRN/sample pairs were realized, and the results
slope is calculated in Kj on the pooled sample s, but in Y^ were used to calculate, for each of the four point estimators,
the slope is calculated separately for the randomization
five Monte Carlo summary statistics. These are as follows,
sample j ^ . Finally, fj differs from Kj and Y^ in that it if y denotes one of the four point estimators, V the corteuses the weighting a^(a^ - 1), instead of just a^^. Note that sponding variance estimator obtained from (4.2), and
all of Y2, and fj and Y^ are members of the GREG family
(y'-z,_^/2^^'^'^^i-a/2^^)*^ corresponding confidence
of estimators given by (4.1). By equating K^. ^3 and Y^ to interval for Y at the nominal confidence level I- a, where
(4.1), we find the g-weights implied by each of the three
is the standard normal score, z, „„ = 1.960 for
-1 - a / 2
estimators. These weights are required for the variance
a = 5%, and z 1 -a/2 = 1.645 for a = 10%:
estimation. We can expect the simulation to show that Y2, Y.^
and Y^, which use the auxiliary variable both at the design
(1) MCEy= the Monte Carlo expectation of the point
stage and at the estimation stage, will improve on (have
estimator Y, that is, the arithmetic mean of the 10,000
smaller variance than) the HT estimator Yy which uses the
point estimates;
auxiliary information only at the sampling stage, but the
extent of the improvement is unpredictable and interesting
(2) MCV Y = the Monte Carlo variance of the point
to observe.
estimator Y, that is, the variance of the 10,000 point
We used a real data population for the Monte Carlo
estimates;
simulation. This population consists of A^ = 1,000 Finnish
enterprises. For enterprise k,k = l,..., 1,000, y^ is number
(3) MCE V = the Monte Carlo expectation of the variance
of employees (full time equivalents) x 10, and x^ is the
estimator V, that is, the arithmetic mean of the 10,000
wages paid by the enterprise to its employees, in thousands
variance estimates;
of FIM (Finnish Marks). The auxiliary information (wages
paid) comes from the Finnish tax authority's VAT register.
(4) MCRTE95 = Monte Carlo coverage rate for nominal
The employment variable is the one requiring estimation.
95% confidence intervals, that is, the number of times
The 1,000 units were selected (in an essentially random
that
the target parameter Y is contained in the
manner) from an original larger population of Finnish
confidence
interval, divided by 10,000, and expressed
enterprises. Units with a value of zero either on y^ or on x^.
in per cent;
were eliminated so that tiie simulation results would not be
disturbed by extraneous factors. Consequently, as for the
(5) MCRTE90 = Monte Carlo coverage rate for nominal
values y^ and x^^, the population used in the simulation is
90% confidence intervals; its definition is analogous to
a natural one, but because of the elimination of units, its
thatofMCRTE95.
features (mean, standard deviation, skewness, etc.) differ
from those of the original larger population.
The simulation results are shown in Table 2 (Average
The population y-total to be estimated is Y = Y,uy^^ =
sample
size, Monte Carlo variance; Monte Carlo expecta169,168. We fixed the expected sample size for the total
tion
of
variance estimator) and in Table 3 (Monte Carlo
sample, s = S(.v> s,^, asn = 100. The procedure described
coverage
rates). The tables do not show MCE Y, because
in section 3 was used to determine the certainty part s^ of
in
all
cases
this quantity was very close to the target parathe sample. This resulted in a certainty part s^. consisting
meter
value
Y= 169,168, confirming that all estimators are
of the largest 29 = n^ units. The rest of the population,
essentially
unbiased.
The deviation of MCE Y from Y was
Uj^ = U - s^ has the following descriptive characteristics:
in
all
cases
less
than
0.14%, in most cases considerably
Its size is N,^ = 1,000 - 29 = 971; the total of y is Y^ =
less.
The
average
sample
size over the 10,000 repetitions is
46,138 (which equals 27% of the entire population total Y
seen to be very close to n = 100, as it should.
= 169,168); the coefficient of variation (standard deviation
4. The (separate) ratio estimator,
Kroger, Sarndal and Teikan: Poisson Mixture Sampling
Table 2
Results of Simulation Study for Different Bernoulli Widths B: Average Sample Size, MCV Y and MCE V;
est.j Refers to Estimator K^.;;= 1, ...,4. (Values in the Last Eight Columns to be Multiplied by 10*.)
Bernoulli
Average
width B
sample size
MCV?
MCEV
est. 1
est. 2
est. 3
est. 4
est. 1
est. 2
est. 3
est. 4
99.95
24.56
3.63
3.43
3.46
24.92
3.67
3.43
3.46
0.010
100.05
22.74
1.96
1.84
1.85
23.53
1.98
1.86
1.87
0.020
100.04
24.75
1.82
1.77
1.78
25.37
1.85
1.77
1.78
0.025
100.09
25.51
1.82
1.78
1.79
26.86
1.87
1.81
1.82
0.030
100.06
28.03
1.83
1.80
1.81
28.58
1.91
1.87
1.88
0.040
99.86
35.17
2.01
2.03
2.06
33.54
2.11
2.11
2.15
0.050
100.02
42.25
2.56
2.64
2.67
41.42
2.48
2.51
2.59
0.060
99.99
56.08
3.42
3.65
3.67
55.70
3.20
3.24
3.44
0.070
100.05
90.73
4.80
5.47
5.59
91.28
4.89
4.72
5.37
0.073
100.02
119.13
6.06
7.09
7.43
116.27
6.00
5.34
6.49
0.000
Table 3
Results of Simulation Study for Different Bernoulli Widths B : Coverage Rates in % of Nominal 95% and
90% Confidence Intervals, MCRTE95 and MCRTE90; est.; Refers to Estimator y.;y = 1, ...,4
Bernoulli
width B
nominal 95% confidence level
est. 1
est. 2
est. 3
est. 4
nominal 90% confidence level
est. 1
est. 2
est. 3
est. 4
0.000
94.50
92.03
92.75
92.48
89.70
86.36
87.35
86.74
0.010
95.20
93.13
93.47
93.52
90.43
88.06
87.93
88.02
0.020
95.06
94.23
93.88
93.88
90.36
89.36
88.49
88.55
0.025
95.06
93.73
94.56
94.70
90.64
89.15
89.73
89.72
0.030
94.63
94.12
94.09
94.19
89.85
89.06
88.70
88.86
0.040
93.84
94.44
94.47
94.64
88.77
89.60
89.41
89.60
0.050
93.97
94.38
93.76
93.82
88.67
88.67
88.08
88.53
0.060
93.54
93.57
92.12
92.69
89.10
87.57
85.99
87.27
0.070
92.93
94.99
90.67
92.03
88.40
88.62
84.27
86.11
0.073
91.03
95.02
88.03
90.46
86.53
88.62
81.26
83.86
Tables 2 and 3 generate these comments:
1. Let us begin the examination of Table 2 by a comparison
of Monte Carlo variances across estimators, for a fixed
Bernoulli width B. This shows that, for every value of
B, there is little to choose between Y2, Y^ and Y^, in 2.
terms of variance. By contrast, the HT estimator 9^ has
considerably greater variance. To illustrate, the ratio
MCV fj/MCV 9^ equals 3.43/24.56 = 0.140 for B = 0
(Poisson Ttps), 1.80/28.03 = 0.064 for B = 0.03, and
7.09/119.13 = 0.060 for B = 0.073 (Bernoulli). This
confirms that the HT estimator is a poor choice
compared to an alternative that uses the strongly cortelated auxiliary variable. This is tine, not surprisingly, for
Bernoulli, but also for B-values near the lower end of the
[0,/^] interval, which shows that the sampling design
alone does not extract all the power of the auxiliary
variable, even though with B near zero, we are close to
a strict Ttps selection (thus supposedly highly efficient).
Part of the reason that the HT estimator has a
comparatively large variance is that the randomness of
the sample size under Pomix sampling penaUzes the HT
estimator (but not the GREG estimators). Since the HT
estimator is inefficient, we do not further discuss it.
Examining the small differences between ^2» ^3 and ^4we note in Table 2: As measured by the Monte Carlo
variance, l^j is better than f^ for all Bernoulli widths B,
but only marginally so. Also, f^ and f^ are better than ^2
at the lower end of the range of B-values, possibly
because of the fact that in ^2 we allow the certainty part
the sample to contribute to the slope estimate, somewhat
inappropriately, since there is only an estimation
problem for the randomization part. But at the upper
end, the relation is reversed and for the upper extreme
B = 0.073 (Bernoulli), ^2 is clearly better than ^3.
That the differences between fj. ^3 and 9^ are so small
is not surprising, because all are varieties of the GREG
estimator (4.1) using essentially the same auxiliary
information.
Survey Methodology, June 1999
3. Table 2 confirms that the proposed variance estimator ^
works well, as we would expect; MCE V is with few
exceptions very close to the target that ^ aims at
estimating, that is, the variance of f, measured here by
MCV f. This holds for all estimators and all values of
B, with a few notable exceptions, namely, in the case of fj
and ?4 when B is close to the upper extreme (Bernoulli).
Then the variance estimator underestimates the variance.
4. The most interesting result in Table 2 we consider to be
the fact that the variance of fj or ^3 or f^, when
viewed as a function of the Bernoulli width B, does not
attain its minimum at B = 0 (Poisson Ttps), as one might
have initially guessed, but rather for a value of B
somewhere between 0.02 and 0.03. Moreover, the
improvement of the case B = 0.02 over the case B = 0 is
substantial for all of f^' ^3 and f^. Measuring this
improvement by MCV(f I B = 0.02) divided by
MCV( y I B = 0), we find that this ratio is only about
50% for all of Y2, 1^3 and f^. More precisely, for ?2 the
ratio is 1.82/3.63 = 0.501, for Y^ it is 1.77/3.43 = 0.516,
and for f^ it is 1.78/3.46 = 0.514. In view of these
results, we added a simulation for B = 0.025, a value not
examined in the original round of simulations. The
results, also displayed in Table 1, confirm that a
minimum variance is obtained, for alltiireeestimators, f^. ^^3
and 9^, at a point in the vicinity of B = 0.025. One
possible explanation of why it is considerably better to
take B to be a value distinctly greater than B = 0 (which
gives Poisson Ttps) is the following: When B is 0 (or
very near 0), the units with the smallest x-values, when
selected, will have unduly large weights, which induces
high variability. This is avoided by choosing B clearly
away from zero.
5. The Monte Carlo results in Table 3 concerning the coverage rates show that the variance estimation and the
confidence interval procedure function to satisfaction.
Astiieoryleads us to expect, MCRTE95 and MCRTE90
are close, for all four estimators, to their theoretical
values, 95% and 90%, respectively. Only for and l^j
and ^4, when B gets close to the upper extreme
(Bernoulli), do we notice any marked tendency for the
MCRTE to drop below the nominal value, resulting in
part from the underestimation of variance mentioned
earlier.
6. FURTHER EVIDENCE THAT POMIX
SAMPLING IS MORE EFFICIENT THAN
POISSON Tips SAMPLING
Initially we had no strong reason to believe that Pomix
sampling combined with a GREG estimator would be more
efficient for some Bernoulli widths B in the interior of
[0,/^] than for Poisson Ttps (B =0). The strong improvement - a variance reduction of around 50% for our
particular population - was rather surprising. For other
populations, the variance reduction can be more or less than
the 50% we found. Because our finding is data dependent,
it is desirable to provide some more general evidence in
support of the proposition that Pomix sampling with a
B-value well into the interior of [0,/^] is better than
Poisson Ttps sampling (B = 0). We now present some
evidence of this kind.
We examined the Taylor variance of f (that is, the
variance of the Taylor linearized statistic). It is given by
(see Samdal et al, 1992, Ch. 6) V^-^Y = Et/ («* " 1)^*^
where E^^ is the population analogue of the sample based
residual e^ used in the variance estimator (4.2). For
example, for the estimator ^3, the residual in question is
Pk=yk -h^kwith ^3 = luS^k- ^)yk^kil^u, («* - I K ' ;
for y^, b^ = Xy^y^/Ly/,replaces by
It is reasonable to model the squared residual as
£^ = o^x/(l + 5^), where/p satisfies Q <p ^2, and 5^ is
near zero. This corresponds to assuming a superpopulation
model y^. = JI^A' P + e^' where the e^ are independent errors
with model expected value zero for every k, and e^ has
model variance o^x^. Using the approximation Ej^ ~ o^
x/, and that a^. = 1 /Tt^ with Tt^ given by (2.3), we have
l)x/
o^{H{B,p)-T{p)}
where H{B,p)
^
-^ =x^JYl
. , _ . , - ^ / [ ^ ^ t / , - C 4 - f i ) ^ J - ' and
T{ P) = Lu x[. Now consider a fixed value ofp such that
0 ^ p ^ 2. We want to find out if H{B, p) has a smaller
value for some B in the interior of the interval [0, / ^ ] ,
compared to its value at B = 0, which is //(O, p). To this
end, let us examine if the derivative H'{B,p) =
dH{B, p)ldB is negative at B = 0. We find
H'{B,p) =x,^Yu,
^k^k -^u,)[P^u,
- (fR-P)^k]-'-
Its valueatB = 0is H'{0,p) = {xylf^)l^j^xr\x,
-x^J.
The sign of H'{0,p) is tne same as that of
Y,y x[' (x^ -Xy ). But this quantity equals, apart from the
factor 1 / (A^^ - 1), the covariance in (/^ between x^''^ and
Xj^-Xy (note that x^ -Xj^ has zero mean). When p
satisfies 0 <, p <2, this covariance is negative: when x^
increases, x^-Xy increases steadily, and jcjf" decreases
steadily (and remains always positive). The sign of
H'{0,p) is therefore negative; consequently, it is not at
B = 0 that H{B, p) attains its minimum value, but for some
B in tiie interior of [0, f^]. Now forp = 2, H'(0, p) = 0, and
H{B, p) has a minimum at B = 0.
These considerations raise the question whether the
population used for our simulation in section 5 cortesponds
to a value of p e [0,2], but distinctly less than 2, so that we
can expect significant gains from Pomix sampling. To
obtain an answer, we estimated p by fitting the logarithmic
version of the model E,^ = a^x^{l +5^) to the data available for Ui^ = U - U^. That is, we fitted w^^ = a + pz^,
10
Kroger, Sarndal and Teikari: Poisson Mixture Sampling
where w^ = log E^; E^=y^- b^ x^ with b^ = Y^^^y^ I ^^^
Xj^; Zj^ = log x^, and a is an intercept term. We obtained the
value p = 1.45, by treating p as a linear regression slope
estimated as p =
'iy{w^-Wy){z,-Zu)l)iy^{z^-Zuf.
Since this p-value is considerably less than 2, our Monte
Carlo population is indeed one where one can expect
significant gains from the use of Pomix sampling with a
value of B in the interior of [0, / ^ ] .
7. CONCLUDING DISCUSSION AND
TENTATIVE RECOMMENDATIONS
advantageous. However, the present paper does not address
the question of the optimal choice of B. A difficulty is that
in practice the value of B must be fixed at the design stage,
and that the optimal B depends on unknown population
characteristics. Prior knowledge of the population, notably
about its residual variance structure, can guide the choice of
B.
Our tentative recommendations based on this paper are:
If prior information suggests a squared residual pattern
conforming to o^x/ with p < 2, then use Pomix sampling
with B = 0.3 /^. On the other hand, if in reality the
unknown p is such that p > 2, then, although the best choice
in this case might be B = 0 (Poisson Ttps), little harm would
probably be done to use B = 0.3 /^, because the variance
viewed as a function of B is likely to increase at a gentle
rate. Therefore, B = 0.3 /^, seems a reasonable all-purpose
suggestion. These recommendations are tentative; the
question merits a further study that lies beyond the scope of
this paper.
The survey sampler will ask: If I consider using Pomix
sampling for my survey, combined with a GREG estimator,
what is an appropriate choice for B? Recall that in this
paper we found, for one particular population, that a large
efficiency gain (roughly 50% variance reduction compared
to Poisson Ttps) is realized by fixing the Pomix parameter
B at around 30% of /^. We were led to suspect that the
variance gain is related to residual ertor characteristics, and
this was confirmed in Section 6 which presented evidence
ACKNOWLEDGEMENTS
that when the squared residual pattern conforms to
Ef^ = o^x/, where p satisfies 0 ^ p < 2, as is the case in
The authors gratefully acknowledge the work of a
referee, whose suggestions led to valuable improvements in
many business survey populations, then Pomix sampling
the original manuscript.
with B in the interior of the interval [0, /^] may well be
APPENDIX
SCATTER PIJOT OF X AND Y
I,
3000
4G00
5000
X (yaariy wagss in 1000 HM)
Figure 3. Scatter plot of x (yearly wages) against y (employment) for the portion f/^ of the Monte Carlo population of 1,000 Finnish
enterprises.
Survey Methodology, June 1999
REFERENCES
ATMER, J., THULIN, G., and B A C K L U N D , S. (1975).
Coordination of samples with the JALES technique. Statistisk
Tidskrift, 13,443-450.
BREWER, K.R.W., EARLY, L.J., and JOYCE, S.F. (1972).
Selecting several samples from a single population. Australian
Joumal of Statistics, 14, 231-239.
FAN, C.T., MULLER, M.E., and REZUCHA, I. (1962).
Development of sampling plans by using sequential (item by item)
techniques and digital computers. Joumal of the American
Statistical Association, 57, 387-402.
11
OHLSSON, E. (1995). Coordination of samples using permanent
random numbers. In Business Survey Methods (Eds. B.C. Cox,
D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. CoUedge,
and P.S. Kott) New York: Wiley, 153-169.
SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J. (1992).
Model Assisted Survey Sampling. New York: Springer-Verlag.
SUNTER, A.B. (1977). Response burden, sample rotation, and
classification renewal in economic surveys. International
Statistical Review, 45, 209-222.
13
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 13-29
Statistics Canada
Toward Variances for X-11 Seasonal Adjustments
WILLIAM R. BELL and MATTHEW KRAMER'
ABSTRACT
We develop an approach to estimating variances for X-11 seasonal adjustments that recognizes the effects of sampling error
and errors from forecast extension. In our approach, seasonal adjustment error in the central values of a sufficiently long
series results only from the effect of the X-11 filtering on the sampling errors. Towards either end of the series, we also
recognize the contribution to seasonal adjustment error from forecast and backcast errors. We extend the approach to
produce variances of errors in X-11 trend estimates, and to recognize error in estimation of regression coefficients used to
model, e.g., calendar effects. In empirical results, the contribution of sampling error often dominated the seasonal
adjustment variances. Trend estimate variances, however, showed large increases at the ends of series due to the effects
of fore/backcast error. Nonstationarities in the sampling errors produced striking patterns in the seasonal adjustment and
trend estimate variances.
KEY WORDS: Sampling error; Forecast error; Trading-day; ARIMA model.
1. INTRODUCTION
The problem of how to obtain variances for seasonally
adjusted data is long-standing (President's Committee to
Appraise Employment and Unemployment Statistics 1962).
Model-based methods of seasonal adjustment (see Bell and
Hillmer 1984, for a discussion) use results from signal
extraction theory to produce estimates and associated error
variances of the seasonal and nonseasonal components.
Most official seasonal adjustments, however, are made
using empirical methods, most notably X-11 (Shishkin,
Young and Musgrave 1967) or X-11-ARIMA (Dagum
1975). These methods are based on fixed filters, not
models, and so it is not obvious how to calculate variances
of the seasonal adjustment ertors. Various approaches for
obtaining variances for X-11 seasonal adjustments have
been proposed, as summarized below.
Wolter and Monsour (1981) suggested two approaches.
They recognized that many time series that are seasonally
adjusted are estimates from repeated sample surveys, and
thus are subject to sampling ertor. Their first approach
accounts only for the effect of sampling ertor on the
variance associated with seasonal adjustments. Their
second approach tries to also reflect uncertainty due to
stochastic time series variation in the seasonal adjustment
variances. However, this second approach assumes that,
apart from regression terms, the time series is stationary.
This type of model is now seldom used for seasonal time
series. Also, their second approach contains a conceptual
crtor: it produces the variance of the seasonally adjusted
estimate, instead of the desired variance of the ertor in the
seasonally adjusted estimate.
Burridge and Wallis (1985) investigated use of the
steady-state Kalman filter for calculation of model-based
seasonal adjustment variances, and apphed this approach to
a model they obtained previously (Burtidge and Wallis
1984) for approximating the X-11 filters. They suggested
that this approach could be used to, "provide measures of
the variability of the X-11 method when it is applied to data
for which it is optimal," (p. 551), but cautioned against
doing this when the X-11 filter would be suboptimal {i.e.,
very different from the optimal model-based filter).
Hausman and Watson (1985) suggested an approach to
estimating the mean squared error for X-11 when it is used
in suboptimal situations. Bell and Hillmer (1984, section
4.3.4) pointed out a problem with the use of model-based
approximations to X-11 for calculating seasonal adjustment
variances. The problem is that X-11 filters (or any seasonal
adjustment filter, for that matter) are not sufficient to
uniquely determine models for the observed series and its
components.
Pfeffermann (1994) developed an approach that
recognizes the contributions of sampling ertor and irtegular
variation (time series variation in the irregular component)
to X-11 seasonal adjustment variances. The properties of
the combined ertor (sampling error plus irtegular) are
estimated using the X-11 estimated irtegular. These
properties are then used to estimate two types of seasonal
adjustment variances. A drawback to this approach is that
it relies on an assumption that the X-11 adjustment filter
annihilates the seasonal component and reproduces the
trend component. (Note Pfeffermann (1994, p. 90),
discussion surtounding equation (2.7).) Violations of this
assumption in practice compromise the approach to an
extent which appears difficult to assess. Thus, this
assumption seems to us highly questionable and also, in any
particular case, uncheckable. A second drawback is that
one of the variance types proposed by Pfeffermann assumes
that the X-11 seasonally adjusted series, rather than the
trend estimate, is taken as an estimate of the trend. Breidt
William R. Bell and Matthew Kramer, Statistical Research Division, Room 3000-4, U.S. Bureau of the Census, Washington, D.C. 20233-9100, U.S.A.
14
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
(1992) and Pfeffermann, Morry and Wong (1993) further
develop Pfeffermann's general approach.
The goal of this paper is the development and application
of an approach to obtaining variances for X-11 seasonal
adjustments accounting for two sources of ertor. The first
error source is sampling ertor. The second is ertor that
arises from the need to extend the time series with forecasts
and backcasts before applying the symmetric X-11 filters.
These latter errors lead to seasonal adjustment revisions
(Pierce 1980). Note that revisions eventually vanish as
sufficient data beyond thetimepoint being adjusted become
available. Also note that a seasonally adjusted series will
not contain sampling ertor if the cortesponding unadjusted
series does not. This is the case for certain economic time
series, e.g., export and import statistics for most countries.
Our approach assumes that the X-II seasonal adjustment
target (what we assume application of X-11 is intended to
estimate) is what would result from application of the
symmetric linear X-I 1 filter (with no forecast and backcast
extension required) if the series contained no sampling
ertor. While this definition of target might be criticized for
ignoring time series variation in the underlying seasonal
and nonseasonal components, we think this may be
appropriate for typical users of X-11 seasonally adjusted
data. Such users are most likely to be concerned about
uncertainty reflected in differences between initial adjustments and final adjustments, i.e., in revisions. Some of
these users will also be aware that the unadjusted series
consists of sample-based estimates of the true underlying
population quantities, and will realize that the effects of
sampling error on adjustments should also be reflected in
seasonal adjustment variances.
Our development is based on use of the symmetric linear
X-11 filters. We assume that the symmetric filters are
applied to the series extended with minimum mean squared
error forecasts and backcasts. In practice, the forecasts and
backcasts are obtained from afittedtimeseries model. This
is in the spirit of the X-11-ARIMA method of Dagum
(1975), but with full forecast and backcast extension, as
recommended by Geweke (1978), Pierce (1980), and
Bobbitt and Otto (1990). Our results apply directly to the
use of additive or log-additive X-11 (with forecast and
backcast extension), and the log-additive results are
assumed to apply approximately (Young 1968) to
multiplicative X-11.
Section 2 of this paper develops our approach, which
builds on the first approach of Wolter and Monsour (1981).
The differences between the two approaches are discussed
in section 2.4. Section 3 then discusses three extensions to
the results of section 2. The first is to note that our approach
works equally well with seasonal, trend, or irregular
estimates, and that more generality is easily accommodated
by allowing different filter choices for different months.
The second extension produces variances of estimates of
month-to-month or year-to-year change. Finally, when
seasonal adjustment involves estimation of regression
effects {e.g., for trading-day or holiday variation), the
results are extended to allow for additional variance due to
crtor in estimating the regression parameters.
Section 4 then presents several examples illustrating the
basic approach and the extensions given in section 3. One
thing evident from the examples is that for time series with
sampling error, our seasonal adjustment variances will often
be dominated by the contribution of the sampling ertor. In
the center of the series, our results effectively reduce to the
first approach results of Wolter and Monsour. Our results
do differ from those of Wolter and Monsour near the end of
the series. This is important since the most recent
seasonally adjusted values receive the most scrutiny. Also,
the contribution of forecast and backcast error to trend
estimate variances can be very large at the ends of a time
series. Other results of particular interest are the effects of
certain nonstationarities in the sampling ertors. The
examples of section 4 show that nonstationarities such as
sampUng ertor variances that change over time, or periodic
independent redrawings of the sample, can yield striking
changes in the pattern of the variances of seasonally
adjusted data or trend estimates over time.
Section 5 provides concluding remarks.
2. METHODOLOGY
Define the observed unadjusted time series as y^ for
t = I, ...,n. Time series that are seasonally adjusted are
often estimates obtained from repeated (monthly or quarterly) sample surveys, and thus can be viewed as composed
of a tme underlying time series Y^, and a series of sampling
errors e^ assumed uncortelated with Yy (See Bell and
Hillmer 1990.) In vector notation, Yo = Y^ + e^, where the
subscript a indicates that the time span of these vectors is
the set of observed time points I,..., n. In certain cases y^
may arise from repeated censuses (as is typically the case
for national export and import statistics, for example), in
which case there is no sampling error, i.e., e^ = 0.
The development that follows assumes that both Y^ and e^
follow known time series models. The model for F, will
generally involve differencing, as in ARIMA (autoregressive-integrated-moving average) and ARIMA component (structural) models. The model for Y^ may be
extended to include regression terms. (This will be considered in section 3.3.) The series e, is assumed to not
require differencing, but it may nonetheless exhibit certain
nonstationarities, such as variances that change over time.
Any such nonstationarities are assumed to be accounted for
in the model for e^ In practice, the models will be developed from observed data, as is discussed by, e.g.. Bell and
Hillmer (1990, submitted). Binder and Dick (1989, 1990),
and Tiller (1992).
In applying a symmetiic X-11 filter of length 2m + I for
seasonal adjustment with full forecast and backcast
extension, the vector y^ needs to be augmented by m
backcasts and m forecasts. The vector holding the m values
15
Survey Methodology, June 1999
cov [ (b, 0, f), e], as discussed in sections 2.1 to 2.3. Then,
of y, prior to the observed data, and the cortesponding
m X I vectors for Y^, and e^, are denoted y^, Y^, and e^. var(Y -y) easily follows as var(b,0,f) + var(e)-cov
[(b,0,f),e] -cov[(b,0,f),e]'. Thus,
The analogous vectors ofttiem future values of y^Yy and e,
are denoted y ^, Y,, and e,. Thus,
var(v) = Si [ var(b, 0, f) + var(e) - cov [ (b, 0, f) ,e] cov[(b,0,f),e]'}fl'.
yfc
Yo
=
+
(2.1)
The full vectors in (2.1), hereafter denoted as y, Y, and e,
have length n + 2m.
The backcasts and forecasts used to augment y^ are
assumed to be minimum mean squared ertor (MMSE)
linear predictions of y^ and y, (using y^) obtained from
the known time series model. (In practice, the model will
be fitted to the data y^.) Under normality, the backcasts
and forecasts are £ ( y ^ I y^) and E{yAyJ. The vector of
observed data augmented with the backcasts and forecasts
is denoted y = (yfc'.yo'.y/)'. where y^ =E{y^\y^) and
y, = £ ( y , I y ^). To simplify notation, from now on we will
taike expressions such as (y^, y^, y<-) to mean the column
vector ( y ; , y ; , y p ' .
Let the linear symmetric X-11 seasonal adjustment filter
be written ©(B) = YT-m ^j P'' where B is the backshift
operator and the co. are the filter weights (co. = co .).
Calculation of the to. is discussed by Young (1968) and
WaUis (1982). Results of Bell and Monsell (1992) were
used here. Apptication of (o(B) to the forecast and backcast extended series can be written as fiy, where fl is a
matrix of dimension nx(n + 2m). Each row of £1 contains
the filter weights ((B_^,...,(jaQ,...,co^), preceded and
followed by the appropriate number of zeroes such that the
center weight of the X-11 filter (COQ) multiplies the observation being adjusted. Thus, in thefirstrow of ft there are
no preceding zeroes and n - I traiUng zeroes, in the second
row there is one preceding zero and n -2 trailing zeroes,
etc. For the default X-11 filter, m = 84. Choice of alternative seasonal or trend moving averages in X-11 changes
the value of m from a low of 70 to a high of 149.
The question arises as to what ily is estimating. As
noted in the introduction, we define the "target" of the
seasonal adjustment as the adjusted series that would result
if there were no sampling ertor and there were sufficient
data before and after all time points of interest for the
symmetric filter to be applied. The target is thus (ii{B)Yy
or in vector notation ilY, and the seasonal adjustment ertor
vector is v = £1(Y - y). We are interested in the variancecovariance matrix var(v) = ilvar(Y - y)ft'- This can be
easily computed once var (Y - y) is obtained. From here
through section 2.3 we discuss the calculation of
var(Y-y).
We start by writing Y - y = (y - e) - y = (b, 0, f) - e,
where b = y^, - y^ is the mx 1 vector of backcast ertors,
and f = y, - y, is the m x I vector of forecast errors.
Given the models for 7, and e^ we calculate var(Y - y)
by separately computing var(e), var(b,0,f), and
Example-U.S. 5-t- Unit Housing Starts. As the computations for each piece of var( Y - y ) are explained, we illusti-ate the results graphically for an example series: housing
starts in the U.S. for buildings of five or more units from
January 1975 through November 1988 (167 observations).
The original series, seasonally adjusted series, and estimated trend are shown in Figure I. In practice, seasonal
adjustment at the Census Bureau of this series uses a multiplicative decomposition with a 3 x 9 seasonal moving
average and a I3-term Henderson trend filter. The
following model for this series was developed in Bell and
Hillmer (submitted):
y = 1 ' +e
(1 -B)(I -B>^)y, = (1 -0.67B+0.36B2)(1 -.8753B'2)a,,
a] = 0.0I9I
e, = (I -0.llB-0.lOB^)byol=0.001l4
(2.2)
Original series
1D75
igS1
1983
year
X-11 SDasonaiiy adjusted series
and trend estimate
1081
1083
yoar
Figure 1. U.S. Housing Starts with Five or more Units. The top panel gives
the original series from January 1975 through November 1988. The strong
seasonality of the series is apparent from the yearly dips that typically occur
during the winter months. The bottom panel gives the X-11 seasonally
adjusted series (solid line) and trend estimate (dotted line) for the same period.
The seasonal adjustment is multiplicative using a 3 x 9 seasonal moving
average and 13-term Henderson trend filter.
16
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
Here, y, denotes the logarithms of the originaltimeseries {e ^'),
so that (2.2) implies a multiplicative decomposition for the
original series {e^' = e ' e *')•
2.1 Computation of Var(e)
If Cj follows a stationary ARMA model, then var(e) can
be computed from standard results, e.g., McLeod (1975,
1977), Wilson (1979). If var(e,) changes over time, we
write ej = h^ey where h, =var(e,), and e^ has variance
one and the same autocorrelation function as e,. (See Bell
and Hillmer submitted.) Then, writing e = He, where H =
diag{h^_^,...,h^^J,
we have var(e) =Hvar(e)H'.
Var(e) is the autocorrelation matrix of e, and it can be
computed as just noted using the model for iy
If the sample is independently redrawn at certain times,
then var (e) will be block diagonal, with blocks cortesponding to the time points when each distinct sample is in
effect. Each diagonal block of var (e) be computed as just
discussed. These two types of nonstationarities in evariance changing over time and "covariance breaks" due
to independent redrawings of the sample - are those that
arise in the examples of section 4.
Example-U.S. 5-i- Unit Housing Starts (continued). Autocovariances for the MA(2) model for e, given in (2.2) are
easily computed. The resulting var (e) is a band matrix, with
var(e^)= 0.007298 on the diagonal, co\{eye^ .^) =
-0.00()707 on the first sub- and super-diagonals, and
cov(e,, e^_2) = -0.000714 on the second sub- and superdiagonals. The rest of var (e) is zero. Following pre- and
post multiplication by the seasonal adjustment filter
matrices SI and SI' the contribution of the sampling ertor
to the variance of the seasonally adjusted series is constant
for each observation (Figure 2). This occurs because the
result of a time invariant Hnear filter applied to a stationary
series (©(B)^^) is a stationary series, which has a constant
variance.
thus Wj = ?>{B)Y^ + 8{B)ey We introduce the matrix A,
cortesponding to 8(B), defined such that Ay = w is the
vector of differenced y. The vector w = (w^,w^,w,),
which is of length n + 2m - rf, is partitioned so that w^ and w,
are m x 1 vectors, and w^ is the n - d vector of differenced observed data. Thus, A has dimensions
(n+2m-c?)x (n + 2m). Note that, because c/ observations
are lost in differencing, w^ and w^ start d time points later
than y^and y^, respectively. That is, yj,and y^ start at
time points 1 - m and 1, but w ^ and w^ start attimepoints
I - m + d and d + 1.
(>:
o.ooa '
e.eoa —
( b ) C o n t r i b u t i o n ^ T • a m p l l n a o r r o r (o>
(c) C o n t r i b u t i o n o f foro/baiclcc««t o r r o r (b O f)
0.004 -
2.2 Computation of Var (b, 0, f)
The central n rows and n columns of var(b, 0, f) are all
zeroes. We require computation of var(b), var(f), and
cov(b, f) for the comer blocks of var(b, 0, f). Although
computation of variances of forecast (or backcast) ertors for
given models is standard in time series analysis, it is
complicated here by the component representation of
y, asy,+ €,, and by differencing in the model for Yy
Although computations for such models are often handled
by the Kalman filter (Bell and Hillmer submitted; Binder
and Dick 1989,1990; Tiller 1992), this is inconvenient here
since we require covariances of all distinct pairs of random
variables from among the m forecast and m backcast ertors.
We instead use a direct matrix approach due to Bell and
Hillmer (1988).
Assume that the differencing operator required to render Y^
stationary is 5(B), which is of degree d. Since e, is
assumed not to require differencing, 6(B) is also the
differencing operator required by y,. Define 5(B)y^ = w^.
Cd) C o n t
( b O f>
0.003
Figure 2. U.S. 5+ Units Housing Starts: Variance Decomposition after X-11
Seasonal Adjustment. The top panel gives the variance of the seasonally
adjusted series as the total of the three components. The second panel give the
contribution of sampling error (e), which is the largest component and
constant across the series. The third panel gives the contribution of
back/forecast error (b 01), which is zero in the middle of the series, where no
back/forecasts are needed, but increases towards either end of the series as
more back/forecasts are used. The bottom panel is the sum of the two
covariance terms (cov(e, (b 0 f)) + cov( (b 0 f), e)), which tend to offset the
contribution from back/forecast error.
Survey Methodology, June 1999
17
Define u = (w,_„^j, •••, «„^„)' = AY. Thetimeseries u,
is stationary. Since w = u + Ae, with u and e uncortelated
with each other, var(w) = var(u) + Avar(e)A'. We partition var (w) as
'12
var(w)
•^21
•^22
•^23
^^31
-"32
-^33/
where E,j is var(w^), L,2 is cov(w^,w^), etc.
Since y, when differenced to w using 5(B), has lost d
data values, y cannot be obtained from w without also
knowing a sequence of d "starting values". Consider
obtaining y^^ from w, and starting values y^ =
iyn^i-d'—'Yn)'- Theorem 1 in Bell (1984a) can be used to
show that
y, = Ay, +Cw
(2.3)
for matrices A and C determined by 6(B). The rows of the
mxm matrix C consist of the coefficients of ^(B) =
1 + L B + ^ - B 2 + . . . = 6 ( B ) - ' inthefomi
1
^1
0
1
^
^l
^m-l
^n.
0 o"!
0 0
0 0
1 0
^l 1
A is an m X ^ matrix which accounts for the effect of the d
starting values in y, on y^. The exact form of A is given in
Bell (1984a) and, since it will exactly cancel in our application, it will not be given here. In (2.3) y, is known since
it is part of y^, the observed data. Thus, from (2.3) the
MMSE forecast of y, is y, = Ay, + Cw,, where w, is the
MMSE forecast of w,. Therefore, f = y^-y^ = Ay, + Cwy
-(Ay, +Cwp =C(w^-w ),andvar(f) = Cvar(w^-wpC'.
Under Assumption A of Bell (1984a), which leads to tiie
standard results for forecasting nonstationary series (as in,
e.g.. Box and Jenkins 1976, Chapter 5), w, = E32E22W0Note that this uses only the differenced data w^ in
forecasting yi^. Then, from standard results on linear prediction, var(w - - w,) 2 - L 'L'.l'L' Thus, var(f)
33
-^32
22 •^32C(L33 - 232222^32)^'To obtain var (b) and cov (b, f) we note that results
obtained by Bell (1984a, p. 651) imply similar calculations
hold for the backcast ertors b. In fact, it can be shown tiiat
b = (-I)'^C'(Wj - Wj), where w^ is the MMSE backcast
of Wj, and r is the number of times (1 - B) appears in the
polynomial 5(B). (The appearance of C' intiiisexpression
instead of C stems from the indexing of w^ and w^forward
through time although the backcasting process proceeds
backwards through time.) Thus, var(b) = C'var(vt'^ - w^)
C = C ' ( 2 „ - T.12'^21 ^(2)^- Similarly cov (f, b) = (-1)"
C{'L^i -£32222^12)^- ^ practice, to avoid inverting Z22,
var(f), var(b), and cov(f, b) can be computed using the
Cholesky decomposition of L22. (See App)endix A.)
Example-U.S. 5-t- Unit Housing Starts (continued). The
contribution to seasonal adjustment variance from var(b, 0,
f) is shown in Figure 2. This is zero or essentially zero for
observations in the middle of the series, where no or few
fore/backcasts need be made to apply the symmetric
adjustment filter. Towards the ends of the series, the
contribution of fore/backcast ertor becomes more substantial since an increasing number of observations need to be
fore/backcast to apply the filter. The jumps in the graph
occur when an additional fo^e^ackcasted observation is
multiplied by a weight in the adjustment filter that is a
multiple of the seasonal period, since these weights have
the greatest magnitude (Bell and Monsell 1992). Note that
the contributions from var(b, 0, f) at the very ends of the
series are smaller than the contributions from var(e), but are
not negligible.
2.3 Computation of Cov[(b 0 f), e]
To compute cov(f, e), we first note from results of the
preceding section that f = y , - y , = C ( w , - w , ) = C
(W^-S3222>o)=C[0| -2:322:2^11 JW=C[0| -2:322:2^1:11 J
Ay. Since cov(y,e) = cov(Y + e,e) = 0 + var(e), we see
that cov(f,e) = C[0| -2:322:2"2|I JAvar(e). Cov (b, e) is
computed in an analogous fashion by noting that,
b=(-l)^C'(w-w,)=(-I)X'(w,-2:,2222W„) =(-!)'•€'
[1^1-2,2 2:2210] Ay, so that cov(b,e) = (-l)^C'
[I^|-2:,22:22^|0]Avar(e).
Example-U.S. 5+ Unit Housing Starts (continued). Figure
2 shows that the contribution of cov[(b, 0, f), e] is zero or
near zero in the middle of the series, but it becomes increasingly negative towards the ends of the series, in a pattern
similar, though opposite in sign and of smaller magnitude,
to that of var(b, 0, f). At the very ends of the series, however, the pattern reverses and the covariance increases. The
elements of cov[(b, 0, f) e], are mainly positive, so its
contribution to the seasonal adjustment variance is negative
because cov[(b, 0, f), e] and its transpose are subtracted
fromvar(e) -1- var(b, 0, f). The net effect is that subtracting
il{cov[(b,0,f),e] +cov[(b,0,f), e\'}Si' tends to offset tiie
effect of adding Qvar(b,0,f)fi', except near the very ends
of the series. Thus, the graph of the variances of the seasonally adjusted series in Figure 2 is very similar to the graph
of the contribution of var(e), except near the very ends of
the series. We observed this type of "cancellation effect" in
several other examples, including those of section 4.
2.4 Comparison with the First Approach of Wolter
and Monsour
The first approach of Wolter and Monsour (1981) proposed use of Sl^^yar{eg)SiJ^ as the variance-covariance
matrix of the X-11 seasonal adjustment errors, where SI
is an n + n matrix whose rows contain the X-11 linear filter
weights, both symmetric and asymmetric. That is, the
18
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
middle rows (rows t such that m <t <n- m -¥ I, assuining
n > 2m) of Sl^^ contain the X-11 symmetiicfilterweights,
but the first and last m rows of SI
contain X-ll's
asymmetricfilterweights. The middle rows of Sl^^ and SI
thus contain the samefilterweights, but the first and last m
rows do not. This means that our approach will give the
same results as that of Wolter and Monsour for m < / < w m -I-1, that is, for time points at which the symmetric filter
is being used. The results of the two approaches will differ
for the first and last m time points. Since the most recent
seasonally adjusted data receive the most attention, this
difference is potentially important.
Wolter and Monsour also considered use of a matrix SI*
instead of Sl^^ where SI* is (n -H 12) X (n -h 12) to include
12 additional rows of weights corresponding to year-ahead
seasonal adjustment filters. Though year-ahead adjustment
was the common practice through the early 1980s, it has
now mostiy been replaced in the United States by
concurrent adjustment (McKenzie 1984).
The differences between our approach and that of Wolter
and Monsour can be viewed in two ways. One view is that
since Wolter and Monsour did not consider forecast and
backcast extension, their approach ignores the contribution
of forecast and backcast ertors to seasonal adjustment error.
This contribution affects results for the first and last m time
points, although the examples of section 4 show that this
contiibution is often small. However, in some cases it is not
small, including those time series not subject to sampling
error. For such series Wolter and Monsour's approach
would assign zero variance to the adjustments, even though
initial adjustments would be revised as new data became
available.
The other way to view the differences between the
approaches centers on the difference in "targets". The
seasonal adjustment error under Wolter and Monsour's
approach can be thought of as Sl^^{Y ^-yj
= -Sl^^e^.
Since this results in zero error for series with no sampUng
ertor (Y^ = y ^), Wolter and Monsour impUcitiy define the
seasonal adjustment target to be SI ^^Y ^. This definition of
target has the undesirable property that the target value for
a giventimepoint changes as additional data are acquired,
since the rows of SI contain different filter weights. Our
target value for any given time point t is always a)(B)y,.
Example-U.S. 5+ Unit Housing Starts (continued). We
compared results using our methodology with that of
Wolter and Monsour's using the default X-11 seasonal
adjustment filter although, as noted earlier, this example
series is adjusted using the optional 3 x 9 seasonal moving
average filter. This comparison used the default filter for
convenience: asymmetric X-11filterweights are needed to
obtain results for the Wolter-Monsour approach and we
were given a computer program by Nash Monsour that
produced them only for the default filter. Figure 3 gives the
results for both approaches. The non-constant variances
over time from the Wolter-Monsour approach result from
applying different filters at different time points. An
interesting consequence of this is that, despite the stationarity of the sampling ertor, the Wolter-Monsour seasonal
adjustment variance is noticeably higher in the middle of
the series than for many time points toward (but not close
to) either end of the series. This carries the implausible
implication that use of less data produces estimators with
lower variance. Similar behavior can be observed in several
examples presented by Pfeffermann (1994).
0.010 o.oos —
o.ooa -
^
/
0.004 o.ooa -
obasrvatlon n u m b * r
»r ( • >
eba*rv«tton numbar
(o) Ce
o.oos -
Cb O f )
(d) c
o.ooa
observation numbor
Figure 3. U.S. 5+ Units Housing Starts: Comparison with Approach of Wolter
and Monsour (1981). The panel descriptions are as for Figure 2. The Wolter
and Monsour approach (dotted lines) uses the asynunetric X-11 filters for the
ends of the series and accounts only for sampling error. Our approaches agree
in the middle of the series where there is no contribution from back/forecast
error. The Wolter and Monsour variances inappropriately decrease near the
ends of the series, suggesting that use of less data produces estimates with
lower variances. The results here, in contrast to Figures 1, 2,4, 5, and 6, use
defauh X-11 filters. (See text.)
Survey Methodology, June 1999
19
The results from using the default X-11 seasonal adjustment filter with our approach are also useful for comparing
with the 3 X 9 seasonal moving average filter, for which
results are given in Figure 2. Differences between results
from using the twofiltersare not great. The contiibution of
the sampling ertor is somewhat lower and that of the
fore/backcast error somewhat higher when using the default
seasonal adjustment filter.
the series. In the center of the series, however, the trend
variances of Figure 4 are substantially lower than the
seasonal adjustment variances of Figure 3, due to the
smoothing of the sampling error by the trend filter.
<•) v «
0.004 -
S
I
3. EXTENSIONS TO THE METHODOLOGY
This section discusses three extensions to the general
methodology of section 2. The first two extensions are
straightforward, the third more involved.
oba»rvatlon numb«r
Ino orror (e)
o.ooe -
3.1 Variances for Seasonal, Trend, and Irregular
Estimates; Variances with Time-Varying
Filters
0.004 -
0.002 -
The only way the nonseasonal (seasonally adjusted)
component is distinguished in the derivation of section 2 is
through the filter weights placed in the matrix SI. Therefore, corresponding variances for X-11 estimates of the
seasonal, trend, and irregular components follow from the
same expressions simply by changing the matrix SI to
contain the desired filter weights. This also changes the
dimension of SI, since the length of the seasonal adjustment, trend, and irtegularfilters(for given options) differs,
and the filter length determines the size of SI.
A similar extension handles the case of different
seasonal moving averages (MAs) selected for different
months (or quarters), an option allowed by X-11. This
changes the seasonal adjustment (and seasonal, trend, and
irtegular) filters applied in the different months. The results
of section 2 also accommodate this extension through a
simple modification of Si. Since the rows of SI cortespond
to thetimepoints being adjusted, we simply define row t of
Si to contain the weights (along with sufficient zeroes) from
whatever filter is being applied in month t. Some care must
be taken to dimension Si appropriately if the longest
selected MA is not used in the first and last months of the
series.
Example-U.S. 5-i- Unit Housing Starts (continued). Figure
4 shows the variance of die X-11tijcndestimate, using the 3 x 9
seasonal MA and 13-term Henderson. The most obvious
difference from the seasonal adjustment results is the
substantial effect of fore/backcast ertor at the very ends of
the series. This occurs because the largest weights of the
trend filter ((B^^(B)) are the center weight {(o'p) and the
(•f)
(T)
(T)
^ u
'
adjacent weights (co, ,0)2 ,(03) that are applied to data
immediately before and after the observation being
adjusted (Bell and Monsell 1992). At the very ends of the
series, the weights (co, ,0)2 ,©3 ) apply to fore/ backcasted observations, which results in large increases in the
contribution of fore/backcast ertor there. The result is that
uncertainty about the trend increases sharply at the ends of
0 100
obaorvBtlon n u m b e r
( c ) Cc
>r ( b O f )
8
m
0.001
0
BO
100
ISO
obaarvatlon number
o.oos —
o.oos -
0.001 -
-0.001 -
-0.003 100
obaarvatlon numbar
Figure 4. U.S. 5+ Units Housing Starts: Variance Decomposition of the
Trend Estimate. The panel descriptions are as for Rgure 2. Note the large
jump in trend estimate variances at the ends of the series due to the
contribution of back/forecast error (third panel).
3.2 Variances for Seasonally Adjusted
Month-to-Month and Year-to-Year Changes
The variances of the ertors of the seasonally adjusted
estimates of month-to-month change are the quantities
var(Vj - Vj ,), f = 2,..., n. Given var(v), the complete ertor
covariance matrix for the seasonally adjusted month-tomonth changes can be calculated as A, var(v) A,', where
20
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
-I I 0
0 -1 1
0
0
o]
0
\ =
0
1^ 0
0
0 0
- 1 1 0
• -1 1 ;
is of dimension (w -1) x n. The ertor covariance matrix for
the seasonally adjusted year-to-year changes in a quarterly
series is calculated similarly as A4var(v) A^, where
-I 0 0
0 -1 0
0
0
1 0 • 0 0
0 I
0 0
0
0
-1
0 0 0 I 0
-1 0 0 0 I
0
0
0
is of dimension (n - 4) x n. The cortesponding {n- I2)xn
matrix A ,2 for monthly series follows a similar pattern with
additional zeroes.
Variances of month-to-month or year-to-year changes in
the trend are also easily obtained, as can be seen from this
discussion and that of section 3.1.
Example-U.S. 5-i- Unit Housing Starts (continued). We
produced the standard errors for seasonally adjusted monthto-month and year-to-year changes for this series (Figure 5).
Since this time series has been log transformed, standard
ertors can be approximately interpreted as percentages on
the original (unlogged) scale. Compared to the standard
ertors for the seasonally adjusted series, there are slight
increases in the standard ertors of the month-to-month
changes near the ends of the series, but the standard ertors
of the year-to-year changes show almost no such increase.
Thus, for this series andfilter,the uncertainty about monthto-month and year-to-year percent change in the seasonally
adjusted data is almost constant across the series. The
standard errors of the month-to-month and year-to-year
changes are both about 50 percent higher than those for the
seasonally adjusted series.
3.3 Variances of X-11 Seasonal Adjustments with
Estimated Regression Effects
Seasonal adjustment often involves the estimation of
certain regression effects to account for such things as
calendar variation, known interventions, and outliers
(Young 1965; Cleveland and Devhn 1982; Hillmer, Bell,
and Tiao 1983; Findley, Monsell, Bell, Otto, and Chen
submitted). (Outlier effects are often estimated in the same
way as known interventions even though inference about
outliers should ideally take account of the fact that the
series was searched for the most "significant" outliers.) This
section shows how the results already obtained can be
extended to include the contiibution to seasonal adjustment
error of error in estimating regression parameters. We still
assume the other model parameters, which determine the
covariance structures of Y and e, are known. In practice
these .other model parameters will also be estimated, but
accounting for error in estimating them is much more
difficult. A Bayesian approach for doing so in the context
of model-based seasonal adjustment is investigated by Bell
and Otto (submitted).
(a) S t a n d a r d e r r o r o f s o a s o n a l l y a d j u s t e d d a t a
0.10 -
g o.oa -
100
obaarvatlon numbar
(b) S t a n d a r d error of m o n t h - t o - m o n t l i c h a n g e
TZ o.oa '
100
obaarvatlon number
(c) S t a n d a r d e r r o r of y e a r - t o - y e a r
"S o.c
100
obaarvatlon numbar
Figure 5. U.S. 5+ Units Housing Starts: Standard Errors. These panels
contrast the standard errors (not variances, as in previous figures) of the
seasonally adjusted data (top panel) with the larger standard errors of
seasonally adjusted month-to-month (middle panel) and year-to-year (bottom
panel) change estimates.
We extend the model for Y^ to include regression terms
by writing Y^ = \l b -^ Zy where x, is the vector of
regression variables at time t, p is the vector of regression
parameters, and Z, is the series of tme population quantities
with regression effects removed. Extending our matrixvector notation, we write Y = Xp + Z, Y^ = X^ p + Z^,
etc. The regression matrix X can be partitioned by its rows
cortesponding to the backcast, observation, and forecast
periods: X = (X^|X^|X')'. We assume e, has mean zero,
so its model does not involve any regression effects. We
then have y = Y + e = (XP + Z ) + e , with the usual partitioning applying. Letting z, denote the series y, with the
regression effects removed, we have z = y - X p = Z + e.
An additional partition is needed of the matrix X and
vector p. This is because some ofthe regression effects in x,' P
may be assigned to the nonseasonal component while
Survey Methodology, June 1999
21
others, such as trading-day or holiday effects, may be
removed as part of the seasonal adjustment. See Bell
(1984b) for a discussion. Partition x^ as (x^'Jx^^) where
X;v, represents the regression variables assigned to the nonseasonal and X5, the variables whose effects are to be
removed in the seasonal adjustment. Cortespondingly partition p so x ; p = x ; , p , + x ^ p ^ and Xp=X,p5 + X^p^
= (X^ I X^)(Pj' I P^)'. (x^, P^ is assigned to the "combined" seasonal component.) The matrix X can thus be
partitioned two ways: by seasonal versus nonseasonal
regression effects, and by the backcast, observation, and
forecast periods. Thus we write
^Sb
^Nb
^So
^No
^Sf
^Nf
error is then v = (X^„ p^„ + SiZ) - (Si£ + X^„ P^J =
^No^ho-ho)^^^'^-^)The expression for v can be simplified by rewriting I.
First, let G = [B'|I|F']', where F is the matrix that
produces forecasts y, from y„ and B is the corresponding
matrix that produces backcasts y^from y„. We will not
need explicit expressions for F or B. G applied to z„
produces z while G applied to i produces I. Therefore, I = z ( z - | ) = z - [ G ( y „ - X „ P ) - G ( y ^ - X j ) ] = z + GX„
(P - P). Note that GX„ is obtained by applying the procedure for forecast and backcast extension (from the model
for z,) to each column of X„. The approach we used to do
this is described in Appendix B. Continuing, we have
v = X^„(P^„-P^„) + S i [ ( Z - z ) - G X „ ( P - p ) ]
= Si(Z - z) + {[0|X^J - SiGX„}(p - P).
If p were known we could compute z ^ = y ^ - X ^ p = Z ^
+ e^,forecast and backcast extend this series (call the
extended series z), adjust z by X-I I (Siz), and add back
the required regression effects X^^P^^. The target of the
seasonal adjustment would be X^^ p^^ + SiZ = X^^^ p^^
+ Si(Y - XP), and the seasonal adjustment error would
then be (X^^p^^ + SiZ) - (X^„P^^ + Siz) = S i ( Z - z ) .
Thus, if the regression parameters were known they would
not contribute to the seasonal adjustment ertor, and the
results already given could be used to compute
var(Si(Z-z)).
In practice, p will be estimated as part of the model
fitting, say by maximum likelihood assuming normality.
Given the estimates of the other model parameters, and
taking these parameters as if they were known, the maximum likelihood estimate of p and its variance are given by
Now, Z - z = z - e - z = [b|0|f] - e. Note that [b|0|f],
the ertor vector from projecting z on z„ or y„, is orthogonal
to (uncortelated with) P - P, since P is a linear function of
the data y„. Therefore, letting K = [0|X^J - SiGX^, we
have the variance-covariance matrix of the seasonal
adjustment ertor allowing for ertor in estimating p:
var(v) = Sivar(Z - z)Si' + Kvar(P)K'
(3.3)
+ Sicov(e, p)K' + Kcov(P, e)Si'
where var(P) is given by (3.2). In (3.3) Sivar(Z - z)Si' is
computed by the results of section 2, and computation of
Kvar(P)K' is straightforward once GX„ has been
computed. To compute the other two terms requires
cov(P,e)
P = [X'A'E2"2A X ] - ' X ' A ' 2 2 " ' A y
(3.1)
var(P) = [X;A;E22A.XJ-',
(3.2)
= cov([X'A'2:"2A X ] " ' X ' A ' Z : ' A y ,e)
= cov([X'A'E2"2A XJ-'X'A'Ej'jLu +A e ],e)
where A„ is of dimension {n-d)xn, containing that part of
the larger matrix A which differences the observed series y^.
The expressions (3.1) and (3.2) are generalized least
squares results using the regression equation for the differenced data, w^=Ay^ = (AX^)P + (u^ + Ae^), where the
crtorterm, u +Ae , has covariance matrix var (w ) = 2 „ ,
'
0
0
'
^
o^
22'
which is determined by the other model parameters.
Given the estimated regression parameters p, the
seasonally adjusted series would be obtained by subtracting
the estimated regression effects from the data (call the
resulting series z ^ = y ^ - X ^ P ) , extending this series with
forecasts and backcasts using the model (denote this
extended series I = [1^,1^,1,]), applying X-11 to the
extended series (Sil), and adding back the estimated
regression effects assigned to the nonseasonal component
(Sii+X^^P^^). The target ofthe seasonal adjustment is still
^No Pwo ^ ' ^ ^ ' discussed above. The seasonal adjustment
^•^
o
o
2.1 o
o-*
o
o
II*- 0
= [X 'A ' E J ' A X ] - ' X 'A T : ' A [0
'-
o
o
22.
o
o^
o
o
21.
o'- nxmi
00^^'
11
nxn
|0„^ Jvar(e).
Notethat[0„,„|I„^jO„,Jvar(e) = [cov(e^,e,)|var(eJ|
cov(e^,ep] is the middles rows of var(e). Using (3.4) and
the aforementioned results, (3.3) can be computed. We can
compare the resulting diagonal elements of var(v) with
those of the sum of the last three terms in (3.3), to see if
allowing for the ertor due to estimating the regression
parameters is important.
There is an important qualification to make about the
results of this section. Sincetiiefirstterm on the right hand
side of (3.3), Sivar(Z - z)Si', is the seasonal adjustment
variance we would get by ignoring ertor in estimating the
regression parameters, it is tempting to interpret the sum of
22
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
the last three terms in (3.3) as the contribution to seasonal
adjustment variance of ertor due to estimating regression
parameters. Unfortunately, this sum is not itself a variance
(it can in fact be written as var(Kp + Sie) - var(Sie)), and
so it can actually be negative. When this happens the
seasonal adjustment variances that allow for error due to
estimating regression parameters are actually lower than
those that ignore this error. We were in fact able to achieve
such a result by artificially modifying model parameters in
the following example with trading-day variables (though,
as in tiie results shown, the effects were quite small). This
situation contrasts with comparable results for model-based
approaches which express the seasonal adjustment ertor as
the sum of two orthogonal terms: the ertor when all parameters are known, plus the contribution to ertor from estimating regression parameters. The seasonal adjustment
variance in this case is thus the sum of the variances of
these two terms, and so the "regression contribution" is
always nonnegative. This result is analogous to
Sivar(Z -z)Si' +K var(P)K'in (3.3). The problem in
(3.3) is that the X-11 estimate Siz is not an optimal
(MMSE) estimator of the target SiZ, hence the ertor
Si(Z - z) is cortelated with p through the sampling error
e, leading to the two covariance terms in (3.3). This
situation results partly from our choice of target
(Xp + SiZ) and partly from the fact that X-11 cannot be
assumed to produce an optimal estimator of anything (note
comments related to this in the Introduction).
Example-U.S. 5-1- Unit Housing Starts (continued). We use
the same example to illustrate the contribution to seasonal
adjustment ertor of adding trading-day variables (Bell and
Hillmer 1983), although the cortesponding regression coefficients were not statistically significant when estimated
with this series. Figure 6a shows the results. In this illustration, the lowest line is the "contribution" to the seasonal
adjustment variance from estimating the trading-day effects
(but see remarks above). When added to the original estimate of variance (dotted line), we obtain the variance of the
seasonally and trading-day adjusted series, allowing for
ertor in estimating the trading-day coefficients (top sohd
line). We see that, for this example, the increase in variance
due to including estimated trading-day effects in the model
is slight. Figure 6b gives results for the trend filter. Here
the contribution to trend uncertainty due to estimating the
trading-day coefficients is certainly negligible.
The contribution to seasonal adjustment variance of
adding three additive outlier variables and one level shift
variable is illustrated in Figure 6c. These regression
variables were identified as potential outlier effects using
the Regarima program (produced by the Time Series Staff
at the U.S. Census Bureau) with a critical r-statistic of 2.5.
Regarima uses an outlier detection methodology similar to
those discussed in Bell (1983) and Chang, Tiao, and Chen
(1988). The contributions ofthe additive outliers appear as
three spikes while that of the level shift is a single smaller
hump in the middle of the series. In comparison to the
trading-day regression variables, the effect of these outlier
variables is mainly local but much sti-onger. In particular,
there is additional uncertainty about seasonal adjustments
for observations considered additive outliers.
(a) V a r i a n c e of s e a s o n a l l y a n d
tradlna-day adjusted series
5
"total" varfanoa
original varlanca aatlmata
"contribution" from raoraaalon affaota
o.ooa H
100
obaarvatlon numbar
(b) V a r i a n c e of trend
estimate
100
obaarvatlon numbar
(d) V a r i a n c e of trend
S
estimate
o.ooa
I
-ZX_
ZMi-A.
100
obaarvatlon numbar
(c) V a r i a n c e of aeaaonally a n d
outlier adjusted aeries
M
0.0D8
S
o.oos -
100
obaarvatlon numbar
Figure 6. U.S. 5+ Units Housing Starts: Including the "Contiibution" from
Regression Effects in the Variance Estimates. The top panel shows both the
original variances from die first panel of Figure 2 (dotted curve) and die
variances allowing for additional uncertainty due to estimating trading-day
regression effects (top solid curve). The regression contribution is also shown
(bottom solid curve). The second panel shows the corresponding results for
die variances of the trend estimates. Note that the regression contribution to
the seasonal adjusOnent variances is small, and to the trend estimate variances
it is essentially zero. The third and fourth panels show analogous results when
die trading-day regression effects are replaced by three additive ouUiers and
a level shift. Notice that these have important local effects on the seasonal
adjustment and trend estimate variances.
Survey Methodology, June 1999
Results for the trend filter (Figure 6d) differ in that
uncertainty is much greater around the observation where
a level shift was detected, which approaches the level of
uncertainty at the ends of the series. A level shift is
considered part of the trend, so an estimated level shift
effect would first be subtracted from the series (in XP), and
then added back following application of the X-Il trend
filter. (This is analogous to the treatment of regression
effects assigned to the nonseasonal or seasonal components
in seasonal adjustment as discussed above.) In contrast,
since both additive outiiers and level shifts are considered
part of the nonseasonal component, all four effects were
added back as part of the seasonal adjustment when
producing results for Figures 6a and 6b.
Actually, these sorts of results for outliers should only be
regarded as crade approximations, since they treat the time
of occurtence and types of outliers as known, leaving only
the magnitude of the effects to be estimated. Ideally, one
would like to recognize that the series was searched for
significant outliers, but this is much more difficult.
23
is the sets of small downward projecting spikes that occur
one year apart in triplets. These occur at non-leap year
Februaries, for which there is no trading-day effect (the
trading-day regression variables are all zero). There is still
a small regression contribution to seasonal adjustment ertor
at these time points since the adjustment averages in these
contributions from adjacent time points. (Dips at non-leap
year Februaries are also visible on close inspection of
Figure 6a.) In addition, for some years, the ertor in estimating the Easter effect produces a noticeable upward
projecting spike involving the two months March and April.
o.ooe
Mr/o raoraaalon varlablaa
4. EXAMPLES
We illustrate our approach using several additional
economic time series whose sampling errors follow different models. The models used for these example series are
taken from previous work as noted.
4,1 Retail Sales of Department Stores
Department store sales are estimated in the Census
Bureau's monthly retail trade survey. Essentially all sales
come from department store chains, all of which are
included in the survey, hence, there is virtually no sampling
error in the estimates. Thus, the variance of the X-11
seasonal adjustment comes only from fore/backcast ertor
and from error in estimating regression effects. (Note that
the Wolter-Monsour seasonal adjustment variance would be
zero for this series.) The model used for this series (Bell
and Wilcox 1993), for the period August 1972 through
March 1989 for the logs of the observations, is
( I - f i ^ ( l - B ' 2 ) [y - x ; p ] = ( l -0.53B)(1 -0.52B^^)a^
with a^ = 4.32 X10", where x, includes variables to account
forti-ading-dayand Easter hohday effects, and Y^ = y^ is the
log of the original series divided by length-of-month
factors. In adjusting the series at the (Census Bureau, the
default X-11 adjustment filter and 13-term Henderson trend
filter are used.
Figure 7a shows the standard ertors for the seasonally
adjusted data over time, with and without the contribution
of regression effects. Unlike the 5-t- units housing starts
series, there are marked increases in the standard ertors of
seasonally adjusted data at the ends of series, due entirely
to fore/backcast ertor. The contribution to the standard ertor
due to estimating regression effects is also more pronounced for this series. An interesting feature in Figure 7
Cb> S t a n d a r d o r r o r o f m o n t h - t o - m o r a t H
O.OOB 0.007 -
i
1
1
O.oos -
ehango
/^
.
1
O.oos —
n^fJUvX^4-^^ffnM-im
1f
.....
0.001 —
-o.ooi —
(c)
Stai
§
i
1
Figure 7. U.S. Department Stores, with Trading-Day and Easter Effects. This
series has no sampUng error. The four panels give standard errors with and
without the contribution from estimating regression effects. For the seasonally
adjusted data and corresponding month-to-month and year-to-year changes
(first three panels), the "contribution" from estimating regression effects is
substantial and erratic in the middle of the series (where it is the sole
contributor) but, at either end, diminishes for reasons explained in the text.
The regression contribution to the trend estimate standard errors is small.
24
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
The regression relative contribution to the seasonal
adjustment standard ertors diminishes towards the ends of
the series. This results from two factors: (1) the magnitude
of the regression contribution to var(v,) decreases somewhat
towards the ends of the series, and, more importantly, (2)
var(Z, - Z,) increases dramatically towards the ends of the
series, diminishing the relative contribution to var(v,) due
to regression (and this is further accentuated when square
roots are taken).
The pattern ofthe standard errors of seasonally adjusted
month-to-month changes (Figure 7b) is similar to that for
the standard error of the seasonally adjusted data (Figure
7a). The regression contribution is slightiy larger than it is
for the seasonally adjusted data. Standard errors of year-toyear changes (Figure 7c) follow similar patterns but the
regression contribution is considerably larger than it is for
the month-to-month changes, and it remains important at
the ends of the series.
A similar set of calculations was performed using the
default X-11 trend filter, and results for the standard errors
of the trend estimates, with and without the regression
contribution, are depicted in Figure 7d. The patterns over
time of these standard ertors are similar to the cortesponding figures for the 5-i- units housing starts series, but the
standard ertors are much smaller due to the absence of
sampling error. The regression contribution is small.
The standard errors for all plots in Figure 7 are small
-none exceed 0.8 percent. For this series, the regression
contribution is small and probably ignorable near the very
ends of the series, for all but the year-to-year changes.
However, in the middle of the series, the sole contributor to
standard errors is that due to the regression effects.
(1 - B){1 - B'^)Y, = (1 - 0.275)(1 - 0.68B'2)a,, (4.3)
with (T^ = 4294. There are no regression effects in the
model and the series is not transformed. BLS uses the
default X-11 seasonal adjustment filter (so m = 84).
In applying the methods of this paper to this example,
problems arise from the fact that the (estimated) sampling
error variance h^ depends on the estimate y^ through the
generalized variance function (4.2). In the backcast and
forecast periods y, is unknown. To obtain h, in these
periods we forecast and backcast y^ using a simple
ARIMA(0 1 1)(0 I 1),2 model for y, (not for Yy as in
(4.3)). The resulting 84 forecasts and backcasts were then
used in (4.2) to produce h, in the forecast and backcast
periods. More refined tteatments are possible, such as using
the component model given by (4.1) and (4.3) to forecast
y,Ca)
variance of seasonally adjusted series
Ct) contribution of sampfing en-or (e)
c O contribution of fore/backcast error (b 0 f)
4.2 Teenage Unemployment
The Bureau of Labor Statistics (BLS) publishes the
monthly time series of number of U.S. unemployed teenagers estimated from the Current Population Survey (CPS).
Data from January 1972 to December 1983 ((n = 144)
were used by Bell and Hillmer (submitted) to develop a
model for this series. The sampling ertor variance /z,
changes overtime,so is nonstationary. The sampling ertor
model they developed is
V,
where (I - 0.6B)e, = (1 - 0.35)^,,
(4.1)
40
60
100
120
C^) contribution of covariances of e and (b 0 f)
0,
o
.is <M
60
80
observation number
with a^ = 0.87671 so that var(ej) = 1. CPS sampling error
variances can be approximated by generalized variance
functions (Wolter 1985, Chapter 5; Hanson 1968). The
generalized variance function Bell and Hillmer used for the
teenage unemployment series is
V = l-971y,-(1.53xlO-=)y;,
20
(4.2)
where y^ is the estimate of the number in thousands of
unemployed teenagers at time t. The estimated model for
the signal component Y^ is
Figure 8. Teenage Unemployment, widi Default X-11 Options. The panel
descriptions are as for Figure 2. The seasonal pattern of the sampling error
variance contribution (second panel) results from its dependence on the level
ofthe series through a generalized variance function (see text).
The seasonal adjustment variance for this series (Figure
8a) is dominated at most times t by the sampling error
contiibution (Figure 8b). This is because, while the contribution of var(b, 0, f) is substantial for this series (Figure
8c), it tends to be offset by the contribution of cov[(b,0,f),
e] + cov[(b,0,f), e]' (Figure 8d), except at the first and last
Survey Methodology, June 1999
25
few time points. The patterns of variances of seasonally
adjusted month-to-month changes and year-to-year changes
(not shown) are similar to that of Figure 8a. The variances
of the month-to-month changes are slightly larger than
those of the adjusted series, those of the year-to-year
changes are larger still.
4.3 Retail Sales of Drinking Places
Retail sales of drinking places are estimated in the
Census Bureau's monthly retail trade survey. In this
survey, (noncertainty) sample cases are independently
redrawn approximately every 5 years, so the covariance
matrix of the sampling errors is block diagonal. Bell and
Hillmer (1990) developed the following model for the
sampling ertor of the logged series within a given sample:
-'''
(1 - 0.15B - 0.665^ + 0.50B^){1 - 0.71B'2)e,
= (1 + 0.135)^,,
variances of change estimates around the sample redraw,
such as those reflected in Figure 9b, simple modifications
are made to estimates in a newly introduced sample to make
their level consistent with that from the old sample. The
simplest version of the modification is as follows. Let
^(oid)t( '^^^P(^(oid)t)) denote estimates from the old sample,
and Z(ng^^,), unmodified estimates from the new sample.
Assume that the old sample provides estimates for t ^ r,
and that the new sample is to provide estimates for t>T.
To provide overlap data for the modification, the new
sample is begun one month early, so that both z. y. and
^(new)T ^^^ available. The modified new sample estimates
are defined as z^'„,^^, = Z(„ew)t(^(oid)r/2(„ew)r) for ' ^ ^- This
modification is carried out each time a new sample is
introduced. In terms of the cortesponding logged estimates
v., the modification is y,' ,.= y, ^ + (y, ,.,, —y, . ).
(4.4)
with cTj = 9.301 X 10'^. Fortimepoints t andj in different
samples, cov{eyej) =0. Bell and Hillmer developed a
model for the signal component of the logged series using
unbenchmarked estimates from September 1977 to
December 1986. We shall instead use the following model
fit by Bell and Wilcox (1993) using additional data through
October 1989:
(1 - B){1 - B^^)[Y, - X;b] = (1 - 0.23fi)
{I-O.S8B^^)ay
where X, contains trading-day regression variables, and
o-f = 4 . 1 6 x 1 0 ^
In seasonally adjusting this series, the default X-11 filters
are used. The contribution of ertor due to estimating
regression parameters is small for this series, and so is not
included in the results to follow. Since the contribution of
sampling ertor overwhelms the contributions from fore/
backcast ertor and cov[(b, 0, f), e] + cov[(b, 0, f), e]', we
also do not illustrate these separate variance contributions.
Figure 9a gives the standard ertor ofthe seasonally adjusted
data (shown over 232 observations to better illustrate the
pattern, with vertical lines indicating sample redraws) and
Figure 9b the standard ertor of seasonally adjusted month
-to-month changes.
Note the strong pattern in Figure 9a, b due to the
redrawing ofthe sample every five years. In particular, this
produces a large spike in the standard ertor of seasonally
adjusted month-to-month changes (Figure 9b) when the
sample is redrawn. Similar jumps in standard deviations of
year-to-year changes occur for the first year of a new
sample. We also found similar patterns for other series from
the retailti-adesurvey using models from Bell and Wilcox
(1993).
The preceding discussion and results ignored certain
aspects of how estimation for the retail trade survey is
actually carried out. In fact, to avoid large increases in
.'(new)t ''(new)t
''.'(old)r
.'(new)r''
Since the modification to y, is linear, it is easy to account for
its effects on the seasonal adjustment variance calculations
here. The month-to-month change at time r -t- 1 before the
modification (and without seasonal adjustment) is
>'(new)T.i ">'(o!d)r- ^ote that this change has a large
variance since y(„g„)j ^, and y^^^. come from different,
independent samples. After modification, this change is
^(new)r +1 ~ >'(new)r' which has a much lower variance due to
strong positive correlation between y^^^^^^ ^ ^ and y , .
(arising from the the sampling ertor model (4.4)).
Unadjusted month-to-month change estimates for time
points other than r + I are unaltered by the modification.
Figure 9d shows that modifying new sample estimates
eliminates the large increases in the standard deviation of
seasonally adjusted month-to-month changes at the transitions to new samples. Similar effects were seen for year-toyear changes over a one year period. The price paid for this
improvement is a steadily increasing error in the level
estimates (Figure 9c) following introduction of new
samples. This occurs because the modification introduces
a transient error into the level estimates that persists
throughout the new sample. Thus, the modification trades
off worse accuracy of level estimates for improvements in
change estimates. (Figure 9c shows no increase for the first
five years because we assume the estimates there are not
modified to agree with those from a previous sample.)
Moreover, the strong patterns in Figure 9a occur because
the sampling ertors from unmodified estimates in adjacent
samples are uncortelated. On the other hand, sampling
ertors in the modified estimates are fairly strongly
cortelated between adjacent samples. The effect of this,
after applying the seasonal adjustment filter, is a much
different pattem (almost no pattern) in thefirstfiveyears of
Figure 9c, and slight oscillations around the linear increase
thereafter.
The standard ertors for the X-11 trend estimates and
changes (not shown) look like smoothed versions of those
shown in Figure 9.
In practice, final estimates from the retail trade survey
are even more complicated than what was just described
and illustrated. First, more than one month of overlapping
26
Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments
adjuwtod
data. original
oatlmats
o.ia -
i
1
o—
ob««rv«tlon number
Cb) S t a n d a r d
Arr^r
chain
' month-to-month
il s s t l m a t o o
ob*»rvaitlon n u m b e r
a d j u s t e d d a t a , nrtodlflad a a t l m a t o a
o.ia -
^^^,„^^^—^ ^ ^ - ^ ^ " ^
1
•g o.o«-
^
/"^'"'^
o -
(d) a t
Chang*** modlflad
aatlmataa
obswrvaktion n u m b e r
Figure 9. Retail Sales of Drinking Places: Samples Redrawn Every
Five Years. The top panel shows the standard error of the seasonally
adjusted data and the second panel the standard error of the
corresponding month-to-month changes. The strong pattem results
from independently drawing a new sample every five years (at the
dotted vertical lines). For month-to-month changes, this produces
large increases in standard errors at the time ofthe sample redraw. To
eliminate this problem, a new sample is drawn to overlap with the
previous sample for one or more months and the new sample's
estimates are modified using data from the overlap to make them
consistent in level with estimates from the previous sample (see text).
This eliminates the increases in standard errors of change estimates
when the sample is redrawn (fourth panel), but introduces a transient
error into the modified level estimates, whose effects accumulate over
time (third panel).
data are collected and may be used to modify level estimates when a new sample is introduced. More importantly,
monthly estimates are benchmarked to agree with annual
totals obtained from the more accurate annual retail trade
survey orfiveyear economic census. Benchmarking should
thus alleviate the problem of level variances increasing over
time seen in Figure 9c. However, since benchmarking
imposes linear sum constraints on the original (unlogged)
estimates, its effects on seasonal adjustment variances are
difficult to investigate under the approach developed here,
and we have not done so. (We have used a model for
unbenchmarked data to avoid this problem.) Durbin and
Quenneville (1995) develop a model-based approach to
benchmarking that accounts for the nonlinearities that such
benchmark constraints impose on logged data.
5.
CONCLUSIONS
This paper presented an approach to the long-standing
problem of obtaining variances for X-I I seasonal adjustments. Our goal was the development and application of an
approach to obtain variances accounting for two sources of
ertor. The first ertor source is sampling error (e,), which
arises because we do not observe the true series, Yy but
instead observe estimates y, = i', + e^ from a repeated
survey. The second ertor source results from the need to
extend the observed series with forecasts and backcasts to
apply the symmetric X-I I filters. This second ertor source
leads to seasonal adjustment revisions. To account for
these two sources of ertor, we defined the seasonal
adjustment variance as the variance of the ertor in using the
X-11 adjustment to estimate a specific target. This target,
a)(fi)y,, is what would result from applying the symmetric,
linear X-I 1filter,co(fi), to the tme series if its values were
available far enough into the future and past for the
symmetric filter to be used. (The application to additive
X-Il with fore/backcast extension is immediate, and
log-additive X-11 is taken as an approximation to multiplicative X-Il.)
Our approach was also applied to produce variances of
X-I Iti:endestimates, and to produce variances of month-tomonth and year-to-year changes in both the seasonally
adjusted data and trend estimates. A further extension was
made to allow for ertor in estimating regression parameters
(e.g., to model calendar effects), though this was more
involved and had some limitations.
The variances we obtain ignore uncertainty due to time
series variation in the seasonal and nonseasonal components. We argued in section 2 that this may be
appropriate for typical users of X-I 1 seasonally adjusted
data. EF one desires to account for this time series variation,
however, we suggest that consideration be given to
model-based approaches to seasonal adjustment, since time
series models provide a means to explicitly account for
variation in all the components. Alternatively, Pfeffermann
(1994) developed an approach to X-II seasonal adjustments that attempts to account for irtegular variation and
sampling error.
Our approach builds on the first approach suggested by
Wolter and Monsour (1981), by accounting for the contribution of forecast and backcast error that was ignored by
Survey Methodology, June 1999
them. An alternative view of the difference between our
approach and theirs is that we define a consistent seasonal
adjustment target, whereas, in using X-ll's asymmetric
filters, Wolter and Monsour impUcitly used targets that
change overtime.Because of this, our approach avoids the
unrealistic feature of seasonal adjustment variances that
decrease towards the ends of the series, which can be seen
in results of Wolter and Monsour, and also of Pfeffermann.
In the empirical results presented, the contribution of
sampling ertor often dominated the seasonal adjustment
variances. This is partly because sampling crtor was often
large relative to fore/backcast ertor, and partly because the
contribution of fore^ackcast ertor tended to be offset by the
contribution of the covariance of fore/backcast ertor with
the sampUng ertor. On the other hand, empirical results for
trend estimate variances showed large increases at the ends
of series due to the effects of fore/backcast ertor. Since the
largest contribution of fore/backcast ertor occurs at the ends
of the series, and variances for the most recent seasonal
adjustments andti-endestimates are of the most interest, one
should not ignore the contribution of fore/backcast ertor.
The relative contribution to our variances of error in
estimating trading-day or hoUday regression coefficients
tended to be small, unless the series had no sampUng error.
Ertor due to estimating additive outUer and level shift
effects was substantial around the time point of the outUer.
The effects of AOs were large on seasonal adjustment
variances; tiie effects of LSs were large on trend estimate
variances.
Nonstationarities in the sampUng ertors produced interesting patterns in the seasonal adjustment and trend
estimate variances. Two types of sampUng ertor nonstationarities were examined. Seasonal patterns in sampUng ertor
variances produced cortesponding seasonal patterns in
seasonal adjustment variances. Independent redrawdngs of
the sample, which yield sampUng crtors cortelated within
but not across samples, produced erratic patterns in
seasonal adjustment and trend estimate variances over time
within a sample. These patterns approximately repeat
across different samples if the samples remain in force for
approximately equal time spans.
Computations for the examples shown (given the fitted
models, which were obtained from the references cited)
were done by programming the expressions of Sections 2
and 3 in the S-i- programming statistical language. The
resulting computer code is available on request.
27
APPENDIX A
Several expressions to be calculated in tliis paper are of
the general form
AE-'B
(A.1)
where E is a positive definite matrix, and A and B are
conformable to E. Let E = LL' be the Cholesky decomposition of E. Then AE'^B = A(L -')'L "^B and (A.1) can
be computed as follows:
(1) Solve LQ, = B for Q,
(2) Solve LQ2 = A' for Qj
(3) Compute AE-'B = QjQ,.
(1) and (2) can be solved efficiently since L is lower
triangular.
APPENDIX B
Two steps are required to obtain GX^, used in section
3.3. The first step produces "forecast" and "backcast" extension of the differenced regression variables. The second
step uses these results and the difference equation to
produce forecast and backcast extension of the original
(undifferenced) regression variables.
Let R = A X , where A is that part of the matrix A
which differences the observed series y ^. Analogous to the
computation of w, and w^ in section 2.2, forecast extensions of the differenced regression variables are calculated as R, = E32E22R0 and backcast extensions as R^ =
E j2 E22R0. Rj and R^ are of the form (A. 1) and can be
computed by the technique given above.
For the second step, let x, denote any one of the
regression variables in X. Let the required forecast
extensions be denoted x^^, for / = 1, 2, ..., m. Let the
differencing operator in the model be 5(fi) == 1 - 5jB -... 5^B'', and let r^ ^ ^ be the forecast extension of 5(fl)x, = r,
at time n + / (r ^, is an element of R^). The x , are
calculated iteratively as
\ . i = V n w - i + - + ^d\.i-d
+ ^«.(' for ^ = 1 . - ' ' " '
where x„^^.=x„,^. if 7 ^ 0 .
The required backcast extensions of x, are denoted Xj _ ^
for / = 1,..., m. These are also obtained recursively from
the difference equation b{B)x^ = r^ by solving for JC, _^ in
the expression
ACKNOWLEDGEMENTS
We thank Nash Monsour for providing the FORTRAN
program to calculate default X-11 asymmetric filter
weights. In addition, he patiently explained how estimates
from the monthly retail trade surveys are modified when
new samples are drawn. This paper reports the general
results of research undertaken by Census Bureau staff. The
views expressed are attributed to the authors and do not
necessarily reflect those of the Census Bureau.
and substituting previously computed backcasts as needed.
Thus,
^l-l - "d ^^d*l-l - "f^d-/~ — ~ °d-l-'^2-l -
for I = l,...,m.
where x 1 - ;
•• X,
1 -J
. for
/ ^ 0.
J
^d*l-l)'
28
Bell and Kramer: Toward Vanances for X-11 Seasonal Adjustments
REFERENCES
BELL, W. R. (1983). A computer program for detecting outliers in
time series. Proceedings ofthe Business and Economic Statistics
Section, American Statistical Association, 634-639.
BELL, W. R. (1984a). Signal extraction for nonstationary time
series. Annals of Statistics, 12,646-664.
BELL, W. R. (1984b). Seasonal Decomposition of Deterministic
Effects, Research Report Number 84/01. Statistical Research
Division, Bureau of the Census.
BELL, W. R., and HILLMER, S. C. (1983). Modeling time series
with calendar variation. Joumal of the American Statistical
Association, 78, 526-534.
BELL, W. R., and HILLMER, S. C. (1984). Issues involved with the
seasonal adjustment of economic time series (with discussion).
Joumal of Business and Economic Statistics, 2, 291 -320.
CHANG, I., TIAO, G. C , and CHEN, C. (1988). Estimation of time
series parameters in the presence of outiiers. Technometrics, 30,
193-204.
CLEVELAND, W. S., and DEVLIN, S. J. (1982). Calendar effects
in monthly time series: modeling and adjustment. Joumal ofthe
American Statistical Association, 11, 520-528.
DAGUM, E. B. (1975). Seasonal factor forecasts from ARIMA
models. Proceedings of the 40th Session of the International
Statistical Institute, Warsaw, Poland, 206-219.
DURBIN, J., and QUENNEVILLE, B. (1995). Benchmarking
monthly time series with structural time series models.
Proceedings ofthe Survey Methods Section, 23rd Annual Meeting
of the Statistical Society of Canada, Montreal, Canada, 13-18.
FINDLEY, D. P., MONSELL, B. C , BELL, W. R., OTTO, M. C ,
and CHEN, B. (1998). New capabilities and methods of the
X-12-ARIMA seasonal adjustment program. Joumal of Business
and Economic Statistics, 16, 127-152.
BELL, W. R., and HILLMER, S. C. (1988). A Matrix Approach to
Signal Extraction and Likelihood Evaluation for ARIMA
Component Time Series Models. Research Report Number
88/22, Statistical Research Division, Bureau of the Census.
GEWEKE, J. (1978). Revision of Seasonally Adjusted Time Series.
SSRI Report No. 7822, Department of Econoinics, University of
Wisconsin.
BELL, W. R., and HILLMER, S. C. (1990). The time series
approach to estimation for repeated surveys. Survey Methodology,
16, 195-215.
HANSON, R. H. (1968). The Current Population Survey: Design
and Methodology. Technical Paper 40, U.S. Census Bureau,
Washington, D.C, Government Printing Office.
BELL, W. R., and HILLMER, S. C. (submitted). Applying time
series models in survey estimation.
HAUSMAN, J. A., and WATSON, M. W. (1985). Errors in variables
and seasonal adjustment procedures. Joumal of the American
Statistical Association, 80, 531-540.
BELL, W. R., and MONSELL, B. C. (1992). X-11 Symmetric Linear
Filters and Their Transfer Functions. Research Report 92/15,
Statistical Research Division, Bureau of the Census.
BELL, W. R., and OTTO, M.C. (submitted). Bayesian Assessment
of Uncertainty in Seasonal Adjustment with Sampling Error
Present.
BELL, W. R., and WILCOX, D. W. (1993). The effect of sampling
error on the time series behavior of consumption data Joumal of
Econometrics, 55, 235-265.
BINDER, D. A , and DICK, J. P. (1989). Modelling and estimation
for repeated surveys. Survey Methodology, 15,29-45.
BINDER, D. A., and DICK, J. P. (1990). A metiiod for the analysis
of seasonal ARIMA models. Survey Methodology, 16,239-253.
BOBBITT, L., and OTTO, M. C. (1990). Effects of forecasts on tiie
revisions of seasonally adjusted values using the X-11 seasonal
adjustment procedure. Proceedings of the Business and
Economics Statistics Section, American Statistical Association,
449-454.
BOX, G.E.P., and JENKINS, G.M. (1976). Time Series Analysis:
Forecasting and Control. San Francisco: Holden Day.
BREIDT, F. J. (1992). Variance estimation in the frequency domain
for seasonally adjusted time series. Proceedings ofthe Business
and Economic Statistics Section, American Statistical
Association, 337-342.
BURRIDGE, P., and WALLIS, K.F. (1984).
Unobserved
components models for seasonal adjustment filters. Joumal of
Business and Economic Statistics, 2, 350-359.
BURRIDGE, P., and WALLIS, K. F. (1985). Calculating die
variance of seasonally adjusted series. Joumal ofthe American
Statistical Association, 80, 541-552.
HILLMER, S.C, BELL, W.R., and TIAO, G. C. (1983). Modeling
considerations in the seasonal adjustment of economic time series.
Applied Time Series Analysis of Economic Data, (Ed. A. Zellner),
U.S. Department of Commerce, Bureau ofthe Census, 74-100.
McKENZIE, S.K. (1984). Concurrent seasonal adjustment with
Census X-11. Joumal of Business and Economic Statistics, 2,
235-249.
McLEOD, I. (1975). Derivation ofthe theoretical autoco variance
function of autoregressive-moving average time series. Applied
Statistics, 24, 255-256.
McLEOD, I. (1977). Correction to derivation of the theoretical
autocovariance function of autoregressive-moving average time
series. Applied Statistics, 26, 194.
PFEFFERMANN, D. (1994). A general method for estimating die
variances of X-11 seasonally adjusted estimators. Joumal of Time
Series Analysis, 15, 85-116.
PFEFFERMANN, D., MORRY, M., and WONG, P. (1993).
Estimation ofthe Variances of X-11-ARIMA Seasonally Adjusted
Estimators for a MultipUcative Decomposition and
Heteroscedastic Variances. Working Paper METH-93-005, Time
Series Research and Analysis Division, Statistics Canada, Ottawa,
Canada.
PIERCE, D. A. (1980). Data revisions with moving average seasonal
adjustment procedures. Joumal of Econometrics, 14,95-114.
PRESIDENT'S COMMITTEE TO APPRAISE EMPLOYMENT
AND UNEMPLOYMENT STATISTICS (1962). Measuring
Employment and Unemployment, Washington, D.C: U.S.
Government Printing Office.
Survey Methodology, June 1999
SHISKIN, J., YOUNG, A. H., and MUSGRAVE, J. C (1967). The
X-11 Variant of the Census Method II Seasonal Adjustment
Program. Technical Paper No. 15, U.S. Department of Commerce,
Bureau of Economic Analysis.
TILLER, R. B. (1992). Time series modeling of sample survey data
from the U.S. current population survey. Joumal of Official
Statistics, 8, 149-166.
WALLIS, K. F. (1982). Seasonal adjustment and revision of current
data: linear filters for the X-11 method. Joumal ofthe Royal
Statistical Society, Series A, 145, 74-85.
WILSON, G. T. (1979). Some efficient computational procedures for
high order ARIMA models. Joumal of Statistical Computation
and Simulation, 8, 301-309.
29
WOLTER, K. M. (1985). Introduction to Variance Estimation. New
York: Springer-Verlag.
WOLTER, K. M., and MONSOUR, N. J. (1981). Ontiieproblem of
variance estimation for a deseasonalized Series. In Current
Topics in Survey Sampling, (Eds. D. Krewski, R. Platek, and J. N.
K. Rao). New York: Academic Press, 367-403.
YOUNG, A. (1965). Estitnating Trading-day Variation in Monthly
Economic Time Series. Technical Paper 12, Bureau of the
Census.
YOUNG, A. H. (1968). Linear approximations to the Census and
BLS seasonal adjustment methods. Joumal of the American
Statistical Association, 63,445-471.
31
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 31-41
Statistics Canada
Item Selection in the Consumer Price Index:
Cut-off Versus Probability Sampling
JAN DE HAAN, EDDY OPPERDOES and CECILE M. SCHUT'
ABSTRACT
Most statistical offices select the sample of commodities of which prices are collected for their Consumer Price Indexes with
non-probabihty techniques. In the Netherlands, and in many other countries as well, those judgemental sampling methods
come close to some kind of cut-off selection, in which a large part of the population (usually the items with the lowest
expenditures) is deliberately left unobserved. This method obviously yields biased price index numbers. The question arises
whether probability sampling would lead to better results in terms of the mean square error. We have considered simple
random sampling, stratified sampling and systematic sampling proportional to expenditure. Monte Carlo simulations using
scanner data on coffee, baby's napkins and toilet paper were carried out to assess the performance of the four sampling
designs. Surprisingly perhaps, cut-off selection is shown to be a succesful strategy for item sampling in the consumer price
index.
KEY WORDS: Laspeyres price index; Monte Carlo simulation; Sampling; Scanner data; Substitution bias.
1. INTRODUCTION
Outsiders may think that measuring inflation is an easy
job: just visit shops, collect a lot of prices and average
them. However, statisticians engaged in the compilation of
the Consumer Price Index (CPI), which is the most widely
used measure of inflation, face many theoretical and
practical problems. In most countries the CPI is essentially
a Laspeyres price index. This index weights the partial price
indexes of the various commodities by expenditure shares
that are fixed at base period levels. Sampling procedures are
needed to estimate the population value. Ideally, the mean
square ertor of the estimator would be minimized. Even
though the Laspeyres index formula is extremely simple,
the estimation procedures applied to the CPI make it a
rather complex statistic. Described in a stylized way, the
estimation involves three different kinds of samples. A
sample of households taking part in an expenditure survey
is used to estimate the commodity group expenditure
weights. From each commodity group a sample of
commodities (items for short) is selected. The prices of
these items are collected in a sample of outlets.
In this paper we focus on the sampling of items. Only a
few statistical agencies, e.g., the U.S. Bureau of Labor
Statistics, use probability sampling to select items to be
priced. Most others, for instance Statistics Netherlands, rely
on the judgements of experts working at the central office
for determining which items should represent the commodity group. In the past this method could be defended by
referring to the lack of appropriate sampUng frames. Due to
the rapidly increasing automation of the retail industry,
registers of consumer goods become more and more
available, and probabiUty sampling of items comes in sight.
Before changing over to a new sampUng sti-ategy, however.
it seems worthwile to experiment with alternative strategies
in order to assess their impact on the accuracy of the estimated price index numbers. The question to be answered is
whether curtent non-probabilistic selection practices perform worse, in terms of the mean square error, than probability techniques. This is the main topic of the present
paper. Simulation studies were carried out for three commodity groups, i.e., coffee, disposable baby's napkins and
toilet paper.
Not so long ago, empirical price index number research
was hampered by the fact that highly disaggregated expenditure and quantity information at the individual outlet level
was lacking or at best available for small samples. Nowadays, some market research firms have managed to set up
vast micro data bases on sales of consumer goods, especially in the field of fast moving consumer goods. These are
derived from electronic scanning by bar-code reader or the
associated bar-code typed in at the cashier's desk. Bradley,
Cook, Leaver and Moulton (1997) give an overview of
potential uses of scanner data in CPI constniction. Processing large scanner data bases is a rather time-consuming
task. For CPI compilation as such, this could prevent an
extensive use in the near future. But scanner data certainly
provide a rich source of information for empirical analysis.
In addition to studies into sampling, scanner data also
enable us to calculate price index numbers according to
different index formulas. The (fixed weight) Laspeyres
price index does not take the households' reactions to
relative price changes into account. We therefore examined
to what extenttiieLaspeyres population price indexes ofthe
three commodity groups are biased with respect to the
Fisher price index, an index that does account for
commodity substitution.
Jan de Haan, Eddy Opperdoes and Cecile M. Schut, Division Socio-economic Statistics, Statistics Netherlands, P.O. Box 4000, 2270 JM, Voorburg,
The Netherlands, e-mail: jhhn@cbs.nl
32
de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index
Section 2 gives an overview of the scanner data that we
used. Section 3 addresses four different commodity
sampling designs. Three of these {i.e., simple random
sampling, stratified sampling, and sampling proportional to
size) are probability techniques, whereas the fourth (cut-off
sampUng) is a judgemental one that mimics official
practices in the Netherlands. Section 4 describes Monte
Carlo experiments we performed to determine the accuracy
of the estimated commodity group price indexes under the
various sampling designs mentioned. Section 5 deals with
the use of Fisher indexes at the item level and the item
group level, respectively. The within-group substitution
bias of the Laspeyres commodity group price indexes is
shown. Section 6 summarizes and discusses the findings.
2. BAR-CODE SCANNING DATA
2.1
An Overview
Scannable products are defined in Europe by the
European Article Number (EAN). Manufacturers should
assign one and only one EAN to every variety, size, type of
packaging, etc. of a product. This has two implications. In
the first place, EANs sometimes change very rapidly, for
instance because of a new packaging. Clearly, this makes it
difficult to follow a specific item over time. Secondly, some
EANs have negligible expenditures. It seems that the
system of classification is too detailed; what is really one
item has been classified as a multitude of items. In a test
study using scanner data on coffee, Reinsdorf (1995) also
found that "items that are, for all practical purposes, the
same may occasionally have different UPC's" (the US
Universal Product Code). Some aggregation over EANs is
required. Fortunately, several product characteristics such
as brandname and subname are included in the scanner data
sets. We will treat EANs having the same product characteristics as identical items. If the number of characteristics
is insufficient, there will of course be a danger of overaggregation, that is of putting heterogeneous items together.
From A.C. Nielsen (Nederland) B.V. we received
scanner data sets containing weekly supermarket sales on
coffee, disposable baby's napkins and toilet paper. The
initial data sets contained 320,569 and 294 different EANs,
respectively. For each EAN, the number of packages sold
and the cortesponding value is included, together with
several product and outlet characteristics. Prices are not
included; average prices (unit values) must be calculati'.d
from the values and quantities. The coffee data relate to
sales over a period of two and a half years, beginning with
week 1 of 1994 and ending in week 24 of 1996, in a sample
of 20 supermarkets located in a Dutch urban area unknown
to us. The data on the other two item groups refer to a
sample of 149 shops spread over the whole country, and
cover a period of two years, beginning with week 1 of 1995
and ending with week 52 of 1996.
For reasons of convenience we deleted the minor brands.
In the case of coffee, only the 15 brands with the highest
turnover during the entire observation period were selected
from the 55 brands actually sold. After aggregating over
EANs with identical product characteristics, we further
limited the population to those items that were sold in the
base year 1994 and every month thereafter in order to have
a complete data set for each month. We ended up with a
total of 68 items (excluding coffee beans), among which 40
items of ground coffee (including decaffeinated coffee) and
28 items of instant coffee. These account for 94.5% of total
base year coffee expenditure in the initial data set. For
napkins and toilet paper (leaving out moist toilet paper), the
brands with a turnover share of less than 1% were removed.
Next, only those items were selected that were sold in the
base year 1995 and at least eight months thereafter. This
resulted in 58 napkins items and 70 toilet paper items,
accounting for 90% and 86% of total 1995 expenditure in
the initial data sets.
2.2 Descriptive Statistics
The most striking feature of the item expenditures is the
skewness ofthe distiibution. Figure 1 shows the inequality
of the base period expenditures in our adjusted data sets by
means of so-called Lorenz curves. The vertical axis depicts
the cumulative expenditure total, the horizontal axis the
cumulative number of items, both expressed as percentages.
The items are sorted in increasing order of expenditure. In
case of equal expenditures, the Lorenz curve would lie on
the diagonal. The more unequal the distribution becomes,
the lower its position will be. Coffee item expenditures are
distributed extremely unequal, with the three largest items
accounting for over half of total base year (1994) coffee
expenditure. For baby's napkins and toilet paper the largest
six and eight items, respectively, account for nearly half of
total base year (1995) expenditure.
Figure 2 shows unit value index numbers, that is the
change in the value per package, irtespective of quantity,
brandname, type etc., taken over all outlets. This gives a
first impression ofthe change in "prices" during the period
under study. For coffee, there was a remarkable decrease in
the second half of 1995 foUowing large price rises in 1994
due to bad harvests in Brazil. Coffee prices are largely
determined by world market prices for coffee beans. We did
not find evidence of significant differences in price changes
between outlets. Baby's napkins differ in this respect. A
heavy competition was going on between the various
producers (which may have caused the decline of the unit
values during 1996), while discounts and other kinds of
special actions were offered frequently. Hence, the unit
value taken over aU items and outlets gives an inaccurate
picture ofthe aggregate price change of baby's napkins.
Survey Methodology, June 1999
33
100%
20%
40%
60%
80%
100%
Percentage of items
Figure 1. Distribution of base period item expenditures
Figure 2. Unit value index numbers (1994=100 for coffee, 1995=100 for baby's napkins and toilet paper)
3, ESTIMATING LASPEYRES-TYPE
PRICE INDEXES
We start this section by introducing some notation. Let
commodity group A consist of a finite number, say N, of
commodities (items); geA means that item g belongs to
group A. We assume that A is fixed during time. In real life
this is not true: some products disappear from the market,
while new products enter. In the short run, however, the
constant item group assumption seems reasonable. Note
that we adjusted our initial data set accordingly. The reason
behind this is that we want to concentrate solely on the
sampUng aspect. The Laspeyres (fixed weight) price index
of commodity group A in period t is
^-^
g£A
g
g
W
geA
geA
P
(1)
where P'^ denotes the price index of item g, e° the
expenditure on g during base period 0 and w the cortesponding expenditure share ofg within item group A. In the
base period a sample A with fixed size n is taken from A.
Because A is supposed to be fixed during time, it seems
natural to keep A fixed as well.
3.1 Simple Random Sampling
Probability sampling refers to situations in which all
possible samples have a known probability of selection.
Under simple random sampling (without replacement), all
possible samples have equal selection probabilities. The
Horvitz-Thompson estimator P'A = {Nln)Y^^w'^P'
is
unbiased for P', that is E{P\) = P' where the expectation
E{.) denotes the mean over all possible samples under a
given sampling design, in this particular^ case simple
random sampling. Despite its unbiasedness, P\ will not be
used in practice because of two undesirable properties.
34
de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index
Firstly, if the price indexes of all sampled items are equal,
the estimated item group index differs from that value,
unless the population and sample means of expenditures
coincide. Price index makers probably dislike this feature.
Secondly, and more importantiy, P\ is bound to exhibit
extraordinary large sampling variance. To overcome both
difficulties, P' is estimated by taking unbiased estimators
of the numerator and denominator:
a large extent on this account, since the bias is a (weighted)
average of positive and negative biases of the various item
group indexes.
3.2 Sampling Proportional to Size
Sampling proportional to size has the advantage that the
most important items have a big chance of being sampled.
We will restrict ourselves to fixed size sampling without
Opt
replacement, since this seems most likely to be chosen in
{Nln)Y e;P
g g
geA
case of item sampUng proportional to size (see for example
P' =
(2)
Z<p'g^
the Swedish case described by Dalen and Ohlsson 1995).
{Nln)Y^,
geA
Base period expenditure acts as our measure of size, and the
geA
required first-order inclusion probability for item g is
..0
where w is the expenditure share of item g in the sample.
•Kg =negle° =nwg, where e° =ZgeA^g- ^^ follows that
Using a first-order Taylor linearization (Samdal, Swensson
£ gyi Pg/'' is an unbiased estimator of P'.
and Wretman 1992, pp. 172-176), the variance of P^ can
Sampling proportional to size without replacement,
be written as
combined with the Horvitz-Thompson or n estimator, is
sometimes called Tips sampling. Most existing schemes for
V{P'B) - ^P'A) - {P ')'(2p'p' - l)(o°)^
fixed-size Ttps sampling are draw-sequential and rather
where o° denotes the coefficient of variation or relative
complicated. We wiU therefore use systematic Tips selection
standard error of the sample mean of base period expendiinstead. This scheme can be described by imagining the
tures, p' is the ratio of the relative standard ertors of the
expenditures e {geA) as cumulatively laid out on a
average base period expenditures expressed in prices of
horizontal axis, starting at the origin and ending at e °. A
period t and 0, and p' is the cortelation coefficient between
real number is randomly chosen in the interval {0,e'^ln],
the average base period expenditures in prices of t and 0
and we proceed systematically by taking the items g
(which is exgected to have a positive sign). The choice for Pg identified by points at the constant distance e^ln apart.
instead of P\ can thus be elucidated by the fact that the
This method yields exactiy the desired sample size. For
former exploits the panel character of the sample; with
commodity groups with large variation in base period
p' > l/(2p'). a substantial reduction in variance is expected.
expenditures, it may not always be possible to select an item
An alternative expression for the variance of Pg is:
sample strictly proportional to expenditure. Obviously,
7t <. I must be satisfied for all g. If « > 1 and some e^
V{P's) ~~ - ^ # 7 E K)^(/'; - P 'f'
(3) values are extremely large, it may be true for some items
that ne°le°> 1, contradicting the requirement t^ ^ 1- The
n N -I geA
conflict will be deaU with as foUows. The N items are
which can be estimated using sample data provided that the
ordered according to descending expenditures. First, if
sampling fraction f=nlN is known. This formula, earlier
ei>e^ln, weset Ji, = 1. Next, if Cj >(^° - ^ i ) / ( " " 1).
mentioned by Balk (1989), shows that the variance depends
we
also set 7I2 = 1. The procedure is repeated until the
on the within-group dispersion of the item price indexes.
requirement
for sampUng proportional to base period
Hence, the variance could be lowered either by constructing
expenditure
is
met for all remaining items. Our recursive
item groups made up of items having similar price changes
approach
differs
somewhat from the method proposed by
or by enlarging the sample. Samdal et al, (1992, p. 176)
Samdal
et
al
(1992,
p. 90). They suggest to set 71 = 1 for
caution that "the Taylor linearization method has a
all
g
with
e
>e°ln.
In our data sets this would lead to
tendency to lead to underestimated variances in not so large
unnecessary large numbers of items with 7t = 1. The
samples". The CPI item samples are generally quite small.
subgroup Afj of items with the highest base period
For some item groups there may even be only one or two
expenditures which is selected with certainty will be called
representative items. Thus, besides being unstable (having
the self-selecting part of the sample. From the remaining
a large variance itself), the variance will probably also be
low-expenditure subgroup A^ a sample A^ with size «^ is
underestimated when based on (3).
drawn strictly proportional to expenditure. The resulting
We note that estimator P'g, being a ratio, suffers from
unbiased estimator is an expenditure weighted average of
smaU sample bias of approximately o(l/«). It can easily be
P '{H), the population Laspeyres price index of A^, and
verified that its absolute value \B{P'g)\ ^f'JV{P'g). If G°
E eA P'g^'^L' *^^ estimated price index of A^.
is small, say less than 0.1, the bias of P^ may safely be
regarded as negligible in relation to its standard ertor.
3.3 StratiHed Sampling
However, with a small item sample and a large variability
The obvious advantage of simple random sampling as
of base period expenditures, o° could easily exceed 0.1 by
opposed to sampling proportional to expenditure is that.
far. We add that tiie all items CPI is unlikely to be biased to
35
Survey Methodology, June 1999
apart from a register of items serving as a sampling frame,
no other data are required. See also Balk (1994). With very
unequaUy distributed item expenditures chances are big that
the rnarket leaders fall outside the sample, a situation that
seems intuitively unappealing. We will argue that it would
indeed be preferable if they were selected. Recall that the
variance ofthe item group price index under simple random
sampling depends on the within-group dispersion of the
item price indexes. A variance reduction could be achieved
if it were possible to stratify the item group into homogeneous subgroups according to their price changes. However, a priori knowledge of item price changes is not available. Another way to lower the variance might be to stratify
the item group into two subgroups, one (A^) with high
base period expenditures which is observed entirely and the
other one (A,) with low expenditures from which a random sample A^ is taken. The new item group price index
estimator is an expenditure weighted average of P'g{L),
the Laspeyres index of the low-expenditure subgroup, estimated in accordance with (2), and P'{H). Its sampling
variance is (1 - T^)^ V[P'g{L)], where x^ is the expenditure share of A^ within A. This method does not necessarily
reduce the variance of the estimated price index, but it is
likely to do so under certain conditions. The variance of the
new estimator will be smaller than the variance of Pg when
l-x„<
se(P^)
(4)
se[P;,(L)]'
where se(.) denotes standard error. Inequality (4) is
expected to hold if the item expenditures are distributed
extremely unequal, since 1 - x^ will then become much
smaller than 1. Stratification may be especially productive
as the overall sample size n increases.
The choice of x^ and thus of the size N^ of the "takeall" stratum A^ is a bit of a problem. Preferably we would
have some optimality criterion in order to minimize the
variance. But since a priori knowledge of item price
changes is lacking and past trends do not forecast future
price changes very accurately, the optimal size of A^ can
hardly be computed in practice. In the empirical analysis we
will try two different relative sample sizes X^ = N^^ln of
Afj, namely X,^ = 1/3 and A.^ = 2/3. These values suffice
to give a clear indication of the performance.
3.4 Cut-off Sampling
When the sample size is very small it seems rather likely
that stratification with X^ = 2/3 leads to a larger standard
ertor of the estimated price index than with X^ = 1/3. But
what happens if A^ is not observed at all, so that A,^ = 1
and thus n=Nj^l We would then be using (a special type
of) cut-off sampling. The item group price index is
estimated simply by PQ- P '{H). AU gsA^j now have an
inclusion probability of 1, whereas all geA^ have zero
inclusion probability (Samdal et ai, 1992, pp. 531-533).
Since we know exactly which items will be selected there
is no randomness involved and the sampling variance of the
P'c is zero by definition. The bias equals the actual ertor,
i.e., the difference between the estimated value and the true
population index
P'c-P' = (l-x^)[P'(//)-P'(L)].
(5)
With an extremely unequal distribution of item expenditures, even a small sample size would cause a large value
for x^. In that case cut-off estimation may outperform
stratification, in terms of the mean square error. We may
either fix the cut-off rate x^, so that the sample size n is
determined by x^, or fix the sample size, in which case x^
depends on the choice of n. The latter option was chosen by
us since fixed size sampling designs are common practice
in selecting CPI items, and because this allows a suitable
comparison with other fixed size designs.
The use of cut-off procedures can be justified on the
grounds that i) the costs prohibit the construction of a
reliable sampling frame for the whole population, and ii)
the bias is deemed negligible. Assumption ii) cannot be
verified in general, of course. The deliberate exclusion of
part of the target population from sample selection may
nevertheless give satisfactory results when appropriate
corrections are made. Statistics Netherlands makes use of
cut-off sampling in various other business surveys, for
instance in production and foreign trade statistics where
very small enterprises are left unobserved. In the Dutch
National Accounts, that use production and foreign trade
data as important inputs, explicit estimates are being made
for smallfirms.The cut-off method for CPI item selection,
on the other hand, does not cortect for the excluded items.
In addition to cost-considerations, this method is sometimes
defended by the belief that, at least in the longer run, the
price changes of less important items will not differ much
from those of the market leaders within the same product
category because of similar production cost structures.
4. EMPIRICAL ESTIMATION
4.1 Monte Carlo Simulation
With the exception of cut-off selection it is difficult to
find reliable measures of the sampling distributions based
on a single sample. Under simple random sampling the
estimator P'(. has an unknown bias whereas variance estimation based on Taylor linearization techniques gives
inaccurate results because CPI item samples are generaUy
very small. Systematicrepssampling raises the question of
how to estimate the variance of the estimator since the
second-order inclusion probabilities are unknown. To
obtain the exact sampling distribution we would have to
consider all samples A that are possible under a certain
sampling design. For every A the probability of drawing A
and the estimated value of the commodity group price index
must be known in order to calculate the exact values of the
expected value, the bias and the variance of the estimator.
de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index
36
This is virtually impossible because of the extremely large
number of possible samples. To describe the sampling
distiibution, we will therefore carry out Monte Carlo simulations. A large number of samples, say K, is drawn from
the (same) population A according to the given design and
for each sample the estimate is calculated. If K is large
enough, the distribution of the K estimates will closely
approximate the exact sampUng distribution. Let P\ denote
the result for the ^-th sample under a certain sampling
design. Then
4.2 Results
Monte Carlo simulations were carried out with three
different sample sizes: n=3, n=6 and rt=12. The number of
repetitions {K) per experiment was set to 500,000. Table 1
shows the results for coffee in January 1995 (1994=100),
tables 2 and 3 those for baby's napkins and toilet paper,
respectively, in January 1996 (1995=100). The choice of
the formula with which individual price observations are
aggregated into a single item price index is discussed in the
Appendix. Throughout this section all item price indexes
are calculated as unit value indexes over all outlets. Simple
random sampling performs particularly bad. For example,
with n=3 the tme (laspeyres) coffee price increase of
17.2% is understated by 1.4%-points. Together with a
standard ertor of 5.1%-points, the rmse amounts to
5.3%-points, that is almost one third of the true price
increase. Even with n=I2, so that the sampUng fraction is
0.18 (which would be unusually large in practical
situations), the rmse still remains considerably high. Notice
that, as expected, the small sample bias is halved when the
sample size is doubled. Stratification works reasonably well
with larger sample sizes but gives disappointing results with
n=3. In the latter case, stratification increases the rmse
compared to simple random sampling for baby's napkins
and toilet paper when Nfj = 2 (that is, when A.^ = 2/3). Our
favourite probabilistic design would be systematic sampling
proportional to expenditure because the estimates are
unbiased and their standard errors relatively low. The most
surprising finding perhaps is the good performance of
cut-off selection. Except for n=3 and n=6 in case of baby's
napkins, this method produces the best results.
At=1
is an unbiased estimate of the expected value E{P'). We
will calculate
P' -P',
which is an unbiased estimate of the bias B{P');
I
•YiP'k-P')\
K-U-i
which is an unbiased estimate ofthe variance V{P'); and
S^ =
dp
-
^{P' - P')' + sl
which is an approximately unbiased estimate of the root
mean square error (rmse) of P'. Samdal et al (1992,
p. 280), remark that "the imperfection caused by the finite
number of repetitions is more keenly felt in the case of a
variance measure... than in the case of measures calculated
as means".
Table 1
Estimated Laspeyres Price Index Numbers for Coffee (1994=100), January 1995 (N=6S)
Sampling
scheme
S.R. *)
Ttps
exp. value
115.7
117.2
n=6
n --= 3
bias
se
-1.4
5.1
2.2
0
rmse
5.3
2.2
exp. value
116.4
se
3.4
bias
-0.7
117.2
1.3
0
116.6
116.4
117.2
2.3
2.5
-0.5
-0.7
Stratified
X„ = 1/3
X„ = 2I3
116.4
115.6
3.9
4.5
-0.7
-1.5
4.0
4.7
Cut-off
117.0
0
-0.2
0.2
0
71 = 12
0.0
rmse
2.3
0.7
se
2.3
0.7
bias
-0.4
1.3
exp. value
116.7
117.2
2.3
2.6
0.0
117.0
117.0
117.5
1.2
1.1
0
-0.1
-0.2
0.3
1.1
0.3
n = l2
bias
se
0.8
2.9
1.5
0
rmse
3.0
1.5
rmse
3.5
0
1.2
*)Siimple randoiTl
Table 2
Estimated Laspeyres Price Index Numbers for Baby's Napkins (1995=100), January 1996 (/V=58)
Sampling
scheme
S.R.
Ttps
Stratified
X^ = 1/3
n =3
bias
exp. value se
2.3
99.4 5.0
97.2 2.8
0
98.9
5.0
X„ = 2/3
98.3
5.8
Cut-off
92.0
0
rmse
5.5
2.8
1.1
5.9
98.1
97.4
-5.1
5.1
93.4
1.8
5.3
exp. value
98.7
97.2
n=6
se
bias
3.9
1.5
0
1.6
rmse
4.2
1.6
exp. value
97.9
97.2
3.3
1.0
3.4
97.4
1.7
3.3
0.3
3.3
97.0
1.6
0.2
-0.2
1.7
1.6
0
-3.8
3.8
95.5
0
-1.6
1.6
Survey Methodology, June 1999
37
Table 3
Estimated Laspeyres Price Index Numbers for Toilet Paper (1995=100), January 1996 (A'=70)
Sampling
scheme
n =3
exp. value
n =6
n = 12
se
bias
rmse
exp. value
se
bias
rmse
bias
rmse
S.R.
103.9
4.5
0.1
4.5
103.9
3.5
-0.1
3.5
103.9
2.6
0.1
2.6
Tips
103.9
3.4
0
3.4
103.9
1.8
0
1.8
103.9
1.2
0
1.2
Stratified
X„ = 1/3
103.5
4.3
-0.3
4.3
103.7
3.2
-0.1
3.2
104.0
2.1
0.1
2.1
X„ = 2/3
103.7
4.6
-0.2
4.6
104.2
3.4
0.4
3.4
103.9
1.6
0.0
1.6
Cut-off
105.0
0
1.1
1.1
104.0
0
0.1
0.1
104.0
0
0.1
0.1
For coffee we also tried another form of stratified
sampUng. The entire population of items was subdivided
into ground coffee and instant coffee, and we took random
samples from each stratum. Although the price changes of
instant coffee are smoothed and lag behind as compared to
ground coffee, Monte Carlo results using stratified
sampling were similar to those using unstratified sampling
for all four sampling methods. This contradicts earlier
findings (see De Haan and Opperdoes 1997a). The reason
is that we deleted some instant coffee items for this study to
have a complete data set for each month, and ended up with
a minor fraction (8%) of instant coffee in total base year
coffee expenditures.
It would be hazardous to draw conclusions about the
performance of the various sampling designs based on
simulations for one particular month since it is likely that
the outcomes depend on the frequency distribution of the
item price indexes. Figure 3 shows these distributions for
coffee and baby's napkins in two months. Both distributions move to the left, indicating that the unweighted mean
has declined. Apart from that, the frequency distribution for
coffee remains quite stable. The shape of the curve for
Coffee (1994=100)
exp. value
se
napkins, on the other hand, changes dramatically; the
variance of the item indexes has grown.
Monte Carlo experiments were run for each month of the
period under study. Figure 4 shows the rmse with n = 3.
The pattem that emerges for coffee and toilet paper is
surprisingly robust: cut-off selection always comes out as
best. Apparentiy, if sample sizes are small, the exclusion of
the smaller items does not seem to matter much. This is
what many statistical offices have been appreciating for a
long time, without being able to test it empirically before.
The reason why cut-off selection perfomis better than
sampling proportional to expenditure is in case of toilet
paper partly caused by the fact that there is no self-selecting
part under the latter sampUng scheme. With larger samples
the results under cut-off selection and sampling proportional to size are very much alike. For baby's napkins the
outcomes differ somewhat. Because of the high volatility of
the item indexes, the rmse under cut-off selection varies
considerably; it seems to meander around the rmse resulting
from systematic sampling proportional to expenditure. The
high variability of the ertor can be considered a drawback
of cut-off selection.
Baby's napkins (1995=100)
Figure 3. Frequency distribution of item price index numbers
38
de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index
Coffee
-SRS
-Cut-off
. . . A - - - RPS
X—- Stratified,
NH=1
-stratified,
NH=2
*-
9503 9505 9507 9509 9511 9601 9603 9605
Baby's Napkins
-«—stratified,
NhfeZ
9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612
Toilet paper
-^—
Stratified,
NH=1
-SK— Stratified,
NH=2
9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612
Figure 4. Rmse of estimated Laspeyres price indexes («=3)
5. THE USE OF FISHER INDEXES
5.1 Unit Value Versus Fisher Item Indexes
In section 4 the item price indexes were calculated as
unit value indexes over all outlets. To assess the impact of
the choice of the item index formula on the outcomes of the
simulation study, Table 4 compares Monte Carlo results
with n=3 based on unit value item index numbers (as in
tables 1-3) and Fisher item price index numbers; see the
Appendix for details. For coffee, we notice hardly any
differences. For napkins and toilet paper, on the other hand.
the rmse decreases when Fisher index numbers are used
instead, especiaUy in case of simple random sampUng. This
is caused by the fact that unit value indexes tend to show a
more ertatic pattem. If physically identical types of napkins
or toilet paper are deemed heterogeneous across outlets, so
that the Fisher formula would be more appropriate, the use
of unit value indexes overstates the price variability of
particularly small items and exaggerates the poor performance of simple random sampUng. Nevertheless, we would
still have to conclude that simple random sampling does not
work very well.
Survey Methodology, June 1999
39
5.2 Within-group Substitution Bias
The Fisher index is one of the best-known superlative
indexes. When applied to the item group level, the difference between the population price indexes calculated
according to the Laspeyres and the Fisher formula can be
interpreted as within-group item substitution bias
(Figure 5). For coffee it is less than I %-point per year. For
toilet paper and particularly for napkins the biases are very
large, about 1.5-3%-points per year. Within-group substitution bias is generally positive and increases over time.
Notice, however, that for baby's napkins in a few months of
thefirsthalf of 1996 the Laspeyres index numbers are lower
than tiie Fisher index numbers. This unexpected effect, and
possibly also the large magnitude of the positive bias in
other months, may be due to a deficiency of the data set
which only contains supermarkets. It is well-known that
baby's napkins are bought in the Netherlands also in other
kinds of shops such as dmgstores that do not make use of
bar-code scanning. Substitution between the included and
excluded outlets in the data base may damage our population index numbers as accurate approximations of the tme
values. We are convinced though that it does not seriously
affect the assessment of the sampling methods presented in
section 4.
Many statistical agencies and users are of the opinion
that the CPI should be an approximation to the tme cost of
living index. This theoretical concept is derived from
microeconomics and measures the change in the minimum
costs for a representative consumer, or household, necessary to retain the same standard of living or utility. Since
utiUty cannot be measured, a feasible index formula should
be chosen that closely approximates the concept. Diewert
(1976) showed that (what he calls) superlative indexes
provide second order approximations to the cost of living
index. The most important feature of superlative price
indexes is that they take account of consumers' substitution
towards goods and services exhibiting relatively smaU price
increases. These index formulas make use of expenditure
data relating to both the base period 0 and the curtent
period t. In practice it takes some time before expenditure
data are known, so that superlative indexes cannot be
compiled in real time. For the sake of timeliness most
national statistical offices adopt the Laspeyres (fixed
weight) formula for constmcting their CPIs.
Table 4
Estimated Laspeyres Price Index Numbers Using Alternative Item Indexes (n=3)
Sampling
scheme
Coffee, January 1995(1994= 100)
Napkins, January 1996 (1995==100)
(2)
(1)
exp.
value
rmse
S.R. *)
115.7
Ttps
Stratified
X„ = 1/3
X„ = 2/3
Cut-off
(2)
(1)
exp.
rmse
value
exp.
value
rmse
5.3
115.8
5.3
99.4
117.2
2.2
117.2
2.2
116.4
4.0
116.5
115.6
4.7
115.6
117.0
0.2
117.0
Toilet paper , January 1996(1995:=100)
(2)
(1)
exp.
value
rmse
exp.
value
rmse
5.5
100.4
4.2
103.9
4.5
104.0
3.5
97.2
2.8
98.6
2.1
103.9
3.4
104.3
3.6
4.0
98.9
5.3
100.1
4.1
103.5
4.3
103.3
3.6
4.8
98.3
5.9
99.5
4.6
103.7
4.6
103.2
3.9
0.2
92.0
5.1
94.8
3.8
105.5
1.1
104.7
1.0
(1) Based on unit value item index numbers
(2) Based on Fisher item index numbers
3,b-|
—0—Toilet
paper
32,5-
—a— Baby's
napl<ins
21,!>-
—6—Coffee
1•
0,5-
0,5CO
o
o
ci>
o
m
o>
m
o
m
a>
1^
o
m
O)
o>
o
m
o>
^_
^_
in
05
05
o
CO
CO
O
CO
en
in
1^
o
CO
o
CO
a>
(3)
en
o
CO
a>
,_
CO
a>
Figure 5. Difference between Laspeyres and Fisher population price index numbers
exp.
value
rmse
de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index
40
6. DISCUSSION
Although bar-code scanning data have some deficiencies, they provide an excellent opportunity to undertake
empirical research into various sampUng issues concerning
CPIs. Our simulations show that, for coffee, disposable
baby's napkins and toilet paper at least, simple random
sampling of items should be advised against. We believe
that this recommendation can be extended to all item groups
where the distribution of expenditures is very skewed. If
statistical offices want to apply probability sampling, they
would do a better job using sampling proportional to
expenditure. However, cut-off selection might be a good or
even better altemative for those item groups where the
various item price changes are not too volatile. As a matter
of fact, as far as we are aware this is the first study to supply
empirical evidence in support of cut-off CPI item selection
methods. Aggregated scanner data - that is, scanner data
aggregated over outiets - should give a clear indication of
the required cut-off rate. Statistics Netherlands already
made use of aggregated Nielsen data on a range of
commodity groups in the past in order to select items for the
CPI sample.
Cut-off methods are applied extensively in the
Netherlands and many other European countries (Boon
1997). Ui the Netherlands the actual item selection is a little
more complex than the situation described above. First, a
number of item subgroups instead of specific items are
chosen using the cut-off method. Next, a number of specific
items are selected from each subgroup through so-called
judgemental sampling. The selection of these representative
items is based on the judgement of experts working at the
central office who should have a firm knowledge of the
consumer market in question. Usually the most frequently
bought items or those with the highest tumover will be
selected, so that the entire sampling scheme is a two-stage
cut-off procedure. It is unlikely that such a two-stage
method would yield results much different from the
single-stage procedure we have used in this paper.
In some other European countries, e.g., the United
Kingdom, cut-off selection does not take place at the central
office but by field staff at the outlets where prices are
measured. To illustrate this method, we choose one item per
outiet, namely the item with the highest base period sales in
the outlet. For coffee, baby's napkins and toilet paper this
yields 2, 12 and 24 different items, respectively. The
Laspeyres item group index is estimated in accordance with
expression (2), where the item price indexes are calculated
as outlet-specific unit value indexes and weighted by
outlet-specific weights. Figure 6 shows the rmse resulting
from this method. If we compare this with Figure 4 (cut-off
selection done at the central office for n=3), the accuracy of
both cut-off selection methods seems "on average" to be of
the same order of magnitude, although the pattem is sUghtiy
more ertatic under selection at the outlets. But such a
comparison is quite arbitrary. Why not compare cut-off
selection at the outlets with cut-off selection at the central
office for n=6, or «=12, or indeed for any other sample
size? Another problem is that we treated the item price
indexes as if they were known with certainty. In reality they
will be based on a sample of outlets, so that our results are
conditional on this sample. For a proper assessment of both
cut-off selection procedures we need to take both the
sampUng of items and the sampling of outlets into account.
However, that is beyond the scope of this paper.
Scanner data not only offer challenging perspectives for
statistical research in the field of CPI sampling issues, they
also enable us to compile all sorts of index numbers,
including superlative indexes, using real and highly disaggregated data at the individual outlet level. We demonstrated that the Laspeyres item group price indexes used by
statistical agencies can be biased by more than -hi%-point
on a yearly basis with respect to the (superlative) Fisher
price index that accounts for item substitution. A related
type of bias, caused by neglecting products that are introduced after the base period (see e.g., Boskin, Dulberger,
Gordon, Griliches and Jorgenson 1996), was not addressed
by us. Scanner data do provide a good opportunity to
investigate this new goods bias.
Figure 6. Rmse resulting from cut-off selection at the outiets
ACKNOWLEDGEMENTS
This research was partially supported by Eurostat (the
Statistical Office of the European Communities) under
SUP-COM 1996, Lot 1: Development of methodologies in
consumer price indices and purchasing power parities. The
authors are grateful to A.C. Nielsen (Nederland) B.V. for
providing scanner data at marginal costs. They also wish to
thank Bert M. BaUc, Leendert Hoven and two anonymous
referees for helpful comments on an earlier draft.
APPENDIX: THE CHOICE OF THE
ITEM INDEX FORMULA
To perform sampling simulations we need item index
numbers. What index formula should be chosen? Statistical
offices are generally forced to calculate indexes at the
Survey Methodology, June 1999
41
lowest level of aggregation based on price data alone
because quantity or expenditure data is lacking. See Szulc
(1987), Dalen (1991), BaUc (1994), and Diewert (1995) for
a comprehensive treatment of this subject. With scanner
microdata at hand, we are in the unique position to
constmct genuine price indexes (Silver 1995, Hawkes
1997). Consider a set of outlets B , assumed fixed during
time, where item g can be bought; beB means that g can
be bought in outlet b. The price of g at outlet b in period s
{s = 0,0 and the cortesponding quantity sold are denoted
Pi and xl, respectively. The item will be taken as the
lowest aggregation level where price indexes are
constmcted. As a start we restrict ourselves to item indexes
that can be written as ratios of weighted arithmetic mean
prices in period t and period 0:
E-ip;
"gb ' gb
P' =
'j^
E
u r)0
"^gbPgb
(6)
beB,
where w^^^ = x^^ /HbeB ^gb denotes the share of outlet b in
the total quantity sold^of item g in period z(z = s,u). If
M = 0 and s = t, the prices in period 0 and period t are
weighted by the corresponding relative quantities. The average prices are called unit values, and P'^ is a unit value
index. De Haan and Opperdoes (1997b) and Balk (1998)
discuss its merits. Adding up quantities makes sense only if
item g can be conceived of as being homogeneous, that is
identical across all beB . Unit values then yield the appropriate average transaction prices and the unit value index is
the appropriate item price index.
The problem, of course, is to define homogeneity. It can
be argued that physically identical products sold in different
outiets are not identical items because of different services
that accompany the transactions, so that homogeneity
across outiets never occurs. Another index formula should
then be chosen. If M = j in expression (6), P'^ can be called
a fixed quantity price index with u acting as the quantity
reference period. For u = s = 0, P^ tums into the Laspeyres
price index, and for M = 5 = r, P is the Paasche price
index. On theoretical grounds we cannot favour either one.
For reasons of symmetry it seems natural to take the
(unweighted) geometric average of the Paasche and the
Laspeyres index, which is the Fisher (ideal) price index.
REFERENCES
BALK, B.M. (1989). On Calculating die Precision of Consumer Price
Indices. Report, Department of Price Statistics, Statistics
Netherlands, Voorburg.
BALK, B.M. (1994). On the first step in the calculation of a
consumer price index. Proceedings of the First International
Conference on Price Indices, Statistics Canada, Ottawa.
BALK, B.M. (1998). On the use of unit value indices as consumer
price subindices. Proceedings of the Fourth International
Conference on Price Indices, U.S. Bureau of Labor Statistics,
Washington, D.C.
BOON, M. (1997). Sampling Designs in Constructing Consumer
Price Indices: Current Practices at Statistical Offices. Research
Paper no. 9717, Research and Development Division, Statistics
Netherlands, Voorburg.
BOSKIN, M.J., DULBERGER, E.R., GORDON, R.J., GRILICHES,
Z., and JORGENSON, D. (1996). Toward a More Accurate
Measure of the Cost of Living. Final report to the U.S. Senate
Finance Committee, Washington, D.C.
BRADLEY, R., COOK, B., LEAVER, S.G., and MOULTON, B.R.
(1997). An overview of research on potential uses of scanner data
in the U.S. CPI. Proceedings of the Third International
Conference on Price Indices, Statistics Netherlands, Voorburg.
DALEN, J. (1991). Computing elementary aggregates in the Swedish
consumer price index. Joumal of Official Statistics, 8, 129-147.
DALEN, J., and OHLSSON, E. (1995). Variance estimation in the
Swedish consumer price index. Joumal of Business & Economic
Statistics, 13,347-356.
DIEWERT, W.E. (1976). Exact and superlative index numbers.
Joumal of Econometrics, 4, 115-145.
DIEWERT, W.E. (1995). Axiomatic and Economic Approaches to
Elementary Price Indexes. Working Paper no. 5104, National
Bureau of Economic Research, Cambridge.
HAAN, J. De, and OPPERDOES, E. (1997a). Estimation of the
Coffee Price Index Using Scanner Data: The Sampling of
Commodities. Research Paper, Socio-economic Statistics
Division, Statistics Netherlands, Voorburg.
HAAN, J. De, and OPPERDOES, E. (1997b). Estimation of the
coffee price index using scanner data: The choice of the micro
index. Proceedings of the Third International Conference on
Price Indices, Statistics Netherlands, Voorburg.
HAWKES, W.J. (1997). Reconciliation of consumer price index
trends with corresponding trends in average prices for
quasi-homogeneous goods using scanning data. Proceedings of
the Third International Conference on Price Indices, Statistics
Netherlands, Voorburg.
REINSDORF, M. (1995). Constmcting Basic Component Indexes
For The U.S. CPI From Scanner Data: A Test Using Data On
Coffee. Presented at NBER Conference on Productivity,
Cambridge, Mass., July 17, 1995.
SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J. (1992).
Model Assisted Survey Sampling. New York: Springer Verlag.
SILVER, M. (1995). Elementary aggregates, micro-indices and
scanner data: some issues in the compilation of consumer price
indices. Review of Income and Wealth, 41, 427-438.
SZULC, B.J. (1987). Price indices below the basic aggregation level.
ILO Bulletin of Labour Statistics, Reprinted in Turvey (1989).
TURVEY, R. (1989). Consumer Price Indices: An ILO Manual.
Geneva: International Labour Office.
43
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 43-56
Statistics Canada
Robust Calibration Estimators
PIERRE DUCHESNE'
ABSTRACT
We consider the use of calibration estimators when outiiers occur. An extension is obtained for the class of Deville and
Samdal (1992) calibration estimators based on Wright (1983) QR estimators. It is also obtained by minimizing a general
metric subject to constraints on the calibration variables and weights. As an apptication, this class of estimators helps us
consider robust calibration estimators by choosing parameters carefully. This makes it possible, e.g., for cosmetic reasons,
to limit robust weights to a predetermined interval. The use of robust estimators with a high breakdown point is also
considered. In the specific case of the mean square metric, the estimator proposed by the author is a generalization of a Lee
(1991) proposition. The new methodology is illustrated by means of a short simulation study.
KEY WORDS: Calibration estimator; Regression estimator; Range restrictions; Robustness.
1. INTRODUCTION
is a notation signifying X^E/t)- Let us assume a positive
variable of interest y and an asymmetric population. As the
The problem of outliers is an important one in all
HT estimator is a mean weighted by the d^, it is vulnerable
branches of statistics. In sampUng theory, the background
to large values of y. A unit with a high weight d,^ may also
is different from that of parametric statistics since the
have a considerable impact on the estimation step by
objective is often to estimate the total of a variable of
including variable estimates. Lee (1995) defined these units
interest y. An outlier may have its full weight within the
as influential. An extreme unit is not necessiuily influential
population total. Moreover, methodologists may assume,
if its weight d^. is sufficiently small. TraditionaUy, methoat the estimation stage, that the values of units are recorded
dologists have sought to Umit the impact of influential units
without error, since the gathered units are often processed
when they are known prior to sampling, by assigning for
within an editing system (Samdal, Swensson and Wretman
example sampling weights close to 1 to extreme units.
1992, section 1.7). This step is part ofthe sampling proceGambino (1987) and Lee (1995) have nevertheless
dure in large statistical agencies such as Statistics Canada.
discussed situations in which this cannot be done. In a
Lee (1995) has provided an overview of robustness
major article, Hidiroglou and Srinath (1981) considered
developments within sampling theory.
changing the sampling weights when outliers occur. Their
Nevertheless, since populations for economic surveys are
approach gave much legitimacy to weight modification
often asymmetric, some units might be extreme as compared
within sampling procedures.
to others, as was discussed by Kish (1965). The complete
Many of the first robust alternatives to the total were
elimination of such units would lead to biased estimates,
based on M estimators and GM-estimators. Nevertheless,
while maintaining them with their fuU weight might make an
much interest has been shown recently for estimators that
estimator such as the generalized regression (GREG)
also provide good overall robustness, as measured by the
estimator highly variable. This would suggest a compromise
breakdown point of an estimator. These concepts are
between bias and variance. When outliers occur, the
discussed for example in Donoho and Huber (1983),
challenge is to propose robust estimators of the total that are
Hampel, Ronchetti, Rousseeuw and Stahel (1986) and
little affected by certain units that deviate sharply from
Rousseeuw and Leroy (1987). The breakdown point meaothers. Such estimators should have littie bias and a small
sures the percentage of outiiers within the sample that the
mean square ertor. Traditionally, sampUng theory has been
estimator can tolerate while providing nonetheless a good
deeply involved in the development of unbiased or
estimate of a given characteristic of the population. Lee,
asymptotically design unbiased (ADU) estimators. See for
Ghangurde, Mach and Yung (1992) required estimators of
example Samdal etal., (1992, Section 7.12). However, this
the total that were based on robust estimators with a high
ADU property is perhaps undesirable within the context of
breakdown point.
outiiers. This was discussed by Chambers and Kokic (1993),
We will be considering calibration estimators of the total
who showed the conflict between the ADU property and the
T written as Yjs^kYk- These estimators were developed
robustness of an estimator.
for example in Deville and Samdal (1992). We are looking
for weights w^ that are as close as possible to sampling
We consider the Horvitz-Thompson (HT) estimator
defined ^Y fiji = Y d y , where dj^=Til , TI^ being the weights di^=nl , while meeting benchmark constraints,
denoted CE (also known as calibration constraints),
inclusion probabiUty (If A is a set of units. A c t / , then Y,A
Pierre Duchesne, Departement de Mathematiques et de Statistique, University de Montreal, C.P. 6128, succursale centre-ville, Montr6al, Quebec, H3C 3J7.
44
Duchesne: Robust Calibration Estimators
that it makes it possible to obtain weights w^ that are
limited to a given interval, say [L, U]. Some of the
properties of classes QR and RQR are provided in
where x^ is a vector of dimension m that corresponds to the
available auxiliary information of known total T^ = Ly -^jt • section 2.
Section 3 describes applications of the RQR class in the
These estimators are popular as they are easily interpreted,
building of robust calibration estimators. The main goal is
since methodologists are used to assigning weights w^ to
to modify robust default weights so that they meet
units y^. Several metrics are studied to measure the proxicalibration constraints. Section 3.1 discusses the choice of
mity between <i^ and w^. The GREG estimator is an
constants ^^ and r^ using arguments suited to calibration
important example with w^=J^(I + {T^-T^.^^)' M^ ^J^^k)'
where M^ = Yjsd^Xi^x'JCi^. It is obtained by minimizing the estimators. This is a new and unifying approach, and in
mean square metric Y.s^k^'^k ~ ^k)^^^k- Constants c^ are section 3.2 it guides our choice of q^ and r^ when there is
auxiliary information. One important element is the use of
weighting factors which can take into account problems of
a robust estimator allowing for the weighted form (1.3),
heteroscedasticity (for example). Samdal (1996) discussed
providing the ^^. Note that this is the case for GMthe selection of these constants. However, since the gweights gi^ = wjdj^ of the GREG estimator are not gene- estimators. Usually, estimators with a high breakdown
point do not have a weighted form. Consideration is given
rally restricted, other metrics are proposed as a means of
to reweighting these estimators, allowing the breakdown
limiting them so that they might meet certain constraints
point
to be kept under control and making it possible to
applicable to the range of values (CARV). Specifically,
have estimators written in the form (1.3). See Rousseeuw
this makes it possible to avoid undesirable negative weights
and Leroy (1987) and Simpson and Chang (1997). We then
w^. See Deville and Samdal (1992), Singh and Mohl
discuss the choice of q,^ and r^, so as to calculate an RQR
(1996) and Stukel, Hidiroglou and Samdal (1996).
estimator and obtain a robust calibration estimator with
As was noted by Fuller, Loughin and Baker (1994,
restricted weights. Various robust estimators, including the
p. 81), there is a link between calibration estimators and
Lee (1991) estimator and the Chambers (1986) estimator,
robust methods. However, it is wrong to assume that
are compared in section 4 with RQR estimators as well as
calibration estimators necessarily have good properties of
with the GREG estimator and one calibration estimator
robustness, given that all the calibration estimators
considered in Deville and Samdal (1992) whose weights
considered by Deville and Samdal (1992) were asymptoare Umited. The Lee (1991) estimator can be considered a
tically equivalent to the GREG estimator, which, being
specific case of our approach. It allows us to also consider
ADU, is not robust. Moreover, a traditional calibration
a new estimator with restricted weights. Four populations
estimator is not robust as it depends linearly on w^, and w^
that have already been studied in the literature are
and does not take into account y^..
considered. It will be noted that estimators free of weight
The purpose of this paper is to build estimators in the
constraints are subject to negative weighting problems.
form of Yjs'^kyk where the weights w^ provide robustness
With the RQR class of estimators, robust estimators having
while meeting constraints on the calibration variables and
positive weights can be obtained, and they compare well
the weights w^. The starting point of our approach is the
with estimators free of weight constraints. Finally,
class of Wright (1983) estimators QR. Let us assume we
have available constants {(^^, r^), ^^ > 0, r^^ ^ 0, V^e [/}, conclusions are drawn in section 5. Appendix B contains
such that Y.u'^klk^k^'k ^ ^ ^""^ T^s^k^k^k > ^' ^^- (^'^ '^ a list of abbreviations, and Appendix C contains a Ust of the
various constants found in this paper, with definitions.
a symmetric matrix, A >0means that A is definite as
positive.) The QR estimators are defined on the basis of ^^
and r^ by the relation
2. RQR CLASS ESTIMATORS
(1.2)
T'B
+
•yQR
.s hh'
X
q
Consider a finite population U = [1,2, ...,N} of size N
whose total T = ]C(/>'t ^^ ^'^h to estimate for a variable of
where B assumes a form weighted by the ^^
interest y that is positive. A sample s of size n^ is drawn
(1.3)
Pq^^s 'ik^kKYT.s Ik^kYk'
foUowing a sampling design p(s}. The inclusion probability
of a unit k is denoted Ji^, and the second-order inclusion
and
probabilities are denoted Jtj^,. We assume that the auxiliary
(1.4)
^k=yk-^'kP,
information x^ is of unit value, i.e., x^^ is known from a
reliable source V^e U.
It will be shown in section 2 that the QR estimators are
Wright (1983) introduced a class of QR estimators
calibration estimators, and a new class of estimators,
written
in the form (1.2) with the primary objective of
denoted RQR, wiU be inti-oduced, also based on the choice
unifying a large number of common estimators. We find
of constants ^^ and r^. It generalizes to a certain extent
the best linear unbiased prediction (BLUP) estimator of
the QR estimators as well as the class of Deville and
Royall (1970) derived from the model-based theory,
Samdal (1992) estimators. The RQR class is interesting in
2>, w*^t = ^x'
(1.1)
Survey Methodology, June 1999
45
obtained by assuming (^j^, r^) = (1/c^, 1), and the GREG
estimator of Cassel, Samdal and Wretman (1976) by
considering the choice {q,^, r^) = {d^^lc,^, d,^). Altemately,
(1.2) can be written as
recourse for the practitioner, then, is to relax the constraints
by reducing the dimension of the number of auxiliary
variables. See also the discussion in Fuller et al., (1994).
As for the calibration estimators considered in Deville and
Samdal (1992), it was shown, in resuU 1, that there is a
•'j'QR ^ l^s
"•kSkYk'
solution with a probability approaching one. Under certain
where ^^g^ satisfies
conditions, this result can be adapted to class RQR
estimators.
The metric on which we will focus our attention so that
dkSk = h ^ (L - P.r)'(Es <lk^k^kY^k^k'
(2-1)
the weights may satisfy the CARVs is a sUght modification
with T^^ = Y^j^Xj^. Assuming ri^=dj^, g^ corresponds toof case No. 7 in Deville and Samdal (1992). We caU it the
restricted mean square metric. The G-function that corresthe g-weight ofthe GREG estimator.
ponds to the choice of this metric is
The QR estimators are calibration estimators, obtained
by minimizing the mean square metric subject to the CEs
{w^-r^)Vq^ if w^e[L,U],
G(w,;9,,r,)
™ n ^ E . K - rkfl<ik' as of Ys '^k^'k = ^.- (2.2)
otherwise,
The weights w^^ are chosen as close as possible to the r^
and the ^^ are weighting factors. In other words, the
starting weights r^ are transformed into caUbration weights w^.
The solution to problem (2.2) is w^ = <i^g^, where df^gj^ is
given by the formula (2.1).
Nothing, however, guarantees that the weights w^ of the
QR estimator are positive, which might be undesirable in
practice. See Brewer (1994), who formalized the interpretation of weights. To limit the weights w^^ in [L, U], we
wish to resolve
min5^^G(w,;^,,r,),
as ofYs
k k
-T and W.E[L,U]-
(2.3)
The calibration estimator of the total is
^vROR
2.^5
^kYk'
(2.4)
where the w^ are obtained by resolving problem (2.3). It is
assumed that function G{w,q,r) is strictly convex and can
be derived in w for fixed r and q. We denote g{u;q,r) =
G'{u; q, r) and h{u; q,r)=g"'(«; q, r). Moreover, it is
assumed that /i(0; q,r) = r and h' (0; q, r) = 9.The
resulting estimators are caUed QR (RQR) restricted caUbration estimators.
Fuller et al, (1994) favoured regression estimators
having reasonable invariance properties. It can be shown
that RQR estimators are regression equivariant and to scale
when constants q,^ and r^ are transformation invariant.
Useful definitions may be found in Bolfarine and Zacks
(1992).
There is no guarantee that there is a solution to problem
(2.3). We refer to the simulation study in Stukel et al,
(1996). There may, for example, be realizations of the
sample for which even the CEs cannot be satisfied (1.1).
Thus, the sample is so imbalanced that it is impossible for
the weighted sum of the components for each dimension to
provide the corresponding population total. The only
whereas the /i-function is
h{x'X;q.,r.)
L
r^ + ^^x^' X<L,
r^^ + q^x^X r^+ q^x;^Xe[L,U],
U
r^^q^xlX>U.
Given this modification, it is the weight w^ that is constrained and not only w^/(i^ as for case No. 7 in Deville and
Samdal (1992). In our situation, w^ can "cortect" an initial
weight that is an outlier. It will be noted that, as it is
formulated, the Deville and Samdal metric (1992) subtly
inserts the constraints on the w^ in the G-function. In order
to calculate the estimator (2.4) according to this metiic, it is
sufficient to follow the same approach as Deville and
Samdal (1992), which leads us to a solution, using
Newton's method, for the following equation in X
Ysh{x;,X;q,^,r^)Xi^ = T^
(2.5)
Thefinalestimator is TJ.RQR = E,^('^t^„; ^k' ''4))'^. where X^
is the solution to equation (2.5).
It is interesting to know whether the weight constraint
changes the properties ofthe estimator as compared to a QR
estimator that is free of weight constraints. The following
result (as proven in the Appendix) shows that, under certain
conditions, the two estimators are asymptotically equivalent. In practice, using the restricted mean square metric,
we have not observed any significant deviations.
Proposition 1. According to hypotheses C, and Cj given
in the Appendix,
N-^\T yQR
T
I Op{n
-1/2
) •
(2.6)
This result can possibly be obtained using the approach
leading to resuU No. 5 in Deville and Samdal (1992)
dealing with the asymptotic equivalence between the
GREG and calibration estimators considered by the authors.
Duchesne: Robust Calibration Estimators
46
perhaps be reduced. In order to satisfy the CEs, this means
finding weights w^ that come closest to the sampling
weights di^ for units which are not outliers but come as
close as possible to a reduction factor r for outlier units,
where r is chosen by the statistician. Specifically, we denote
J =5, U52. where s^ of cardinaUty n^ represents those
units that are not reported as outliers, whereas S2= s - s^ of
cardinaUty /ij = " ~ "i represents the outiier units ofs. The
(2.7)
reduction factor r wiU typically satisfy r <.d^,yk£ jj- For
^L = E E , ^wK«t)(>^/^/)'
example, consider the estimator (3.1) with qi, = r,^ =
where A^, = TC^, - TI^TI,, A^^, = A^/TC^, and e^ are given by B,. = dJi^^ + r{l - /ji), where /^, is the variable indicating
affiliation to 5,. In this way, constants q,^ and r^^ are
(1.4). See Samdal etal, (1989) and Samdal etal, (1992,
reduced for units of ^2 so as to reflect the fad
p. 234). It can be shown that the asymptotic bias of a QR
are extreme. The estimator (3.1) becomes
estimator is given under general conditions by
However, proposition 1 is of some use to understand the
type of conditions needed to reach the result described in
our situation.
Since (1.2) shows the same asymptotic behaviour as
quantity T^B + L^jt^i- where Ei^=y^-xlB^ and B^ =
('Lu'^k^k^k^k)'^Lu\^k^kyk'
^^^^ w°"l'^ suggest as
vanance estimator
V^vQR-^P=E^(V.-!)£,.
Then, a possible bias estimator is b = Yjs'^k^Vk ~ ^)^k'
which can be used in conjunction with formula (2.7) to
build an estimator of the mean square ertor for a QR and
RQR estimator, using proposition 1.
RQR estimators make it possible to obtain calibration
estimators with constrained weights. Given set q^, and r^^,
it is sufficient to resolve problem (2.3). In the sections
which follow, the RQR class is applied within a context of
robustness. We will show how to direct the selection of
constants 9^ and r^, chosen in practice using sample s.
3. BUILDING ROBUST AND CALIBRATED
ESTIMATORS
3.1
Methods Based on Weight Reduction and Value
Modiflcation
•>.QR
Cs(P)('Esi
hYk ^ ' ' E . 2 >-*)•
In the case of simple random sampling, cf^ = NIn and we
obtain
TyQR-CSB)f^g,
where f g = NlnY,,^ Yk "^ ''Sj2>'t ^s the Bershad (1960)
estimator discussed in Lee (1995). Other methods based on
weight reduction have been discussed in Lee (1995), who
also discussed the choice of r.
One disadvantage of methods based on weight reduction
is that the analyst must identify the outlier units. Methods
based on value modification avoid this difficulty by
providing gradual weight reduction for units that are more
extreme. We consider a case of simple random sampling.
We assume
m{y^\ t,a,b)=b
+ {a -b) min (1, tly^).
(3.2)
Lee (1995) discussed various propositions based on the
Thus, this function assigns a starting weight of value a for
weight reduction method for simple random sampling.
the y^ < r, and graduaUy reduces this to afinalweight b, as y^^
Once outlier observations have been detected, these
becomes
extreme. Value t is called the threshold. The
methods consist in reducing the weight of extreme obserconstants
a, b and t are chosen by the statistician. Several
vations. These methods are to be preferted to those which
values
for
a and b have been considered in the literature.
eliminate doubtful observations entirely, since all the
Thus,
instead
of assigning a fixed reduction factor to the
observations in the sample are legitimate, as was discussed
units
of
^2,
we
select 9;^ = •';fc = ^^ = '"()'*' ^' ^ ^ " JNIn),
by Lee CM/., (1992).
where/is a constant between 0 and 1. The estimator (3.1)
With respect to caUbration estimators, we begin by
becomes
considering the situation in which there is no auxiliary
information available and the only constraint is Y^s'^k ~ ^•
TyQK =
c,mY..y/kyk
This case will guide our path. Consider the QR estimator
with qi^ = ff^. For the sake of our discussion, we consider
= C,{W)f^^.
constants r^, known and fixed. The weights minimizing
~
(2.2) subject to Ys^k
"^ are w^ = C^C'')'"^. where
The estimator f ^ has been discussed in Gross, Taylor and
Cfr)= NlYr^, so that TyQR becomes
Lloyd-Smith (1986) as well as in Chambers and Kokic
(1993), who called it the winsorized estimator. This is a
(3.1)
'PyQR = Cs(r)Ys'-kykspecial case of the approach used by Chambers (1982,
1986). When / = 0, the estimator (3.1) becomes
Whenever an observation is extreme, it might represent few
T'.QR = C^{W,)f^„, witii q, = r, = W„ = m{y,; t, NIn, 0),
units like itself within the population, and its weight should
47
Survey Methodology, June 1999
where 7,^, = Af/"(Ej,yt + "2')' denoting the part of s
containing units that satisfy y^^ < t. The estimator 7"«,, has
been discussed in Lee (1995), as well as in Gross et al,
(1986), who called it the type I winsorized estimator. When
f=nlN, Gross et al, (1986) called it the type II winsorized
estimator. It has also been discussed in Brace (1991).
In a design Tips, Dalen (1987) inserted the design by
assuming D^ = /"(y^^; it/, dj^, I). Thus, if k and / are two
extreme observations such that y;^ = y^, then the observation
whose sampling weight is largest wiU have a higher weight
D^. Selecting r^ = ^^ = D^ makes it possible to obtain
essentially the Dalen estimator, ^'Q^ = C^(D)^jDj^yj^.
The estimator T [) = Jis^kyk ^^ ^^^^ studied for example
inTambay(198l).
Table 3.1
Estimator (3.1) Based on Weight Reduction and
Value Modification
Estimator
Value of q. = r.
Bershad
B, =
d,I,,-r{l-I,,)
Winsorised
Wi^ =
Winsorised, type 1
W,^ = m(y^;t,Nln,0)
Winsorised, type 11
W,„=m(y,;t,Nln,l)
Dalen
^* = '"(>'*;'^*''''*'l)
m{yi^;t,Nln,fNln)
Note: mCv^; t, a, b) = b + {a - b) min(l, f/y^).
The approach used in this section suggests that we may
occasionally seek estimators whose weights are close to r^
rather than the sampling weights c?^. The constants r^^ will
themselves be chosen close to d^ for the proper units, but
will be reduced once a unit is declared extreme. The QR
estimators allow the weight reduction and value modification methods to be unified. Methods based on value
modification help us choose weights that are adapted to the
specific sample 5 chosen. As was noted in Chambers and
Kokic (1993), this is not surprising since the problem of
outliers occurs after the selection of sample s. We must use
the sample at our disposal to overcome the problem. These
methods are generaUzed in the foUowing section using
auxiliary information.
3.2 Estimators of the Total Based on Robust
Statistics
One of the first attempts to obtain robust alternatives to
population totals using auxiliary information can be found
in Chambers (1982, 1986), who proposed a robust ratio
estimator based on BLUP estimator decomposition. One
recent extension of the work carried out by Chambers can
be found in Welsh and Ronchetti (1998). Gwet and Rivest
(1992) also proposed a robust version ofthe ratio estimator
using an approach based on the design in simple random
sampling. Rivest and RouiUard (1991) carried out a comparative study of several robust estimators, and examined
several estimators of the mean square ertor. For designs
with unequal probabilities, HulUger (1995) considered
robustifying the HT estimator when inclusion probabilities
are obtained using auxiliary information. Gwet and Rivest
(1992) and HulUger (1995) considered a version of the
influence function for finite populations, emphasizing the
need for procedures having good properties of local robustness and the use of estimators having limited influence
functions. Influence functions were discussed generally in
Hampel ef a/., (1986).
The following sections will deal with building robust
estimators having constrained weights. The building of
such estimators is based on the following steps:
-
Identifying the constants ^^ and r^; this provides a QR
estimator.
-
Resolving the problem (2.3) so as to provide an RQR
estimator.
In terms of robustness, the coefficients q,^ are selected such
that B is a robust estimator. Thus, the first part of the QR
estimator, T'B , provides a good predicted value for the
entire population. The second part of the QR estimator,
Y,s''k^k' corrects the first part for the y^ observed in the
sample. The constants r^ ensure that with this cortection,
the outliers in the sample will not retum with full weight.
3.2.1. Choice of q^ Based on a GM-Estimator
Consider the estimator (1.2) in which B is replaced by
a robust estimator of a regression coefficient. Such estimators have been discussed for example in Huber (1981)
and Hampel et al, (1986). We thus obtain
P:P.
E.r,{y,-x^B^).
(3.3)
The estimator (3.3) does not have the form of QR
estimators unless B assumes a weighted form. This is the
case if B is a GM-estimator defined by the equation
E . d,h,x,yv({y, - x'.B)/[cshlf,])
/ f , = 0,
(3.4)
since the solution to (3.4) can be expressed as
Pg = ( E . d,hl ""M,X,X;/C,)"' YS ^k^l ""^k^kYkl^'k'
where
^
_^[^yk-^'kPg)I^^Kfk))
^yk-^lPg)li^Kfk)
'
The properties of GM-estimators have been discussed in
Simpson and Chang (1997). To simplify our discussion, a
is assumed to be known, and the role of c^ is the same as in
the case of the GREG estimator. The function \|i is
determined by the analyst. A current example would be the
Huber function
48
Duchesne: Robust Calibration Estimators
c if X > c,
VH„i(^;c) = ' x if Ixl ^ c,
-c if x< -c.
(3.5)
A value of c around 2 is often used in calculating GMestimators. See for example Hampel et al, (1986), Gwet
and Rivest (1992) and HulUger (1995).
The choice of /i^ makes it possible to limit the influence
of auxiUary information that is too extreme. The constant a = 0
leads to the choice of Mallows whereas a = 1 makes it
possible to obtain the Schweppe version. The Schweppe
version is sometimes preferted.
See Coakley and
Hettmansperger (1993) and Hampel et al, (1986, p. 322).
When there is minimal auxiliary information, Le., when we
only have available a real variable Xi^,\/keU, a possible
choice for function /z^ is
mm 1,
(3.6)
x^/med(x^)^
where B^ is an equivariant estimator with a high breakdown point meeting certain regularity conditions. The
reweighted estimator is
B.
I -0
^k^k^k'^
••^s^kh^
(^k' '•*) = ( ^ t ^
"k'^^k'
h)-
The choice of constants r^ is discussed in section 3.2.3.
3.2.2 Choice of q,^ Based on a High-Breaking-Point
Estimator
The choice of a GM-estimator is only a first step towards
obtaining a very robust estimator of the total. In fact,
although the influence function of GM-estimators is
restricted, the fact remains that such estimators do not have
a high breakdown point, which usually diminishes
according to the dimension of the auxiliary information
(Rousseeuw and Leroy 1987, p. 13). This section wiU
explain how to build robust calibration estimators based on
high breakdown point estimators. As such estimators do
not usually assume a weighted form, we will consider
reweighting them. This will allow us to obtain, as in the
previous section, the constants ^^^ needed to compute the
RQR estimator metric. SpecificaUy,tiiefollowing weights iJ^
are considered:
.
_^((yk-^kPo)/(^Kfk))^
iyk-KPo)h^Kfk)
'
(3.7)
(3.8)
dkK"'^k''kyk'''kThe asymptotic properties of this type of estimator have
been studied in Simpson and Chang (1997).
The estimator B^ that is considered is the one-step GMestimator of Coakley and Hettmansperger (1993). This
estimator has a high breakdown point. It is obtained as the
first iteration of the Newton formula in equation (3.4),
where the Schweppe version is used, assuming a = 1.
Other robust estimators could have been chosen. However,
the efficiency and robust properties of the Coakley and
Hettmansperger (1993) estimator make it a good choice.
Thus, the proposed constant q^ is
Ik
For a design Ttps, a modification of /i^ following Dalen
(1987) so as to take various sampling weights into
consideration would perhaps be desirable. The constant /
must be specified by the statistician. A value of t around
1.5 is found in the applications. See for example Rivest and
Rouillard (1991), who also provide other choices for
functions /i^.
Writing B as a weighted estimator makes it possible to
write estimator (3.3) as a QR estimator with
.)-
dX'^uJc^,
denoting
with B„ PQH'PCH
Hettmansperger (1993) estimator.
the
Coakley
and
3.2.3 Choice of r^.
Once the constants ^^, have been determined, the
constants r^ must be selected. If J^ = r^, then under general
conditions, the QR estimator is an ADU estimator. However, such a choice of r^ yields an estimator that is sensitive
to outiiers. Altemately, choosing r^ = 0 provides a robust
estimator that might be very biased as was emphasized in
Gwet and Rivest (1992, p. 1180). Lee (1991) suggested
choosing r^ = 0^^, where 0G[O, I ]. The asymptotic bias
becomes under general conditions (0 - 1 )Xl(/^t, where E^^
represents the residuals obtained by adjusting a robust
estimator for the entire population. Choosing 9 makes it
possible to control estimator bias. The discussion in
section 3 leads us to suggest constants r^ that are close to
the dj^ for good units, and reduced gradually for doubtful
observations. We suggest choosing
r, = d,u:,
(3.9)
where
A^yk-KPr)/i<fk))
(yk-^kPr)/«fk)
The function \|/* which we will be considering is a
modification of the Huber function
V|/*(X)
X
if Ixl ^ a,
a sign(x) if lxl>a and \x\<alb,
(3.10)
bx
if \x\>alb.
Survey Methodology, June 1999
49
We choose a = 9,b= 1/4. The reason for this modification
is that we do not want the outliers comprising large
residuals or extreme auxiliary information to have weights
that are too reduced. In this way, the sampling weight is
fully maintained when the argument for M^^* is between -9
and 9, and reduced gradually to one quarter. If the weight
of large residuals is reduced too much, then the bias
becomes too great, leading to the choice of y ' . The choice
of constants r^ has been done empirically, and seems to
work well in practice.
Thus, we will consider the choice of constants qj^ and r^
following
Thus, the value for c in the Huber function can be obtained
by taking into account efficiency concems under normal
errors. See Hampel et al, (1986, p. 333) and Gwet and
Rivest (1992). Constants a and b are also found; they are
more directly linked to the proposed estimators. Constant
b represents the maximum weight reduction that can be
allowed when specifying the default weights r^, and for
this reason there is a link with the suggestion made by Lee
(1991). The constant which it is most important to specify
is possibly the value of a. We suggest here a =9. However,
in our simulations, a value of a between 6 and 12 yielded
relatively comparable results. The choice of limits L and U
rests on cosmetic considerations, so that the weights may be
limited to one interval. This last consideration is perhaps
(3.11)
i^k'h) = (dk^k '''^J^^k'^X)
secondary for the practitioner. As a result, it would seem
that
the most important aspect is to choose a value of r^ that,
We suggest a generalization ofthe Lee proposition (1991),
is
close
to d^ for the proper values, then reduced as an
since instead of considering ri^ = di^Q where 0 is fixed,
observation
is deemed extreme, and that is the goal which
r^ = dj^ u^ will adapt automatically (or adaptively) to the
has
guided
our
choice of r^ in this section. Nevertheless,
sample. Having this choice of constants ^^ and r^ at our
it
would
be
useful
to make a choice of r^ that satisfies a
disposal, and with the usual mean square metric, we obtain
certain optimality criterion.
a QR estimator, but it is subject to negative weighting
problems. However, with the constants (3.11), we can
3.3 Chambers Model-Assisted Estimator
consider the restricted mean square metric, solve the
Another approach is based on a decomposition proposed
problem (2.3) and obtain a robust estimator meeting the
by
Chambers (1982, 1986) which we now apply to QR
CEs and the CARVs.
estimators.
Note that a QR estimator can always be written
There are in proposition 1 possible solutions for the
in
the
form
asymptotic behaviour of the resulting RQR estimator as
compared to the QR estimator free of weight constraints.
However, since the constants (3.11) depend on .y in a
T^yQR=T.s hYk^CPx-^^'P^Hs
h^k^k-KP)'
complex way, there can be no automatic conclusion about
asymptotic equivalence. Nevertiieless, the simulation study
where z, = {T^- tj' {Ls^,x,x;^)~'x,^,, f^^ = Ys'-k^'k
in section 4 seems to suggest a very comparable behaviour
for the estimator with and without constraint on the and B are arbitrary. Chambers (1986) had considered the
specific case (^^, r^) = (l/o^, 1) for the ratio estimator. In
weights, with respect to the Monte Carlo mean square error.
Thus, empirical evidence shows that if the ^^ and r^ are order to limit the influence of outlier units. Chambers
proposed
chosen in such a way that the estimator without constraint
on the weights is robust, then the version with constraints
on the weights will also be robust.
Finally, the following is a summary of the steps in the
Esh\{sl^k(yk-KP))(3.12)
proposed method used to obtain a robust RQR estimator.
1. Choice of constants ^^ and r^. We suggest the
constants found in equation (3.11). For this step, it is
necessary to compute Bf.^^.
2. Choice of metric. If need be, choice of constants L and
U.
These constants are chosen such that
L^ri^^U, ykes.
The function \)/ helps limit the influence of large residuals.
The choice for B is a robust estimator, e.g., B . One
function \)/ considered in Chambers (1986) was
y^{t) = texp{-Q.25{\t\-6f).
(3.13)
It is interesting to note that (3.12) can be written as
3. Solution using Newton's method for equation (2.5).
4. Assume Wi^= h{x^Xj,qi^,r^) for X^ solution to step 3.
5. Assume f
estimator.
= Ys ^kYv which is the proposed RQR
The procedure requires a certain number of constants.
The constants a, / and c are found in the calculation of ^^
and r^. The choice of these values is nevertheless justifieti
using robustness theory, which helps guide the practitioner.
'>CHAM
T:P-YsihHd,g,-r,)X,)e,{B),
where e^(B) =yj^ -xl.B,gi^ is defined in formula (2.1)
calculated using ^^ and r^, and
(3.14)
^k^Yk-KP)
Duchesne: Robust Calibration Estimators
50
Thus, the residuals «;^(B) are weighted using a relation
referring to formula (3.2). If A,^ = 1, then d^^^ is applied to
residuals ej^(B) and it is easy to verify that we have the
estimator 7" „^. Altemately, if X^. = 0, we obtain (3.3) if we
assume B =B . If in (3.12) we assume {q^., r^) = (l/Cj^,l)
and B=B , then the Chambers estimator represents a
compromise between the BLUP and a robust estimator
based on a GM-estimator. Note that formally (3.12) is a
QR estimator with
h + K ^ t - h)\)[dk^l~''''kl''k'
{q^K)
However, since r^' is not necessarily positive, it is not
always possible to undertake a change of metric in this case.
4.
EMPIRICAL STUDY
To study the performance of robust calibration
estimators, we carried out a Monte Carlo simulation study.
We considered four populations comprising data from
readily available works on sampling theory. For each
population, /iC=2000 samples were drawn using simple
random sampling for various sample sizes. Our main objective was to determine whether it is possible to obtain estimators having good empirical properties (bias, mean square
ertor) while satisfying the CEs and the CARVs. Note that
all the programs were written in S-PLUS (Statistical
Sciences 1991) and are available from the author.
4.1 Populations Under Study
The population graphs can be found in Figure (4.1). The
first population, comprising 51 units, can be found in
Mosteller and Tukey (1977, p. 560). It consists of the
U.S. population in 1960 and in 1970 for each of the
50 states and the federal district of Columbia. It is called
POPUSA. Looking at the scattergram of the 1970 popiilation in terms of the 1960 population, we notice that all
units seem to be on the same straight line, with some good
leverage points. An example of a good leverage point is the
point surtounded in this population. The second population,
with 34 units, can be found in Singh and Chaudhary (1986,
p. 177). It deals with the area of fields sown in 1971 and in
1974. This population is called AREA. There is a bad
leverage point (see the surtounded point) in this population
since the point (4170.99) does not respect the linear trend
AREA
POPUSA
1000
5000
10000
POPULATION (M THOUSANDS). 1M0
2000
3000
AREA uroen WVEAT <t4 ACRES), 1971
MU284
MU281
eo
so
TOTAL NUMBER OF SEATS N MUNICPAL COUNCL
0
2000
4000
KOO
(000
10000
12000
REAL ESTATE VALUES ACCORDtt«a TO 1M4 ASSESSMENT (M MLLK3NS OF KRONOR)
Figure 4.1. The four populations under study
Survey Methodology, June 1999
51
of the majority of units. Samples of size 10 and 15 are
drawn from POPUSA and AREA. The third population,
i.e.tiieMU284 in Samdal et al, (1992), comprises tiie 284
municipalities of Sweden. We considered variables x = 582
conceming the total number of seats in the municipal
council, and y = P85 representing the population of Sweden
in 1985. There are vertical outliers {e.g., the surtounded
point) and one bad leverage point. Finally, we considered
population made up of MU28I made up of MU284 from
which the three largest municipalities were excluded. The
variables considered were x = REV84 representing the
values of landed property based on the 1984 assessment,
and y = RMT85 representing municipal tax revenues in
1985. The unit of measurement was one milUon kronor for
both variables. It seems this population has several bad
leverage points. Samples of size n = 30 and « = 60 were
drawn from MU284 and MU281. Table (4.1) contains
totals for various populations.
Table 4.1
Totals for Various Populations and Totals Known From
Auxiliary Information
Population
POPUSA
AREA
MU284
MU281
T
179,972
29,118
13,500
757,246
T
203,923
6,781
8,339
53,124
N
51
34
284
281
4.2 Description of the Estimators
The two basic estimators were the GREG estimator and
the estimator obtained by considering case No. 7 in Deville
and Samdal (1992), i.e., a GREG estimator with restricted
weights. These estimators were denoted GREGAJ and
GREG/R respectively. We selected c^ = 1 for populations
POPUSA and AREA, and chose c^ = x^ for populations
MU284 and MU281. Our choice for the c^ was motivated
by the relationship between these constants and the heteroscedasticity of the superpopulation model. Of the robust
estimators, we studied the Chambers (1986) estimator by
considering
i^k' h) = (d,ii,{B^^)lc,, I + {d,g, - l)X,{B^)),
where in the formula (2.1) ((9^, r^) = (l/c^^, 1), denoted
CHAM, based on B^. The constants "^^(B^^) were
obtained from formula (3.7). Selection a = I was used
throughout the simulation. Huber's function \(/ was used
with the constant c = 1.345 for B^. The functions X,^ are
those given by formula (3.6), where we selected t = 1.46.
The function \ is defined by equation (3.14). The function
v|/ considered was that given by equation (3.13). The scale
was estimated as in Coakley and Hettmansperger (1993).
We also considered the model-assisted BLUP estimator in
which the generalized least squares estimator was replaced
by estimator B^, which we called MODEL. Moreover, we
considered the Lee (1991) estimator on the basis of B^
where r^ = 0,25d^, using the mean square metric. We also
studied an extension of the Lee (1991) estimator by considering the limited mean square metric. These estimators
were denoted by LEE25Aj and LEE25/R respectively.
Finally, we considered the new method in section 3.2.3,
selecting (^^, r^) as given by equation (3.11) in accordance
with the mean square metric and the limited mean square
metiic. They were denoted by QRROB/U and QRROB/R
respectively. The choice of function y* was given by
formula (3.10).
Table 4.2
Monte Carlo Results for Sampling From the POPUSA Population
Estimators
n = lO
GREG/U
GREG/R
CHAM
MODEL
LEE25/U
LEE25/R
QRROB/U
QRROB/R
n=15
GREG/U
GREG/R
CHAM
MODEL
LEE25/U
LEE25/R
QRROBAJ
QRROB/R
MIN
MAX
CARV'
VARM
MSEM
CVM
BRM
34.90
35.29
32.43
27.66
27.48
28.67
27.40
28.33
34.92
35.30
33.75
30.69
30.07
30.90
28.40
29.18
2.90
2.91
2.85
2.72
2.69
2.73
2.61
2.65
-0.07
-0.04
-0.56
-0.85
-0.79
-0.'^3
-0.49
-0.45
-6.24
0.20
-19.61
-19.71
-19.38
0.20
-15.68
0.20
26.75
32.00
40.96
40.86
39.64
32.00
40.10
32.00
86.7
100.0
84.0
82.8
83.2
100.0
83.2
100.0
21.90
22.12
18.11
15.43
15.44
15.72
14.68
14.85
21.95
22.15
20.14
19.03
19.54
19.68
16.44
16.56
2.30
2.31
2.20
2.14
2.17
2.18
1.99
2.00
-0.10
-0.09
-0.70
-0.93
-0.99
-0.98
-0.65
-0.64
-3.13
0.20
-5.79
-6.09
-6.19
0.20
-4.48
0.20
15.32
16.00
16.44
16.92
17.06
16.00
16.41
16.00
94.7
100.0
92.4
91.0
90.8
100.0
90.9
100.0
The limits for the CARVs are [0.20, 32] for n = 10 and [0.20, 16] for n=l5.
CONV
98.4
98.4
98.4
99.5
99.5
99.5
52
4.3 Frequency Measurements
Duchesne: Robust Calibration Estimators
significant reduction in variance was achieved for the.
QRROB/U estimator, but at the cost of a relative bias of
The eight estimators in Section (4.2) were calculated for
about 10%.
each sample. The results can be found in Tables 4.2, 4.3,
Population MU284 had a vertical outlier and bad
4.4 and 4.5. Since one asset of the new methods is the
leverage points. Robust estimators reduced the variance
CARVs, statistics were calculated' on these weights.
radically, since they were not affected by the three extreme
Columns MIN and MAX in the tables of results contain the
units in y which were clearly moving away from the linear
minimum and maximum values of weights calculated
trend. The CHAM, QRROB/R and QRROB/U estimators
during the simulation for each estimator. Also shown is the
were more than four times less variable than the GREG/U
percentage of samples for which the weights are within the
estimator. However, this led to a much higher negative
CARVs in the CARV column in the tables of resuUs. We
bias. All the robust estimators were severely biased. The
also considered the percentage of samples for which there
MODEL estimator showed a negative bias of more than
were convergent limited estimators in the CONV column.
13%, whereas QRROB/U had a negative bias of the order
The intervals used [L, f/] for the limited intervals are
of 11%. As for QRROB, a better choice of constants in
specified in the different tables. In all cases, the various
function t)/' might help reduce a larger part of the bias at
statistics were calculated using samples for which all
the cost of a lower variance reduction. Increasing the
estimators were convergent.
sample size to n = 60 made it possible to reduce the bias
Another significant feature is related to the bias and
below the 10% ofthe CHAM and QRROB estimators, but
efficiency of the proposed methods. Let T denote an
the other robust estimators remained more biased.
estimator of the total T. Assume T. is the estimator of the
Population MU281 contained a fairly large number of
total calculated using sample /, i = I,..., K. The relative
bad leverage points. The variance dominated the MSE
Monte Carlo bias BR^, the mean value f^ and the
share of this population. The LEE25 estimator was the least
variance V^ are given by the usual formulas, Le.,
variable, with a reduction of more than 35% as compared to
^^M =
{pM(T)-T^)lT^.xm,E^{f)
GREG/U. However, although 0 = 0.25 functions well for
this population, our study shows that it is not always the
best choice.
=
ll-if, and V^-jLf-i(frpM(T)fK
Note that all the robust estimators were more efficient
than the GREG or its limited version. As was confirmed by
Our main criterion for efficiency will be the Monte Carlo
the results of Deville and Samdal (1992), the limited
mean square ertor defined by MSE^,^ = 1 IK^i^i (^, ~ "Py)^- version of the GREG estimator showed essentially the same
The coefficients of variation CV^ are calculated in
behaviour as the GREG in terms of both bias and Monte
accordance with JMSE^^IT,.
The variance and mean
Carlo variance for each population. Of all the estimators
square ertor are expressed in millions. The coefficient of
that were considered, GREG/U and GREG/R were the least
variation, the relative bias, the CARVs and the convergence
biased. The robust versions all exhibited greater bias.
of limited versions are expressed as percentages.
However, this is more than offset by the reduction in
variance so that the efficiency of robust estimators is always
4.4 Discussion
greater than that of GREG/U or of GREG/R estimators.
Conceming the constraints on the weights, it will be
The POPUSA population had no outiiers that did not
noted that the GREG/U, CHAM, MODEL, LEE25 and
satisfy the linear model. During sampling, the coefficients
QRROB/U estimators are all subject to problems of
of variation of the estimators were small, which could be
negative weighting, as can be seen in column MIN. This
expected given the trend of the population. Columns MSE
problem is avoided with limited estimators. The CARV
and VAR are very similar, indicating that bias is not a
column shows that the constraints were not met relatively
problem for this population. AU relative bias was less than
1 %. The QRROB/XJ estimator provided a reduction in frequently, depending on population and sample size,
varying between 5% and 60%. The general behaviour of
variance as compared to GREG/U that exceeded 21% for
the two limited robust estimators was comparable to that of
«= 10 and 30% for « = 15.
their non-limited versions. Moreover, QRROB/R, in
The size of the AREA population was small. This
addition to meeting the CARVs, provided interesting
population had a bad leverage point leading to very high
properties of efficiency, as compared to other robust
empirical relative bias for all the estimators. The GREG/U
estimators. Limited versions were not as prone to
estimator had a relative bias of more than 7% in spite of a
convergence problems when sample sizes were greater.
44% sampUng for this population. The robust estimators
Note that we had to use wider bands in the case of
had the most significant bias, though it was relatively
POPUSA in order to obtain satisfactory convergence rates.
comparable to the bias of the GREG/U estimator. The most
53
Survey Methodology, June 1999
Table 4.3
Monte Cario Results for Sampling From the AREA Population
Estimators
VARM
MSEM
CVM
BRM
GREG/U
1.334
1.700
19.23
8.92
GREG/R
1.295
1.629
18.82
18.30
MIN
MAX
CARV
CONV
n=lO
-3.35
14.94
86.6
8.53
0.20
14.00
100.0
8.77
-4.09
14.90
87.2
86.8
CHAM
1.187
1.541
MODEL
1.291
1.580
18.54
7.93
-5.23
16.75
LEE25/U
1.279
1.593
18.61
8.26
-5.28
16.89
86.6
LEE25/R
1.284
1.596
18.63
8.24
0.20
14.00
100.0
QRROB/U
1.026
1.440
17.70
9.50
-4.74
15.38
87.6
QRROB/R
1.028
1.437
17.68
9.43
0.20
14.00
100.0
GREG/U
0.940
1.178
16.00
7.18
-1.40
7.03
93.0
GREG/R
0.928
1.154
15.85
7.01
0.20
6.00
100.0
CHAM
0.708
0.989
14.67
7.82
-1.52
7.92
93.7
99.0
99.0
99.0
n=15
MODEL
0.757
0.997
14.73
7.22
-1.66
8.39
93.1
LEE25/U
0.672
1.059
15.18
9.18
-1.68
9.40
92.0
LEE25/R
0.671
1.056
15.15
9.15
0.20
6.00
100.0
QRROB/U
0.485
0.990
14.68
10.48
-1.59
8.90
93.9
QRROB/R
0,485
0.986
14.64
10.44
0.20
6.00
100.0
99.8
99.8
99.8
' The limits for the CARVs are [0.20,14] for n = 10 and [0.20, 6] for n = 15.
Table 4.4
Monte Carlo Results for Sampling From the MU284 Population
Estimators
VARM
MSEM
CARV
CVM
BRM
MIN
MAX
20.51
-3.64
-6.83
23.90
89.8
0.20
16.00
100.0
77.0
CONV
« = 30
GREG/U
2.833
2.925
GREG/R
2.813
2.910
20.46
-3.73
CHAM
0.645
1.639
15.35
-11.95
-11.80
31.26
MODEL
0.709
2.037
17.11
-13.82
-12.06
31.91
68.40
LEE25/U
0.887 .
1.877
16.43
-11.93
-11.06
30.93
73.5
LEE25/R
0.871
1.847
16.30
-11.85
0.20
26.00
100.0
QRROB/U
0.719
1.532
14.84
-10.81
-9.46
25.84
86.5
QRROB/R
0.720
1.525
14.81
-10.76
0.20
16.00
100.0
GREG/U
1.473
1.489
14.63
-1.49
-1.19
10.03
90.1
GREG/R
1.467
1.484
14.61
-1.57
0.20
7.00
100.0
CHAM
0.357
0.990
11.93
-9.54
-2.53
15.59
69.8
14.52
58.1
14.20
60.3
99.2
99.2
99.2
n = 60
MODEL
0.380
1.255
13.43
-11.22
-4.93
LEE25/U
0.403
1.201
13.14
-10.72
-4.80
LEE25/R
0.396
1.203
13.16
-10.78
0.20
7.00
100.0
QRROB/U
0.308
0.976
11.85
-9.80
-2.36
10.99
86.1
QRROB/R
0.308
0.979
11.87
-9.82
0.20
7.00
100.0
The limits for the CARVs are [0.20,16] for n = 30 and [0.20, 7] for n = 60.
99.7
99.7
99.7
54
Duchesne: Robust Calibration Estimators
Table 4.5
Monte Carlo Results for Sampling From the MU281 Population
Estimators
MIN
VARM
MSEM
CVM
BRM
GREG/U
GREG/R
CHAM
MODEL
LEE25/U
LEE25/R
QRROB/U
QRROB/R
17,33
17,40
13,23
11,30
11,21
11,26
12,92
12,94
17,35
17,41
13,26
11,91
11,60
11,73
13,29
13,34
7,84
7,86
6,86
6,50
6,41
6,45
6,86
6,88
-0,26
-0,24
-0,33
1,47
1,17
1,29
1,15
1,20
-38,97
0,20
-47,09
-66,22
-59,75
0,20
-54,14
GREG/U
GREG/R
CHAM
MODEL
7,57
7,58
5,85
4,53
4,55
4,50
5,40
5,39
7,57
7,58
5,90
5,23
5,18
5,18
4,57
4,30
-0,10
-0,09
-0,43
1,57
5,18
5,21
6,16
6,17
4,28
4,30
4,67
4,67
1,49
1,58
1,64
1,66
MAX
CARV
CONV
« = 30
0,20
34,56
25,00
39,08
41,43
37,03
25,00
39,73
25,00
86,0
100,0
56,9
47,9
53,3
100,0
70,8
100,0
-12,77
0,20
-22,97
-24,02
-23,74
0,20
-21,08
0,20
15,34
9,00
11,49
14,58
14,41
9,00
21,07
9,00
86,4
100,0
51,4
38,7
41,2
100,0
68,6
100,0
99,8
99,8
99,8
n = 60
LEE25/U
LEE25/R
QRROB/U
QRROB/R
99,9
99,9
99,9
The limits for the CARVs are [0.20, 25] for n = 30 and [0.20, 9] for n = 60.
5. CONCLUSION
The goal of this paper has been to introduce calibration
estimators having good properties of robustness. Traditional calibration estimators are easy to use, since it is
sufficient to have a set of starting weights, usually the
sampling weights J^, which are transformed into calibrated
weights. The steps used in this paper have been the same,
Le., the robust default weights r^ have been transformed
into calibrated weights, and the constants q^. have been
chosen such that B is a robust estimator. The proposed
choice of r^ is given by the formula (3.9), with a = 9,
^ = 1/4. There remains to develop a theory for the optimal
choice of r^. The suggestion is made for applications to
vary constant a, between say 6 and 12, in order to determine
the influence of the constant on the estimation. The limits
L and U can be used to limit the weights, e.g., to make them
all positive. We suggest the general use of L = 0.2,
U = kNIn, where k is about 3.
Note that robust calibration estimators are not meant to
replace the GREG estimator, but to be used in conjunction
with it. Thus, if the robust estimator and the GREG estimator are very different, a more in-depth analysis might
help determine the reason. The proposed estimators could
be useful as diagnostic tools.
It would be interesting to pursue the empirical studies of
section 4, by examining for example the effect of sampling
design on the proposed procedures. Another important area
of development is the estimation of variance. Multipurpose
surveys are yet another area of interest. In fact, for applications, there is rarely a single variable of interest, and
methodologists would like to use a single set of weights for
all the variables of interest. In terms of robustness, a
solution has been proposed in the conclusion of a paper by
Gwet and Rivest (1992), where robust weights were
calculated for each variable of interest y'''\i = I, ...,I. For
one unit, the final weight cortesponds to the minimum
weight among the weights obtained. Altemately, to obtain
robust and calibrated estimators, we could calculate robust
default weights for each variable of interest, providing a set
of '•^(y*'^), and assume r^ = min r^.(y*'^), where the
minimum is on / = 1,..., /. These weights could then be
transformed into calibrated weights. This procedure should
be assessed in greater detail.
ACKNOWLEDGEMENTS
I wish to thank Carl-Erik Samdal for introducing me to
sampling theory and for suggesting that I consider the
problem of outliers in sampling theory. I also thank Roch
Roy and Christian Leger for helping me during various
stages of development of this paper. My sincere thanks go
to the Associate Editor, the Assistant Editor and two
referees for comments which led to significant
improvements in both content and layout.
55
Survey Methodology, June 1999
APPENDIX C
LIST OF THE PRINCIPAL CONSTANTS
APPENDIX A
PROOF OF PROPOSITION 1
Let A^{u;q, r) = r -i- qu - h{u;q, r) and z^ be a variable
of interest. We assume the following conditions
c,-N-^Ys^kh
= Opin
Sk
C2-N-^Ys\(KK'qk'rk)h
= 0pin-"'),
where X^ is a solution to equation (2.5).
N-'Es9kVk(\-K)-o^{n-'''),
^
C,, with
(•'QR ~ • ' R Q R )
= N-'Ys
irk-'ik^k\)yk-N''Y.s
Quantity used to reduce the influence of outUer
auxiUary information in B .
1 1 , . , Tl kl- Inclusion probabilities of first and second order,
respectively.
Quantities defining an estimator QR. The ^^ are
used to build the regression coefficients involved
in the first part, T^B the r^ are used for the
second part, Ys''k^kWeights
used to build B in a robust way.
u^,u
t'"ifcWeights used to consider a robust correction factor
Lh^kw itCalibrated weight attributed to y^. to form
L'^kYkh,.
Note that Zs^^k^9k^k\)^k"Px'
^'^^'"^ ^i ""
-(ls^kVk)''(Pxr-Px)'
and also that
IMKK'
^k' ''k)-''k ~ ^x- Thus, in using Cj, we find that
X^ -X^= o {n'^'^), and therefore using
h - -"-k^k • ^^ '^ also easily shown that
Factor capable of accounting for heteroscedasticity
problems.
Sampling weights.
g-weight defined by wjdi^.
hix;Xj,q„r,)y,
=N-'Ys^kKyki\-K)-N-'Ys\^KK-'qk'rk)y,
REFERENCES
BERSHAD, M.A. (1960).
Some Observations on Outiiers.
Unpublished memorandum. Statistical Research Division, U.S.
Bureau of the Census.
APPENDIX B
LIST OF ABBREVIATIONS
BOLFARINE, H., and ZACKS, S. (1992). Prediction Theory for
Finite Population. New York: Springer-Verlag.
ADU: Asymptotically Design Unbiased.
BLUP: Best Linear Unbiased Predictor (Royall 1970).
CARV: Constraints applicable to the range of values for
the weights w^, by requiring for example that all
the w^e[L, [/].
CE:
Calibration constraints, Y,s^k^k ~ ^x' where
Px = I t / ^ r
CH:
Robust estimator proposed by Coakley and
Hettmansperger (1993), a single-step GMestimator that is robust and efficient.
CHAM: Robust Chambers (1982, 1986) estimator.
GM: Generalized M estimators, derived from robustness
theory (see for example Hampel et al, 1986).
GREG: Generalized regression estimator proposed by
Cassel ef a/., (1976).
HT:
Horvitz-Thompson estimator Y.s'^kYk' where
BREWER, K.R.W. (1994). Survey sampling inference: Some past
perspectives and present prospects. Pakistan Joumal of Statistics
10,213-233.
BRUCE, A.G. (1991). Robust Estimation and Diagnostics for
Repeated Sample Surveys. Mathematical Statistics Working
Paper 1991/1, Statistics New Zealand.
CASSEL, CM., SARNDAL, C.-E., and WRETMAN, J.H. (1976).
Some results on generalized difference estimation and generalized
regression estimation for finite population. Biometrika 63,615620.
CHAMBERS, R.L. (1982). Robust Finite Population Estimation.
Ph. D. thesis, Johns Hopkins University, Dept. of Biostatistics.
CHAMBERS, R.L. (1986). Outiier robust finite population
estimation. Joumal of the American Statistical Association, 81,
1063-1069.
CHAMBERS, R.L., and KOKIC, P.N. (1993). An integrated
approach for the treatment of outiiers in sub-aimual surveys.
Proceedings on the 49'' Session, International Statistical Institute.
QR:
Wri|ht (1983)
T'B + y r.e..
form
COAKLEY, C.W., and HETTMANSPERGER, T.P. (1993). A
bounded influence, high breakdown, efficient regression
estimator. Joumal of the American Statistical Association 88,
872-880.
RQR:
Generalization of tiie Wright (1983) estimators,
obtained using a general metric as well as
constraints on the weights.
DALfiN, J. (1987). Practical Estimators of a Population Total Which
Reduce the Impact of Large Observations. R & D Report,
Statistics Sweden.
dk = Th-
X
q
'-•s
k
estimators,
in
the
k
Duchesne: Robust Calibration Estimators
56
DEVILLE, J.-C, and SARNDAL, C.-E. (1992). Calibration
estimators in survey sampling. Joumal of the American Statistical
Association, 87, 376-382.
DONOHO, D.L., and HUBER, P.J. (1983). The notion of breakdown
point. In A Festschrift for Erich Lehmann, (Eds. P.J. Bickel, K.A.
Doksum and J.L. Hodges). Belmont, CA: Wadsworth.
FULLER, W.A., LOUGHIN, M.M., and BAKER, H.D. (1994).
Regression weighting in the presence of nonresponse with
application to the 1987-1988 Nationwide Food Consumption
Survey. Survey Methodology, 20, 75-85.
GAMBINO, J. (1987). Dealing With Outiiers: A Look at Some
Methods Used at Statistics Canada. Technical report. Business
Survey Division, Statistics Canada.
GROSS, W.F., BODE, G., TAYLOR, J.M., and LLOYD-SMITH,
CW. (1986). Some finite population estimators which reduce the
contribution of outliers. Proceedings of the Pacific Statistical
Congress. Aucklaud, New Zealand, 20-24 May 1985.
GWET, J.-P., and RIVEST, L.-P. (1992). Outiier resistant
alternatives to the ratio estimator. Joumal of the American
Statistical Association, 87, 1174-1182.
HAMPEL, F.R., RONCHETTI, E.M., ROUSSEEUW, P.J., and
STAHEL, W.A. (1986). Robust Statistics: The Approach Based
on Influence Functions. New York: Wiley.
HIDIROGLOU, M.A., and SRINATH, K.P. (1981). Some estimators
of the population total from simple random samples containing
large units. Joumal ofthe American Statistical Association, 76,
690-695.
HUBER, P.J. (1981). Robust Statistics. ^ev/Yor\i:
Wiley.
HULLIGER, B. (1995).
Outlier robust Horvitz-Thompson
estimators. Survey Methodology, 21,79-87.
KISH, L. (1965). Survey Sampling. New York: Wiley.
LEE, H. (1991). Model-based estimators that are robust to outiiers.
Proceedings of the 1991 Annual Research Conference. U.S.
Bureau of Census.
LEE, H. (1995). Outiiers in business surveys. In Business Survey
Methods, (Eds. B.G. Cox, D.A. Binder, B.N. Chinnappa,
A. Christianson, M.J. CoUedge and P.S. Kott). New York: Wiley.
RIVEST, L.P., and ROUILLARD, E. (1991). M-estimators and
outlier resistant alternatives to the ratio estimator. Proceedings:
Symposium 90, Measurement and Improvement of Data Quality,
Statistics Canada, 245-257.
ROUSSEEUW, P.J., and LEROY, A.M. (1987). Robust Regression
and Outlier Detection. New York: Wiley.
ROYALL, R.M. (1970). On finite population sampling under certain
linear regression models. Biometrika 57, 377-387.
SARNDAL, C.-E. (1996). Efficient estimators with simple variance
in unequal probability sampling. Joumal of the American
Statistical Association, 91, 1289-1300.
SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J.H. (1989).
The weighted residual technique for estimating the variance of the
general regression estimator of the finite population total.
Biometrika, 76, 527-537.
S A R N D A L , C.-E., SWENSSON, B., and WRETMAN, J.H. (1992).
Model Assisted Survey Sampling. New York: Springer-Verlag.
SIMPSON, D.G., and CHANG, Y.-C.l. (1997). Reweighted
approximate GM-estimators: asymptotics and residual-based
graphics. Joumal of Statistical Planning and Inference, 57, 273293.
SINGH, A.C, and MOHL, CA. (1996). Understanding calibration
estimators in survey sampling. Survey Methodology, 22,107-115.
SINGH, D., and CHAUDHARY, F.S. (1986). Theory and Analysis
of Sample Survey Designs. New York: Wiley.
STATISTICAL SCIENCES, INC. (1991). S-PLUS
Manual. Seattle: Statistical Science, Inc.
Reference
STUKEL, D.M., HIDIROGLOU, M.A., and SARNDAL, C.-E.
(1996). Variance estimation for calibration estimators: A
comparison of jackknifing versus Taylor linearizarion. Survey
Methodology, 22, 111-125.
TAMBAY, J.-L. (1988). An integrated approach for the treatment of
outliers in sub-annual surveys. Proceedings on the Section on
Survey Research Methods, American Statistical Association, 229234.
LEE, H., GHANGURDE, P.D., MACH, L., and YUNG, W. (1992).
Outliers in Sample Surveys. Methodology Branch Working Paper
BSMD-92-008E, Statistics Canada.
WELSH, A.H., and RONCHETTI, E. (1998). Bias-calibrated
estimation from sample surveys containing outtiers. Joumal of
the Royal Statistical Society, Series B, 60, 413-428.
MOSTELLER, F., and TUKEY, J.W. (1977). Data Analysis and
Regression, A Second Course in Statistics. Redding, MA:
Addison-Wesley.
WRIGHT, R.L. (1983). Finite population sampling widi multivariate
auxiliary information. Joumal of the American Statistical
Association, 78, 879-884.
57
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 57-66
Statistics Canada
Estimation in Surveys Using Conditional Inclusion Probabilities:
Complex Design
YVES TILLE'
ABSTRACT
This paper investigates a repeated sampling approach to take into account auxiliary information in order to improve the
precision of estimators. The objective is to build an estimator with a small conditional bias by weighting the observed
values by the inverses ofthe conditional inclusion probabilities. A general approximation is proposed in cases when the
auxiliary statistic is a vector of Horvitz-Thompson estimators. This approximation is quite close to the optimal estimator
discussed by Fuller and Isaki (1981), Montanari (1987, 1997), Deville (1992) and Rao (1994, 1997). Next, the optimal
estimator is applied to a stratified sampling design and it is shown that the optimal estimator can be viewed as an generalised
regression estimator for which the stratification indicator variables are also used at the estimation stage. Finally, the
application field of this estimator is discussed in the general context ofthe use of auxiliary information.
KEY WORDS: Conditional estimation; Weighted observation; Generalised regression estimator; Complex survey.
1. INTRODUCTION
At the estimation stage, practitioners of survey sampling
often have auxiliary information available. This information can be the knowledge of a set of population means
or totals. Sometimes, the available information is detailed,
for instance when the values taken by a variable on all the
units ofthe population are known. This information can be
used to improve the precision of the estimators.
Our aim is to dealt with the use of auxiliary information
based on a conditional principle. Conditional inference has
been largely studied in the survey sampling literature.
Indeed, the optimal estimator was discussed by Fuller and
Isaki (1981), Montanari (1987, 1997), Deville (1992) and
Rao (1994, 1997). The conditional properties of the poststratified estimators has been studied by Casady and
ValUant (1993). hi an eariier paper (Tille 1998), a general
technique that allows to build a mean or total estimator that
has a small conditional bias has been proposed for simple
random sampling. This technique is based on the use of
conditional inclusion probabilities and allows one to take
into account auxiliary information without any reference to
a superpopulation model.
In this paper the use of conditional inclusion probabilities is generalised to any sampling design. It is shown that
this technique allows to construct an estimator very similar
to the optimal estimator discussed by Montanari (1987),
Deville (1992) and Rao (1994). This family of estimators
provides a vaUd conditional inference and can also be
viewed as the optimal linear estimator. Next, these estimators are applied in the stratification case and are
compared to the GREG-estimator. The GREG-estimator is
generally conditionally biased. Nevertheless, it is shown
that, in regression, the optimal estimator is a particularly
case of the GREG-estimator. Indeed, when the sttatification
variables are re-used as auxiliary variables in the GREGestimator, it is equal to the optimal estimator. Next, a set
of simulations is given that shows the interest ofthe optimal
estimator in stratification. The gain of precision can be very
important when the stratification variables are very cortelated to the interest variable. Finally we discuss the general
estimation problem in survey sampling that can be viewed
as a third-order problem where three sets of variables
interact: the planning variables, the calibration variables
and the interest variables.
The paper is organised as follows. In section 2 the notation is defined. In section 3, the problem of conditional
inference is presented. In section 4, an approximation ofthe
SCW-estimator is given for complex designs under
technical hypotheses. These hypotheses are discussed in
section 5. In section 6 the optimal estimator and the SCWestimator are compared to the generalised regression
(GREG) estimator in the stratification framework. It is
shown that the optimal estimator can be viewed as a
GREG-estimator for which the stratification indicator
variables are also used a posteriori. Next a set of
simulations is presented in section 7 in order to compare the
discussed estimators. Finally, the problem of interaction
between the design and the auxiliary variables is discussed
in section 8.
2. PROBLEM AND NOTATION
Consider a finite population U = {l,...,k,...,N] and
suppose that a random sample 5 is drawn without replacement from this population foUowing a sampling design p{.).
The probabiUty of selecting the sample s is Pr{S =s) =p{s),
Yves Till6, Crest - Ensai, &ole Nationale de la Statistique et de I'Analyse de I'lnformation, rue Blaise Pascal, Campus de Ker Lann, 35170 Bruz, France,
e-mail: tille@ensai.fr.
58
Tille: Estimation in Surveys Using Conditional Inclusion Probabilities
for aU sc U. The indicator variables /^ take the value 1 if
unit k is in the sample and 0 otherwise, for all ke U. The
inclusion probabiUty of unitfcis 7i^ = £'(/^), where symbol
E{.) is the expectation with respect to the sampling design.
The joint inclusion probability for unit k and / is TI^, =
E{IJi). Let y^ denote the value of the variable y for the
k-th unit of the population. The aim is to estimate the
population mean of y:
supposed that the x^ are known for each unit of the
population. Later, it will be considered the more restrictive
case where only one function ofthe x^ such as
Ntu *
is known. Consider also the Horvitz-Thompson estimator of
X given by
i
y^-^YkN keu
If 7i^>0,for all k€ U the Horvitz-Thompson estimator
(1952) given by
=JLyli
" Nts n,
If 7t^>0, for all keU,x^ is an unbiased estimator of x
£(x„) = i .
The variance of x^^ is given by
N keS 7t^
provides an unbiased estimator of y.
Let T be a statistic. The objective is to estimate y with
a conditional bias as small as possible with respect to
statistic T. Define the first-order conditional inclusion
probabilities to be Jij^ir" Pi^k\T) for aU keU and the
conditional joint inclusion probabilities to be 7tjt,|7- =
E{IJiIT) for all keU,leU,k* I. The simple conditionally
weighted estimator (SCW) is defined by
_ I y
'\T
Nkes
(3.2)
yk
(2.1)
2=Var(x„) = - L i : x , ^ ( l - 7 t , ) .
N^ leu
71/
-|^EE^(-.-,-JA^"^ leU meU JC, Jt^
m*l
(3.3)
Suppose now that vector {y^ x^^)' has a multinormal
distribution. Under this hypothesis, it can be derived a
conditional unbiased estimator (see for instance, Deville
1992). First the conditional bias is computed:
Tt^ir
P(yn\K) = P(yn\K)-y = (K -x)Var(x„)-' Cov(x„,y„).
This estimator is not exactly conditionally unbiased. Indeed,
a conditionally unbiased estimator exists if and only if
7t^I^> 0 for all keU. For this reason, it is useful to enlarge
the definition of conditional unbiasedness: an estimator is
said to be virtually conditionally unbiased (VCU) if the
conditional bias only depends on the units having null
conditional inclusion probabilities. The SCW-estimator is
VCU, indeed:
B{y^^\T)=E{y^.p\T)-y
Tf
^
^*-
This estimator generalises some classic results (see Tille
1998) like post-stratification. Moreover, it allows us to
build an original estimator for a contingency table when the
population marginal totals are known. Unfortunately, the
computation ofthe TI^,^ becomes very difficult in complex
sampling designs. A general approximation for the SCWestimator will however be given when using a vector of
Horvitz-Thompson estimators as auxiliary statistic.
USE OF A COMPLEX AUXILIARY
STATISTIC
Suppose that the auxiliary information is represented by
the vector x^ = {Xi^y...,Xi^;,...,x
:.., ...,Xi^j)' of values taken by the J
auxiliary variables on the k-th unit of U. In a first step, it is
If an estimator of B{yJXj^) is available, the HorvitzThompson estimator can be cortected in the foUowing way:
yc=y.-P(yn\K)
= ^« + (^ - \) Var (xj-i Cov(x„, y„).
This estimator is related to the optimal Unear estimator
discussed by Fuller and Isaki (1981), Montanari (1987) and
Rao (1994). Indeed, Montanari showed that the best
estimator in the sense of the smallest mean square ertor
(MSE)ofthefomi
3'p=)'n +
(^-\P
(3.4)
occurs when p takes the value:
Popr =S-'Cov(i„,?„).
The optimal linear estimator presented by Montanari
leads thus to a very similar result to the conditional
approach, although Montanari did not start with a conditional point of view. In Montanari's approach, the optimal
estimator is found in a class of linear estimators defined by
(3.4) without any reference to conditional properties.
Nevertheless, Rao (1994) has pointed out that this estimator
leads to valid conditional inference. The general problem
Survey Methodology, June 1999
59
of the optimal estimator is that PQ^J. is not known and must
thus be estimated. By estimating POPT> the optimal
properties of the estimator are lost. ^
In order to estimate PQPT' (or ^(ynlx^)) two cases can
be distinguished. In the first one, the values taken by the
auxiUary variable on all the units of the population are
known. In this case, Y. is thus known and
PQPP
is thus estimated by (E is supposed non singular)
PoFr = ^"'Cov2(x„,y„).
By estimating Popp. another asymptotically optimal
(AOPT) estimator can be given:
3 ' A 0 F n " 3 ^ n + { ' ^ n - x ) ' P : OPT-
N^ keu
\
^3',
r^EE
A^'' keU leU
( % - '^t'^/)
Tt^t,
-Y
(X|,-x)y,
N keU
can be unbiasedly estimated by
(3.8)
The difference between the AOPTl and A0PT2 estimator
is die way we estimate Cov (x^, y^^) and £ . However, the
AOPTI-estimator needs more complete auxiliary
information.
The generalised regression (GREG) estimator defined by
Cassel, Samdal and Wretman (1976), Wright (1983),
Samdal, Swensson and Wretman (1992, p. 225) is also an
estimator of the linear class given by expression (3.4). For
the GREG-estimator P is defined by
-1
(3.5)
C«^>K'^n) =NT;E
PGREG
,
keS
^k^k
keU
E
t^t
keU
and can be estimated by
where
""l^kl
^\k-E[K\k^s) N= , ]-Y
,, 71,71,
leU
k I
By using (3.5), afirstasymptotically optimal estimator can
be constructed
kovri =y.-(^-
^l*-x
K)^''jX
^ ^ ^ r
N keS
(3-7)
''k'^k
-'GREG
keS
kes n^c^
where quantities c^>0,^e [/, are weights defined for all tiie
population units. TTie GREG-estimator does not have good
conditional properties. It is generally conditionally biased
(Rao 1994).
{l-n,)^
4. APPROXIMATION OF THE
SCW-ESTIMATOR
Another way to construct a conditionally unbiased
estimator is tofindan approximation ofthe SCW-estimator
given in (2.1). Indeed this estimator has good unbiasedness
properties because it is VCU. If x^^ is used as an auxiliary
statistic, we shall seek an approximation of
n{
\\T
1
Jt^C^
\
In the second case, only the population mean x is
known, E must thus be estimated and Cov(x^, y^) can not
be estimated using (3.5). Montanari proposes to estimate E
and Cov(Xjj, y^j) by the classic Horvitz-Thompson
estimator:
^--X
N' kes
"iX,
(3.6)
n.N
k
E(h\U
V - V - X,X, 7t,,-7l,7t,
N^ kes teS T^k^l
l*k
••kl
If the random vector x^^ takes for instance the value z, we
get by Bayes's theorem that
and
E(l,\i^
N^ kes nl
N^ keS leS TC^Tl,
7t^,
=
z)=Pr(keS\x^=z)=K,
Pr{i^=z\k€S)
Pr(x„ = z)
:
In order to compute the conditional inclusion probabilities, it is thus necessary to know the probability distribution of Xjj unconditionally and conditionally on the
presence of each unit in the sample. Except for some
particular case, this probability distribution is very complex;
for this reason an approximation will be constructed.
60
Tille: Estimation in Surveys Using Conditional Inclusion Probabilities
It is possible to derive the means and variances of x^
unconditionally and conditionally on the presence of each
unit in the sample. Indeed £'(x„) is given in (3.2 ),
Var(x„) in {3.3 ),E{\^\keS)
in (3.6), and
E, = War(^J
kes)
X;X,7l^,
7C,
K.n,
7t,.
A^ leu
l*k
These three hypotheses are verified for simple random
sampling without replacement when only one auxiliary
variable is available. Indeed, in this case, we have 7 = 1,
x , = x , , x = x , i | ^ = X | ^ x „ = x „ . Weget
\
n
n n- I
A^ "'
NN-l
By (3.6), (3.3), (4.9),
7c^ = — , 71^; =
k I
—E E
N'^
5. DISCUSSION ABOUT THE HYPOTHESES
leu
l*k
melJ
m*'
m*k
X;X„
^ki^km
(4.9)
X|^=X +
'•klm
Jt^Jt/Tt^
where 7t^,^ is the third-order inclusion probability. Matrixes
E and E^^ are assumed to be non singular^
As the probability distribution of x^ is generally
unknown, the following three assumptions will be used to
construct an approximation of conditional inclusion
probabilities.
(i) If the sample size n is large, x^ has a multivariate
normal distribution unconditionally and conditionally
on the presence of each unit in the sample.
N
,
a n d 71,•klm
- n
^k
N-l
•"•
(5.11)
n
N-n
N-l
Var(x„)
n n- I n -2
N N-l
N-2
^
n
(5.12)
V a r ( t | f c e 5 ) = ^ M z i O ( - l I ],?-.^fi:^l(5.13)
N-l
{N-2){N-l)n'
where
(ii) R-' -R ' -Oj^j{n-^) for aU keU where R=\-^'^
ij:^Y-m ,V denotesaJxJdiagonal
EV -1/2 R,=V-"2^
matrix having the elements of the diagonal of E on its
diagonal and Oj^j {n"") denotes a matrix of quantities
that when multiplied by « ° remains bounded as n - «>.
<^l-^X(-k-^)'N keU
Now, consider the three hypotheses for this particular
case.
(iii) Y^ = V-''2(X|^ -x) = Oj{n '^'^) where Oj {n "°) denotes - Hypothesis (i) was proved by Madow (1948) under
a vector of quantities that when multiplied by n°
some conditions.
remains bounded as n - «>.
- Hypothesis (ii) becomes
These three hypotheses are made on the sample size. It
is thus supposed that when n increases, A^ increases at least
as quickly as n. Nevertheless, no hypothesis are made on
/ = nIN. Assuming that the hypotheses given in section 3
are verified, the following result gives an approximation of
the SCW-estimator:
Varf;c„ IkeS)
^-^
^-1
Var(x„)
By (5.12) and (5.13), we get
Var(x„ \kES)
_N{n-l)
Var(xJ
" (A^ - 2)n
Result 1: Assuining (i), (ii) and (iii), and if the auxiliary
statistic used is x_, then
^ _l_jN-2n
V=?. *(^-\)'2:'' N keS
=0{n'
(«-•).
^N{n-l)
{x,-x)^
{N - Do,"
(x^-x)^
^ l * - ' '
n \{N-2)
yk
{N-2)
I
= 1 +0
*Op{n-)^y^Q^y
(4 10)
-
where n x O (n"') is a quantity bounded in probability.
Proof of Result 1 is given in the appendix.
Hypothesis (iii) becomes
^\k
\/y^{\)
^
_/^/•„l/2^
0{n"').
(N-l)ol
61
Survey Methodology, June 1999
By (5.11) (5.12), we get
^/N^X^-X
1k =
where
O
In simple random sampUng, these hypotheses can better
be interpreted. Hypothesis (i) is the classic assumption of
normality that was also needed for the construction of the
optimal estimator. In simple random sampling, it is easy to
verify that Hypothesis (iii) implies hypothesis (ii). Both
technical hypotheses simply imply that a particular unit
cannot take a |x^ - x | value much more important than the
other ones.
The three hypotheses are thus valid under simple random
sampling when only one variable is available. This result
can also be extended to stratified sampling when the
number of strata is fixed and the sample size within each
stratum is large. In cluster sampling, when the number of
clusters is large and the clusters are selected with a simple
random sampling design, these hypotheses are still
applicable. Hypothesis (i) was also partially showed by
Rosen (1972) for sampling with unequal probabilities.
Actually, the proof of Rosen is restricted to a rejective
sampling design.
The proposed hypotheses, are generally less restrictive
than a superpopulation model. Indeed, a superpopulation
model is a set of hypotheses on the interest variables while
the three hypotheses presented only affect the auxiliary
variables. In a superpopulation model, the relation between
the interest variable and the auxiliary variables are the most
extensive contribution of the model. In the conditional
approach, no hypothesis is made on the interest variable. If
the hypotheses presented are debatable, it is thus clear that
a superpopulation model is a set of hypotheses much more
restrictive than those used in the conditional approach.
6. APPLICATION TO STRATIFIED SAMPLING
6.1 The Problem
In stratification, auxiUary information is used it a priori
to improve the estimation. In this case, three sets of
variables interact: the stratification variables, the auxiliary
variables used a posteriori and the interest variable.
Suppose that the population is partitioned into H strata
Uf^,h = \,...,H, of size Nf^,h = l,...,H. The population
means of the strata are denoted y^ =A^;," Hkeu yk ^^^
Xf^=Nf^ Ejtey x^. A simple random sample 5^ of fixed
size Wy, (2!A=I "A ~ ") ^^ selected without replacement independently in each stratum. From the general theory of stratification (see for instance Samdal, Swensson and Wretman
1992, p. 100), we get
3'„ = T : E ^hYh and x„ = - J ] A^,x,
N At:= l
N A=l
- E 3't a n d i , = — 5 ^ ^ r
Moreover, we have that
Cov(x„,y„) =
-^t^b'~^'
N^
h-i
' ,E(x.-x,)(y,-n)'
n,
N,-l
keu,
and
1
"
N'^ /i=i
1-/.
n,
1
E (Xt-x,)(x,-x,)'
N^-l keu,
where/^ =nJNf^,h = l,...,H.
6.2 AOPTl-estimator
If keU^, by extending expression (3.6) to stratified
sampUng, we get
A^„^
X|^ = E{xjk€ S) =x +
l-/a
N{N^-i)
\
(x*-x„)
and
Ttj,
A^ kes
I ^
NI
l-f, I
- T E T;^—
E
(x*-x,)y,.
N^h-i N^-l
n, n^kes,
From (3.7) the AOPTl-estimator of can be derived as
W i = i + (x-x„)'
-1
H
2W .
EN,
''=1
"nA
H
Nb
xE
H-.i N^-l
1
^A
E
(Xi-x,)(x,-x,)'
~ 1 keU,
1 -fb 1
E (Xt-x,)y,.
«, n^kes,
The use of this estimator requires the knowledge of very
substantial auxiUary information. The population means x ^
of the auxiliary variables must be known for each stratum as
weU as the stratum sizes A^^. Moreover, the values taken by
the auxiliary variables must be known for each unit of U.
62
Tille: Estimation in Surveys Using Conditional Inclusion Probabilities
However yAOPri ^^^ i" stratification an important
drawback: it is not calibrated on the strata size N^^ Le.,
when the objective consists in estimating the strata sizes
Nf^, generally ^AOiTt '' ^A- ^^^^ drawback can easily be
overcome by centring the interest variable. We thus get:
''GREG
'^^GREG
"
^
Nb
1 -/A 1
H-i N^-l
«,
E
n^kes.
(x,-x,)(>',-y,).
N.GREG ••N,-N{i-ij
yAot^=k*(^-K)'
E^-
A=l
XN,
A= l
"A
2^-f,
"A
"A
-1
E
(x,-x,)(x,-X;,)'
" 1 kes.
1
E
1" A. -- 1 *e5,
(X,-XA)(>'t-3'A)-
The AOPT2-estimator only needs the knowledge of the
population mean vector x and of the stratum sizes N^. It
has however a drawback, the x^ are estimated and thus
JxH degrees of freedom are lost. If the number of strata is
large, this loss of degrees of freedom could increase the
instability of this estimator when 7 x / / is large.
6.4
GREG-Estimator
The GREG-estimator does not take into account the joint
inclusion probabilities. It is given by
)'GREG=>'.*(X-X„)' •
'^ A^
X X '
Y —Y
x*)-*
A=i n^ kes,
q
y^ ****
A=l
E-^E
n^ kes,
E
cYk
C^
|A=I
N.
— E x^x,
«A
kes,
"
The AOPT2-estimator can also be used in stratification.
In this case, from (3.8) we get
1
1-1
Since, in stratified sampling, N,^^ '^H' ^® ^^^
6.3 AOPT2-Estimator
2^-h
X x'
A=i «A kes.
X
^
^ ^ ' 'n,^ TN^-l
T ^ keu,
^ K-XA)(X.-X,)'
L=i
N
= N,„.A^(x-i„)' \y^ ^
Y^E
^^^
h-l n^ kes,
Although this estimator is more stable, it is conditionally
biased. Moreover, if we want to estimate the stratum sizes A^^
by the GREG-estimator, we do not find exactly N^.
Indeed, if y^ = 1 when keU^ and y^ = 0 when k$U^,
then
N
c^
X
A=i n^ kes, q
(6.14)
Expression (6.14) shows that generally ^GREG^'^AThus, the GREG-estimator destroys the stratification effect
because it does not take the stratification into account.
Indeed, the stratification is represented by the joint
inclusion probabilities. In the GREG-estimator, only the
first-order inclusion probabilities are used. On the other
hand, it is easy to verify that the AOPTl and AOPT2estimator of A^^ are exactly equal to A^^. The AOPTestimators is thus calibrated on the N^.
We propose to use the GREG-estimator in the following
cases: when the sample size is small or if the number of
strata is large and when the stratification gives poor
auxiliary information on the interest variable. Indeed, in
this case, the loss of precision due to the loss of degrees of
freedom, wiU be more important than the precision benefit
due to the optimality of the estimator. An interesting
analysis of the benefit due to the optimal estimator is also
given in Montanari (1998).
6.5 GREG-Estimator With use of the Stratification
Variables
A variant of use of the GREG-estimator consists in
re-using the stratification variables at the estimation stage.
Consider the column vector
w.
iz,kl- -'^kh'—'Z,•km X^jt/ ) '
where zkh 1 ifkeUf^ and 0 if not. This vector is thus
composed of the values taken by the indicator variables of
the presence of unit kin the H strata and of the values taken
by the x-auxiliary variables.
Now if w denotes the population mean of vectors w^
and Wjj its Horvitz-Thompson estimator, the GREGestimator using the auxiliary information w is given by
Survey Methodology, June 1999
63
>'GREGW=3'„ + ( W - W J '
" N^
A=l n.
keS,
c.
Y —E
w . yk
A=i n. kes,
c.
•
(6.15)
The presentation of expression (6.15) can be simplified.
Indeed, the foUowing result was proved by Tille (1994) and
generalised by Samdal (1996):
Result 2:
When the stratification variables are re-used at the
estimation stage, and if the c,^ are equal into the strata
{c^. = Cf^,k€Ui^) the GREG-estimator can be written
The populations are generated by means of the following
models: T*,:Xj^ = aj^,y^ = e^.^et/, (total independence),
'^r^k-^k'Yk-^-'^k'^ ^k'^^(/, (dependencebetweenXand
y), 5*3: x^ = a^,y^^ = Xj^ + 2h{k) + ei^,keU, (dependencebetween x, y and the strata), J*^: Jc^ = a^.y^^ = exp(10 + 2x^
(+ I0h{k)-^e^,ks [/, non-linearity and dependence between
X, y and the strata), T^ x^ = a^,y^ = exp(e^ + 3xj^) + 3h{k),
k€U, (non-linearity and dependence between x,y and the
strata), 7*^: x^ = a^, y^ = 3h{k) + e^^, /: 6 (/, (strong dependence between y and the strata), J*.j:Xj^ = aj^,y^ =
50h{k) + ei^,keU, (very strong dependence between y and
the strata), where a^^ and e^^ are independent normal
variable with mean equal to 0 and variance equal to 1, and
h{k) is the number of the stratum of unit k. Results of the
simulations is given in Table 1.
^GREGW
Table 1
Results of 10,000 Simulations
-I
i-(x-x,)' E — E
A=l
(x,-x,)(x,-x,)'
n^C^keS,
"
N
^
x E — ^ E
•P,
^
(Xt-x^)Cy^-y^). (6.16)
T,
•P.
v.
y.
y.
y.
M, 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
A = l n^Ci^kes,
A proof of Result 2 is given in the Appendix. Note that
expression (6.16) is equal to the AOPT2-estimator when
c.=C^
"
n, I - lln.
^,
^A
M2 1.0070 0.0906 0.5180 0.9261 0.9263 1.1047 38.5104
M^ 1.0069 0.0906 0.4835 0.9277 0.9269 1.0015 1.0123
M^ 1.0060 0.0936 0.4850 0.9257 0.9239 1.0006 1.0111
1-/A
for /i = 1,..., H, where C>0 isaconstant. Whenthe/^ are
small and the «^ are large and proportional to the A^^ both
estimators are equivalent. This result shows that with the
conditional approach, the fact that the sampling design is
stratified is automatically taken into account in the estimation method. The GREG-estimator does not take into
account the stratification effect and thus it is necessary to
reintroduce the stratification variables at the estimation
stage so as not to lose the stratification effect.
Table 1 shows that the GREG-estimator provides a good
estimation when the stratification variables are not
cortelated to the interest variable. Nevertheless, the more is
the dependence between the stratification variable and the
interest variable, the more is the gain of precision of
>'AotTic ^^^ ^AOPT2- ^ ^ '°^^ of degrees of freedom ofthe
optimal estimator does not seem to affect the precision for
this sample size. Moreover, the gain obtained by the
knowledge of the population stratum is not significant for
this sample size. For aU these cases, the optimal estimator
is thus clearly preferable to the GREG-estimator.
7. SIMULATIONS
8. A THIRD-ORDER PROBLEM
A set of simulations was carried out in order to compare
the four following estimators: y„,>'Aopric, >'AOFn,
y^^Q.The population is made up of 4 strata of 250 units
(N= 1,000). A stratified sampling design is applied with
proportional allocation. For each simulation, 10,000
samples of size n =100 are selected and the foUowing ratios
has been estimated:
The complexity of determining the conditional weights
is not a specific problem of the SCW-estimator. It is due to
the general problem of estimation with auxiUary information used a posteriori when an auxiliary variable is
already used a priori in the sampling design. This problem
can be presented as a third-order interaction problem among
M, =MSE(y„)/MSE(y„)=l,
-
the interest variables;
M2=MSE(|G^O)/MSE(^„),
-
the sampling design and thus the auxiliary variables
used a priori;
-
the auxiliary variables used a posteriori.
M3 = MSE(^A0Pric)/MSE(y„),
M,=MSE(y^oj^/MSE(yJ.
64
Tille: Estimation in Surveys Using Conditional Inclusion Probabilities
Indeed, the use of auxiliary information at the estimation
stage leads to the following problem: how do these
auxiliary variables used a posteriori interact with the
interest variable through a given sampling design? The
problem being complex, we have to take into account the
relationships between each set of variables above as well as
the third-order interactions among these three sets of
variables.
It is very difficult to find a really operational estimator
which uses the three second-order interactions and the
tiiird-order interaction. For this reason, one can attempt to
simplify the problem. The neutraUsation of one of the
aspects of this problem significantly simplifies the research
of an estimator. Most of the possible simplifications have
already been studied. We can cite some of these:
-
-
-
-
-
If no auxiliary information is used a posteriori (except
the population size N ) v/e can only construct the
Horvitz-Thompson estimator or Hajek's ratio (1971).
Searching general solutions using auxiliary information
for simple random sampling does not pose major
problems. In this case, no auxiUary information is used
a priori.
Using a superpopulation model allows one to fix a
relation existing between the interest variable and the
auxiliary variables used a posteriori. In this case, it is
possible to determine the optimal estimator (under the
model).
For the GREG-estimator and also for the calibration
methods see Deville and Samdal 1992), in the designbased inference framework,only thefirst-orderprobabilities are retained from the sampling design. A simple
random sampling is thus treated in the same way as a
stratified design for which the first-order inclusion
probabilities are all equal. For this reason, a regression
estimator applied to a stratified design generally
destroys the calibration on the stratum frequencies
given by the a priori stratification. In this case, the
simpUfication arises because aU the contiibutions of the
auxiliary variables used a priori to the sampling design
can be described only by the first-order probabilities.
Finally, for the optimal linear estimator, it is implicitly
supposed that the dependence between HorvitzThompson estimators ofthe variables x and y is linear.
Obviously, these estimators neglect the non-linear
dependence between the estimators. Nevertheless, it
takes into account the joint inclusion probabilities.
When the sampling design is stratified, the estimator
remains calibrated on the population stratum
frequencies.
The CW-estimator takes into account this third-order
interaction. Moreover, in this case, auxiliary information
does not necessarily intervene in a linear way. The weights
depend on both the sampUng design and the auxiliary
statistic. These weights appUed to the values taken by the
interest variable take into account all the interactions
between the three variable groups.
The methods using conditional inclusion probabilities
are interesting for different reasons: they give a general
frameallowing to search and conceive estimators using
auxiliary information without reference to a superpopulation model and lead to valid conditional inference. They
bring into prominence all the complexity of the estimation
problem with auxiliary nformation. According to the known
auxiliary information, we can find either known results (as
for example post-stratification) or very complex and not
really operational estimators. However, a first approximation leads to a known result, Le. the optimal linear
estimator.
ACKNOWLEDGEMENTS
The author is grateful to two anonymous referees and an
Associate Editor for constructive suggestions and to
Professor Carl Samdal for interesting comments on a
previous version of this paper.
APPENDIX: PROOF OF RESULT 1 AND 2
Lemma 1 will be used in the proof of Result 1.
Lemma 1 If R^' - R'' =0^xy(n"'), then | R J - ' | R | =
1 + 0{n"'), where R and R^ are defined as in hypothesis
(ii).
Proof
[R;'-R-']R=
Oj^j{n-')R
and thus
|R,'R|=|/+0,,,(n-')R| =
[1 +0(n-')]^ + 0(n-') = l + 0 ( n - ' )
where I is a 7x7 identity matrix. Thus,
R,
1
= 1 +0(n"').
|R|
l+0(«-')
Note that lemma 1 is a consequence of hypothesis (ii)
Proof of Result 1
If we define
d, = - ^ , for an keU.
A^Tl,
by hypothesis (i), we get:
«Pr(x„)
«i(x„)
Tt.i^A^
Nn^Pr{xJkeS)
/(xJ
Vi(x„)
Survey Methodology, June 1999
65
where/(resp. /^) is the density function of a multivariate
normal variable with mean x (resp. x,^) and variancecovariance matrix E (resp. E^).Thus,
= ^41. O/n-')].
-l-exp-l(i„-x-)'E-(x„-x-)
«t(x„)=^/i
1
c, = d,[l . 0 ( / z - ' ) f e x p | l i : ' 0 , , / « - > ) ^ „ j
(8.21)
(8.17)
-exp--(x„-X|,)'E;'(i„-X|,)
By (8.20) and (8.21), we get
«(x„) =rfj 1-0^-'(«-')]
If we also note
R=V-''^EV-''^,
{l-T*K'-0.,.(n-'))i%0^(«-'))
R,=V-'^E^V-'^,
= c/,{i-Y;R-'i;.o^(n-')}
x: = V-''^(i„-i),
= ^.{l-(x-x„)'E''(X|,-x).0^(n.,)}.
Finally, we get
and
c, = ^ , ^ ^ ^ e x p - ^ - x : ' ( R " ' - R - • ) x „ ^
IRI''^
2
^
(8.18)
'
V = " E ^^(xjyi
'^keS
=?.^(x--x„)'E-'iE^^^^^^o>N keS
71.
we get
1 | R r ' ^ e x p - - ^ x c; RD-lC^<^
-'x
''t(x„)=^t-
Proof of Result 2
Rj--'^exp-l(l;-Y,)'R-;(i;-Y,)
^texp-Y,R-'(7,-2x„^).
In Samdal (1980), we see that the GREG-estimator
presented in (6.15) can also be written:
(8.19)
By using a Taylor development for the vector y^ of (8.19),
we get
«.(x„)=c,(l-Y,R^'x„^)+/?(Yr).
(8.20)
where
R(yT)-Ck e x p | Y r R ; ' ( y f - 2 x : )
^GREG.=i-^-'(i;vW^-l5n5'W5)'
(w;c-;n-;w,)->w^^c,'n-;y,
where l^(resp. 1^) is a column vector composed of A^
(resp. n) ones, 11^ (resp. 11^) is a diagonal matrix having
the inclusion probabilities of the population (resp. sample)
units on its diagonal, C^ is a diagonal matiix having the c^
of the sample units on its diagonal, y^ is a column vector
composed of the values taken by the interest variable y in
the sample.
Z-IH Xj
xy;K(yr-x:)][R^'(Yf-x:)]'-Ri')Y.
,.(0)
and Yt is a vector whose elements are included between
the cortespondent elements of y^ and 0. By hypothesis (iii),
we directly get
w =
^w
Z-NH X/v
and W^ is a n X ( ^ + 7) matrix composed ofthe n rows of
W^ cortesponding to the units selected in the sample.
R(yf^) = o(n-').
The matrix to invert can be partitioned into four parts:
On the other hand, we have by hypothesis (ii), lemma 1 and
(8.18) that
(w;c-;n-'w,)
A D
D' B
Tille: Estimation in Surveys Using Conditional Inclusion Probabilities
66
where A is an //x H matrix having N^ICf^,h = 1,..., H, on
its diagonal,
B = E^Ex,x;
DEVILLE J.-C. (1992). Constrained samples, conditional inference,
weighting: three aspects of the utilisation of auxiliary information.
Proceeding ofthe Workshop: Auxiliary Information in Surveys,
Orebro.
DEVILLE J.-C, and SARNDAL, C.-E. (1992). Calibration
estimators in survey sampling. Joumal of the American Statistical
Association, 376-382.
and
N,x,
D'
A^^x^
By using the technique of matiix inversion by partition, we
get
( A - D B ' D ' ) i -A 'DQ
irr-li
(w;c-'n,w,)
-QD'A •
Q
0.
(i;w;.-i;n;w,)
MADOW, W.G. (1948). On the Hmiting distributions of estimates
based on samples from finite universes. Annals of Mathematical
Statistics, 19,535-545.
x„-x
where 0^ is a column vector composed of H zeros, we get
(i;w^-i;n-'w,)'(w.;c-'n-;w,)-'
= (i„-x)'Q[-D'A->I(,,,)]
where Ly^y) is a 7 x 7 identity matrix. Since
Q=
[-DA-' I^j^j^] = [-x ... - x ^ J(y^,)]
RAO, J.N.K. (1997). Development in sample survey theory: an
appraisal. Canadian Joumal of Statistics, 25, 1-21.
and
ROSEN, B. (1972). Asymptotic theory for successive sampling with
varying probabilities without replacement. Annals of
Mathematical Statistics, 43, 373-397.
^5^1
SARNDAL, C.-E. (1980). On 71-inverse weighting versus best linear
unbiased weighting in probability sampling. Biometrika, 67,
639-650.
^H^H
M
Y^Yx.yk
*=1 n^c^kes,
MONTANARI, G.E. (1997). On conditional properties of finite
population mean estimators. Proceeding ofthe 5Ist Session ofthe
Intemational Statistical Institute, Contributed paper, 351-352.
RAO, J.N.K. (1994). Estimating totals and distribution functions
using auxiliary information at the estimation stage. Joumal of
Official Statistics, 10, 153-165.
A = l n^c^kes,
H
MONTANARI, G. E. (1987). Post sampling efficient QR-prediction
in large sample survey. International Statistical Review, 55,
191-202.
MONTANARI, G. E. (1998). On regression estimation of finite
population means. Survey Methodology, 24,69-77.
E—E(x,-x,)(x,-i,)'[ ,
w;c-;ni'y,^
H A JEK, J. (1971). Comment on a essay of D. Basu. Foundations of
statistical inference, (fids. V.P. Godambe, et D.A. SprotO,
Toronto: Holt, Rinehart and Winston, 236.
HORVITZ, D.G., and THOMPSON, D.J. (1952).A generalization of
sampling without replacement from a finite universe. Joumal of
the American Statistical Association, 7, 663-685.
where Q = (B - D'A"'D)"'. Since
r-ll
FULLER, W.A, and ISAKI, C.T. (1981). Survey design under
superpopulation models. In Current Topics in Survey Sampling.
(Eds. D. Krewski, R. Platek, J.N.K. Rao, and M.P. Singh). New
York: Academic Press, 196-226.
(8.23)
we get Result 2 by multiplication of (8.22) and (8.23).
REFERENCES
CASADY, R.J., and VALLIANT, R. (1993). Conditional properties
of post-stratified estimators under normal theory. Survey
Methoddology, 19,183-192.
CASSEL, C.-M., SARNDAL, C.-E., and WRETMAN, J. H. (1976).
Some results on generalized difference estimation and generalized
regression estimation for finite population. Biometrika, 63,
615-620.
SARNDAL, C.-E (1996). Efficient estimators with simple variance
in unequal probability sampling. Joumal of the American
Statistical Association, 91, 1289-1300.
S A R N D A L , C.-E., SWENSSON, B., and WRETMAN (1992). Model
Assisted Survey Sampling. New York: Springer Veriag.
TILLE, Y. (1994). Utilisation d'information auxiliaire en th6orie des
sondages sans r6f6rence & un mod&le. Ph.D Thesis, Universite
Libre de Braxelles, Institut de Statistique.
TILLE, Y. (1998). Estimation in surveys using conditional inclusion
probabilities: simple random sampling. Intemational Statistical
Review, 66, 303-322.
WRIGHT, R.L. (1983). Finite population with multivariate auxiliary
information. Joumal ofthe American Statistical Association, 78,
879-883.
67
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 67-72
Statistics Canada
On Robust Small Area Estimation Using a Simple
Random Effects Model
N.G.N. PRASAD and J.N.K. RAO'
ABSTRACT
Robust small area estimation is studied under a simple random effects model consisting of a basic (or fixed effects) model
and a linking model that treats the fixed effects as realizations of a random variable. Under this model a model-assisted
estimator of a small area mean is obtained. This estimator depends on the survey weights and remains design-consistent.
A model-based estimator of its mean squared error (MSE) is also obtained. Simulation results suggest that the proposed
estimator and Kott's (1989) model-assisted estimator are equally efficient, and that the proposed MSE estimator is often
much more stable than Kott's MSE estimator, even under moderate deviations ofthe linking model. The method is also
extended to nested error regression models.
KEY WORDS: Design consistent; Linking model; Mean squared error; Survey weights.
1. INTRODUCTION
Unit-level random effects models are often used in small
area estimation to obtain efficient model-based estimators
of small area means. Such estimators typically do not make
use ofthe survey weights {e.g., Ghosh and Meeden 1986;
Battese, Harter and Fuller 1988; Prasad and Rao 1990). As
a result, the estimators are not design consistent unless the
sampling design is self-weighting within areas. We refer the
reader to Ghosh and Rao (1994) for an appraisal of small
area estimation methods.
Kott (1989) advocated the use of design-consistent
model-based estimators {Le., model assisted estimators)
because such estimators provide protection against model
failure as the small area sample size increases. He derived
a design-consistent estimator of a small area mean under a
simple random effects model. This model has two components: the basic (or fixed effects) model and the linking
model. The basic model is given by
Assuming that the model (1) also holds for the sample
{y. ,7 = 1,2, ...,n:, i = l,2, ...,m} and combining the sample
model with the linking model, Kott (1989) obtained the
familiar unit-level random effects model
+ V. +
I
e..,i
ij'-'
1,2, . . . , « • / = 1,2, ...,m,
(3)
also calledtiiecomponents-of-variance model. It is customary to assume equal variances a, = a^, although the case of
random error variances has also been studied (Kleffe and
Rao 1992; Arora and Lahiri 1997).
2
9
Assuming a, =a , Kott (1989) derived an efficient
estimator Q^^ of 9,. which is both model-unbiased under (3)
and design-consistent. He also proposed an estimator of its
mean squared ertor (MSE) which is model unbiased under
the basic model (1) as well as design-consistent. But this
MSE estimator can be quite unstable and can even take
negative values, as noted by Kott (1989) in his empirical
example. Kott (1989) used his MSE estimators mainly to
Yij = Q, •" ^ij' 7 = 1,2,..., N.; 1 = 1,2,..., m
(1)compare the overall reduction in MSE from using Q-^r in
place of a direct design-based estimator y .^ given by (4)
where the y. are the population values and the e.. are
below. He remarked that more stable MSE estimators are
y
" 2
uncortelated random ertors with mean zero and variance o, needed.
The main purpose of this paper is to obtain a pseudo
for each small area i{= 1,2, .„,m). For simplicity, we
empirical
best linear unbiased prediction (EBLUP) estimator
take 8 as the smaU area mean F = y.y. / A^., where A^. is
the number of population units in the i-th area. Note that of 9; which depends on the survey weights and is designconsistent (section 2). A stable model-based MSE estimator
Y. = Q. + E. and E. = Y,je.jlN. = 0 if A^,. is large.
The linking model assumes that 9, is a realization of a is also obtained (section 3). Results of a simulation study in
section 4 show that the proposed MSE estimator is often
random variable satisfying the model
0. = M + V.
(2) much more stable than the MSE estimator of Kott, as
measured by their coefficient of variation, even under
where the v. are uncortelated random variables with mean moderate deviations of the linking model (2). Results under
zero and variance o^,. Further, {v^.} and {e..} are assumed the simple model (3) are also extended to a nested ertor
to be uncortelated.
regression model (section 5).
N.G.N. Prasad, Department of Mathematical Sciences, University of Alberta, Edmonton, Alberta, T6G 201; J.N.K. Rao, Department of Mathematics and
Statistics, Carleton University, Ottawa, Ontario, KIS 5B6.
Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model
68
2.
PSEUDO EBLUP ESTIMATOR
Suppose w.. denotes the basic design weight attached to
the y-th sample unit {j = l,2,...,n.) in the i-th area
(/ = 1,2,..., m). A direct design-based estimator of 9,. is
then given by the ratio estimator
The estimator 9. will be referred to as pseudo-EBLUP
estimator. We use standard estimators of oJ and o^, based
on the within-area sums of squares
TT \ 2
!2., = E E (Yij-Yi)
•
J
and the between-area sums of squares
yi.-Ej^yyjEj^o
= Ej^yyii
(4)
Qb =
where w.. = vv../ ^.vv... The direct estimator y .^ is designconsistent but fails to borrow sti-ength from the other areas.
To get a more efficient estimator, we consider the
following reduced model obtained from the combined
model (3) with o, = o^:
En,{yry)"'
i
where y = £^«,y, / Y^i^t is the overall sample mean. We have
o' =
Q./{Enrm^
and 6^ = ma\{6^, 0) where
o, =
(5)
where the e. are uncortelated random variables with mean
zero and variance 5^ = o V ^,7 • The reduced model (5) is
an area-level model similar to the well-known Fay-Herriot
model (Fay and Herriot 1979). It now follows from the
standard best linear unbiased prediction (BLUP) theory
{e.g., Prasad and Rao 1990) that the BLUP estimator of
0. = |i + V; for the reduced model (5) is given by
0 , = M,
(6)
+ V.
where
tiw^^iw
^w'
with pi =y.Y y / ^ y - and Y =Ov/(o„+5.). Note
that 9,. is different from tiie BLUJ* estimator under the full
model (3). We therefore denote 0 . as a pseudo-BLUP estimator. The estimator (6) may also be written as a convex
combination of the direct estimator y. and jl :
0.=Y.
V.
+ ( 1 - Y ) M -
I
V
' m-'
+(1-Y
iw
^
)M
' iw'
,
with
«*=E",-E"//E",It may be noted that o^ and o^ are either not estimable or
poorly estimated from the reduced model (5) due to identifiability problems. Following Kackar and Harville (1984),
it can be shown that the pseudo-EBLUP estimator 9,. is
model-unbiased for 9^ under the original model (3) for
symmetricaUy distributed errors {v^.} and {e.j}, not necessarily normal. It is also design consistent, assuming that
n . £ . w | is bounded as n. increases, because Y,,^ converges
in probability to 1 as «,-«> regardless ofthe validity ofthe
model (3), assuming a^ and 6^ converge in probability to
some values, say, 6* and o'^.
Kott's (1989) model-based estimator of 9, is obtained by
taking a weighted combination of y.,^ and Y.i*i^i' Y p ^^^^
is,
/;.(a.,c«) = ( l - a , . ) y , , + a , E ^ / " > ^ / .
l*i
(7)
The estimator 0_. depends on the parameters a^ and a^
which are generally unknown in practice. We therefore
replace oJ and o^ in (7) by model-consistent estimators 6^
and 6^ under the original unit-level model (3) to obtain the
estimator
0.=Y
[Q,-{m-l)c']ln*
(8)
and then minimizing the model mean squared ertor (MSE)
of/j.(a., c^'^) with respect to a. and c/'' subject to modelunbiasedness condition: X/*,^/ ~ 1- This leads to
9,^=/;.(ct.,a('>)
with
«,=
" w'
;/JE»',j*Ef',7",*( •*!:«';•"](»'.'#)
w.
where
and
and
A(')
r* w
^-^i
* iw y iw f
I-^i
» iw'
i<f^)-nr]/E
'
h*i
(d:/d^) + «;'
(9)
Survey Methodology, June 1999
69
-2
The estimator 0^.^^ is also model-unbiased and design- see Appendix 1. The variances and covariances of a^
and
consistent. In a previous version of this paper, we proposed 6^ are also given in the Appendix 1. It can be shown that
an estimator similar to (9). It uses the best estimators of \i g^.{6^,,6^) + gj.{a^,6^) is approximately unbiased for
under the unit-level model, based on the unweighted means y., g^i{ol,a^) in the sense that its bias is of lower order than
rather than fi^, the best estimator of \i under the reduced m"' (see Appendix 2). Similarly, g2^{al,6^) and
model (4), based on the survey-weighted means y. .
^3,.(6J,,6^) are approximately unbiased for ^^/(Ov.o^)
and g^.{al,a^), respectively. It now follows that an approximately model-unbiased estimator of MSE(9j.) is given by
3. ESTIMATORS OF MSE
'-^2 -2
.^2 -2>
:;2 ; ; 2 .
mse(9,.) = g„.(a;,
6^) + ^^..(a;,
6^) + 2g,.{6l
6^). (15)
It is straightforward to derive the MSE of the pseudoBLUP estimator 9 . under the unit level model (3). We For the estimator 9 .j^ given by (9), Kott (1989) proposed an
have
estimator of MSE as
MSE(9 .) =£(9 r%)-gii«'
,2 „ 2 N
,2 „2^
^') ^82i«'
^ )
(10)
with
and
g2,(o^a2) = o^(l-Y.^)2/5;^.Y,v
.2 „2^
The leading term, ^1,(0,,, o ) is of order 0(1), while the
second term, g2i(^v' ^^)' due to estimation of \i is of order
0 ( / M " ' ) for large m.
A naive MSE estimator of the pseudo-EBLUP estimator
9. is obtained by estimating MSE(9j.) given by (10):
msef^(9,.) = g,,.(a^,, 6^) + g2i{ol,6^).
(12)
under normaUty of the ertors {y,} and [ey] so that
MSE(§,.) is always smallertiianMSE(0 j); see Kackar and
Harville (1984).
To get a "cortect" estimator of MSE(9(), we first
approximate the second order term J?(9. - 9 .)^in (12) for
large m, assuming that {v,} and {eij} are normally distributed. Following Prasad and Rao (1990), we have
where the neglected terms are of lower order than m"', and
{V{dl)-2{ollo^)Cov{dl,e^)^
2/„2\2\r„^rA2\\.
(a:/a^)^Var(6^)};
-E^( 0 - \ 2
/
yi\
(16)
where v *(y .^^) is both a design-consistent estimator ofthe
design-MSE of y .^ and a model-unbiased estimator of the
model-variance of y .^ under the basic model (1). Since d.
converges in probability to zero as n.^°°, it follows from
(16) that mse(9j.j^.) is also both design-consistent and model
unbiased assuming only the basic model (1). However,
mse(0j.^) is unstable and can even take negative values
when d; exceeds 0.5, as noted by Kott (1989).
Note that our MSE estimator, mse(0|.) is based on the full
model (3) obtained by combining the basic model (1) with
the linking model (2). However, our simulation results in
section 4 show that it may perform well even under moderate
deviations from the Unking model.
(11)
But (11) could lead to significant underestimation of
MSE (9 ) because it ignores the uncertainty associated with
a„ and o^. Note that
MSE(9,.) = MSE(9,.) + E{Q. -Qf
mse(9.j,) = ( l - 2 d , . ) v ' ( y , J ++ «;[:
at y,,
(14)
4. SIMULATION STUDY
We conducted a limited simulation study to evaluate the
performances ofthe proposed estimator 9^., given by (8), and
its estimator of MSE, given by (15), relative to Kott's
estimator 9 .^, given by (9), and its estimator of MSE, given
by (16). We studied the performances under two different
approaches: (i) For each simulation run, a finite population
of m = 30 small areas with A^^. = 200 population units in each
area is generated from the assumed unit-level model and then
a PPS (probability proportional to size) sample within each
small area is drawn independently, using n. = 20. (ii) A
fixed finite population is first generated from the assumed
unit-level model and then for each simulation run a PPS
sample within each small area is drawn independently,
employing the fixed finite population. Approach (i) refers
to both the design and the linking model whereas approach
(ii) is design-based in the sense that it refers only to the
design. The ertors {v,.} and {e..} are assumed to be
normally distributed in generating the finite populations
{y.., i = l,2,..., 30; j = 1,2,..., 200}. We considered two
cases: (1) The linking model (2) is true with |i = 50. (2) The
linking model is violated by letting |i vary across areas:
|i,. = 50,/ = l,2,..., 10; ^1,. = 55, J = 11,12, ...,20; n,. = 60,
70
Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model
J = 21, 22,..., 30. To implement PPS sampling within
each area, size measures z..(/ = l,2,...,30;y = l,2,...,200)
were generated from an exponential distribution with mean
200. Using these z-values, we computed selection
probabilities py = Zy I £ Zy for each area i and then used
them to select PPS with replacement samples of sizes
n. = n,hy taking n = 20, and the associated sample values
{y..} were observed.
The basic design weights are given by w ..=n"' Pjj so
that w.. = Py"' / X Py'• Using these weights and the associated sample values y.. we computed estimates 9^. and 9^.^^
and associated estimates of MSE, and also the ratio estimate
y . for each simulation run;tiieformula for v * (y.,) under
PPS sampling is given in Appendix 3. This process was
repeated R = 10,000 times to get from each run
r{=l,2,...,R)Q.{r) and 9,.j^(r) and associated MSE estimates msej(9j.(r)) and msej(9.j^.(r)) and also the direct
estimate yj^{r). Using these values, empirical relative
efficiencies (RE) of Q. and 9^.^ over y .^ were computed as
RE(9,.)=MSE,(y,„)/MSE.(9.)
Table 1 reports summary measures of the values of
percent RE, IRBI and CV for cases (1) and (2) under
approach (i). Summary measures under approach (ii) are
reported in Table 2. Summary measures considered are the
mean and the median (med) over the smaU areas
/ = 1,2,...,30.
Table 1
Relative Efficiency (RE) of Estimators, Absolute Relative Bias
(IRBI) and Coefficient of Variation (CV) of MSE
estimators (0=5.0, n=20): Approach (i)
RE%
CV%
IRBI%
Q.^ 0,. mse(9,.jj.) mse(9() mse(9,.jf) mse(9;)
Case 1
1
Mean
Med
2 Mean
Med
3 Mean
Med
190 177
190 182
126 123
127 124
113 111
112 111
15.3
14.8
5.1
5.6
3.5
3.2
3.5
2.6
3.2
2.9
2.7
3.0
148
148
48
48
35
35
25
25
8
8
6
6
Case 2
and
39
Mean 108 103
10.4
7.9
38
Med
108 104
11.1
7.7
2 Mean 108 104
13.3
8.9
39
Med
108 104
13.6
7.9
37
37
3 Mean 104 103
11.5
7.2
36
Med
105 105
13.1
8.0
Casel: |i,.=50, i=l,2,...,30;Case2: n,.=50, i=l,2,...,10;
|a,.=55,i=ll,12,...,20; n,. =60,1=21,22,...,30.
1
RE(9,„)=MSE.(y. )/MSE,(9.A
6
5
6
6
5
6
where MSE. denotes the MSE over R_= 10,000 runs. For
example, MSE,(9;) = £^ [9,.(r) - Y.{r)flR,
where
Y.{r) is the i-th area population mean for the r-th run.
Note that Y .{r) remains the same over the runs r under the
design-based approach because the finite population is
fixed over the simulation runs.
It is clear from Tables 1 and 2 that 9,.^^ and 9,. perform
Similarly, the relative biases of the MSE estimators were
similarly with respect to RE which decreases as a^la
computed as
increases. Under approach (ii), RE is large for both cases 1
and 2 when o,^l o ^0.4, whereas it decreases significantly
RB[mse(9,.)] =[MSE.(0,.) -£.mse(9,.)]/MSE.(9,.)
under approach (i) if the linking model is violated (case 2);
the direct estimator y .^ is quite unstable under approach (u).
and
Turning to the peiformance of MSE estirnators under
approach
(i), Table 1 shows that IRBI of mse(9,.) is negUgRB[mse(9,.j(.)] =[MSE.(9,.jf) -£.mse(9,.j^)]/
ible (<4%) when the linking model holds (Case 1) and that
it is small (<10%) even when the linking model is violated,
MSE, (9,.^),
although it increases. The estimator mse(9,.j^) has a larger
where £ , denotes the expectation over R = 10,000 runs. IRBI but it is less tiian 15%_. The CV of mse(9.) is much
For example, £,mse(0;) =£^mse(0,.(r))//?. Finally, the smaller than the CV of mse (9 ^j^) for both Cases 1 and 2. For
empirical coefficient of variation (CV) of the MSE example, when the model holds (Case 1) the median CV is
estimators were computed as
25% for mse(9;) compared to 148% for mse(0,.j^) when
o^,= 1; the median CV decreases to 8% for mse(9,)
CV[mse(0,.)]=[MSE.{mse(0,.)}y'2/MSE.(0,.)
compared to 48% for mse(0,.^) when a^ = 2. This pattem
is retained when the model is violated (Case 2). It may be
noted that the probability of inse(9,j5^) taking a negative
and
value is quite large (>0.3) when a^l a ^ 0.4.
CV[mse(0,.jf)] = [MSE. {mse(9,.j,,) }]"2/MSE.(9,.j^).
Under approach (ii). Table 2 shows that IRBI of mse (9,.)
is larger than the value under approach (i) and ranges from
Note that MSE,[mse(9,.)] =Xjmse(9,(r))-MSE,(9,.)f/^ 15% to 25%. On tiie otiier hand, IRBI of mse (9,.j,,) is smaUer
and ranges from 4% to 15%. The CV of mse (9,.^), howand a similar expression for MSE,[mse(9;^)].
71
Survey Methodology, June 1999
The pseudo-EBLUP of 0,. = X! p + v. is given by
ever, is much larger than under approach (i). For example,
the median CV for Case 1 is 295% compared to 38% for
9.=Y y. +(1 -Y- )X'.Q
mse(9.) when o^ = 1 which decreases to 122% compared
to 23% when o^ = 2. A similar pattem holds for case 2 where
where the fixed finite population is generated from the
model with varying means.
Pw [Z^i yiw-'^iw-'^iwl [2.^1 yiw-'^iwYiw}Table 2
Relative Efficiency (RE) of Estimators, Absolute Relative Bias
(IRBI) and Coefficient of Variation (CV) of MSE
estimators (a=5.0, n=20): Approach (ii)
RE%
IRBI%
CV%
(19)
An approximate model-unbiased estimator of MSE(9j.) is
givenby (15) with
/^2
^2\
/I
gi,(Ov>oO = (i
"
\ "2
-yj<^v
as before,
9^.^ 9. mse(9,.^) mse(9j.) mse(9,.^) mse(9;)
^2,(°v'0 ) =
Case 1
25.4
289
295
24.7
Med
19.2
115
2 Mean
7.3
18.7
122
6.9
Med
14.8
68
3 Mean
4.8
4.2
65
13.9
Med
Case 2
26.8
291
1 Mean 278 276
15.7
26.2
297
Med 271 275 16.6
117
8.8
20.7
2 Mean 175 177
20.3
124
8.5
Med
173 177
70
6.3
16.2
3 Mean 124 124
15.5
67
Med
125 124
6.8
10;
Case 1: H.=50,z=l,2,...,30; Case 2: |J.=50, i=l,2,....
M,=55, ,=]1,12,...,20; H,. =60,1=21,22,. .,30.
1
Mean
283 281
275 279
180182
177 181
129 129
129 128
14.2
15.0
Ov(^,- -yiJiJ'lEi
39
38
24
23
24
24
r,V^,w^/w)"'(^,-Y,vVnv)
and ^3,(d,,, d^), obtained from (14), involves the estimated
variances and covariances of d ^ and 6^. The latter can be
obtained from Prasad and Rao (1990) for the method of
fitting constants and from Datta and Lahiri (1997) for
REML.
41
40
26
25
25
26
6. CONCLUSION
To reduce IRBI of mse(9^) under approach (ii), one
could combine it with mse (9.^) by taking a weighted
average, but it appears difficult to chose the appropriate
weights. The weighted average will be more stable than
We have proposed a model-assisted estimator of a smaU
area mean under a simple unit-level random effects model.
This estimator depends on the survey weights and is designconsistent. We have also obtained a model-based MSE
estimator. Results of our simulation study have shown that
the proposed MSE estimator performs well, even under
moderate deviations of the linking model. The proposed
approach is also extended to a nested ertor regression model.
ACKNOWLEDGEMENTS
mse(9,.jf).
5. NESTED ERROR REGRESSION MODEL
The results in sections 2 and 3 can be extended to nested
ertor regression models
Yy = -^0 P * ^, + ^ij' 7 =1' 2, - , n.; i = l,2,..., m
(18)
withx.' =Y.w..x... Model-consistent estimates a„ and o^
tw
i-n
ij
ij
APPENDIX 1
(17)
using the results of Prasad and Rao (1990), where x.j is a
/7-vector of auxiUary variables with known population mean
Xj and related to y.., and P is the p-vector of regression
coefficients. The reduced model is given by
>',V=^/wP + V. + e .
This work was supported by research grants from the
Natural Sciences and Engineering Research Council of
Canada. We are thankful to the Associate Editor and the
referee for constmctive comments and suggestions.
Proof of (13):
From general results (Prasad and Rao 1990) we have
E{Q.-Q f '^ tr
A.{al,c^)B.{al,a^)
~2
where B.{a^, o ) is the 2 x 2 covariance matrix of a j, and
6^, and A.{al, o^) is the 2 x 2 covariance matrix of
V
are obtained from the unit-level model (17), employing
either the method of fitting constants (Prasad and Rao
1990) or REML (restricted maximum UkeUhood) estimation
(Datta and Lahiri 1997).
dol' da^
72
Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model
Now, noting that
APPENDIX 3
dQ;
ay.
da.,
da..
39,*
da^
Y,v(l -Y,w)
The design-based estimator of variance of y.^ under PPS
sampling is given by
v(y ) =
dv.
Y,v(i
-yJ
^•^ iw'
y iw'
da'^
j.^{l
-yj^a;^
I
J
I] ^I]
Kott (1989) model-assisted variance estimator is
-alia'
fE %J j E ^l{y--y^•'1}
-at I a'
iw'
v*(J,J={V(y,J/£v(y,.j}v(y,J
and V{y .J = a^ + 5,. = oJ / y.^, we get
A.{al,a^)=
y^ vv;, (y. - v. ) ' .
_ I i-^
{a'jaf
and hence the result (14).
Covariance matrix of d ^ and d^:
Under normality, we have
•'
)
iw'
E w,J[ 1 - 2w^- + E >^,J]
where .E and V denote expectation and variance with respect
to the basic model (1).
REFERENCES
V(d2) = 2 a V p . « . - / n ) ,
V{al)=2n:^
a'*{m- l){E"i-
0 ( E « , - ' " ) " ' +2«,o^o^+«,.o^
and
Cov{a\al)
= -{m -
l)n:'V{6'),
BATTESE, G.E., HARTER, R., and FULLER,W.A. (1988). An error
component model for prediction of county crop areas using survey
and satellite data. Joumal ofthe American Statistical Association,
83, 28-36.
DATTA, G.S., and LAHIRI, P. (1997). A Unified Measure of
Uncertainty of Estimated Best Linear Unbiased Predictor in
Small-Area Estimation Problems. Technical Report, University of
Nebraska-Lincoln.
where
n„-Enf-2Ynf/Yn^^En^f/(Enf;
see Searle, Casella and McCulloch (1992, p. 428).
APPENDIX 2
Proof of £ [ g , . ( ^ t , ^ )
ARORA,V., and LAHIRI, P. (1997). On the superiority of the
Bayesian method over the BLUP in small area estimation problems.
Statistica Sinica, 7, 1053-1063.
^g,i(ol,S')]^g,,{ffl,<f'):
By a Taylor expansion of g,;(a^, o^) around {a^,r^) to
second order and noting that £'(6^-o^)=0 and
£^(Ov-Ov) = 0, we get
,^2 ' 2 \
/ 2 2\1
gl,(0v.0')-g„(0v,02)J
~~Ur D.{a\,a'')B.<,a\,a'')
2
FAY, R.E., and HERRIOT, R.A. (1979). Estimates of income for small
places: an application of James-Stein procedures to census data.
Joumal of the American Statistical Association, 74, 269-277.
GHOSH, M., and MEEDEN, G. (1986). Empirical Bayes estimation
in finite population sampling. Joumal ofthe American Statistical
Association, 81, 1058-1069.
GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an
appraisal. Statistical Science, 9, 55-93.
KACKAR, R.N., and HARVILLE, D.A. (1984). Approximations for
standard errors of estimators for fixed and random effects models.
Joumal of the American Statistical Association, 79, 853-862.
KLEFFE, J., and RAO, J.N.K. (1992). Estimation of mean square error
of empirical best linear unbiased predictors under a random error
variance linear model. Joumal of Multivariate Analysis, 43, 1-15.
,2 „2
where D^{a^, o ) is the 2 x 2 matrix of second order deriv- KOTT, P. (1989). Robust small domain estimation using random
atives of gy^{a^, cp-) with respect to a^ and c?. It is easy
effects modelling. Survey Methodology, 15, 3-12.
to verify that
'-tr
D.{al,a')B.{al,a'-\
= g^^{al,a'').
2
Now, noting that £[^3,(0^, a^)] = ^3,(0^, a^) we get tiie
desired result.
PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation of mean
squared errors of small-area estimators. Joumal of the American
Statistical Association, 85, 163-171.
SEARLE, S.R., CASELLA, G., and McCULLOCH, C E . (1992).
Variance Components. New York: John Wiley and Sons.
Sun/ey Methodology, June 1999
Vol.25, No. 1, pp. 73-80
Statistics Canada
73
Small Area Estimation Using Multilevel Models
FERNANDO A.S. MOURA and DAVID HOLT*
ABSTRACT
In this paper a general multilevel model framework is used to provide estimates for small areas using survey data. This class
of models allows for variation between areas because of: (i) differences in the distributions of unit level variables between
areas, (ii) differences in the distribution of area level variables between areas and (iii) area specific components of variance
which make provision for additional local variation which cannot be explained by unit-level or area-level covariates. Small
area estimators are derived for this multilevel model formulation and an approximation to the mean square error (MSE) of
each small area estimate for this general class of mixed models is provided together with an estimator of this MSE. Both
the approximations to the MSE and the estimator of MSE take into account three sources of variation: (i) the prediction
MSE assuming that both the fixed and components of variance terms in the multilevel model are known, (ii) the additional
component due to the fact that the fixed coefficients must be estimated, and (iii) the fiirther component due to the fact that
the components of variance in the model must be estimated. The proposed methods are estimated using a large data set as
a basis for numerical investigation. The results confirm that the extra components of variance contained in multilevel models
as well as small area covariates can improve small area estimates and that the MSE approximation and estimator are
satisfactory.
KEY WORDS: Small area estimation; Mixed models; Multilevel models; EBLUE.
1. INTRODUCTION
The need for small area (and small domain) estimates
from survey data has long been recognized. The difficulty
with the production of such estimates is that for most, if not
all, small areas, the sample size achieved by a survey
designed for national purposes is too smaU for direct estimates to be made witii acceptable precision. Early attempts
to tackle this problem using methods such as synthetic
estimation (Gonzalez 1973) involved the use of auxiliary
information and the pooling of information across small
areas. An excellent review and bibliography are given by
Ghosh and Rao (1994).
Empirical studies show that such methods made too little
provision for local variation and consequently the resulting
small area estimates were shmnk too far towards a predicted mean. More recent approaches {e.g., Battese and
Fuller 1981 and Battese, Harter and Fuller 1988) use some
components of variance model, or equivalent, to provide for
local variation. Empirical studies show the superiority of
this approach {e.g., Prasad and Rao 1990).
This paper proposes a general multilevel model framework for small area estimation. This involves the potential
to use auxiUary information at both the unit and small area
level. In addition any of the regression parameters, rather
than just the intercept as proposed by Battese and Fuller
(1981), may be treated as varying randomly between small
areas. The local variation is provided for by using differences between the means of unit level auxiliary variables,
the small area level variables, and the various components
of variance which allow variation between areas.
For this general model, the small area predictor is
obtained. In addition, an approximation to the mean square
ertor (MSE) of each separate small area prediction and an
estimator of this MSE are developed.
The numerical study, based on a large data set from
Brazil shows that such models may be useful for predicting
small area estimates. The robustness of the approach to misspecification of the variance-covariance matiix of the small
area random effects and misspecification of small area
covariates are also investigated. Further numerical results
demonsti-ate the success of the MSE approximation and its
estimator.
2. THE MULTILEVEL MODEL FRAMEWORK
2.1
Introduction
We consider the following multilevel model for
predicting the small area means:
y,. = x,p,.e,
P,=^J"'^ / = l,...,m
(2.1)
where Y. is the vector of length n. for the characteristic of
interest for the sample units in tiie i-th small area,
/ =1
m;X. is the matrix of explanatory variables at
sample unit level; Z. is the design matrix of small area
variables; y is the vector of length q of fixed coefficients
and Vj. = {ViQ,—,v.)^ is the vector of length (p + 1) of
random effects for the i-th small area. We assume the
Fernando A.S. Moura, Instihito de Matem^tica, UFRJ, Rio de Janeiro, Brazil, CP: 68530, CEP: 21941-590, e-mail: fmoura@dme.ufrj.br; David Holt, Office
for National Statistics, 1 Drummond Gate, London, SWIP 2QQ, e-mail: tholt@ons.go.uk.
74
Moura and Holt: Small Area Estimation Using Multilevel Models
following about the distribution of the random vectors: (a)
the V. are independent between small areas and have a joint
distribution within each small area with £(Vj.) = 0 and (b)
V(v.) = n (b) The e.,s and v.,s are independent and
V{e')=a'l.
For the whole population (2.1) appUes with n. replaced
by N., the small area population sizes.
The set of m equations in (2.1) can be concisely written
by stacking them as
ix.=xjz.y^xjv.
(2.3)
where X. is the {p + l) population mean vector for the i-th
small area.
An estimator of |i, may be obtained by plugging the
RIGLS estimators of y and 0 in the respective terms of
equation (2.3), where the predictor of the i-th small area
random effect v. is given by v^ = ClX^ V. {Y. -X.Z.y)
V, ' = 6"' / - d"^ X.. Q. G.. X; and
Y = XZ^ + Xv^E.
(2.2) where
{I + a-'XiX.Q.)-\
It is worth noting that the random intercept model (see
This estimator of \i. is known as Empirical Best Linear
section 2.3) can be regarded as a special case of the model
Unbiased Estimator (EBLUE)
(2.1) where Z. is equal to the identity matiix for each small
(2.4)
area and Q has all terms constrained to be zero except the
M. = xfZ,Y-xfv,..
one cortesponding to the variance of the intercept term.
Other intermediate models exist, for instance, when Q. is
Battese etal, {19SI, 1988) propose and apply a random
diagonal so that the small area regression coefficients are
intercept model to provide small area estimates. In this
random but uncortelated between covariates.
case, the Empirical Best Linear Unbiased Estimator is
Holt (in Ghosh 1994, page 82) observes that the advantage of the model (2.1) over other competitors is that it
M,(Ri)=^rP+v.o.
effectively integrates the use of unit level and area level
covariates into a single model. Besides the use of extra
We use the label (RI) to imply a random intercept model
random effects for the regression coefficients gives greater
since only the intercept of each small area is random while
flexibility in sitiiations where it is not appropriate to assume the other components of P remain fixed.
the same slope coefficients apply for aU small areas.
2.4 Approximation to the Mean Square Error
2.2 Fixed and Component of Variance Parameter
(MSE)
Estimates
Kackar and Harville (1984) show that, if 0 is a transThe fixed and components of variance parameters in the
lation
invariant estimator of 9 and the random terms are
model (2.1) are y and 0 = ([Vech(i^)]^, a')^ respectively.
normally
distiibuted, the mean square ertor of a predictor of
Various methods for estimating these model parameters in
a
linear
combination
of a fixed and random effect can be
the case of a general mixed linear model are available. Most
decomposed
into
two
terms. The first one is due to the
of them, based on iterative algorithms, lead to the maximum
variability
in
estimating
the fixed parameters when the
UkeUhood estimator (MLE) or the restricted maximum likecomponents
of
variance
are
known, the second term comes
lihood estimator (RMLE) under certain regularity conditions.
from
estimating
the
components
of variance.
Goldstein (1986) shows how consistent estimators can
Since
under
normality
the
RIGLS
estimator is equivalent
be obtained by applying iterative generalised least squares
to
the
RMLE
estimator
and
the
RMLE
is translationprocedures (IGLS). He also proved its equivalence to the
invariant, Kackar and Harville's (1984) results can be
maximum likelihood estimator under normality. Later
applied
to the smaU area means estimators (iy i=l,..., m:
Goldstein (1989) proposed a sUght modification of his algorithm (namely, restricted iterative generalised least squares
MSE(p,.) = £[M, - M,.]' = £[|I, - H,]' + £[A,,- M,]' (2.5)
(RIGLS)) which is equivalent to RMLE under normality.
Unlike the IGLS estimates, the RIGLS estimation procewhere jl^. is the BLUE of \i.y
dures provide unbiased estimates of the component of
The first term of (2.5), that is MSE[(i,.], can be obtained
variance parameters by taking into account the loss in
by
direct calculation as
degrees of freedom resulting from estimating the fixed
parameters.
This work is confined to the RIGLS approach as in
MSE(p.) = v^rn'\T
x ; ( G , - y In x , . +
Goldstein (1989). The RIGLS procedure is described in
\Tr
details in Appendix A.
Z^G;'X.
(2.6)
a'x,{Gryz. Y1=1 ZIG:'XIX;Z,
2.3 The Estimator of the Small Area Mean
Assuming the model (2.1) and considering that the
population size N^ in the i-th small area is large, we can
write the mean for the j-th small area as
where G,. = / + o^ X^ X. CI. Kackar and Harville (1984)
point out that the second term of (2.6) is not tractable,
except for special cases, and propose an approximation to
Survey Methodology, June 1999
75
it. Prasad and Rao (1990) propose an approximation to this
second term and work out the details of their approximation
for three particular cases: the random intercept model,
random regression coefficient model and the Fay-Herriot
model. They also give some regularity conditions for their
approximation to be of the second order, and prove that
their MSE approximation for the Fay-Herriot model is of
the second order. Nevertheless, it seems to be more difficult
to give general conditions for more complex models such as
model (2.1).
Applying Prasad and Rao's approach, an approximation
to the second term of (2.5) is developed in Appendix B.
It is worth noting that the MSE approximation of (ft^.)
can be decomposed into three terms:
Harville and Jeske 1992) argue that this procedure tends to
underestimate the MSE. Prasad and Rao (1990) reported a
simulation study which showed that the use of this "naive"
estimator leads to severe downwards bias. They also
showed for the Fay-Herriot model (a special case of the
model (2.1)), using "truncated Henderson" estimates for the
variance components, that
where the superscript F indicates that a cortection for the
finite population sampling fraction f. was used; x. is the
(p + l) vector of sample means.
The MSE(|Li,) can be obtained by noting that
MSE(Af)= (1 -y;.)nMSE*(A,.)+A^:'(i-/;.)-• 6^] (2.11)
E{T,)
Tj - Tj + o{m
');E{T2) = T2^o{m-');
E{f,)=
T^*o{tn-\
HarviUe and Jeske (1992) establish some conditions for
the unbiasedness of Prasad and Rao's mean square error
estimator. However, considering the more general model
(2.1), again it seems more difficult to give general
conditions
for which the order of bias of Prasad and Rao's
MSE(M,.) - r, + r^ + Tj
(2.7)
estimator is o{m"'), especially if iterative procedures as
where T^ and T2 are respectively the first and the second RIGLS are used to obtain the parameter estimates.
term of equation (2.6) and T^ is described in Appendix B.
Nevertiieless, motivated by the simulation study summaThe term T^ is the variability of \x. when all parameters rised in Section 3.4 and an extensive simulation study
are known, the second term 7^ is due to estimating the
described in Moura (1994), we propose to use an estimator
fixed effects and the third term T^ comes from estimating
similar to Prasad and Rao's for MSE((1|.):
the components of variance.
When sampling fractions are not negligible, estimators
MSE = r, + P2- 2T.3(2.10)
of the small area means can be built in tiie spirit of the finite
population approach by predicting specifically for the nonWhere f. are obtained from (2.5) by replacing a' and Q by
sampled units:
their respective RIGLS estimators.
From equation (2.9) we can also obtain an estimator for
^F
(2.8)
MSE((i,) as follows:
p;=/;)',^(^,-/;^,)'(z,Y-v,.)
>^,=(i-/i)FfrM-Y)-v,-v,-8f)
MSE(A[) = (1 -y;y M S E -(p.) + N:\I
-f)-'
1 ^2
a
xf.
3. A MODEL-BASED NUMERICAL
INVESTIGATION
c . is the mean of e^..
where X, = (1 -/j.) (X^. -/jx^.) and e^.
for the non-sampled units in the i-th small area. Therefore
/'.F\
where MSE *(|i.) is the equation (2.10) with X. replaced by
(2.9)
where MSE *(A,) is the equation (2.7) with X^. replaced by
xf.
2.5 Estimation of Mean Square Error
It is common practice to estimate the MSE of a linear
combination of the fixed and random effects in a mixed
model as in (2.1) by replacing estimates ofthe components
of variance respectively in the expression of MSE. This
estimator ignores the contribution to MSE due to estimating
the components of variance parameters. Several studies (see
for example Singh, Stukel and Pfeffermann 1998 or
3,1 Comparison of the Estimators
In order to investigate the properties of altemative
estimators, data was used from 38,740 households in the
enumeration districts in one county in Brazil. The Head of
Household's income wasti-eatedas the dependent variable.
Two unit level independent variables were identified as the
educational attainment of the Head of Household (ordinal
scale of 0-5) and the number of rooms in the household
(1-11-h).
The assumed model is
^ij = ^io^^iiW^i2^2ij^h
'• = 1'-'m;j = l,...,N^
P.0 = Voo + ^o; P/t = Yio * ^a; K = Y20 * v,2
^3 J ^
where Xj and X2 respectively represent tiie number of
rooms and the educational attainment of the head of the
Moura and Holt: Small Area Estimation Using Multilevel Models
76
household (centred about their respective population
means).
The parameter values for the fit model and their
respective standard ertors are
Yoo= 8.456(0.108) YIO= 1.223(0.046) Y2O= 2.596(0.086)
Ooo= 1.385(0.194) OQ, =0.354(0.66) Oo2=0.492(0.117)
0,2=0.333(0.054) o,, =0.234(0.35) 022=0.926(0.124)
0^=47.74(0.345)
To carry out numerical investigations within the
model-based framework a simulation was carried out
keeping the enumeration district identifiers and the values
ofthe two explanatory variables (X)fixed.Initially the area
population means X,,. and Xj, were calculated for the
whole data set and a randomly selected subsample of 10%
of records from each small area was identified. This same
subset was retained throughout the simulations (the
Simulation subset).
The data generation for the simulations was cartied out
in two stages using a data generation model which was the
General Model (G), the Diagonal Model (D), the Random
Intercept Model (RI) as appropriate. In the first case the
parameter values were taken from the estimates mentioned
earlier. In the second case the off-diagonal terms were set
to zero, in the third case only OQ^ = 1.385 was non-zero.
The first stage of the data simulation process was to
generate the level 2 random terms (that is, the non-zero
elements of v.^ and v^., and v,.2) depending on the choice of
the data generation model. These random terms were
Normally distributed (jointly Normal in the case of the
General Data Generation Model and the Diagonal Data
Generation Model). At this stage the expected value of the
mean for the i-th area conditional on the area level random
effects generated by the model m, = G, D, RI in the r-th
simulation could be obtained:
.(<•)
1^,=^.
P^t? ^ t
P^X
2r
MSE[M,„,.^]=/?-'E(Ali^-Ml2|
r=l
and the absolute relative ertor (ARE) by
ARE[M,„„„^]=/?-'ElMl;^„m,-<l<For comparative purposes we contrast the properties of
each estimator with those of the estimator which is the same
as the data generation model. Hence we define the Ratio of
Mean Square Error (RMSE):
RMSE_ni2,tnj
EMSE[P,„_„;J/|:MSE[(1,.„_„_] xlOO
and the Ratio of Absolute Relative Ertor (RARE):
RARE_m 2 , n i |
E ARE[M,„_,„^] / | : ARE[p,.„_,„^]|xlOO.
It will be seen that when the data are generated from a
simpler model {e.g., RI) the more complex estimation
procedures do not suffer any appreciable worsening of efficiency or bias. On the other hand when the data are generated from a more complex model the simpler estimators
have inferior properties. However the difference between
the Diagonal and General estimators is much less than
between these and the Random Intercept Estimator. From
Table 1 one would conclude that it is worth introducing
additional random coefficients of some kind, beyond the
simple Random Intercept model assumptions, but not
necessarily the full General Model.
Table 1
Ratios of Mean Square Errors and Ratios
of Absolute Relative Errors (in parentheses) for the three
Estimators and Three Data Generation Models
At the second stage of the data simulation process, unit
Data Generation Model
Estimator
values {Y.) were created for each of the data generation
G
D
RI
models. Having generated the data for the simulation subset
101.2
101.8
General (G)
100.0
under one of the data generation models, all three of the
(100.6)
(100.0)
(100.9)
estimation models (G, D and RI) could be fitted to the
100.2
100.0
simulated data to obtain parameter estimates and predictors
Diagonal (D)
108.8
(100.1)
(100.0)
(82.6)
for the small area means.
For each data generation model m, = G, D, RI the whole
109.1
R. Intercept (RI)
131.9
100.0
(105.6)
(176.9)
simulation process was repeated R = 5000 times to yield a
(100.0)
set of smaU area means |i|^^ and predicted means
M/m m , A" = 1....,/? for each snrnU area,/, / = l,...,/n andfor
the three estimation models: /Wj = G, D, RI. For each small
The summary measures in Table 1 are average properties
area and for data generated under model m, = G, D, RI, the
over all small areas. A careful analysis of the MSE perforMean Square Ertor (MSE) of the prediction process for
mance of the estimators for each small area shows that there
each estimation model /Wj may be defined as
is a modest increase in the MSE for the Diagonal Estimator
Survey Methodology, June 1999
compared to the General Estimator for all areas, whereas
for the Random Intercept estimator a relatively small
number of areas exhibit a substantial increase in MSE. A
sinular pattem occurs between the Diagonal and Random
Intercept estimator when the Diagonal Data Generation
Model is used.
3.2 Introducing a Small-Area Level Covariate
In this section an attempt is made to investigate the
impact on small area estimates of introducing an area covariate Z. Unfortunately for the data set used, it was not
possible to identify a single contextual area level covariate
which had a substantial effect on the multilevel models.
Nevertheless, the number of cars per household in each
small area was a useful covariate for the random coefficients for the individual level random slopes coefficients for
"Room" and "Edu", but not for the random intercept term.
This was observed after some preUminary model fit analysis
on the real data. Although the "numbers of cars" was the
best small area level covariate found to explain between
area variation, it was not as powerful at the individual level
as "Room" and "Edu", the individual level covariates
chosen.
The model above with the small area covariate Z can be
written as
77
Table 2
Parameter Estimates and Standard Errors for General Model
with Area Level Covariate: Demographic Data
Parameter
Diagonal
Model with Z
Diagonal Model
Too
8.442(0.112)
8.688(0.136)
Vio
0.451(0.179)
1.321(0.085)
^20
0.744(0.272)
2.636(0.134)
Yii
3.779(0.507)
-
Y22
1.659(0.323)
-
"00
0.745(0.308)
0.637(0.303)
^'n
0.237(0.083)
0.471(0.116)
^12
0.700(0.197)
1.472(0.295)
^
44.00(1.05)
44.01(1.05)
Table 3
Ratios of Mean Square Errors and Ratios of Absolute Relative
Errors (in parentheses) for the Diagonal and the Diagonal with Z
Estimators Under the Two Respective Data Generation Models
Estimator
Diagonal
The small area random effects were assumed uncortelated in order to avoid convergence failure in the simulation
study.
Table 2 reports the parameter estimates and their respective standard ertors obtained by fitting the Diagonal Model
with the Z covariate (3.2) and witiiout the Z covariate (2.1).
It is worth noting the significant reduction of all the
components of variance estimates, except a^^ and a', after
introducing the explanatory area covariate Z.
In order to investigate the effect of misspecification of
the Z variable, the model based simulation procedure
described in section 3.1 was applied to the two models
above, where the data generation was done according to the
parameters presented in Table 2. Table 3 summarises the
simulation results.
It is worth noting that in both cases there is a significant
loss of efficiency by using an unsuitable estimator. It can
also be seen from an individual analysis of MSE for each
small area that a considerable gain in efficiency is achieved
with the introduction of a small area covariate Z over the
diagonal model. For many small areas the MSE of the
Diagonal with Z is significantly less than the MSE of the
cortesponding estimator without Z. Even for those few
areas in which the MSE of the Diagonal with Z is
unchanged or even sUghtly increased by the introduction of
Z, the difference is not appreciable.
Diagonal with Z
Diagonal
100.0
(100.0)
110.3
(125.4)
Diagonal with Z
126.2
(107.5)
100.0
(100.0)
Yy = P/o^P,i^t(,+P/2^2y+e,^ '• = 1'-'m;j = l,...,N.
P,o=Yoo+V P,i=rion„z,+v,.,; P,7 =720+^212/+v,-2- (3.2)
Data Generation Model
3.3 Comparisons with Regression Estimator
One essential advantage of the multilevel models over
regression models is to recognize that groups (here the
small areas) share common features; they are not completely independent as could be assumed, for example by using
separate linear regression model for each small area.
Nevertheless, the relatively small intraclass cortelation
observed for the data set used plus the fact that each small
area has on average 28 units, could make one think that in
this case the use of the multilevel model would not result in
great improvement in the smaU area estimators. However,
it is gratifying to know that even in these circumstances the
multilevel model small area estimator performs on average
better than the synthetic separate regression estimator,
under either the multilevel model or even under the
regression model. Table 4 illustrates this finding.
The multilevel data generation model used was the
General one with the parameters given in section 3.1. The
parameters used in the data generation regression model
were obtained byfittinga separate regression for each small
area.
It can be seen from Table 4 that the Separate Regression
estimator which does not explore the difference of small
areas through small area random effects shows substantial
loss of efficiency when compared with the General estimator.
Moura and Holt: Small Area Estimation Using Multilevel Models
78
Table 4
Ratios of Mean Square Errors and Ratios of Absolute Relative
Errors (in parentheses) for the General and the Separate Regression
Estimators Under the Two Respective Data Generation Models
Data Generation Model
Estimator
General
Separate Regression
General
100.0
(100.0)
88.1
(83.1)
Separated
Regression
247.6
(154.7)
100.0
(100.0)
Figure 1 illustrates this fact by showing a plot ofthe ratio
of mean square error between the General estimator and the
Separate Regression estimator for each small area. To
demonsti-ate the effect of the small area sample size on the
efficiency, the ratio of the MSEs is plotted against the
sample size for each small area. It is clear from Figure 1
that the gain in efficiency tends to decrease as the sample
size increases.
12 1
1
10 •
S
8-
•
0)
£
o
6 -
•
1 i 2=
•
•
B •
•
•
•
0.
c)
20
40
60
Small area sample size
Figure 1. Model-based efficiencies of the general estimator
compared with the separate regression estimator for each
small area
3.4 An Evaluation of the MSE Approximation and
the MSE Estimator
0.7% and 10.5% of the total. The component Tj "ever
contributed more than 2.2% of the total MSE for any area.
We also investigated the performance of the MSE
estimator represented by equation (2.10) against the "naive"
estimator ofthe MSE, which does not consider the last term
of (2.10). The average Root Mean Squared Ertor of the
proposed MSE estimator is 17.5% ranging from 4.7% to
32.3%, while for the naive estimator the average is 20.9%
ranging from 5.2% to 47.5%. The MSE estimator is on
average unbiased while the naive MSE estimator
underestimates the MSE on average by 9.1%, its relative
bias ranging from -23.5% to -0.9%. Our results agree with
others, see Singh, Stukel and Pfeffermann (1998) and
Prasad and Rao (1990), which showtiiatthe naive estimator
can exhibit severe bias.
4.
DISCUSSION
Prasad and Rao (1990) and Battese etal, (1981, 1988)
have demonstrated that models which include small area
specific components of variance can provide greatly
improved small area estimators. Some of the numerical
results in this paper show that within the model-based
simulation framework even better estimators can be
obtained by allowing the small area slopes as well as the
intercept to be random.
The overall conclusions from this investigation for this
set of parameter values are that: a component of variance
model more complex than the Random Intercept estimator
is beneficial; overspecification ofthe model {e.g., using the
General estimator with data generated under the Random
Intercept Model) does not lead to serious loss of efficiency;
the use of small area covariates can also improve the small
area estimates; and the use of multilevel models should be
preferted rather than the Separate Regression Model. The
simulation study confirms that the MSE approximation
appears to be precise and the MSE estimation is approximately unbiased, reflecting the variation in MSE between
areas, but further theoretical investigation about the exact
order of the approximation should be done.
Clearly model fitting and diagnostics are crucial. If we
apply a general mixed model in circumstances where it is
only a poor fit to the data, then the results may be
disappointing. Considerably more investigation is needed
to understand what characteristics of specific small areas
are likely to provide efficiency gains if general mixed
models are used rather than simpler models.
From the simulation results we may investigate the
properties of the MSE approximation (2.7). If we consider
the General estimator when the General Data Generation
model is used the MSE approximation appears to be very
good. The average underestimation of the MSE approximation was 0.31% of tiie MSE value with a range from the
largest underestimate of 5.4% of tiie MSE value through to
a largest overestimate of 4.8% of the MSE value. For the
situation considered here 7, contiibuted on average 94.6%
ACKNOWLEDGEMENTS
of the total variation and Tj a further 4.3%. Given die large
component of variance due to a', these results are not
We would like to thank the referees and the Editor for
unexpected. For individual areas the component T^ varied
their
helpful comments on the earlier version of this paper.
between 87.4% to 99.1 % oftiietotal and T^ varied between
Survey Methodology, June 1999
79
APPENDIX A: RESTRICTED ITERATIVE
GENERALIZED PROCEDURE
The RIGLS approach is based on the fact that if y is
estimated by using generalised least squares with V known
then
The generaUzed least squares estimator of y in the model
(2.1) is given by
£:[(y-xzY)(y-xz^)^]=v-xz(z ^x ^v "'xz)-'z ^x ^.
The equation above suggests that we use
Y= ( Z ' X ' V - ' X Z ) " ' ( Z ' X ' V - ' y ) =
(y-XZY)(y-XZY)''+XZ(Z'"X^V-'XZ)-'Z^X^ ( A . 5 )
Ez,'x,V,-'x,z,
,=1
YZ'X,'V-'Y^
(A.1)
\i=i
;
where V = Diag (V,,..., VJ and V. = o^ / + X '"fiX is the
covariance matrix of y., / = 1,..., m.
However, V is assumed to be a function of unknown
parameters, thus y cannot be estimated using (A.1). On the
other hand, if y is known then
Y * = vech [{Y - XZy) {Y - XZyf]
(A.2)
is an unbiased estimator of vech( V). Furthermore vech( V)
is a linear function of 0. Then we can consider the
following linear model:
y* = F e + ^.
^
.^ , f avec(V)'i^
G = cov (0 „)
^^—L
lv-'®V-Mvec(yy^)(A.4)
30
j 12
where
cov(0^) =
[ avec(V)' Tl
.1 30 ,
' 3vec(V)'\
.
2
30 J
and
Y= Y-
APPENDIX B: AN APPROXIMATION
TO £[A, - M j '
(A.3)
Where F = 3vech( V)/39 and ^ is a random variable with
mean O = (0,..., 0 ) and the covariance of ^ is given by
V, = 2(p^(V®V)<p„. The matrix tp^ is any linear transformation of vec(A) into vech(A), and A is any n x n matrix such
that vech(A) = {p^vec(A), see Fuller(1987) for further
details. Then, assuming that F has full rank and V. is
known and non-singular, it may be shown that the
GeneraUzed Least Square Estimator of 0 is given by
"{
instead of ( y - X Z Y ) ( y - X Z Y ) ^ at each iteration cycle
described above in order to obtain an approximately
unbiased estimator of V and consequently of 0.
As pointed out by Goldstein (1986, 1989), if we start
with a consistent estimate of y, say the ordinary least
squares estimator, then the final estimates wiU be consistent
providing finite fourth moments exist.
It is worth noting that it is possible for the above
procedure to yield negative estimates of variances. This
problem can be avoided by imposing constraints at each
iteration. For further details on this issue see Goldstein
(1986).
XZy.
Note that 0^ depends on 0 and y, so both may be
iteratively estimated. The IGLS procedure starts with an
initial estimate of V (that is, setting initial values of 0)
which produces an estimate of y. Hence replacing the initial
estimate of V together with the estimate of y in (A.1)
provides an improved estimate of 0. In most cases
convergence is achieved after a few iterations between
equations (A.1) and (A.4), although it is not always
guaranteed.
Prasad and Rao (1990), based on Kachar and Harville
(1984), developed a second order approximation to the
second term of (2.5) under some regularity conditions:
dd,
tr
L\
39 >
' dd} '
*-e)(e-e)1
dQ)
(B.l)
where, for the model (2.1),
d.=xjK.{I<sa)X^V'\
K. = [0,...,/,...0], is the (p + l)x(p + l)/n matrix with the
identity matrix / of order p +1 in the i-th position and 0 as
the null matrix of order p +1, and 0 is any translationinvariant estimator of 0 = (0j,...,0^) where 0j = o^ and
0^;^ = l,...,.y - 1 are the distinct elements of Q. Goldstein
(1989) proves that under normality of the random terms of
model (2.1), the RIGLS estimator of 0 is equivalent to the
Restricted Maximum Likelihood Estimator (RMLE), which
is translation invariant.
Let us approximate E[{Q - 0)(0 - 0)^J to the asymptotic
covariance matrix of the RMLE estimator {B). The jk-th
element of fi'' is given by(see Harville (1977))
b; = T,\YP.^P.-^
,=1 ' 30^. ' 30^^
for
j
and
k = l,...,s
{T:.iZ.''x,:'vr'x.Z)
where
r-l
P . = v : - V,."'X.Z.
z/x,.^v,-'. l^t bj, he jk-th
element of B. After some matrix algebra, it can be shown
80
Moura and Holt: Small Area Estimation Using Multilevel Models
GON21ALES, M.E. (1973). Use and evaluation of synthetic estimates.
Proceedings ofthe Social Statistics Section, American Statistical
Association, 33-36.
that
j-l
s-l
r3=xf(G:y EEbjkAjCAl
(;=i
Gr'x,-
k=i
s-l
2X]{G['f Yb A.
OT
/?,nX, + ^„X;Q5,i^X, (B.2)
where
C, = a'' G,."' x / x . ; R, = o"'* G,."^x/x,;
5. = o-*G; X/X.; and
A
3Q ,
,
,
A. = — — A : = l,...,.y-1
3^
GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an
appraisal. StatisticaiScience, 9, 55-93.
HARVILLE, D.A. (1977). Maximum likelihood approach to variance
component estimation and related problems. Joumal of the
American Statistical Association, 72, 320-340.
HARVILLE, D.A, and JESKE, D.R. (1992). Mean squared error of
estimation on prediction under a general linear model. Joumal of
the American Statistical Association, 87, 724-731.
HENDERSON, C.R. (1975). Best linear unbiased estimation and
prediction under a selection model. Biometrics, 31,423-447.
is the s-l
square derivative matrix with respect to
Q^;k =
l,...,s-l.
HOLT, D., and MOURA, F. (1993a). Mixed models for making small
area estimates. In: Small Area Statistics and Survey Design,
(G. Kalton, J. Kordos, and R. Platek, Eds.) 1, 221-231. Warsaw:
Central Statistical Office.
REFERENCES
HOLT, D., and MOURA, F. (1993b). Small area estimation using
multilevel models. Proceedings of the Section on Survey
Research Methods, American Statistical Association, 21-31.
BATTESE, G.E., and FULLER, W.A. (1981). Prediction of county
crop areas using survey and satellite data. Proceedings of the
Section on Survey Research Methods, American Statistical
Association, 500-505.
BATTESE, G.E., HARTER, R.M., and FULLER, W.A. (1988). An
error components model for prediction of county crop areas using
survey and satellite data. Joumal of the American Statistical
Association, 83, 28-36.
FULLER, W.A. (1987). Measurement Error Models. Chichester:
John Wiley.
GOLDSTEIN, H. (1986). Multilevel mixed linear model analysis
using iterative generalised least squares estimation. Biometrika,
73,43-56.
GOLDSTEIN, H. (1989). Restricted unbiased iterative generalised
least squares estimation. Biometrika, 76, 622-623.
KACKAR, R.N., and HARVILLE, D.A. (1984). Approximations for
standard errors of estimators of fixed and random effects in mixed
linear models. Joumal ofthe American Statistical Association, 79,
853-862.
LONGFORD, N. (1987). A fast scoring algorithm for maximum
likelihood estimation in unbalanced mixed models with nested
effects. Biometrika, 79, 817-827.
MOURA, F.A.S. (1994). Small Area Estimation Using Multilevel
Models. University of Southampton. Unpublished Ph. D. Thesis
PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation ofthe
mean squared error of small-area estimators. Joumal of the
American Statistical Association, 85, 163-171.
SINGH, A.C, STUKEL, D., and PFEFFERMANN, D. (1998).
Bayesian versus frequentist measures of error in small area
estimation. Joumal of the Royal Statistical Society, Series B,
377-396.
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 81-86
Statistics Canada
81
Composite Estimation of Drug Prevalences
for Sub-State Areas
MANAS CHATTOPADHYAY, PARTHA LAHIRI, MICHAEL LARSEN and JOHN REIMNITZ'
ABSTRACT
The Gallup Organization has been conducting household surveys to study state-wide prevalences of alcohol and drug {e.g.,
cocaine, marijuana, etc.) use. Traditional design-based survey estimates of use and dependence for counties and select
demographic groups have unacceptably large standard errors because sample sizes in sub-state groups are too small.
Synthetic estimation incorporates demographic information and social indicators in estimates of prevalence through an
implicit regression model. Synthetic estimates tend to have smaller variances than design-based estimates, but can be very
homogeneous across counties when auxiliary variables are homogeneous. Composite estimates for small areas are weighted
averages of design-based survey estimates and synthetic estimates. A second problem generally not encountered at the state
level but present for sub-state areas and groups concems estimating standard errors of estimated prevalences that are close
to zero. This difficulty affects not only telephone household survey estimates, but also composite estimates. A hierarchical
model is proposed to address this problem. Empirical Bayes composite estimators, which incorporate survey weights, of
prevalences and jackknife estimators of their mean squared errors are presented and illustrated.
KEY WORDS: Alcohol abuse; Drug abuse; Empirical Bayes; Jackknife; Mean squared error; Small area estimation;
Synthetic estimation.
1. INTRODUCTION
The Gallup Organization has been conducting a series of
household surveys for different states to study state-wide
prevalences ofthe use of alcohol and drugs {e.g., cocaine,
marijuana) among civilian, non-institutionalized adults and
adolescents. The common goal of these surveys is to estimate the use and dependence prevalences for alcohol and
drugs and, on that basis, to project the treatment needs of
dependent users. For planning and resource allocation,
states need precise estimates of prevalences for certain
subgroups of the target population. For example, it is of
interest to estimate prevalences for sub-state planning
regions and counties in demographic subpopulations {e.g.,
older white males).
Traditional design-based procedures to estimate use and
dependence for subpopulations have two drawbacks. First,
if the traditional design-based survey estimate for a subgroup is positive, but sample size is small, then the cortesponding standard ertor is unacceptably large. Second, since
the problem is to estimate the proportion of a rare event, it
is possible that the design-based procedure produces an
estimate of zero and standard ertor estimation formulas for
a particular subgroup, if appUed, would give a false impression of the true underlying variability.
To improve on the traditional design-based estimators,
one can use certain supplementary information usually
available from administrative records in conjunction with
the telephone survey data. This generaUy is done by using
either imphcit or explicit models that "bortow strength" or
incorporate additional information that relates the various
groups, counties, and planning regions to one another. The
method proposed here combines information across
counties in order to deal with problem of zero estimates in
some counties. It is derived from a model that bounds the
proportions away from 1, which is reasonable in an application with proportions expected to be very small, and estimates parameters using empirical Bayes methods. The
procedure also incorporates the survey sampling weights in
estimation.
For a detailed account of small-area estimation methods,
see Ghosh and Rao (1994). Other recent references can be
found in Fartell, MacGibbon and Tomberlin (1997) and
Malec, Sedransk, Moriarity and Leclere (1997). Farrell
et al, (1997) propose estimating small-area proportions
with empirical Bayes procedures. They model the
proportions via a logistic regression that relates expected
proportions to respondent variables and includes random
effects for the small areas. Malec et al, (1997) use
hierarchical Bayes models. They use logistic regression
models to relate individual characteristics to probabilities of
an outcome and then use a linear regression model to relate
coefficients across small areas. Most existing methods,
including those of Fartell et al, (1997) and Malec et al,
(1997) do not directly use survey sampling weights in
estimation.
The survey design used by Gallup is described in
section 2. In section 3, notations used in the paper are
introduced. A direct design-based estimator and two
synthetic estimators are presented in section 4. In section 5,
several composite estimators of prevalences of alcohol and
drug use and dependence are given. In this section, certain
Manas Chattopadhyay, The Gallup Organization; Partha Lahiri, University of Nebraska/Lincoln; Michael Larsen, Harvard University, [)epartment of Statistics,
Science Center, One Oxford Street, Cambridge, MA 02138, U.S.A.; John Reimnitz, The Gallup Organization.
82
Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Drug Prevalences
empirical Bayes estimators and jackknife estimators of their
mean squared ertors (MSE) are proposed. In section 6,
estimators presented in sections 4 and 5 are applied to a
data set from a particular state. The focus of the analysis in
this study is taken to be county level estimates. Sample size
planning considerations originally were concerned with
larger sub-state planning areas.
2. SURVEY
For sampling purposes, the state is divided into a few
planning regions and samples are collected independently
for each planning region using a truncated stratified random
digit dialing (RDD) method of Casady and Lepkowski
(1993). This design stratifies the Bellcore (BCR) frame
into two strata: a high density stratum consisting of 100banks with one or more listed residential numbers and a low
density stratum consisting of all the remaining nufnbers in
the BCR frame. About 52 percent of the numbers in the
high-density stratum are estimated to be working residential
numbers whereas in the low-density stratum, the corresponding percentage is only about 2 percent. The CasadyLepkowski procedure exploits the significant difference in
the cost of sampling between the two strata by optimally
determining the sample size in each stratum. In the
truncated version of the procedure, sampling is done only
from the high-density stratum.
Sample size in the original study was determined in order
to estimate statewide prevalence with a desired degree of
accuracy. Sample sizes were allocated to the planning
regions using an optimal allocation scheme. Data on drug
treatment admissions for the adult population in each
county were used to compute the index prevalence (rate of
admissions) percent in every planning region. These indices
were then used to calculate the optimum sample size for
each planning region. As a result of optimal allocation,
relatively larger sample sizes were allocated to planning
regions with higher index prevalences. The optimal allocation also minimizes the variance of the estimators. Gallup
also oversampled the 18-45 age group by planning region,
because it is the age group with relatively higher rates of
illicit drug use. Due to optimal allocation (which may be
disproportional), the age oversampling and the complex
design, weighting was needed to compute estimates from
the sample data. The necessary weights, commonly known
as sampling weights, were computed using curtent
estimates of the population based on census data.
Due to budgetary constraints, it is not possible to
increase sample size for all sub-state regions and groups in
order to achieve the desired accuracy. To estimate alcohol
and drug prevalences, we consider empirical Bayes procedures (see Efron and Morris 1973, Fay and Herriot 1979,
Ghosh and Lahiri 1987, among others) to improve on usual
design-based estimates of drug prevalences by taking
advantage of demographic measurements and social
indicator data.
Other variables that possibly are related to use and
dependence prevalence by county and that are available
from Census include the percent of population that is over
65, under 30, white, male, married, and renters. Local
governments can provide data by county on social
indicators, such as DUI (Driving Under the Influence) rate,
mortality rate, per capita liquor licenses, and drug and
alcohol treatment admission rates. The more closely auxiliary variables relate to use and dependence prevalence, the
more likely it istiiatmethods that "bortow strength" across
areas and groups, such as the empirical Bayes methods
presented here, can be employed to meet the desired
accuracy levels for sub-state areas.
3. NOTATIONS
Let n. be the sample size allocated to the i-th planning
region, i = I, ...,I{n = £|,, n.). Samples are drawn independently in each planning region using RDD telephone
surveys. After the sample is observed, suppose each region
is post-stratified into AT demographic groups. These groups
are formed by cross-classifying gender (Male, Female) and
age(18-24, 25-44,45-64,65-H),resultinginK = 2x4 = %
groups. Suppose there are J. counties in the i-th planning
region (/ = 1,...,/) and /i^ observations within the /:-th
demographic group in they-th county belonging to the j-th
planning region (i = 1,...,l;j = l,...,/,; k = I,..., K).
Since typically n..^ is small, there is a good chance that
some ofthe k demographic groups are not represented in a
particular county. Let 5^.. be the set of demographic groups
in they-th county within the i-th stratum {i = l,...,I;j = l,
..., J.) for which individuals have completed surveys.
Let y..|^^ be the l-th observation (0 or 1) for the k-th
demographic group in the j-th county belonging to the i-th
planning area (/ = 1,...,/; j = I,...,J.; keSy, l = l,...,n.ji^).
Let w-.j^i be the cortesponding sampling weight available
from the survey. The goal is to estimate n^.,the true
prevalence of substance use or dependence for the ;"-th
county within i-th planning area (/ = 1,...,/; j = 1,...,./,).
4. DIRECT SURVEY ESTIMATOR AND
SYNTHETIC ESTIMATORS
The direct sample survey estimator of 7t.. is given by
EE
Ef
W yklYijkl
^yH
keS,, / = !
The sample size available from a county could be very
small (sometimes as small as 3 or 4). Thus, the estimator
is highly unreliable. Other direct survey estimators are
defined similarly. For example, the direct survey estimator
83
Survey Methodology, June 1999
of 71 .^, the true prevalence in the ^-tii demographic group in
the i-th planning region is
E i-,
j:keS..
ijkl-Tykl
/=t
lit
EE
wyki
r.keS-jl'l
where the notation j:ke S.. means that the summation is
y
over counties j in which demographic group k is observed.
Additional problems arise when estimating the proportion of rare events. It is quite Ukely that all observations
in a county may be zero, resulting in a zero estimate for a
county. If usual estimates of standard ertor were appUed,
an estimated zero standard crtor of tiie estimate would give
a false impression of the uncertainty of the estimate. Thus,
it is very important to improve on the direct survey
estimator.
Synthetic estimators bortow strength from related
counties through implicit modeling of supplementary data
from the U.S. Census Bureau along with the telephone
survey data. A synthetic estimator, which has been used in
the past to estimate alcohol prevalence at the county level,
is given by
>.51
= E^
yk'^k
*=i
where TI^
is the statewide direct survey estimator of
prevalence of alcohol for the ^-th demographic group and a p
is the proportion of individuals belonging to the /c-th
demographic group in the j-th county within the i-th
planning area (i = 1,...,/; j = I,...,J.; k = I, ...,K). The
value a..i^ is available from curtent census estimates. For
the household survey reported in this paper, the a^ values
were obtained from database vendors like Clantas Data
Services of Ithaca, New York. Based on latest available
census data, the a,..^ values are typically estimated using
projection models. In practice, therefore, the a. ^ values are
not true proportions but are curtent census estimates of
reasonable precision. Outdated or inaccurate a..^ values
cause the estimators using them to be biased. If population
projections are used to calculate poststratification weighting
adjustments in the survey, the direct survey estimator also
suffers from this source of bias. It is beyond the scope of
this paper to study the iiripact of alternate population
projections. In proposing 7t.. •^', it is implicitly assumed
that the prevalences for alcohol and drug use for the k-th
group in all the counties are the same (or nearly the same).
A less restrictive synthetic estimator of prevalence of
alcohol and drug use is given by
^ '
^ijk^ik
k=l
where TI,.^
is a direct survey estimator of 7t^, the
prevalence for alcohol or dmg use for the ^-tii demographic
group in the i-th planning region. It is impUcitly assumed
that the prevalences in the k-th group are the same (or
nearly the same) for all the counties in a planning region.
This assumption is more "regional" or less restrictive than
the one made in proposing n.. ^^. A similar direct survey
estimator T C " ^ for the A:-th demographic group within a
specific county 7 in region i may be defined by restricting
the sample to county j only. As compared to TI..^ ^, the
estimator TI^.^ ^ wiU have relatively lower variance although
it may have some bias since it does not distinguish the
counties. T C " " , on the other hand, may be based on a very
small sample size and hence may be significantly less
reliable in terms of its variability.
The above synthetic estimators achieve reductions in
variances at the cost of increasing bias. The synthetic estimators distinguish counties only through an indirect
variable a..^. obtained from the census, whereas the direct
estimator treats each county separately.
COMPOSITE ESTIMATORS OF TI..
USING TELEPHONE
SURVEY AND CENSUS DATA
A compromise between a direct survey estimator and a
synthetic estimator is a composite estimator. A number of
different composite estimators are proposed here based on
the following identity:
-rX
^yk^ijk^
keS„
E (iyk'^yk'
kfS,,
where TI..^ is the prevalence for alcohol and drug use and
a..J,, as defined above, is the proportion of individuals
belonging to the k-th demographic group in the 7-th county
within the i-th planning area {i = l,...,r,j = l,...,Jf,
k = l,...,K).
A simple composite estimator of iiy is obtained when,
for keS..,Ti.ji^ is estimated by T O ^ , the direct survey
estimator of 7t..j^, and for k C S.., n..i^ is estimated by 71^ °.
The estimator is then given by
"y =E w
keS:,
^E
"yk^iic
k SS,,
In the above formula, TI^ {ke S.) is estimated using a small
sample and thus there is the possibility for improving on
Ttj^^ (and, hence, on 'TC^) by bortowing strength from
relevant resources. To this end, an empirical Bayes
estimate of 71.. is proposed based on the foUowing model.
Model
1. Given the TC^'S, the y ^ / s are uncortelated with one
another with E(y ITI ) =n and Var(y ..^IT: )=7t
(l-7ip) for i = l,...,I;j = l,...,J.;k = i,...,k;l = i,
...,n yk-
84
Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Drug Prevalences
2. The 71. 's are uncorrelated with
E(7r..,) = u.,;
^ Ilk'
r-|jt'
MSE(7i;p ) = MSE(7i;;^)+E(7tp
-^c^f.
Var(7t..j = dvil (i = 1,...,/; j = I,..., J- k = 1,..., K).
If 7ip - Uniform (0, 2n,^), then in statement (2) d= 1/3.
It is necessary to estimate MSE(7c^^*) since it contains
Thus, unlike the implicit assumption made in the synthetic
estimator TI^'^^ (i.e., n.-i^ = p,^), some variabiUty of propor- the unknown parameter p.^^. Thefirstterm MSE(7tp"*)
can be estimated by
tions across counties within a region for a particular
B
^ ^
demographic group is allowed.
msej{n..
)
=
mse(7c^.
)Thefirstassumption ofthe model implies that given 7t;.^,
./, - 1 ^
/
B
B V
the JC... 's are uncortelated with one another,
- - — 2 . (mse(.„)(7t,^ )-mse(7t,^ )),
E(7^
Ji
«= 1
^c,k = L = .wi/(Lf,Wp,)2fori = l,...,I;j = l,...,Jwhere
k 1,..., K. The Unear Bayes estimator of 7i.., under the
model and squared ertor loss function, is given by
where
Ilk
-,
B
^y * = E «yt (Pyk 'tp "^ + (1 - Pyk) M/J + E ^yk M,*keS:,
ktS,,
2\
(
mse(7t.. )=d EaU'^-Pyk)\^ik
+E«i^*/t
kiS:,
V keS,j
where B..^ =d\i.J{dp,., + c^ (p,.^ - (d + 1) [x-^)).
Since the Bayes estimator involves the unknown
parameter p^.^, it cannot be used in practice. The following
empirical Bayes estimator of Tiy is obtained when p .^ is replaced by an estimator, say p ^ , of p^^^:
and
mse(.„)(7t,^ ) =
/
2
21.
d\ E 4 ( ^ -Pijki-u^^'ik^-u) •" E 4 M « ( - U )
'
with
keS,,
E«„iM
yk t^ik
A-
keS,,
"ijk
I h
"ijk
W
l^ik(-u)^ l-> 2^ ^ijklYykl / 2^ 2^
ijkl'
where By^ =rfp,.j^ /(^M,* + ^ H . * - {d + l)p,;t ^)).
The weight or shrinkage factor By^ is a ratio of the variance of 7tp in the model to the (unconditional) variance of
T T ^ . The estimator of p^^ is taken to be p ] ^ = T I ^ ^ .
i*ul=l
'
j*u1=1
and
Mean Square Errors
The mean squared ertor (MSE) of the Bayes estimator
n^^ is defined as MSE {n^^) = E (7tT^* - 7t,p^ where
(unconditional) expectation is taken with respect to the
model. It can be checked that
(^\h^u^ ^Cp(l^;^) - ('^ + 1 )\'1^u)))See Jiang, Lahiri, and Wan (1998) for comment on these
estimators. The second term E (T^^^ -'^^)^
^^^ ^^
estimated with the following jackknife estimator:
MSE(7r")=Var(7r" ^y)
R
-EB
B
= Var(7ip) + Var(7t,p -2Cov(7ijJ^, n.j)
= Var(7t,p-Var(7t,^. )
«= 1
where
=d
keS:,
Ji
<'^£fi\2
kiS,,
It is customary to take MSE(7r^*) as the MSE of the
empirical Bayes estimator TT^^*. However, MSE(7i.. *)
will underestimate the MSE of TC'^* since it does not
incorporate the uncertainty due to the estimation of the
parameter p^.^^. See Prasad and Rao (1990) and Lahiri and
Rao (1995) in this context. Using a standard Bayesian
argument, it can be shown that
-EB
"•iK-u)
E ^yk (Pijk(-u)'^p "^ + (1 - Pijk(-u))i^ik(-u))
tei.
2L> yk^Uk^ik(-u
r*(•*(-«)•
kiS,,
Thus MSE(7c^^*) is estimated by
mse(j^'")=msej(n;;')^Ej(j^''
-iT^'f.
Jackknife methods are reviewed in the recent text by Shao
and Tu (1995).
Survey Methodology, June 1999
85
6. AN EXAMPLE
other response variables can be obtained from the authors.
In order to preserve confidentiality, results for only 40
counties, identified as counties 1 through 40, are reported.
Table 1 contains five different estimates of prevalence for
alcohol dependence. In general, the direct estimates are
highly variable and are often zero. The first synthetic estimator (SI) is the most stable, producing no zero estimates
and estimates with little variability. The second synthetic
estimator (S2) is similar to S1, but not as restrictive. The
first synthetic estimates are very homogeneous, while the
second synthetic estimates are homogeneous within the
four planning areas. The estimates produced by the
composite estimator are more variable than the other
estimates. The empirical Bayes estimator produces
estimates very similar to those of S2. In the model leading
In this study, the primary objective is to provide information about treatment need. Anyone who meets the criteria
for lifetime dependence or abuse as defined by the National
Technical Center's DSM-EU-R criteria, is considered a
member ofthe group of respondents who may have needed
treatment during the last year. Several indicator variables
were created in the dataset to identify respondents with a
diagnosis for substance dependence or abuse for alcohol or
drugs. For the purpose of numerical calculations, these
indicator variables with 0 and 1 as possible values were
treated as response variables (yp,) •
In order to save space, results are presented for the
outcome variable Alcohol Dependence only. Results on
Table 1
Five Estimators of Alcohol Dependence Prevalence Expressed as Percents for Forty Counties.
Estimated Standard Errors for Direct (Est.se) and Square Root of Estimated Mean Square Error for Empirical Bayes
(^Est.mse) Estimates in Parentheses Also as Percents
Synthetic 1
Estimator
Synthetic 2
Composite
Direct
County
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40,
v°
1.7
4.4
0.0
0.0
9.4
1.6
9.3
0.0
0.0
1.5
0.0
7.0
5.7
0.0
2.4
4.1
2.8
3.9
0.0
3.1
2.7
4.2
9.7
0.0
7.8
0.0
2.2
10.5
0.0
0.0
4.6
8.4
2.5
2.9
0.0
0.0
4.2
0.0
0.0
5.3
Sample Size
30
111
36
6
37
136
25
20
3
81
58
14
37
12
120
32
48
316
19
20
102
124
121
22
32
28
63
5
12
11
44
52
144
49
22
17
26
16
10
144
8
8
8
5
8
8
6
7
3
8
8
6
8
4
8
7
8
8
5
6
8
8
8
6
6
7
8
5
5
6
8
8
8
7
8
6
6
6
6
8
Emp irical Bayes
(Est.se)
(2.4)
(2.0)
(0.0)
(0.0)
(4.8)
(1.1)
(5.8)
(0.0)
(0.0)
(1.3)
(0.0)
(6.8)
(3.8)
(0.0)
(1.4)
(3.5)
(2.4)
0-1)
(0.0)
(3.9)
(1.6)
(1.8)
(2.7)
(0.0)
(4.7)
(0.0)
(1.8)
(13.7)
(0.0)
(0.0)
(3.2)
(3.8)
(1.3)
(2.4)
(0.0)
(0.0)
(4.0)
(0.0)
(0.0)
(1.9)
Number of
Groups
Observed in
County
(v/Est.mse)
3.4
3.8
3.6
3.3
3.3
3.4
3.4
3.6
3.4
3.4
3.3
3.5
3.3
3.5
3.3
3.3
3.8
3.4
3.4
3.6
3.3
3.3
4.3
3.3
3.3
3.5
3.2
3.4
3.5
3.2
3.5
3.7
3.4
3.6
3.3
3.4
3.0
3.4
3.5
3.4
1.6
1.8
3.3
5.6
5.6
3.0
3.1
3.2
5.8
2.1
1.6
1.7
5.5
1.7
5.6
3.0
1.8
3.0
5.7
3.2
5.6
2.1
8.0
2.0
1.6
1.7
5.6
1.6
3.1
1.5
5.9
3.4
2.2
1.7
3.0
3.1
2.0
5.8
3.1
3.1
0.9
7.2
0.0
1.6
14.1
1.7
9.9
0.4
5.6
0.7
0.0
5.0
12.9
0.8
2.0
2.5
1.3
3.2
3.7
14.9
4.1
1.8
11.8
0.2
2.8
0.0
1.6
14.2
1.8
0.0
17.0
8.4
2.5
1.3
0.0
0.3
3.4
3.7
0.6
2.9
1.6
2.1
3.0
5.3
6.9
2.7
3.1
3.1
5.8
1.9
1.5
1.8
6.4
1.6
4.4
3.0
1.8
3.2
5.7
3.2
5.8
2.2
8.8
1.9
1.8
1.6
4.9
1.7
3.0
1.5
5.8
4.1
2.1
1.7
2.8
2.9
2.1
5.7
3.0
3.5
(0.33)
(0.35)
(0.85)
(1.79)
(1.78)
(0.67)
(0.81)
(0.84)
(1.93)
(0.54)
(0.33)
(0.35)
(1.75)
(0.33)
(1.56)
(0.77)
(0.37)
(0.60)
(1.95)
(0.82)
(1.50)
(0.42)
(2.11)
(0.54)
(0.33)
(0.37)
(1-74)
(0.35)
(0.81)
(0.33)
(1.87)
(0.84)
(0.50)
(0.35)
(0.77)
(0.82)
(0.54)
(1.97)
(0.81)
(0.69)
86
Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Dmg Prevalences
Table 2
Summary of Five Estimators of Alcohol Dependence Prevalence for All Counties. Results Expressed as Percents.
Estimator
Direct
Synthetic 1
Synthetic 2
Composite
Empirical Bayes
minimum
0.0
3.0
1.5
0.0
1.5
l"quartile
0.0
3.3
1.8
0.4
1.8
median
2.2
3.4
3.0
1.7
2.8
to the empirical Bayes estimator, d was chosen to be one
third.
Table 1 also displays the estimated standard ertors of the
direct estimates and the square root of the estimated mean
squared errors (see section 4) of the empirical Bayes
estimates. The standard ertors of the direct estimates,
which are calculated as
{-r"''(l -^'^)ln..
y
^
IJ
^
,
IJ '
are often (incortectly) estimated to be zero and are quite
variable. The square roots of the estimated MSE of the
empirical Bayes estimates are relatively stable and always
below .025.
Table 2 summarizes alcohol dependence estimates in the
previous table for all counties in the state. The means of the
synthetic and composite estimates are higher than the mean
of the direct estimates, because there are fewer zero estimates and the means in the summary tables are unweighted.
7.
CONCLUSION
We have proposed simple empirical Bayes estimators to
estimate county level prevalences. Empirical Bayes estimators are found to be very effective when sample sizes for
the counties are small and when prevalences are extremely
small. We have introduced a measure of uncertainty of the
proposed empirical Bayes estimator based on the jackknife
method. The proposed measure incorporates additional
sources of variability due to estimation of various model
parameters. In our model, presented in this paper, we have
implicitly assumed that the selection probabilities are
unrelated to yp,. In the household study reported in this
paper, the selection probabiUties were unequal and depended
on several factors like number of telephone lines and number
of adult household members in the household. None of these
variables were related to yp,. The sample allocation to
different regions, however, was done based on the number
of "treatment admissions" in each region. Hence, the
selection probabilities might be indirectly related to y^..^,. In
this paper, we have not addressed the issue of sample
selection bias, which can be handled appropriately by
foUowing procedures discussed in Pfeffermann (1993).
In this paper, we have not considered the use of auxiliary variables in the model to relate small areas to one
another and to facilitate improved estimation. The use of
available auxiliary data from the U.S. Census and other
3"" quartile
4.3
3.5
4.4
4.6
4.2
maximum
10.5
4.3
8.0
17.5
8.8
mean
2.8
3.5
3.2
3.7
3.2
standard deviation
3.2
0.2
1.7
4.8
4.8
administrative records may be a sensible use of resources
that can be used to improve planning for treatment of drug
and alcohol abuse and dependence. We plan to do further
work in this area with an actual example in a future paper.
ACKNOWLEDGEMENTS
We wish to thank an anonymous referee who had many
useful suggestions on improving our paper and the Gallup
Organization for partial support. Additionally, Partha Lahiri
wishes to acknowledge partial support from U.S. National
Science Foundation Grant SBR-9705574.
REFERENCES
CASADY, R.J., and LEPKOWSKI, J.M. (1993). Stratified telephone
survey designs. Survey Methodology, 19,103-113.
EFRON, B., and MORRIS, C. (1973). Stein's estimation nile and its
competitors - an empirical Bayes approach. Joumal of the
American Statistical Association, 68, 117-130.
FARRELL, P.J., MacGIBBON, B., and TOMBERLIN, T.J. (1997).
Empirical Bayes estimators of small area proportions in multistage
designs. Statistica Sinica, 1,1065-1083.
FAY, R., and HERRIOT, R. (1979). Estimates of income for small
places: an application of James-Stein procedures to census data.
Joumal of the American Statistical Association, 74, 269-277.
GHOSH, M., and LAHIRI, P. (1987). Robust empirical Bayes
estimation of means from stratified samples. Joumal of the
American Statistical Association, 82, 1153-1162.
GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an
appraisal. Statistical Science, 9, 55-93.
JIANG, J., LAHIRI, P., and WAN, S. (1998). Jackknifing Mean
Squared Error of Empirical Best Predictor. UnpubUshed
manuscript.
LAHIRI, P., and RAO, J.N.K. (1995). Robust estimation of mean
squared error of small area estimators. Joumal ofthe American
Statistical Association, 90,758-766.
MALEC, D., SEDRANSK, J., MORIARITY, C.L., and LECLERE,
F.B. (1997). Small area inference for binary variables in the
National Health Interview Survey. Joumal of the American
Statistical Association, 92, 815-826.
PFEFFERMANN, D. (1993). The role of sampUng weights when
modeling survey data. Intemational Statistical Review, 61,
317-337.
PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation of mean
squared errors of small area estimators. Joumal ofthe American
Statistical Association, 85,163-171.
SHAO, J., and TU, D. (1995). The Jackknife and the Bootstrap. New
York: Springer-Verlag.
87
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 87-98
Statistics Canada
Some Issues in the Estimation of Income Dynamics
SUSANA RUBIN BLEUER and MILORAD K O V A C E V I C '
ABSTRACT
Two design-based estimators of gross flows and transition rates are considered. One makes use of the cross-so^tional
samples for the estimation of the income class boundaries at each time period and the longitudinal sample for the estimation
of counts of units in the longitudinal population (longitudinal counts); this is the mixed estimator. The other one is entirely
based on the longitudinal sample, both for the estimation of the class boundaries and the longitudinal counts; this is the
longitudinal estimator. We compare the two estimators in the presence of large attrition rates, by means of a simulation.
We find that under a less than peifect model of compensation for attrition, the mixed estimator is usually more sensitive to
model bias than the longitudinal estimator. Furthermore, we find that for the mixed estimator, the magnitude of this bias
overshadows the small gain in precision when compared to the longitudinal estimator. The results are illustrated with data
from the Survey of Labour and Income Dynamics and the Longimdinal Administrative Database of Statistics Canada.
KEY WORDS: Attrition; Gross flows; Transition rates; Longitudinal weighting; Cross-sectional weighting; Bootstrap
variance estimator.
1. INTRODUCTION
Gross flows are counts of transitions from onetimepoint
to the other between a number of states for individuals in a
population. Related parameters are longitudinal proportions
and transition rates. Longitudinal proportions are relative
gross flows, while transition rates are relative gross flows
conditional on the initial transition state. Estimates of these
parameters for transitions between different income classes
are required in studies of income dynamics and can be
obtained from longitudinal surveys. The boundaries of the
income classes often have to be estimated from the survey
as weU. An example is the low income measure defined as
half of the median income, where income is adjusted for
family size. Thus, in this case, estimators of counts of transitions to and from "low income state" require the estimation of the income medians at the time period of interest.
The income class boundaries usually refer to the
respective cross-sectional populations and have to be estimated from the cross-sectional samples to obtain unbiased
estimators. If the change in population from one wave to
the other (that is the number of "births" and "deaths") is
negligible, a longitudinal sample may represent the respective populations at both time points, and we may estimate
the income class boundaries from the longitudinal sample.
Otherwise, estimation of income class boundaries from the
longitudinal sample may yield biased estimates. By
"deaths" we mean real deaths and/or emigration; similarly,
"births" means real births and/or immigration.
Two design-based approaches are considered for estimation of longitudinal parameters involving two waves.
One approach is based on the cross-sectional samples for
the estimation of the class boundaries at each time period
and on the longitudinal sample for the estimation of counts
of units in the longitudinal population (longitudinal counts).
This results in an estimator that we term the mixed estimator. The other approach uses an estimator based on the
longitudinal sample for both the class boundaries and the
longitudinal counts, and we call it the longitudinal estimator. The main objective of this study is to compare the
two approaches in terms of their performance under
different attrition adjustment models.
In order to make the comparison we address two related
issues: the impact of attrition on the considered estimators
and the estimation of their variance. Attrition refers to the
type of non-response that occurs from a certain wave on,
until the end of the period of observation. The real issue
with attrition is that non-respondents cumulate over time,
and the longer the study lasts, the greater is the non-response. In some surveys like SIPP (Survey of Income
Program Participation), attrition reached 20% by the time
ofthe third wave (Rizzo, Kalton, and Brick 1996). Even if
extra care is taken in the development of adjustments to
compensate for the missing data, the resulting estimators
may still be sensitive to a less-than-perfect model of
compensation. We investigate empirically the sensitivity to
attrition of the estimators considered.
Variance estimation is also an issue because the parameters of interest are non-linear functions of the observations and are dependent on the income class boundaries.
The problem of variance estimation of low income proportions, and other measures of income inequality from complex cross-sectional samples was studied by Shao and Rao
(1993), Binder and KovaCevic (1995) and Kovacevic and
Yung (1997), among others. In the longitudinal situation,
changes in the population over time imply the need to
combine different samples and different systems of weights,
which complicate variance estimation. The ultimate units
Susana Rubin Bleuer, Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, KIA 0T6, e-mail: rubisus@statcan.ca; Milorad S. Kova£evi6,
Social Survey Methods Division, Statistics Canada, Ottawa, Ontario, Kl A 0T6, e-mail: kovamil@statcan.ca.
88
Rubin Bleuer and Kovacevic;: Some Issues in the Estimation of Income Dynamics
in a longitudinal sample may belong to different primary
sampling units (PSU's) at different waves, some PSU's in
the sample at one wave may not belong to the sample at
another wave, etc. In this study, we develop an appropriate
bootstrap variance estimator for estimators of income
dynamics and for the complex design used in the example.
The data used for illustration come from Statistics
Canada's Survey of Labour and Income Dynamics (SLID)
and Longitudinal Administiative Database (LAD); both are
sources with quite accurate longitudinal income data
obtained from income tax returns. At the time of the study,
SLID had response data from only two waves and the
attrition rate was about 10%.
Section 2 outiines the general assumptions for the population under study and for the design. In section 3 we deal
with the issue of estimation of longitudinal low income
proportion and the impact of attrition on it by means of a
small simulation study performed on an artificial data set
created assuming the log-normal distribution. Section 4
deals with the bootstrap variance estimation for longitudinal
complex surveys. A more extensive simulation of various
attrition models, using different adjustment methods and
data from a complex design, is described in section 5, and
the results are presented and discussed in section 6.
(Tambay, Schiopu-Kratina, Mayda, Stukel and Nation
1997). The parameters considered in this paper refer to the
longitudinal population U^, and therefore units that "die"
from one wave to another are out of scope.
Large scale surveys often employ stratified multistage
designs with a large number of strata and relatively few
clusters or primary sampling units (PSU) sampled within
each stratum. The selected PSU's are subsampled in one or
more stages until the ultimate units are obtained. Here we
assume that the number of strata and clusters within strata
does not change from one wave to the other.
We assume that the cross-sectional samples s^ consist of
n^ sampled PSU's with replacement within stratum h and
m^j units sampled within the i-th PSU in stratum h, for
r = 0, l,/i = l,...,//and/ = l,...,/i^.Let{w4},; = l,2,..., m^^
be the set of survey weights cortesponding to the crosssectional sample s^. We assume that the survey weights
provide approximately unbiased estimators of population
totals so that E (T,^ w^y) ~ N', where A^' is the size of
Uy for f = 0,1. Here E is the expectation with respect to
the design p{s). When the set s^ of attritors is large, the
original weights w^y have to be adjusted to account for the
missing units and the adjusted weights should add up, in
average, to the size A^^ ofthe longitudinal population:
E E
2. LONGITUDINAL POPULATION,
SAMPLE, AND WEIGHTING
Let UQ represent the population at time 0 and (/,
represent the population at time 1. In this study we only
consider parameters involving two periods of time, and
therefore, the longitudinal population is defined in terms of
two waves by f/^, where U^ = UQ f) Uy
"Deaths" and "births" from one time period to the next
cause a change in the population. If we denote by U^ the
set of individuals who belong to the population UQ at time r = 0
and do not belong to the population (/, at t = l due to
"death", and by U^ the set of "births" from time 0 to 1, then
the longitudinal population can be expressed as f/^ =
U,\U^ = U^\U^.
p
"
•^o^t^rfUi^)
wAy
N,L-
Here the expectation E^ is taken with respect to the model
m assumed for the probability of response.
Examples:
1. In the Survey of Labour and Income Dynamics (SLID),
every wave has an added component that consists of
"cohabitants", Le., individuals who live in the households of the longitudinal individuals (Lavallee and
Hunter 1992). SLID has a stratified two-stage design
with approximately H = 400 strata at each wave. The
number of clusters within stratum h may change if
there is growth in it. The number of sampled clusters
is usually 2 or 3. When a new panel is selected or an
old one is replaced from time r = 0 to time t = l, then
the number of sample clusters per stratum may vary.
Similarly, we denote hy SQ, a representative sample of
UQ, hy S^, a representative sample of U^ and by Sj and 5^
the respective subsamples of individuals in s^ who "died"
between t = 0 and t = l, and of individuals in 5, "bom"
between / = 0 and / = 1. Hence, the longitudinal sample, 2. The Longitudinal Administrative Database (LAD) of
Statistics Canada is a longitudinal sample obtained
representing U^, is defined by 5^ = ^QDJ, = s^s^ = s^\Sy
from administrative data files and is a representative
Non-respondents to the initial wave at / = 0 exist but they
sample of the income-tax-ftUng population at any year.
are relatively few compared to non-respondents in later
The
LAD is a collection of many panels since a panel
waves. For the sake of simplicity, assume that s^ is the
is
"bom"
at each wave (year). Here non-response is
sample without the initial non-response and with the assoapproximately
5% of the cross-sectional sample every
ciated weights already adjusted for it. Attrition from wave
year.
Longitudinal
administrative samples, like LAD,
0 to wave 1 wiU be represented by a subset of individuals in
do
not
have
attrition,
but are subject to wave non5Q denoted by s^. Hence, the longitudinal sample affected
response
caused
usually
by late filing (Rubin Bleuer
by attrition can be expressed by .y^ = SQ\{S^ U S^).
1996). The design for LAD is non-stratified and single
Note that for some parameters of interest s^ should
stage. We use LAD as a base for a simulation of
remain in the longitudinal sample for weighting purposes
Survey Methodology, June 1999
89
attrition because its data are representative of the Canadian
income population at every wave.
This result is easily extended to the longitudinal situation
for the estimator
e =
3.
F{MJ2,M.I2)
(3.3)
ESTIMATION UNDER ATTRITION
Without loss of generality (wlog), in the following we
define and explain the two estimation methods in terms of
longitudinal low income proportions. In section 6 results on
the impact of attrition are given for other two-wave parameters like gross-flows and transition rates. We also assume
a negligible amount of "births" and "deaths" in the finite
population, relative to the attrition rate. In fact, we assume
that UQ = i/j and that though the units are the same at both
time points, the incomes attached to the units can vary.
Let yly be the value of the characteristic of interest
(family income adjusted for family size) for they-th ultimate
unit in tiie J-th PSU of sti-atum/i, j=l,...,M^., i=l,...,N^,
h = l,...,H, and t = 0, I. Then the longitudinal proportion
of individuals with income less than or equal to x at r = 0
and income less than or equal to y at r = 1 is given by
under the assumptions of no change in the population from
r = 0 to r = 1 and of no attrition.
Let Mj denote the estimator of the median income at
time t based on the cross-sectional sample s^ and cortesponding cross-sectional weights {w^y}, for r = 0, 1
M, =
inf{y-yes,\F,(y'^p^ll2],
where
p, (y) = Es, "^hy^iyhy ^ y)/Es,
Ky
and let M^ denote the estimator of the median income at
time t based on the longitudinal sample 5^ and the longitudinal weights {vv^y}:
M.,
=
inf[ylyes^\F,{yl,)>lll\,
where
"^'^^ = i S S ^ ''''^''' ''^'^ ^'-'^
where, since the two populations coincide, M^, = M^,, and
H
l^h
^(^> - Es:<^iyl,-y)/Es,^^-
Then, there are two possible ways to estimate the
longitudinal parameter (3.2):
Nj^-E E M,
h=l
1=1
coincides with the size of the original population C/Q. / is
the indicator function of the incomes smaller than or equal
to X and y respectively. F{x, y) is the bivariate distribution
function of incomes attimes0 and 1. Let us now denote by
A/Q 12, half the median income at time t = Q, and by M, 12
half the median income at r = 1. Then the longitudinal low
income proportion is defined by
e=f(M„/2,M,/2).
(3.2)
Under complete response, and f/g = f/j, 0 is the bivariate
version of the cross-sectional low income proportion which
was studied, among others, by Shao and Rao (1993). Under
a framework for the development of asymptotic theory in
the design space and under certain regularity conditions on
the design and the income distributions, Shao and Rao
proved that the estimator of the cross-sectional low income
proportion F,(M,/2) is consistent (as the number of
PSU's, A^psu' approaches infinity) for general stratified
multistage designs where the PSU's are selected with replacement. The framework assumes: (i) the existence of a
sequence of finite populations with either increasing
number A^pg^ of PSU's or increasing number of independent units if the population is not clustered, and (ii) the
existence of a cortesponding sequence of probability
designs with the first stage sample size n^^^ increasing to
infinity as A^p^^ -* °°.
E ^by l[yty^MJ2) l[yly^M,l2)/Y
^y (3.4)
and
Y ^4 l[yly ^ M0/2) l[yly . M,I2)IY
^bij- (3.5)
The first estimator is termed "mixed" because it combines longitudinal and cross-sectional samples. The second
is only based on the longitudinal sample. Note that when
there are no "births" or "deaths" from one wave to the next,
the median at r = 1 can only be estimated from the longitudinal sample and thus we use M, in the definition of the
mixed estimator.
Under attrition, most of the missing data may correspond
to individuals who are different from the rest of the population, and failure to account for this may result in biased
estimates. Hence, weights are adjusted to compensate for
the missing information according to a model. The estimates wiU become more sensitive to model misspecification
as attrition increases. Thus, estimators that are robust to the
choice of the model for non-response adjustments are
desirable.
90
Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics
In order to compare estimators (3.4) and (3.5) regarding
their robustness against incortect non-response adjustments
we made a simple simulation study to empirically estimate
the expected (with respect to the design and the attrition
model) values of 9^;^^^ and Q^ when the adjustment model
was both the cortect one and an incortect one.
As we already pointed out, if there is no change in the
population from one wave to the next, and thus no new
sample is selected in the second wave to represent the
change, the estimation ofthe median in the second wave can
only be based on the longitudinal sample. Thus the only
difference between the two estimators lies in the estimation
of the low income measure in the first wave. Hence,
wlog, we consider for our simulation the parameter
9 = F{MJ2, °o). In that case, 0 coincides with the crosssectional low income proportion, and the estimator of 0
under complete response, 9^,^^^^ =FQ(MQ/2), is consistent
(and thus asymptotically unbiased) as N^^^ tends to infinity.
The simulation study, described in detail in Appendix A,
consisted in simulating 1,000 samples of size 1,000 from a
log-normal income population similar to the Canadian
income population. We first selected a simple random
sample without replacement (SRSWOR) from a large finite
population of incomes and then we simulated attrition in that
sample. Here we call a model missing at random (MAR) if
the probability of non-response in the second wave is
constant within response classes; and we call a model of
attrition missing completely at random (MCAR) if the
probability of non-response in the second wave is constant
in the whole population. The attrition was simulated
following a missing at random model where the nonresponse was induced in a low income class. The boundary
of the low income class was the first quintile of the finite
population, known apriori. For every sample, wecalculated 0„i^j.jj,
0,„„„ and 0 „ with adjustments under both the cortect
boundary estimated from the sample, and attrition is heavier
in the lower income categories, we will always have the
inequality
0
long.
<0
mixed.'
and as with the low income proportion 0, the estimator of
0^ using less information is, in average, nearer the tmth.
The description of the simulation and the numerical results
are in Appendix A.
The question now is if the bias caused by model misspecification is larger than the increase in variance caused
by the attrition. In sections 5 and 6 we tackle this issue by
simulating attrition on data from SLID and LAD, calculating 0, given by (3.3), 0^;^^ and 0,^,^^, and calculating
the design variance of the estimators as well.
4. BOOTSTRAP VARIANCE ESTIMATION
FOR LONGITUDINAL SAMPLES
In order to compare the two approaches to estimation we
need to study them in terms of variance and bias under
different attrition situations. The estimators O^j^ej and Q^
defined in section 3 are nonUnear functions of the observations; in addition, the income data come from complex
surveys. The variances of these estimators cannot be expressed in simple terms, and we have to rely on approximate
variance estimation techniques. We seek a method that is
easy to apply to many different complex parameters, and
under different designs. We would like to evaluate the two
estimation approaches for any parameter, using the same
criteria and a consistent method of variance estimation. We
concentrate on developing a bootstrap variance estimator
that can be applied to a stratified multistage longitudinal
long
cross
J
design. It is important to emphasize that only the primary
(MAR) and a MCAR attrition model. The arithmetic mean sampling units are resampled, not the units within them.
of the estimates approximates the double expectation (with
Kovacevic and Yung (1997) compared several resamrespect to the model and the design) of the first two pling methods and the Taylor linearization method for
estimators and approximates the expectation of Q^^^^^ with variance estimation of cross-sectional estimators of income
respect to the design. This last expectation approximates, in inequality under a complex survey design. They found, by
turn, the parameter 0, since Q„^^ is asymptotically unbiased means of a simulation study, that the best method (in terms
as npgy - OO. When the weight adjustments are calculated of relative bias, coverage properties, stability, robustness
under the MCAR (incortect) attrition model, the following against assumptions, etc.) is the Taylor linearization method
relationship is empirically found:
via the estimating equation approach, and that the next best
\EE {Q . A-Q\>\E E .(^long) 61
is the bootstrap method.
I p m ^ mixed-'
I \ p n
In the calculation of the number of individuals in one
where m refers to the simulated attrition model and p refers income class at time 0 and another income class at time 1,
to the design under SRSWOR. Note that attrition from low the units in the longimdinal sample 5^ are involved, and the
income individuals wiU always bias upwards the estimator of bootstrap sampUng scheme must ensure the selection of
the median and thus we will always obtain Q^^.^^^ ^ 0, . units in s^. However, if we are confined only to the
The somewhat surprising result is that the estimator which resampling of units in 5^, we would not allow enough
utilises less information is, in average, nearer the tme variability for the consistent estimation of the variance of
parameter, meaning that more information, if it is not used the cross-sectional quantile estimators M^ and My
well, does not improve the estimator. Similarly, when 0^ is Therefore, the bootstrap sample should contain as well
the proportion of incomes higher than the income category elements from s^Xsy 5^ and s^Xs^^ at each iteration.
91
Survey Methodology, June 1999
We assume a stratified two-stage design, and we assume
e:.=E <!l'Ay-i-M;i2]i[yi,.M:i2\/Y ^S"that the primary sampUng units (PSU's) are exactly the same
at both f=0 and f = l, that is there are no "deaths" or
"births" of PSU's from one wave to the next.
where
This is the case of the first and second waves of SLID.
^r- = inf i)'^ ^ *'boot U ^Lboo. IP," (YhiP ^ 1/2}
The "births" and cohabitants that appear in the second wave
live in dwellings with individuals who were selected in the
first wave. Every unit u in s^ or s^ is assigned to a PSU that and
was selected in the first wave:
^;o'):= E
<'/(y4^3')/E-.J'.' = o,i.
if i< 6 jp n Sj, tiien we assign u to the PSU corresponding to
^iboM ^ ^Lboot
its original dwelling at time r =0;
if M 6 s^Xsy then we assign u to the PSU it belonged to at The estimate 0, computed from s^'boot is
t = 0;
if M e 5, \ 5Q and u Uves with v e Sg fl 5,, then we assign u to
the PSU of V.
In this way we reduce the problem to a cross-sectional
situation. Then we perform the following steps:
Suppose that the original weights of an individual are
^hy
^bij'
^hij-
where the medians M.' are estimated from the longitudinal
sample j^oof
3. Repeat steps 1 and 2 a large number oftimes,say B. A
Monte Carlo estimate of the variance is obtained as
1. We select a simple random sample with replacement
(SRSWR) of PSU's of size n^-l, independently in each
stratum. A union of such samples is denoted by s^^. It
contains a subsample So'boot ^f ""'^^ f™™ •^o ^^^ ^^ ^^^
in 5,, a subsample ^.'^oo, of units from s^ that are not in s^ and
and a subsample 5^^001 ^^ ""'^^ "^hat are in both, s^ and
^B^^mixed) = T
B
5 ^ i^mfc " ^^ixed)
^B^^long) =^1^[^lb-^long)
Sy
2. Let m^j be the number of times the hi-th PSU is
selected; the bootstrap modifications ofthe weights are
where
0 mixed
•(0)
wby
b
f^hi
»
0
w^y;
'
B and 0 long
EKJP-
By resampling the original PSU's, we reduce the
problem of variance estimation in a longitudinal survey to
that of a cross-sectional framework. This is an extension of
'b
.„ • . .1
(1)
f^hi
^by'
^by
the bootstrap variance estimator developed by Rao and Wu
n,-l
(1988) and later Kovacevic and Yung (1997) for variance
estimation of cross-sectional income inequality measures
•a)
r^hi ^hyw,by
from a stratified multistage sample survey.
«.-!
In order to accommodate attrition, we look at the original
data set as a set of longitudinal records. Then, attrition can
(L)
.(1)
are obtained from the be viewed as item non-response and accordingly the weight
w
The weights w^y and
«..^ .y^y
original weights by multiplying by an attrition adjustment adjustment for attrition can be considered as ratio
factor. The adjustment factors are the inverses of the imputation (Hajek-ratio).
response probabilities (assumed different for each response
Indeed, recaUing that s^ = s^us^, let us denote by y^?^*
class). These probabilities are estimated from the original {hijesj the ratio-imputed value of wave one information,
data set. The process of estimation of the response probabi- based on the observed data in the longitudinal sample; and
lities and subsequent adjustment is imitated in the bootstrap let us denote by yj,y' {hijesj the ratio-imputed value of
resampling: for each bootstrap sample s^^^ the adjustments wave zero information based on the longitudinal sample
are recalculated, to produce new vv^y and w^y.
(5^). Note that the values yL- {hijesj are not missing, but
Then the estimate Q^^y^,^ computed from a bootstrap we need yL to represent the weight adjustment in the
sample s^^ is
longitudinal sample.
n,-l
92
Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics
The estimation of 0 with weight adjustments for attrition
is equivalent to estimation on a "complete" imputed data set
weighted by wj^'y', {hijes^). The set
^
mi.e,=\(yS'ylS)'^^es^{yl^,yl^'),hijesJ^,
is used to calculate Q^^^^ and the set
2> ,ong=te4^)./«y^^.(yir'>'ir)'''y^^.}'
is used to calculate 0. „„.
long
We noted above that adjustment due to attrition is
equivalent to ratio imputation for item non response. Hence
the variance estimator proposed here has the same properties
as the cross-sectional variance estimators for imputed survey
data: consistency now follows from the consistency of the
bootstrap variance for imputed survey data (Shao and Sitter
1996) and good coverage properties and small relative bias,
as documented by Kovacevic and Yung (1997). In the case
of small number of PSU's per stratum (SLID has two or
three PSU's per stratum) Kovar, Rao and Wu (1988)
showed empirically that the bootstrap variance estimate
overestimates the tme variance by no more than 10%.
RESPONDENTS
Responded in thefirstwave 1,
not in the second
RESPONDENTS
1=0
t=l
Responded in both waves
Here X means that the response is available for the individual at the cortesponding wave, and 0 means the opposite.
We omitted "births" and "deatiis" from botii SLID and LAD.
The respondents in the first wave are divided into two
response classes, low income class and else. The boundary
is given by M^ 12, where M^ is the estimate of the median
income at time t = 0, based on all respondents in the first
wave. The size of the SLID sample (Ontario) is approximately 10,000, from which 2,000 were low income
individuals.
We simulate three different attrition situations. In all of
them we select a subsample of the complete respondents
with pattem XX and convert them into the following pattem
X
1) 10% attrition. We select a subsample at random of
individuals with low income {>';,-^MQ/2} at time
t = 0, and make them non-respondents. In order to
have 10% of the overall sample missing at time r = 1,
we convert 50% of low income individuals into nonrespondents in the second wave.
2) 20% attrition. We select 70% of individuals from
{y"y <. M(/2} and 7.5% from [yly>M^I2} at random
and convert them into non-respondents at the second
wave. This results in 20% overall attrition for SLID.
3) 30% attrition. We select 80% of individuals from
{y'^y^M^n] and 17.5% from {y'^y>M^I2} at random and convert them into non-respondents at the
second wave. This results into 30% attrition for SLID.
We then consider two different adjustment models for
each situation:
Model 1:
The non-respondents are missing completely at
random (MCAR). This model is the worst
possible model that we might use given that
attrition is usually experienced by the group of
low income individuals.
Model 2:
The non-respondents are missing at random
from the low-income class. We allow for a
small increase of the upper boundary for the
low-income class, so that the response classes
are defined as {y^y<,MQl2 + M^/IO} and
{yly>MQl2 + M Q / I O } . We assume that this
model as one of the best possible models under
our setup, since it recognizes the response
classes as separated by low income boundaries.
5. EMPIRICAL STUDY
In order to compare the two approaches to estimation, we
now consider two real longitudinal surveys, SLID and LAD,
and simulate different attrition situations. The study begins
with a sample of complete respondents in two waves. The
response pattem of the individuals in this sample can be
presented as
«=0
We chose these two models of adjustment because they
represent the two extremes. In practice, we may only be
able to choose a model between these two.
Let /, denote the low income measure defined as half
the median: /, = MJ2, t = 0,1. Several longitudinal parameters were studied. We define some of them in Table 5.1.
The values of the estimates are presented in Tables B1 to
B3. The standard ertors were obtained using the bootstrap
method described in section 4 assuming that the cortesponding adjusted weights are known a priori and do not change
for each bootstrap sample.
Survey Methodology, June 1999
93
Table 5.1
Some Longitudinal Parameters Evaluated
in the Empirical Study
Proportions
The proportion of individuals with income
(adjusted fainily income) below /, at r = 0,1.
The proportion of individuals with income
below /Q at / = 0 and above /,' = (11/10)Z, at
t = 1; the factor 11/10 is used to be able to
detect true transitions from one state to the
other.
The proportion of individuals with income
above IQ ={l 1/10)/^ at / = 0 and below /, at
t=l.
The proportion of individuals with income
above IQ and
and/,' at times 0 and 1
respectively.
Conditional Rates
The probability of an individual having low
income at the second wave (second year),
given that he or she had low income in the
first wave.
The probability of not having low income at
the second wave, given that the individual
had low income at thefirstwave.
p{yo^k<yi^h^
PO'o^'o'J'i^'r)
PCyo^'o*'>'i^'i)
pCyo^'o'')'i^'i')
(representing 70% attrition) in the domain of individuals
who were low income in the first wave (1993), and with a
factor of 1.08 (representing 7.5% attrition) in the domain of
individuals who had an income higher than the LIM in
1993. Thus, by adjusting with an incortect model, we incur
in a much greater ertor in the estimation of one domain
{y^ <. M(,/2) than in the estimation ofthe other (y^> MJ2).
We see from these tables that both the mixed and
longitudinal estimates seriously underestimate the parameter of interest in the first column and overestimate it in
the second colunih, assuming that estimation based on the
complete data set results in reasonable and acceptably good
estimates. It is obvious that the mixed estimates are more
affected by the wrong adjustment, verifying the inequality
stated in section 3.
Table 6.1
Gross Flows Estimated From SLID and LAD 20% Attrition
(70% ofthe Low Income Missing)
p(y,s/,|yoS/(,)
1994
PO'I>VJ)'O^'O)
6. RESULTS AND DISCUSSION
The empirical study shows that attrition does affect
estimates adversely, but different outcomes result depending
on whether the parameter of interest is cross-sectional or
longitudinal. In the estimation of later-wave cross-sectional
parameters, estimators based on the actual longitudinal
sample are more biased than estimators based on the crosssectional sample whether the model of adjustment is sound
or not, (see for example the estimates of the median att = 0
in Tables B.1-B.3.) However, in the estimation of
longitudinal parameters, longitudinal estimators (based
entirely on the longitudinal sample) are less biased than
mixed estimators (based on the three samples.)
Tables 6.1a and 6.1b present gross flows estimated from
SLID and LAD data, respectively. The estimates are
calculated with the complete data set and after 20% nonresponse is simulated in the second wave. For the complete
data set the longitudinal and the mixed estimators coincide
(this is the no-attrition situation). As explained in the
previous section, 20% of non-response was simulated by
eUminating 70% ofthe responses from individuals who were
low-income and 7.5% of individuals with income higher than
/Q in the first wave. The adjustment for non-response was
done assuming that the individuals were missing completely
at random.
The appUed adjustment model means that the original
survey weights were adjusted with a factor of 1.25
(representing 20% attrition) across the sample, whereas the
cortect adjustment should have been with a factor of 3.33
y,sM,/2
y,>l.lM,/2
1992
y,sM,/2
y,>l.lM,/2
a. SLID, Ontario
1993
y„.M„/2
y>M,l2
1,602,000
113,000
No attrition
425,000
152,000
Mixed
710,000
125,700
Longitudinal
70,000
8,080,000
No attrition
15,000
8,975,000
Mixed
30,000
8,870,000
Longitudinal
b. LAD, Sub-Area from Toronto
1991
yo^M,l2
y>l.lM,l2
2,700
640
No attrition
1,100
800
Mixed
1,500
740
Longitudinal
580
10,420
No attrition
190
12,150
Mixed
380
11,650
Longitudinal
Tables B.l to B.3, given in Appendix B, show the results
for SLID at three different attrition levels: 10%, 20% and
30%. For each parameter the estimates and their cortesponding bootstrap standard ertors were calculated using
both the longitudinal and mixed estimators. First, we
calculated them for the ideal longitudinal "no attrition"
sample, and then for the reduced sample adjusted under the
two non-response models described in section 5. We
provide the estimate of the model bias as the difference A
between the estimate obtained under the model and the "no
attrition" estimate.
The numbers in Tables B.l to B.3 repeat the same
pattem that was shown for gross flows in Tables 6.1a and
6.1b, Le., the estimates obtained using the longitudinal
94
Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics
estimators are "less sensitive" to a choice of the non-re- of the estimators becomes more important than the model
sponse adjustment model. At the same time, there is almost used in the adjustments.
no difference between the corresponding standard errors
We summarize our findings as following
(given in parentheses). Overall, we found that the estimates
1) In the estimation of later wave cross-sectional
ofthe standard errors of Q^^^^^^ are sUghtiy smaUertiianfor Q^ ,
parameters, estimators based on the actual longitudinal
and that this negligible difference in favour of O^j^gj is not
sample are more biased than estimators based on the
enough to compensate for the larger bias affecting 0^1^^^,
cross-sectional sample.
induced by the wrong adjustment for non-response.
There is no difference between O^i^gd and Q^ when the 2) In the estimation of longitudinal parameters, both the
mixed and longitudinal estimators are considerably
second (best) adjustment for non-response is used. This is
biased
if the wrong model of attrition adjustment is
tine for most parameters except for the conditional rates: for
used.
example, in Table B.l (10% attrition) the empirical bias of
the mixed estimate of the conditional rate of remaining low 3) The longitudinal estimator is more robust against the
income in 1994 was found statistically significant, whereas
inappropriate adjustment for attrition than the mixed
the empirical biases of the longitudinal estimates of the two
estimator. Under the perfect adjustment model, these
conditional rates were found non-significant (see Appendix
two estimators perform alike.
C). Of course the "perfect" model for adjustment (not shown
4) In general, the sampling variance of the mixed
here) yields exactly the same numbers from the two
estimator is smaller than the variance of the longituestimation methods, and Q^^,^^^ is approximately equal to
dinal estimator. This relationship remains steady over
^lons ^^^ ^ y parameter considered, but their variances differ.
different attrition rates and different adjustment
We introduce a single "sensitivity measure" that commodels.
bines information on sampling standard error and model bias
5) For the mixed estimators, the magnitude of the bias
caused by the applied attrition adjustment model:
coming from inappropriate non-response adjustments
overshadows the small gain in precision when
Cxed-9o)'^s.e.^(0^,,,J
compared to the longitudinal estimates.
(6.1)
5(0 mixed ) =
6) Different models of adjustment perform alike for
s.e.\%)
\
higher attrition rates. In this case the choice of the
^A
estimator is more important than the efforts at model
Here, 0Q and 9^^^;,^^^, (A = MCAR or MAR) denote estimates
improvement.
obtained under "no attrition" and under an attrition adjustment model respectively, and s.e.(.) stands for the standard
ertor due to sampUng. Similarly, we define 5(9, ). If an
ACKNOWLEDGEMENTS
attrition adjustment does not change by much the value of an
estimator and its standard ertor (compared with the estimate
The authors would like to thank the associate editor and
obtained using another attrition adjustment), we say that the the referee for their helpful comments and suggestions.
estimator is relatively insensitive to the applied attrition
model. The ratios ofthe two sensitivity measures of the two
APPENDIX A
applied adjustment models are defined by
ratiomixed
ratio,long
/sMCAR
\ mixed
)/4'rJl and
Description of Simulation for Section 3.
SMAR)
(cr)/»(i
long /•
(6.2)
Values of the ratio for different attrition scenarios are
presented in Charts B.l-B.3 in Appendix B.
From Charts B.1-B.3 it is evident that the ratio of the
sensitivity measures (6.2) is systematically lower for the
longitudinal estimator. This further means that the sensitivity
measure of longitudinal estimator under applied adjustment
models are more alike than those of the mixed estimator. The
longitudinal estimator seems to be more insensitive on the
appUed adjustment models. We refer to this as "robustness".
Regarding the simulated attrition rates, the charts show that
as the attrition rate increases the sensitivity measures ratio
approaches 1. This means that both models of adjustment
perform alike for higher attrition rates, and then the choice
The simulation consisted of the following steps.
1. Let X be a log-normal random variable, X~ exp {N
(p = 10.3,02=0.64)}. These parameters cortespond
to a median similar to that of the SLID estimate for
1992, and a spread similar to that of the Canadian
population. The low income boundary was set to the
first quintile of the income population. The first
quintile 9, was estimated from a simulated sample of
size 50,000. The value obtained was ^i =14,901.
2. From this infinite population, 1,000 independent
random samples of size 1,000 were selected.
3. To simulate attiition, from each sample, 50% of the
units with income below g, were selected at random
and dropped from the sample for the calculations
Survey Methodology, June 1999
95
pertinent to the second wave. Thus 10% attrition from values was 0.012. This implies that for a SRSWOR sample
the low income class was simulated (MAR model).
of size 1,000, the estimate Q^^^^^ is quite stable and its
expectation can be used as surtogate for 0. The expected
For
each
sample
i,
the
low
income
proportion
estimators
4.
^mixed(') ^"d ©longC') ^^""6 estimated with weights values of Q^^^^^ and 0, are estimated by 0.109 and 0.145,
adjusted under both the cortect attrition model (MAR) with standard deviations of 0.011 and 0.013 respectively.
and under the incorrect assumption of units missing
Expected Values Under Mis-Specification
completely at random (MCAR).
of the Adjustment Model
5. Also, for each sample i, we calculated the cross-secExpected
Standard
Number of
Estimator
value
deviation
samples
tional estimator under complete response, ^^^^^^{i),
which is entirely based on the sample before attrition.
0.109
0.011
1,000
mixed
For each type of adjustment, the expectation with
0.145
0.013
1,000
long
respect to design and the attrition model of 0mixed' 0long
0.193
0.012
1,000
e
and Q^^^^^ was estimated by their respective arithmetic
cross
Expected Values Under the Correct Adjustment
means over 1,000 samples with the incortect adjustment,
ofthe Attrition Model
and over 539 samples for the cortect adjustment.
Expected
value
Standard
deviation
Number of
samples
mixed
0.193
0.016
539
long
0.193
0.014
539
0.194
0.012
539
Estimator
Result of the Simulation
The next two tables show that under the incorrect specification, the longitudinal estimator has less bias than the
mixed estimator, and under the cortect specification for the
adjustment, both estimators are unbiased (with respect to the
model and the design).
The arithmetic mean of 0„„„ over the 1,000 different
cross
'
values was 0.193, and the standard deviation of the 1,000
e
cross
We see from the table above that both estimators approximate the tme value of the parameter if the adjustment
model is correct.
APPENDIX B
Table B.l
Estimates of Different Population Characteristics and Their Standard Errors Obtained From
the Complete Data Set and Under 10% Attrition in the Second Wave
Parameter
Type of Estimator
No Attrition
s.e.
Attrition Adjustment Model
MCAR
MAR
s.e.
A
s.e.
e
A
e
Quantiles
M
29,300
(1,000)
29,300
(1,000)
0
29,300
(1,100)
0
Afg (median, t=0)
L
29,300
(1,000)
30,900
(1,000)
-1,600
29,300
(1,200)
0
M
28,600
(1,100)
30,400
(900)
-1,800
28,600
(1,100)
0
Af, (median, r=l)
L
28,600
(1,000)
30,400
(1,000)
-1,800
28,600
(1,200)
0
Proportions
M
0.156
(0.010)
0.092
(0.008)
0.064
0.124
(0.009)
0.032
PCyo^'o-yi^'i)
L
0.156
(0.010)
0.111
0.124
(0.008)
0.045
(0.009)
0.032
M
0.007
(0.002)
0.003
(0.001)
0.004
0.005
(0.002)
0.002
P(yo^'o'>'i^'i')
L
0.007
(0.002)
0.004
(0.001)
0.003
0.005
(0.002)
0.002
M
0.011
(0.002)
0.013
(0.003)
-0.002
0.012
(0.002)
0.001
pCyo^'o'.yi^'i)
L
0.011
(0.002)
0.011
(0.003)
0.000
0.012
(0.002)
-0.001
M
0.790
(0.040)
0.840
(0.040)
-0.050
0.804
(0.042)
-0.005
P(>o^C>'i^'i')
L
0.790
(0.040)
0.831
(0.042)
-0.041
0.804
(0.043)
-0.005
Conditional Rates
M
0.923
(0.023)
0.546
(0.025)
0.734
0.377
(0.033)
0.189
/'()'i^M>'o^'o)
L
0.923
(0.023)
0.926
(0.030)
-0.003
0.921
(0.031)
0.002
M
0.040
(0.010)
0.018
(0.006)
0.022
0.030
(0.009)
0.010
p{yi>K\yo^k)
L
0.040
(0.010)
0.036
(0.010)
0.004
0.038
(0.012)
0.002
/,' = (11/10)/,.; /,.* is used to identify true transitions from one wave to the next.
M denotes Mixed (6) and L denotes Longitudinal (0) estimates.
9 is the estimate, s.e. denotes the standard error ofthe estimate, and A is the difference between the corresponding estimates
obtained using the attrition adjustment model and assuming no attrition.
e
96
Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics
Table B.2
Estimates of Different Population Characteristics and Their Standard Errors Obtained From
the Complete Data Set and Under 20% Attrition in the Second Wave
Parameter
No Attrition
Type of Estimator
e
MQ (median, r=0)
M
L
M
L
29,300
29,300
28,600
28,600
M
L
M
L
M
L
M
L
0.156
0.156
0.007
0.007
0.011
0.011
0.790
0.790
M
L
M
L
0.923
0.923
0.040
0.040
M^ (median, t = 1)
p{yo^'o'yi^'0
p{yo^k'yi^K)
/'(>'o^'o'J'i^'i)
/'(yo^C^i^'i*)
/7(y, i / J y o S / o )
/'(>'i>'ri>'o^'o)
s.e.
e
Attrition Adjustment Model
MCAR
MAR
s.e.
A
s.e.
e
Quantiles
(1,000)
29,300
(1,000)
(1,000)
31,800
(1,100)
(1,100)
31,100
(800)
(1,000)
31,100
(900)
Proportions
(0.010)
0.055
(0.008)
(0.010)
0.080
(0.006)
(0.002)
0.001
(0.001)
(0.002)
0.003
(0.001)
(0.002)
0.014
(0.003)
(0.002)
0.012
(0.002)
(0.040)
0.864
(0.041)
(0.040)
0.855
(0.042)
Conditional Rates
(0.023)
0.323
(0.051)
(0.023)
0.914
(0.040)
(0.010)
0.007
(0.005)
(0.010)
0.030
(0.012)
A
0
-2500
-2500
-2500
29,300
29,300
28,600
28,600
(1,000)
(1,000)
(1,000)
(1,000)
0
0
0
0
0.101
0.076
0.006
0.004
-0.003
-0.002
-0.074
-0.065
0.096
0.096
0.004
0.004
0.012
0.012
0.820
0.820
(0.012)
(0.013)
(0.002)
(0.002)
(0.002)
(0.002)
(0.049)
(0.048)
0.060
0.060
0.003
0.003
-0.001
-0.001
-0.030
-0.030
0.600
0.009
0.033
0.010
0.570
0.928
0.026
0.042
(0.057)
(0.058)
(0.010)
(0.014)
0.353
-0.005
0.014
-0.002
Table B.3
Estimates of Different Population Characteristics and Their Standard Errors Obtained From
the Complete Data Set and Under 30% Attrition in the Second Wave
Attrition.Adjustment
Type of Estimator
Parameter
MQ (median, t =^0)
A/j (median, t =^1)
P(yo^lo'yi^'
.)
/'(>'o^'o'>'i^'i ' )
.)
p{yo^K'yi^h ,')
MCAR
s.e.
A
9
MAR
s.e.
A
31,200
31,300
(1,000)
(900)
(800)
(900)
0
-2,700
-2,600
-2,700
29,300
29,300
28,600
28,600
(1,000)
(1,000)
(1,000)
(1,100)
0
0
0
0
Proportions
(0.01)
0.04
(0.011)
0.116
0.086
0.006
0.004
0.080
0.080
0.004
0.004
(0.014)
0.076
0.076
0.003
0.003
0.013
0.013
(0.016)
(0.002)
(0.002)
(0.002)
(0.002)
e
s.e.
L
29,300
29,300
M
L
28,600
28,600
(1,000)
(1,000)
(1,100)
(1,000)
M
L
M
0.156
0.156
0.007
0.007
M
L
M
piyo^io^yi^i
No Attrition
L
M
L
p{y\^h
l > ' o ^ ''o)
M
L
p{yi>K
l ^ o ^ ''o)
M
L
(0.01)
(0.002)
0.07
0.001
0.003
0.015
0.012
(0.04)
0.874
(0.037)
-0.005
-0.001
-0.084
(0.04)
0.864
(0.038)
-0.074
0.828
(0.045)
(0.045)
(0.006)
0.678
0.465
0.038
0.032
(0.016)
0.003
0.930
0.022
0.044
0.011
0.011
0.790
0.790
0.040
29,300
32,000
(0.006)
(0.001)
(0.001)
(0.003)
(0.003)
(0.002)
(0.002)
(0.002)
0.923
0.923
0.040
e
Quantiles
Conditional Rates
(0.023)
0.245
(0.023)
0.885
(0.010)
(0.010)
0.008
0.037
0.829
(0.049)
(0.051)
(0.071)
(0.099)
(0.012)
(0.017)
-0.002
-0.002
-0.039
-0.038
0.458
-0.007
0.018
-0.004
97
Survey Methodology, June 1999
Charts B
Sensitivity Measures Ratio of the Mixed and Longitudinal Estimators
1. 10% attrition rate in the second wave
Sensitivity Measures Ratio
c
s
4
e
10
e
12
1
1
p(y(0)<=L0,y(1)<=L1)
1=1 Mi (ed (MCAR/MAR)
^
LonaitudinaKMCAR /MAR)
1
ply(0)<=L0.y(1)=>L1)
p(y(0)=>L0.y(1)<=L1)
)
S
i
1
1
p(y(0)=>L0.y(1)=>L1)
^^^
0.
1
p{y(1)<=L1 ly(0)<=LO)
1
p(y(1)=>L1 ly{0)<=LO)
2. 20% attrition rate in the second wave;
Sensitivity Measures Ratio
p(y(0)<=L0,y(1)<=L1)
1=1 Mixed (MCAR/MAR)
p(y(0)<=L0.y(1)=>L1)
Longitudinal ( M C A R / M A R )
p(y(0)=>L0,y(1)<=L1)
S
p{y(0)=>L0.y(1)=>Lt)
CL
p{y(1)<=L1 ly(0)<=LO)
p(y(1)=>L1 ly(0)<=LO)
3. 30% attrition rate in the second wave
Sensitivity Measures Ratio
>
4
6
8
10
p(y(0)<=L0,y(1)<=L1)
p{y(0)<=L0,y(1)=>L1)
1—1 Mixed ( M C A R / M A R )
• • Longitudinal (MCAR / MAR)
^
1
p(y(0)=>L0.y(1)<=L1}
1
p{y(0)=>L0,y(1)=>L1)
0.
1
p{y(1)<=L1 ly(0)<=LO)
1
p(y(1)=.>L1 ly(0)<=LO)
12
98
Rubin Bleuer and Kovacevic;: Some Issues in the Estimation of Income Dynamics
APPENDIX C
We found the empirical bias of the mixed estimates
(under a MAR adjustment model) of tiie conditional rate of
remaining low income in 1994 given that the individual was
low income in 1993, statistically significant when
performing a conservative test of the form
(% - e^ixed) / ^ar(0^i,^,) + var(0o).
We found the empirical bias of the mixed estimates of
the conditional rate of having an income higher than the
LIM in 1994 given that the individual'was low income in
1993, (under a MAR adjustment model) non-significant
when performing a "radical" test of the form
\% "^mixed) /
S-e.(0n,i,,d).
Similarly, we found the empirical bias ofthe longitudinal
estimates (under a MAR adjustment model) of both
conditional rates, non-significant when performing the
same type of "radical" test as above, Le., when assuming
that the estimate under no attrition is non-stochastic.
These results hold for 10%, 20% and 30% attrition rates.
REFERENCES
BINDER, D., and KOVACEVIC, M.S (1995). Estimating some
measures of income inequality from survey data: An application
of the estimating equation approach. Survey Methodology, 21,
137-145.
K O V A C E V I C , M.S., and YUNG, W. (1997). Variance estimation for
measures of income inequality and polarization - An empirical
study. Survey Methodology, 23,41-52.
KOVAR, J.G., RAO, J.N.K., and WU, C.F.J. (1988). Bootstrap and
other methods to measure errors in survey estimates. The
Canadian Joumal of Statistics, 16, 25-45.
LAVALLEE, P., and HUNTER, L. (1992). Weighting for the survey
of labour and income dynainics. In Proceedings: Symposium 92,
Design andAruilysis of Longitudinal Surveys, Statistics Canada.
LEPKOWSKI, J.M. (1989). Non-response adjustments for wave nonresponse. In Panel Surveys, (Eds. D. Kasprzyk, et al.). New
York: John Wiley and Sons.
RAO, J.N.K., and WU, C.F.J. (1988). Resampling inference with
complex survey data. Joumal of American Statistical Association,
83,231-241.
RIZZO, L., KALTON, G., and BRICK, J.M. (1996). A comparison
of some weighting adjustment methods for panel nonresponse.
Survey Methodology, 22, 43-53.
RUBIN BLEUER, S. (1996). Gross flows estimation from longitudinal administrative data with missing waves. Proceedings of
the Section on Survey Research Methods, American Statistical
Association, II, 681-686.
SHAO, J., and RAO, J.N.K. (1993). Standard errors for low income
proportions estimated from stratified multi-stage samples.
Sankhya, B, 55, 393-414.
SHAO, J., and SITTER, R. (1996). Bootstrap for imputed data,
Joumal ofthe American Statistical Association, 91, 435, 12781288.
TAMBAY, J-L., SCHIOPU-KRATINA, I., MAYDA, J., STUKEL,
D., and NADON, S. (1998). Treatment of nonresponse in cycle
two of the National Population Health Survey. Survey
Methodology, 24,147-156.
99
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 99-103
Statistics Canada
Utilising Longitudinally Linked Data from
the British Labour Force Survey
PAM F. TATE'
ABSTRACT
The British Labour Force Survey (LFS) uses a rotating sample design, with each sample household retained for five
consecutive quarters. Linking together the information on the same persons across quarters produces a potentially very rich
source of longitudinal data. There are however serious risks of distortion in the results from such longitudinal linking,
mainly arising from sample attrition, and from response errors, which can produce spurious flows between economic activity
states. This paper describes the initial results of investigations by the Office for National Statistics (ONS) into the nature
and extent of the problems.
KEY WORDS: Longitudinal data; Labour Force Survey; Economic activity; Attrition bias; Response error.
1. INTRODUCTION
The British Labour Force Survey (LFS) is a household
survey, gathering information on a wide range of labour
force characteristics and related topics. Since 1992 it has
been conducted on a quarterly basis, with each sample
household retained for five consecutive quarters, and a fifth
of the sample replaced each quarter.
The survey is designed to produce cross-sectional data,
but in recent years it has been recognised that linking
together data on each individual across quarters could
produce a rich source of longitudinal data, the uses of
which include estimation of labour force gross flows.
The process of linking information on the same individual from different quarters in the LFS is relatively straightforward. However, there are methodological problems
which pose seriousrisksof distortion in the results from this
new, hitherto untested use of LFS data. Similar problems
have been identified in other countries' labour force
surveys, but there are as yet no generally accepted methods
of dealing with them. The Office for National Statistics
(ONS) has therefore undertaken a programme of work to
address this issue.
This paper describes the results so far of investigations
into the nature and extent oftiieproblems, and the proposed
methods of dealing with them. The issues fall into two main
groups: biases arising from sample attrition and related
factors; and biases arising from response ertors, particularly
their effects in producing spuriousflowsbetween economic
activity states. These are considered in turn.
2. SAMPLE ATTRITION AND ITS BIASING
EFFECTS
Some sample members are lost at the initial stage,
because of nonresponse in thefirstinterview, either because
it has not been possible for them to be contacted during the
narrow time window available, or because they have
refused to be interviewed. After that, further sample
members are lost from each successive quarterly interview
round, either because they have moved house (the basic
sampling unit for this survey being the dwelling), or
because it proves impossible to contact them or they refuse
to continue. All these groups of people are, in different
ways, atypical of the population as a whole, so their loss
from the sample can introduce biases.
Some of these biases are compensated for in the course
of applying the normal LFS weighting procedure, which
produces population level estimates which are consistent
with census-based control totals by sex, age group and
region. This process wiU compensate for biases arising, at all
stages ofthe survey, from differential attrition by sex, age and
region. However, biases in other characteristics which are not
themselves used in the weighting procedure will not be
compensated for (and may even be increased) in that process,
except when they are related to age, sex or region, in such a
way that the bias is caused entirely by the under- or overrepresentation of particular age, sex or region categories.
Work on this subject therefore looked first at what
characteristics are more or less represented in the LFS
sample than the whole population, and in different waves of
the LFS sample. (Each quarter, the sample is made up of
five waves, the people in the first wave having their first
interview, those in the second wave their second interview,
and so on.) It then examined whether and to what extent
these characteristics are related to each other, and whether
it is possible to define a set of variables which characterise
those people who are likely to be under-represented.
3. CHARACTERISTICS OF
NON-RESPONDENTS
Analysis ofthe proportions which could not be linked to
the next quarter, by wave, for key demographic and
Pam F. Tate, Office for National Statistics, Room RG/11, 1 Drummond Gate, I.x)ndon SWIV 2QQ, United Kingdom.
100
Tate: Utilising Longitudinally Linked Data
economic variables (Table I gives an illustration for broad
age/sex groups), showed that, consistently across all waves,
there is a greater propensity to be under-represented for
young people aged 18 to 29 (and especially 18 to 24), single
people, those living in London, people in rented accommodation (especially privately rented), the unemployed, and
those in temporary employment. Most oftiiesecharacteristics have also been found (by Foster (1994) in a study which
linked data from the 1991 Census with non-responding LFS
sample households) to be associated with high non response
at the first interview, particularly young adults, single
people, one person households, and those Uving in London.
All persons
Variable
Category
Wave 1
Wave 2
Wave 3
AgeGroup
18-19
1.89
2.56
2.86
1.92
20-24
1.79
2.08
2.10
2.83
25-29
1.17
1.30
1.44
1.55
Tenure
Privately
rented
2.12
1.52
1.86
2.29
Marital
Status
Single
1.25
(1.12)
1.27
1.49
Employee,
fiiUtime,
temp
(1.12)
(1.36)
(1.13)
1.75
Unlinked percentage
Wave 4
Wavel
8.6
4.8
AGE & SEX
Male
Table 2
Multiplying Factors for Odds Ratios Categories Associated with High Attrition
Multiplying factor for odds ratio
Table 1
Percentage of Unlinked Cases by
Sex and Age Group by Wave
Variable &
category
29 (and especially 18 to 24), has a particularly strong effect,
as does being in privately rented accommodation. Being
single {Le., never married and not cohabiting) has a
moderate association. There are no consistent associations
with particular categories of economic activity or qualification level, except for a slight one with full-time, temporary
employees. The effect of region is not consistent even for
the two waves in which it appears.
8.6
4.9
15-17
6.8
4.9
18-29
13.8
10.2
30-44
6.8
4.1
45-64
7.2
2.5
Economic
activity/
status
Note: () indicates coefficient is not significant at 5% level.
Female
8.7
4.7
15-17
5.9
3.0
18-29
13.9
10.9
30-44
6.3
2.8
45-59
7.5
2.2
Note: More detailed analyses are available from the author.
Several of the characteristics of those who are lost to the
sample appear likely to be related, and this was investigated
in the first instance using logistic regression. The variables
identified as being independently associated with whether
the cases were lost from the sample, were found to be
largely consistent for the four waves. In each case they
included age group, marital status, tenure, {Le. whether the
accommodation was owned, rented from a private landlord,
or rented from a local authority or housing association),
quaUfication level, and a combined economic activity
variable incorporating broad economic activity (employed,
unemployed or inactive), and, for the employed, employment status, part-time/full-time and temporary/permanent.
Region was found to be independently associated in only
two of the four waves, and sex in none.
For the five variables consistently appearing for all
waves, there was a good degree of consistency conceming
which categories were associated with sample attrition.
Table 2 gives the multiplying factors for the odds ratio for
aU categories with a consistent association with increasing
attrition. Being in the younger age groups, between 18 and
Wave 4
The logistic regression analysis performed did not allow
for interactions between the variables, and to investigate
this possibility a further analysis was performed, using the
CHAID module of SPSS to produce a segmentation of the
data set into groups which have as great a variation as
possible with respect to the proportion of unlinked cases.
The results of this were however very similar to those of the
logistic regression analysis. Overall, the main characteristics independently associated with a high proportion of
sample loss were the younger adult age groups (18 to 29,
especially 18 to 24) and living in privately rented accommodation, with some relatively minor additional effects of
being in temporary employment for the youngest age
groups. Separate analyses of the characteristics of those
sample members who had been lost through moving away,
and those lost through non-contact (or, more rarely, refusal)
produced similar results.
4. COMPENSATING FOR ATTRITION BIAS
The analysis so far has been directed at the biasing effect
of sample attrition on the cross-sectional characteristics of
the longitudinal sample, and has identified the characteristics independently associated with greater nonresponse.
A possible approach to compensating for the bias arising
from this is to incorporate tenure as well as age into the
Survey Methodology, June 1999
101
weighting procedure for the longitudinal data. This is being
explored, using a calibration approach with CALMAR
software, and including prior weights derived from the
work described above to compensate for differential
nonresponse by tenure.
However, there may be a problem which would limit the
effectiveness of this approach. The propensity to not
respond may be directly dependent on the unobserved
labour force status of the individual, and possibly independentiy of their observed characteristics. Nonresponse of this
kind is an example of non-ignorable nonresponse (Rubin
1976), and its presence would imply that estimates of the
important measures of labour force gross flows would be
biased even after the application of a weighting process of
the type being explored.
There are two indirect approaches which give some
indication of whether non-ignorable nonresponse might be
a problem for longitudinal LFS data. One is to investigate
whether the proportion of the gross flows in the sample
which are transitions between different economic activity
states systematically decrease (or increase) from wave 1-2
to wave 4-5; if so, this would suggest that people changing
from one state to another are more (or less) likely to be
nonrespondents than those in a stable state. However,
Table 3 shows that there is no consistent systematic pattem
across waves - though this does not exclude the possibility
of other patterns of differential nonresponse by labour force
flows category.
Table 3
Percentage of Transitions Between Different Economic
Activity States by Wave for Pairs of Adjacent Quarters
Data Set
wave
1-2
Percentages
Wave Wave
2-3
3-4
Wave
4-5
Summer/autumn 1995
8.0
7.3
7.3
7.1
Autumn 1995/
winter 1995-1996
7.2
6.7
6.5
6.5
Summer/autumn 1996
7.6
7.0
7.3
7.5
Autumn 1996/
winter 1996-1997
6.8
6.5
6.2
6.5
Another possibility is that people moving addresses (and
thereby lost to the LFS sample) may have a different pattem
of labour force flows than the rest of the population. We do
not have any information on the people who have moved
away, but we do know something about the people who
have moved into the sample addresses from elsewhere.
These movers-in can reasonably be taken to represent the
movers-out, since they are equaUy samples from the same
population of movers (ignoring the possible effects of the
smaU proportion of intemational moves). Table 4 shows
the distribution of the linked sample (all adults whose
records were able to be matched) and of the identifiable
movers-in for a pair of adjacent quarters in 1995. (It should
however be noted that the fiows categories are not strictly
comparable, since the previous economic activity state for
the movers-in is obtained by retrospective reporting.) It is
clear that the sample of movers does differ, with a lower
proportion in stable inactivity, and a higher proportion in all
the other flows categories, and in particular a greater
proportion of people changing their economic activity state;
but that the movers make up such a smaU proportion overall
that the effect on the whole sample is negligible.
Table 4
Gross Flows for Movers - in Compared with Linked Sample
Activity states
Linked sample Movers-in
(%)
(%)
Linked + movers
(%)
EE
55.1
56.9
55.1
EU
0.8
1.5
0.8
EN
1.1
1.6
1.1
UE
1.0
1.7
1.0
UU
2.9
6.5
3.0
UN
0.7
1.1
0.7
NE
1.2
2.5
1.3
NU
1.0
2.3
1.1
NN
36.2
26.0
35.9
All transitions
5.9
10.6
6.0
TOTAL (no.)
80,664
1,790
82,454
Note: E represents in employment
U represents ILO unemployed
N represents economically inactive
hence EE represents in employment at both quarters
EU represents in employment then ILO unemployed etc.
These indirect approaches do not indicate any very
strong effect of non-ignorable nonresponse, but they do not
mle it out. This possibility is therefore being investigated
by work involving the modelling of nonresponse in the
LFS.
5. RESPONSE ERROR AND ITS BIASING
EFFECTS
AU surveys in general, and household surveys in particular, are subject to response ertor, when the information
given by the respondent is not an accurate reflection of the
actuality. This may occur for a variety of reasons - the
respondent may misunderstand the question; the interviewer
may misunderstand or misrecord the response; the respondent may not know or remember the cortecl;^ answer; or the
respondent may knowingly give an incortect answer for
reasons of embartassment, prestige, fear 'of breach of
confidentiality or a wish to give the "expected" answer.
102
In the field of labour force surveys it has generally been
found (for an overview see Lemaitre 1994) that, for
cross-sectional data, there is no particular tendency for the
errors to be systematic, so that on average they tend to
cancel out. However, for longitudinal data produced by
linking together data collected on the same person at
different points in time, this cancellation may not occur.
In particular, this is likely to be the case for data on gross
flows between economic activity states. The numbers of
people who move from one state (in employment, unemployed, economically inactive) to another during the relatively short period usually considered (a month, a quarter, or
perhaps a year) are small compared with the numbers of
people who remain in the same state. A response ertor at
one point of time is much more likely to lead to an apparent
change of state when the tme situation is one of stability,
than the reverse. Thus response ertors are likely to have a
very disproportionate effect in upwardly biasing flows
between reported states. In the LFS, they may arise from
the use of proxy respondents, where one person answers
questions on behalf of someone else in the same household;
and from respondent errors. We will consider these in tum.
6. THE EFFECT OF PROXY RESPONDENTS
To investigate the effect of proxy respondents, we need
to look at the distribution of activity states at the two
quarters according to whether the first quarter's interview
was in person or by proxy, and whether the second quarter's
interview was in person or by proxy. Very young adults
under 20 are both exceptionally likely to be represented by
proxies and also likely to be particularly volatile in terms of
their economic activity category, and so may distort any
relationship between these two factors. Table 5 therefore
shows the distribution of activity states at the two quarters,
for men aged 20 to 64 and women aged 20 to 59. There is
a higher proportion of transitions for personal followed by
proxy interviews than for personal at both quarters, but
proxy followed by personal interviews show only a very
slightly higher proportion than personal at both quarters.
Thus switching between proxy and personal interviews
does not show a consistently greater proportion of
transitions. Cases with both interviews by proxy have the
lowest proportion of transitions of all, and the inclusion of
these brings the overall proportion to a level consistent with
that for personal interviews at both quarters. Thus there do
appear to be differences between the various combinations
of interview types, which merit further investigation, but in
the LFS the use of proxy respondents does not of itself
produce an exaggerated estimate of gross flows.
Tate: Utilising Longitudinally Linked Data
Table 5
Percentage of Transitions by Interview Type
Interview
type
Men (20-64)
Sample no. % trans.
Women (20-59)
Sample no.
% trans.
Personal/
personal
14,527
5.2
19,582
7.3
Personal/
proxy
2,044
7.0
1,597
8.3
Proxy/
personal
2,214
5.4
1,632
7.6
Proxy/
proxy
8,602
4.7
4,206
5.4
27,387
5.2
27,017
7.1
All
7. RESPONDENT ERRORS
By their nature, respondent ertors are impossible to
identify directly, (except perhaps by re-interview, and even
then there may be doubt about what is the correct answer).
It is however sometimes possible to identify intemal inconsistencies in the survey data, which may indicate response
ertor. In the LFS, respondents who are in employment, and
respondents who are unemployed, are asked how long they
have been in that state. If the period is greater than three
months, but they stated in the previous quarter that they
were in a different state, there is an inconsistency which
may indicate a false transition between economic activity
states.
Table 6 shows the percentage of inconsistencies for
various kinds of transitions - these are high throughout.
Transitions from economic inactivity produce the highest
percentages, especially when the transition is into unemployment. (There are no large or consistent differences
between the different subcategories of the inactive.) Separating those in employment into part-time and full-time
shows that there is a consistent pattem of a greater
proportion of inconsistencies for part time employment, and
similar but less pronounced results were found for the
self-employed.
Table 6
Percentage of Inconsistencies by Transition Type
Percentage of inconsistencies
Transition Type
All
Fulltime Part time
(%)
(%)
(%)
12.2
7.8
Unempl. to Employment
8.7
30.4
18.1
Inactive to Employment
26.2
23.3
14.7
Employment to Unemployment 18.7
Inactive to Unemployment
49.5
All
23.9
103
Survey Methodology, June 1999
It is possible that the inconsistencies may have arisen
through ertors in the reported length of time in the economic activity state at the second quarter, rather than in the
initial state at thefirstquarter. The distribution of the length
of time does not however show heaping at around four to
five months (as would be expected in the case of ertors in
the duration data). Also, the duration data reported in
consecutive quarters for people in a stable state were found
to be very consistent. These findings tend to suggest,
though the evidence is indirect and by no means conclusive,
that the ertors are more likely to be in the reporting of
economic activity at one or other of the interviews. This is
not the only possibility - for example, it may be that some
respondents have correct transition data, but incorrect
duration data through using an interpretation of their past
economic activity which is not consistent with the standard
definitions applied to the reporting of their curtent state but thefindingsso far suggest that it is likely to be the most
common.
Some light on which of the inconsistent categories is
cortect may emerge by looking at the pattem of responses
over three interviews. Table 7 shows the proportions of
each group of inconsistent transitions from one quarter to
the next which are followed by each economic activity
category in the third quarter. (All relevant waves are
combined in order to obtain reasonable sample sizes.) It is
clear that of the transitions into eiriployment in the second
quarter, the great majority remain in that category in the
third quarter. The transitions^into unemployment show a
much more mixed pattem, with a little over half remaining
in unemployment, but a substantial group of about 30 to 40
per cent reverting to the state reported in the first quarter.
It is noteworthy that scarcely any of the transitions from the
second to the third quarter for this group were found to have
a repeated inconsistency between the transition and the
reported duration data. The results so far suggest that, in
the case of an inconsistent transition into employment, that
is likely to be the correct state, but more investigation is
needed to achieve further clarification.
8. ADJUSTING FOR RESPONSE ERROR BIAS
It is clear from the above that there is likely to be a
substantial level of response ertor affecting the raw data on
gross flows. Work on adjusting for such ertors has so far
been largely confined to the USA and Canada. A review of
three methods proposed for USA data is given by Haim and
Hogue (1985), and a later proposal for Canadian data is
given in Singh and Rao (1995), but to date, to the author's
knowledge no official adjusted gross flows data are being
published, though several countries are publishing unadjusted data while drawing attention to their Umitations. The
adjustment methods so far proposed all rely on assumptions
about the nature of the errors which seem unlikely to be met
in practice - either full independence of the classification
ertors or very limited departures from that assumption.
(See Lemaitre 1994 for a review of problems with these
adjustment methods.)
It seems worthwhile to explore different routes to the
development of methods of adjustment or compensation
for response error bias. As a first stage, work is continuing
on the investigation ofthe characteristics and circumstances
of cases of inconsistency, and of other possible ways of
identifying false transitions. It is also proposed to investigate the circumstances of people giving inconsistent
responses of the kind analysed above, by means of more
detailed follow-up interviews. This should provide better
indications of the extent to which the inconsistencies do
represent response ertor, and may provide results useful for
both reducing, and adjusting for, response ertor. Both these
strands will provide inputs to a third element of the forward
programme, in which it is proposed to develop models of
classification ertor in reporting economic activity.
ACKNOWLEDGEMENTS
The author wishes to thank the editor and two referees
for their helpful comments.
REFERENCES
Table 7
Percentages of Inconsistent Transitions
by Economic Activity at Following Quarter
Transition
type
Unempl. to
Employment
Inactive to
Employment
Employment
to Unempl.
Inactive to
Unempl.
Total
inconsistent
Activity state in next quarter
Employed Unempl.
Inactive
(%)
(%)
(%)
60
90
7
3
159
79
4
17
87
39
53
8
229
17
55
28
FLAIM, P. O., and HOGUE, C. R. (1985). Measuring labor force
flows: a special conference examines the problems. Monthly
Labor Review, July 1985, U.S. Bureau of Labor Statistics.
FOSTER, K. (1994). The Labour Force Survey - Report of the 1991
Census-linked Study of Survey Nonrespondents. Office of
Population Censuses and Surveys.
LEMAITRE, G. (1994). Data on Labour Force Dynamics from
Labour Force Surveys. Organisation for Economic Co-operation
and Development.
RUBIN, D.B. (1976). Inference and missing data. Biometrika, 63,
581-592.
SINGH, A C , and RAO, J.N.K. (1995). On the adjustment of gross
flow estimates for classification error with application to data from
the Canadian Labour Force Survey. Joumal of the American
Statistical Association, 90, 478-488.
Survey Methodology, June 1999
Vol. 25, No. 1, pp. 105-106
Statistics Canada
105
A Model Based Justification of Kish's Formula for
Design Effects for Weighting and Clustering
SIEGFRIED GABLER, SABINE HAEDER and PARTHA LAHIRI'
ABSTRACT
In this short note, we demonstrate that the well-known formula for the design effect intuitively proposed by Kish has a
model-based justification. The formula can be interpreted as a conservative value for the actual design effect.
KEY WORDS: Cluster size; Intraclass correlation coefficient; Selection probabitities.
1. INTRODUCTION
We consider multistage, clustered, sample designs where
each observation belongs to a weighting class. For example,
the clusters are blocks which are selected proportional to
the number of its households. Within each block the same
number of households is selected with equal probabilities.
A randomly chosen person of the household has to be interviewed. Then, the household sizes determine the weighting
classes. Kish (1987) proposed the following formula for
determining the design effect in order to incorporate the
effects due to both weighting needed to counter unequal
selection probabilities, and clustered selection:
(/ = 1,..., /; c = 1,..., C). Then m. = ^^ = I'Wjc. the number
of observations in the I-th weighting class. Let b =
X, = 1 nt.^, the number of observations in the c-th cluster
(/ = l,...,/;c = l, ..., C) sotiiatZ7 =C-'i;f,,Z;^. Let y,j
and w. be the observation and the weight for the j-th
sampling unit in the c-th cluster (c = 1,..., C;j = 1,..., b ) .
The usual design-based estimator for the population mean
is defined as
c K
E E ^ciYcj
—
_ c=1 j = 1
c K
E E ^cj
c=l
E ^ff^i
1 = 1
[l+(*-l)p],
deffKish='"7
E
j=l
To justify Kish's formula, we assume the following
model:
^if^i
where m. and w. denote the number of observations and
the weight attached to the i-th weighting class (/ = 1,...,/),
m = Y,Uifny the total sample size, b is the average cluster
size and p is the intraclass correlation coefficient. Kish's
formula is very intuitive and novel, but he said that his
"treatment may be incomplete and imperfect."
Kish's formula is now used by many survey samplers. In
fact, the above formula will be used in the sample size
determination in the European Social Surveys to be
conducted by its member countries. The purpose of this
note is to provide a model-based justification for using
Kish's formula.
2. A MODEL BASED JUSTIFICATION
OF KISH'S FORMULA
Let m.^ be the number of observations in the c-th
sampled cluster belonging to the I-th weighting class
Var(y^.)=o^ for c = 1,..., C;; = 1,..., fc^
Cov(ycj'Yc'j''
IQ
c = c';j*f
otherwise.
(1)
The above model is appropriate to account for the cluster
effect and was used earlier by others (see, e.g.. Skinner,
Holt and Smith 1989). We shall then define design effect
as deff = Var,(y^)/Var2(y), where Var,(y^) is the
variance of y^ under model (1) and Var2(y) is the
variance of the overall sample mean y, defined as
Ef= lE/= 1 Yc/"^' computed under the following model:
Var(y^.)=a^ for c = 1,..., C;; = 1, ...,^^
Cov(y<,.,y,.p = 0 for all {c,j) * {c',r).
(2)
Note that model (2) is appropriate under simple random
sampUng and provides the usual formula a^lm for
Var2(y).
Siegfried Gabler and Sabine Haeder, ZUMA, B 2,1, D-68I59 Mannheim; Partha L.ahiri, University of Nebraska-Lincoln, NE 68588-0323.
106
Gabler, Haeder and Lahiri: A Model Based Justification of Kish's Formula for Design Effects
Now, turning our attention to Var, (y^), first note that
E >^''",
c
1 = 1
deff = m-
fcc
Var, E E ^cjYcj
c
1 = 1
w C]-'
.yCJ•
where b* =Yfc-i(Zj= I M ' , ' " , C ) ^ / E ' = i ^ ' ' " , Using the Cauchy-Schwarz inequality, we get
2
-E
(4)
wm.
I
I
\c = i i = i
= EVar E
— [I+(Z;--I)p],
E ^cj^^iycj)
+ E '^cj'^cr^ov{y^j,y^j.)
c=l
J*J
E
^i^i.
m...
w.-—
E
<=i
o
,2V^
"-ic
= fc:
,=1
<^ /
c^i [j=i
j*r
J
^c E
2'"ic
0,
"^f^ic
1 = 1
/
C
I
2<
= o E ^ / ' 2" , + P E
E^*','",C
c=l
1=1
-PE»^,'",[
(= 1
(3)
SO that
1= 1
c
/
E ^cE w,^'"/c
fc'^
since Y.^_^, ^ - l , w^. = ^ [ ^ , wf m,. and
C = 1
1 = 1
C
^ w ' say.
/
(5)
E E w/'",c
C = 1 ( = 1
C
E
6c
C
*c
E
- E >v^-
E %^cy- = E •
c=lJ*/
Thus (4) and (5) yield
c= 1
C
=E1 E
E
w.m
I
,
E '^i^'",
W, m.
/
=E
E ^i"^ic\ -E
c= 1 V'= 1
/
1 = 1
[l+(^^-l)p].
/
<• = 1
c= l
deff ^ m-
(6)
E w.m.I
V' = i
wfff^r
'=1
Noting that Xc = i 1,/= i % = Z; = i *^,'",-. we have
Note that b ^ can be interpreted as an average (weighted)
cluster size. If fc ^^, is equal to b, e.g., if all b^ are equal, tiie
upper bound of deff is simply Kish's formula. Thus Kish's
formula serves as a conservative value for the actual design
effect.
Var,(yJ
ACKNOWLEDGEMENTS
C
Var,
*c
E E %3'c;
The authors are thankful to the editor and the referees for
their remarks which led to an improvement of the paper.
The work was completed while the last author was a Guest
Professor at ZUMA, the Center for Survey Research and
Methodology, Mannheim, Germany.
c=1> = 1
E m.w.
I
I
1 = 1
c ( I
a^\Y
|l = l
w, m.-^pY
E ^i"^ic
c=l V 1 = 1
E ^i"^i
(=1
so that
V
I
- P E ^, '",
1=1
REFERENCES
KISH, L. (1987). Weighting in Deft^. The Survey Statistician, June
1987.
SKINNER, C.J., HOLT, D., and SMITH, T.M.F. (Eds.) (1989).
Analysis of Complex Surveys. Chichester: Wiley.
JOURNAL OF O F n C I A L STATISTICS
An Intemational Review Published by Statistics Sweden
JOS is a scholarly quarterly that specializes in statistical methodology and applications. Survey methodology and other issues pertinent to the
production of statistics at national offices and other statistical organizations are emphasized. All manuscripts are rigorously reviewed by
independent referees and members of the Editorial Board
Contents
Volume 14, Number 4 , 1 9 9 8
Introduction to the Special Issue: Disclosure Limitation Methods for Protecting the Confidentiality of Statistical Data
Steven E. Fienberg and Leon C.R.J. Willenborg
A Database System Prototype for Remote Access to Information Based on Confidential Data
Sallie Keller-McNulty and Elizabeth A. Unger
Estimating the Re-identification Risk Per Record in Microdata
C.J. Skinner and D.J. Holmes
A Bayesian Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment
Stephen M. Samuels
Confidentiality, Uniqueness, and Disclosure Limitation for Categorical Data
Stephen E. Fienberg and Udi E. Makov
Synthetic and Combined Estimators in Statistical Disclosure Control
Jeroen Pannekoek and Ton de Waal
Balancing Disclosure Risk Against the Loss of Nonpublication
Alan M. Zaslavsky and Nicolas J. Norton
Optimal Local Suppression in Microdata
A.G.de Waal and LC.R.J Willenborg
Models and Methods for the Microdata Protection Problem
C.A.J. Hurkens and S.R. Tiourine
Masking Microdata Using Micro - Aggregation
D. Defays and M.N. Anwar
Post Randomisation for Statistical Disclosure Control: Theory and Implementation
J.M. Gouweleeuw, P. Kooiman, LC.R.J. Willenborg and P.-P. de Wolf
Comment
G. Sande
Disclosure Limitation Using Perturbation and Related Methods for Categorical Data
Stephen E. Fienberg, Udi E. Makov, and Russell J. Steel
Comment
Peter Kooiman
Rejoinder
Stephen E. Fienberg, Udi E. Makov. and Russell J. Steel
Comparision of Systems Implementing Automated Cell Suppression for Economic Statistics
N: Kirkendall and G. Sande
Using Noise for Disclosure Limitation Establishment Tabular Data
Timothy Evans, Laura Zayatz, and John Slanta
Experiments with Controlled Rounding for Statistical Disclosure Control in Tabular Data with Linear Constraints
Matteo Fischetti and Juan-Jose Salazar-Gonzdlez
•
337
347
361
373
385
399
411
421
437
449
463
479
485
503
509
513
537
553
Editorial Collaborators
567
Index to Volume 14,1998
571
All inquires about submissions and subscriptions should be directed to the Chief Editor:
Lars Lyberg, R&D Department, Statistics Sweden, Box 24 300, S -104 51 Stockholm, Sweden.
The Canadian Journal of Statistics
La Revue Canadienne de Statistique
CONTENTS
TABLE DES MATIERES
Volume 27, No. 1, March/mars 1999
Christian GENEST
Editor's report /Rapport du redacteur en Chef
Mousumi BANERJEE, Michelle CAPOZZOLI, Laura MCSWEENEY and Debajyoti SINHA
Beyond kappa: A review of interrater agreement measures
L.MANCHESTER, C. A. FIELD and A. MCDOUGALL
Regression for overdetermined systems: A fisheries example
E.MONGA and S. TARDIF
Asymptotic optimality of a class of rank tests for replicated Latin square designs
Guohua PAN and Winson TAAM
Distribution-free subset selection for incompletely ranked data
Gary SNEDDON
Smoothing in an underdetermined linear model with random explanatory variables
John R.COLLINS
Robust A/-estimators of scale: Minimax bias versus maximal variance
Dongchu SUN and Keying YE
Reference priors for a product of normal means when variances are unknown
Sonia PETRONE
Bayesian density estimation using Bernstein polynomials
M. PENSKY and R. S. SINGH
Empirical Bayes estimation of reliability characteristics for an exponential family
M. C. nNOCCHIARO and D. SACCHETTI
The variance function of the natural exponential family generated by a measure on n integers
John E. KOLASSA
Confidence intervals for parameters lying in a random polygon
Nabendu PAL and Wool K. LIM
Second order properties of intraclass correlation estimators for a symmetric normal, distiibution
Andrew HEARD and Tim SWARTZ
Extended voting measures
Michael BARON
Convergence rates of change-point estimators and tail probabilities of the first-passage-time process
Celestin C. KOKONENDJI
Le probleme d'Anscombe pour les lois binomiales negatives g6neralisees
Helge BLAKER
A class of shrinkage estimators in linear regression
Acknowledgement of referees' services / Remerciements aux membres des jurys
Forthcoming papers/Articles a paraitre
The Canadian Journal of Statistics
La Revue Canadienne de Statistique
CONTENTS
TABLE DES MATIERES
Volume 27, No. 2, June/juin 1999
Andrey FEUERVERGER, John ROBINSON and Augustine WONG I
On the relative accuracy of certain bootstiap procedures
Biao ZHANG
Bootstiapping with auxiliary information
Steven N. MACEACHERN, Meriise CLYDE and Jun S. LIU
Sequential importance sampling for nonparametric Bayes models: The next generation
George D. PAPANDONATOS and Seymour GEISSER
Bayesian interim analysis of lifetime data
Paul DAMEEN and Stephen WALKER
A full Bayesian analysis of circular data using the von Mises distribution
Craig A. COOLEY and Steven N. MACEACHERN
Prior elicitation in the classification problem
Schultz CHAN and Malay GHOSH
A geometric optimality of Cox's partial likelihood
Y. LEE and John A. NELDER
The robustness of the quasilikelihood estimator
Hui CHEN and Joseph P. ROMANO
An invariance principle fortiiangulararrays of dependent variables with apptication to autocovariance
Estimation
Kanchan MUKHERJEE
The asymptotic behavior of a class of L-estimators under long-range dependence
Ross H. TAPLIN
Robust F-tests for linear models
Ping ZHANG
The optimal prediction of cross-sectional proportions in categorical panel-data analysis
Dan NETTLETON
Order-restricted hypothesis testing in a variation of the normal mixture model
K. KRISHNAMOORTHY and Marutiiy K. PANNALA
Confidence estimation of a normal mean vector with incomplete data
Marten H. WEGKAMP
Quasi-universal bandwidth selection for kernel density estimators
Christian GENEST
Probability and statistics: A tale of two worlds?
Forthcoming Papers / Articles a paraiti-e
Journal ofthe Royal
Statistical Society
Series D (The Statistician)
Edited by N. R. J. Fieller and U. T. Moorthy
Covering a broad range of topics of interest to professional statisticians. The
Statistician includes applied papers on education, business, sport, industry and
agriculture, statistical computing and professional affairs and obituaries of eminent
statisticians.
Recent and forthcoming highlights:
Some Statistical Heresies (with Discussion), J. K. Lindsey
Can Takeover Targets be Identified by Statistical Techiuques? Some UK Evidence,
P. Barnes
A Fractional Factorial Design for Bench-mark Testing of a Bayesian Method for
Multilocation Audits, V. BamettandJ. Haworth
Using Maximum Entropy to Double One's Expected Winnings in the UK National
Lottery, S. J. Cox, G. J. DaniellandD. A. Nicole
Demonstrating the Durbin-Watson Statistic, R. Champion, C. T. Lenard and T. M.
Mills
Tiers, Structure Formulae and the Analysis of Complicated Experiments, C. J. Brien
andR. W. Payne
Joumal of the Royal Statistical Society: Series D (The Statistician)
ISSN 0039-0526 Published in March, June, September and December
Subscnption Rates, Vol. 48/1999:
Europe £80, N. America $149, Rest of World £90.
To subscribe to The Statistician please use the order form on the Blackwell
website: http://www.blackwellpublJshers.co.uk, send an email to
jnllnfo@blackwellpublishers.co.uk, or contact either of the following:
Blackwell Publishers Journals, PO Box 805, 108 Cowley Road, Oxford 0X4
1FH, UK. Tel: +44 (0)1865 244083, fax +44 (0)1865 381381
• Journals Marketing (RSSD), Blackwell Publishers, 350 Main Street, Maiden, MA
02148, USA. Tel. +1 (781) 388 8200, fax +1 (781)388 8210
^BLAGWELl
For further Information or to request a sample copy
please visit our website
http;//www.blackwellpublishers.co.uk
GUIDELINES FOR MANUSCRIPTS
Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 19, No. 1 and
onward) as a guide and note particularly the points below. Accepted articles must be submitted in machine-readable form,
preferably in WordPerfect. Other word processors are acceptable, but these also require paper copies for formulas and figures.
1.
Layout
1.1 Manuscripts should be typed on white bond paper of standard size {SV2 x 11 inch), one side only, entirely double spaced
withmarginsof at least I'/z inches on aU sides.
•
1.2 The manuscripts should be divided iiito numbered sections with suitable verbal titles.
1.3 The name and address of each author should be given as a footnote on the first page of the manuscript.
1.4 Acknowledgements should appear at the end of the text;
1.5 Any appendix should be placed after the acknowledgements but before the list of references.
2.
Abstract
The manuscript should begin with an abstiact consisting of one paragraph followed by three to six key words. Avoid
mathematical expressions in the abstiact.
3.
Style
3.1 Avoid footnotes, abbreviations, and acronynis.
3.2 Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as "exp(-)" and
"log(-)"., etc.
••
•
. . .
3.3 Short formulae should be left in the text but everything in the text shouldfitin single spacing. Long and important equatipns
should be separated from the text and numbered consecutively with arable numerals on the right if they are to be referted
to later.
3.4 Write fractions in the text using a soUdus.
3.5 Distinguish between ambiguous characters, (e.g., w, a);o, 0,0; 1, 1).
3.6 Itahcs are used for emphasis. Indicate italics by underlining on the manuscript.
4.
Figures and Tables
4.1 . All figures,and tables should be numbered consecutively with arable numerals, with tities which are as; nearly self
explanatory as possible, at the bottom forfiguresand at the top for tables.
4.2 They should be put on separate pages witii an indication of their appropriate placement in the text. (Normally they should
appear near where they are first referred to).
5.
References
5.1 References in the text should be cited with autiiors' names and the date of pubUcation. If part of a reference is cited, indicate
after the reference, e.g., Cochran (1977, p. 164).
.
5.2 The Ust of references at the end of the itianuscript should be arranged alphabetically and for the same autiior chronologically.
Distinguish publications of tiie same author intiiesame year by attaching a, b, c to die year of publication. Joumal tities
should not be abbreviated. Follow the same format used in recent issues.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising