c - ^ SURVEY METHODOLOGY 1^1 Statistics Canada Statistique Canada Canada SURVEY METHODOLOGY A JOURTiAL PUBLISHED BY STATISTICS CAriADA JUriE1999 • VOLUME 2 5 • NUMBER 1 Published by authority of the Minister responsible for Statistics Canada ® Minister of Industry, 1999 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without prior written permission from Licence Services, Marl^eting Division, Statistics Canada, Ottawa, Ontario, Canada K1A 0T6. September 1999 Catalogue no. 12-001-XPB Frequency: Semi-annual ISSN 0714-0045 Ottawa • ^ 1 • ~ l Statistics Canada Statistique Canada f OT^O/Ho VyClJ.lclvi.cL SURVEY METHODOLOGY A Journal Published by Statistics Canada Survey Methodology is abstiacted in The Survey Statistician, Statistical Theory and Methods Absttacts and SRM Database of Social Research Metiiodology, Erasmus University and is referenced in the Current Index to Statistics, and Journal Contents in Qualitative Methods. MANAGEMENT BOARD Chairman G.J. Brackstone Members D. Binder G.J.C. Hole F. Mayda (Production Manager) C. Patiick R. Platek (Past Chairman) D.Roy M.P. Singh EDITORIAL BOARD Editor M.P. Singh, Statistics Canada Associate Editors D.R. Bellhouse, University of Western Ontario D. Binder, Statistics Canada J.-C. Deville, INSEE J.D. Drew, Statistics Canada J. Eltinge, Texas A&M University W.A. Fuller, Iowa State University R.M. Groves, University of Maryland M.A. Hidiroglou, Statistics Canada D. Holt, Central Statistical Office, U.K. G. Kalton, Westat, Inc. R. Lachapelle, Statistics Canada P. Lahiri, University of Nebraska-Lincoln S. Linacre, Australian Bureau of Statistics G. Nathan, Central Bureau of Statistics, Israel Assistant Editors D. Pfeffermann, Hebrew University J.N.K. Rao, Carleton University L.-P. Rivest, Universite Laval I. Sande, Bell Communications Research, U.S.A. F.J. Scheuren, Ernst and Young, LLP J. Sedransk, Case Western Reserve University R. Sitter, Simon Eraser University C.J. Skinner, University of Southampton R. Valliant, Westat, Inc. V.K. Verma, University of Essex P.J. Waite, U.S. Bureau of the Census J. Waksberg, Westat, Inc. K.M. Wolter, National Opinion Research Center A. Zaslavsky, Harvard University P. Dick, H. Mantel, B. Quenneville and D. Stukel, Statistics Canada EDITORIAL POLICY Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constiaints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, tiie autiiors retain full responsibility for the contents of tiieir papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada. Submission of Manuscripts Survey Methodology is published twice a year. Authors are invited to submit their manuscripts in either English or French to the Editor, Dr. M.P. Singh, Household Survey Metiiods Division, Statistics Canada, Tunney's Pasture, Ottawa, Ontario, Canada Kl A 0T6. Four nonreturnable copies of each manuscript prepared following the guidelines given in the Joumal are requested. Subscription Rates The price of Survey Methodology (Catalogue no. 12-(X)I-XPB) is $47 per year in Canada and US $47 per year Outside Canada. Subscription order should be sent to Statistics Canada, Operations and Integration Division, Circulation Management, 120 Parkdale Avenue, Ottawa, Ontario, Canada Kl A 0T6 or by dialling (613) 951-7277 or 1 8(X) 700-1033, by fax (613) 951-1584 or 1 800 889-9734 or by Internet: order@statcan.ca. A reduced price is available to members of the American Statistical Association, the International Association of Survey Statisticians, the American Association for Public Opiiuon Research, and the Statistical Society of Canada. SURVEY METHODOLOGY A Joumal Published by Statistics Canada Volume 25, Number 1, June 1999 CONTENTS In This Issue I H. KROGER, C.-E. SARNDAL and I. TEIKARI Poisson Mixture Sampling: A Family of Designs for Coordinated Selection Using Permanent Random Numbers .. 3 W. R. BELL and M. KRAMER Toward Variances for X-11 Seasonal Adjustments 13 J. DE HAAN, E. OPPERDOES and C. M. SCHUT Item Selection in the Consumer Price Index: Cut-off Versus Probability Sampling 31 P. DUCHESNE Robust Calibration Estimators 43 Y. TILLE Estimation in Surveys Using Conditional Inclusion Probabilities: Complex Design 57 N.G.N. PRASAD and J.N.K. RAO On Robust Small Area Estimation Using a Simple Random Effects Model 67 F.A.S. MOURA and D. HOLT Small Area Estimation Using Multilevel Models 73 M. CHATTOPADHYAY, P. LAHIRI, M. LARSEN and J. REIMNITZ Composite Estimation of Drug Prevalences for Sub-State Areas 81 S. RUBIN BLEUER and M. KOVACEVIC Some Issues in the Estimation of Income Dynamics 87 P. F. TATE Utilising Longitudinally Linked Data from the British Labour Force Survey 99 S. GABLER, S. HAEDER and P. LAHIRI A Model Based Justification of Kish's Formula for Design Effects for Weighting and Clustering 105 Survey IVIethodology, June 1999 Vol. 25, No. 1, pp. 1-2 Statistics Canada In This Issue Dear Readers, I would like to share with you good news on two fronts. First, the upcoming December issue will mark the 25* anniversary of Survey Methodology. This issue of the joumal will be slightly larger than usual and will contain papers from some very prominent statisticians of our time. Second, we are looking into producing an electronic version of the Joumal. Our current plan is to make the December 1999 issue available on a special Web site. All curtent subscribers will be able to download the Joumal free of charge. Based on the response to this trial we will see if it is feasible to offer the Joumal in that medium instead of or in addition to the curtent paper version . Watch for further information in the next issue. As usual your comments and suggestions are always welcome. This issue covers a variety of topics - three papers on small area estimation, four papers on general estimation issues and two each on new sampling designs and data analysis. Kroger, Samdal and Teikari they introduce a new family of sampling designs, called Poisson Mixture Sampling, which comprises of a weighted mixture of Poisson and Bernoulli sampling. Through a Monte Carlo study using Finnish data, they empirically show that, for a variety of point estimators, Poisson Mixture Sampling is more efficient than the usual Poisson sampling. Bell and Kramer deal with the long standing problem of estimating the variance of X-11 estimators. Each month, statistical bureaus throughout the world publish the raw estimates of variables along with a corresponding measure of ertor, usually a standard ertor or a coefficient of variation. However, the corresponding seasonally adjusted or trend estimates, obtained by application of the X-11 method, do not have such an associated measure of error. Bell and Kramer present an interesting approach that offers a practical solution to this problem. They calculate two sources of ertor: one resulting from the sampling error and the other resulting from the use of ARIMA extrapolations at the two ends of the series. De Haan, Opperdoes and Schut discuss sampling the items in a commodity group for input to the Consumer Price Index using scanner data. While most statistical offices curtently use a judgmental selection procedure, this naturally leads to biased estimates. The authors address the question of whether probability sampling would lead to better results in terms of mean square ertor, with interesting results. Pierre Duchesne considers a new class of robust calibration estimators used to obtain constrained weights at given intervals. The process involves changing carefully selected robust default weights into calibrated weights. In a brief empirical study, the new estimators are illustrated and compared to. estimators which have already been proposed. Tille investigates a repeated sampling approach which takes into account auxiliary information. First he generalizes the use of conditional inclusion probabilities for use with any sampling design. He then constructs estimators that can be viewed as optimal linear estimators, and compares them with the GREG-estimator. He contrasts all of the estimators via a set of simulations. Finally he discusses the problem of interaction between the design and the auxiliary variables. Prasad and Rao consider the problem of small area estimation through the use of a random effects model. While traditional methods rely on model-based methods to obtain estimates of small area means, Prasad and Rao obtain design-based (model-assisted) estimates by integrating survey weights. Cortesponding model-based estimators of the mean squared ertors (MSE) of the small aiea estimates are also derived. Through simulation results, they show that their MSE estimator has low bias and is quite stable. In their paper on small area estimation, Moura and Holt focus on multilevel models, which make use of auxiliary information at both the unit and the small area levels, and allow small area random effects for both the intercepts and the regression slopes. The fixed and random effects parameters are estimated using restricted iterative generalized least squares. The mean square ertor is approximated. Simulations show that the model can lead to better small area estimators than those based on simpler models, that overspecification of the model does not lead to a serious loss of efficiency, and that the MSE approximation and associated MSE estimator work well. In This Issue Chattopadhyay, Lahiri, Larsen and Reimnitz consider estimation of proportions for rare events in small areas. Their method is illustrated and compared to other approaches using data from a telephone survey of alcohol and drug use. Their proposed estimator combines census-based demographic estimates of population within age/sex/county groups with survey-based empirical Bayes estimates of proportions within those groups. A jackknife estimator of mean square error is proposed which captures variability due to estimation of model parameters. The problem of estimating longitudinal low income proportions from a longitudinal survey having a complex design is studied in Rubin-Bleuer and Kovacevic. Two design-based estimators are considered: one based on both the longitudinal and cross-sectional sample, called the "mixed estimator", and one based entirely on the longitudinal sample. Through simulation, the two estimators are compared in the presence of attrition using models of compensation that assume "missing at random" and "completely missing at random" underlying mechanisms. The results are illustrated using data from two longitudinal sueveys. Tate considers linking data on the same individuals from subsequent quarters of the British Labour Force Survey, a rotating panel survey in which one fifth of the sample is renewed at each occasion. She analyzes the various factors which can introduce bias into analyses derived from such linked data. In particular, she studies the possible effects of sample attrition, respondent ertors and proxy respondents. She also considers various approaches to adjusting for these biases. Finally, in a short note, Gabler, Haeder and Lahiri present a model-based justification for Kish's well known formula for design effects. They show that the result is actually a conservative value for the actual design effect. The Editor Survey Methodology, June 1999 Vol. 25, No. 1, pp. 3-11 Statistics Canada Poisson Mixture Sampling: A Family of Designs for Coordinated Selection Using Permanent Random Numbers HANNU KROGER, CARL-ERIK SARNDAL and ISMO TEIKARI' ABSTRACT This paper introduces Poisson Mixture sampling, a family of sampling designs so named because each member of the family is a mixture of two Poisson sampling designs, Poisson Tips sampling and Bernoulli sampling. These two designs are at opposite ends of a continuous spectrum, indexed by a continuous parameter. Poisson Mixture sampling is conceived for use with the highly skewed populations often arising in business surveys. It gives the statistician a range of different options for the extent of the sample coordination and the control of response burden. Some Poisson Mixture sampling designs give considerably more precise estimates than the usual Poisson Tips sampling. This result is noteworthy, because Poisson Tips is in itself highly efficient, assuming it is based on a strong measure of size. KEY WORDS: Business surveys; Skewed populations; Response burden; Regression estimators. 1. THE OBJECTIVES OF POISSON MIXTURE SAMPLING Poisson Mixture (Pomix) sampling is a family of sampling designs suitable for business surveys with its often highly skewed populations. The Pomix family contains the traditional Bernoulli sampling and Poisson Tcps sampling designs as two special cases, situated at the two extiemes of a range of possibilities indexed by a continuous parameter. This parameter, called the Bernoulli width and denoted B, satisfies 0 ^ B ^/^, where/^ is the predetermined expected sampling fraction in the "take-some" portion of the population, that is, the portion where randomized selection is applied. Random numbers, in the form of independent realizations of the Unif (0,1) random variable, are commonly used in modem computerized sample selection. Fan, Muller and Rezucha (1962) introduced several sequential (unit by unit) drawing mechanisms based on random numbers. Now, Pomix sampling is based on the Permanent Random Number (PRN) technique, which calls for assigning at birth a random number to each unit in the frame (the business register, in the case of a business survey). The random number is permanent in the sense of remaining attached to the unit during its entire lifetime. The PRN technique makes it easy to achieve coordination of samples and control of response burden. Early references to sampling with the aid of PRN's are Brewer, Early and Joyce (1972) and Atmer, Thulin and Backlund (1975). A recent review of different PRN techniques, and important extensions, are given in Ohlsson (1995). Poisson 7q)s sampling has the desirable feature of selecting large units with relatively greater probability than small units, whose contribution to estimated population totals will in any case be relatively minor. Coordination of Poisson jrps samples with the aid of PRN's was introduced by Brewer et at., (1972) and is discussed subsequently by several authors, including Sunter (1977) and Ohlsson (1995). Similarly as in Poisson Ttps, Pomix sampling allows control of the response burden, as explained in the next section. Larger enterprises will be selected relatively more often than smaller ones. The selection is controlled through rotation so as to distribute the response burden. Another objective of Pomix sampling is for all (or a substantial portion) of the population units to be included in sample (therefore observed, so that their basic data can be updated) with regularity over a period of time. The objective can be, for example, that every enterprise should be in sample at least once during a ten or twelve year period. 2. THE SELECTION PROCEDURE UNDER POISSON MIXTURE SAMPLING Denote the finite population as U = {l,...,k, ...,N], where the integer k represents the k-th population unit. Denote by y the variable of interest and by y^ its value for unit k; y^, is unknown before sample selection and observation. With the unit ke U is also associated a known positive size measure Xj^. Its role in Pomix sampling is to bring about a more frequent selection of the larger units; in addition, the size variable should be used as an auxiliary variable at the estimation stage. A sample, s, is realized from the population U. The size of s may be random; its expected size, denoted n, is a number fixed in advance. We allow s to consist of two nonoverlapping parts, s = s^^usj^, where s^. is called the certainty part of s and s^ the randomization part of s. The part S(., consisting of very large units selected with Hannu Kroger and Ismo Teikari, P.O. Box 3A, Fin-00022, Statistics Finland, Finland; Carl-Erik Samdal, 11th floor, R.H. Coats Bldg., Statistics Canada, Ottawa, Ontario, KIA 0T6, Canada. Kroger, Sarndal and Teikan: Poisson Mixture Sampling probability one, is designated in a preliminary step, with the aid of the known size measures x^. One procedure for this is given in section 3. Depending on the population characteristics, it could happen that iio certainty part is designated, so that s^ is the empty set, but this eventuality is rather exceptional with the highly skewed populations usually occurring in business surveys. A frequently used synonymous term for the certainty part is take-all stratum. If the take-all stratum is denoted U^., a probabilistic description is to say that 5^ is drawn from the take-all stratum U^ so that s^ = U^ with probability one. We denote the size of s^ = U^ by n^.. Next, the randomization sample, s^, is selected from the rest of the population, Ujf = U - U^, of size N^= N - n^. It consists of units with inclusion probability 71^ strictiy less than unity. In this paper, 5^ is drawn by Pomix sampling (thus it uses the PRN technique). The size of Sj^ is random; its expected size, denoted n^, is fixed by the equation "R=" In this paper, we use the term Poisson sampling for selection cortesponding to independent unit by unit Bernoulli trials with any inclusion probabilities n^. More specifically, by Poisson Ttps we mean Poisson sampling with 71^ directiy proportional to a measure of size. Bernoulli sampling is the special case of Poisson sampling where all 7t, are equal. For Pomix sampling, we need some more notation. For unit keUj^, define the relative size measure \=^R''k/jlu,''k- U-P h 1 -B X, _^-PlfR I -B For Poisson Tips sampling, the inclusion probability of unit ^ is 7t^ = A^ given by (2.1). Coordination of PRN-based Poisson samples was introduced by Brewer et al, (1972) using a graphical representation cortesponding to fi = 0 in Figures 1 and 2. At each occasion, the selection area is then a triangle; the unit's PRN on the horizontal axis is plotted against the unit's size measure, A^^, on the vertical axis. Coordination is obtained by "moving the selection area over" to the right. Coordination of Pomix samples is realized in a similar fashion. Q A (2.1) We can from now on assume that A^ < 1 for all ke U^^, because if y4^< 1 had not been true for certain units keU^, then the procedure in section 3 for constructing the certainty part of the sample would in effect have assigned those units to the certainty part s^. We now define Pomix sampling with the aid of a twodimensional diagram. On the horizontal axis, a unit's PRN is plotted. On the vertical axis, a size-related measure, (2^, is plotted. At each survey occasion, a new sampling selection region is designated by rotation in this diagram, and sample coordination is realized in the manner that we now describe. Pomix sampling is characterized by two parameters, B and fg, where /^ such that 0 </^ = nj^lNj^ < 1 denotes the fixed expected sampling rate in Uj^ = U - U^. The parameter B, called tiie Bernoulli widtii, is such that 0 <.B <.f,^. For every unit k€ U^, define Qk from Ujf, which will be seen to reduce to Bernoulli sampling. The measures Qj^ are used in Pomix selection of coordinated samples, as we now describe. Start with a plot of the points (r^^, g/t) for A: e f/j,, where r^ denotes the PRN attached at birth to unit k, and (2^ is given by (2.2). With reference to Figure I, Pomix sampling is defined as follows: Include in the randomization sample, 5^, all units having PRN's r^ falling in the (0, B] interval, and also include some units having PRN's r^ in the {B, 1] interval, namely, those for which (2^ is at least equal to a threshold value situated on the line joining the points {B, 0) and (1, 1). The selection area is thus the shaded part of Figure 1. Note that since A^ < 1 for all ke U^, •we have by (2.2) that Q^<{1 - Blf^)l{l - B) <.l for all A: e f/„, when (2.2) where Xy -Y,U/^J^R^°'' ^ " 0 , we have Qi^=Ai^, which is the size measure for the usual Poisson Ttps sampling. At the other extreme, B =/^, we havefi^^= 0 for all keU„; in this case, size will play no role in the selection B 1 r Fig 1. Sampling at time 1 D B 1 r Fig 2. Sampling at time 2 Figures 1 and 2 illustrate how coordinated Pomix sampling from Uj^ = U - U^ can be carried out at two consecutive survey occasions. In each of the two figures, the sample is defined as the set of units for which the point {''i^,Qi^) falls in the shaded area. The "starting point" on the PRN axis is the point to the right of which we start to count units for inclusion in the sample. At time 1 (Figure 1), the starting point is 0; at time 2 (Figure 2), the starting point is D. (In general, the starting point can be a randomly selected point in the unit interval; in other words, the sample identified in Figure 2 is also the one that would be selected at time t = I if at that time the randomly selected starting point on the PRN axis had been equal to D.) A convenient way to achieve sample rotation is through the constant shift method, which implies that the starting point is moved over to the right by a fixed amount at every Survey Methodology, June 1999 new occasion of sampling; see Ohlsson (1995). The constant D is called the constant shift. The starting point at time 3 would thus be 2D, and so on. In the following we examine Pomix sampling and estimation at a single occasion, and we can concentrate on Figure 1 (time 1), with starting point 0 on the PRN axis. The algorithm for Pomix sampling with parameters B and /^, and starting point 0, is thus as follows: From Figure 1, unit k is included in the randomization part, 5^, (i) if 0<r^^fi,or(ii)iffi<r^^ 1 and G^ ^ ( r ^ - f i ) / ( 1 -B). Consequentiy, k is included in s^ if Table 1 Values of the Inclusion Probability 71^^ as a Function of the Parameter B and the Relative Size /l^ When the Fixed Expected Sampling Rate, /^j, is 0.1 Values of Jt^ Value of B 0 0.03 0.05 0.07 0.10 i4j = 0 (small unit) 0 0.03 0.05 0.07 0.10 Aj = 1 (large unit) 1 0.73 0.55 0.37 0.10 This illustrates that for a Pomix sampling design close to Bernoulli {B near 0.10), the inclusion probabilities of large and small units alike lie near the fixed expected sampling Because r^ ~ Unif (0,1), the first order inclusion probabilirate, 0.10. By contrast, in a Pomix design close to Poisson ties under Pomix sampling are Ttps {B near 0), a small unit is practically certain not to be in sample, and a large unit is practically sure to be in sample. P-'Qki^ -P) for keU^ The table also illustrates how Pomix sampling with an inter(2.3) \ =' for keU,cmediate value of B will modify the inclusion probabilities: a small unit's chances to appear in the sample are decreased somewhat compared to Bernoulli, and the selection of a It is easy to see that the inclusion probabilities satisfy the large unit becomes less probable than under Poisson Ttps. necessary requirement that their sum must equal the The implications for response burden are: The total expected sample size fixed in advance: response burden on the population rests the same for all values of B; an expected total of n = n^ + n^. units are E^'^. = E^, i-E^,{5-e,(i-B)} = «c-". = «- always asked to report. Compared to Poisson Ttps (B = 0), the fixing of a value B in the interior of the interval [0, /^] We now note two extreme cases of the family of Pomix will have the effect of shifting some of the response burden sampling schemes: Bernoulli sampling is obtained if B =f^ from larger onto smaller units; at the same time the preciin the Pomix algorithm, because then Qi^ = 0 for all ke U^, sion of the estimates is increased in many cases (see and the algorithm becomes: Include unit /: e f/^ in 5^ if sections 5 and 6). 0 < r ^ ^ / ^ , which is Bernoulli sampling. Poisson Trps Finally, we need to mention the second order inclusion sampling is obtained if fi = 0 in the Pomix algorithm, probabilities under Pomix sampling, because they are because then the algorithm becomes: Include unif ke U^ required for the design-based variance calculation. They are in 5^ if 0 < r^ < A^, where A^ is given by (2.1). But this is simple. If Tt^, denotes the probability that units k and / are Poisson Ttps sampling from f/^, the inclusion probability both in the sample, then being TI^ = A^, that is, directly proportional to the size measure x^. ^Tt^Ttj (2.4) Pomix sampling is a mixture of Poisson Ttps and Bernoulli in that the Pomix inclusion probability, for k*leU = U^uUc, because the PRN's r^ are Tc^ =fi + (2^(1 -B), equals a linear combination of the independent realizations of the Unif(0,1) random variable. inclusion probabilities that apply under the two extreme For k = 1,'we have TI^, = Tt^^ = 7t^. The multiphcative feature designs, weighted by the relative Bernoulli width. A, =fi//^, (2.4) of the Jt^, gready simplifies the design-based variance such that 0 ^X<. 1. Wehave TI^ =A,TL + ( 1 -X)^K^^, where calculation. We get a simple, single-sum variance estimator, Ti^ =/yj for all k (Bernoulli) and nf' = A^ (Poisson ;tps). as in (4.2) below. The character of Pomix sampling is determined by its two parameters, B and/^. To illustrate, we note from (2.1) and (2.3) that the inclusion probability of unit keUj^ is 3. DETERMINING THE CERTAINTY Tt^ = 5 + (1 - Blf^Aj^. Thus, for a unit k that is large (but PART OF THE SAMPLE not large enough to quahfy for s^ = U^), so that A^ is near unity, we have, to close approximation, TI^ = B + 1 - Blfj^. If the population is highly skewed, a set of units (the By contrast, for a unit that is small, so that its value A^^ is certainty part of the sample, s^) will be sampled with very near zero, we have, to close approximation, Tt^ = B, probability one, and Pomix sampling can be used for independentiy of the size. For example, with /^ = 10%, the randomized selection in the remaining part of the populafollowing Table 1 shows how the inclusion probability TI^ tion, f/^. Several procedures could be considered for the varies with B, where B = 0 is Poisson Ttps, and construction of the certainty set; here we give one that is B = / „ = 0.10 is Bernoulli. reasonable (though not necessarily optimal) and used in the 0 < r , ^ 5 + e^(l -B). Kroger, Sarndal and Teikari: Poisson Mixture Sampling Monte Carlo simulation reported later in section 5. The certainty set is designated with the aid of the known positive size measures x^ through the following procedure in one or more steps. An expected sample size, n, is fixed for the whole sample, 5 = j ^ u 5^. In step one, compute the relative size measures A^^-j. = nx^/ J^^x^ for k£ U. Those units k, if any, for which A^,j. ^ 1 are assigned to the certainty part. They form a set denoted t/„j.; let its size be «P(i^. The procedure is then repeated to see if additional units should be assigned to the certainty part. In step two, calculate the relative size measures A k{2) («-«C(l)) xjY^x^ where the summation extends over the set ^ ~ ^c(i) - ^ ^k(2) *-1 for all ^ e [/ - f/^^,^, the procedure stops, and the final certainty part is s^ = U^,^y But if Aj^,2) ^ 1 for some units, then these are also assigned to the certainty part, and so on until a step / is reached where all intermediate relative size measures A^^^.^ are less than unity. The ultimate certainty part s^ will contain, say, n^ units, and we then have A^< 1 for all keU-s^, where ^k ^ "R-'^J'L^k ^^* "R"^" - "c ^"^ ^^ ^"™extends over U-s, 'c- 4. ESTIMATION FOLLOWING POMIX SAMPLING Although the auxiliary variable serves a useful purpose at the sampling stage, we advocate using it also at the estimation stage. To estimate the population total, Y = X^y^iconsider the generalized regression (GREG) estimator Hs'^kSkYk (4-1) where a^ = 1 /TI^ is the sampling weight and the second weight, the g-weight, is given by 'GREG 8=iHX-xyT: (2.4) applies, and all product terms of the quadratic form are zero. With only the squared terms left, the variance 2 2 estimator becomes simply V = Y,s^k'<'^k ~ ^)Sk ^k - Finally, because a^ - 1 = 0 for all ke s^., we get the variance estimator used in the Monte Carlo study reported later in section 5, namely, 2 2 ^ = E , «*(«*-!)?*« (4.2) 5. A MONTE CARLO STUDY OF POMIX SAMPLING USING FINNISH DATA To illustrate various aspects of Pomix sampling, we conducted a Monte Carlo study involving four different estimators of the population total Y. The experiment involved repeated draws of samples as well as repeated assignments of the set of A^ PRN's to the population units. Note that since every assignment of the A^ PRN's to the population units is a random outcome, a proper Monte Carlo study also requires repetitions of the PRN assignments. Therefore, after the first assignment of the N PRN's, we selected 100 Pomix samples, using a fixed value of the Bernoulli width B. (Each sample was realized using a new, randomly selected starting point on the PRN axis). Then a new set of N PRN's were assigned, 100 additional samples were drawn, and so on, until we had reached 100 x 100 = 10,000 PRN/sample pairs, for the given value of B. For each of the 10,000 pairs, we computed the four point estimators, the cortesponding four variance estimators, and the cortesponding four confidence intervals. With 10,000 repetitions, we expect the Monte Carlo error to be rather small. The four estimators used in the Monte Carlo study have the following expressions, where a^ = 1 /TI^ is the sampling weight of unit k, and 7t^ is given by (2.3): Xkl(^k 1. TheHorvitz-Thompson estimator, where x^ denotes the auxiliary vector value for unit k, X = Y.s'^k^k' ^"'^ ^s = Y^s^k^k^'J^k- "^h^ auxiliary infori ^ i = E , «it>'i = E(/^3', + E . , «A3'r mation requirement is that the vector total X = Y^^JX^ must be known from a reliable source. The unidimensional size 2. The (combined) ratio estimator. variable x^ used for computing the Q^. in the Pomix sampling scheme can be one of the components of jc^ or it Y2=Xb2 can a linear combination of the components of x:^. In the empirical study reported in section 5, JC^ is unidimensional, and Xi^=Xi^. The constants c^, specified by the user, where X = X(/^i and ^2 = L « t V L ^ * ^ r ^^ »s a provide a means of weighing the data, in addition to the special case of (4.1) such that x^ = ^^^ = c,^. survey weights a^^. An often used choice is c^ = 1 for all k. A commonly used estimator of the variance of the 3. The GREG estimator GREG estimator (see Samdal, Swensson and Wretman 1992, Ch. 6) is given as a quadratic form in g^^e^^j^ where c^ Y3 = E y , Yk * E . , ^kYk + ^^R - E . , ^k^k) is the regression residual e^^ = y^ - A:^ b, where * = 'Ps^Hs'^k^kYk^^k- ^^^ '^^ ^^^ advantages of Pomix sampling is that the cortesponding variance estimation is simple. This is because the PRN's r^^ are independent realizations from the Unif(0,l) distribution. Hence equation where X^ = X y / i and b^ =Zs,^^("k-'^)yk^k''Ls^ Uft^a^^ -l)x^. It is a special case of (4.1) such that x^ =x^ and c^ = {a^ - 1)"'. Survey Methodology, June 1999 divided by mean) is 1.78 for the variable y and 1.94 for the variable x; the coefficient of cortelation between x and y is 0.965. The randomization part s^, of expected size n^ Y.-Yu^Yk-^RK = 100 - 29 = 71 units, is realized, in the simulation, by repeated Pomix sample selection from C/^. A plot of where X^ = J^^jX^ and b^ = l,a^yjl,a^x^(x^, y^) for the units k in I/^ is shown in the Appendix. To see the effect of the Bernoulli width, we carried out For Poisson Ttps sampling (B = 0), we have b^ = the simulation for a range of different B-values: B = 0,0.OI, CLs yJ^k)^"s' where n^ is the random size of 5^; the ..., 0.07, and, in addition, B =f^ = n^lN^ = 71/971 = 0.073 corresponding estirriator Y^ was considered by Brewer (which gives Bernoulli). For each value of B, 100 x 100 = etal, (1972). Now Y^^ and Y^ differ in that the regression 10,000 PRN/sample pairs were realized, and the results slope is calculated in Kj on the pooled sample s, but in Y^ were used to calculate, for each of the four point estimators, the slope is calculated separately for the randomization five Monte Carlo summary statistics. These are as follows, sample j ^ . Finally, fj differs from Kj and Y^ in that it if y denotes one of the four point estimators, V the corteuses the weighting a^(a^ - 1), instead of just a^^. Note that sponding variance estimator obtained from (4.2), and all of Y2, and fj and Y^ are members of the GREG family (y'-z,_^/2^^'^'^^i-a/2^^)*^ corresponding confidence of estimators given by (4.1). By equating K^. ^3 and Y^ to interval for Y at the nominal confidence level I- a, where (4.1), we find the g-weights implied by each of the three is the standard normal score, z, „„ = 1.960 for -1 - a / 2 estimators. These weights are required for the variance a = 5%, and z 1 -a/2 = 1.645 for a = 10%: estimation. We can expect the simulation to show that Y2, Y.^ and Y^, which use the auxiliary variable both at the design (1) MCEy= the Monte Carlo expectation of the point stage and at the estimation stage, will improve on (have estimator Y, that is, the arithmetic mean of the 10,000 smaller variance than) the HT estimator Yy which uses the point estimates; auxiliary information only at the sampling stage, but the extent of the improvement is unpredictable and interesting (2) MCV Y = the Monte Carlo variance of the point to observe. estimator Y, that is, the variance of the 10,000 point We used a real data population for the Monte Carlo estimates; simulation. This population consists of A^ = 1,000 Finnish enterprises. For enterprise k,k = l,..., 1,000, y^ is number (3) MCE V = the Monte Carlo expectation of the variance of employees (full time equivalents) x 10, and x^ is the estimator V, that is, the arithmetic mean of the 10,000 wages paid by the enterprise to its employees, in thousands variance estimates; of FIM (Finnish Marks). The auxiliary information (wages paid) comes from the Finnish tax authority's VAT register. (4) MCRTE95 = Monte Carlo coverage rate for nominal The employment variable is the one requiring estimation. 95% confidence intervals, that is, the number of times The 1,000 units were selected (in an essentially random that the target parameter Y is contained in the manner) from an original larger population of Finnish confidence interval, divided by 10,000, and expressed enterprises. Units with a value of zero either on y^ or on x^. in per cent; were eliminated so that tiie simulation results would not be disturbed by extraneous factors. Consequently, as for the (5) MCRTE90 = Monte Carlo coverage rate for nominal values y^ and x^^, the population used in the simulation is 90% confidence intervals; its definition is analogous to a natural one, but because of the elimination of units, its thatofMCRTE95. features (mean, standard deviation, skewness, etc.) differ from those of the original larger population. The simulation results are shown in Table 2 (Average The population y-total to be estimated is Y = Y,uy^^ = sample size, Monte Carlo variance; Monte Carlo expecta169,168. We fixed the expected sample size for the total tion of variance estimator) and in Table 3 (Monte Carlo sample, s = S(.v> s,^, asn = 100. The procedure described coverage rates). The tables do not show MCE Y, because in section 3 was used to determine the certainty part s^ of in all cases this quantity was very close to the target parathe sample. This resulted in a certainty part s^. consisting meter value Y= 169,168, confirming that all estimators are of the largest 29 = n^ units. The rest of the population, essentially unbiased. The deviation of MCE Y from Y was Uj^ = U - s^ has the following descriptive characteristics: in all cases less than 0.14%, in most cases considerably Its size is N,^ = 1,000 - 29 = 971; the total of y is Y^ = less. The average sample size over the 10,000 repetitions is 46,138 (which equals 27% of the entire population total Y seen to be very close to n = 100, as it should. = 169,168); the coefficient of variation (standard deviation 4. The (separate) ratio estimator, Kroger, Sarndal and Teikan: Poisson Mixture Sampling Table 2 Results of Simulation Study for Different Bernoulli Widths B: Average Sample Size, MCV Y and MCE V; est.j Refers to Estimator K^.;;= 1, ...,4. (Values in the Last Eight Columns to be Multiplied by 10*.) Bernoulli Average width B sample size MCV? MCEV est. 1 est. 2 est. 3 est. 4 est. 1 est. 2 est. 3 est. 4 99.95 24.56 3.63 3.43 3.46 24.92 3.67 3.43 3.46 0.010 100.05 22.74 1.96 1.84 1.85 23.53 1.98 1.86 1.87 0.020 100.04 24.75 1.82 1.77 1.78 25.37 1.85 1.77 1.78 0.025 100.09 25.51 1.82 1.78 1.79 26.86 1.87 1.81 1.82 0.030 100.06 28.03 1.83 1.80 1.81 28.58 1.91 1.87 1.88 0.040 99.86 35.17 2.01 2.03 2.06 33.54 2.11 2.11 2.15 0.050 100.02 42.25 2.56 2.64 2.67 41.42 2.48 2.51 2.59 0.060 99.99 56.08 3.42 3.65 3.67 55.70 3.20 3.24 3.44 0.070 100.05 90.73 4.80 5.47 5.59 91.28 4.89 4.72 5.37 0.073 100.02 119.13 6.06 7.09 7.43 116.27 6.00 5.34 6.49 0.000 Table 3 Results of Simulation Study for Different Bernoulli Widths B : Coverage Rates in % of Nominal 95% and 90% Confidence Intervals, MCRTE95 and MCRTE90; est.; Refers to Estimator y.;y = 1, ...,4 Bernoulli width B nominal 95% confidence level est. 1 est. 2 est. 3 est. 4 nominal 90% confidence level est. 1 est. 2 est. 3 est. 4 0.000 94.50 92.03 92.75 92.48 89.70 86.36 87.35 86.74 0.010 95.20 93.13 93.47 93.52 90.43 88.06 87.93 88.02 0.020 95.06 94.23 93.88 93.88 90.36 89.36 88.49 88.55 0.025 95.06 93.73 94.56 94.70 90.64 89.15 89.73 89.72 0.030 94.63 94.12 94.09 94.19 89.85 89.06 88.70 88.86 0.040 93.84 94.44 94.47 94.64 88.77 89.60 89.41 89.60 0.050 93.97 94.38 93.76 93.82 88.67 88.67 88.08 88.53 0.060 93.54 93.57 92.12 92.69 89.10 87.57 85.99 87.27 0.070 92.93 94.99 90.67 92.03 88.40 88.62 84.27 86.11 0.073 91.03 95.02 88.03 90.46 86.53 88.62 81.26 83.86 Tables 2 and 3 generate these comments: 1. Let us begin the examination of Table 2 by a comparison of Monte Carlo variances across estimators, for a fixed Bernoulli width B. This shows that, for every value of B, there is little to choose between Y2, Y^ and Y^, in 2. terms of variance. By contrast, the HT estimator 9^ has considerably greater variance. To illustrate, the ratio MCV fj/MCV 9^ equals 3.43/24.56 = 0.140 for B = 0 (Poisson Ttps), 1.80/28.03 = 0.064 for B = 0.03, and 7.09/119.13 = 0.060 for B = 0.073 (Bernoulli). This confirms that the HT estimator is a poor choice compared to an alternative that uses the strongly cortelated auxiliary variable. This is tine, not surprisingly, for Bernoulli, but also for B-values near the lower end of the [0,/^] interval, which shows that the sampling design alone does not extract all the power of the auxiliary variable, even though with B near zero, we are close to a strict Ttps selection (thus supposedly highly efficient). Part of the reason that the HT estimator has a comparatively large variance is that the randomness of the sample size under Pomix sampling penaUzes the HT estimator (but not the GREG estimators). Since the HT estimator is inefficient, we do not further discuss it. Examining the small differences between ^2» ^3 and ^4we note in Table 2: As measured by the Monte Carlo variance, l^j is better than f^ for all Bernoulli widths B, but only marginally so. Also, f^ and f^ are better than ^2 at the lower end of the range of B-values, possibly because of the fact that in ^2 we allow the certainty part the sample to contribute to the slope estimate, somewhat inappropriately, since there is only an estimation problem for the randomization part. But at the upper end, the relation is reversed and for the upper extreme B = 0.073 (Bernoulli), ^2 is clearly better than ^3. That the differences between fj. ^3 and 9^ are so small is not surprising, because all are varieties of the GREG estimator (4.1) using essentially the same auxiliary information. Survey Methodology, June 1999 3. Table 2 confirms that the proposed variance estimator ^ works well, as we would expect; MCE V is with few exceptions very close to the target that ^ aims at estimating, that is, the variance of f, measured here by MCV f. This holds for all estimators and all values of B, with a few notable exceptions, namely, in the case of fj and ?4 when B is close to the upper extreme (Bernoulli). Then the variance estimator underestimates the variance. 4. The most interesting result in Table 2 we consider to be the fact that the variance of fj or ^3 or f^, when viewed as a function of the Bernoulli width B, does not attain its minimum at B = 0 (Poisson Ttps), as one might have initially guessed, but rather for a value of B somewhere between 0.02 and 0.03. Moreover, the improvement of the case B = 0.02 over the case B = 0 is substantial for all of f^' ^3 and f^. Measuring this improvement by MCV(f I B = 0.02) divided by MCV( y I B = 0), we find that this ratio is only about 50% for all of Y2, 1^3 and f^. More precisely, for ?2 the ratio is 1.82/3.63 = 0.501, for Y^ it is 1.77/3.43 = 0.516, and for f^ it is 1.78/3.46 = 0.514. In view of these results, we added a simulation for B = 0.025, a value not examined in the original round of simulations. The results, also displayed in Table 1, confirm that a minimum variance is obtained, for alltiireeestimators, f^. ^^3 and 9^, at a point in the vicinity of B = 0.025. One possible explanation of why it is considerably better to take B to be a value distinctly greater than B = 0 (which gives Poisson Ttps) is the following: When B is 0 (or very near 0), the units with the smallest x-values, when selected, will have unduly large weights, which induces high variability. This is avoided by choosing B clearly away from zero. 5. The Monte Carlo results in Table 3 concerning the coverage rates show that the variance estimation and the confidence interval procedure function to satisfaction. Astiieoryleads us to expect, MCRTE95 and MCRTE90 are close, for all four estimators, to their theoretical values, 95% and 90%, respectively. Only for and l^j and ^4, when B gets close to the upper extreme (Bernoulli), do we notice any marked tendency for the MCRTE to drop below the nominal value, resulting in part from the underestimation of variance mentioned earlier. 6. FURTHER EVIDENCE THAT POMIX SAMPLING IS MORE EFFICIENT THAN POISSON Tips SAMPLING Initially we had no strong reason to believe that Pomix sampling combined with a GREG estimator would be more efficient for some Bernoulli widths B in the interior of [0,/^] than for Poisson Ttps (B =0). The strong improvement - a variance reduction of around 50% for our particular population - was rather surprising. For other populations, the variance reduction can be more or less than the 50% we found. Because our finding is data dependent, it is desirable to provide some more general evidence in support of the proposition that Pomix sampling with a B-value well into the interior of [0,/^] is better than Poisson Ttps sampling (B = 0). We now present some evidence of this kind. We examined the Taylor variance of f (that is, the variance of the Taylor linearized statistic). It is given by (see Samdal et al, 1992, Ch. 6) V^-^Y = Et/ («* " 1)^*^ where E^^ is the population analogue of the sample based residual e^ used in the variance estimator (4.2). For example, for the estimator ^3, the residual in question is Pk=yk -h^kwith ^3 = luS^k- ^)yk^kil^u, («* - I K ' ; for y^, b^ = Xy^y^/Ly/,replaces by It is reasonable to model the squared residual as £^ = o^x/(l + 5^), where/p satisfies Q <p ^2, and 5^ is near zero. This corresponds to assuming a superpopulation model y^. = JI^A' P + e^' where the e^ are independent errors with model expected value zero for every k, and e^ has model variance o^x^. Using the approximation Ej^ ~ o^ x/, and that a^. = 1 /Tt^ with Tt^ given by (2.3), we have l)x/ o^{H{B,p)-T{p)} where H{B,p) ^ -^ =x^JYl . , _ . , - ^ / [ ^ ^ t / , - C 4 - f i ) ^ J - ' and T{ P) = Lu x[. Now consider a fixed value ofp such that 0 ^ p ^ 2. We want to find out if H{B, p) has a smaller value for some B in the interior of the interval [0, / ^ ] , compared to its value at B = 0, which is //(O, p). To this end, let us examine if the derivative H'{B,p) = dH{B, p)ldB is negative at B = 0. We find H'{B,p) =x,^Yu, ^k^k -^u,)[P^u, - (fR-P)^k]-'- Its valueatB = 0is H'{0,p) = {xylf^)l^j^xr\x, -x^J. The sign of H'{0,p) is tne same as that of Y,y x[' (x^ -Xy ). But this quantity equals, apart from the factor 1 / (A^^ - 1), the covariance in (/^ between x^''^ and Xj^-Xy (note that x^ -Xj^ has zero mean). When p satisfies 0 <, p <2, this covariance is negative: when x^ increases, x^-Xy increases steadily, and jcjf" decreases steadily (and remains always positive). The sign of H'{0,p) is therefore negative; consequently, it is not at B = 0 that H{B, p) attains its minimum value, but for some B in tiie interior of [0, f^]. Now forp = 2, H'(0, p) = 0, and H{B, p) has a minimum at B = 0. These considerations raise the question whether the population used for our simulation in section 5 cortesponds to a value of p e [0,2], but distinctly less than 2, so that we can expect significant gains from Pomix sampling. To obtain an answer, we estimated p by fitting the logarithmic version of the model E,^ = a^x^{l +5^) to the data available for Ui^ = U - U^. That is, we fitted w^^ = a + pz^, 10 Kroger, Sarndal and Teikari: Poisson Mixture Sampling where w^ = log E^; E^=y^- b^ x^ with b^ = Y^^^y^ I ^^^ Xj^; Zj^ = log x^, and a is an intercept term. We obtained the value p = 1.45, by treating p as a linear regression slope estimated as p = 'iy{w^-Wy){z,-Zu)l)iy^{z^-Zuf. Since this p-value is considerably less than 2, our Monte Carlo population is indeed one where one can expect significant gains from the use of Pomix sampling with a value of B in the interior of [0, / ^ ] . 7. CONCLUDING DISCUSSION AND TENTATIVE RECOMMENDATIONS advantageous. However, the present paper does not address the question of the optimal choice of B. A difficulty is that in practice the value of B must be fixed at the design stage, and that the optimal B depends on unknown population characteristics. Prior knowledge of the population, notably about its residual variance structure, can guide the choice of B. Our tentative recommendations based on this paper are: If prior information suggests a squared residual pattern conforming to o^x/ with p < 2, then use Pomix sampling with B = 0.3 /^. On the other hand, if in reality the unknown p is such that p > 2, then, although the best choice in this case might be B = 0 (Poisson Ttps), little harm would probably be done to use B = 0.3 /^, because the variance viewed as a function of B is likely to increase at a gentle rate. Therefore, B = 0.3 /^, seems a reasonable all-purpose suggestion. These recommendations are tentative; the question merits a further study that lies beyond the scope of this paper. The survey sampler will ask: If I consider using Pomix sampling for my survey, combined with a GREG estimator, what is an appropriate choice for B? Recall that in this paper we found, for one particular population, that a large efficiency gain (roughly 50% variance reduction compared to Poisson Ttps) is realized by fixing the Pomix parameter B at around 30% of /^. We were led to suspect that the variance gain is related to residual ertor characteristics, and this was confirmed in Section 6 which presented evidence ACKNOWLEDGEMENTS that when the squared residual pattern conforms to Ef^ = o^x/, where p satisfies 0 ^ p < 2, as is the case in The authors gratefully acknowledge the work of a referee, whose suggestions led to valuable improvements in many business survey populations, then Pomix sampling the original manuscript. with B in the interior of the interval [0, /^] may well be APPENDIX SCATTER PIJOT OF X AND Y I, 3000 4G00 5000 X (yaariy wagss in 1000 HM) Figure 3. Scatter plot of x (yearly wages) against y (employment) for the portion f/^ of the Monte Carlo population of 1,000 Finnish enterprises. Survey Methodology, June 1999 REFERENCES ATMER, J., THULIN, G., and B A C K L U N D , S. (1975). Coordination of samples with the JALES technique. Statistisk Tidskrift, 13,443-450. BREWER, K.R.W., EARLY, L.J., and JOYCE, S.F. (1972). Selecting several samples from a single population. Australian Joumal of Statistics, 14, 231-239. FAN, C.T., MULLER, M.E., and REZUCHA, I. (1962). Development of sampling plans by using sequential (item by item) techniques and digital computers. Joumal of the American Statistical Association, 57, 387-402. 11 OHLSSON, E. (1995). Coordination of samples using permanent random numbers. In Business Survey Methods (Eds. B.C. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. CoUedge, and P.S. Kott) New York: Wiley, 153-169. SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag. SUNTER, A.B. (1977). Response burden, sample rotation, and classification renewal in economic surveys. International Statistical Review, 45, 209-222. 13 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 13-29 Statistics Canada Toward Variances for X-11 Seasonal Adjustments WILLIAM R. BELL and MATTHEW KRAMER' ABSTRACT We develop an approach to estimating variances for X-11 seasonal adjustments that recognizes the effects of sampling error and errors from forecast extension. In our approach, seasonal adjustment error in the central values of a sufficiently long series results only from the effect of the X-11 filtering on the sampling errors. Towards either end of the series, we also recognize the contribution to seasonal adjustment error from forecast and backcast errors. We extend the approach to produce variances of errors in X-11 trend estimates, and to recognize error in estimation of regression coefficients used to model, e.g., calendar effects. In empirical results, the contribution of sampling error often dominated the seasonal adjustment variances. Trend estimate variances, however, showed large increases at the ends of series due to the effects of fore/backcast error. Nonstationarities in the sampling errors produced striking patterns in the seasonal adjustment and trend estimate variances. KEY WORDS: Sampling error; Forecast error; Trading-day; ARIMA model. 1. INTRODUCTION The problem of how to obtain variances for seasonally adjusted data is long-standing (President's Committee to Appraise Employment and Unemployment Statistics 1962). Model-based methods of seasonal adjustment (see Bell and Hillmer 1984, for a discussion) use results from signal extraction theory to produce estimates and associated error variances of the seasonal and nonseasonal components. Most official seasonal adjustments, however, are made using empirical methods, most notably X-11 (Shishkin, Young and Musgrave 1967) or X-11-ARIMA (Dagum 1975). These methods are based on fixed filters, not models, and so it is not obvious how to calculate variances of the seasonal adjustment ertors. Various approaches for obtaining variances for X-11 seasonal adjustments have been proposed, as summarized below. Wolter and Monsour (1981) suggested two approaches. They recognized that many time series that are seasonally adjusted are estimates from repeated sample surveys, and thus are subject to sampling ertor. Their first approach accounts only for the effect of sampling ertor on the variance associated with seasonal adjustments. Their second approach tries to also reflect uncertainty due to stochastic time series variation in the seasonal adjustment variances. However, this second approach assumes that, apart from regression terms, the time series is stationary. This type of model is now seldom used for seasonal time series. Also, their second approach contains a conceptual crtor: it produces the variance of the seasonally adjusted estimate, instead of the desired variance of the ertor in the seasonally adjusted estimate. Burridge and Wallis (1985) investigated use of the steady-state Kalman filter for calculation of model-based seasonal adjustment variances, and apphed this approach to a model they obtained previously (Burtidge and Wallis 1984) for approximating the X-11 filters. They suggested that this approach could be used to, "provide measures of the variability of the X-11 method when it is applied to data for which it is optimal," (p. 551), but cautioned against doing this when the X-11 filter would be suboptimal {i.e., very different from the optimal model-based filter). Hausman and Watson (1985) suggested an approach to estimating the mean squared error for X-11 when it is used in suboptimal situations. Bell and Hillmer (1984, section 4.3.4) pointed out a problem with the use of model-based approximations to X-11 for calculating seasonal adjustment variances. The problem is that X-11 filters (or any seasonal adjustment filter, for that matter) are not sufficient to uniquely determine models for the observed series and its components. Pfeffermann (1994) developed an approach that recognizes the contributions of sampling ertor and irtegular variation (time series variation in the irregular component) to X-11 seasonal adjustment variances. The properties of the combined ertor (sampling error plus irtegular) are estimated using the X-11 estimated irtegular. These properties are then used to estimate two types of seasonal adjustment variances. A drawback to this approach is that it relies on an assumption that the X-11 adjustment filter annihilates the seasonal component and reproduces the trend component. (Note Pfeffermann (1994, p. 90), discussion surtounding equation (2.7).) Violations of this assumption in practice compromise the approach to an extent which appears difficult to assess. Thus, this assumption seems to us highly questionable and also, in any particular case, uncheckable. A second drawback is that one of the variance types proposed by Pfeffermann assumes that the X-11 seasonally adjusted series, rather than the trend estimate, is taken as an estimate of the trend. Breidt William R. Bell and Matthew Kramer, Statistical Research Division, Room 3000-4, U.S. Bureau of the Census, Washington, D.C. 20233-9100, U.S.A. 14 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments (1992) and Pfeffermann, Morry and Wong (1993) further develop Pfeffermann's general approach. The goal of this paper is the development and application of an approach to obtaining variances for X-11 seasonal adjustments accounting for two sources of ertor. The first error source is sampling ertor. The second is ertor that arises from the need to extend the time series with forecasts and backcasts before applying the symmetric X-11 filters. These latter errors lead to seasonal adjustment revisions (Pierce 1980). Note that revisions eventually vanish as sufficient data beyond thetimepoint being adjusted become available. Also note that a seasonally adjusted series will not contain sampling ertor if the cortesponding unadjusted series does not. This is the case for certain economic time series, e.g., export and import statistics for most countries. Our approach assumes that the X-II seasonal adjustment target (what we assume application of X-11 is intended to estimate) is what would result from application of the symmetric linear X-I 1 filter (with no forecast and backcast extension required) if the series contained no sampling ertor. While this definition of target might be criticized for ignoring time series variation in the underlying seasonal and nonseasonal components, we think this may be appropriate for typical users of X-11 seasonally adjusted data. Such users are most likely to be concerned about uncertainty reflected in differences between initial adjustments and final adjustments, i.e., in revisions. Some of these users will also be aware that the unadjusted series consists of sample-based estimates of the true underlying population quantities, and will realize that the effects of sampling error on adjustments should also be reflected in seasonal adjustment variances. Our development is based on use of the symmetric linear X-11 filters. We assume that the symmetric filters are applied to the series extended with minimum mean squared error forecasts and backcasts. In practice, the forecasts and backcasts are obtained from afittedtimeseries model. This is in the spirit of the X-11-ARIMA method of Dagum (1975), but with full forecast and backcast extension, as recommended by Geweke (1978), Pierce (1980), and Bobbitt and Otto (1990). Our results apply directly to the use of additive or log-additive X-11 (with forecast and backcast extension), and the log-additive results are assumed to apply approximately (Young 1968) to multiplicative X-11. Section 2 of this paper develops our approach, which builds on the first approach of Wolter and Monsour (1981). The differences between the two approaches are discussed in section 2.4. Section 3 then discusses three extensions to the results of section 2. The first is to note that our approach works equally well with seasonal, trend, or irregular estimates, and that more generality is easily accommodated by allowing different filter choices for different months. The second extension produces variances of estimates of month-to-month or year-to-year change. Finally, when seasonal adjustment involves estimation of regression effects {e.g., for trading-day or holiday variation), the results are extended to allow for additional variance due to crtor in estimating the regression parameters. Section 4 then presents several examples illustrating the basic approach and the extensions given in section 3. One thing evident from the examples is that for time series with sampling error, our seasonal adjustment variances will often be dominated by the contribution of the sampling ertor. In the center of the series, our results effectively reduce to the first approach results of Wolter and Monsour. Our results do differ from those of Wolter and Monsour near the end of the series. This is important since the most recent seasonally adjusted values receive the most scrutiny. Also, the contribution of forecast and backcast error to trend estimate variances can be very large at the ends of a time series. Other results of particular interest are the effects of certain nonstationarities in the sampling ertors. The examples of section 4 show that nonstationarities such as sampUng ertor variances that change over time, or periodic independent redrawings of the sample, can yield striking changes in the pattern of the variances of seasonally adjusted data or trend estimates over time. Section 5 provides concluding remarks. 2. METHODOLOGY Define the observed unadjusted time series as y^ for t = I, ...,n. Time series that are seasonally adjusted are often estimates obtained from repeated (monthly or quarterly) sample surveys, and thus can be viewed as composed of a tme underlying time series Y^, and a series of sampling errors e^ assumed uncortelated with Yy (See Bell and Hillmer 1990.) In vector notation, Yo = Y^ + e^, where the subscript a indicates that the time span of these vectors is the set of observed time points I,..., n. In certain cases y^ may arise from repeated censuses (as is typically the case for national export and import statistics, for example), in which case there is no sampling error, i.e., e^ = 0. The development that follows assumes that both Y^ and e^ follow known time series models. The model for F, will generally involve differencing, as in ARIMA (autoregressive-integrated-moving average) and ARIMA component (structural) models. The model for Y^ may be extended to include regression terms. (This will be considered in section 3.3.) The series e, is assumed to not require differencing, but it may nonetheless exhibit certain nonstationarities, such as variances that change over time. Any such nonstationarities are assumed to be accounted for in the model for e^ In practice, the models will be developed from observed data, as is discussed by, e.g.. Bell and Hillmer (1990, submitted). Binder and Dick (1989, 1990), and Tiller (1992). In applying a symmetiic X-11 filter of length 2m + I for seasonal adjustment with full forecast and backcast extension, the vector y^ needs to be augmented by m backcasts and m forecasts. The vector holding the m values 15 Survey Methodology, June 1999 cov [ (b, 0, f), e], as discussed in sections 2.1 to 2.3. Then, of y, prior to the observed data, and the cortesponding m X I vectors for Y^, and e^, are denoted y^, Y^, and e^. var(Y -y) easily follows as var(b,0,f) + var(e)-cov [(b,0,f),e] -cov[(b,0,f),e]'. Thus, The analogous vectors ofttiem future values of y^Yy and e, are denoted y ^, Y,, and e,. Thus, var(v) = Si [ var(b, 0, f) + var(e) - cov [ (b, 0, f) ,e] cov[(b,0,f),e]'}fl'. yfc Yo = + (2.1) The full vectors in (2.1), hereafter denoted as y, Y, and e, have length n + 2m. The backcasts and forecasts used to augment y^ are assumed to be minimum mean squared ertor (MMSE) linear predictions of y^ and y, (using y^) obtained from the known time series model. (In practice, the model will be fitted to the data y^.) Under normality, the backcasts and forecasts are £ ( y ^ I y^) and E{yAyJ. The vector of observed data augmented with the backcasts and forecasts is denoted y = (yfc'.yo'.y/)'. where y^ =E{y^\y^) and y, = £ ( y , I y ^). To simplify notation, from now on we will taike expressions such as (y^, y^, y<-) to mean the column vector ( y ; , y ; , y p ' . Let the linear symmetric X-11 seasonal adjustment filter be written ©(B) = YT-m ^j P'' where B is the backshift operator and the co. are the filter weights (co. = co .). Calculation of the to. is discussed by Young (1968) and WaUis (1982). Results of Bell and Monsell (1992) were used here. Apptication of (o(B) to the forecast and backcast extended series can be written as fiy, where fl is a matrix of dimension nx(n + 2m). Each row of £1 contains the filter weights ((B_^,...,(jaQ,...,co^), preceded and followed by the appropriate number of zeroes such that the center weight of the X-11 filter (COQ) multiplies the observation being adjusted. Thus, in thefirstrow of ft there are no preceding zeroes and n - I traiUng zeroes, in the second row there is one preceding zero and n -2 trailing zeroes, etc. For the default X-11 filter, m = 84. Choice of alternative seasonal or trend moving averages in X-11 changes the value of m from a low of 70 to a high of 149. The question arises as to what ily is estimating. As noted in the introduction, we define the "target" of the seasonal adjustment as the adjusted series that would result if there were no sampling ertor and there were sufficient data before and after all time points of interest for the symmetric filter to be applied. The target is thus (ii{B)Yy or in vector notation ilY, and the seasonal adjustment ertor vector is v = £1(Y - y). We are interested in the variancecovariance matrix var(v) = ilvar(Y - y)ft'- This can be easily computed once var (Y - y) is obtained. From here through section 2.3 we discuss the calculation of var(Y-y). We start by writing Y - y = (y - e) - y = (b, 0, f) - e, where b = y^, - y^ is the mx 1 vector of backcast ertors, and f = y, - y, is the m x I vector of forecast errors. Given the models for 7, and e^ we calculate var(Y - y) by separately computing var(e), var(b,0,f), and Example-U.S. 5-t- Unit Housing Starts. As the computations for each piece of var( Y - y ) are explained, we illusti-ate the results graphically for an example series: housing starts in the U.S. for buildings of five or more units from January 1975 through November 1988 (167 observations). The original series, seasonally adjusted series, and estimated trend are shown in Figure I. In practice, seasonal adjustment at the Census Bureau of this series uses a multiplicative decomposition with a 3 x 9 seasonal moving average and a I3-term Henderson trend filter. The following model for this series was developed in Bell and Hillmer (submitted): y = 1 ' +e (1 -B)(I -B>^)y, = (1 -0.67B+0.36B2)(1 -.8753B'2)a,, a] = 0.0I9I e, = (I -0.llB-0.lOB^)byol=0.001l4 (2.2) Original series 1D75 igS1 1983 year X-11 SDasonaiiy adjusted series and trend estimate 1081 1083 yoar Figure 1. U.S. Housing Starts with Five or more Units. The top panel gives the original series from January 1975 through November 1988. The strong seasonality of the series is apparent from the yearly dips that typically occur during the winter months. The bottom panel gives the X-11 seasonally adjusted series (solid line) and trend estimate (dotted line) for the same period. The seasonal adjustment is multiplicative using a 3 x 9 seasonal moving average and 13-term Henderson trend filter. 16 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments Here, y, denotes the logarithms of the originaltimeseries {e ^'), so that (2.2) implies a multiplicative decomposition for the original series {e^' = e ' e *')• 2.1 Computation of Var(e) If Cj follows a stationary ARMA model, then var(e) can be computed from standard results, e.g., McLeod (1975, 1977), Wilson (1979). If var(e,) changes over time, we write ej = h^ey where h, =var(e,), and e^ has variance one and the same autocorrelation function as e,. (See Bell and Hillmer submitted.) Then, writing e = He, where H = diag{h^_^,...,h^^J, we have var(e) =Hvar(e)H'. Var(e) is the autocorrelation matrix of e, and it can be computed as just noted using the model for iy If the sample is independently redrawn at certain times, then var (e) will be block diagonal, with blocks cortesponding to the time points when each distinct sample is in effect. Each diagonal block of var (e) be computed as just discussed. These two types of nonstationarities in evariance changing over time and "covariance breaks" due to independent redrawings of the sample - are those that arise in the examples of section 4. Example-U.S. 5-i- Unit Housing Starts (continued). Autocovariances for the MA(2) model for e, given in (2.2) are easily computed. The resulting var (e) is a band matrix, with var(e^)= 0.007298 on the diagonal, co\{eye^ .^) = -0.00()707 on the first sub- and super-diagonals, and cov(e,, e^_2) = -0.000714 on the second sub- and superdiagonals. The rest of var (e) is zero. Following pre- and post multiplication by the seasonal adjustment filter matrices SI and SI' the contribution of the sampling ertor to the variance of the seasonally adjusted series is constant for each observation (Figure 2). This occurs because the result of a time invariant Hnear filter applied to a stationary series (©(B)^^) is a stationary series, which has a constant variance. thus Wj = ?>{B)Y^ + 8{B)ey We introduce the matrix A, cortesponding to 8(B), defined such that Ay = w is the vector of differenced y. The vector w = (w^,w^,w,), which is of length n + 2m - rf, is partitioned so that w^ and w, are m x 1 vectors, and w^ is the n - d vector of differenced observed data. Thus, A has dimensions (n+2m-c?)x (n + 2m). Note that, because c/ observations are lost in differencing, w^ and w^ start d time points later than y^and y^, respectively. That is, yj,and y^ start at time points 1 - m and 1, but w ^ and w^ start attimepoints I - m + d and d + 1. (>: o.ooa ' e.eoa — ( b ) C o n t r i b u t i o n ^ T • a m p l l n a o r r o r (o> (c) C o n t r i b u t i o n o f foro/baiclcc««t o r r o r (b O f) 0.004 - 2.2 Computation of Var (b, 0, f) The central n rows and n columns of var(b, 0, f) are all zeroes. We require computation of var(b), var(f), and cov(b, f) for the comer blocks of var(b, 0, f). Although computation of variances of forecast (or backcast) ertors for given models is standard in time series analysis, it is complicated here by the component representation of y, asy,+ €,, and by differencing in the model for Yy Although computations for such models are often handled by the Kalman filter (Bell and Hillmer submitted; Binder and Dick 1989,1990; Tiller 1992), this is inconvenient here since we require covariances of all distinct pairs of random variables from among the m forecast and m backcast ertors. We instead use a direct matrix approach due to Bell and Hillmer (1988). Assume that the differencing operator required to render Y^ stationary is 5(B), which is of degree d. Since e, is assumed not to require differencing, 6(B) is also the differencing operator required by y,. Define 5(B)y^ = w^. Cd) C o n t ( b O f> 0.003 Figure 2. U.S. 5+ Units Housing Starts: Variance Decomposition after X-11 Seasonal Adjustment. The top panel gives the variance of the seasonally adjusted series as the total of the three components. The second panel give the contribution of sampling error (e), which is the largest component and constant across the series. The third panel gives the contribution of back/forecast error (b 01), which is zero in the middle of the series, where no back/forecasts are needed, but increases towards either end of the series as more back/forecasts are used. The bottom panel is the sum of the two covariance terms (cov(e, (b 0 f)) + cov( (b 0 f), e)), which tend to offset the contribution from back/forecast error. Survey Methodology, June 1999 17 Define u = (w,_„^j, •••, «„^„)' = AY. Thetimeseries u, is stationary. Since w = u + Ae, with u and e uncortelated with each other, var(w) = var(u) + Avar(e)A'. We partition var (w) as '12 var(w) •^21 •^22 •^23 ^^31 -"32 -^33/ where E,j is var(w^), L,2 is cov(w^,w^), etc. Since y, when differenced to w using 5(B), has lost d data values, y cannot be obtained from w without also knowing a sequence of d "starting values". Consider obtaining y^^ from w, and starting values y^ = iyn^i-d'—'Yn)'- Theorem 1 in Bell (1984a) can be used to show that y, = Ay, +Cw (2.3) for matrices A and C determined by 6(B). The rows of the mxm matrix C consist of the coefficients of ^(B) = 1 + L B + ^ - B 2 + . . . = 6 ( B ) - ' inthefomi 1 ^1 0 1 ^ ^l ^m-l ^n. 0 o"! 0 0 0 0 1 0 ^l 1 A is an m X ^ matrix which accounts for the effect of the d starting values in y, on y^. The exact form of A is given in Bell (1984a) and, since it will exactly cancel in our application, it will not be given here. In (2.3) y, is known since it is part of y^, the observed data. Thus, from (2.3) the MMSE forecast of y, is y, = Ay, + Cw,, where w, is the MMSE forecast of w,. Therefore, f = y^-y^ = Ay, + Cwy -(Ay, +Cwp =C(w^-w ),andvar(f) = Cvar(w^-wpC'. Under Assumption A of Bell (1984a), which leads to tiie standard results for forecasting nonstationary series (as in, e.g.. Box and Jenkins 1976, Chapter 5), w, = E32E22W0Note that this uses only the differenced data w^ in forecasting yi^. Then, from standard results on linear prediction, var(w - - w,) 2 - L 'L'.l'L' Thus, var(f) 33 -^32 22 •^32C(L33 - 232222^32)^'To obtain var (b) and cov (b, f) we note that results obtained by Bell (1984a, p. 651) imply similar calculations hold for the backcast ertors b. In fact, it can be shown tiiat b = (-I)'^C'(Wj - Wj), where w^ is the MMSE backcast of Wj, and r is the number of times (1 - B) appears in the polynomial 5(B). (The appearance of C' intiiisexpression instead of C stems from the indexing of w^ and w^forward through time although the backcasting process proceeds backwards through time.) Thus, var(b) = C'var(vt'^ - w^) C = C ' ( 2 „ - T.12'^21 ^(2)^- Similarly cov (f, b) = (-1)" C{'L^i -£32222^12)^- ^ practice, to avoid inverting Z22, var(f), var(b), and cov(f, b) can be computed using the Cholesky decomposition of L22. (See App)endix A.) Example-U.S. 5-t- Unit Housing Starts (continued). The contribution to seasonal adjustment variance from var(b, 0, f) is shown in Figure 2. This is zero or essentially zero for observations in the middle of the series, where no or few fore/backcasts need be made to apply the symmetric adjustment filter. Towards the ends of the series, the contribution of fore/backcast ertor becomes more substantial since an increasing number of observations need to be fore/backcast to apply the filter. The jumps in the graph occur when an additional fo^e^ackcasted observation is multiplied by a weight in the adjustment filter that is a multiple of the seasonal period, since these weights have the greatest magnitude (Bell and Monsell 1992). Note that the contributions from var(b, 0, f) at the very ends of the series are smaller than the contributions from var(e), but are not negligible. 2.3 Computation of Cov[(b 0 f), e] To compute cov(f, e), we first note from results of the preceding section that f = y , - y , = C ( w , - w , ) = C (W^-S3222>o)=C[0| -2:322:2^11 JW=C[0| -2:322:2^1:11 J Ay. Since cov(y,e) = cov(Y + e,e) = 0 + var(e), we see that cov(f,e) = C[0| -2:322:2"2|I JAvar(e). Cov (b, e) is computed in an analogous fashion by noting that, b=(-l)^C'(w-w,)=(-I)X'(w,-2:,2222W„) =(-!)'•€' [1^1-2,2 2:2210] Ay, so that cov(b,e) = (-l)^C' [I^|-2:,22:22^|0]Avar(e). Example-U.S. 5+ Unit Housing Starts (continued). Figure 2 shows that the contribution of cov[(b, 0, f), e] is zero or near zero in the middle of the series, but it becomes increasingly negative towards the ends of the series, in a pattern similar, though opposite in sign and of smaller magnitude, to that of var(b, 0, f). At the very ends of the series, however, the pattern reverses and the covariance increases. The elements of cov[(b, 0, f) e], are mainly positive, so its contribution to the seasonal adjustment variance is negative because cov[(b, 0, f), e] and its transpose are subtracted fromvar(e) -1- var(b, 0, f). The net effect is that subtracting il{cov[(b,0,f),e] +cov[(b,0,f), e\'}Si' tends to offset tiie effect of adding Qvar(b,0,f)fi', except near the very ends of the series. Thus, the graph of the variances of the seasonally adjusted series in Figure 2 is very similar to the graph of the contribution of var(e), except near the very ends of the series. We observed this type of "cancellation effect" in several other examples, including those of section 4. 2.4 Comparison with the First Approach of Wolter and Monsour The first approach of Wolter and Monsour (1981) proposed use of Sl^^yar{eg)SiJ^ as the variance-covariance matrix of the X-11 seasonal adjustment errors, where SI is an n + n matrix whose rows contain the X-11 linear filter weights, both symmetric and asymmetric. That is, the 18 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments middle rows (rows t such that m <t <n- m -¥ I, assuining n > 2m) of Sl^^ contain the X-11 symmetiicfilterweights, but the first and last m rows of SI contain X-ll's asymmetricfilterweights. The middle rows of Sl^^ and SI thus contain the samefilterweights, but the first and last m rows do not. This means that our approach will give the same results as that of Wolter and Monsour for m < / < w m -I-1, that is, for time points at which the symmetric filter is being used. The results of the two approaches will differ for the first and last m time points. Since the most recent seasonally adjusted data receive the most attention, this difference is potentially important. Wolter and Monsour also considered use of a matrix SI* instead of Sl^^ where SI* is (n -H 12) X (n -h 12) to include 12 additional rows of weights corresponding to year-ahead seasonal adjustment filters. Though year-ahead adjustment was the common practice through the early 1980s, it has now mostiy been replaced in the United States by concurrent adjustment (McKenzie 1984). The differences between our approach and that of Wolter and Monsour can be viewed in two ways. One view is that since Wolter and Monsour did not consider forecast and backcast extension, their approach ignores the contribution of forecast and backcast ertors to seasonal adjustment error. This contribution affects results for the first and last m time points, although the examples of section 4 show that this contiibution is often small. However, in some cases it is not small, including those time series not subject to sampling error. For such series Wolter and Monsour's approach would assign zero variance to the adjustments, even though initial adjustments would be revised as new data became available. The other way to view the differences between the approaches centers on the difference in "targets". The seasonal adjustment error under Wolter and Monsour's approach can be thought of as Sl^^{Y ^-yj = -Sl^^e^. Since this results in zero error for series with no sampUng ertor (Y^ = y ^), Wolter and Monsour impUcitiy define the seasonal adjustment target to be SI ^^Y ^. This definition of target has the undesirable property that the target value for a giventimepoint changes as additional data are acquired, since the rows of SI contain different filter weights. Our target value for any given time point t is always a)(B)y,. Example-U.S. 5+ Unit Housing Starts (continued). We compared results using our methodology with that of Wolter and Monsour's using the default X-11 seasonal adjustment filter although, as noted earlier, this example series is adjusted using the optional 3 x 9 seasonal moving average filter. This comparison used the default filter for convenience: asymmetric X-11filterweights are needed to obtain results for the Wolter-Monsour approach and we were given a computer program by Nash Monsour that produced them only for the default filter. Figure 3 gives the results for both approaches. The non-constant variances over time from the Wolter-Monsour approach result from applying different filters at different time points. An interesting consequence of this is that, despite the stationarity of the sampling ertor, the Wolter-Monsour seasonal adjustment variance is noticeably higher in the middle of the series than for many time points toward (but not close to) either end of the series. This carries the implausible implication that use of less data produces estimators with lower variance. Similar behavior can be observed in several examples presented by Pfeffermann (1994). 0.010 o.oos — o.ooa - ^ / 0.004 o.ooa - obasrvatlon n u m b * r »r ( • > eba*rv«tton numbar (o) Ce o.oos - Cb O f ) (d) c o.ooa observation numbor Figure 3. U.S. 5+ Units Housing Starts: Comparison with Approach of Wolter and Monsour (1981). The panel descriptions are as for Figure 2. The Wolter and Monsour approach (dotted lines) uses the asynunetric X-11 filters for the ends of the series and accounts only for sampling error. Our approaches agree in the middle of the series where there is no contribution from back/forecast error. The Wolter and Monsour variances inappropriately decrease near the ends of the series, suggesting that use of less data produces estimates with lower variances. The results here, in contrast to Figures 1, 2,4, 5, and 6, use defauh X-11 filters. (See text.) Survey Methodology, June 1999 19 The results from using the default X-11 seasonal adjustment filter with our approach are also useful for comparing with the 3 X 9 seasonal moving average filter, for which results are given in Figure 2. Differences between results from using the twofiltersare not great. The contiibution of the sampling ertor is somewhat lower and that of the fore/backcast error somewhat higher when using the default seasonal adjustment filter. the series. In the center of the series, however, the trend variances of Figure 4 are substantially lower than the seasonal adjustment variances of Figure 3, due to the smoothing of the sampling error by the trend filter. <•) v « 0.004 - S I 3. EXTENSIONS TO THE METHODOLOGY This section discusses three extensions to the general methodology of section 2. The first two extensions are straightforward, the third more involved. oba»rvatlon numb«r Ino orror (e) o.ooe - 3.1 Variances for Seasonal, Trend, and Irregular Estimates; Variances with Time-Varying Filters 0.004 - 0.002 - The only way the nonseasonal (seasonally adjusted) component is distinguished in the derivation of section 2 is through the filter weights placed in the matrix SI. Therefore, corresponding variances for X-11 estimates of the seasonal, trend, and irregular components follow from the same expressions simply by changing the matrix SI to contain the desired filter weights. This also changes the dimension of SI, since the length of the seasonal adjustment, trend, and irtegularfilters(for given options) differs, and the filter length determines the size of SI. A similar extension handles the case of different seasonal moving averages (MAs) selected for different months (or quarters), an option allowed by X-11. This changes the seasonal adjustment (and seasonal, trend, and irtegular) filters applied in the different months. The results of section 2 also accommodate this extension through a simple modification of Si. Since the rows of SI cortespond to thetimepoints being adjusted, we simply define row t of Si to contain the weights (along with sufficient zeroes) from whatever filter is being applied in month t. Some care must be taken to dimension Si appropriately if the longest selected MA is not used in the first and last months of the series. Example-U.S. 5-i- Unit Housing Starts (continued). Figure 4 shows the variance of die X-11tijcndestimate, using the 3 x 9 seasonal MA and 13-term Henderson. The most obvious difference from the seasonal adjustment results is the substantial effect of fore/backcast ertor at the very ends of the series. This occurs because the largest weights of the trend filter ((B^^(B)) are the center weight {(o'p) and the (•f) (T) (T) ^ u ' adjacent weights (co, ,0)2 ,(03) that are applied to data immediately before and after the observation being adjusted (Bell and Monsell 1992). At the very ends of the series, the weights (co, ,0)2 ,©3 ) apply to fore/ backcasted observations, which results in large increases in the contribution of fore/backcast ertor there. The result is that uncertainty about the trend increases sharply at the ends of 0 100 obaorvBtlon n u m b e r ( c ) Cc >r ( b O f ) 8 m 0.001 0 BO 100 ISO obaarvatlon number o.oos — o.oos - 0.001 - -0.001 - -0.003 100 obaarvatlon numbar Figure 4. U.S. 5+ Units Housing Starts: Variance Decomposition of the Trend Estimate. The panel descriptions are as for Rgure 2. Note the large jump in trend estimate variances at the ends of the series due to the contribution of back/forecast error (third panel). 3.2 Variances for Seasonally Adjusted Month-to-Month and Year-to-Year Changes The variances of the ertors of the seasonally adjusted estimates of month-to-month change are the quantities var(Vj - Vj ,), f = 2,..., n. Given var(v), the complete ertor covariance matrix for the seasonally adjusted month-tomonth changes can be calculated as A, var(v) A,', where 20 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments -I I 0 0 -1 1 0 0 o] 0 \ = 0 1^ 0 0 0 0 - 1 1 0 • -1 1 ; is of dimension (w -1) x n. The ertor covariance matrix for the seasonally adjusted year-to-year changes in a quarterly series is calculated similarly as A4var(v) A^, where -I 0 0 0 -1 0 0 0 1 0 • 0 0 0 I 0 0 0 0 -1 0 0 0 I 0 -1 0 0 0 I 0 0 0 is of dimension (n - 4) x n. The cortesponding {n- I2)xn matrix A ,2 for monthly series follows a similar pattern with additional zeroes. Variances of month-to-month or year-to-year changes in the trend are also easily obtained, as can be seen from this discussion and that of section 3.1. Example-U.S. 5-i- Unit Housing Starts (continued). We produced the standard errors for seasonally adjusted monthto-month and year-to-year changes for this series (Figure 5). Since this time series has been log transformed, standard ertors can be approximately interpreted as percentages on the original (unlogged) scale. Compared to the standard ertors for the seasonally adjusted series, there are slight increases in the standard ertors of the month-to-month changes near the ends of the series, but the standard ertors of the year-to-year changes show almost no such increase. Thus, for this series andfilter,the uncertainty about monthto-month and year-to-year percent change in the seasonally adjusted data is almost constant across the series. The standard errors of the month-to-month and year-to-year changes are both about 50 percent higher than those for the seasonally adjusted series. 3.3 Variances of X-11 Seasonal Adjustments with Estimated Regression Effects Seasonal adjustment often involves the estimation of certain regression effects to account for such things as calendar variation, known interventions, and outliers (Young 1965; Cleveland and Devhn 1982; Hillmer, Bell, and Tiao 1983; Findley, Monsell, Bell, Otto, and Chen submitted). (Outlier effects are often estimated in the same way as known interventions even though inference about outliers should ideally take account of the fact that the series was searched for the most "significant" outliers.) This section shows how the results already obtained can be extended to include the contiibution to seasonal adjustment error of error in estimating regression parameters. We still assume the other model parameters, which determine the covariance structures of Y and e, are known. In practice these .other model parameters will also be estimated, but accounting for error in estimating them is much more difficult. A Bayesian approach for doing so in the context of model-based seasonal adjustment is investigated by Bell and Otto (submitted). (a) S t a n d a r d e r r o r o f s o a s o n a l l y a d j u s t e d d a t a 0.10 - g o.oa - 100 obaarvatlon numbar (b) S t a n d a r d error of m o n t h - t o - m o n t l i c h a n g e TZ o.oa ' 100 obaarvatlon number (c) S t a n d a r d e r r o r of y e a r - t o - y e a r "S o.c 100 obaarvatlon numbar Figure 5. U.S. 5+ Units Housing Starts: Standard Errors. These panels contrast the standard errors (not variances, as in previous figures) of the seasonally adjusted data (top panel) with the larger standard errors of seasonally adjusted month-to-month (middle panel) and year-to-year (bottom panel) change estimates. We extend the model for Y^ to include regression terms by writing Y^ = \l b -^ Zy where x, is the vector of regression variables at time t, p is the vector of regression parameters, and Z, is the series of tme population quantities with regression effects removed. Extending our matrixvector notation, we write Y = Xp + Z, Y^ = X^ p + Z^, etc. The regression matrix X can be partitioned by its rows cortesponding to the backcast, observation, and forecast periods: X = (X^|X^|X')'. We assume e, has mean zero, so its model does not involve any regression effects. We then have y = Y + e = (XP + Z ) + e , with the usual partitioning applying. Letting z, denote the series y, with the regression effects removed, we have z = y - X p = Z + e. An additional partition is needed of the matrix X and vector p. This is because some ofthe regression effects in x,' P may be assigned to the nonseasonal component while Survey Methodology, June 1999 21 others, such as trading-day or holiday effects, may be removed as part of the seasonal adjustment. See Bell (1984b) for a discussion. Partition x^ as (x^'Jx^^) where X;v, represents the regression variables assigned to the nonseasonal and X5, the variables whose effects are to be removed in the seasonal adjustment. Cortespondingly partition p so x ; p = x ; , p , + x ^ p ^ and Xp=X,p5 + X^p^ = (X^ I X^)(Pj' I P^)'. (x^, P^ is assigned to the "combined" seasonal component.) The matrix X can thus be partitioned two ways: by seasonal versus nonseasonal regression effects, and by the backcast, observation, and forecast periods. Thus we write ^Sb ^Nb ^So ^No ^Sf ^Nf error is then v = (X^„ p^„ + SiZ) - (Si£ + X^„ P^J = ^No^ho-ho)^^^'^-^)The expression for v can be simplified by rewriting I. First, let G = [B'|I|F']', where F is the matrix that produces forecasts y, from y„ and B is the corresponding matrix that produces backcasts y^from y„. We will not need explicit expressions for F or B. G applied to z„ produces z while G applied to i produces I. Therefore, I = z ( z - | ) = z - [ G ( y „ - X „ P ) - G ( y ^ - X j ) ] = z + GX„ (P - P). Note that GX„ is obtained by applying the procedure for forecast and backcast extension (from the model for z,) to each column of X„. The approach we used to do this is described in Appendix B. Continuing, we have v = X^„(P^„-P^„) + S i [ ( Z - z ) - G X „ ( P - p ) ] = Si(Z - z) + {[0|X^J - SiGX„}(p - P). If p were known we could compute z ^ = y ^ - X ^ p = Z ^ + e^,forecast and backcast extend this series (call the extended series z), adjust z by X-I I (Siz), and add back the required regression effects X^^P^^. The target of the seasonal adjustment would be X^^ p^^ + SiZ = X^^^ p^^ + Si(Y - XP), and the seasonal adjustment error would then be (X^^p^^ + SiZ) - (X^„P^^ + Siz) = S i ( Z - z ) . Thus, if the regression parameters were known they would not contribute to the seasonal adjustment ertor, and the results already given could be used to compute var(Si(Z-z)). In practice, p will be estimated as part of the model fitting, say by maximum likelihood assuming normality. Given the estimates of the other model parameters, and taking these parameters as if they were known, the maximum likelihood estimate of p and its variance are given by Now, Z - z = z - e - z = [b|0|f] - e. Note that [b|0|f], the ertor vector from projecting z on z„ or y„, is orthogonal to (uncortelated with) P - P, since P is a linear function of the data y„. Therefore, letting K = [0|X^J - SiGX^, we have the variance-covariance matrix of the seasonal adjustment ertor allowing for ertor in estimating p: var(v) = Sivar(Z - z)Si' + Kvar(P)K' (3.3) + Sicov(e, p)K' + Kcov(P, e)Si' where var(P) is given by (3.2). In (3.3) Sivar(Z - z)Si' is computed by the results of section 2, and computation of Kvar(P)K' is straightforward once GX„ has been computed. To compute the other two terms requires cov(P,e) P = [X'A'E2"2A X ] - ' X ' A ' 2 2 " ' A y (3.1) var(P) = [X;A;E22A.XJ-', (3.2) = cov([X'A'2:"2A X ] " ' X ' A ' Z : ' A y ,e) = cov([X'A'E2"2A XJ-'X'A'Ej'jLu +A e ],e) where A„ is of dimension {n-d)xn, containing that part of the larger matrix A which differences the observed series y^. The expressions (3.1) and (3.2) are generalized least squares results using the regression equation for the differenced data, w^=Ay^ = (AX^)P + (u^ + Ae^), where the crtorterm, u +Ae , has covariance matrix var (w ) = 2 „ , ' 0 0 ' ^ o^ 22' which is determined by the other model parameters. Given the estimated regression parameters p, the seasonally adjusted series would be obtained by subtracting the estimated regression effects from the data (call the resulting series z ^ = y ^ - X ^ P ) , extending this series with forecasts and backcasts using the model (denote this extended series I = [1^,1^,1,]), applying X-11 to the extended series (Sil), and adding back the estimated regression effects assigned to the nonseasonal component (Sii+X^^P^^). The target ofthe seasonal adjustment is still ^No Pwo ^ ' ^ ^ ' discussed above. The seasonal adjustment ^•^ o o 2.1 o o-* o o II*- 0 = [X 'A ' E J ' A X ] - ' X 'A T : ' A [0 '- o o 22. o o^ o o 21. o'- nxmi 00^^' 11 nxn |0„^ Jvar(e). Notethat[0„,„|I„^jO„,Jvar(e) = [cov(e^,e,)|var(eJ| cov(e^,ep] is the middles rows of var(e). Using (3.4) and the aforementioned results, (3.3) can be computed. We can compare the resulting diagonal elements of var(v) with those of the sum of the last three terms in (3.3), to see if allowing for the ertor due to estimating the regression parameters is important. There is an important qualification to make about the results of this section. Sincetiiefirstterm on the right hand side of (3.3), Sivar(Z - z)Si', is the seasonal adjustment variance we would get by ignoring ertor in estimating the regression parameters, it is tempting to interpret the sum of 22 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments the last three terms in (3.3) as the contribution to seasonal adjustment variance of ertor due to estimating regression parameters. Unfortunately, this sum is not itself a variance (it can in fact be written as var(Kp + Sie) - var(Sie)), and so it can actually be negative. When this happens the seasonal adjustment variances that allow for error due to estimating regression parameters are actually lower than those that ignore this error. We were in fact able to achieve such a result by artificially modifying model parameters in the following example with trading-day variables (though, as in tiie results shown, the effects were quite small). This situation contrasts with comparable results for model-based approaches which express the seasonal adjustment ertor as the sum of two orthogonal terms: the ertor when all parameters are known, plus the contribution to ertor from estimating regression parameters. The seasonal adjustment variance in this case is thus the sum of the variances of these two terms, and so the "regression contribution" is always nonnegative. This result is analogous to Sivar(Z -z)Si' +K var(P)K'in (3.3). The problem in (3.3) is that the X-11 estimate Siz is not an optimal (MMSE) estimator of the target SiZ, hence the ertor Si(Z - z) is cortelated with p through the sampling error e, leading to the two covariance terms in (3.3). This situation results partly from our choice of target (Xp + SiZ) and partly from the fact that X-11 cannot be assumed to produce an optimal estimator of anything (note comments related to this in the Introduction). Example-U.S. 5-1- Unit Housing Starts (continued). We use the same example to illustrate the contribution to seasonal adjustment ertor of adding trading-day variables (Bell and Hillmer 1983), although the cortesponding regression coefficients were not statistically significant when estimated with this series. Figure 6a shows the results. In this illustration, the lowest line is the "contribution" to the seasonal adjustment variance from estimating the trading-day effects (but see remarks above). When added to the original estimate of variance (dotted line), we obtain the variance of the seasonally and trading-day adjusted series, allowing for ertor in estimating the trading-day coefficients (top sohd line). We see that, for this example, the increase in variance due to including estimated trading-day effects in the model is slight. Figure 6b gives results for the trend filter. Here the contribution to trend uncertainty due to estimating the trading-day coefficients is certainly negligible. The contribution to seasonal adjustment variance of adding three additive outlier variables and one level shift variable is illustrated in Figure 6c. These regression variables were identified as potential outlier effects using the Regarima program (produced by the Time Series Staff at the U.S. Census Bureau) with a critical r-statistic of 2.5. Regarima uses an outlier detection methodology similar to those discussed in Bell (1983) and Chang, Tiao, and Chen (1988). The contributions ofthe additive outliers appear as three spikes while that of the level shift is a single smaller hump in the middle of the series. In comparison to the trading-day regression variables, the effect of these outlier variables is mainly local but much sti-onger. In particular, there is additional uncertainty about seasonal adjustments for observations considered additive outliers. (a) V a r i a n c e of s e a s o n a l l y a n d tradlna-day adjusted series 5 "total" varfanoa original varlanca aatlmata "contribution" from raoraaalon affaota o.ooa H 100 obaarvatlon numbar (b) V a r i a n c e of trend estimate 100 obaarvatlon numbar (d) V a r i a n c e of trend S estimate o.ooa I -ZX_ ZMi-A. 100 obaarvatlon numbar (c) V a r i a n c e of aeaaonally a n d outlier adjusted aeries M 0.0D8 S o.oos - 100 obaarvatlon numbar Figure 6. U.S. 5+ Units Housing Starts: Including the "Contiibution" from Regression Effects in the Variance Estimates. The top panel shows both the original variances from die first panel of Figure 2 (dotted curve) and die variances allowing for additional uncertainty due to estimating trading-day regression effects (top solid curve). The regression contribution is also shown (bottom solid curve). The second panel shows the corresponding results for die variances of the trend estimates. Note that the regression contribution to the seasonal adjusOnent variances is small, and to the trend estimate variances it is essentially zero. The third and fourth panels show analogous results when die trading-day regression effects are replaced by three additive ouUiers and a level shift. Notice that these have important local effects on the seasonal adjustment and trend estimate variances. Survey Methodology, June 1999 Results for the trend filter (Figure 6d) differ in that uncertainty is much greater around the observation where a level shift was detected, which approaches the level of uncertainty at the ends of the series. A level shift is considered part of the trend, so an estimated level shift effect would first be subtracted from the series (in XP), and then added back following application of the X-Il trend filter. (This is analogous to the treatment of regression effects assigned to the nonseasonal or seasonal components in seasonal adjustment as discussed above.) In contrast, since both additive outiiers and level shifts are considered part of the nonseasonal component, all four effects were added back as part of the seasonal adjustment when producing results for Figures 6a and 6b. Actually, these sorts of results for outliers should only be regarded as crade approximations, since they treat the time of occurtence and types of outliers as known, leaving only the magnitude of the effects to be estimated. Ideally, one would like to recognize that the series was searched for significant outliers, but this is much more difficult. 23 is the sets of small downward projecting spikes that occur one year apart in triplets. These occur at non-leap year Februaries, for which there is no trading-day effect (the trading-day regression variables are all zero). There is still a small regression contribution to seasonal adjustment ertor at these time points since the adjustment averages in these contributions from adjacent time points. (Dips at non-leap year Februaries are also visible on close inspection of Figure 6a.) In addition, for some years, the ertor in estimating the Easter effect produces a noticeable upward projecting spike involving the two months March and April. o.ooe Mr/o raoraaalon varlablaa 4. EXAMPLES We illustrate our approach using several additional economic time series whose sampling errors follow different models. The models used for these example series are taken from previous work as noted. 4,1 Retail Sales of Department Stores Department store sales are estimated in the Census Bureau's monthly retail trade survey. Essentially all sales come from department store chains, all of which are included in the survey, hence, there is virtually no sampling error in the estimates. Thus, the variance of the X-11 seasonal adjustment comes only from fore/backcast ertor and from error in estimating regression effects. (Note that the Wolter-Monsour seasonal adjustment variance would be zero for this series.) The model used for this series (Bell and Wilcox 1993), for the period August 1972 through March 1989 for the logs of the observations, is ( I - f i ^ ( l - B ' 2 ) [y - x ; p ] = ( l -0.53B)(1 -0.52B^^)a^ with a^ = 4.32 X10", where x, includes variables to account forti-ading-dayand Easter hohday effects, and Y^ = y^ is the log of the original series divided by length-of-month factors. In adjusting the series at the (Census Bureau, the default X-11 adjustment filter and 13-term Henderson trend filter are used. Figure 7a shows the standard ertors for the seasonally adjusted data over time, with and without the contribution of regression effects. Unlike the 5-t- units housing starts series, there are marked increases in the standard ertors of seasonally adjusted data at the ends of series, due entirely to fore/backcast ertor. The contribution to the standard ertor due to estimating regression effects is also more pronounced for this series. An interesting feature in Figure 7 Cb> S t a n d a r d o r r o r o f m o n t h - t o - m o r a t H O.OOB 0.007 - i 1 1 O.oos - ehango /^ . 1 O.oos — n^fJUvX^4-^^ffnM-im 1f ..... 0.001 — -o.ooi — (c) Stai § i 1 Figure 7. U.S. Department Stores, with Trading-Day and Easter Effects. This series has no sampUng error. The four panels give standard errors with and without the contribution from estimating regression effects. For the seasonally adjusted data and corresponding month-to-month and year-to-year changes (first three panels), the "contribution" from estimating regression effects is substantial and erratic in the middle of the series (where it is the sole contributor) but, at either end, diminishes for reasons explained in the text. The regression contribution to the trend estimate standard errors is small. 24 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments The regression relative contribution to the seasonal adjustment standard ertors diminishes towards the ends of the series. This results from two factors: (1) the magnitude of the regression contribution to var(v,) decreases somewhat towards the ends of the series, and, more importantly, (2) var(Z, - Z,) increases dramatically towards the ends of the series, diminishing the relative contribution to var(v,) due to regression (and this is further accentuated when square roots are taken). The pattern ofthe standard errors of seasonally adjusted month-to-month changes (Figure 7b) is similar to that for the standard error of the seasonally adjusted data (Figure 7a). The regression contribution is slightiy larger than it is for the seasonally adjusted data. Standard errors of year-toyear changes (Figure 7c) follow similar patterns but the regression contribution is considerably larger than it is for the month-to-month changes, and it remains important at the ends of the series. A similar set of calculations was performed using the default X-11 trend filter, and results for the standard errors of the trend estimates, with and without the regression contribution, are depicted in Figure 7d. The patterns over time of these standard ertors are similar to the cortesponding figures for the 5-i- units housing starts series, but the standard ertors are much smaller due to the absence of sampling error. The regression contribution is small. The standard errors for all plots in Figure 7 are small -none exceed 0.8 percent. For this series, the regression contribution is small and probably ignorable near the very ends of the series, for all but the year-to-year changes. However, in the middle of the series, the sole contributor to standard errors is that due to the regression effects. (1 - B){1 - B'^)Y, = (1 - 0.275)(1 - 0.68B'2)a,, (4.3) with (T^ = 4294. There are no regression effects in the model and the series is not transformed. BLS uses the default X-11 seasonal adjustment filter (so m = 84). In applying the methods of this paper to this example, problems arise from the fact that the (estimated) sampling error variance h^ depends on the estimate y^ through the generalized variance function (4.2). In the backcast and forecast periods y, is unknown. To obtain h, in these periods we forecast and backcast y^ using a simple ARIMA(0 1 1)(0 I 1),2 model for y, (not for Yy as in (4.3)). The resulting 84 forecasts and backcasts were then used in (4.2) to produce h, in the forecast and backcast periods. More refined tteatments are possible, such as using the component model given by (4.1) and (4.3) to forecast y,Ca) variance of seasonally adjusted series Ct) contribution of sampfing en-or (e) c O contribution of fore/backcast error (b 0 f) 4.2 Teenage Unemployment The Bureau of Labor Statistics (BLS) publishes the monthly time series of number of U.S. unemployed teenagers estimated from the Current Population Survey (CPS). Data from January 1972 to December 1983 ((n = 144) were used by Bell and Hillmer (submitted) to develop a model for this series. The sampling ertor variance /z, changes overtime,so is nonstationary. The sampling ertor model they developed is V, where (I - 0.6B)e, = (1 - 0.35)^,, (4.1) 40 60 100 120 C^) contribution of covariances of e and (b 0 f) 0, o .is <M 60 80 observation number with a^ = 0.87671 so that var(ej) = 1. CPS sampling error variances can be approximated by generalized variance functions (Wolter 1985, Chapter 5; Hanson 1968). The generalized variance function Bell and Hillmer used for the teenage unemployment series is V = l-971y,-(1.53xlO-=)y;, 20 (4.2) where y^ is the estimate of the number in thousands of unemployed teenagers at time t. The estimated model for the signal component Y^ is Figure 8. Teenage Unemployment, widi Default X-11 Options. The panel descriptions are as for Figure 2. The seasonal pattern of the sampling error variance contribution (second panel) results from its dependence on the level ofthe series through a generalized variance function (see text). The seasonal adjustment variance for this series (Figure 8a) is dominated at most times t by the sampling error contiibution (Figure 8b). This is because, while the contribution of var(b, 0, f) is substantial for this series (Figure 8c), it tends to be offset by the contribution of cov[(b,0,f), e] + cov[(b,0,f), e]' (Figure 8d), except at the first and last Survey Methodology, June 1999 25 few time points. The patterns of variances of seasonally adjusted month-to-month changes and year-to-year changes (not shown) are similar to that of Figure 8a. The variances of the month-to-month changes are slightly larger than those of the adjusted series, those of the year-to-year changes are larger still. 4.3 Retail Sales of Drinking Places Retail sales of drinking places are estimated in the Census Bureau's monthly retail trade survey. In this survey, (noncertainty) sample cases are independently redrawn approximately every 5 years, so the covariance matrix of the sampling errors is block diagonal. Bell and Hillmer (1990) developed the following model for the sampling ertor of the logged series within a given sample: -''' (1 - 0.15B - 0.665^ + 0.50B^){1 - 0.71B'2)e, = (1 + 0.135)^,, variances of change estimates around the sample redraw, such as those reflected in Figure 9b, simple modifications are made to estimates in a newly introduced sample to make their level consistent with that from the old sample. The simplest version of the modification is as follows. Let ^(oid)t( '^^^P(^(oid)t)) denote estimates from the old sample, and Z(ng^^,), unmodified estimates from the new sample. Assume that the old sample provides estimates for t ^ r, and that the new sample is to provide estimates for t>T. To provide overlap data for the modification, the new sample is begun one month early, so that both z. y. and ^(new)T ^^^ available. The modified new sample estimates are defined as z^'„,^^, = Z(„ew)t(^(oid)r/2(„ew)r) for ' ^ ^- This modification is carried out each time a new sample is introduced. In terms of the cortesponding logged estimates v., the modification is y,' ,.= y, ^ + (y, ,.,, —y, . ). (4.4) with cTj = 9.301 X 10'^. Fortimepoints t andj in different samples, cov{eyej) =0. Bell and Hillmer developed a model for the signal component of the logged series using unbenchmarked estimates from September 1977 to December 1986. We shall instead use the following model fit by Bell and Wilcox (1993) using additional data through October 1989: (1 - B){1 - B^^)[Y, - X;b] = (1 - 0.23fi) {I-O.S8B^^)ay where X, contains trading-day regression variables, and o-f = 4 . 1 6 x 1 0 ^ In seasonally adjusting this series, the default X-11 filters are used. The contribution of ertor due to estimating regression parameters is small for this series, and so is not included in the results to follow. Since the contribution of sampling ertor overwhelms the contributions from fore/ backcast ertor and cov[(b, 0, f), e] + cov[(b, 0, f), e]', we also do not illustrate these separate variance contributions. Figure 9a gives the standard ertor ofthe seasonally adjusted data (shown over 232 observations to better illustrate the pattern, with vertical lines indicating sample redraws) and Figure 9b the standard ertor of seasonally adjusted month -to-month changes. Note the strong pattern in Figure 9a, b due to the redrawing ofthe sample every five years. In particular, this produces a large spike in the standard ertor of seasonally adjusted month-to-month changes (Figure 9b) when the sample is redrawn. Similar jumps in standard deviations of year-to-year changes occur for the first year of a new sample. We also found similar patterns for other series from the retailti-adesurvey using models from Bell and Wilcox (1993). The preceding discussion and results ignored certain aspects of how estimation for the retail trade survey is actually carried out. In fact, to avoid large increases in .'(new)t ''(new)t ''.'(old)r .'(new)r'' Since the modification to y, is linear, it is easy to account for its effects on the seasonal adjustment variance calculations here. The month-to-month change at time r -t- 1 before the modification (and without seasonal adjustment) is >'(new)T.i ">'(o!d)r- ^ote that this change has a large variance since y(„g„)j ^, and y^^^. come from different, independent samples. After modification, this change is ^(new)r +1 ~ >'(new)r' which has a much lower variance due to strong positive correlation between y^^^^^^ ^ ^ and y , . (arising from the the sampling ertor model (4.4)). Unadjusted month-to-month change estimates for time points other than r + I are unaltered by the modification. Figure 9d shows that modifying new sample estimates eliminates the large increases in the standard deviation of seasonally adjusted month-to-month changes at the transitions to new samples. Similar effects were seen for year-toyear changes over a one year period. The price paid for this improvement is a steadily increasing error in the level estimates (Figure 9c) following introduction of new samples. This occurs because the modification introduces a transient error into the level estimates that persists throughout the new sample. Thus, the modification trades off worse accuracy of level estimates for improvements in change estimates. (Figure 9c shows no increase for the first five years because we assume the estimates there are not modified to agree with those from a previous sample.) Moreover, the strong patterns in Figure 9a occur because the sampling ertors from unmodified estimates in adjacent samples are uncortelated. On the other hand, sampling ertors in the modified estimates are fairly strongly cortelated between adjacent samples. The effect of this, after applying the seasonal adjustment filter, is a much different pattem (almost no pattern) in thefirstfiveyears of Figure 9c, and slight oscillations around the linear increase thereafter. The standard ertors for the X-11 trend estimates and changes (not shown) look like smoothed versions of those shown in Figure 9. In practice, final estimates from the retail trade survey are even more complicated than what was just described and illustrated. First, more than one month of overlapping 26 Bell and Kramer: Toward Variances for X-11 Seasonal Adjustments adjuwtod data. original oatlmats o.ia - i 1 o— ob««rv«tlon number Cb) S t a n d a r d Arr^r chain ' month-to-month il s s t l m a t o o ob*»rvaitlon n u m b e r a d j u s t e d d a t a , nrtodlflad a a t l m a t o a o.ia - ^^^,„^^^—^ ^ ^ - ^ ^ " ^ 1 •g o.o«- ^ /"^'"'^ o - (d) a t Chang*** modlflad aatlmataa obswrvaktion n u m b e r Figure 9. Retail Sales of Drinking Places: Samples Redrawn Every Five Years. The top panel shows the standard error of the seasonally adjusted data and the second panel the standard error of the corresponding month-to-month changes. The strong pattem results from independently drawing a new sample every five years (at the dotted vertical lines). For month-to-month changes, this produces large increases in standard errors at the time ofthe sample redraw. To eliminate this problem, a new sample is drawn to overlap with the previous sample for one or more months and the new sample's estimates are modified using data from the overlap to make them consistent in level with estimates from the previous sample (see text). This eliminates the increases in standard errors of change estimates when the sample is redrawn (fourth panel), but introduces a transient error into the modified level estimates, whose effects accumulate over time (third panel). data are collected and may be used to modify level estimates when a new sample is introduced. More importantly, monthly estimates are benchmarked to agree with annual totals obtained from the more accurate annual retail trade survey orfiveyear economic census. Benchmarking should thus alleviate the problem of level variances increasing over time seen in Figure 9c. However, since benchmarking imposes linear sum constraints on the original (unlogged) estimates, its effects on seasonal adjustment variances are difficult to investigate under the approach developed here, and we have not done so. (We have used a model for unbenchmarked data to avoid this problem.) Durbin and Quenneville (1995) develop a model-based approach to benchmarking that accounts for the nonlinearities that such benchmark constraints impose on logged data. 5. CONCLUSIONS This paper presented an approach to the long-standing problem of obtaining variances for X-I I seasonal adjustments. Our goal was the development and application of an approach to obtain variances accounting for two sources of ertor. The first ertor source is sampling error (e,), which arises because we do not observe the true series, Yy but instead observe estimates y, = i', + e^ from a repeated survey. The second ertor source results from the need to extend the observed series with forecasts and backcasts to apply the symmetric X-I I filters. This second ertor source leads to seasonal adjustment revisions. To account for these two sources of ertor, we defined the seasonal adjustment variance as the variance of the ertor in using the X-11 adjustment to estimate a specific target. This target, a)(fi)y,, is what would result from applying the symmetric, linear X-I 1filter,co(fi), to the tme series if its values were available far enough into the future and past for the symmetric filter to be used. (The application to additive X-Il with fore/backcast extension is immediate, and log-additive X-11 is taken as an approximation to multiplicative X-Il.) Our approach was also applied to produce variances of X-I Iti:endestimates, and to produce variances of month-tomonth and year-to-year changes in both the seasonally adjusted data and trend estimates. A further extension was made to allow for ertor in estimating regression parameters (e.g., to model calendar effects), though this was more involved and had some limitations. The variances we obtain ignore uncertainty due to time series variation in the seasonal and nonseasonal components. We argued in section 2 that this may be appropriate for typical users of X-I 1 seasonally adjusted data. EF one desires to account for this time series variation, however, we suggest that consideration be given to model-based approaches to seasonal adjustment, since time series models provide a means to explicitly account for variation in all the components. Alternatively, Pfeffermann (1994) developed an approach to X-II seasonal adjustments that attempts to account for irtegular variation and sampling error. Our approach builds on the first approach suggested by Wolter and Monsour (1981), by accounting for the contribution of forecast and backcast error that was ignored by Survey Methodology, June 1999 them. An alternative view of the difference between our approach and theirs is that we define a consistent seasonal adjustment target, whereas, in using X-ll's asymmetric filters, Wolter and Monsour impUcitly used targets that change overtime.Because of this, our approach avoids the unrealistic feature of seasonal adjustment variances that decrease towards the ends of the series, which can be seen in results of Wolter and Monsour, and also of Pfeffermann. In the empirical results presented, the contribution of sampling ertor often dominated the seasonal adjustment variances. This is partly because sampling crtor was often large relative to fore/backcast ertor, and partly because the contribution of fore^ackcast ertor tended to be offset by the contribution of the covariance of fore/backcast ertor with the sampUng ertor. On the other hand, empirical results for trend estimate variances showed large increases at the ends of series due to the effects of fore/backcast ertor. Since the largest contribution of fore/backcast ertor occurs at the ends of the series, and variances for the most recent seasonal adjustments andti-endestimates are of the most interest, one should not ignore the contribution of fore/backcast ertor. The relative contribution to our variances of error in estimating trading-day or hoUday regression coefficients tended to be small, unless the series had no sampUng error. Ertor due to estimating additive outUer and level shift effects was substantial around the time point of the outUer. The effects of AOs were large on seasonal adjustment variances; tiie effects of LSs were large on trend estimate variances. Nonstationarities in the sampUng ertors produced interesting patterns in the seasonal adjustment and trend estimate variances. Two types of sampUng ertor nonstationarities were examined. Seasonal patterns in sampUng ertor variances produced cortesponding seasonal patterns in seasonal adjustment variances. Independent redrawdngs of the sample, which yield sampUng crtors cortelated within but not across samples, produced erratic patterns in seasonal adjustment and trend estimate variances over time within a sample. These patterns approximately repeat across different samples if the samples remain in force for approximately equal time spans. Computations for the examples shown (given the fitted models, which were obtained from the references cited) were done by programming the expressions of Sections 2 and 3 in the S-i- programming statistical language. The resulting computer code is available on request. 27 APPENDIX A Several expressions to be calculated in tliis paper are of the general form AE-'B (A.1) where E is a positive definite matrix, and A and B are conformable to E. Let E = LL' be the Cholesky decomposition of E. Then AE'^B = A(L -')'L "^B and (A.1) can be computed as follows: (1) Solve LQ, = B for Q, (2) Solve LQ2 = A' for Qj (3) Compute AE-'B = QjQ,. (1) and (2) can be solved efficiently since L is lower triangular. APPENDIX B Two steps are required to obtain GX^, used in section 3.3. The first step produces "forecast" and "backcast" extension of the differenced regression variables. The second step uses these results and the difference equation to produce forecast and backcast extension of the original (undifferenced) regression variables. Let R = A X , where A is that part of the matrix A which differences the observed series y ^. Analogous to the computation of w, and w^ in section 2.2, forecast extensions of the differenced regression variables are calculated as R, = E32E22R0 and backcast extensions as R^ = E j2 E22R0. Rj and R^ are of the form (A. 1) and can be computed by the technique given above. For the second step, let x, denote any one of the regression variables in X. Let the required forecast extensions be denoted x^^, for / = 1, 2, ..., m. Let the differencing operator in the model be 5(fi) == 1 - 5jB -... 5^B'', and let r^ ^ ^ be the forecast extension of 5(fl)x, = r, at time n + / (r ^, is an element of R^). The x , are calculated iteratively as \ . i = V n w - i + - + ^d\.i-d + ^«.(' for ^ = 1 . - ' ' " ' where x„^^.=x„,^. if 7 ^ 0 . The required backcast extensions of x, are denoted Xj _ ^ for / = 1,..., m. These are also obtained recursively from the difference equation b{B)x^ = r^ by solving for JC, _^ in the expression ACKNOWLEDGEMENTS We thank Nash Monsour for providing the FORTRAN program to calculate default X-11 asymmetric filter weights. In addition, he patiently explained how estimates from the monthly retail trade surveys are modified when new samples are drawn. This paper reports the general results of research undertaken by Census Bureau staff. The views expressed are attributed to the authors and do not necessarily reflect those of the Census Bureau. and substituting previously computed backcasts as needed. Thus, ^l-l - "d ^^d*l-l - "f^d-/~ — ~ °d-l-'^2-l - for I = l,...,m. where x 1 - ; •• X, 1 -J . for / ^ 0. J ^d*l-l)' 28 Bell and Kramer: Toward Vanances for X-11 Seasonal Adjustments REFERENCES BELL, W. R. (1983). A computer program for detecting outliers in time series. Proceedings ofthe Business and Economic Statistics Section, American Statistical Association, 634-639. BELL, W. R. (1984a). Signal extraction for nonstationary time series. Annals of Statistics, 12,646-664. BELL, W. R. (1984b). Seasonal Decomposition of Deterministic Effects, Research Report Number 84/01. Statistical Research Division, Bureau of the Census. BELL, W. R., and HILLMER, S. C. (1983). Modeling time series with calendar variation. Joumal of the American Statistical Association, 78, 526-534. BELL, W. R., and HILLMER, S. C. (1984). Issues involved with the seasonal adjustment of economic time series (with discussion). Joumal of Business and Economic Statistics, 2, 291 -320. CHANG, I., TIAO, G. C , and CHEN, C. (1988). Estimation of time series parameters in the presence of outiiers. Technometrics, 30, 193-204. CLEVELAND, W. S., and DEVLIN, S. J. (1982). Calendar effects in monthly time series: modeling and adjustment. Joumal ofthe American Statistical Association, 11, 520-528. DAGUM, E. B. (1975). Seasonal factor forecasts from ARIMA models. Proceedings of the 40th Session of the International Statistical Institute, Warsaw, Poland, 206-219. DURBIN, J., and QUENNEVILLE, B. (1995). Benchmarking monthly time series with structural time series models. Proceedings ofthe Survey Methods Section, 23rd Annual Meeting of the Statistical Society of Canada, Montreal, Canada, 13-18. FINDLEY, D. P., MONSELL, B. C , BELL, W. R., OTTO, M. C , and CHEN, B. (1998). New capabilities and methods of the X-12-ARIMA seasonal adjustment program. Joumal of Business and Economic Statistics, 16, 127-152. BELL, W. R., and HILLMER, S. C. (1988). A Matrix Approach to Signal Extraction and Likelihood Evaluation for ARIMA Component Time Series Models. Research Report Number 88/22, Statistical Research Division, Bureau of the Census. GEWEKE, J. (1978). Revision of Seasonally Adjusted Time Series. SSRI Report No. 7822, Department of Econoinics, University of Wisconsin. BELL, W. R., and HILLMER, S. C. (1990). The time series approach to estimation for repeated surveys. Survey Methodology, 16, 195-215. HANSON, R. H. (1968). The Current Population Survey: Design and Methodology. Technical Paper 40, U.S. Census Bureau, Washington, D.C, Government Printing Office. BELL, W. R., and HILLMER, S. C. (submitted). Applying time series models in survey estimation. HAUSMAN, J. A., and WATSON, M. W. (1985). Errors in variables and seasonal adjustment procedures. Joumal of the American Statistical Association, 80, 531-540. BELL, W. R., and MONSELL, B. C. (1992). X-11 Symmetric Linear Filters and Their Transfer Functions. Research Report 92/15, Statistical Research Division, Bureau of the Census. BELL, W. R., and OTTO, M.C. (submitted). Bayesian Assessment of Uncertainty in Seasonal Adjustment with Sampling Error Present. BELL, W. R., and WILCOX, D. W. (1993). The effect of sampling error on the time series behavior of consumption data Joumal of Econometrics, 55, 235-265. BINDER, D. A , and DICK, J. P. (1989). Modelling and estimation for repeated surveys. Survey Methodology, 15,29-45. BINDER, D. A., and DICK, J. P. (1990). A metiiod for the analysis of seasonal ARIMA models. Survey Methodology, 16,239-253. BOBBITT, L., and OTTO, M. C. (1990). Effects of forecasts on tiie revisions of seasonally adjusted values using the X-11 seasonal adjustment procedure. Proceedings of the Business and Economics Statistics Section, American Statistical Association, 449-454. BOX, G.E.P., and JENKINS, G.M. (1976). Time Series Analysis: Forecasting and Control. San Francisco: Holden Day. BREIDT, F. J. (1992). Variance estimation in the frequency domain for seasonally adjusted time series. Proceedings ofthe Business and Economic Statistics Section, American Statistical Association, 337-342. BURRIDGE, P., and WALLIS, K.F. (1984). Unobserved components models for seasonal adjustment filters. Joumal of Business and Economic Statistics, 2, 350-359. BURRIDGE, P., and WALLIS, K. F. (1985). Calculating die variance of seasonally adjusted series. Joumal ofthe American Statistical Association, 80, 541-552. HILLMER, S.C, BELL, W.R., and TIAO, G. C. (1983). Modeling considerations in the seasonal adjustment of economic time series. Applied Time Series Analysis of Economic Data, (Ed. A. Zellner), U.S. Department of Commerce, Bureau ofthe Census, 74-100. McKENZIE, S.K. (1984). Concurrent seasonal adjustment with Census X-11. Joumal of Business and Economic Statistics, 2, 235-249. McLEOD, I. (1975). Derivation ofthe theoretical autoco variance function of autoregressive-moving average time series. Applied Statistics, 24, 255-256. McLEOD, I. (1977). Correction to derivation of the theoretical autocovariance function of autoregressive-moving average time series. Applied Statistics, 26, 194. PFEFFERMANN, D. (1994). A general method for estimating die variances of X-11 seasonally adjusted estimators. Joumal of Time Series Analysis, 15, 85-116. PFEFFERMANN, D., MORRY, M., and WONG, P. (1993). Estimation ofthe Variances of X-11-ARIMA Seasonally Adjusted Estimators for a MultipUcative Decomposition and Heteroscedastic Variances. Working Paper METH-93-005, Time Series Research and Analysis Division, Statistics Canada, Ottawa, Canada. PIERCE, D. A. (1980). Data revisions with moving average seasonal adjustment procedures. Joumal of Econometrics, 14,95-114. PRESIDENT'S COMMITTEE TO APPRAISE EMPLOYMENT AND UNEMPLOYMENT STATISTICS (1962). Measuring Employment and Unemployment, Washington, D.C: U.S. Government Printing Office. Survey Methodology, June 1999 SHISKIN, J., YOUNG, A. H., and MUSGRAVE, J. C (1967). The X-11 Variant of the Census Method II Seasonal Adjustment Program. Technical Paper No. 15, U.S. Department of Commerce, Bureau of Economic Analysis. TILLER, R. B. (1992). Time series modeling of sample survey data from the U.S. current population survey. Joumal of Official Statistics, 8, 149-166. WALLIS, K. F. (1982). Seasonal adjustment and revision of current data: linear filters for the X-11 method. Joumal ofthe Royal Statistical Society, Series A, 145, 74-85. WILSON, G. T. (1979). Some efficient computational procedures for high order ARIMA models. Joumal of Statistical Computation and Simulation, 8, 301-309. 29 WOLTER, K. M. (1985). Introduction to Variance Estimation. New York: Springer-Verlag. WOLTER, K. M., and MONSOUR, N. J. (1981). Ontiieproblem of variance estimation for a deseasonalized Series. In Current Topics in Survey Sampling, (Eds. D. Krewski, R. Platek, and J. N. K. Rao). New York: Academic Press, 367-403. YOUNG, A. (1965). Estitnating Trading-day Variation in Monthly Economic Time Series. Technical Paper 12, Bureau of the Census. YOUNG, A. H. (1968). Linear approximations to the Census and BLS seasonal adjustment methods. Joumal of the American Statistical Association, 63,445-471. 31 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 31-41 Statistics Canada Item Selection in the Consumer Price Index: Cut-off Versus Probability Sampling JAN DE HAAN, EDDY OPPERDOES and CECILE M. SCHUT' ABSTRACT Most statistical offices select the sample of commodities of which prices are collected for their Consumer Price Indexes with non-probabihty techniques. In the Netherlands, and in many other countries as well, those judgemental sampling methods come close to some kind of cut-off selection, in which a large part of the population (usually the items with the lowest expenditures) is deliberately left unobserved. This method obviously yields biased price index numbers. The question arises whether probability sampling would lead to better results in terms of the mean square error. We have considered simple random sampling, stratified sampling and systematic sampling proportional to expenditure. Monte Carlo simulations using scanner data on coffee, baby's napkins and toilet paper were carried out to assess the performance of the four sampling designs. Surprisingly perhaps, cut-off selection is shown to be a succesful strategy for item sampling in the consumer price index. KEY WORDS: Laspeyres price index; Monte Carlo simulation; Sampling; Scanner data; Substitution bias. 1. INTRODUCTION Outsiders may think that measuring inflation is an easy job: just visit shops, collect a lot of prices and average them. However, statisticians engaged in the compilation of the Consumer Price Index (CPI), which is the most widely used measure of inflation, face many theoretical and practical problems. In most countries the CPI is essentially a Laspeyres price index. This index weights the partial price indexes of the various commodities by expenditure shares that are fixed at base period levels. Sampling procedures are needed to estimate the population value. Ideally, the mean square ertor of the estimator would be minimized. Even though the Laspeyres index formula is extremely simple, the estimation procedures applied to the CPI make it a rather complex statistic. Described in a stylized way, the estimation involves three different kinds of samples. A sample of households taking part in an expenditure survey is used to estimate the commodity group expenditure weights. From each commodity group a sample of commodities (items for short) is selected. The prices of these items are collected in a sample of outlets. In this paper we focus on the sampling of items. Only a few statistical agencies, e.g., the U.S. Bureau of Labor Statistics, use probability sampling to select items to be priced. Most others, for instance Statistics Netherlands, rely on the judgements of experts working at the central office for determining which items should represent the commodity group. In the past this method could be defended by referring to the lack of appropriate sampUng frames. Due to the rapidly increasing automation of the retail industry, registers of consumer goods become more and more available, and probabiUty sampling of items comes in sight. Before changing over to a new sampUng sti-ategy, however. it seems worthwile to experiment with alternative strategies in order to assess their impact on the accuracy of the estimated price index numbers. The question to be answered is whether curtent non-probabilistic selection practices perform worse, in terms of the mean square error, than probability techniques. This is the main topic of the present paper. Simulation studies were carried out for three commodity groups, i.e., coffee, disposable baby's napkins and toilet paper. Not so long ago, empirical price index number research was hampered by the fact that highly disaggregated expenditure and quantity information at the individual outlet level was lacking or at best available for small samples. Nowadays, some market research firms have managed to set up vast micro data bases on sales of consumer goods, especially in the field of fast moving consumer goods. These are derived from electronic scanning by bar-code reader or the associated bar-code typed in at the cashier's desk. Bradley, Cook, Leaver and Moulton (1997) give an overview of potential uses of scanner data in CPI constniction. Processing large scanner data bases is a rather time-consuming task. For CPI compilation as such, this could prevent an extensive use in the near future. But scanner data certainly provide a rich source of information for empirical analysis. In addition to studies into sampling, scanner data also enable us to calculate price index numbers according to different index formulas. The (fixed weight) Laspeyres price index does not take the households' reactions to relative price changes into account. We therefore examined to what extenttiieLaspeyres population price indexes ofthe three commodity groups are biased with respect to the Fisher price index, an index that does account for commodity substitution. Jan de Haan, Eddy Opperdoes and Cecile M. Schut, Division Socio-economic Statistics, Statistics Netherlands, P.O. Box 4000, 2270 JM, Voorburg, The Netherlands, e-mail: jhhn@cbs.nl 32 de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index Section 2 gives an overview of the scanner data that we used. Section 3 addresses four different commodity sampling designs. Three of these {i.e., simple random sampling, stratified sampling, and sampling proportional to size) are probability techniques, whereas the fourth (cut-off sampUng) is a judgemental one that mimics official practices in the Netherlands. Section 4 describes Monte Carlo experiments we performed to determine the accuracy of the estimated commodity group price indexes under the various sampling designs mentioned. Section 5 deals with the use of Fisher indexes at the item level and the item group level, respectively. The within-group substitution bias of the Laspeyres commodity group price indexes is shown. Section 6 summarizes and discusses the findings. 2. BAR-CODE SCANNING DATA 2.1 An Overview Scannable products are defined in Europe by the European Article Number (EAN). Manufacturers should assign one and only one EAN to every variety, size, type of packaging, etc. of a product. This has two implications. In the first place, EANs sometimes change very rapidly, for instance because of a new packaging. Clearly, this makes it difficult to follow a specific item over time. Secondly, some EANs have negligible expenditures. It seems that the system of classification is too detailed; what is really one item has been classified as a multitude of items. In a test study using scanner data on coffee, Reinsdorf (1995) also found that "items that are, for all practical purposes, the same may occasionally have different UPC's" (the US Universal Product Code). Some aggregation over EANs is required. Fortunately, several product characteristics such as brandname and subname are included in the scanner data sets. We will treat EANs having the same product characteristics as identical items. If the number of characteristics is insufficient, there will of course be a danger of overaggregation, that is of putting heterogeneous items together. From A.C. Nielsen (Nederland) B.V. we received scanner data sets containing weekly supermarket sales on coffee, disposable baby's napkins and toilet paper. The initial data sets contained 320,569 and 294 different EANs, respectively. For each EAN, the number of packages sold and the cortesponding value is included, together with several product and outlet characteristics. Prices are not included; average prices (unit values) must be calculati'.d from the values and quantities. The coffee data relate to sales over a period of two and a half years, beginning with week 1 of 1994 and ending in week 24 of 1996, in a sample of 20 supermarkets located in a Dutch urban area unknown to us. The data on the other two item groups refer to a sample of 149 shops spread over the whole country, and cover a period of two years, beginning with week 1 of 1995 and ending with week 52 of 1996. For reasons of convenience we deleted the minor brands. In the case of coffee, only the 15 brands with the highest turnover during the entire observation period were selected from the 55 brands actually sold. After aggregating over EANs with identical product characteristics, we further limited the population to those items that were sold in the base year 1994 and every month thereafter in order to have a complete data set for each month. We ended up with a total of 68 items (excluding coffee beans), among which 40 items of ground coffee (including decaffeinated coffee) and 28 items of instant coffee. These account for 94.5% of total base year coffee expenditure in the initial data set. For napkins and toilet paper (leaving out moist toilet paper), the brands with a turnover share of less than 1% were removed. Next, only those items were selected that were sold in the base year 1995 and at least eight months thereafter. This resulted in 58 napkins items and 70 toilet paper items, accounting for 90% and 86% of total 1995 expenditure in the initial data sets. 2.2 Descriptive Statistics The most striking feature of the item expenditures is the skewness ofthe distiibution. Figure 1 shows the inequality of the base period expenditures in our adjusted data sets by means of so-called Lorenz curves. The vertical axis depicts the cumulative expenditure total, the horizontal axis the cumulative number of items, both expressed as percentages. The items are sorted in increasing order of expenditure. In case of equal expenditures, the Lorenz curve would lie on the diagonal. The more unequal the distribution becomes, the lower its position will be. Coffee item expenditures are distributed extremely unequal, with the three largest items accounting for over half of total base year (1994) coffee expenditure. For baby's napkins and toilet paper the largest six and eight items, respectively, account for nearly half of total base year (1995) expenditure. Figure 2 shows unit value index numbers, that is the change in the value per package, irtespective of quantity, brandname, type etc., taken over all outlets. This gives a first impression ofthe change in "prices" during the period under study. For coffee, there was a remarkable decrease in the second half of 1995 foUowing large price rises in 1994 due to bad harvests in Brazil. Coffee prices are largely determined by world market prices for coffee beans. We did not find evidence of significant differences in price changes between outlets. Baby's napkins differ in this respect. A heavy competition was going on between the various producers (which may have caused the decline of the unit values during 1996), while discounts and other kinds of special actions were offered frequently. Hence, the unit value taken over aU items and outlets gives an inaccurate picture ofthe aggregate price change of baby's napkins. Survey Methodology, June 1999 33 100% 20% 40% 60% 80% 100% Percentage of items Figure 1. Distribution of base period item expenditures Figure 2. Unit value index numbers (1994=100 for coffee, 1995=100 for baby's napkins and toilet paper) 3, ESTIMATING LASPEYRES-TYPE PRICE INDEXES We start this section by introducing some notation. Let commodity group A consist of a finite number, say N, of commodities (items); geA means that item g belongs to group A. We assume that A is fixed during time. In real life this is not true: some products disappear from the market, while new products enter. In the short run, however, the constant item group assumption seems reasonable. Note that we adjusted our initial data set accordingly. The reason behind this is that we want to concentrate solely on the sampUng aspect. The Laspeyres (fixed weight) price index of commodity group A in period t is ^-^ g£A g g W geA geA P (1) where P'^ denotes the price index of item g, e° the expenditure on g during base period 0 and w the cortesponding expenditure share ofg within item group A. In the base period a sample A with fixed size n is taken from A. Because A is supposed to be fixed during time, it seems natural to keep A fixed as well. 3.1 Simple Random Sampling Probability sampling refers to situations in which all possible samples have a known probability of selection. Under simple random sampling (without replacement), all possible samples have equal selection probabilities. The Horvitz-Thompson estimator P'A = {Nln)Y^^w'^P' is unbiased for P', that is E{P\) = P' where the expectation E{.) denotes the mean over all possible samples under a given sampling design, in this particular^ case simple random sampling. Despite its unbiasedness, P\ will not be used in practice because of two undesirable properties. 34 de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index Firstly, if the price indexes of all sampled items are equal, the estimated item group index differs from that value, unless the population and sample means of expenditures coincide. Price index makers probably dislike this feature. Secondly, and more importantiy, P\ is bound to exhibit extraordinary large sampling variance. To overcome both difficulties, P' is estimated by taking unbiased estimators of the numerator and denominator: a large extent on this account, since the bias is a (weighted) average of positive and negative biases of the various item group indexes. 3.2 Sampling Proportional to Size Sampling proportional to size has the advantage that the most important items have a big chance of being sampled. We will restrict ourselves to fixed size sampling without Opt replacement, since this seems most likely to be chosen in {Nln)Y e;P g g geA case of item sampUng proportional to size (see for example P' = (2) Z<p'g^ the Swedish case described by Dalen and Ohlsson 1995). {Nln)Y^, geA Base period expenditure acts as our measure of size, and the geA required first-order inclusion probability for item g is ..0 where w is the expenditure share of item g in the sample. •Kg =negle° =nwg, where e° =ZgeA^g- ^^ follows that Using a first-order Taylor linearization (Samdal, Swensson £ gyi Pg/'' is an unbiased estimator of P'. and Wretman 1992, pp. 172-176), the variance of P^ can Sampling proportional to size without replacement, be written as combined with the Horvitz-Thompson or n estimator, is sometimes called Tips sampling. Most existing schemes for V{P'B) - ^P'A) - {P ')'(2p'p' - l)(o°)^ fixed-size Ttps sampling are draw-sequential and rather where o° denotes the coefficient of variation or relative complicated. We wiU therefore use systematic Tips selection standard error of the sample mean of base period expendiinstead. This scheme can be described by imagining the tures, p' is the ratio of the relative standard ertors of the expenditures e {geA) as cumulatively laid out on a average base period expenditures expressed in prices of horizontal axis, starting at the origin and ending at e °. A period t and 0, and p' is the cortelation coefficient between real number is randomly chosen in the interval {0,e'^ln], the average base period expenditures in prices of t and 0 and we proceed systematically by taking the items g (which is exgected to have a positive sign). The choice for Pg identified by points at the constant distance e^ln apart. instead of P\ can thus be elucidated by the fact that the This method yields exactiy the desired sample size. For former exploits the panel character of the sample; with commodity groups with large variation in base period p' > l/(2p'). a substantial reduction in variance is expected. expenditures, it may not always be possible to select an item An alternative expression for the variance of Pg is: sample strictly proportional to expenditure. Obviously, 7t <. I must be satisfied for all g. If « > 1 and some e^ V{P's) ~~ - ^ # 7 E K)^(/'; - P 'f' (3) values are extremely large, it may be true for some items that ne°le°> 1, contradicting the requirement t^ ^ 1- The n N -I geA conflict will be deaU with as foUows. The N items are which can be estimated using sample data provided that the ordered according to descending expenditures. First, if sampling fraction f=nlN is known. This formula, earlier ei>e^ln, weset Ji, = 1. Next, if Cj >(^° - ^ i ) / ( " " 1). mentioned by Balk (1989), shows that the variance depends we also set 7I2 = 1. The procedure is repeated until the on the within-group dispersion of the item price indexes. requirement for sampUng proportional to base period Hence, the variance could be lowered either by constructing expenditure is met for all remaining items. Our recursive item groups made up of items having similar price changes approach differs somewhat from the method proposed by or by enlarging the sample. Samdal et al, (1992, p. 176) Samdal et al (1992, p. 90). They suggest to set 71 = 1 for caution that "the Taylor linearization method has a all g with e >e°ln. In our data sets this would lead to tendency to lead to underestimated variances in not so large unnecessary large numbers of items with 7t = 1. The samples". The CPI item samples are generally quite small. subgroup Afj of items with the highest base period For some item groups there may even be only one or two expenditures which is selected with certainty will be called representative items. Thus, besides being unstable (having the self-selecting part of the sample. From the remaining a large variance itself), the variance will probably also be low-expenditure subgroup A^ a sample A^ with size «^ is underestimated when based on (3). drawn strictly proportional to expenditure. The resulting We note that estimator P'g, being a ratio, suffers from unbiased estimator is an expenditure weighted average of smaU sample bias of approximately o(l/«). It can easily be P '{H), the population Laspeyres price index of A^, and verified that its absolute value \B{P'g)\ ^f'JV{P'g). If G° E eA P'g^'^L' *^^ estimated price index of A^. is small, say less than 0.1, the bias of P^ may safely be regarded as negligible in relation to its standard ertor. 3.3 StratiHed Sampling However, with a small item sample and a large variability The obvious advantage of simple random sampling as of base period expenditures, o° could easily exceed 0.1 by opposed to sampling proportional to expenditure is that. far. We add that tiie all items CPI is unlikely to be biased to 35 Survey Methodology, June 1999 apart from a register of items serving as a sampling frame, no other data are required. See also Balk (1994). With very unequaUy distributed item expenditures chances are big that the rnarket leaders fall outside the sample, a situation that seems intuitively unappealing. We will argue that it would indeed be preferable if they were selected. Recall that the variance ofthe item group price index under simple random sampling depends on the within-group dispersion of the item price indexes. A variance reduction could be achieved if it were possible to stratify the item group into homogeneous subgroups according to their price changes. However, a priori knowledge of item price changes is not available. Another way to lower the variance might be to stratify the item group into two subgroups, one (A^) with high base period expenditures which is observed entirely and the other one (A,) with low expenditures from which a random sample A^ is taken. The new item group price index estimator is an expenditure weighted average of P'g{L), the Laspeyres index of the low-expenditure subgroup, estimated in accordance with (2), and P'{H). Its sampling variance is (1 - T^)^ V[P'g{L)], where x^ is the expenditure share of A^ within A. This method does not necessarily reduce the variance of the estimated price index, but it is likely to do so under certain conditions. The variance of the new estimator will be smaller than the variance of Pg when l-x„< se(P^) (4) se[P;,(L)]' where se(.) denotes standard error. Inequality (4) is expected to hold if the item expenditures are distributed extremely unequal, since 1 - x^ will then become much smaller than 1. Stratification may be especially productive as the overall sample size n increases. The choice of x^ and thus of the size N^ of the "takeall" stratum A^ is a bit of a problem. Preferably we would have some optimality criterion in order to minimize the variance. But since a priori knowledge of item price changes is lacking and past trends do not forecast future price changes very accurately, the optimal size of A^ can hardly be computed in practice. In the empirical analysis we will try two different relative sample sizes X^ = N^^ln of Afj, namely X,^ = 1/3 and A.^ = 2/3. These values suffice to give a clear indication of the performance. 3.4 Cut-off Sampling When the sample size is very small it seems rather likely that stratification with X^ = 2/3 leads to a larger standard ertor of the estimated price index than with X^ = 1/3. But what happens if A^ is not observed at all, so that A,^ = 1 and thus n=Nj^l We would then be using (a special type of) cut-off sampling. The item group price index is estimated simply by PQ- P '{H). AU gsA^j now have an inclusion probability of 1, whereas all geA^ have zero inclusion probability (Samdal et ai, 1992, pp. 531-533). Since we know exactly which items will be selected there is no randomness involved and the sampling variance of the P'c is zero by definition. The bias equals the actual ertor, i.e., the difference between the estimated value and the true population index P'c-P' = (l-x^)[P'(//)-P'(L)]. (5) With an extremely unequal distribution of item expenditures, even a small sample size would cause a large value for x^. In that case cut-off estimation may outperform stratification, in terms of the mean square error. We may either fix the cut-off rate x^, so that the sample size n is determined by x^, or fix the sample size, in which case x^ depends on the choice of n. The latter option was chosen by us since fixed size sampling designs are common practice in selecting CPI items, and because this allows a suitable comparison with other fixed size designs. The use of cut-off procedures can be justified on the grounds that i) the costs prohibit the construction of a reliable sampling frame for the whole population, and ii) the bias is deemed negligible. Assumption ii) cannot be verified in general, of course. The deliberate exclusion of part of the target population from sample selection may nevertheless give satisfactory results when appropriate corrections are made. Statistics Netherlands makes use of cut-off sampling in various other business surveys, for instance in production and foreign trade statistics where very small enterprises are left unobserved. In the Dutch National Accounts, that use production and foreign trade data as important inputs, explicit estimates are being made for smallfirms.The cut-off method for CPI item selection, on the other hand, does not cortect for the excluded items. In addition to cost-considerations, this method is sometimes defended by the belief that, at least in the longer run, the price changes of less important items will not differ much from those of the market leaders within the same product category because of similar production cost structures. 4. EMPIRICAL ESTIMATION 4.1 Monte Carlo Simulation With the exception of cut-off selection it is difficult to find reliable measures of the sampling distributions based on a single sample. Under simple random sampling the estimator P'(. has an unknown bias whereas variance estimation based on Taylor linearization techniques gives inaccurate results because CPI item samples are generaUy very small. Systematicrepssampling raises the question of how to estimate the variance of the estimator since the second-order inclusion probabilities are unknown. To obtain the exact sampling distribution we would have to consider all samples A that are possible under a certain sampling design. For every A the probability of drawing A and the estimated value of the commodity group price index must be known in order to calculate the exact values of the expected value, the bias and the variance of the estimator. de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index 36 This is virtually impossible because of the extremely large number of possible samples. To describe the sampling distiibution, we will therefore carry out Monte Carlo simulations. A large number of samples, say K, is drawn from the (same) population A according to the given design and for each sample the estimate is calculated. If K is large enough, the distribution of the K estimates will closely approximate the exact sampUng distribution. Let P\ denote the result for the ^-th sample under a certain sampling design. Then 4.2 Results Monte Carlo simulations were carried out with three different sample sizes: n=3, n=6 and rt=12. The number of repetitions {K) per experiment was set to 500,000. Table 1 shows the results for coffee in January 1995 (1994=100), tables 2 and 3 those for baby's napkins and toilet paper, respectively, in January 1996 (1995=100). The choice of the formula with which individual price observations are aggregated into a single item price index is discussed in the Appendix. Throughout this section all item price indexes are calculated as unit value indexes over all outlets. Simple random sampling performs particularly bad. For example, with n=3 the tme (laspeyres) coffee price increase of 17.2% is understated by 1.4%-points. Together with a standard ertor of 5.1%-points, the rmse amounts to 5.3%-points, that is almost one third of the true price increase. Even with n=I2, so that the sampUng fraction is 0.18 (which would be unusually large in practical situations), the rmse still remains considerably high. Notice that, as expected, the small sample bias is halved when the sample size is doubled. Stratification works reasonably well with larger sample sizes but gives disappointing results with n=3. In the latter case, stratification increases the rmse compared to simple random sampling for baby's napkins and toilet paper when Nfj = 2 (that is, when A.^ = 2/3). Our favourite probabilistic design would be systematic sampling proportional to expenditure because the estimates are unbiased and their standard errors relatively low. The most surprising finding perhaps is the good performance of cut-off selection. Except for n=3 and n=6 in case of baby's napkins, this method produces the best results. At=1 is an unbiased estimate of the expected value E{P'). We will calculate P' -P', which is an unbiased estimate of the bias B{P'); I •YiP'k-P')\ K-U-i which is an unbiased estimate ofthe variance V{P'); and S^ = dp - ^{P' - P')' + sl which is an approximately unbiased estimate of the root mean square error (rmse) of P'. Samdal et al (1992, p. 280), remark that "the imperfection caused by the finite number of repetitions is more keenly felt in the case of a variance measure... than in the case of measures calculated as means". Table 1 Estimated Laspeyres Price Index Numbers for Coffee (1994=100), January 1995 (N=6S) Sampling scheme S.R. *) Ttps exp. value 115.7 117.2 n=6 n --= 3 bias se -1.4 5.1 2.2 0 rmse 5.3 2.2 exp. value 116.4 se 3.4 bias -0.7 117.2 1.3 0 116.6 116.4 117.2 2.3 2.5 -0.5 -0.7 Stratified X„ = 1/3 X„ = 2I3 116.4 115.6 3.9 4.5 -0.7 -1.5 4.0 4.7 Cut-off 117.0 0 -0.2 0.2 0 71 = 12 0.0 rmse 2.3 0.7 se 2.3 0.7 bias -0.4 1.3 exp. value 116.7 117.2 2.3 2.6 0.0 117.0 117.0 117.5 1.2 1.1 0 -0.1 -0.2 0.3 1.1 0.3 n = l2 bias se 0.8 2.9 1.5 0 rmse 3.0 1.5 rmse 3.5 0 1.2 *)Siimple randoiTl Table 2 Estimated Laspeyres Price Index Numbers for Baby's Napkins (1995=100), January 1996 (/V=58) Sampling scheme S.R. Ttps Stratified X^ = 1/3 n =3 bias exp. value se 2.3 99.4 5.0 97.2 2.8 0 98.9 5.0 X„ = 2/3 98.3 5.8 Cut-off 92.0 0 rmse 5.5 2.8 1.1 5.9 98.1 97.4 -5.1 5.1 93.4 1.8 5.3 exp. value 98.7 97.2 n=6 se bias 3.9 1.5 0 1.6 rmse 4.2 1.6 exp. value 97.9 97.2 3.3 1.0 3.4 97.4 1.7 3.3 0.3 3.3 97.0 1.6 0.2 -0.2 1.7 1.6 0 -3.8 3.8 95.5 0 -1.6 1.6 Survey Methodology, June 1999 37 Table 3 Estimated Laspeyres Price Index Numbers for Toilet Paper (1995=100), January 1996 (A'=70) Sampling scheme n =3 exp. value n =6 n = 12 se bias rmse exp. value se bias rmse bias rmse S.R. 103.9 4.5 0.1 4.5 103.9 3.5 -0.1 3.5 103.9 2.6 0.1 2.6 Tips 103.9 3.4 0 3.4 103.9 1.8 0 1.8 103.9 1.2 0 1.2 Stratified X„ = 1/3 103.5 4.3 -0.3 4.3 103.7 3.2 -0.1 3.2 104.0 2.1 0.1 2.1 X„ = 2/3 103.7 4.6 -0.2 4.6 104.2 3.4 0.4 3.4 103.9 1.6 0.0 1.6 Cut-off 105.0 0 1.1 1.1 104.0 0 0.1 0.1 104.0 0 0.1 0.1 For coffee we also tried another form of stratified sampUng. The entire population of items was subdivided into ground coffee and instant coffee, and we took random samples from each stratum. Although the price changes of instant coffee are smoothed and lag behind as compared to ground coffee, Monte Carlo results using stratified sampling were similar to those using unstratified sampling for all four sampling methods. This contradicts earlier findings (see De Haan and Opperdoes 1997a). The reason is that we deleted some instant coffee items for this study to have a complete data set for each month, and ended up with a minor fraction (8%) of instant coffee in total base year coffee expenditures. It would be hazardous to draw conclusions about the performance of the various sampling designs based on simulations for one particular month since it is likely that the outcomes depend on the frequency distribution of the item price indexes. Figure 3 shows these distributions for coffee and baby's napkins in two months. Both distributions move to the left, indicating that the unweighted mean has declined. Apart from that, the frequency distribution for coffee remains quite stable. The shape of the curve for Coffee (1994=100) exp. value se napkins, on the other hand, changes dramatically; the variance of the item indexes has grown. Monte Carlo experiments were run for each month of the period under study. Figure 4 shows the rmse with n = 3. The pattem that emerges for coffee and toilet paper is surprisingly robust: cut-off selection always comes out as best. Apparentiy, if sample sizes are small, the exclusion of the smaller items does not seem to matter much. This is what many statistical offices have been appreciating for a long time, without being able to test it empirically before. The reason why cut-off selection perfomis better than sampling proportional to expenditure is in case of toilet paper partly caused by the fact that there is no self-selecting part under the latter sampUng scheme. With larger samples the results under cut-off selection and sampling proportional to size are very much alike. For baby's napkins the outcomes differ somewhat. Because of the high volatility of the item indexes, the rmse under cut-off selection varies considerably; it seems to meander around the rmse resulting from systematic sampling proportional to expenditure. The high variability of the ertor can be considered a drawback of cut-off selection. Baby's napkins (1995=100) Figure 3. Frequency distribution of item price index numbers 38 de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index Coffee -SRS -Cut-off . . . A - - - RPS X—- Stratified, NH=1 -stratified, NH=2 *- 9503 9505 9507 9509 9511 9601 9603 9605 Baby's Napkins -«—stratified, NhfeZ 9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612 Toilet paper -^— Stratified, NH=1 -SK— Stratified, NH=2 9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612 Figure 4. Rmse of estimated Laspeyres price indexes («=3) 5. THE USE OF FISHER INDEXES 5.1 Unit Value Versus Fisher Item Indexes In section 4 the item price indexes were calculated as unit value indexes over all outlets. To assess the impact of the choice of the item index formula on the outcomes of the simulation study, Table 4 compares Monte Carlo results with n=3 based on unit value item index numbers (as in tables 1-3) and Fisher item price index numbers; see the Appendix for details. For coffee, we notice hardly any differences. For napkins and toilet paper, on the other hand. the rmse decreases when Fisher index numbers are used instead, especiaUy in case of simple random sampUng. This is caused by the fact that unit value indexes tend to show a more ertatic pattem. If physically identical types of napkins or toilet paper are deemed heterogeneous across outlets, so that the Fisher formula would be more appropriate, the use of unit value indexes overstates the price variability of particularly small items and exaggerates the poor performance of simple random sampUng. Nevertheless, we would still have to conclude that simple random sampling does not work very well. Survey Methodology, June 1999 39 5.2 Within-group Substitution Bias The Fisher index is one of the best-known superlative indexes. When applied to the item group level, the difference between the population price indexes calculated according to the Laspeyres and the Fisher formula can be interpreted as within-group item substitution bias (Figure 5). For coffee it is less than I %-point per year. For toilet paper and particularly for napkins the biases are very large, about 1.5-3%-points per year. Within-group substitution bias is generally positive and increases over time. Notice, however, that for baby's napkins in a few months of thefirsthalf of 1996 the Laspeyres index numbers are lower than tiie Fisher index numbers. This unexpected effect, and possibly also the large magnitude of the positive bias in other months, may be due to a deficiency of the data set which only contains supermarkets. It is well-known that baby's napkins are bought in the Netherlands also in other kinds of shops such as dmgstores that do not make use of bar-code scanning. Substitution between the included and excluded outlets in the data base may damage our population index numbers as accurate approximations of the tme values. We are convinced though that it does not seriously affect the assessment of the sampling methods presented in section 4. Many statistical agencies and users are of the opinion that the CPI should be an approximation to the tme cost of living index. This theoretical concept is derived from microeconomics and measures the change in the minimum costs for a representative consumer, or household, necessary to retain the same standard of living or utility. Since utiUty cannot be measured, a feasible index formula should be chosen that closely approximates the concept. Diewert (1976) showed that (what he calls) superlative indexes provide second order approximations to the cost of living index. The most important feature of superlative price indexes is that they take account of consumers' substitution towards goods and services exhibiting relatively smaU price increases. These index formulas make use of expenditure data relating to both the base period 0 and the curtent period t. In practice it takes some time before expenditure data are known, so that superlative indexes cannot be compiled in real time. For the sake of timeliness most national statistical offices adopt the Laspeyres (fixed weight) formula for constmcting their CPIs. Table 4 Estimated Laspeyres Price Index Numbers Using Alternative Item Indexes (n=3) Sampling scheme Coffee, January 1995(1994= 100) Napkins, January 1996 (1995==100) (2) (1) exp. value rmse S.R. *) 115.7 Ttps Stratified X„ = 1/3 X„ = 2/3 Cut-off (2) (1) exp. rmse value exp. value rmse 5.3 115.8 5.3 99.4 117.2 2.2 117.2 2.2 116.4 4.0 116.5 115.6 4.7 115.6 117.0 0.2 117.0 Toilet paper , January 1996(1995:=100) (2) (1) exp. value rmse exp. value rmse 5.5 100.4 4.2 103.9 4.5 104.0 3.5 97.2 2.8 98.6 2.1 103.9 3.4 104.3 3.6 4.0 98.9 5.3 100.1 4.1 103.5 4.3 103.3 3.6 4.8 98.3 5.9 99.5 4.6 103.7 4.6 103.2 3.9 0.2 92.0 5.1 94.8 3.8 105.5 1.1 104.7 1.0 (1) Based on unit value item index numbers (2) Based on Fisher item index numbers 3,b-| —0—Toilet paper 32,5- —a— Baby's napl<ins 21,!>- —6—Coffee 1• 0,5- 0,5CO o o ci> o m o> m o m a> 1^ o m O) o> o m o> ^_ ^_ in 05 05 o CO CO O CO en in 1^ o CO o CO a> (3) en o CO a> ,_ CO a> Figure 5. Difference between Laspeyres and Fisher population price index numbers exp. value rmse de Haan, Opperdoes and Schut: Item Selection in the Consumer Price Index 40 6. DISCUSSION Although bar-code scanning data have some deficiencies, they provide an excellent opportunity to undertake empirical research into various sampUng issues concerning CPIs. Our simulations show that, for coffee, disposable baby's napkins and toilet paper at least, simple random sampling of items should be advised against. We believe that this recommendation can be extended to all item groups where the distribution of expenditures is very skewed. If statistical offices want to apply probability sampling, they would do a better job using sampling proportional to expenditure. However, cut-off selection might be a good or even better altemative for those item groups where the various item price changes are not too volatile. As a matter of fact, as far as we are aware this is the first study to supply empirical evidence in support of cut-off CPI item selection methods. Aggregated scanner data - that is, scanner data aggregated over outiets - should give a clear indication of the required cut-off rate. Statistics Netherlands already made use of aggregated Nielsen data on a range of commodity groups in the past in order to select items for the CPI sample. Cut-off methods are applied extensively in the Netherlands and many other European countries (Boon 1997). Ui the Netherlands the actual item selection is a little more complex than the situation described above. First, a number of item subgroups instead of specific items are chosen using the cut-off method. Next, a number of specific items are selected from each subgroup through so-called judgemental sampling. The selection of these representative items is based on the judgement of experts working at the central office who should have a firm knowledge of the consumer market in question. Usually the most frequently bought items or those with the highest tumover will be selected, so that the entire sampling scheme is a two-stage cut-off procedure. It is unlikely that such a two-stage method would yield results much different from the single-stage procedure we have used in this paper. In some other European countries, e.g., the United Kingdom, cut-off selection does not take place at the central office but by field staff at the outlets where prices are measured. To illustrate this method, we choose one item per outiet, namely the item with the highest base period sales in the outlet. For coffee, baby's napkins and toilet paper this yields 2, 12 and 24 different items, respectively. The Laspeyres item group index is estimated in accordance with expression (2), where the item price indexes are calculated as outlet-specific unit value indexes and weighted by outlet-specific weights. Figure 6 shows the rmse resulting from this method. If we compare this with Figure 4 (cut-off selection done at the central office for n=3), the accuracy of both cut-off selection methods seems "on average" to be of the same order of magnitude, although the pattem is sUghtiy more ertatic under selection at the outlets. But such a comparison is quite arbitrary. Why not compare cut-off selection at the outlets with cut-off selection at the central office for n=6, or «=12, or indeed for any other sample size? Another problem is that we treated the item price indexes as if they were known with certainty. In reality they will be based on a sample of outlets, so that our results are conditional on this sample. For a proper assessment of both cut-off selection procedures we need to take both the sampUng of items and the sampling of outlets into account. However, that is beyond the scope of this paper. Scanner data not only offer challenging perspectives for statistical research in the field of CPI sampling issues, they also enable us to compile all sorts of index numbers, including superlative indexes, using real and highly disaggregated data at the individual outlet level. We demonstrated that the Laspeyres item group price indexes used by statistical agencies can be biased by more than -hi%-point on a yearly basis with respect to the (superlative) Fisher price index that accounts for item substitution. A related type of bias, caused by neglecting products that are introduced after the base period (see e.g., Boskin, Dulberger, Gordon, Griliches and Jorgenson 1996), was not addressed by us. Scanner data do provide a good opportunity to investigate this new goods bias. Figure 6. Rmse resulting from cut-off selection at the outiets ACKNOWLEDGEMENTS This research was partially supported by Eurostat (the Statistical Office of the European Communities) under SUP-COM 1996, Lot 1: Development of methodologies in consumer price indices and purchasing power parities. The authors are grateful to A.C. Nielsen (Nederland) B.V. for providing scanner data at marginal costs. They also wish to thank Bert M. BaUc, Leendert Hoven and two anonymous referees for helpful comments on an earlier draft. APPENDIX: THE CHOICE OF THE ITEM INDEX FORMULA To perform sampling simulations we need item index numbers. What index formula should be chosen? Statistical offices are generally forced to calculate indexes at the Survey Methodology, June 1999 41 lowest level of aggregation based on price data alone because quantity or expenditure data is lacking. See Szulc (1987), Dalen (1991), BaUc (1994), and Diewert (1995) for a comprehensive treatment of this subject. With scanner microdata at hand, we are in the unique position to constmct genuine price indexes (Silver 1995, Hawkes 1997). Consider a set of outlets B , assumed fixed during time, where item g can be bought; beB means that g can be bought in outlet b. The price of g at outlet b in period s {s = 0,0 and the cortesponding quantity sold are denoted Pi and xl, respectively. The item will be taken as the lowest aggregation level where price indexes are constmcted. As a start we restrict ourselves to item indexes that can be written as ratios of weighted arithmetic mean prices in period t and period 0: E-ip; "gb ' gb P' = 'j^ E u r)0 "^gbPgb (6) beB, where w^^^ = x^^ /HbeB ^gb denotes the share of outlet b in the total quantity sold^of item g in period z(z = s,u). If M = 0 and s = t, the prices in period 0 and period t are weighted by the corresponding relative quantities. The average prices are called unit values, and P'^ is a unit value index. De Haan and Opperdoes (1997b) and Balk (1998) discuss its merits. Adding up quantities makes sense only if item g can be conceived of as being homogeneous, that is identical across all beB . Unit values then yield the appropriate average transaction prices and the unit value index is the appropriate item price index. The problem, of course, is to define homogeneity. It can be argued that physically identical products sold in different outiets are not identical items because of different services that accompany the transactions, so that homogeneity across outiets never occurs. Another index formula should then be chosen. If M = j in expression (6), P'^ can be called a fixed quantity price index with u acting as the quantity reference period. For u = s = 0, P^ tums into the Laspeyres price index, and for M = 5 = r, P is the Paasche price index. On theoretical grounds we cannot favour either one. For reasons of symmetry it seems natural to take the (unweighted) geometric average of the Paasche and the Laspeyres index, which is the Fisher (ideal) price index. REFERENCES BALK, B.M. (1989). On Calculating die Precision of Consumer Price Indices. Report, Department of Price Statistics, Statistics Netherlands, Voorburg. BALK, B.M. (1994). On the first step in the calculation of a consumer price index. Proceedings of the First International Conference on Price Indices, Statistics Canada, Ottawa. BALK, B.M. (1998). On the use of unit value indices as consumer price subindices. Proceedings of the Fourth International Conference on Price Indices, U.S. Bureau of Labor Statistics, Washington, D.C. BOON, M. (1997). Sampling Designs in Constructing Consumer Price Indices: Current Practices at Statistical Offices. Research Paper no. 9717, Research and Development Division, Statistics Netherlands, Voorburg. BOSKIN, M.J., DULBERGER, E.R., GORDON, R.J., GRILICHES, Z., and JORGENSON, D. (1996). Toward a More Accurate Measure of the Cost of Living. Final report to the U.S. Senate Finance Committee, Washington, D.C. BRADLEY, R., COOK, B., LEAVER, S.G., and MOULTON, B.R. (1997). An overview of research on potential uses of scanner data in the U.S. CPI. Proceedings of the Third International Conference on Price Indices, Statistics Netherlands, Voorburg. DALEN, J. (1991). Computing elementary aggregates in the Swedish consumer price index. Joumal of Official Statistics, 8, 129-147. DALEN, J., and OHLSSON, E. (1995). Variance estimation in the Swedish consumer price index. Joumal of Business & Economic Statistics, 13,347-356. DIEWERT, W.E. (1976). Exact and superlative index numbers. Joumal of Econometrics, 4, 115-145. DIEWERT, W.E. (1995). Axiomatic and Economic Approaches to Elementary Price Indexes. Working Paper no. 5104, National Bureau of Economic Research, Cambridge. HAAN, J. De, and OPPERDOES, E. (1997a). Estimation of the Coffee Price Index Using Scanner Data: The Sampling of Commodities. Research Paper, Socio-economic Statistics Division, Statistics Netherlands, Voorburg. HAAN, J. De, and OPPERDOES, E. (1997b). Estimation of the coffee price index using scanner data: The choice of the micro index. Proceedings of the Third International Conference on Price Indices, Statistics Netherlands, Voorburg. HAWKES, W.J. (1997). Reconciliation of consumer price index trends with corresponding trends in average prices for quasi-homogeneous goods using scanning data. Proceedings of the Third International Conference on Price Indices, Statistics Netherlands, Voorburg. REINSDORF, M. (1995). Constmcting Basic Component Indexes For The U.S. CPI From Scanner Data: A Test Using Data On Coffee. Presented at NBER Conference on Productivity, Cambridge, Mass., July 17, 1995. SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J. (1992). Model Assisted Survey Sampling. New York: Springer Verlag. SILVER, M. (1995). Elementary aggregates, micro-indices and scanner data: some issues in the compilation of consumer price indices. Review of Income and Wealth, 41, 427-438. SZULC, B.J. (1987). Price indices below the basic aggregation level. ILO Bulletin of Labour Statistics, Reprinted in Turvey (1989). TURVEY, R. (1989). Consumer Price Indices: An ILO Manual. Geneva: International Labour Office. 43 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 43-56 Statistics Canada Robust Calibration Estimators PIERRE DUCHESNE' ABSTRACT We consider the use of calibration estimators when outiiers occur. An extension is obtained for the class of Deville and Samdal (1992) calibration estimators based on Wright (1983) QR estimators. It is also obtained by minimizing a general metric subject to constraints on the calibration variables and weights. As an apptication, this class of estimators helps us consider robust calibration estimators by choosing parameters carefully. This makes it possible, e.g., for cosmetic reasons, to limit robust weights to a predetermined interval. The use of robust estimators with a high breakdown point is also considered. In the specific case of the mean square metric, the estimator proposed by the author is a generalization of a Lee (1991) proposition. The new methodology is illustrated by means of a short simulation study. KEY WORDS: Calibration estimator; Regression estimator; Range restrictions; Robustness. 1. INTRODUCTION is a notation signifying X^E/t)- Let us assume a positive variable of interest y and an asymmetric population. As the The problem of outliers is an important one in all HT estimator is a mean weighted by the d^, it is vulnerable branches of statistics. In sampUng theory, the background to large values of y. A unit with a high weight d,^ may also is different from that of parametric statistics since the have a considerable impact on the estimation step by objective is often to estimate the total of a variable of including variable estimates. Lee (1995) defined these units interest y. An outlier may have its full weight within the as influential. An extreme unit is not necessiuily influential population total. Moreover, methodologists may assume, if its weight d^. is sufficiently small. TraditionaUy, methoat the estimation stage, that the values of units are recorded dologists have sought to Umit the impact of influential units without error, since the gathered units are often processed when they are known prior to sampling, by assigning for within an editing system (Samdal, Swensson and Wretman example sampling weights close to 1 to extreme units. 1992, section 1.7). This step is part ofthe sampling proceGambino (1987) and Lee (1995) have nevertheless dure in large statistical agencies such as Statistics Canada. discussed situations in which this cannot be done. In a Lee (1995) has provided an overview of robustness major article, Hidiroglou and Srinath (1981) considered developments within sampling theory. changing the sampling weights when outliers occur. Their Nevertheless, since populations for economic surveys are approach gave much legitimacy to weight modification often asymmetric, some units might be extreme as compared within sampling procedures. to others, as was discussed by Kish (1965). The complete Many of the first robust alternatives to the total were elimination of such units would lead to biased estimates, based on M estimators and GM-estimators. Nevertheless, while maintaining them with their fuU weight might make an much interest has been shown recently for estimators that estimator such as the generalized regression (GREG) also provide good overall robustness, as measured by the estimator highly variable. This would suggest a compromise breakdown point of an estimator. These concepts are between bias and variance. When outliers occur, the discussed for example in Donoho and Huber (1983), challenge is to propose robust estimators of the total that are Hampel, Ronchetti, Rousseeuw and Stahel (1986) and little affected by certain units that deviate sharply from Rousseeuw and Leroy (1987). The breakdown point meaothers. Such estimators should have littie bias and a small sures the percentage of outiiers within the sample that the mean square ertor. Traditionally, sampUng theory has been estimator can tolerate while providing nonetheless a good deeply involved in the development of unbiased or estimate of a given characteristic of the population. Lee, asymptotically design unbiased (ADU) estimators. See for Ghangurde, Mach and Yung (1992) required estimators of example Samdal etal., (1992, Section 7.12). However, this the total that were based on robust estimators with a high ADU property is perhaps undesirable within the context of breakdown point. outiiers. This was discussed by Chambers and Kokic (1993), We will be considering calibration estimators of the total who showed the conflict between the ADU property and the T written as Yjs^kYk- These estimators were developed robustness of an estimator. for example in Deville and Samdal (1992). We are looking for weights w^ that are as close as possible to sampling We consider the Horvitz-Thompson (HT) estimator defined ^Y fiji = Y d y , where dj^=Til , TI^ being the weights di^=nl , while meeting benchmark constraints, denoted CE (also known as calibration constraints), inclusion probabiUty (If A is a set of units. A c t / , then Y,A Pierre Duchesne, Departement de Mathematiques et de Statistique, University de Montreal, C.P. 6128, succursale centre-ville, Montr6al, Quebec, H3C 3J7. 44 Duchesne: Robust Calibration Estimators that it makes it possible to obtain weights w^ that are limited to a given interval, say [L, U]. Some of the properties of classes QR and RQR are provided in where x^ is a vector of dimension m that corresponds to the available auxiliary information of known total T^ = Ly -^jt • section 2. Section 3 describes applications of the RQR class in the These estimators are popular as they are easily interpreted, building of robust calibration estimators. The main goal is since methodologists are used to assigning weights w^ to to modify robust default weights so that they meet units y^. Several metrics are studied to measure the proxicalibration constraints. Section 3.1 discusses the choice of mity between <i^ and w^. The GREG estimator is an constants ^^ and r^ using arguments suited to calibration important example with w^=J^(I + {T^-T^.^^)' M^ ^J^^k)' where M^ = Yjsd^Xi^x'JCi^. It is obtained by minimizing the estimators. This is a new and unifying approach, and in mean square metric Y.s^k^'^k ~ ^k)^^^k- Constants c^ are section 3.2 it guides our choice of q^ and r^ when there is auxiliary information. One important element is the use of weighting factors which can take into account problems of a robust estimator allowing for the weighted form (1.3), heteroscedasticity (for example). Samdal (1996) discussed providing the ^^. Note that this is the case for GMthe selection of these constants. However, since the gweights gi^ = wjdj^ of the GREG estimator are not gene- estimators. Usually, estimators with a high breakdown point do not have a weighted form. Consideration is given rally restricted, other metrics are proposed as a means of to reweighting these estimators, allowing the breakdown limiting them so that they might meet certain constraints point to be kept under control and making it possible to applicable to the range of values (CARV). Specifically, have estimators written in the form (1.3). See Rousseeuw this makes it possible to avoid undesirable negative weights and Leroy (1987) and Simpson and Chang (1997). We then w^. See Deville and Samdal (1992), Singh and Mohl discuss the choice of q,^ and r^, so as to calculate an RQR (1996) and Stukel, Hidiroglou and Samdal (1996). estimator and obtain a robust calibration estimator with As was noted by Fuller, Loughin and Baker (1994, restricted weights. Various robust estimators, including the p. 81), there is a link between calibration estimators and Lee (1991) estimator and the Chambers (1986) estimator, robust methods. However, it is wrong to assume that are compared in section 4 with RQR estimators as well as calibration estimators necessarily have good properties of with the GREG estimator and one calibration estimator robustness, given that all the calibration estimators considered in Deville and Samdal (1992) whose weights considered by Deville and Samdal (1992) were asymptoare Umited. The Lee (1991) estimator can be considered a tically equivalent to the GREG estimator, which, being specific case of our approach. It allows us to also consider ADU, is not robust. Moreover, a traditional calibration a new estimator with restricted weights. Four populations estimator is not robust as it depends linearly on w^, and w^ that have already been studied in the literature are and does not take into account y^.. considered. It will be noted that estimators free of weight The purpose of this paper is to build estimators in the constraints are subject to negative weighting problems. form of Yjs'^kyk where the weights w^ provide robustness With the RQR class of estimators, robust estimators having while meeting constraints on the calibration variables and positive weights can be obtained, and they compare well the weights w^. The starting point of our approach is the with estimators free of weight constraints. Finally, class of Wright (1983) estimators QR. Let us assume we have available constants {(^^, r^), ^^ > 0, r^^ ^ 0, V^e [/}, conclusions are drawn in section 5. Appendix B contains such that Y.u'^klk^k^'k ^ ^ ^""^ T^s^k^k^k > ^' ^^- (^'^ '^ a list of abbreviations, and Appendix C contains a Ust of the various constants found in this paper, with definitions. a symmetric matrix, A >0means that A is definite as positive.) The QR estimators are defined on the basis of ^^ and r^ by the relation 2. RQR CLASS ESTIMATORS (1.2) T'B + •yQR .s hh' X q Consider a finite population U = [1,2, ...,N} of size N whose total T = ]C(/>'t ^^ ^'^h to estimate for a variable of where B assumes a form weighted by the ^^ interest y that is positive. A sample s of size n^ is drawn (1.3) Pq^^s 'ik^kKYT.s Ik^kYk' foUowing a sampling design p(s}. The inclusion probability of a unit k is denoted Ji^, and the second-order inclusion and probabilities are denoted Jtj^,. We assume that the auxiliary (1.4) ^k=yk-^'kP, information x^ is of unit value, i.e., x^^ is known from a reliable source V^e U. It will be shown in section 2 that the QR estimators are Wright (1983) introduced a class of QR estimators calibration estimators, and a new class of estimators, written in the form (1.2) with the primary objective of denoted RQR, wiU be inti-oduced, also based on the choice unifying a large number of common estimators. We find of constants ^^ and r^. It generalizes to a certain extent the best linear unbiased prediction (BLUP) estimator of the QR estimators as well as the class of Deville and Royall (1970) derived from the model-based theory, Samdal (1992) estimators. The RQR class is interesting in 2>, w*^t = ^x' (1.1) Survey Methodology, June 1999 45 obtained by assuming (^j^, r^) = (1/c^, 1), and the GREG estimator of Cassel, Samdal and Wretman (1976) by considering the choice {q,^, r^) = {d^^lc,^, d,^). Altemately, (1.2) can be written as recourse for the practitioner, then, is to relax the constraints by reducing the dimension of the number of auxiliary variables. See also the discussion in Fuller et al., (1994). As for the calibration estimators considered in Deville and Samdal (1992), it was shown, in resuU 1, that there is a •'j'QR ^ l^s "•kSkYk' solution with a probability approaching one. Under certain where ^^g^ satisfies conditions, this result can be adapted to class RQR estimators. The metric on which we will focus our attention so that dkSk = h ^ (L - P.r)'(Es <lk^k^kY^k^k' (2-1) the weights may satisfy the CARVs is a sUght modification with T^^ = Y^j^Xj^. Assuming ri^=dj^, g^ corresponds toof case No. 7 in Deville and Samdal (1992). We caU it the restricted mean square metric. The G-function that corresthe g-weight ofthe GREG estimator. ponds to the choice of this metric is The QR estimators are calibration estimators, obtained by minimizing the mean square metric subject to the CEs {w^-r^)Vq^ if w^e[L,U], G(w,;9,,r,) ™ n ^ E . K - rkfl<ik' as of Ys '^k^'k = ^.- (2.2) otherwise, The weights w^^ are chosen as close as possible to the r^ and the ^^ are weighting factors. In other words, the starting weights r^ are transformed into caUbration weights w^. The solution to problem (2.2) is w^ = <i^g^, where df^gj^ is given by the formula (2.1). Nothing, however, guarantees that the weights w^ of the QR estimator are positive, which might be undesirable in practice. See Brewer (1994), who formalized the interpretation of weights. To limit the weights w^^ in [L, U], we wish to resolve min5^^G(w,;^,,r,), as ofYs k k -T and W.E[L,U]- (2.3) The calibration estimator of the total is ^vROR 2.^5 ^kYk' (2.4) where the w^ are obtained by resolving problem (2.3). It is assumed that function G{w,q,r) is strictly convex and can be derived in w for fixed r and q. We denote g{u;q,r) = G'{u; q, r) and h{u; q,r)=g"'(«; q, r). Moreover, it is assumed that /i(0; q,r) = r and h' (0; q, r) = 9.The resulting estimators are caUed QR (RQR) restricted caUbration estimators. Fuller et al, (1994) favoured regression estimators having reasonable invariance properties. It can be shown that RQR estimators are regression equivariant and to scale when constants q,^ and r^ are transformation invariant. Useful definitions may be found in Bolfarine and Zacks (1992). There is no guarantee that there is a solution to problem (2.3). We refer to the simulation study in Stukel et al, (1996). There may, for example, be realizations of the sample for which even the CEs cannot be satisfied (1.1). Thus, the sample is so imbalanced that it is impossible for the weighted sum of the components for each dimension to provide the corresponding population total. The only whereas the /i-function is h{x'X;q.,r.) L r^ + ^^x^' X<L, r^^ + q^x^X r^+ q^x;^Xe[L,U], U r^^q^xlX>U. Given this modification, it is the weight w^ that is constrained and not only w^/(i^ as for case No. 7 in Deville and Samdal (1992). In our situation, w^ can "cortect" an initial weight that is an outlier. It will be noted that, as it is formulated, the Deville and Samdal metric (1992) subtly inserts the constraints on the w^ in the G-function. In order to calculate the estimator (2.4) according to this metiic, it is sufficient to follow the same approach as Deville and Samdal (1992), which leads us to a solution, using Newton's method, for the following equation in X Ysh{x;,X;q,^,r^)Xi^ = T^ (2.5) Thefinalestimator is TJ.RQR = E,^('^t^„; ^k' ''4))'^. where X^ is the solution to equation (2.5). It is interesting to know whether the weight constraint changes the properties ofthe estimator as compared to a QR estimator that is free of weight constraints. The following result (as proven in the Appendix) shows that, under certain conditions, the two estimators are asymptotically equivalent. In practice, using the restricted mean square metric, we have not observed any significant deviations. Proposition 1. According to hypotheses C, and Cj given in the Appendix, N-^\T yQR T I Op{n -1/2 ) • (2.6) This result can possibly be obtained using the approach leading to resuU No. 5 in Deville and Samdal (1992) dealing with the asymptotic equivalence between the GREG and calibration estimators considered by the authors. Duchesne: Robust Calibration Estimators 46 perhaps be reduced. In order to satisfy the CEs, this means finding weights w^ that come closest to the sampling weights di^ for units which are not outliers but come as close as possible to a reduction factor r for outlier units, where r is chosen by the statistician. Specifically, we denote J =5, U52. where s^ of cardinaUty n^ represents those units that are not reported as outliers, whereas S2= s - s^ of cardinaUty /ij = " ~ "i represents the outiier units ofs. The (2.7) reduction factor r wiU typically satisfy r <.d^,yk£ jj- For ^L = E E , ^wK«t)(>^/^/)' example, consider the estimator (3.1) with qi, = r,^ = where A^, = TC^, - TI^TI,, A^^, = A^/TC^, and e^ are given by B,. = dJi^^ + r{l - /ji), where /^, is the variable indicating affiliation to 5,. In this way, constants q,^ and r^^ are (1.4). See Samdal etal, (1989) and Samdal etal, (1992, reduced for units of ^2 so as to reflect the fad p. 234). It can be shown that the asymptotic bias of a QR are extreme. The estimator (3.1) becomes estimator is given under general conditions by However, proposition 1 is of some use to understand the type of conditions needed to reach the result described in our situation. Since (1.2) shows the same asymptotic behaviour as quantity T^B + L^jt^i- where Ei^=y^-xlB^ and B^ = ('Lu'^k^k^k^k)'^Lu\^k^kyk' ^^^^ w°"l'^ suggest as vanance estimator V^vQR-^P=E^(V.-!)£,. Then, a possible bias estimator is b = Yjs'^k^Vk ~ ^)^k' which can be used in conjunction with formula (2.7) to build an estimator of the mean square ertor for a QR and RQR estimator, using proposition 1. RQR estimators make it possible to obtain calibration estimators with constrained weights. Given set q^, and r^^, it is sufficient to resolve problem (2.3). In the sections which follow, the RQR class is applied within a context of robustness. We will show how to direct the selection of constants 9^ and r^, chosen in practice using sample s. 3. BUILDING ROBUST AND CALIBRATED ESTIMATORS 3.1 Methods Based on Weight Reduction and Value Modiflcation •>.QR Cs(P)('Esi hYk ^ ' ' E . 2 >-*)• In the case of simple random sampling, cf^ = NIn and we obtain TyQR-CSB)f^g, where f g = NlnY,,^ Yk "^ ''Sj2>'t ^s the Bershad (1960) estimator discussed in Lee (1995). Other methods based on weight reduction have been discussed in Lee (1995), who also discussed the choice of r. One disadvantage of methods based on weight reduction is that the analyst must identify the outlier units. Methods based on value modification avoid this difficulty by providing gradual weight reduction for units that are more extreme. We consider a case of simple random sampling. We assume m{y^\ t,a,b)=b + {a -b) min (1, tly^). (3.2) Lee (1995) discussed various propositions based on the Thus, this function assigns a starting weight of value a for weight reduction method for simple random sampling. the y^ < r, and graduaUy reduces this to afinalweight b, as y^^ Once outlier observations have been detected, these becomes extreme. Value t is called the threshold. The methods consist in reducing the weight of extreme obserconstants a, b and t are chosen by the statistician. Several vations. These methods are to be preferted to those which values for a and b have been considered in the literature. eliminate doubtful observations entirely, since all the Thus, instead of assigning a fixed reduction factor to the observations in the sample are legitimate, as was discussed units of ^2, we select 9;^ = •';fc = ^^ = '"()'*' ^' ^ ^ " JNIn), by Lee CM/., (1992). where/is a constant between 0 and 1. The estimator (3.1) With respect to caUbration estimators, we begin by becomes considering the situation in which there is no auxiliary information available and the only constraint is Y^s'^k ~ ^• TyQK = c,mY..y/kyk This case will guide our path. Consider the QR estimator with qi^ = ff^. For the sake of our discussion, we consider = C,{W)f^^. constants r^, known and fixed. The weights minimizing ~ (2.2) subject to Ys^k "^ are w^ = C^C'')'"^. where The estimator f ^ has been discussed in Gross, Taylor and Cfr)= NlYr^, so that TyQR becomes Lloyd-Smith (1986) as well as in Chambers and Kokic (1993), who called it the winsorized estimator. This is a (3.1) 'PyQR = Cs(r)Ys'-kykspecial case of the approach used by Chambers (1982, 1986). When / = 0, the estimator (3.1) becomes Whenever an observation is extreme, it might represent few T'.QR = C^{W,)f^„, witii q, = r, = W„ = m{y,; t, NIn, 0), units like itself within the population, and its weight should 47 Survey Methodology, June 1999 where 7,^, = Af/"(Ej,yt + "2')' denoting the part of s containing units that satisfy y^^ < t. The estimator 7"«,, has been discussed in Lee (1995), as well as in Gross et al, (1986), who called it the type I winsorized estimator. When f=nlN, Gross et al, (1986) called it the type II winsorized estimator. It has also been discussed in Brace (1991). In a design Tips, Dalen (1987) inserted the design by assuming D^ = /"(y^^; it/, dj^, I). Thus, if k and / are two extreme observations such that y;^ = y^, then the observation whose sampling weight is largest wiU have a higher weight D^. Selecting r^ = ^^ = D^ makes it possible to obtain essentially the Dalen estimator, ^'Q^ = C^(D)^jDj^yj^. The estimator T [) = Jis^kyk ^^ ^^^^ studied for example inTambay(198l). Table 3.1 Estimator (3.1) Based on Weight Reduction and Value Modification Estimator Value of q. = r. Bershad B, = d,I,,-r{l-I,,) Winsorised Wi^ = Winsorised, type 1 W,^ = m(y^;t,Nln,0) Winsorised, type 11 W,„=m(y,;t,Nln,l) Dalen ^* = '"(>'*;'^*''''*'l) m{yi^;t,Nln,fNln) Note: mCv^; t, a, b) = b + {a - b) min(l, f/y^). The approach used in this section suggests that we may occasionally seek estimators whose weights are close to r^ rather than the sampling weights c?^. The constants r^^ will themselves be chosen close to d^ for the proper units, but will be reduced once a unit is declared extreme. The QR estimators allow the weight reduction and value modification methods to be unified. Methods based on value modification help us choose weights that are adapted to the specific sample 5 chosen. As was noted in Chambers and Kokic (1993), this is not surprising since the problem of outliers occurs after the selection of sample s. We must use the sample at our disposal to overcome the problem. These methods are generaUzed in the foUowing section using auxiliary information. 3.2 Estimators of the Total Based on Robust Statistics One of the first attempts to obtain robust alternatives to population totals using auxiliary information can be found in Chambers (1982, 1986), who proposed a robust ratio estimator based on BLUP estimator decomposition. One recent extension of the work carried out by Chambers can be found in Welsh and Ronchetti (1998). Gwet and Rivest (1992) also proposed a robust version ofthe ratio estimator using an approach based on the design in simple random sampling. Rivest and RouiUard (1991) carried out a comparative study of several robust estimators, and examined several estimators of the mean square ertor. For designs with unequal probabilities, HulUger (1995) considered robustifying the HT estimator when inclusion probabilities are obtained using auxiliary information. Gwet and Rivest (1992) and HulUger (1995) considered a version of the influence function for finite populations, emphasizing the need for procedures having good properties of local robustness and the use of estimators having limited influence functions. Influence functions were discussed generally in Hampel ef a/., (1986). The following sections will deal with building robust estimators having constrained weights. The building of such estimators is based on the following steps: - Identifying the constants ^^ and r^; this provides a QR estimator. - Resolving the problem (2.3) so as to provide an RQR estimator. In terms of robustness, the coefficients q,^ are selected such that B is a robust estimator. Thus, the first part of the QR estimator, T'B , provides a good predicted value for the entire population. The second part of the QR estimator, Y,s''k^k' corrects the first part for the y^ observed in the sample. The constants r^ ensure that with this cortection, the outliers in the sample will not retum with full weight. 3.2.1. Choice of q^ Based on a GM-Estimator Consider the estimator (1.2) in which B is replaced by a robust estimator of a regression coefficient. Such estimators have been discussed for example in Huber (1981) and Hampel et al, (1986). We thus obtain P:P. E.r,{y,-x^B^). (3.3) The estimator (3.3) does not have the form of QR estimators unless B assumes a weighted form. This is the case if B is a GM-estimator defined by the equation E . d,h,x,yv({y, - x'.B)/[cshlf,]) / f , = 0, (3.4) since the solution to (3.4) can be expressed as Pg = ( E . d,hl ""M,X,X;/C,)"' YS ^k^l ""^k^kYkl^'k' where ^ _^[^yk-^'kPg)I^^Kfk)) ^yk-^lPg)li^Kfk) ' The properties of GM-estimators have been discussed in Simpson and Chang (1997). To simplify our discussion, a is assumed to be known, and the role of c^ is the same as in the case of the GREG estimator. The function \|i is determined by the analyst. A current example would be the Huber function 48 Duchesne: Robust Calibration Estimators c if X > c, VH„i(^;c) = ' x if Ixl ^ c, -c if x< -c. (3.5) A value of c around 2 is often used in calculating GMestimators. See for example Hampel et al, (1986), Gwet and Rivest (1992) and HulUger (1995). The choice of /i^ makes it possible to limit the influence of auxiUary information that is too extreme. The constant a = 0 leads to the choice of Mallows whereas a = 1 makes it possible to obtain the Schweppe version. The Schweppe version is sometimes preferted. See Coakley and Hettmansperger (1993) and Hampel et al, (1986, p. 322). When there is minimal auxiliary information, Le., when we only have available a real variable Xi^,\/keU, a possible choice for function /z^ is mm 1, (3.6) x^/med(x^)^ where B^ is an equivariant estimator with a high breakdown point meeting certain regularity conditions. The reweighted estimator is B. I -0 ^k^k^k'^ ••^s^kh^ (^k' '•*) = ( ^ t ^ "k'^^k' h)- The choice of constants r^ is discussed in section 3.2.3. 3.2.2 Choice of q,^ Based on a High-Breaking-Point Estimator The choice of a GM-estimator is only a first step towards obtaining a very robust estimator of the total. In fact, although the influence function of GM-estimators is restricted, the fact remains that such estimators do not have a high breakdown point, which usually diminishes according to the dimension of the auxiliary information (Rousseeuw and Leroy 1987, p. 13). This section wiU explain how to build robust calibration estimators based on high breakdown point estimators. As such estimators do not usually assume a weighted form, we will consider reweighting them. This will allow us to obtain, as in the previous section, the constants ^^^ needed to compute the RQR estimator metric. SpecificaUy,tiiefollowing weights iJ^ are considered: . _^((yk-^kPo)/(^Kfk))^ iyk-KPo)h^Kfk) ' (3.7) (3.8) dkK"'^k''kyk'''kThe asymptotic properties of this type of estimator have been studied in Simpson and Chang (1997). The estimator B^ that is considered is the one-step GMestimator of Coakley and Hettmansperger (1993). This estimator has a high breakdown point. It is obtained as the first iteration of the Newton formula in equation (3.4), where the Schweppe version is used, assuming a = 1. Other robust estimators could have been chosen. However, the efficiency and robust properties of the Coakley and Hettmansperger (1993) estimator make it a good choice. Thus, the proposed constant q^ is Ik For a design Ttps, a modification of /i^ following Dalen (1987) so as to take various sampling weights into consideration would perhaps be desirable. The constant / must be specified by the statistician. A value of t around 1.5 is found in the applications. See for example Rivest and Rouillard (1991), who also provide other choices for functions /i^. Writing B as a weighted estimator makes it possible to write estimator (3.3) as a QR estimator with .)- dX'^uJc^, denoting with B„ PQH'PCH Hettmansperger (1993) estimator. the Coakley and 3.2.3 Choice of r^. Once the constants ^^, have been determined, the constants r^ must be selected. If J^ = r^, then under general conditions, the QR estimator is an ADU estimator. However, such a choice of r^ yields an estimator that is sensitive to outiiers. Altemately, choosing r^ = 0 provides a robust estimator that might be very biased as was emphasized in Gwet and Rivest (1992, p. 1180). Lee (1991) suggested choosing r^ = 0^^, where 0G[O, I ]. The asymptotic bias becomes under general conditions (0 - 1 )Xl(/^t, where E^^ represents the residuals obtained by adjusting a robust estimator for the entire population. Choosing 9 makes it possible to control estimator bias. The discussion in section 3 leads us to suggest constants r^ that are close to the dj^ for good units, and reduced gradually for doubtful observations. We suggest choosing r, = d,u:, (3.9) where A^yk-KPr)/i<fk)) (yk-^kPr)/«fk) The function \|/* which we will be considering is a modification of the Huber function V|/*(X) X if Ixl ^ a, a sign(x) if lxl>a and \x\<alb, (3.10) bx if \x\>alb. Survey Methodology, June 1999 49 We choose a = 9,b= 1/4. The reason for this modification is that we do not want the outliers comprising large residuals or extreme auxiliary information to have weights that are too reduced. In this way, the sampling weight is fully maintained when the argument for M^^* is between -9 and 9, and reduced gradually to one quarter. If the weight of large residuals is reduced too much, then the bias becomes too great, leading to the choice of y ' . The choice of constants r^ has been done empirically, and seems to work well in practice. Thus, we will consider the choice of constants qj^ and r^ following Thus, the value for c in the Huber function can be obtained by taking into account efficiency concems under normal errors. See Hampel et al, (1986, p. 333) and Gwet and Rivest (1992). Constants a and b are also found; they are more directly linked to the proposed estimators. Constant b represents the maximum weight reduction that can be allowed when specifying the default weights r^, and for this reason there is a link with the suggestion made by Lee (1991). The constant which it is most important to specify is possibly the value of a. We suggest here a =9. However, in our simulations, a value of a between 6 and 12 yielded relatively comparable results. The choice of limits L and U rests on cosmetic considerations, so that the weights may be limited to one interval. This last consideration is perhaps (3.11) i^k'h) = (dk^k '''^J^^k'^X) secondary for the practitioner. As a result, it would seem that the most important aspect is to choose a value of r^ that, We suggest a generalization ofthe Lee proposition (1991), is close to d^ for the proper values, then reduced as an since instead of considering ri^ = di^Q where 0 is fixed, observation is deemed extreme, and that is the goal which r^ = dj^ u^ will adapt automatically (or adaptively) to the has guided our choice of r^ in this section. Nevertheless, sample. Having this choice of constants ^^ and r^ at our it would be useful to make a choice of r^ that satisfies a disposal, and with the usual mean square metric, we obtain certain optimality criterion. a QR estimator, but it is subject to negative weighting problems. However, with the constants (3.11), we can 3.3 Chambers Model-Assisted Estimator consider the restricted mean square metric, solve the Another approach is based on a decomposition proposed problem (2.3) and obtain a robust estimator meeting the by Chambers (1982, 1986) which we now apply to QR CEs and the CARVs. estimators. Note that a QR estimator can always be written There are in proposition 1 possible solutions for the in the form asymptotic behaviour of the resulting RQR estimator as compared to the QR estimator free of weight constraints. However, since the constants (3.11) depend on .y in a T^yQR=T.s hYk^CPx-^^'P^Hs h^k^k-KP)' complex way, there can be no automatic conclusion about asymptotic equivalence. Nevertiieless, the simulation study where z, = {T^- tj' {Ls^,x,x;^)~'x,^,, f^^ = Ys'-k^'k in section 4 seems to suggest a very comparable behaviour for the estimator with and without constraint on the and B are arbitrary. Chambers (1986) had considered the specific case (^^, r^) = (l/o^, 1) for the ratio estimator. In weights, with respect to the Monte Carlo mean square error. Thus, empirical evidence shows that if the ^^ and r^ are order to limit the influence of outlier units. Chambers proposed chosen in such a way that the estimator without constraint on the weights is robust, then the version with constraints on the weights will also be robust. Finally, the following is a summary of the steps in the Esh\{sl^k(yk-KP))(3.12) proposed method used to obtain a robust RQR estimator. 1. Choice of constants ^^ and r^. We suggest the constants found in equation (3.11). For this step, it is necessary to compute Bf.^^. 2. Choice of metric. If need be, choice of constants L and U. These constants are chosen such that L^ri^^U, ykes. The function \)/ helps limit the influence of large residuals. The choice for B is a robust estimator, e.g., B . One function \)/ considered in Chambers (1986) was y^{t) = texp{-Q.25{\t\-6f). (3.13) It is interesting to note that (3.12) can be written as 3. Solution using Newton's method for equation (2.5). 4. Assume Wi^= h{x^Xj,qi^,r^) for X^ solution to step 3. 5. Assume f estimator. = Ys ^kYv which is the proposed RQR The procedure requires a certain number of constants. The constants a, / and c are found in the calculation of ^^ and r^. The choice of these values is nevertheless justifieti using robustness theory, which helps guide the practitioner. '>CHAM T:P-YsihHd,g,-r,)X,)e,{B), where e^(B) =yj^ -xl.B,gi^ is defined in formula (2.1) calculated using ^^ and r^, and (3.14) ^k^Yk-KP) Duchesne: Robust Calibration Estimators 50 Thus, the residuals «;^(B) are weighted using a relation referring to formula (3.2). If A,^ = 1, then d^^^ is applied to residuals ej^(B) and it is easy to verify that we have the estimator 7" „^. Altemately, if X^. = 0, we obtain (3.3) if we assume B =B . If in (3.12) we assume {q^., r^) = (l/Cj^,l) and B=B , then the Chambers estimator represents a compromise between the BLUP and a robust estimator based on a GM-estimator. Note that formally (3.12) is a QR estimator with h + K ^ t - h)\)[dk^l~''''kl''k' {q^K) However, since r^' is not necessarily positive, it is not always possible to undertake a change of metric in this case. 4. EMPIRICAL STUDY To study the performance of robust calibration estimators, we carried out a Monte Carlo simulation study. We considered four populations comprising data from readily available works on sampling theory. For each population, /iC=2000 samples were drawn using simple random sampling for various sample sizes. Our main objective was to determine whether it is possible to obtain estimators having good empirical properties (bias, mean square ertor) while satisfying the CEs and the CARVs. Note that all the programs were written in S-PLUS (Statistical Sciences 1991) and are available from the author. 4.1 Populations Under Study The population graphs can be found in Figure (4.1). The first population, comprising 51 units, can be found in Mosteller and Tukey (1977, p. 560). It consists of the U.S. population in 1960 and in 1970 for each of the 50 states and the federal district of Columbia. It is called POPUSA. Looking at the scattergram of the 1970 popiilation in terms of the 1960 population, we notice that all units seem to be on the same straight line, with some good leverage points. An example of a good leverage point is the point surtounded in this population. The second population, with 34 units, can be found in Singh and Chaudhary (1986, p. 177). It deals with the area of fields sown in 1971 and in 1974. This population is called AREA. There is a bad leverage point (see the surtounded point) in this population since the point (4170.99) does not respect the linear trend AREA POPUSA 1000 5000 10000 POPULATION (M THOUSANDS). 1M0 2000 3000 AREA uroen WVEAT <t4 ACRES), 1971 MU284 MU281 eo so TOTAL NUMBER OF SEATS N MUNICPAL COUNCL 0 2000 4000 KOO (000 10000 12000 REAL ESTATE VALUES ACCORDtt«a TO 1M4 ASSESSMENT (M MLLK3NS OF KRONOR) Figure 4.1. The four populations under study Survey Methodology, June 1999 51 of the majority of units. Samples of size 10 and 15 are drawn from POPUSA and AREA. The third population, i.e.tiieMU284 in Samdal et al, (1992), comprises tiie 284 municipalities of Sweden. We considered variables x = 582 conceming the total number of seats in the municipal council, and y = P85 representing the population of Sweden in 1985. There are vertical outliers {e.g., the surtounded point) and one bad leverage point. Finally, we considered population made up of MU28I made up of MU284 from which the three largest municipalities were excluded. The variables considered were x = REV84 representing the values of landed property based on the 1984 assessment, and y = RMT85 representing municipal tax revenues in 1985. The unit of measurement was one milUon kronor for both variables. It seems this population has several bad leverage points. Samples of size n = 30 and « = 60 were drawn from MU284 and MU281. Table (4.1) contains totals for various populations. Table 4.1 Totals for Various Populations and Totals Known From Auxiliary Information Population POPUSA AREA MU284 MU281 T 179,972 29,118 13,500 757,246 T 203,923 6,781 8,339 53,124 N 51 34 284 281 4.2 Description of the Estimators The two basic estimators were the GREG estimator and the estimator obtained by considering case No. 7 in Deville and Samdal (1992), i.e., a GREG estimator with restricted weights. These estimators were denoted GREGAJ and GREG/R respectively. We selected c^ = 1 for populations POPUSA and AREA, and chose c^ = x^ for populations MU284 and MU281. Our choice for the c^ was motivated by the relationship between these constants and the heteroscedasticity of the superpopulation model. Of the robust estimators, we studied the Chambers (1986) estimator by considering i^k' h) = (d,ii,{B^^)lc,, I + {d,g, - l)X,{B^)), where in the formula (2.1) ((9^, r^) = (l/c^^, 1), denoted CHAM, based on B^. The constants "^^(B^^) were obtained from formula (3.7). Selection a = I was used throughout the simulation. Huber's function \(/ was used with the constant c = 1.345 for B^. The functions X,^ are those given by formula (3.6), where we selected t = 1.46. The function \ is defined by equation (3.14). The function v|/ considered was that given by equation (3.13). The scale was estimated as in Coakley and Hettmansperger (1993). We also considered the model-assisted BLUP estimator in which the generalized least squares estimator was replaced by estimator B^, which we called MODEL. Moreover, we considered the Lee (1991) estimator on the basis of B^ where r^ = 0,25d^, using the mean square metric. We also studied an extension of the Lee (1991) estimator by considering the limited mean square metric. These estimators were denoted by LEE25Aj and LEE25/R respectively. Finally, we considered the new method in section 3.2.3, selecting (^^, r^) as given by equation (3.11) in accordance with the mean square metric and the limited mean square metiic. They were denoted by QRROB/U and QRROB/R respectively. The choice of function y* was given by formula (3.10). Table 4.2 Monte Carlo Results for Sampling From the POPUSA Population Estimators n = lO GREG/U GREG/R CHAM MODEL LEE25/U LEE25/R QRROB/U QRROB/R n=15 GREG/U GREG/R CHAM MODEL LEE25/U LEE25/R QRROBAJ QRROB/R MIN MAX CARV' VARM MSEM CVM BRM 34.90 35.29 32.43 27.66 27.48 28.67 27.40 28.33 34.92 35.30 33.75 30.69 30.07 30.90 28.40 29.18 2.90 2.91 2.85 2.72 2.69 2.73 2.61 2.65 -0.07 -0.04 -0.56 -0.85 -0.79 -0.'^3 -0.49 -0.45 -6.24 0.20 -19.61 -19.71 -19.38 0.20 -15.68 0.20 26.75 32.00 40.96 40.86 39.64 32.00 40.10 32.00 86.7 100.0 84.0 82.8 83.2 100.0 83.2 100.0 21.90 22.12 18.11 15.43 15.44 15.72 14.68 14.85 21.95 22.15 20.14 19.03 19.54 19.68 16.44 16.56 2.30 2.31 2.20 2.14 2.17 2.18 1.99 2.00 -0.10 -0.09 -0.70 -0.93 -0.99 -0.98 -0.65 -0.64 -3.13 0.20 -5.79 -6.09 -6.19 0.20 -4.48 0.20 15.32 16.00 16.44 16.92 17.06 16.00 16.41 16.00 94.7 100.0 92.4 91.0 90.8 100.0 90.9 100.0 The limits for the CARVs are [0.20, 32] for n = 10 and [0.20, 16] for n=l5. CONV 98.4 98.4 98.4 99.5 99.5 99.5 52 4.3 Frequency Measurements Duchesne: Robust Calibration Estimators significant reduction in variance was achieved for the. QRROB/U estimator, but at the cost of a relative bias of The eight estimators in Section (4.2) were calculated for about 10%. each sample. The results can be found in Tables 4.2, 4.3, Population MU284 had a vertical outlier and bad 4.4 and 4.5. Since one asset of the new methods is the leverage points. Robust estimators reduced the variance CARVs, statistics were calculated' on these weights. radically, since they were not affected by the three extreme Columns MIN and MAX in the tables of results contain the units in y which were clearly moving away from the linear minimum and maximum values of weights calculated trend. The CHAM, QRROB/R and QRROB/U estimators during the simulation for each estimator. Also shown is the were more than four times less variable than the GREG/U percentage of samples for which the weights are within the estimator. However, this led to a much higher negative CARVs in the CARV column in the tables of resuUs. We bias. All the robust estimators were severely biased. The also considered the percentage of samples for which there MODEL estimator showed a negative bias of more than were convergent limited estimators in the CONV column. 13%, whereas QRROB/U had a negative bias of the order The intervals used [L, f/] for the limited intervals are of 11%. As for QRROB, a better choice of constants in specified in the different tables. In all cases, the various function t)/' might help reduce a larger part of the bias at statistics were calculated using samples for which all the cost of a lower variance reduction. Increasing the estimators were convergent. sample size to n = 60 made it possible to reduce the bias Another significant feature is related to the bias and below the 10% ofthe CHAM and QRROB estimators, but efficiency of the proposed methods. Let T denote an the other robust estimators remained more biased. estimator of the total T. Assume T. is the estimator of the Population MU281 contained a fairly large number of total calculated using sample /, i = I,..., K. The relative bad leverage points. The variance dominated the MSE Monte Carlo bias BR^, the mean value f^ and the share of this population. The LEE25 estimator was the least variance V^ are given by the usual formulas, Le., variable, with a reduction of more than 35% as compared to ^^M = {pM(T)-T^)lT^.xm,E^{f) GREG/U. However, although 0 = 0.25 functions well for this population, our study shows that it is not always the best choice. = ll-if, and V^-jLf-i(frpM(T)fK Note that all the robust estimators were more efficient than the GREG or its limited version. As was confirmed by Our main criterion for efficiency will be the Monte Carlo the results of Deville and Samdal (1992), the limited mean square ertor defined by MSE^,^ = 1 IK^i^i (^, ~ "Py)^- version of the GREG estimator showed essentially the same The coefficients of variation CV^ are calculated in behaviour as the GREG in terms of both bias and Monte accordance with JMSE^^IT,. The variance and mean Carlo variance for each population. Of all the estimators square ertor are expressed in millions. The coefficient of that were considered, GREG/U and GREG/R were the least variation, the relative bias, the CARVs and the convergence biased. The robust versions all exhibited greater bias. of limited versions are expressed as percentages. However, this is more than offset by the reduction in variance so that the efficiency of robust estimators is always 4.4 Discussion greater than that of GREG/U or of GREG/R estimators. Conceming the constraints on the weights, it will be The POPUSA population had no outiiers that did not noted that the GREG/U, CHAM, MODEL, LEE25 and satisfy the linear model. During sampling, the coefficients QRROB/U estimators are all subject to problems of of variation of the estimators were small, which could be negative weighting, as can be seen in column MIN. This expected given the trend of the population. Columns MSE problem is avoided with limited estimators. The CARV and VAR are very similar, indicating that bias is not a column shows that the constraints were not met relatively problem for this population. AU relative bias was less than 1 %. The QRROB/XJ estimator provided a reduction in frequently, depending on population and sample size, varying between 5% and 60%. The general behaviour of variance as compared to GREG/U that exceeded 21% for the two limited robust estimators was comparable to that of «= 10 and 30% for « = 15. their non-limited versions. Moreover, QRROB/R, in The size of the AREA population was small. This addition to meeting the CARVs, provided interesting population had a bad leverage point leading to very high properties of efficiency, as compared to other robust empirical relative bias for all the estimators. The GREG/U estimators. Limited versions were not as prone to estimator had a relative bias of more than 7% in spite of a convergence problems when sample sizes were greater. 44% sampUng for this population. The robust estimators Note that we had to use wider bands in the case of had the most significant bias, though it was relatively POPUSA in order to obtain satisfactory convergence rates. comparable to the bias of the GREG/U estimator. The most 53 Survey Methodology, June 1999 Table 4.3 Monte Cario Results for Sampling From the AREA Population Estimators VARM MSEM CVM BRM GREG/U 1.334 1.700 19.23 8.92 GREG/R 1.295 1.629 18.82 18.30 MIN MAX CARV CONV n=lO -3.35 14.94 86.6 8.53 0.20 14.00 100.0 8.77 -4.09 14.90 87.2 86.8 CHAM 1.187 1.541 MODEL 1.291 1.580 18.54 7.93 -5.23 16.75 LEE25/U 1.279 1.593 18.61 8.26 -5.28 16.89 86.6 LEE25/R 1.284 1.596 18.63 8.24 0.20 14.00 100.0 QRROB/U 1.026 1.440 17.70 9.50 -4.74 15.38 87.6 QRROB/R 1.028 1.437 17.68 9.43 0.20 14.00 100.0 GREG/U 0.940 1.178 16.00 7.18 -1.40 7.03 93.0 GREG/R 0.928 1.154 15.85 7.01 0.20 6.00 100.0 CHAM 0.708 0.989 14.67 7.82 -1.52 7.92 93.7 99.0 99.0 99.0 n=15 MODEL 0.757 0.997 14.73 7.22 -1.66 8.39 93.1 LEE25/U 0.672 1.059 15.18 9.18 -1.68 9.40 92.0 LEE25/R 0.671 1.056 15.15 9.15 0.20 6.00 100.0 QRROB/U 0.485 0.990 14.68 10.48 -1.59 8.90 93.9 QRROB/R 0,485 0.986 14.64 10.44 0.20 6.00 100.0 99.8 99.8 99.8 ' The limits for the CARVs are [0.20,14] for n = 10 and [0.20, 6] for n = 15. Table 4.4 Monte Carlo Results for Sampling From the MU284 Population Estimators VARM MSEM CARV CVM BRM MIN MAX 20.51 -3.64 -6.83 23.90 89.8 0.20 16.00 100.0 77.0 CONV « = 30 GREG/U 2.833 2.925 GREG/R 2.813 2.910 20.46 -3.73 CHAM 0.645 1.639 15.35 -11.95 -11.80 31.26 MODEL 0.709 2.037 17.11 -13.82 -12.06 31.91 68.40 LEE25/U 0.887 . 1.877 16.43 -11.93 -11.06 30.93 73.5 LEE25/R 0.871 1.847 16.30 -11.85 0.20 26.00 100.0 QRROB/U 0.719 1.532 14.84 -10.81 -9.46 25.84 86.5 QRROB/R 0.720 1.525 14.81 -10.76 0.20 16.00 100.0 GREG/U 1.473 1.489 14.63 -1.49 -1.19 10.03 90.1 GREG/R 1.467 1.484 14.61 -1.57 0.20 7.00 100.0 CHAM 0.357 0.990 11.93 -9.54 -2.53 15.59 69.8 14.52 58.1 14.20 60.3 99.2 99.2 99.2 n = 60 MODEL 0.380 1.255 13.43 -11.22 -4.93 LEE25/U 0.403 1.201 13.14 -10.72 -4.80 LEE25/R 0.396 1.203 13.16 -10.78 0.20 7.00 100.0 QRROB/U 0.308 0.976 11.85 -9.80 -2.36 10.99 86.1 QRROB/R 0.308 0.979 11.87 -9.82 0.20 7.00 100.0 The limits for the CARVs are [0.20,16] for n = 30 and [0.20, 7] for n = 60. 99.7 99.7 99.7 54 Duchesne: Robust Calibration Estimators Table 4.5 Monte Carlo Results for Sampling From the MU281 Population Estimators MIN VARM MSEM CVM BRM GREG/U GREG/R CHAM MODEL LEE25/U LEE25/R QRROB/U QRROB/R 17,33 17,40 13,23 11,30 11,21 11,26 12,92 12,94 17,35 17,41 13,26 11,91 11,60 11,73 13,29 13,34 7,84 7,86 6,86 6,50 6,41 6,45 6,86 6,88 -0,26 -0,24 -0,33 1,47 1,17 1,29 1,15 1,20 -38,97 0,20 -47,09 -66,22 -59,75 0,20 -54,14 GREG/U GREG/R CHAM MODEL 7,57 7,58 5,85 4,53 4,55 4,50 5,40 5,39 7,57 7,58 5,90 5,23 5,18 5,18 4,57 4,30 -0,10 -0,09 -0,43 1,57 5,18 5,21 6,16 6,17 4,28 4,30 4,67 4,67 1,49 1,58 1,64 1,66 MAX CARV CONV « = 30 0,20 34,56 25,00 39,08 41,43 37,03 25,00 39,73 25,00 86,0 100,0 56,9 47,9 53,3 100,0 70,8 100,0 -12,77 0,20 -22,97 -24,02 -23,74 0,20 -21,08 0,20 15,34 9,00 11,49 14,58 14,41 9,00 21,07 9,00 86,4 100,0 51,4 38,7 41,2 100,0 68,6 100,0 99,8 99,8 99,8 n = 60 LEE25/U LEE25/R QRROB/U QRROB/R 99,9 99,9 99,9 The limits for the CARVs are [0.20, 25] for n = 30 and [0.20, 9] for n = 60. 5. CONCLUSION The goal of this paper has been to introduce calibration estimators having good properties of robustness. Traditional calibration estimators are easy to use, since it is sufficient to have a set of starting weights, usually the sampling weights J^, which are transformed into calibrated weights. The steps used in this paper have been the same, Le., the robust default weights r^ have been transformed into calibrated weights, and the constants q^. have been chosen such that B is a robust estimator. The proposed choice of r^ is given by the formula (3.9), with a = 9, ^ = 1/4. There remains to develop a theory for the optimal choice of r^. The suggestion is made for applications to vary constant a, between say 6 and 12, in order to determine the influence of the constant on the estimation. The limits L and U can be used to limit the weights, e.g., to make them all positive. We suggest the general use of L = 0.2, U = kNIn, where k is about 3. Note that robust calibration estimators are not meant to replace the GREG estimator, but to be used in conjunction with it. Thus, if the robust estimator and the GREG estimator are very different, a more in-depth analysis might help determine the reason. The proposed estimators could be useful as diagnostic tools. It would be interesting to pursue the empirical studies of section 4, by examining for example the effect of sampling design on the proposed procedures. Another important area of development is the estimation of variance. Multipurpose surveys are yet another area of interest. In fact, for applications, there is rarely a single variable of interest, and methodologists would like to use a single set of weights for all the variables of interest. In terms of robustness, a solution has been proposed in the conclusion of a paper by Gwet and Rivest (1992), where robust weights were calculated for each variable of interest y'''\i = I, ...,I. For one unit, the final weight cortesponds to the minimum weight among the weights obtained. Altemately, to obtain robust and calibrated estimators, we could calculate robust default weights for each variable of interest, providing a set of '•^(y*'^), and assume r^ = min r^.(y*'^), where the minimum is on / = 1,..., /. These weights could then be transformed into calibrated weights. This procedure should be assessed in greater detail. ACKNOWLEDGEMENTS I wish to thank Carl-Erik Samdal for introducing me to sampling theory and for suggesting that I consider the problem of outliers in sampling theory. I also thank Roch Roy and Christian Leger for helping me during various stages of development of this paper. My sincere thanks go to the Associate Editor, the Assistant Editor and two referees for comments which led to significant improvements in both content and layout. 55 Survey Methodology, June 1999 APPENDIX C LIST OF THE PRINCIPAL CONSTANTS APPENDIX A PROOF OF PROPOSITION 1 Let A^{u;q, r) = r -i- qu - h{u;q, r) and z^ be a variable of interest. We assume the following conditions c,-N-^Ys^kh = Opin Sk C2-N-^Ys\(KK'qk'rk)h = 0pin-"'), where X^ is a solution to equation (2.5). N-'Es9kVk(\-K)-o^{n-'''), ^ C,, with (•'QR ~ • ' R Q R ) = N-'Ys irk-'ik^k\)yk-N''Y.s Quantity used to reduce the influence of outUer auxiUary information in B . 1 1 , . , Tl kl- Inclusion probabilities of first and second order, respectively. Quantities defining an estimator QR. The ^^ are used to build the regression coefficients involved in the first part, T^B the r^ are used for the second part, Ys''k^kWeights used to build B in a robust way. u^,u t'"ifcWeights used to consider a robust correction factor Lh^kw itCalibrated weight attributed to y^. to form L'^kYkh,. Note that Zs^^k^9k^k\)^k"Px' ^'^^'"^ ^i "" -(ls^kVk)''(Pxr-Px)' and also that IMKK' ^k' ''k)-''k ~ ^x- Thus, in using Cj, we find that X^ -X^= o {n'^'^), and therefore using h - -"-k^k • ^^ '^ also easily shown that Factor capable of accounting for heteroscedasticity problems. Sampling weights. g-weight defined by wjdi^. hix;Xj,q„r,)y, =N-'Ys^kKyki\-K)-N-'Ys\^KK-'qk'rk)y, REFERENCES BERSHAD, M.A. (1960). Some Observations on Outiiers. Unpublished memorandum. Statistical Research Division, U.S. Bureau of the Census. APPENDIX B LIST OF ABBREVIATIONS BOLFARINE, H., and ZACKS, S. (1992). Prediction Theory for Finite Population. New York: Springer-Verlag. ADU: Asymptotically Design Unbiased. BLUP: Best Linear Unbiased Predictor (Royall 1970). CARV: Constraints applicable to the range of values for the weights w^, by requiring for example that all the w^e[L, [/]. CE: Calibration constraints, Y,s^k^k ~ ^x' where Px = I t / ^ r CH: Robust estimator proposed by Coakley and Hettmansperger (1993), a single-step GMestimator that is robust and efficient. CHAM: Robust Chambers (1982, 1986) estimator. GM: Generalized M estimators, derived from robustness theory (see for example Hampel et al, 1986). GREG: Generalized regression estimator proposed by Cassel ef a/., (1976). HT: Horvitz-Thompson estimator Y.s'^kYk' where BREWER, K.R.W. (1994). Survey sampling inference: Some past perspectives and present prospects. Pakistan Joumal of Statistics 10,213-233. BRUCE, A.G. (1991). Robust Estimation and Diagnostics for Repeated Sample Surveys. Mathematical Statistics Working Paper 1991/1, Statistics New Zealand. CASSEL, CM., SARNDAL, C.-E., and WRETMAN, J.H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite population. Biometrika 63,615620. CHAMBERS, R.L. (1982). Robust Finite Population Estimation. Ph. D. thesis, Johns Hopkins University, Dept. of Biostatistics. CHAMBERS, R.L. (1986). Outiier robust finite population estimation. Joumal of the American Statistical Association, 81, 1063-1069. CHAMBERS, R.L., and KOKIC, P.N. (1993). An integrated approach for the treatment of outiiers in sub-aimual surveys. Proceedings on the 49'' Session, International Statistical Institute. QR: Wri|ht (1983) T'B + y r.e.. form COAKLEY, C.W., and HETTMANSPERGER, T.P. (1993). A bounded influence, high breakdown, efficient regression estimator. Joumal of the American Statistical Association 88, 872-880. RQR: Generalization of tiie Wright (1983) estimators, obtained using a general metric as well as constraints on the weights. DALfiN, J. (1987). Practical Estimators of a Population Total Which Reduce the Impact of Large Observations. R & D Report, Statistics Sweden. dk = Th- X q '-•s k estimators, in the k Duchesne: Robust Calibration Estimators 56 DEVILLE, J.-C, and SARNDAL, C.-E. (1992). Calibration estimators in survey sampling. Joumal of the American Statistical Association, 87, 376-382. DONOHO, D.L., and HUBER, P.J. (1983). The notion of breakdown point. In A Festschrift for Erich Lehmann, (Eds. P.J. Bickel, K.A. Doksum and J.L. Hodges). Belmont, CA: Wadsworth. FULLER, W.A., LOUGHIN, M.M., and BAKER, H.D. (1994). Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey. Survey Methodology, 20, 75-85. GAMBINO, J. (1987). Dealing With Outiiers: A Look at Some Methods Used at Statistics Canada. Technical report. Business Survey Division, Statistics Canada. GROSS, W.F., BODE, G., TAYLOR, J.M., and LLOYD-SMITH, CW. (1986). Some finite population estimators which reduce the contribution of outliers. Proceedings of the Pacific Statistical Congress. Aucklaud, New Zealand, 20-24 May 1985. GWET, J.-P., and RIVEST, L.-P. (1992). Outiier resistant alternatives to the ratio estimator. Joumal of the American Statistical Association, 87, 1174-1182. HAMPEL, F.R., RONCHETTI, E.M., ROUSSEEUW, P.J., and STAHEL, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley. HIDIROGLOU, M.A., and SRINATH, K.P. (1981). Some estimators of the population total from simple random samples containing large units. Joumal ofthe American Statistical Association, 76, 690-695. HUBER, P.J. (1981). Robust Statistics. ^ev/Yor\i: Wiley. HULLIGER, B. (1995). Outlier robust Horvitz-Thompson estimators. Survey Methodology, 21,79-87. KISH, L. (1965). Survey Sampling. New York: Wiley. LEE, H. (1991). Model-based estimators that are robust to outiiers. Proceedings of the 1991 Annual Research Conference. U.S. Bureau of Census. LEE, H. (1995). Outiiers in business surveys. In Business Survey Methods, (Eds. B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. CoUedge and P.S. Kott). New York: Wiley. RIVEST, L.P., and ROUILLARD, E. (1991). M-estimators and outlier resistant alternatives to the ratio estimator. Proceedings: Symposium 90, Measurement and Improvement of Data Quality, Statistics Canada, 245-257. ROUSSEEUW, P.J., and LEROY, A.M. (1987). Robust Regression and Outlier Detection. New York: Wiley. ROYALL, R.M. (1970). On finite population sampling under certain linear regression models. Biometrika 57, 377-387. SARNDAL, C.-E. (1996). Efficient estimators with simple variance in unequal probability sampling. Joumal of the American Statistical Association, 91, 1289-1300. SARNDAL, C.-E., SWENSSON, B., and WRETMAN, J.H. (1989). The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika, 76, 527-537. S A R N D A L , C.-E., SWENSSON, B., and WRETMAN, J.H. (1992). Model Assisted Survey Sampling. New York: Springer-Verlag. SIMPSON, D.G., and CHANG, Y.-C.l. (1997). Reweighted approximate GM-estimators: asymptotics and residual-based graphics. Joumal of Statistical Planning and Inference, 57, 273293. SINGH, A.C, and MOHL, CA. (1996). Understanding calibration estimators in survey sampling. Survey Methodology, 22,107-115. SINGH, D., and CHAUDHARY, F.S. (1986). Theory and Analysis of Sample Survey Designs. New York: Wiley. STATISTICAL SCIENCES, INC. (1991). S-PLUS Manual. Seattle: Statistical Science, Inc. Reference STUKEL, D.M., HIDIROGLOU, M.A., and SARNDAL, C.-E. (1996). Variance estimation for calibration estimators: A comparison of jackknifing versus Taylor linearizarion. Survey Methodology, 22, 111-125. TAMBAY, J.-L. (1988). An integrated approach for the treatment of outliers in sub-annual surveys. Proceedings on the Section on Survey Research Methods, American Statistical Association, 229234. LEE, H., GHANGURDE, P.D., MACH, L., and YUNG, W. (1992). Outliers in Sample Surveys. Methodology Branch Working Paper BSMD-92-008E, Statistics Canada. WELSH, A.H., and RONCHETTI, E. (1998). Bias-calibrated estimation from sample surveys containing outtiers. Joumal of the Royal Statistical Society, Series B, 60, 413-428. MOSTELLER, F., and TUKEY, J.W. (1977). Data Analysis and Regression, A Second Course in Statistics. Redding, MA: Addison-Wesley. WRIGHT, R.L. (1983). Finite population sampling widi multivariate auxiliary information. Joumal of the American Statistical Association, 78, 879-884. 57 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 57-66 Statistics Canada Estimation in Surveys Using Conditional Inclusion Probabilities: Complex Design YVES TILLE' ABSTRACT This paper investigates a repeated sampling approach to take into account auxiliary information in order to improve the precision of estimators. The objective is to build an estimator with a small conditional bias by weighting the observed values by the inverses ofthe conditional inclusion probabilities. A general approximation is proposed in cases when the auxiliary statistic is a vector of Horvitz-Thompson estimators. This approximation is quite close to the optimal estimator discussed by Fuller and Isaki (1981), Montanari (1987, 1997), Deville (1992) and Rao (1994, 1997). Next, the optimal estimator is applied to a stratified sampling design and it is shown that the optimal estimator can be viewed as an generalised regression estimator for which the stratification indicator variables are also used at the estimation stage. Finally, the application field of this estimator is discussed in the general context ofthe use of auxiliary information. KEY WORDS: Conditional estimation; Weighted observation; Generalised regression estimator; Complex survey. 1. INTRODUCTION At the estimation stage, practitioners of survey sampling often have auxiliary information available. This information can be the knowledge of a set of population means or totals. Sometimes, the available information is detailed, for instance when the values taken by a variable on all the units ofthe population are known. This information can be used to improve the precision of the estimators. Our aim is to dealt with the use of auxiliary information based on a conditional principle. Conditional inference has been largely studied in the survey sampling literature. Indeed, the optimal estimator was discussed by Fuller and Isaki (1981), Montanari (1987, 1997), Deville (1992) and Rao (1994, 1997). The conditional properties of the poststratified estimators has been studied by Casady and ValUant (1993). hi an eariier paper (Tille 1998), a general technique that allows to build a mean or total estimator that has a small conditional bias has been proposed for simple random sampling. This technique is based on the use of conditional inclusion probabilities and allows one to take into account auxiliary information without any reference to a superpopulation model. In this paper the use of conditional inclusion probabilities is generalised to any sampling design. It is shown that this technique allows to construct an estimator very similar to the optimal estimator discussed by Montanari (1987), Deville (1992) and Rao (1994). This family of estimators provides a vaUd conditional inference and can also be viewed as the optimal linear estimator. Next, these estimators are applied in the stratification case and are compared to the GREG-estimator. The GREG-estimator is generally conditionally biased. Nevertheless, it is shown that, in regression, the optimal estimator is a particularly case of the GREG-estimator. Indeed, when the sttatification variables are re-used as auxiliary variables in the GREGestimator, it is equal to the optimal estimator. Next, a set of simulations is given that shows the interest ofthe optimal estimator in stratification. The gain of precision can be very important when the stratification variables are very cortelated to the interest variable. Finally we discuss the general estimation problem in survey sampling that can be viewed as a third-order problem where three sets of variables interact: the planning variables, the calibration variables and the interest variables. The paper is organised as follows. In section 2 the notation is defined. In section 3, the problem of conditional inference is presented. In section 4, an approximation ofthe SCW-estimator is given for complex designs under technical hypotheses. These hypotheses are discussed in section 5. In section 6 the optimal estimator and the SCWestimator are compared to the generalised regression (GREG) estimator in the stratification framework. It is shown that the optimal estimator can be viewed as a GREG-estimator for which the stratification indicator variables are also used a posteriori. Next a set of simulations is presented in section 7 in order to compare the discussed estimators. Finally, the problem of interaction between the design and the auxiliary variables is discussed in section 8. 2. PROBLEM AND NOTATION Consider a finite population U = {l,...,k,...,N] and suppose that a random sample 5 is drawn without replacement from this population foUowing a sampling design p{.). The probabiUty of selecting the sample s is Pr{S =s) =p{s), Yves Till6, Crest - Ensai, &ole Nationale de la Statistique et de I'Analyse de I'lnformation, rue Blaise Pascal, Campus de Ker Lann, 35170 Bruz, France, e-mail: tille@ensai.fr. 58 Tille: Estimation in Surveys Using Conditional Inclusion Probabilities for aU sc U. The indicator variables /^ take the value 1 if unit k is in the sample and 0 otherwise, for all ke U. The inclusion probabiUty of unitfcis 7i^ = £'(/^), where symbol E{.) is the expectation with respect to the sampling design. The joint inclusion probability for unit k and / is TI^, = E{IJi). Let y^ denote the value of the variable y for the k-th unit of the population. The aim is to estimate the population mean of y: supposed that the x^ are known for each unit of the population. Later, it will be considered the more restrictive case where only one function ofthe x^ such as Ntu * is known. Consider also the Horvitz-Thompson estimator of X given by i y^-^YkN keu If 7i^>0,for all k€ U the Horvitz-Thompson estimator (1952) given by =JLyli " Nts n, If 7t^>0, for all keU,x^ is an unbiased estimator of x £(x„) = i . The variance of x^^ is given by N keS 7t^ provides an unbiased estimator of y. Let T be a statistic. The objective is to estimate y with a conditional bias as small as possible with respect to statistic T. Define the first-order conditional inclusion probabilities to be Jij^ir" Pi^k\T) for aU keU and the conditional joint inclusion probabilities to be 7tjt,|7- = E{IJiIT) for all keU,leU,k* I. The simple conditionally weighted estimator (SCW) is defined by _ I y '\T Nkes (3.2) yk (2.1) 2=Var(x„) = - L i : x , ^ ( l - 7 t , ) . N^ leu 71/ -|^EE^(-.-,-JA^"^ leU meU JC, Jt^ m*l (3.3) Suppose now that vector {y^ x^^)' has a multinormal distribution. Under this hypothesis, it can be derived a conditional unbiased estimator (see for instance, Deville 1992). First the conditional bias is computed: Tt^ir P(yn\K) = P(yn\K)-y = (K -x)Var(x„)-' Cov(x„,y„). This estimator is not exactly conditionally unbiased. Indeed, a conditionally unbiased estimator exists if and only if 7t^I^> 0 for all keU. For this reason, it is useful to enlarge the definition of conditional unbiasedness: an estimator is said to be virtually conditionally unbiased (VCU) if the conditional bias only depends on the units having null conditional inclusion probabilities. The SCW-estimator is VCU, indeed: B{y^^\T)=E{y^.p\T)-y Tf ^ ^*- This estimator generalises some classic results (see Tille 1998) like post-stratification. Moreover, it allows us to build an original estimator for a contingency table when the population marginal totals are known. Unfortunately, the computation ofthe TI^,^ becomes very difficult in complex sampling designs. A general approximation for the SCWestimator will however be given when using a vector of Horvitz-Thompson estimators as auxiliary statistic. USE OF A COMPLEX AUXILIARY STATISTIC Suppose that the auxiliary information is represented by the vector x^ = {Xi^y...,Xi^;,...,x :.., ...,Xi^j)' of values taken by the J auxiliary variables on the k-th unit of U. In a first step, it is If an estimator of B{yJXj^) is available, the HorvitzThompson estimator can be cortected in the foUowing way: yc=y.-P(yn\K) = ^« + (^ - \) Var (xj-i Cov(x„, y„). This estimator is related to the optimal Unear estimator discussed by Fuller and Isaki (1981), Montanari (1987) and Rao (1994). Indeed, Montanari showed that the best estimator in the sense of the smallest mean square ertor (MSE)ofthefomi 3'p=)'n + (^-\P (3.4) occurs when p takes the value: Popr =S-'Cov(i„,?„). The optimal linear estimator presented by Montanari leads thus to a very similar result to the conditional approach, although Montanari did not start with a conditional point of view. In Montanari's approach, the optimal estimator is found in a class of linear estimators defined by (3.4) without any reference to conditional properties. Nevertheless, Rao (1994) has pointed out that this estimator leads to valid conditional inference. The general problem Survey Methodology, June 1999 59 of the optimal estimator is that PQ^J. is not known and must thus be estimated. By estimating POPT> the optimal properties of the estimator are lost. ^ In order to estimate PQPT' (or ^(ynlx^)) two cases can be distinguished. In the first one, the values taken by the auxiUary variable on all the units of the population are known. In this case, Y. is thus known and PQPP is thus estimated by (E is supposed non singular) PoFr = ^"'Cov2(x„,y„). By estimating Popp. another asymptotically optimal (AOPT) estimator can be given: 3 ' A 0 F n " 3 ^ n + { ' ^ n - x ) ' P : OPT- N^ keu \ ^3', r^EE A^'' keU leU ( % - '^t'^/) Tt^t, -Y (X|,-x)y, N keU can be unbiasedly estimated by (3.8) The difference between the AOPTl and A0PT2 estimator is die way we estimate Cov (x^, y^^) and £ . However, the AOPTI-estimator needs more complete auxiliary information. The generalised regression (GREG) estimator defined by Cassel, Samdal and Wretman (1976), Wright (1983), Samdal, Swensson and Wretman (1992, p. 225) is also an estimator of the linear class given by expression (3.4). For the GREG-estimator P is defined by -1 (3.5) C«^>K'^n) =NT;E PGREG , keS ^k^k keU E t^t keU and can be estimated by where ""l^kl ^\k-E[K\k^s) N= , ]-Y ,, 71,71, leU k I By using (3.5), afirstasymptotically optimal estimator can be constructed kovri =y.-(^- ^l*-x K)^''jX ^ ^ ^ r N keS (3-7) ''k'^k -'GREG keS kes n^c^ where quantities c^>0,^e [/, are weights defined for all tiie population units. TTie GREG-estimator does not have good conditional properties. It is generally conditionally biased (Rao 1994). {l-n,)^ 4. APPROXIMATION OF THE SCW-ESTIMATOR Another way to construct a conditionally unbiased estimator is tofindan approximation ofthe SCW-estimator given in (2.1). Indeed this estimator has good unbiasedness properties because it is VCU. If x^^ is used as an auxiliary statistic, we shall seek an approximation of n{ \\T 1 Jt^C^ \ In the second case, only the population mean x is known, E must thus be estimated and Cov(x^, y^) can not be estimated using (3.5). Montanari proposes to estimate E and Cov(Xjj, y^j) by the classic Horvitz-Thompson estimator: ^--X N' kes "iX, (3.6) n.N k E(h\U V - V - X,X, 7t,,-7l,7t, N^ kes teS T^k^l l*k ••kl If the random vector x^^ takes for instance the value z, we get by Bayes's theorem that and E(l,\i^ N^ kes nl N^ keS leS TC^Tl, 7t^, = z)=Pr(keS\x^=z)=K, Pr{i^=z\k€S) Pr(x„ = z) : In order to compute the conditional inclusion probabilities, it is thus necessary to know the probability distribution of Xjj unconditionally and conditionally on the presence of each unit in the sample. Except for some particular case, this probability distribution is very complex; for this reason an approximation will be constructed. 60 Tille: Estimation in Surveys Using Conditional Inclusion Probabilities It is possible to derive the means and variances of x^ unconditionally and conditionally on the presence of each unit in the sample. Indeed £'(x„) is given in (3.2 ), Var(x„) in {3.3 ),E{\^\keS) in (3.6), and E, = War(^J kes) X;X,7l^, 7C, K.n, 7t,. A^ leu l*k These three hypotheses are verified for simple random sampling without replacement when only one auxiliary variable is available. Indeed, in this case, we have 7 = 1, x , = x , , x = x , i | ^ = X | ^ x „ = x „ . Weget \ n n n- I A^ "' NN-l By (3.6), (3.3), (4.9), 7c^ = — , 71^; = k I —E E N'^ 5. DISCUSSION ABOUT THE HYPOTHESES leu l*k melJ m*' m*k X;X„ ^ki^km (4.9) X|^=X + '•klm Jt^Jt/Tt^ where 7t^,^ is the third-order inclusion probability. Matrixes E and E^^ are assumed to be non singular^ As the probability distribution of x^ is generally unknown, the following three assumptions will be used to construct an approximation of conditional inclusion probabilities. (i) If the sample size n is large, x^ has a multivariate normal distribution unconditionally and conditionally on the presence of each unit in the sample. N , a n d 71,•klm - n ^k N-l •"• (5.11) n N-n N-l Var(x„) n n- I n -2 N N-l N-2 ^ n (5.12) V a r ( t | f c e 5 ) = ^ M z i O ( - l I ],?-.^fi:^l(5.13) N-l {N-2){N-l)n' where (ii) R-' -R ' -Oj^j{n-^) for aU keU where R=\-^'^ ij:^Y-m ,V denotesaJxJdiagonal EV -1/2 R,=V-"2^ matrix having the elements of the diagonal of E on its diagonal and Oj^j {n"") denotes a matrix of quantities that when multiplied by « ° remains bounded as n - «>. <^l-^X(-k-^)'N keU Now, consider the three hypotheses for this particular case. (iii) Y^ = V-''2(X|^ -x) = Oj{n '^'^) where Oj {n "°) denotes - Hypothesis (i) was proved by Madow (1948) under a vector of quantities that when multiplied by n° some conditions. remains bounded as n - «>. - Hypothesis (ii) becomes These three hypotheses are made on the sample size. It is thus supposed that when n increases, A^ increases at least as quickly as n. Nevertheless, no hypothesis are made on / = nIN. Assuming that the hypotheses given in section 3 are verified, the following result gives an approximation of the SCW-estimator: Varf;c„ IkeS) ^-^ ^-1 Var(x„) By (5.12) and (5.13), we get Var(x„ \kES) _N{n-l) Var(xJ " (A^ - 2)n Result 1: Assuining (i), (ii) and (iii), and if the auxiliary statistic used is x_, then ^ _l_jN-2n V=?. *(^-\)'2:'' N keS =0{n' («-•). ^N{n-l) {x,-x)^ {N - Do," (x^-x)^ ^ l * - ' ' n \{N-2) yk {N-2) I = 1 +0 *Op{n-)^y^Q^y (4 10) - where n x O (n"') is a quantity bounded in probability. Proof of Result 1 is given in the appendix. Hypothesis (iii) becomes ^\k \/y^{\) ^ _/^/•„l/2^ 0{n"'). (N-l)ol 61 Survey Methodology, June 1999 By (5.11) (5.12), we get ^/N^X^-X 1k = where O In simple random sampUng, these hypotheses can better be interpreted. Hypothesis (i) is the classic assumption of normality that was also needed for the construction of the optimal estimator. In simple random sampling, it is easy to verify that Hypothesis (iii) implies hypothesis (ii). Both technical hypotheses simply imply that a particular unit cannot take a |x^ - x | value much more important than the other ones. The three hypotheses are thus valid under simple random sampling when only one variable is available. This result can also be extended to stratified sampling when the number of strata is fixed and the sample size within each stratum is large. In cluster sampling, when the number of clusters is large and the clusters are selected with a simple random sampling design, these hypotheses are still applicable. Hypothesis (i) was also partially showed by Rosen (1972) for sampling with unequal probabilities. Actually, the proof of Rosen is restricted to a rejective sampling design. The proposed hypotheses, are generally less restrictive than a superpopulation model. Indeed, a superpopulation model is a set of hypotheses on the interest variables while the three hypotheses presented only affect the auxiliary variables. In a superpopulation model, the relation between the interest variable and the auxiliary variables are the most extensive contribution of the model. In the conditional approach, no hypothesis is made on the interest variable. If the hypotheses presented are debatable, it is thus clear that a superpopulation model is a set of hypotheses much more restrictive than those used in the conditional approach. 6. APPLICATION TO STRATIFIED SAMPLING 6.1 The Problem In stratification, auxiUary information is used it a priori to improve the estimation. In this case, three sets of variables interact: the stratification variables, the auxiliary variables used a posteriori and the interest variable. Suppose that the population is partitioned into H strata Uf^,h = \,...,H, of size Nf^,h = l,...,H. The population means of the strata are denoted y^ =A^;," Hkeu yk ^^^ Xf^=Nf^ Ejtey x^. A simple random sample 5^ of fixed size Wy, (2!A=I "A ~ ") ^^ selected without replacement independently in each stratum. From the general theory of stratification (see for instance Samdal, Swensson and Wretman 1992, p. 100), we get 3'„ = T : E ^hYh and x„ = - J ] A^,x, N At:= l N A=l - E 3't a n d i , = — 5 ^ ^ r Moreover, we have that Cov(x„,y„) = -^t^b'~^' N^ h-i ' ,E(x.-x,)(y,-n)' n, N,-l keu, and 1 " N'^ /i=i 1-/. n, 1 E (Xt-x,)(x,-x,)' N^-l keu, where/^ =nJNf^,h = l,...,H. 6.2 AOPTl-estimator If keU^, by extending expression (3.6) to stratified sampUng, we get A^„^ X|^ = E{xjk€ S) =x + l-/a N{N^-i) \ (x*-x„) and Ttj, A^ kes I ^ NI l-f, I - T E T;^— E (x*-x,)y,. N^h-i N^-l n, n^kes, From (3.7) the AOPTl-estimator of can be derived as W i = i + (x-x„)' -1 H 2W . EN, ''=1 "nA H Nb xE H-.i N^-l 1 ^A E (Xi-x,)(x,-x,)' ~ 1 keU, 1 -fb 1 E (Xt-x,)y,. «, n^kes, The use of this estimator requires the knowledge of very substantial auxiUary information. The population means x ^ of the auxiliary variables must be known for each stratum as weU as the stratum sizes A^^. Moreover, the values taken by the auxiliary variables must be known for each unit of U. 62 Tille: Estimation in Surveys Using Conditional Inclusion Probabilities However yAOPri ^^^ i" stratification an important drawback: it is not calibrated on the strata size N^^ Le., when the objective consists in estimating the strata sizes Nf^, generally ^AOiTt '' ^A- ^^^^ drawback can easily be overcome by centring the interest variable. We thus get: ''GREG '^^GREG " ^ Nb 1 -/A 1 H-i N^-l «, E n^kes. (x,-x,)(>',-y,). N.GREG ••N,-N{i-ij yAot^=k*(^-K)' E^- A=l XN, A= l "A 2^-f, "A "A -1 E (x,-x,)(x,-X;,)' " 1 kes. 1 E 1" A. -- 1 *e5, (X,-XA)(>'t-3'A)- The AOPT2-estimator only needs the knowledge of the population mean vector x and of the stratum sizes N^. It has however a drawback, the x^ are estimated and thus JxH degrees of freedom are lost. If the number of strata is large, this loss of degrees of freedom could increase the instability of this estimator when 7 x / / is large. 6.4 GREG-Estimator The GREG-estimator does not take into account the joint inclusion probabilities. It is given by )'GREG=>'.*(X-X„)' • '^ A^ X X ' Y —Y x*)-* A=i n^ kes, q y^ **** A=l E-^E n^ kes, E cYk C^ |A=I N. — E x^x, «A kes, " The AOPT2-estimator can also be used in stratification. In this case, from (3.8) we get 1 1-1 Since, in stratified sampling, N,^^ '^H' ^® ^^^ 6.3 AOPT2-Estimator 2^-h X x' A=i «A kes. X ^ ^ ^ ' 'n,^ TN^-l T ^ keu, ^ K-XA)(X.-X,)' L=i N = N,„.A^(x-i„)' \y^ ^ Y^E ^^^ h-l n^ kes, Although this estimator is more stable, it is conditionally biased. Moreover, if we want to estimate the stratum sizes A^^ by the GREG-estimator, we do not find exactly N^. Indeed, if y^ = 1 when keU^ and y^ = 0 when k$U^, then N c^ X A=i n^ kes, q (6.14) Expression (6.14) shows that generally ^GREG^'^AThus, the GREG-estimator destroys the stratification effect because it does not take the stratification into account. Indeed, the stratification is represented by the joint inclusion probabilities. In the GREG-estimator, only the first-order inclusion probabilities are used. On the other hand, it is easy to verify that the AOPTl and AOPT2estimator of A^^ are exactly equal to A^^. The AOPTestimators is thus calibrated on the N^. We propose to use the GREG-estimator in the following cases: when the sample size is small or if the number of strata is large and when the stratification gives poor auxiliary information on the interest variable. Indeed, in this case, the loss of precision due to the loss of degrees of freedom, wiU be more important than the precision benefit due to the optimality of the estimator. An interesting analysis of the benefit due to the optimal estimator is also given in Montanari (1998). 6.5 GREG-Estimator With use of the Stratification Variables A variant of use of the GREG-estimator consists in re-using the stratification variables at the estimation stage. Consider the column vector w. iz,kl- -'^kh'—'Z,•km X^jt/ ) ' where zkh 1 ifkeUf^ and 0 if not. This vector is thus composed of the values taken by the indicator variables of the presence of unit kin the H strata and of the values taken by the x-auxiliary variables. Now if w denotes the population mean of vectors w^ and Wjj its Horvitz-Thompson estimator, the GREGestimator using the auxiliary information w is given by Survey Methodology, June 1999 63 >'GREGW=3'„ + ( W - W J ' " N^ A=l n. keS, c. Y —E w . yk A=i n. kes, c. • (6.15) The presentation of expression (6.15) can be simplified. Indeed, the foUowing result was proved by Tille (1994) and generalised by Samdal (1996): Result 2: When the stratification variables are re-used at the estimation stage, and if the c,^ are equal into the strata {c^. = Cf^,k€Ui^) the GREG-estimator can be written The populations are generated by means of the following models: T*,:Xj^ = aj^,y^ = e^.^et/, (total independence), '^r^k-^k'Yk-^-'^k'^ ^k'^^(/, (dependencebetweenXand y), 5*3: x^ = a^,y^^ = Xj^ + 2h{k) + ei^,keU, (dependencebetween x, y and the strata), J*^: Jc^ = a^.y^^ = exp(10 + 2x^ (+ I0h{k)-^e^,ks [/, non-linearity and dependence between X, y and the strata), T^ x^ = a^,y^ = exp(e^ + 3xj^) + 3h{k), k€U, (non-linearity and dependence between x,y and the strata), 7*^: x^ = a^, y^ = 3h{k) + e^^, /: 6 (/, (strong dependence between y and the strata), J*.j:Xj^ = aj^,y^ = 50h{k) + ei^,keU, (very strong dependence between y and the strata), where a^^ and e^^ are independent normal variable with mean equal to 0 and variance equal to 1, and h{k) is the number of the stratum of unit k. Results of the simulations is given in Table 1. ^GREGW Table 1 Results of 10,000 Simulations -I i-(x-x,)' E — E A=l (x,-x,)(x,-x,)' n^C^keS, " N ^ x E — ^ E •P, ^ (Xt-x^)Cy^-y^). (6.16) T, •P. v. y. y. y. M, 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 A = l n^Ci^kes, A proof of Result 2 is given in the Appendix. Note that expression (6.16) is equal to the AOPT2-estimator when c.=C^ " n, I - lln. ^, ^A M2 1.0070 0.0906 0.5180 0.9261 0.9263 1.1047 38.5104 M^ 1.0069 0.0906 0.4835 0.9277 0.9269 1.0015 1.0123 M^ 1.0060 0.0936 0.4850 0.9257 0.9239 1.0006 1.0111 1-/A for /i = 1,..., H, where C>0 isaconstant. Whenthe/^ are small and the «^ are large and proportional to the A^^ both estimators are equivalent. This result shows that with the conditional approach, the fact that the sampling design is stratified is automatically taken into account in the estimation method. The GREG-estimator does not take into account the stratification effect and thus it is necessary to reintroduce the stratification variables at the estimation stage so as not to lose the stratification effect. Table 1 shows that the GREG-estimator provides a good estimation when the stratification variables are not cortelated to the interest variable. Nevertheless, the more is the dependence between the stratification variable and the interest variable, the more is the gain of precision of >'AotTic ^^^ ^AOPT2- ^ ^ '°^^ of degrees of freedom ofthe optimal estimator does not seem to affect the precision for this sample size. Moreover, the gain obtained by the knowledge of the population stratum is not significant for this sample size. For aU these cases, the optimal estimator is thus clearly preferable to the GREG-estimator. 7. SIMULATIONS 8. A THIRD-ORDER PROBLEM A set of simulations was carried out in order to compare the four following estimators: y„,>'Aopric, >'AOFn, y^^Q.The population is made up of 4 strata of 250 units (N= 1,000). A stratified sampling design is applied with proportional allocation. For each simulation, 10,000 samples of size n =100 are selected and the foUowing ratios has been estimated: The complexity of determining the conditional weights is not a specific problem of the SCW-estimator. It is due to the general problem of estimation with auxiUary information used a posteriori when an auxiliary variable is already used a priori in the sampling design. This problem can be presented as a third-order interaction problem among M, =MSE(y„)/MSE(y„)=l, - the interest variables; M2=MSE(|G^O)/MSE(^„), - the sampling design and thus the auxiliary variables used a priori; - the auxiliary variables used a posteriori. M3 = MSE(^A0Pric)/MSE(y„), M,=MSE(y^oj^/MSE(yJ. 64 Tille: Estimation in Surveys Using Conditional Inclusion Probabilities Indeed, the use of auxiliary information at the estimation stage leads to the following problem: how do these auxiliary variables used a posteriori interact with the interest variable through a given sampling design? The problem being complex, we have to take into account the relationships between each set of variables above as well as the third-order interactions among these three sets of variables. It is very difficult to find a really operational estimator which uses the three second-order interactions and the tiiird-order interaction. For this reason, one can attempt to simplify the problem. The neutraUsation of one of the aspects of this problem significantly simplifies the research of an estimator. Most of the possible simplifications have already been studied. We can cite some of these: - - - - - If no auxiliary information is used a posteriori (except the population size N ) v/e can only construct the Horvitz-Thompson estimator or Hajek's ratio (1971). Searching general solutions using auxiliary information for simple random sampling does not pose major problems. In this case, no auxiUary information is used a priori. Using a superpopulation model allows one to fix a relation existing between the interest variable and the auxiliary variables used a posteriori. In this case, it is possible to determine the optimal estimator (under the model). For the GREG-estimator and also for the calibration methods see Deville and Samdal 1992), in the designbased inference framework,only thefirst-orderprobabilities are retained from the sampling design. A simple random sampling is thus treated in the same way as a stratified design for which the first-order inclusion probabilities are all equal. For this reason, a regression estimator applied to a stratified design generally destroys the calibration on the stratum frequencies given by the a priori stratification. In this case, the simpUfication arises because aU the contiibutions of the auxiliary variables used a priori to the sampling design can be described only by the first-order probabilities. Finally, for the optimal linear estimator, it is implicitly supposed that the dependence between HorvitzThompson estimators ofthe variables x and y is linear. Obviously, these estimators neglect the non-linear dependence between the estimators. Nevertheless, it takes into account the joint inclusion probabilities. When the sampling design is stratified, the estimator remains calibrated on the population stratum frequencies. The CW-estimator takes into account this third-order interaction. Moreover, in this case, auxiliary information does not necessarily intervene in a linear way. The weights depend on both the sampUng design and the auxiliary statistic. These weights appUed to the values taken by the interest variable take into account all the interactions between the three variable groups. The methods using conditional inclusion probabilities are interesting for different reasons: they give a general frameallowing to search and conceive estimators using auxiliary information without reference to a superpopulation model and lead to valid conditional inference. They bring into prominence all the complexity of the estimation problem with auxiliary nformation. According to the known auxiliary information, we can find either known results (as for example post-stratification) or very complex and not really operational estimators. However, a first approximation leads to a known result, Le. the optimal linear estimator. ACKNOWLEDGEMENTS The author is grateful to two anonymous referees and an Associate Editor for constructive suggestions and to Professor Carl Samdal for interesting comments on a previous version of this paper. APPENDIX: PROOF OF RESULT 1 AND 2 Lemma 1 will be used in the proof of Result 1. Lemma 1 If R^' - R'' =0^xy(n"'), then | R J - ' | R | = 1 + 0{n"'), where R and R^ are defined as in hypothesis (ii). Proof [R;'-R-']R= Oj^j{n-')R and thus |R,'R|=|/+0,,,(n-')R| = [1 +0(n-')]^ + 0(n-') = l + 0 ( n - ' ) where I is a 7x7 identity matrix. Thus, R, 1 = 1 +0(n"'). |R| l+0(«-') Note that lemma 1 is a consequence of hypothesis (ii) Proof of Result 1 If we define d, = - ^ , for an keU. A^Tl, by hypothesis (i), we get: «Pr(x„) «i(x„) Tt.i^A^ Nn^Pr{xJkeS) /(xJ Vi(x„) Survey Methodology, June 1999 65 where/(resp. /^) is the density function of a multivariate normal variable with mean x (resp. x,^) and variancecovariance matrix E (resp. E^).Thus, = ^41. O/n-')]. -l-exp-l(i„-x-)'E-(x„-x-) «t(x„)=^/i 1 c, = d,[l . 0 ( / z - ' ) f e x p | l i : ' 0 , , / « - > ) ^ „ j (8.21) (8.17) -exp--(x„-X|,)'E;'(i„-X|,) By (8.20) and (8.21), we get «(x„) =rfj 1-0^-'(«-')] If we also note R=V-''^EV-''^, {l-T*K'-0.,.(n-'))i%0^(«-')) R,=V-'^E^V-'^, = c/,{i-Y;R-'i;.o^(n-')} x: = V-''^(i„-i), = ^.{l-(x-x„)'E''(X|,-x).0^(n.,)}. Finally, we get and c, = ^ , ^ ^ ^ e x p - ^ - x : ' ( R " ' - R - • ) x „ ^ IRI''^ 2 ^ (8.18) ' V = " E ^^(xjyi '^keS =?.^(x--x„)'E-'iE^^^^^^o>N keS 71. we get 1 | R r ' ^ e x p - - ^ x c; RD-lC^<^ -'x ''t(x„)=^t- Proof of Result 2 Rj--'^exp-l(l;-Y,)'R-;(i;-Y,) ^texp-Y,R-'(7,-2x„^). In Samdal (1980), we see that the GREG-estimator presented in (6.15) can also be written: (8.19) By using a Taylor development for the vector y^ of (8.19), we get «.(x„)=c,(l-Y,R^'x„^)+/?(Yr). (8.20) where R(yT)-Ck e x p | Y r R ; ' ( y f - 2 x : ) ^GREG.=i-^-'(i;vW^-l5n5'W5)' (w;c-;n-;w,)->w^^c,'n-;y, where l^(resp. 1^) is a column vector composed of A^ (resp. n) ones, 11^ (resp. 11^) is a diagonal matrix having the inclusion probabilities of the population (resp. sample) units on its diagonal, C^ is a diagonal matiix having the c^ of the sample units on its diagonal, y^ is a column vector composed of the values taken by the interest variable y in the sample. Z-IH Xj xy;K(yr-x:)][R^'(Yf-x:)]'-Ri')Y. ,.(0) and Yt is a vector whose elements are included between the cortespondent elements of y^ and 0. By hypothesis (iii), we directly get w = ^w Z-NH X/v and W^ is a n X ( ^ + 7) matrix composed ofthe n rows of W^ cortesponding to the units selected in the sample. R(yf^) = o(n-'). The matrix to invert can be partitioned into four parts: On the other hand, we have by hypothesis (ii), lemma 1 and (8.18) that (w;c-;n-'w,) A D D' B Tille: Estimation in Surveys Using Conditional Inclusion Probabilities 66 where A is an //x H matrix having N^ICf^,h = 1,..., H, on its diagonal, B = E^Ex,x; DEVILLE J.-C. (1992). Constrained samples, conditional inference, weighting: three aspects of the utilisation of auxiliary information. Proceeding ofthe Workshop: Auxiliary Information in Surveys, Orebro. DEVILLE J.-C, and SARNDAL, C.-E. (1992). Calibration estimators in survey sampling. Joumal of the American Statistical Association, 376-382. and N,x, D' A^^x^ By using the technique of matiix inversion by partition, we get ( A - D B ' D ' ) i -A 'DQ irr-li (w;c-'n,w,) -QD'A • Q 0. (i;w;.-i;n;w,) MADOW, W.G. (1948). On the Hmiting distributions of estimates based on samples from finite universes. Annals of Mathematical Statistics, 19,535-545. x„-x where 0^ is a column vector composed of H zeros, we get (i;w^-i;n-'w,)'(w.;c-'n-;w,)-' = (i„-x)'Q[-D'A->I(,,,)] where Ly^y) is a 7 x 7 identity matrix. Since Q= [-DA-' I^j^j^] = [-x ... - x ^ J(y^,)] RAO, J.N.K. (1997). Development in sample survey theory: an appraisal. Canadian Joumal of Statistics, 25, 1-21. and ROSEN, B. (1972). Asymptotic theory for successive sampling with varying probabilities without replacement. Annals of Mathematical Statistics, 43, 373-397. ^5^1 SARNDAL, C.-E. (1980). On 71-inverse weighting versus best linear unbiased weighting in probability sampling. Biometrika, 67, 639-650. ^H^H M Y^Yx.yk *=1 n^c^kes, MONTANARI, G.E. (1997). On conditional properties of finite population mean estimators. Proceeding ofthe 5Ist Session ofthe Intemational Statistical Institute, Contributed paper, 351-352. RAO, J.N.K. (1994). Estimating totals and distribution functions using auxiliary information at the estimation stage. Joumal of Official Statistics, 10, 153-165. A = l n^c^kes, H MONTANARI, G. E. (1987). Post sampling efficient QR-prediction in large sample survey. International Statistical Review, 55, 191-202. MONTANARI, G. E. (1998). On regression estimation of finite population means. Survey Methodology, 24,69-77. E—E(x,-x,)(x,-i,)'[ , w;c-;ni'y,^ H A JEK, J. (1971). Comment on a essay of D. Basu. Foundations of statistical inference, (fids. V.P. Godambe, et D.A. SprotO, Toronto: Holt, Rinehart and Winston, 236. HORVITZ, D.G., and THOMPSON, D.J. (1952).A generalization of sampling without replacement from a finite universe. Joumal of the American Statistical Association, 7, 663-685. where Q = (B - D'A"'D)"'. Since r-ll FULLER, W.A, and ISAKI, C.T. (1981). Survey design under superpopulation models. In Current Topics in Survey Sampling. (Eds. D. Krewski, R. Platek, J.N.K. Rao, and M.P. Singh). New York: Academic Press, 196-226. (8.23) we get Result 2 by multiplication of (8.22) and (8.23). REFERENCES CASADY, R.J., and VALLIANT, R. (1993). Conditional properties of post-stratified estimators under normal theory. Survey Methoddology, 19,183-192. CASSEL, C.-M., SARNDAL, C.-E., and WRETMAN, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite population. Biometrika, 63, 615-620. SARNDAL, C.-E (1996). Efficient estimators with simple variance in unequal probability sampling. Joumal of the American Statistical Association, 91, 1289-1300. S A R N D A L , C.-E., SWENSSON, B., and WRETMAN (1992). Model Assisted Survey Sampling. New York: Springer Veriag. TILLE, Y. (1994). Utilisation d'information auxiliaire en th6orie des sondages sans r6f6rence & un mod&le. Ph.D Thesis, Universite Libre de Braxelles, Institut de Statistique. TILLE, Y. (1998). Estimation in surveys using conditional inclusion probabilities: simple random sampling. Intemational Statistical Review, 66, 303-322. WRIGHT, R.L. (1983). Finite population with multivariate auxiliary information. Joumal ofthe American Statistical Association, 78, 879-883. 67 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 67-72 Statistics Canada On Robust Small Area Estimation Using a Simple Random Effects Model N.G.N. PRASAD and J.N.K. RAO' ABSTRACT Robust small area estimation is studied under a simple random effects model consisting of a basic (or fixed effects) model and a linking model that treats the fixed effects as realizations of a random variable. Under this model a model-assisted estimator of a small area mean is obtained. This estimator depends on the survey weights and remains design-consistent. A model-based estimator of its mean squared error (MSE) is also obtained. Simulation results suggest that the proposed estimator and Kott's (1989) model-assisted estimator are equally efficient, and that the proposed MSE estimator is often much more stable than Kott's MSE estimator, even under moderate deviations ofthe linking model. The method is also extended to nested error regression models. KEY WORDS: Design consistent; Linking model; Mean squared error; Survey weights. 1. INTRODUCTION Unit-level random effects models are often used in small area estimation to obtain efficient model-based estimators of small area means. Such estimators typically do not make use ofthe survey weights {e.g., Ghosh and Meeden 1986; Battese, Harter and Fuller 1988; Prasad and Rao 1990). As a result, the estimators are not design consistent unless the sampling design is self-weighting within areas. We refer the reader to Ghosh and Rao (1994) for an appraisal of small area estimation methods. Kott (1989) advocated the use of design-consistent model-based estimators {Le., model assisted estimators) because such estimators provide protection against model failure as the small area sample size increases. He derived a design-consistent estimator of a small area mean under a simple random effects model. This model has two components: the basic (or fixed effects) model and the linking model. The basic model is given by Assuming that the model (1) also holds for the sample {y. ,7 = 1,2, ...,n:, i = l,2, ...,m} and combining the sample model with the linking model, Kott (1989) obtained the familiar unit-level random effects model + V. + I e..,i ij'-' 1,2, . . . , « • / = 1,2, ...,m, (3) also calledtiiecomponents-of-variance model. It is customary to assume equal variances a, = a^, although the case of random error variances has also been studied (Kleffe and Rao 1992; Arora and Lahiri 1997). 2 9 Assuming a, =a , Kott (1989) derived an efficient estimator Q^^ of 9,. which is both model-unbiased under (3) and design-consistent. He also proposed an estimator of its mean squared ertor (MSE) which is model unbiased under the basic model (1) as well as design-consistent. But this MSE estimator can be quite unstable and can even take negative values, as noted by Kott (1989) in his empirical example. Kott (1989) used his MSE estimators mainly to Yij = Q, •" ^ij' 7 = 1,2,..., N.; 1 = 1,2,..., m (1)compare the overall reduction in MSE from using Q-^r in place of a direct design-based estimator y .^ given by (4) where the y. are the population values and the e.. are below. He remarked that more stable MSE estimators are y " 2 uncortelated random ertors with mean zero and variance o, needed. The main purpose of this paper is to obtain a pseudo for each small area i{= 1,2, .„,m). For simplicity, we empirical best linear unbiased prediction (EBLUP) estimator take 8 as the smaU area mean F = y.y. / A^., where A^. is the number of population units in the i-th area. Note that of 9; which depends on the survey weights and is designconsistent (section 2). A stable model-based MSE estimator Y. = Q. + E. and E. = Y,je.jlN. = 0 if A^,. is large. The linking model assumes that 9, is a realization of a is also obtained (section 3). Results of a simulation study in section 4 show that the proposed MSE estimator is often random variable satisfying the model 0. = M + V. (2) much more stable than the MSE estimator of Kott, as measured by their coefficient of variation, even under where the v. are uncortelated random variables with mean moderate deviations of the linking model (2). Results under zero and variance o^,. Further, {v^.} and {e..} are assumed the simple model (3) are also extended to a nested ertor to be uncortelated. regression model (section 5). N.G.N. Prasad, Department of Mathematical Sciences, University of Alberta, Edmonton, Alberta, T6G 201; J.N.K. Rao, Department of Mathematics and Statistics, Carleton University, Ottawa, Ontario, KIS 5B6. Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model 68 2. PSEUDO EBLUP ESTIMATOR Suppose w.. denotes the basic design weight attached to the y-th sample unit {j = l,2,...,n.) in the i-th area (/ = 1,2,..., m). A direct design-based estimator of 9,. is then given by the ratio estimator The estimator 9. will be referred to as pseudo-EBLUP estimator. We use standard estimators of oJ and o^, based on the within-area sums of squares TT \ 2 !2., = E E (Yij-Yi) • J and the between-area sums of squares yi.-Ej^yyjEj^o = Ej^yyii (4) Qb = where w.. = vv../ ^.vv... The direct estimator y .^ is designconsistent but fails to borrow sti-ength from the other areas. To get a more efficient estimator, we consider the following reduced model obtained from the combined model (3) with o, = o^: En,{yry)"' i where y = £^«,y, / Y^i^t is the overall sample mean. We have o' = Q./{Enrm^ and 6^ = ma\{6^, 0) where o, = (5) where the e. are uncortelated random variables with mean zero and variance 5^ = o V ^,7 • The reduced model (5) is an area-level model similar to the well-known Fay-Herriot model (Fay and Herriot 1979). It now follows from the standard best linear unbiased prediction (BLUP) theory {e.g., Prasad and Rao 1990) that the BLUP estimator of 0. = |i + V; for the reduced model (5) is given by 0 , = M, (6) + V. where tiw^^iw ^w' with pi =y.Y y / ^ y - and Y =Ov/(o„+5.). Note that 9,. is different from tiie BLUJ* estimator under the full model (3). We therefore denote 0 . as a pseudo-BLUP estimator. The estimator (6) may also be written as a convex combination of the direct estimator y. and jl : 0.=Y. V. + ( 1 - Y ) M - I V ' m-' +(1-Y iw ^ )M ' iw' , with «*=E",-E"//E",It may be noted that o^ and o^ are either not estimable or poorly estimated from the reduced model (5) due to identifiability problems. Following Kackar and Harville (1984), it can be shown that the pseudo-EBLUP estimator 9,. is model-unbiased for 9^ under the original model (3) for symmetricaUy distributed errors {v^.} and {e.j}, not necessarily normal. It is also design consistent, assuming that n . £ . w | is bounded as n. increases, because Y,,^ converges in probability to 1 as «,-«> regardless ofthe validity ofthe model (3), assuming a^ and 6^ converge in probability to some values, say, 6* and o'^. Kott's (1989) model-based estimator of 9, is obtained by taking a weighted combination of y.,^ and Y.i*i^i' Y p ^^^^ is, /;.(a.,c«) = ( l - a , . ) y , , + a , E ^ / " > ^ / . l*i (7) The estimator 0_. depends on the parameters a^ and a^ which are generally unknown in practice. We therefore replace oJ and o^ in (7) by model-consistent estimators 6^ and 6^ under the original unit-level model (3) to obtain the estimator 0.=Y [Q,-{m-l)c']ln* (8) and then minimizing the model mean squared ertor (MSE) of/j.(a., c^'^) with respect to a. and c/'' subject to modelunbiasedness condition: X/*,^/ ~ 1- This leads to 9,^=/;.(ct.,a('>) with «,= " w' ;/JE»',j*Ef',7",*( •*!:«';•"](»'.'#) w. where and and A(') r* w ^-^i * iw y iw f I-^i » iw' i<f^)-nr]/E ' h*i (d:/d^) + «;' (9) Survey Methodology, June 1999 69 -2 The estimator 0^.^^ is also model-unbiased and design- see Appendix 1. The variances and covariances of a^ and consistent. In a previous version of this paper, we proposed 6^ are also given in the Appendix 1. It can be shown that an estimator similar to (9). It uses the best estimators of \i g^.{6^,,6^) + gj.{a^,6^) is approximately unbiased for under the unit-level model, based on the unweighted means y., g^i{ol,a^) in the sense that its bias is of lower order than rather than fi^, the best estimator of \i under the reduced m"' (see Appendix 2). Similarly, g2^{al,6^) and model (4), based on the survey-weighted means y. . ^3,.(6J,,6^) are approximately unbiased for ^^/(Ov.o^) and g^.{al,a^), respectively. It now follows that an approximately model-unbiased estimator of MSE(9j.) is given by 3. ESTIMATORS OF MSE '-^2 -2 .^2 -2> :;2 ; ; 2 . mse(9,.) = g„.(a;, 6^) + ^^..(a;, 6^) + 2g,.{6l 6^). (15) It is straightforward to derive the MSE of the pseudoBLUP estimator 9 . under the unit level model (3). We For the estimator 9 .j^ given by (9), Kott (1989) proposed an have estimator of MSE as MSE(9 .) =£(9 r%)-gii«' ,2 „ 2 N ,2 „2^ ^') ^82i«' ^ ) (10) with and g2,(o^a2) = o^(l-Y.^)2/5;^.Y,v .2 „2^ The leading term, ^1,(0,,, o ) is of order 0(1), while the second term, g2i(^v' ^^)' due to estimation of \i is of order 0 ( / M " ' ) for large m. A naive MSE estimator of the pseudo-EBLUP estimator 9. is obtained by estimating MSE(9j.) given by (10): msef^(9,.) = g,,.(a^,, 6^) + g2i{ol,6^). (12) under normaUty of the ertors {y,} and [ey] so that MSE(§,.) is always smallertiianMSE(0 j); see Kackar and Harville (1984). To get a "cortect" estimator of MSE(9(), we first approximate the second order term J?(9. - 9 .)^in (12) for large m, assuming that {v,} and {eij} are normally distributed. Following Prasad and Rao (1990), we have where the neglected terms are of lower order than m"', and {V{dl)-2{ollo^)Cov{dl,e^)^ 2/„2\2\r„^rA2\\. (a:/a^)^Var(6^)}; -E^( 0 - \ 2 / yi\ (16) where v *(y .^^) is both a design-consistent estimator ofthe design-MSE of y .^ and a model-unbiased estimator of the model-variance of y .^ under the basic model (1). Since d. converges in probability to zero as n.^°°, it follows from (16) that mse(9j.j^.) is also both design-consistent and model unbiased assuming only the basic model (1). However, mse(0j.^) is unstable and can even take negative values when d; exceeds 0.5, as noted by Kott (1989). Note that our MSE estimator, mse(0|.) is based on the full model (3) obtained by combining the basic model (1) with the linking model (2). However, our simulation results in section 4 show that it may perform well even under moderate deviations from the Unking model. (11) But (11) could lead to significant underestimation of MSE (9 ) because it ignores the uncertainty associated with a„ and o^. Note that MSE(9,.) = MSE(9,.) + E{Q. -Qf mse(9.j,) = ( l - 2 d , . ) v ' ( y , J ++ «;[: at y,, (14) 4. SIMULATION STUDY We conducted a limited simulation study to evaluate the performances ofthe proposed estimator 9^., given by (8), and its estimator of MSE, given by (15), relative to Kott's estimator 9 .^, given by (9), and its estimator of MSE, given by (16). We studied the performances under two different approaches: (i) For each simulation run, a finite population of m = 30 small areas with A^^. = 200 population units in each area is generated from the assumed unit-level model and then a PPS (probability proportional to size) sample within each small area is drawn independently, using n. = 20. (ii) A fixed finite population is first generated from the assumed unit-level model and then for each simulation run a PPS sample within each small area is drawn independently, employing the fixed finite population. Approach (i) refers to both the design and the linking model whereas approach (ii) is design-based in the sense that it refers only to the design. The ertors {v,.} and {e..} are assumed to be normally distributed in generating the finite populations {y.., i = l,2,..., 30; j = 1,2,..., 200}. We considered two cases: (1) The linking model (2) is true with |i = 50. (2) The linking model is violated by letting |i vary across areas: |i,. = 50,/ = l,2,..., 10; ^1,. = 55, J = 11,12, ...,20; n,. = 60, 70 Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model J = 21, 22,..., 30. To implement PPS sampling within each area, size measures z..(/ = l,2,...,30;y = l,2,...,200) were generated from an exponential distribution with mean 200. Using these z-values, we computed selection probabilities py = Zy I £ Zy for each area i and then used them to select PPS with replacement samples of sizes n. = n,hy taking n = 20, and the associated sample values {y..} were observed. The basic design weights are given by w ..=n"' Pjj so that w.. = Py"' / X Py'• Using these weights and the associated sample values y.. we computed estimates 9^. and 9^.^^ and associated estimates of MSE, and also the ratio estimate y . for each simulation run;tiieformula for v * (y.,) under PPS sampling is given in Appendix 3. This process was repeated R = 10,000 times to get from each run r{=l,2,...,R)Q.{r) and 9,.j^(r) and associated MSE estimates msej(9j.(r)) and msej(9.j^.(r)) and also the direct estimate yj^{r). Using these values, empirical relative efficiencies (RE) of Q. and 9^.^ over y .^ were computed as RE(9,.)=MSE,(y,„)/MSE.(9.) Table 1 reports summary measures of the values of percent RE, IRBI and CV for cases (1) and (2) under approach (i). Summary measures under approach (ii) are reported in Table 2. Summary measures considered are the mean and the median (med) over the smaU areas / = 1,2,...,30. Table 1 Relative Efficiency (RE) of Estimators, Absolute Relative Bias (IRBI) and Coefficient of Variation (CV) of MSE estimators (0=5.0, n=20): Approach (i) RE% CV% IRBI% Q.^ 0,. mse(9,.jj.) mse(9() mse(9,.jf) mse(9;) Case 1 1 Mean Med 2 Mean Med 3 Mean Med 190 177 190 182 126 123 127 124 113 111 112 111 15.3 14.8 5.1 5.6 3.5 3.2 3.5 2.6 3.2 2.9 2.7 3.0 148 148 48 48 35 35 25 25 8 8 6 6 Case 2 and 39 Mean 108 103 10.4 7.9 38 Med 108 104 11.1 7.7 2 Mean 108 104 13.3 8.9 39 Med 108 104 13.6 7.9 37 37 3 Mean 104 103 11.5 7.2 36 Med 105 105 13.1 8.0 Casel: |i,.=50, i=l,2,...,30;Case2: n,.=50, i=l,2,...,10; |a,.=55,i=ll,12,...,20; n,. =60,1=21,22,...,30. 1 RE(9,„)=MSE.(y. )/MSE,(9.A 6 5 6 6 5 6 where MSE. denotes the MSE over R_= 10,000 runs. For example, MSE,(9;) = £^ [9,.(r) - Y.{r)flR, where Y.{r) is the i-th area population mean for the r-th run. Note that Y .{r) remains the same over the runs r under the design-based approach because the finite population is fixed over the simulation runs. It is clear from Tables 1 and 2 that 9,.^^ and 9,. perform Similarly, the relative biases of the MSE estimators were similarly with respect to RE which decreases as a^la computed as increases. Under approach (ii), RE is large for both cases 1 and 2 when o,^l o ^0.4, whereas it decreases significantly RB[mse(9,.)] =[MSE.(0,.) -£.mse(9,.)]/MSE.(9,.) under approach (i) if the linking model is violated (case 2); the direct estimator y .^ is quite unstable under approach (u). and Turning to the peiformance of MSE estirnators under approach (i), Table 1 shows that IRBI of mse(9,.) is negUgRB[mse(9,.j(.)] =[MSE.(9,.jf) -£.mse(9,.j^)]/ ible (<4%) when the linking model holds (Case 1) and that it is small (<10%) even when the linking model is violated, MSE, (9,.^), although it increases. The estimator mse(9,.j^) has a larger where £ , denotes the expectation over R = 10,000 runs. IRBI but it is less tiian 15%_. The CV of mse(9.) is much For example, £,mse(0;) =£^mse(0,.(r))//?. Finally, the smaller than the CV of mse (9 ^j^) for both Cases 1 and 2. For empirical coefficient of variation (CV) of the MSE example, when the model holds (Case 1) the median CV is estimators were computed as 25% for mse(9;) compared to 148% for mse(0,.j^) when o^,= 1; the median CV decreases to 8% for mse(9,) CV[mse(0,.)]=[MSE.{mse(0,.)}y'2/MSE.(0,.) compared to 48% for mse(0,.^) when a^ = 2. This pattem is retained when the model is violated (Case 2). It may be noted that the probability of inse(9,j5^) taking a negative and value is quite large (>0.3) when a^l a ^ 0.4. CV[mse(0,.jf)] = [MSE. {mse(9,.j,,) }]"2/MSE.(9,.j^). Under approach (ii). Table 2 shows that IRBI of mse (9,.) is larger than the value under approach (i) and ranges from Note that MSE,[mse(9,.)] =Xjmse(9,(r))-MSE,(9,.)f/^ 15% to 25%. On tiie otiier hand, IRBI of mse (9,.j,,) is smaUer and ranges from 4% to 15%. The CV of mse (9,.^), howand a similar expression for MSE,[mse(9;^)]. 71 Survey Methodology, June 1999 The pseudo-EBLUP of 0,. = X! p + v. is given by ever, is much larger than under approach (i). For example, the median CV for Case 1 is 295% compared to 38% for 9.=Y y. +(1 -Y- )X'.Q mse(9.) when o^ = 1 which decreases to 122% compared to 23% when o^ = 2. A similar pattem holds for case 2 where where the fixed finite population is generated from the model with varying means. Pw [Z^i yiw-'^iw-'^iwl [2.^1 yiw-'^iwYiw}Table 2 Relative Efficiency (RE) of Estimators, Absolute Relative Bias (IRBI) and Coefficient of Variation (CV) of MSE estimators (a=5.0, n=20): Approach (ii) RE% IRBI% CV% (19) An approximate model-unbiased estimator of MSE(9j.) is givenby (15) with /^2 ^2\ /I gi,(Ov>oO = (i " \ "2 -yj<^v as before, 9^.^ 9. mse(9,.^) mse(9j.) mse(9,.^) mse(9;) ^2,(°v'0 ) = Case 1 25.4 289 295 24.7 Med 19.2 115 2 Mean 7.3 18.7 122 6.9 Med 14.8 68 3 Mean 4.8 4.2 65 13.9 Med Case 2 26.8 291 1 Mean 278 276 15.7 26.2 297 Med 271 275 16.6 117 8.8 20.7 2 Mean 175 177 20.3 124 8.5 Med 173 177 70 6.3 16.2 3 Mean 124 124 15.5 67 Med 125 124 6.8 10; Case 1: H.=50,z=l,2,...,30; Case 2: |J.=50, i=l,2,.... M,=55, ,=]1,12,...,20; H,. =60,1=21,22,. .,30. 1 Mean 283 281 275 279 180182 177 181 129 129 129 128 14.2 15.0 Ov(^,- -yiJiJ'lEi 39 38 24 23 24 24 r,V^,w^/w)"'(^,-Y,vVnv) and ^3,(d,,, d^), obtained from (14), involves the estimated variances and covariances of d ^ and 6^. The latter can be obtained from Prasad and Rao (1990) for the method of fitting constants and from Datta and Lahiri (1997) for REML. 41 40 26 25 25 26 6. CONCLUSION To reduce IRBI of mse(9^) under approach (ii), one could combine it with mse (9.^) by taking a weighted average, but it appears difficult to chose the appropriate weights. The weighted average will be more stable than We have proposed a model-assisted estimator of a smaU area mean under a simple unit-level random effects model. This estimator depends on the survey weights and is designconsistent. We have also obtained a model-based MSE estimator. Results of our simulation study have shown that the proposed MSE estimator performs well, even under moderate deviations of the linking model. The proposed approach is also extended to a nested ertor regression model. ACKNOWLEDGEMENTS mse(9,.jf). 5. NESTED ERROR REGRESSION MODEL The results in sections 2 and 3 can be extended to nested ertor regression models Yy = -^0 P * ^, + ^ij' 7 =1' 2, - , n.; i = l,2,..., m (18) withx.' =Y.w..x... Model-consistent estimates a„ and o^ tw i-n ij ij APPENDIX 1 (17) using the results of Prasad and Rao (1990), where x.j is a /7-vector of auxiUary variables with known population mean Xj and related to y.., and P is the p-vector of regression coefficients. The reduced model is given by >',V=^/wP + V. + e . This work was supported by research grants from the Natural Sciences and Engineering Research Council of Canada. We are thankful to the Associate Editor and the referee for constmctive comments and suggestions. Proof of (13): From general results (Prasad and Rao 1990) we have E{Q.-Q f '^ tr A.{al,c^)B.{al,a^) ~2 where B.{a^, o ) is the 2 x 2 covariance matrix of a j, and 6^, and A.{al, o^) is the 2 x 2 covariance matrix of V are obtained from the unit-level model (17), employing either the method of fitting constants (Prasad and Rao 1990) or REML (restricted maximum UkeUhood) estimation (Datta and Lahiri 1997). dol' da^ 72 Prasad and Rao: On Robust Small Area Estimation Using a Simple Random Effects Model Now, noting that APPENDIX 3 dQ; ay. da., da.. 39,* da^ Y,v(l -Y,w) The design-based estimator of variance of y.^ under PPS sampling is given by v(y ) = dv. Y,v(i -yJ ^•^ iw' y iw' da'^ j.^{l -yj^a;^ I J I] ^I] Kott (1989) model-assisted variance estimator is -alia' fE %J j E ^l{y--y^•'1} -at I a' iw' v*(J,J={V(y,J/£v(y,.j}v(y,J and V{y .J = a^ + 5,. = oJ / y.^, we get A.{al,a^)= y^ vv;, (y. - v. ) ' . _ I i-^ {a'jaf and hence the result (14). Covariance matrix of d ^ and d^: Under normality, we have •' ) iw' E w,J[ 1 - 2w^- + E >^,J] where .E and V denote expectation and variance with respect to the basic model (1). REFERENCES V(d2) = 2 a V p . « . - / n ) , V{al)=2n:^ a'*{m- l){E"i- 0 ( E « , - ' " ) " ' +2«,o^o^+«,.o^ and Cov{a\al) = -{m - l)n:'V{6'), BATTESE, G.E., HARTER, R., and FULLER,W.A. (1988). An error component model for prediction of county crop areas using survey and satellite data. Joumal ofthe American Statistical Association, 83, 28-36. DATTA, G.S., and LAHIRI, P. (1997). A Unified Measure of Uncertainty of Estimated Best Linear Unbiased Predictor in Small-Area Estimation Problems. Technical Report, University of Nebraska-Lincoln. where n„-Enf-2Ynf/Yn^^En^f/(Enf; see Searle, Casella and McCulloch (1992, p. 428). APPENDIX 2 Proof of £ [ g , . ( ^ t , ^ ) ARORA,V., and LAHIRI, P. (1997). On the superiority of the Bayesian method over the BLUP in small area estimation problems. Statistica Sinica, 7, 1053-1063. ^g,i(ol,S')]^g,,{ffl,<f'): By a Taylor expansion of g,;(a^, o^) around {a^,r^) to second order and noting that £'(6^-o^)=0 and £^(Ov-Ov) = 0, we get ,^2 ' 2 \ / 2 2\1 gl,(0v.0')-g„(0v,02)J ~~Ur D.{a\,a'')B.<,a\,a'') 2 FAY, R.E., and HERRIOT, R.A. (1979). Estimates of income for small places: an application of James-Stein procedures to census data. Joumal of the American Statistical Association, 74, 269-277. GHOSH, M., and MEEDEN, G. (1986). Empirical Bayes estimation in finite population sampling. Joumal ofthe American Statistical Association, 81, 1058-1069. GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an appraisal. Statistical Science, 9, 55-93. KACKAR, R.N., and HARVILLE, D.A. (1984). Approximations for standard errors of estimators for fixed and random effects models. Joumal of the American Statistical Association, 79, 853-862. KLEFFE, J., and RAO, J.N.K. (1992). Estimation of mean square error of empirical best linear unbiased predictors under a random error variance linear model. Joumal of Multivariate Analysis, 43, 1-15. ,2 „2 where D^{a^, o ) is the 2 x 2 matrix of second order deriv- KOTT, P. (1989). Robust small domain estimation using random atives of gy^{a^, cp-) with respect to a^ and c?. It is easy effects modelling. Survey Methodology, 15, 3-12. to verify that '-tr D.{al,a')B.{al,a'-\ = g^^{al,a''). 2 Now, noting that £[^3,(0^, a^)] = ^3,(0^, a^) we get tiie desired result. PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation of mean squared errors of small-area estimators. Joumal of the American Statistical Association, 85, 163-171. SEARLE, S.R., CASELLA, G., and McCULLOCH, C E . (1992). Variance Components. New York: John Wiley and Sons. Sun/ey Methodology, June 1999 Vol.25, No. 1, pp. 73-80 Statistics Canada 73 Small Area Estimation Using Multilevel Models FERNANDO A.S. MOURA and DAVID HOLT* ABSTRACT In this paper a general multilevel model framework is used to provide estimates for small areas using survey data. This class of models allows for variation between areas because of: (i) differences in the distributions of unit level variables between areas, (ii) differences in the distribution of area level variables between areas and (iii) area specific components of variance which make provision for additional local variation which cannot be explained by unit-level or area-level covariates. Small area estimators are derived for this multilevel model formulation and an approximation to the mean square error (MSE) of each small area estimate for this general class of mixed models is provided together with an estimator of this MSE. Both the approximations to the MSE and the estimator of MSE take into account three sources of variation: (i) the prediction MSE assuming that both the fixed and components of variance terms in the multilevel model are known, (ii) the additional component due to the fact that the fixed coefficients must be estimated, and (iii) the fiirther component due to the fact that the components of variance in the model must be estimated. The proposed methods are estimated using a large data set as a basis for numerical investigation. The results confirm that the extra components of variance contained in multilevel models as well as small area covariates can improve small area estimates and that the MSE approximation and estimator are satisfactory. KEY WORDS: Small area estimation; Mixed models; Multilevel models; EBLUE. 1. INTRODUCTION The need for small area (and small domain) estimates from survey data has long been recognized. The difficulty with the production of such estimates is that for most, if not all, small areas, the sample size achieved by a survey designed for national purposes is too smaU for direct estimates to be made witii acceptable precision. Early attempts to tackle this problem using methods such as synthetic estimation (Gonzalez 1973) involved the use of auxiliary information and the pooling of information across small areas. An excellent review and bibliography are given by Ghosh and Rao (1994). Empirical studies show that such methods made too little provision for local variation and consequently the resulting small area estimates were shmnk too far towards a predicted mean. More recent approaches {e.g., Battese and Fuller 1981 and Battese, Harter and Fuller 1988) use some components of variance model, or equivalent, to provide for local variation. Empirical studies show the superiority of this approach {e.g., Prasad and Rao 1990). This paper proposes a general multilevel model framework for small area estimation. This involves the potential to use auxiUary information at both the unit and small area level. In addition any of the regression parameters, rather than just the intercept as proposed by Battese and Fuller (1981), may be treated as varying randomly between small areas. The local variation is provided for by using differences between the means of unit level auxiliary variables, the small area level variables, and the various components of variance which allow variation between areas. For this general model, the small area predictor is obtained. In addition, an approximation to the mean square ertor (MSE) of each separate small area prediction and an estimator of this MSE are developed. The numerical study, based on a large data set from Brazil shows that such models may be useful for predicting small area estimates. The robustness of the approach to misspecification of the variance-covariance matiix of the small area random effects and misspecification of small area covariates are also investigated. Further numerical results demonsti-ate the success of the MSE approximation and its estimator. 2. THE MULTILEVEL MODEL FRAMEWORK 2.1 Introduction We consider the following multilevel model for predicting the small area means: y,. = x,p,.e, P,=^J"'^ / = l,...,m (2.1) where Y. is the vector of length n. for the characteristic of interest for the sample units in tiie i-th small area, / =1 m;X. is the matrix of explanatory variables at sample unit level; Z. is the design matrix of small area variables; y is the vector of length q of fixed coefficients and Vj. = {ViQ,—,v.)^ is the vector of length (p + 1) of random effects for the i-th small area. We assume the Fernando A.S. Moura, Instihito de Matem^tica, UFRJ, Rio de Janeiro, Brazil, CP: 68530, CEP: 21941-590, e-mail: fmoura@dme.ufrj.br; David Holt, Office for National Statistics, 1 Drummond Gate, London, SWIP 2QQ, e-mail: tholt@ons.go.uk. 74 Moura and Holt: Small Area Estimation Using Multilevel Models following about the distribution of the random vectors: (a) the V. are independent between small areas and have a joint distribution within each small area with £(Vj.) = 0 and (b) V(v.) = n (b) The e.,s and v.,s are independent and V{e')=a'l. For the whole population (2.1) appUes with n. replaced by N., the small area population sizes. The set of m equations in (2.1) can be concisely written by stacking them as ix.=xjz.y^xjv. (2.3) where X. is the {p + l) population mean vector for the i-th small area. An estimator of |i, may be obtained by plugging the RIGLS estimators of y and 0 in the respective terms of equation (2.3), where the predictor of the i-th small area random effect v. is given by v^ = ClX^ V. {Y. -X.Z.y) V, ' = 6"' / - d"^ X.. Q. G.. X; and Y = XZ^ + Xv^E. (2.2) where {I + a-'XiX.Q.)-\ It is worth noting that the random intercept model (see This estimator of \i. is known as Empirical Best Linear section 2.3) can be regarded as a special case of the model Unbiased Estimator (EBLUE) (2.1) where Z. is equal to the identity matiix for each small (2.4) area and Q has all terms constrained to be zero except the M. = xfZ,Y-xfv,.. one cortesponding to the variance of the intercept term. Other intermediate models exist, for instance, when Q. is Battese etal, {19SI, 1988) propose and apply a random diagonal so that the small area regression coefficients are intercept model to provide small area estimates. In this random but uncortelated between covariates. case, the Empirical Best Linear Unbiased Estimator is Holt (in Ghosh 1994, page 82) observes that the advantage of the model (2.1) over other competitors is that it M,(Ri)=^rP+v.o. effectively integrates the use of unit level and area level covariates into a single model. Besides the use of extra We use the label (RI) to imply a random intercept model random effects for the regression coefficients gives greater since only the intercept of each small area is random while flexibility in sitiiations where it is not appropriate to assume the other components of P remain fixed. the same slope coefficients apply for aU small areas. 2.4 Approximation to the Mean Square Error 2.2 Fixed and Component of Variance Parameter (MSE) Estimates Kackar and Harville (1984) show that, if 0 is a transThe fixed and components of variance parameters in the lation invariant estimator of 9 and the random terms are model (2.1) are y and 0 = ([Vech(i^)]^, a')^ respectively. normally distiibuted, the mean square ertor of a predictor of Various methods for estimating these model parameters in a linear combination of a fixed and random effect can be the case of a general mixed linear model are available. Most decomposed into two terms. The first one is due to the of them, based on iterative algorithms, lead to the maximum variability in estimating the fixed parameters when the UkeUhood estimator (MLE) or the restricted maximum likecomponents of variance are known, the second term comes lihood estimator (RMLE) under certain regularity conditions. from estimating the components of variance. Goldstein (1986) shows how consistent estimators can Since under normality the RIGLS estimator is equivalent be obtained by applying iterative generalised least squares to the RMLE estimator and the RMLE is translationprocedures (IGLS). He also proved its equivalence to the invariant, Kackar and Harville's (1984) results can be maximum likelihood estimator under normality. Later applied to the smaU area means estimators (iy i=l,..., m: Goldstein (1989) proposed a sUght modification of his algorithm (namely, restricted iterative generalised least squares MSE(p,.) = £[M, - M,.]' = £[|I, - H,]' + £[A,,- M,]' (2.5) (RIGLS)) which is equivalent to RMLE under normality. Unlike the IGLS estimates, the RIGLS estimation procewhere jl^. is the BLUE of \i.y dures provide unbiased estimates of the component of The first term of (2.5), that is MSE[(i,.], can be obtained variance parameters by taking into account the loss in by direct calculation as degrees of freedom resulting from estimating the fixed parameters. This work is confined to the RIGLS approach as in MSE(p.) = v^rn'\T x ; ( G , - y In x , . + Goldstein (1989). The RIGLS procedure is described in \Tr details in Appendix A. Z^G;'X. (2.6) a'x,{Gryz. Y1=1 ZIG:'XIX;Z, 2.3 The Estimator of the Small Area Mean Assuming the model (2.1) and considering that the population size N^ in the i-th small area is large, we can write the mean for the j-th small area as where G,. = / + o^ X^ X. CI. Kackar and Harville (1984) point out that the second term of (2.6) is not tractable, except for special cases, and propose an approximation to Survey Methodology, June 1999 75 it. Prasad and Rao (1990) propose an approximation to this second term and work out the details of their approximation for three particular cases: the random intercept model, random regression coefficient model and the Fay-Herriot model. They also give some regularity conditions for their approximation to be of the second order, and prove that their MSE approximation for the Fay-Herriot model is of the second order. Nevertheless, it seems to be more difficult to give general conditions for more complex models such as model (2.1). Applying Prasad and Rao's approach, an approximation to the second term of (2.5) is developed in Appendix B. It is worth noting that the MSE approximation of (ft^.) can be decomposed into three terms: Harville and Jeske 1992) argue that this procedure tends to underestimate the MSE. Prasad and Rao (1990) reported a simulation study which showed that the use of this "naive" estimator leads to severe downwards bias. They also showed for the Fay-Herriot model (a special case of the model (2.1)), using "truncated Henderson" estimates for the variance components, that where the superscript F indicates that a cortection for the finite population sampling fraction f. was used; x. is the (p + l) vector of sample means. The MSE(|Li,) can be obtained by noting that MSE(Af)= (1 -y;.)nMSE*(A,.)+A^:'(i-/;.)-• 6^] (2.11) E{T,) Tj - Tj + o{m ');E{T2) = T2^o{m-'); E{f,)= T^*o{tn-\ HarviUe and Jeske (1992) establish some conditions for the unbiasedness of Prasad and Rao's mean square error estimator. However, considering the more general model (2.1), again it seems more difficult to give general conditions for which the order of bias of Prasad and Rao's MSE(M,.) - r, + r^ + Tj (2.7) estimator is o{m"'), especially if iterative procedures as where T^ and T2 are respectively the first and the second RIGLS are used to obtain the parameter estimates. term of equation (2.6) and T^ is described in Appendix B. Nevertiieless, motivated by the simulation study summaThe term T^ is the variability of \x. when all parameters rised in Section 3.4 and an extensive simulation study are known, the second term 7^ is due to estimating the described in Moura (1994), we propose to use an estimator fixed effects and the third term T^ comes from estimating similar to Prasad and Rao's for MSE((1|.): the components of variance. When sampling fractions are not negligible, estimators MSE = r, + P2- 2T.3(2.10) of the small area means can be built in tiie spirit of the finite population approach by predicting specifically for the nonWhere f. are obtained from (2.5) by replacing a' and Q by sampled units: their respective RIGLS estimators. From equation (2.9) we can also obtain an estimator for ^F (2.8) MSE((i,) as follows: p;=/;)',^(^,-/;^,)'(z,Y-v,.) >^,=(i-/i)FfrM-Y)-v,-v,-8f) MSE(A[) = (1 -y;y M S E -(p.) + N:\I -f)-' 1 ^2 a xf. 3. A MODEL-BASED NUMERICAL INVESTIGATION c . is the mean of e^.. where X, = (1 -/j.) (X^. -/jx^.) and e^. for the non-sampled units in the i-th small area. Therefore /'.F\ where MSE *(|i.) is the equation (2.10) with X. replaced by (2.9) where MSE *(A,) is the equation (2.7) with X^. replaced by xf. 2.5 Estimation of Mean Square Error It is common practice to estimate the MSE of a linear combination of the fixed and random effects in a mixed model as in (2.1) by replacing estimates ofthe components of variance respectively in the expression of MSE. This estimator ignores the contribution to MSE due to estimating the components of variance parameters. Several studies (see for example Singh, Stukel and Pfeffermann 1998 or 3,1 Comparison of the Estimators In order to investigate the properties of altemative estimators, data was used from 38,740 households in the enumeration districts in one county in Brazil. The Head of Household's income wasti-eatedas the dependent variable. Two unit level independent variables were identified as the educational attainment of the Head of Household (ordinal scale of 0-5) and the number of rooms in the household (1-11-h). The assumed model is ^ij = ^io^^iiW^i2^2ij^h '• = 1'-'m;j = l,...,N^ P.0 = Voo + ^o; P/t = Yio * ^a; K = Y20 * v,2 ^3 J ^ where Xj and X2 respectively represent tiie number of rooms and the educational attainment of the head of the Moura and Holt: Small Area Estimation Using Multilevel Models 76 household (centred about their respective population means). The parameter values for the fit model and their respective standard ertors are Yoo= 8.456(0.108) YIO= 1.223(0.046) Y2O= 2.596(0.086) Ooo= 1.385(0.194) OQ, =0.354(0.66) Oo2=0.492(0.117) 0,2=0.333(0.054) o,, =0.234(0.35) 022=0.926(0.124) 0^=47.74(0.345) To carry out numerical investigations within the model-based framework a simulation was carried out keeping the enumeration district identifiers and the values ofthe two explanatory variables (X)fixed.Initially the area population means X,,. and Xj, were calculated for the whole data set and a randomly selected subsample of 10% of records from each small area was identified. This same subset was retained throughout the simulations (the Simulation subset). The data generation for the simulations was cartied out in two stages using a data generation model which was the General Model (G), the Diagonal Model (D), the Random Intercept Model (RI) as appropriate. In the first case the parameter values were taken from the estimates mentioned earlier. In the second case the off-diagonal terms were set to zero, in the third case only OQ^ = 1.385 was non-zero. The first stage of the data simulation process was to generate the level 2 random terms (that is, the non-zero elements of v.^ and v^., and v,.2) depending on the choice of the data generation model. These random terms were Normally distributed (jointly Normal in the case of the General Data Generation Model and the Diagonal Data Generation Model). At this stage the expected value of the mean for the i-th area conditional on the area level random effects generated by the model m, = G, D, RI in the r-th simulation could be obtained: .(<•) 1^,=^. P^t? ^ t P^X 2r MSE[M,„,.^]=/?-'E(Ali^-Ml2| r=l and the absolute relative ertor (ARE) by ARE[M,„„„^]=/?-'ElMl;^„m,-<l<For comparative purposes we contrast the properties of each estimator with those of the estimator which is the same as the data generation model. Hence we define the Ratio of Mean Square Error (RMSE): RMSE_ni2,tnj EMSE[P,„_„;J/|:MSE[(1,.„_„_] xlOO and the Ratio of Absolute Relative Ertor (RARE): RARE_m 2 , n i | E ARE[M,„_,„^] / | : ARE[p,.„_,„^]|xlOO. It will be seen that when the data are generated from a simpler model {e.g., RI) the more complex estimation procedures do not suffer any appreciable worsening of efficiency or bias. On the other hand when the data are generated from a more complex model the simpler estimators have inferior properties. However the difference between the Diagonal and General estimators is much less than between these and the Random Intercept Estimator. From Table 1 one would conclude that it is worth introducing additional random coefficients of some kind, beyond the simple Random Intercept model assumptions, but not necessarily the full General Model. Table 1 Ratios of Mean Square Errors and Ratios of Absolute Relative Errors (in parentheses) for the three Estimators and Three Data Generation Models At the second stage of the data simulation process, unit Data Generation Model Estimator values {Y.) were created for each of the data generation G D RI models. Having generated the data for the simulation subset 101.2 101.8 General (G) 100.0 under one of the data generation models, all three of the (100.6) (100.0) (100.9) estimation models (G, D and RI) could be fitted to the 100.2 100.0 simulated data to obtain parameter estimates and predictors Diagonal (D) 108.8 (100.1) (100.0) (82.6) for the small area means. For each data generation model m, = G, D, RI the whole 109.1 R. Intercept (RI) 131.9 100.0 (105.6) (176.9) simulation process was repeated R = 5000 times to yield a (100.0) set of smaU area means |i|^^ and predicted means M/m m , A" = 1....,/? for each snrnU area,/, / = l,...,/n andfor the three estimation models: /Wj = G, D, RI. For each small The summary measures in Table 1 are average properties area and for data generated under model m, = G, D, RI, the over all small areas. A careful analysis of the MSE perforMean Square Ertor (MSE) of the prediction process for mance of the estimators for each small area shows that there each estimation model /Wj may be defined as is a modest increase in the MSE for the Diagonal Estimator Survey Methodology, June 1999 compared to the General Estimator for all areas, whereas for the Random Intercept estimator a relatively small number of areas exhibit a substantial increase in MSE. A sinular pattem occurs between the Diagonal and Random Intercept estimator when the Diagonal Data Generation Model is used. 3.2 Introducing a Small-Area Level Covariate In this section an attempt is made to investigate the impact on small area estimates of introducing an area covariate Z. Unfortunately for the data set used, it was not possible to identify a single contextual area level covariate which had a substantial effect on the multilevel models. Nevertheless, the number of cars per household in each small area was a useful covariate for the random coefficients for the individual level random slopes coefficients for "Room" and "Edu", but not for the random intercept term. This was observed after some preUminary model fit analysis on the real data. Although the "numbers of cars" was the best small area level covariate found to explain between area variation, it was not as powerful at the individual level as "Room" and "Edu", the individual level covariates chosen. The model above with the small area covariate Z can be written as 77 Table 2 Parameter Estimates and Standard Errors for General Model with Area Level Covariate: Demographic Data Parameter Diagonal Model with Z Diagonal Model Too 8.442(0.112) 8.688(0.136) Vio 0.451(0.179) 1.321(0.085) ^20 0.744(0.272) 2.636(0.134) Yii 3.779(0.507) - Y22 1.659(0.323) - "00 0.745(0.308) 0.637(0.303) ^'n 0.237(0.083) 0.471(0.116) ^12 0.700(0.197) 1.472(0.295) ^ 44.00(1.05) 44.01(1.05) Table 3 Ratios of Mean Square Errors and Ratios of Absolute Relative Errors (in parentheses) for the Diagonal and the Diagonal with Z Estimators Under the Two Respective Data Generation Models Estimator Diagonal The small area random effects were assumed uncortelated in order to avoid convergence failure in the simulation study. Table 2 reports the parameter estimates and their respective standard ertors obtained by fitting the Diagonal Model with the Z covariate (3.2) and witiiout the Z covariate (2.1). It is worth noting the significant reduction of all the components of variance estimates, except a^^ and a', after introducing the explanatory area covariate Z. In order to investigate the effect of misspecification of the Z variable, the model based simulation procedure described in section 3.1 was applied to the two models above, where the data generation was done according to the parameters presented in Table 2. Table 3 summarises the simulation results. It is worth noting that in both cases there is a significant loss of efficiency by using an unsuitable estimator. It can also be seen from an individual analysis of MSE for each small area that a considerable gain in efficiency is achieved with the introduction of a small area covariate Z over the diagonal model. For many small areas the MSE of the Diagonal with Z is significantly less than the MSE of the cortesponding estimator without Z. Even for those few areas in which the MSE of the Diagonal with Z is unchanged or even sUghtly increased by the introduction of Z, the difference is not appreciable. Diagonal with Z Diagonal 100.0 (100.0) 110.3 (125.4) Diagonal with Z 126.2 (107.5) 100.0 (100.0) Yy = P/o^P,i^t(,+P/2^2y+e,^ '• = 1'-'m;j = l,...,N. P,o=Yoo+V P,i=rion„z,+v,.,; P,7 =720+^212/+v,-2- (3.2) Data Generation Model 3.3 Comparisons with Regression Estimator One essential advantage of the multilevel models over regression models is to recognize that groups (here the small areas) share common features; they are not completely independent as could be assumed, for example by using separate linear regression model for each small area. Nevertheless, the relatively small intraclass cortelation observed for the data set used plus the fact that each small area has on average 28 units, could make one think that in this case the use of the multilevel model would not result in great improvement in the smaU area estimators. However, it is gratifying to know that even in these circumstances the multilevel model small area estimator performs on average better than the synthetic separate regression estimator, under either the multilevel model or even under the regression model. Table 4 illustrates this finding. The multilevel data generation model used was the General one with the parameters given in section 3.1. The parameters used in the data generation regression model were obtained byfittinga separate regression for each small area. It can be seen from Table 4 that the Separate Regression estimator which does not explore the difference of small areas through small area random effects shows substantial loss of efficiency when compared with the General estimator. Moura and Holt: Small Area Estimation Using Multilevel Models 78 Table 4 Ratios of Mean Square Errors and Ratios of Absolute Relative Errors (in parentheses) for the General and the Separate Regression Estimators Under the Two Respective Data Generation Models Data Generation Model Estimator General Separate Regression General 100.0 (100.0) 88.1 (83.1) Separated Regression 247.6 (154.7) 100.0 (100.0) Figure 1 illustrates this fact by showing a plot ofthe ratio of mean square error between the General estimator and the Separate Regression estimator for each small area. To demonsti-ate the effect of the small area sample size on the efficiency, the ratio of the MSEs is plotted against the sample size for each small area. It is clear from Figure 1 that the gain in efficiency tends to decrease as the sample size increases. 12 1 1 10 • S 8- • 0) £ o 6 - • 1 i 2= • • B • • • • 0. c) 20 40 60 Small area sample size Figure 1. Model-based efficiencies of the general estimator compared with the separate regression estimator for each small area 3.4 An Evaluation of the MSE Approximation and the MSE Estimator 0.7% and 10.5% of the total. The component Tj "ever contributed more than 2.2% of the total MSE for any area. We also investigated the performance of the MSE estimator represented by equation (2.10) against the "naive" estimator ofthe MSE, which does not consider the last term of (2.10). The average Root Mean Squared Ertor of the proposed MSE estimator is 17.5% ranging from 4.7% to 32.3%, while for the naive estimator the average is 20.9% ranging from 5.2% to 47.5%. The MSE estimator is on average unbiased while the naive MSE estimator underestimates the MSE on average by 9.1%, its relative bias ranging from -23.5% to -0.9%. Our results agree with others, see Singh, Stukel and Pfeffermann (1998) and Prasad and Rao (1990), which showtiiatthe naive estimator can exhibit severe bias. 4. DISCUSSION Prasad and Rao (1990) and Battese etal, (1981, 1988) have demonstrated that models which include small area specific components of variance can provide greatly improved small area estimators. Some of the numerical results in this paper show that within the model-based simulation framework even better estimators can be obtained by allowing the small area slopes as well as the intercept to be random. The overall conclusions from this investigation for this set of parameter values are that: a component of variance model more complex than the Random Intercept estimator is beneficial; overspecification ofthe model {e.g., using the General estimator with data generated under the Random Intercept Model) does not lead to serious loss of efficiency; the use of small area covariates can also improve the small area estimates; and the use of multilevel models should be preferted rather than the Separate Regression Model. The simulation study confirms that the MSE approximation appears to be precise and the MSE estimation is approximately unbiased, reflecting the variation in MSE between areas, but further theoretical investigation about the exact order of the approximation should be done. Clearly model fitting and diagnostics are crucial. If we apply a general mixed model in circumstances where it is only a poor fit to the data, then the results may be disappointing. Considerably more investigation is needed to understand what characteristics of specific small areas are likely to provide efficiency gains if general mixed models are used rather than simpler models. From the simulation results we may investigate the properties of the MSE approximation (2.7). If we consider the General estimator when the General Data Generation model is used the MSE approximation appears to be very good. The average underestimation of the MSE approximation was 0.31% of tiie MSE value with a range from the largest underestimate of 5.4% of tiie MSE value through to a largest overestimate of 4.8% of the MSE value. For the situation considered here 7, contiibuted on average 94.6% ACKNOWLEDGEMENTS of the total variation and Tj a further 4.3%. Given die large component of variance due to a', these results are not We would like to thank the referees and the Editor for unexpected. For individual areas the component T^ varied their helpful comments on the earlier version of this paper. between 87.4% to 99.1 % oftiietotal and T^ varied between Survey Methodology, June 1999 79 APPENDIX A: RESTRICTED ITERATIVE GENERALIZED PROCEDURE The RIGLS approach is based on the fact that if y is estimated by using generalised least squares with V known then The generaUzed least squares estimator of y in the model (2.1) is given by £:[(y-xzY)(y-xz^)^]=v-xz(z ^x ^v "'xz)-'z ^x ^. The equation above suggests that we use Y= ( Z ' X ' V - ' X Z ) " ' ( Z ' X ' V - ' y ) = (y-XZY)(y-XZY)''+XZ(Z'"X^V-'XZ)-'Z^X^ ( A . 5 ) Ez,'x,V,-'x,z, ,=1 YZ'X,'V-'Y^ (A.1) \i=i ; where V = Diag (V,,..., VJ and V. = o^ / + X '"fiX is the covariance matrix of y., / = 1,..., m. However, V is assumed to be a function of unknown parameters, thus y cannot be estimated using (A.1). On the other hand, if y is known then Y * = vech [{Y - XZy) {Y - XZyf] (A.2) is an unbiased estimator of vech( V). Furthermore vech( V) is a linear function of 0. Then we can consider the following linear model: y* = F e + ^. ^ .^ , f avec(V)'i^ G = cov (0 „) ^^—L lv-'®V-Mvec(yy^)(A.4) 30 j 12 where cov(0^) = [ avec(V)' Tl .1 30 , ' 3vec(V)'\ . 2 30 J and Y= Y- APPENDIX B: AN APPROXIMATION TO £[A, - M j ' (A.3) Where F = 3vech( V)/39 and ^ is a random variable with mean O = (0,..., 0 ) and the covariance of ^ is given by V, = 2(p^(V®V)<p„. The matrix tp^ is any linear transformation of vec(A) into vech(A), and A is any n x n matrix such that vech(A) = {p^vec(A), see Fuller(1987) for further details. Then, assuming that F has full rank and V. is known and non-singular, it may be shown that the GeneraUzed Least Square Estimator of 0 is given by "{ instead of ( y - X Z Y ) ( y - X Z Y ) ^ at each iteration cycle described above in order to obtain an approximately unbiased estimator of V and consequently of 0. As pointed out by Goldstein (1986, 1989), if we start with a consistent estimate of y, say the ordinary least squares estimator, then the final estimates wiU be consistent providing finite fourth moments exist. It is worth noting that it is possible for the above procedure to yield negative estimates of variances. This problem can be avoided by imposing constraints at each iteration. For further details on this issue see Goldstein (1986). XZy. Note that 0^ depends on 0 and y, so both may be iteratively estimated. The IGLS procedure starts with an initial estimate of V (that is, setting initial values of 0) which produces an estimate of y. Hence replacing the initial estimate of V together with the estimate of y in (A.1) provides an improved estimate of 0. In most cases convergence is achieved after a few iterations between equations (A.1) and (A.4), although it is not always guaranteed. Prasad and Rao (1990), based on Kachar and Harville (1984), developed a second order approximation to the second term of (2.5) under some regularity conditions: dd, tr L\ 39 > ' dd} ' *-e)(e-e)1 dQ) (B.l) where, for the model (2.1), d.=xjK.{I<sa)X^V'\ K. = [0,...,/,...0], is the (p + l)x(p + l)/n matrix with the identity matrix / of order p +1 in the i-th position and 0 as the null matrix of order p +1, and 0 is any translationinvariant estimator of 0 = (0j,...,0^) where 0j = o^ and 0^;^ = l,...,.y - 1 are the distinct elements of Q. Goldstein (1989) proves that under normality of the random terms of model (2.1), the RIGLS estimator of 0 is equivalent to the Restricted Maximum Likelihood Estimator (RMLE), which is translation invariant. Let us approximate E[{Q - 0)(0 - 0)^J to the asymptotic covariance matrix of the RMLE estimator {B). The jk-th element of fi'' is given by(see Harville (1977)) b; = T,\YP.^P.-^ ,=1 ' 30^. ' 30^^ for j and k = l,...,s {T:.iZ.''x,:'vr'x.Z) where r-l P . = v : - V,."'X.Z. z/x,.^v,-'. l^t bj, he jk-th element of B. After some matrix algebra, it can be shown 80 Moura and Holt: Small Area Estimation Using Multilevel Models GON21ALES, M.E. (1973). Use and evaluation of synthetic estimates. Proceedings ofthe Social Statistics Section, American Statistical Association, 33-36. that j-l s-l r3=xf(G:y EEbjkAjCAl (;=i Gr'x,- k=i s-l 2X]{G['f Yb A. OT /?,nX, + ^„X;Q5,i^X, (B.2) where C, = a'' G,."' x / x . ; R, = o"'* G,."^x/x,; 5. = o-*G; X/X.; and A 3Q , , , A. = — — A : = l,...,.y-1 3^ GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an appraisal. StatisticaiScience, 9, 55-93. HARVILLE, D.A. (1977). Maximum likelihood approach to variance component estimation and related problems. Joumal of the American Statistical Association, 72, 320-340. HARVILLE, D.A, and JESKE, D.R. (1992). Mean squared error of estimation on prediction under a general linear model. Joumal of the American Statistical Association, 87, 724-731. HENDERSON, C.R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics, 31,423-447. is the s-l square derivative matrix with respect to Q^;k = l,...,s-l. HOLT, D., and MOURA, F. (1993a). Mixed models for making small area estimates. In: Small Area Statistics and Survey Design, (G. Kalton, J. Kordos, and R. Platek, Eds.) 1, 221-231. Warsaw: Central Statistical Office. REFERENCES HOLT, D., and MOURA, F. (1993b). Small area estimation using multilevel models. Proceedings of the Section on Survey Research Methods, American Statistical Association, 21-31. BATTESE, G.E., and FULLER, W.A. (1981). Prediction of county crop areas using survey and satellite data. Proceedings of the Section on Survey Research Methods, American Statistical Association, 500-505. BATTESE, G.E., HARTER, R.M., and FULLER, W.A. (1988). An error components model for prediction of county crop areas using survey and satellite data. Joumal of the American Statistical Association, 83, 28-36. FULLER, W.A. (1987). Measurement Error Models. Chichester: John Wiley. GOLDSTEIN, H. (1986). Multilevel mixed linear model analysis using iterative generalised least squares estimation. Biometrika, 73,43-56. GOLDSTEIN, H. (1989). Restricted unbiased iterative generalised least squares estimation. Biometrika, 76, 622-623. KACKAR, R.N., and HARVILLE, D.A. (1984). Approximations for standard errors of estimators of fixed and random effects in mixed linear models. Joumal ofthe American Statistical Association, 79, 853-862. LONGFORD, N. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested effects. Biometrika, 79, 817-827. MOURA, F.A.S. (1994). Small Area Estimation Using Multilevel Models. University of Southampton. Unpublished Ph. D. Thesis PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation ofthe mean squared error of small-area estimators. Joumal of the American Statistical Association, 85, 163-171. SINGH, A.C, STUKEL, D., and PFEFFERMANN, D. (1998). Bayesian versus frequentist measures of error in small area estimation. Joumal of the Royal Statistical Society, Series B, 377-396. Survey Methodology, June 1999 Vol. 25, No. 1, pp. 81-86 Statistics Canada 81 Composite Estimation of Drug Prevalences for Sub-State Areas MANAS CHATTOPADHYAY, PARTHA LAHIRI, MICHAEL LARSEN and JOHN REIMNITZ' ABSTRACT The Gallup Organization has been conducting household surveys to study state-wide prevalences of alcohol and drug {e.g., cocaine, marijuana, etc.) use. Traditional design-based survey estimates of use and dependence for counties and select demographic groups have unacceptably large standard errors because sample sizes in sub-state groups are too small. Synthetic estimation incorporates demographic information and social indicators in estimates of prevalence through an implicit regression model. Synthetic estimates tend to have smaller variances than design-based estimates, but can be very homogeneous across counties when auxiliary variables are homogeneous. Composite estimates for small areas are weighted averages of design-based survey estimates and synthetic estimates. A second problem generally not encountered at the state level but present for sub-state areas and groups concems estimating standard errors of estimated prevalences that are close to zero. This difficulty affects not only telephone household survey estimates, but also composite estimates. A hierarchical model is proposed to address this problem. Empirical Bayes composite estimators, which incorporate survey weights, of prevalences and jackknife estimators of their mean squared errors are presented and illustrated. KEY WORDS: Alcohol abuse; Drug abuse; Empirical Bayes; Jackknife; Mean squared error; Small area estimation; Synthetic estimation. 1. INTRODUCTION The Gallup Organization has been conducting a series of household surveys for different states to study state-wide prevalences ofthe use of alcohol and drugs {e.g., cocaine, marijuana) among civilian, non-institutionalized adults and adolescents. The common goal of these surveys is to estimate the use and dependence prevalences for alcohol and drugs and, on that basis, to project the treatment needs of dependent users. For planning and resource allocation, states need precise estimates of prevalences for certain subgroups of the target population. For example, it is of interest to estimate prevalences for sub-state planning regions and counties in demographic subpopulations {e.g., older white males). Traditional design-based procedures to estimate use and dependence for subpopulations have two drawbacks. First, if the traditional design-based survey estimate for a subgroup is positive, but sample size is small, then the cortesponding standard ertor is unacceptably large. Second, since the problem is to estimate the proportion of a rare event, it is possible that the design-based procedure produces an estimate of zero and standard ertor estimation formulas for a particular subgroup, if appUed, would give a false impression of the true underlying variability. To improve on the traditional design-based estimators, one can use certain supplementary information usually available from administrative records in conjunction with the telephone survey data. This generaUy is done by using either imphcit or explicit models that "bortow strength" or incorporate additional information that relates the various groups, counties, and planning regions to one another. The method proposed here combines information across counties in order to deal with problem of zero estimates in some counties. It is derived from a model that bounds the proportions away from 1, which is reasonable in an application with proportions expected to be very small, and estimates parameters using empirical Bayes methods. The procedure also incorporates the survey sampling weights in estimation. For a detailed account of small-area estimation methods, see Ghosh and Rao (1994). Other recent references can be found in Fartell, MacGibbon and Tomberlin (1997) and Malec, Sedransk, Moriarity and Leclere (1997). Farrell et al, (1997) propose estimating small-area proportions with empirical Bayes procedures. They model the proportions via a logistic regression that relates expected proportions to respondent variables and includes random effects for the small areas. Malec et al, (1997) use hierarchical Bayes models. They use logistic regression models to relate individual characteristics to probabilities of an outcome and then use a linear regression model to relate coefficients across small areas. Most existing methods, including those of Fartell et al, (1997) and Malec et al, (1997) do not directly use survey sampling weights in estimation. The survey design used by Gallup is described in section 2. In section 3, notations used in the paper are introduced. A direct design-based estimator and two synthetic estimators are presented in section 4. In section 5, several composite estimators of prevalences of alcohol and drug use and dependence are given. In this section, certain Manas Chattopadhyay, The Gallup Organization; Partha Lahiri, University of Nebraska/Lincoln; Michael Larsen, Harvard University, [)epartment of Statistics, Science Center, One Oxford Street, Cambridge, MA 02138, U.S.A.; John Reimnitz, The Gallup Organization. 82 Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Drug Prevalences empirical Bayes estimators and jackknife estimators of their mean squared ertors (MSE) are proposed. In section 6, estimators presented in sections 4 and 5 are applied to a data set from a particular state. The focus of the analysis in this study is taken to be county level estimates. Sample size planning considerations originally were concerned with larger sub-state planning areas. 2. SURVEY For sampling purposes, the state is divided into a few planning regions and samples are collected independently for each planning region using a truncated stratified random digit dialing (RDD) method of Casady and Lepkowski (1993). This design stratifies the Bellcore (BCR) frame into two strata: a high density stratum consisting of 100banks with one or more listed residential numbers and a low density stratum consisting of all the remaining nufnbers in the BCR frame. About 52 percent of the numbers in the high-density stratum are estimated to be working residential numbers whereas in the low-density stratum, the corresponding percentage is only about 2 percent. The CasadyLepkowski procedure exploits the significant difference in the cost of sampling between the two strata by optimally determining the sample size in each stratum. In the truncated version of the procedure, sampling is done only from the high-density stratum. Sample size in the original study was determined in order to estimate statewide prevalence with a desired degree of accuracy. Sample sizes were allocated to the planning regions using an optimal allocation scheme. Data on drug treatment admissions for the adult population in each county were used to compute the index prevalence (rate of admissions) percent in every planning region. These indices were then used to calculate the optimum sample size for each planning region. As a result of optimal allocation, relatively larger sample sizes were allocated to planning regions with higher index prevalences. The optimal allocation also minimizes the variance of the estimators. Gallup also oversampled the 18-45 age group by planning region, because it is the age group with relatively higher rates of illicit drug use. Due to optimal allocation (which may be disproportional), the age oversampling and the complex design, weighting was needed to compute estimates from the sample data. The necessary weights, commonly known as sampling weights, were computed using curtent estimates of the population based on census data. Due to budgetary constraints, it is not possible to increase sample size for all sub-state regions and groups in order to achieve the desired accuracy. To estimate alcohol and drug prevalences, we consider empirical Bayes procedures (see Efron and Morris 1973, Fay and Herriot 1979, Ghosh and Lahiri 1987, among others) to improve on usual design-based estimates of drug prevalences by taking advantage of demographic measurements and social indicator data. Other variables that possibly are related to use and dependence prevalence by county and that are available from Census include the percent of population that is over 65, under 30, white, male, married, and renters. Local governments can provide data by county on social indicators, such as DUI (Driving Under the Influence) rate, mortality rate, per capita liquor licenses, and drug and alcohol treatment admission rates. The more closely auxiliary variables relate to use and dependence prevalence, the more likely it istiiatmethods that "bortow strength" across areas and groups, such as the empirical Bayes methods presented here, can be employed to meet the desired accuracy levels for sub-state areas. 3. NOTATIONS Let n. be the sample size allocated to the i-th planning region, i = I, ...,I{n = £|,, n.). Samples are drawn independently in each planning region using RDD telephone surveys. After the sample is observed, suppose each region is post-stratified into AT demographic groups. These groups are formed by cross-classifying gender (Male, Female) and age(18-24, 25-44,45-64,65-H),resultinginK = 2x4 = % groups. Suppose there are J. counties in the i-th planning region (/ = 1,...,/) and /i^ observations within the /:-th demographic group in they-th county belonging to the j-th planning region (i = 1,...,l;j = l,...,/,; k = I,..., K). Since typically n..^ is small, there is a good chance that some ofthe k demographic groups are not represented in a particular county. Let 5^.. be the set of demographic groups in they-th county within the i-th stratum {i = l,...,I;j = l, ..., J.) for which individuals have completed surveys. Let y..|^^ be the l-th observation (0 or 1) for the k-th demographic group in the j-th county belonging to the i-th planning area (/ = 1,...,/; j = I,...,J.; keSy, l = l,...,n.ji^). Let w-.j^i be the cortesponding sampling weight available from the survey. The goal is to estimate n^.,the true prevalence of substance use or dependence for the ;"-th county within i-th planning area (/ = 1,...,/; j = 1,...,./,). 4. DIRECT SURVEY ESTIMATOR AND SYNTHETIC ESTIMATORS The direct sample survey estimator of 7t.. is given by EE Ef W yklYijkl ^yH keS,, / = ! The sample size available from a county could be very small (sometimes as small as 3 or 4). Thus, the estimator is highly unreliable. Other direct survey estimators are defined similarly. For example, the direct survey estimator 83 Survey Methodology, June 1999 of 71 .^, the true prevalence in the ^-tii demographic group in the i-th planning region is E i-, j:keS.. ijkl-Tykl /=t lit EE wyki r.keS-jl'l where the notation j:ke S.. means that the summation is y over counties j in which demographic group k is observed. Additional problems arise when estimating the proportion of rare events. It is quite Ukely that all observations in a county may be zero, resulting in a zero estimate for a county. If usual estimates of standard ertor were appUed, an estimated zero standard crtor of tiie estimate would give a false impression of the uncertainty of the estimate. Thus, it is very important to improve on the direct survey estimator. Synthetic estimators bortow strength from related counties through implicit modeling of supplementary data from the U.S. Census Bureau along with the telephone survey data. A synthetic estimator, which has been used in the past to estimate alcohol prevalence at the county level, is given by >.51 = E^ yk'^k *=i where TI^ is the statewide direct survey estimator of prevalence of alcohol for the ^-th demographic group and a p is the proportion of individuals belonging to the /c-th demographic group in the j-th county within the i-th planning area (i = 1,...,/; j = I,...,J.; k = I, ...,K). The value a..i^ is available from curtent census estimates. For the household survey reported in this paper, the a^ values were obtained from database vendors like Clantas Data Services of Ithaca, New York. Based on latest available census data, the a,..^ values are typically estimated using projection models. In practice, therefore, the a. ^ values are not true proportions but are curtent census estimates of reasonable precision. Outdated or inaccurate a..^ values cause the estimators using them to be biased. If population projections are used to calculate poststratification weighting adjustments in the survey, the direct survey estimator also suffers from this source of bias. It is beyond the scope of this paper to study the iiripact of alternate population projections. In proposing 7t.. •^', it is implicitly assumed that the prevalences for alcohol and drug use for the k-th group in all the counties are the same (or nearly the same). A less restrictive synthetic estimator of prevalence of alcohol and drug use is given by ^ ' ^ijk^ik k=l where TI,.^ is a direct survey estimator of 7t^, the prevalence for alcohol or dmg use for the ^-tii demographic group in the i-th planning region. It is impUcitly assumed that the prevalences in the k-th group are the same (or nearly the same) for all the counties in a planning region. This assumption is more "regional" or less restrictive than the one made in proposing n.. ^^. A similar direct survey estimator T C " ^ for the A:-th demographic group within a specific county 7 in region i may be defined by restricting the sample to county j only. As compared to TI..^ ^, the estimator TI^.^ ^ wiU have relatively lower variance although it may have some bias since it does not distinguish the counties. T C " " , on the other hand, may be based on a very small sample size and hence may be significantly less reliable in terms of its variability. The above synthetic estimators achieve reductions in variances at the cost of increasing bias. The synthetic estimators distinguish counties only through an indirect variable a..^. obtained from the census, whereas the direct estimator treats each county separately. COMPOSITE ESTIMATORS OF TI.. USING TELEPHONE SURVEY AND CENSUS DATA A compromise between a direct survey estimator and a synthetic estimator is a composite estimator. A number of different composite estimators are proposed here based on the following identity: -rX ^yk^ijk^ keS„ E (iyk'^yk' kfS,, where TI..^ is the prevalence for alcohol and drug use and a..J,, as defined above, is the proportion of individuals belonging to the k-th demographic group in the 7-th county within the i-th planning area {i = l,...,r,j = l,...,Jf, k = l,...,K). A simple composite estimator of iiy is obtained when, for keS..,Ti.ji^ is estimated by T O ^ , the direct survey estimator of 7t..j^, and for k C S.., n..i^ is estimated by 71^ °. The estimator is then given by "y =E w keS:, ^E "yk^iic k SS,, In the above formula, TI^ {ke S.) is estimated using a small sample and thus there is the possibility for improving on Ttj^^ (and, hence, on 'TC^) by bortowing strength from relevant resources. To this end, an empirical Bayes estimate of 71.. is proposed based on the foUowing model. Model 1. Given the TC^'S, the y ^ / s are uncortelated with one another with E(y ITI ) =n and Var(y ..^IT: )=7t (l-7ip) for i = l,...,I;j = l,...,J.;k = i,...,k;l = i, ...,n yk- 84 Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Drug Prevalences 2. The 71. 's are uncorrelated with E(7r..,) = u.,; ^ Ilk' r-|jt' MSE(7i;p ) = MSE(7i;;^)+E(7tp -^c^f. Var(7t..j = dvil (i = 1,...,/; j = I,..., J- k = 1,..., K). If 7ip - Uniform (0, 2n,^), then in statement (2) d= 1/3. It is necessary to estimate MSE(7c^^*) since it contains Thus, unlike the implicit assumption made in the synthetic estimator TI^'^^ (i.e., n.-i^ = p,^), some variabiUty of propor- the unknown parameter p.^^. Thefirstterm MSE(7tp"*) can be estimated by tions across counties within a region for a particular B ^ ^ demographic group is allowed. msej{n.. ) = mse(7c^. )Thefirstassumption ofthe model implies that given 7t;.^, ./, - 1 ^ / B B V the JC... 's are uncortelated with one another, - - — 2 . (mse(.„)(7t,^ )-mse(7t,^ )), E(7^ Ji «= 1 ^c,k = L = .wi/(Lf,Wp,)2fori = l,...,I;j = l,...,Jwhere k 1,..., K. The Unear Bayes estimator of 7i.., under the model and squared ertor loss function, is given by where Ilk -, B ^y * = E «yt (Pyk 'tp "^ + (1 - Pyk) M/J + E ^yk M,*keS:, ktS,, 2\ ( mse(7t.. )=d EaU'^-Pyk)\^ik +E«i^*/t kiS:, V keS,j where B..^ =d\i.J{dp,., + c^ (p,.^ - (d + 1) [x-^)). Since the Bayes estimator involves the unknown parameter p^.^, it cannot be used in practice. The following empirical Bayes estimator of Tiy is obtained when p .^ is replaced by an estimator, say p ^ , of p^^^: and mse(.„)(7t,^ ) = / 2 21. d\ E 4 ( ^ -Pijki-u^^'ik^-u) •" E 4 M « ( - U ) ' with keS,, E«„iM yk t^ik A- keS,, "ijk I h "ijk W l^ik(-u)^ l-> 2^ ^ijklYykl / 2^ 2^ ijkl' where By^ =rfp,.j^ /(^M,* + ^ H . * - {d + l)p,;t ^)). The weight or shrinkage factor By^ is a ratio of the variance of 7tp in the model to the (unconditional) variance of T T ^ . The estimator of p^^ is taken to be p ] ^ = T I ^ ^ . i*ul=l ' j*u1=1 and Mean Square Errors The mean squared ertor (MSE) of the Bayes estimator n^^ is defined as MSE {n^^) = E (7tT^* - 7t,p^ where (unconditional) expectation is taken with respect to the model. It can be checked that (^\h^u^ ^Cp(l^;^) - ('^ + 1 )\'1^u)))See Jiang, Lahiri, and Wan (1998) for comment on these estimators. The second term E (T^^^ -'^^)^ ^^^ ^^ estimated with the following jackknife estimator: MSE(7r")=Var(7r" ^y) R -EB B = Var(7ip) + Var(7t,p -2Cov(7ijJ^, n.j) = Var(7t,p-Var(7t,^. ) «= 1 where =d keS:, Ji <'^£fi\2 kiS,, It is customary to take MSE(7r^*) as the MSE of the empirical Bayes estimator TT^^*. However, MSE(7i.. *) will underestimate the MSE of TC'^* since it does not incorporate the uncertainty due to the estimation of the parameter p^.^^. See Prasad and Rao (1990) and Lahiri and Rao (1995) in this context. Using a standard Bayesian argument, it can be shown that -EB "•iK-u) E ^yk (Pijk(-u)'^p "^ + (1 - Pijk(-u))i^ik(-u)) tei. 2L> yk^Uk^ik(-u r*(•*(-«)• kiS,, Thus MSE(7c^^*) is estimated by mse(j^'")=msej(n;;')^Ej(j^'' -iT^'f. Jackknife methods are reviewed in the recent text by Shao and Tu (1995). Survey Methodology, June 1999 85 6. AN EXAMPLE other response variables can be obtained from the authors. In order to preserve confidentiality, results for only 40 counties, identified as counties 1 through 40, are reported. Table 1 contains five different estimates of prevalence for alcohol dependence. In general, the direct estimates are highly variable and are often zero. The first synthetic estimator (SI) is the most stable, producing no zero estimates and estimates with little variability. The second synthetic estimator (S2) is similar to S1, but not as restrictive. The first synthetic estimates are very homogeneous, while the second synthetic estimates are homogeneous within the four planning areas. The estimates produced by the composite estimator are more variable than the other estimates. The empirical Bayes estimator produces estimates very similar to those of S2. In the model leading In this study, the primary objective is to provide information about treatment need. Anyone who meets the criteria for lifetime dependence or abuse as defined by the National Technical Center's DSM-EU-R criteria, is considered a member ofthe group of respondents who may have needed treatment during the last year. Several indicator variables were created in the dataset to identify respondents with a diagnosis for substance dependence or abuse for alcohol or drugs. For the purpose of numerical calculations, these indicator variables with 0 and 1 as possible values were treated as response variables (yp,) • In order to save space, results are presented for the outcome variable Alcohol Dependence only. Results on Table 1 Five Estimators of Alcohol Dependence Prevalence Expressed as Percents for Forty Counties. Estimated Standard Errors for Direct (Est.se) and Square Root of Estimated Mean Square Error for Empirical Bayes (^Est.mse) Estimates in Parentheses Also as Percents Synthetic 1 Estimator Synthetic 2 Composite Direct County 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40, v° 1.7 4.4 0.0 0.0 9.4 1.6 9.3 0.0 0.0 1.5 0.0 7.0 5.7 0.0 2.4 4.1 2.8 3.9 0.0 3.1 2.7 4.2 9.7 0.0 7.8 0.0 2.2 10.5 0.0 0.0 4.6 8.4 2.5 2.9 0.0 0.0 4.2 0.0 0.0 5.3 Sample Size 30 111 36 6 37 136 25 20 3 81 58 14 37 12 120 32 48 316 19 20 102 124 121 22 32 28 63 5 12 11 44 52 144 49 22 17 26 16 10 144 8 8 8 5 8 8 6 7 3 8 8 6 8 4 8 7 8 8 5 6 8 8 8 6 6 7 8 5 5 6 8 8 8 7 8 6 6 6 6 8 Emp irical Bayes (Est.se) (2.4) (2.0) (0.0) (0.0) (4.8) (1.1) (5.8) (0.0) (0.0) (1.3) (0.0) (6.8) (3.8) (0.0) (1.4) (3.5) (2.4) 0-1) (0.0) (3.9) (1.6) (1.8) (2.7) (0.0) (4.7) (0.0) (1.8) (13.7) (0.0) (0.0) (3.2) (3.8) (1.3) (2.4) (0.0) (0.0) (4.0) (0.0) (0.0) (1.9) Number of Groups Observed in County (v/Est.mse) 3.4 3.8 3.6 3.3 3.3 3.4 3.4 3.6 3.4 3.4 3.3 3.5 3.3 3.5 3.3 3.3 3.8 3.4 3.4 3.6 3.3 3.3 4.3 3.3 3.3 3.5 3.2 3.4 3.5 3.2 3.5 3.7 3.4 3.6 3.3 3.4 3.0 3.4 3.5 3.4 1.6 1.8 3.3 5.6 5.6 3.0 3.1 3.2 5.8 2.1 1.6 1.7 5.5 1.7 5.6 3.0 1.8 3.0 5.7 3.2 5.6 2.1 8.0 2.0 1.6 1.7 5.6 1.6 3.1 1.5 5.9 3.4 2.2 1.7 3.0 3.1 2.0 5.8 3.1 3.1 0.9 7.2 0.0 1.6 14.1 1.7 9.9 0.4 5.6 0.7 0.0 5.0 12.9 0.8 2.0 2.5 1.3 3.2 3.7 14.9 4.1 1.8 11.8 0.2 2.8 0.0 1.6 14.2 1.8 0.0 17.0 8.4 2.5 1.3 0.0 0.3 3.4 3.7 0.6 2.9 1.6 2.1 3.0 5.3 6.9 2.7 3.1 3.1 5.8 1.9 1.5 1.8 6.4 1.6 4.4 3.0 1.8 3.2 5.7 3.2 5.8 2.2 8.8 1.9 1.8 1.6 4.9 1.7 3.0 1.5 5.8 4.1 2.1 1.7 2.8 2.9 2.1 5.7 3.0 3.5 (0.33) (0.35) (0.85) (1.79) (1.78) (0.67) (0.81) (0.84) (1.93) (0.54) (0.33) (0.35) (1.75) (0.33) (1.56) (0.77) (0.37) (0.60) (1.95) (0.82) (1.50) (0.42) (2.11) (0.54) (0.33) (0.37) (1-74) (0.35) (0.81) (0.33) (1.87) (0.84) (0.50) (0.35) (0.77) (0.82) (0.54) (1.97) (0.81) (0.69) 86 Chattopadhyay, Lahin, Larsen and Reimnitz: Composite Estimation of Dmg Prevalences Table 2 Summary of Five Estimators of Alcohol Dependence Prevalence for All Counties. Results Expressed as Percents. Estimator Direct Synthetic 1 Synthetic 2 Composite Empirical Bayes minimum 0.0 3.0 1.5 0.0 1.5 l"quartile 0.0 3.3 1.8 0.4 1.8 median 2.2 3.4 3.0 1.7 2.8 to the empirical Bayes estimator, d was chosen to be one third. Table 1 also displays the estimated standard ertors of the direct estimates and the square root of the estimated mean squared errors (see section 4) of the empirical Bayes estimates. The standard ertors of the direct estimates, which are calculated as {-r"''(l -^'^)ln.. y ^ IJ ^ , IJ ' are often (incortectly) estimated to be zero and are quite variable. The square roots of the estimated MSE of the empirical Bayes estimates are relatively stable and always below .025. Table 2 summarizes alcohol dependence estimates in the previous table for all counties in the state. The means of the synthetic and composite estimates are higher than the mean of the direct estimates, because there are fewer zero estimates and the means in the summary tables are unweighted. 7. CONCLUSION We have proposed simple empirical Bayes estimators to estimate county level prevalences. Empirical Bayes estimators are found to be very effective when sample sizes for the counties are small and when prevalences are extremely small. We have introduced a measure of uncertainty of the proposed empirical Bayes estimator based on the jackknife method. The proposed measure incorporates additional sources of variability due to estimation of various model parameters. In our model, presented in this paper, we have implicitly assumed that the selection probabilities are unrelated to yp,. In the household study reported in this paper, the selection probabiUties were unequal and depended on several factors like number of telephone lines and number of adult household members in the household. None of these variables were related to yp,. The sample allocation to different regions, however, was done based on the number of "treatment admissions" in each region. Hence, the selection probabilities might be indirectly related to y^..^,. In this paper, we have not addressed the issue of sample selection bias, which can be handled appropriately by foUowing procedures discussed in Pfeffermann (1993). In this paper, we have not considered the use of auxiliary variables in the model to relate small areas to one another and to facilitate improved estimation. The use of available auxiliary data from the U.S. Census and other 3"" quartile 4.3 3.5 4.4 4.6 4.2 maximum 10.5 4.3 8.0 17.5 8.8 mean 2.8 3.5 3.2 3.7 3.2 standard deviation 3.2 0.2 1.7 4.8 4.8 administrative records may be a sensible use of resources that can be used to improve planning for treatment of drug and alcohol abuse and dependence. We plan to do further work in this area with an actual example in a future paper. ACKNOWLEDGEMENTS We wish to thank an anonymous referee who had many useful suggestions on improving our paper and the Gallup Organization for partial support. Additionally, Partha Lahiri wishes to acknowledge partial support from U.S. National Science Foundation Grant SBR-9705574. REFERENCES CASADY, R.J., and LEPKOWSKI, J.M. (1993). Stratified telephone survey designs. Survey Methodology, 19,103-113. EFRON, B., and MORRIS, C. (1973). Stein's estimation nile and its competitors - an empirical Bayes approach. Joumal of the American Statistical Association, 68, 117-130. FARRELL, P.J., MacGIBBON, B., and TOMBERLIN, T.J. (1997). Empirical Bayes estimators of small area proportions in multistage designs. Statistica Sinica, 1,1065-1083. FAY, R., and HERRIOT, R. (1979). Estimates of income for small places: an application of James-Stein procedures to census data. Joumal of the American Statistical Association, 74, 269-277. GHOSH, M., and LAHIRI, P. (1987). Robust empirical Bayes estimation of means from stratified samples. Joumal of the American Statistical Association, 82, 1153-1162. GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an appraisal. Statistical Science, 9, 55-93. JIANG, J., LAHIRI, P., and WAN, S. (1998). Jackknifing Mean Squared Error of Empirical Best Predictor. UnpubUshed manuscript. LAHIRI, P., and RAO, J.N.K. (1995). Robust estimation of mean squared error of small area estimators. Joumal ofthe American Statistical Association, 90,758-766. MALEC, D., SEDRANSK, J., MORIARITY, C.L., and LECLERE, F.B. (1997). Small area inference for binary variables in the National Health Interview Survey. Joumal of the American Statistical Association, 92, 815-826. PFEFFERMANN, D. (1993). The role of sampUng weights when modeling survey data. Intemational Statistical Review, 61, 317-337. PRASAD, N.G.N., and RAO, J.N.K. (1990). The estimation of mean squared errors of small area estimators. Joumal ofthe American Statistical Association, 85,163-171. SHAO, J., and TU, D. (1995). The Jackknife and the Bootstrap. New York: Springer-Verlag. 87 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 87-98 Statistics Canada Some Issues in the Estimation of Income Dynamics SUSANA RUBIN BLEUER and MILORAD K O V A C E V I C ' ABSTRACT Two design-based estimators of gross flows and transition rates are considered. One makes use of the cross-so^tional samples for the estimation of the income class boundaries at each time period and the longitudinal sample for the estimation of counts of units in the longitudinal population (longitudinal counts); this is the mixed estimator. The other one is entirely based on the longitudinal sample, both for the estimation of the class boundaries and the longitudinal counts; this is the longitudinal estimator. We compare the two estimators in the presence of large attrition rates, by means of a simulation. We find that under a less than peifect model of compensation for attrition, the mixed estimator is usually more sensitive to model bias than the longitudinal estimator. Furthermore, we find that for the mixed estimator, the magnitude of this bias overshadows the small gain in precision when compared to the longitudinal estimator. The results are illustrated with data from the Survey of Labour and Income Dynamics and the Longimdinal Administrative Database of Statistics Canada. KEY WORDS: Attrition; Gross flows; Transition rates; Longitudinal weighting; Cross-sectional weighting; Bootstrap variance estimator. 1. INTRODUCTION Gross flows are counts of transitions from onetimepoint to the other between a number of states for individuals in a population. Related parameters are longitudinal proportions and transition rates. Longitudinal proportions are relative gross flows, while transition rates are relative gross flows conditional on the initial transition state. Estimates of these parameters for transitions between different income classes are required in studies of income dynamics and can be obtained from longitudinal surveys. The boundaries of the income classes often have to be estimated from the survey as weU. An example is the low income measure defined as half of the median income, where income is adjusted for family size. Thus, in this case, estimators of counts of transitions to and from "low income state" require the estimation of the income medians at the time period of interest. The income class boundaries usually refer to the respective cross-sectional populations and have to be estimated from the cross-sectional samples to obtain unbiased estimators. If the change in population from one wave to the other (that is the number of "births" and "deaths") is negligible, a longitudinal sample may represent the respective populations at both time points, and we may estimate the income class boundaries from the longitudinal sample. Otherwise, estimation of income class boundaries from the longitudinal sample may yield biased estimates. By "deaths" we mean real deaths and/or emigration; similarly, "births" means real births and/or immigration. Two design-based approaches are considered for estimation of longitudinal parameters involving two waves. One approach is based on the cross-sectional samples for the estimation of the class boundaries at each time period and on the longitudinal sample for the estimation of counts of units in the longitudinal population (longitudinal counts). This results in an estimator that we term the mixed estimator. The other approach uses an estimator based on the longitudinal sample for both the class boundaries and the longitudinal counts, and we call it the longitudinal estimator. The main objective of this study is to compare the two approaches in terms of their performance under different attrition adjustment models. In order to make the comparison we address two related issues: the impact of attrition on the considered estimators and the estimation of their variance. Attrition refers to the type of non-response that occurs from a certain wave on, until the end of the period of observation. The real issue with attrition is that non-respondents cumulate over time, and the longer the study lasts, the greater is the non-response. In some surveys like SIPP (Survey of Income Program Participation), attrition reached 20% by the time ofthe third wave (Rizzo, Kalton, and Brick 1996). Even if extra care is taken in the development of adjustments to compensate for the missing data, the resulting estimators may still be sensitive to a less-than-perfect model of compensation. We investigate empirically the sensitivity to attrition of the estimators considered. Variance estimation is also an issue because the parameters of interest are non-linear functions of the observations and are dependent on the income class boundaries. The problem of variance estimation of low income proportions, and other measures of income inequality from complex cross-sectional samples was studied by Shao and Rao (1993), Binder and KovaCevic (1995) and Kovacevic and Yung (1997), among others. In the longitudinal situation, changes in the population over time imply the need to combine different samples and different systems of weights, which complicate variance estimation. The ultimate units Susana Rubin Bleuer, Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, KIA 0T6, e-mail: rubisus@statcan.ca; Milorad S. Kova£evi6, Social Survey Methods Division, Statistics Canada, Ottawa, Ontario, Kl A 0T6, e-mail: kovamil@statcan.ca. 88 Rubin Bleuer and Kovacevic;: Some Issues in the Estimation of Income Dynamics in a longitudinal sample may belong to different primary sampling units (PSU's) at different waves, some PSU's in the sample at one wave may not belong to the sample at another wave, etc. In this study, we develop an appropriate bootstrap variance estimator for estimators of income dynamics and for the complex design used in the example. The data used for illustration come from Statistics Canada's Survey of Labour and Income Dynamics (SLID) and Longitudinal Administiative Database (LAD); both are sources with quite accurate longitudinal income data obtained from income tax returns. At the time of the study, SLID had response data from only two waves and the attrition rate was about 10%. Section 2 outiines the general assumptions for the population under study and for the design. In section 3 we deal with the issue of estimation of longitudinal low income proportion and the impact of attrition on it by means of a small simulation study performed on an artificial data set created assuming the log-normal distribution. Section 4 deals with the bootstrap variance estimation for longitudinal complex surveys. A more extensive simulation of various attrition models, using different adjustment methods and data from a complex design, is described in section 5, and the results are presented and discussed in section 6. (Tambay, Schiopu-Kratina, Mayda, Stukel and Nation 1997). The parameters considered in this paper refer to the longitudinal population U^, and therefore units that "die" from one wave to another are out of scope. Large scale surveys often employ stratified multistage designs with a large number of strata and relatively few clusters or primary sampling units (PSU) sampled within each stratum. The selected PSU's are subsampled in one or more stages until the ultimate units are obtained. Here we assume that the number of strata and clusters within strata does not change from one wave to the other. We assume that the cross-sectional samples s^ consist of n^ sampled PSU's with replacement within stratum h and m^j units sampled within the i-th PSU in stratum h, for r = 0, l,/i = l,...,//and/ = l,...,/i^.Let{w4},; = l,2,..., m^^ be the set of survey weights cortesponding to the crosssectional sample s^. We assume that the survey weights provide approximately unbiased estimators of population totals so that E (T,^ w^y) ~ N', where A^' is the size of Uy for f = 0,1. Here E is the expectation with respect to the design p{s). When the set s^ of attritors is large, the original weights w^y have to be adjusted to account for the missing units and the adjusted weights should add up, in average, to the size A^^ ofthe longitudinal population: E E 2. LONGITUDINAL POPULATION, SAMPLE, AND WEIGHTING Let UQ represent the population at time 0 and (/, represent the population at time 1. In this study we only consider parameters involving two periods of time, and therefore, the longitudinal population is defined in terms of two waves by f/^, where U^ = UQ f) Uy "Deaths" and "births" from one time period to the next cause a change in the population. If we denote by U^ the set of individuals who belong to the population UQ at time r = 0 and do not belong to the population (/, at t = l due to "death", and by U^ the set of "births" from time 0 to 1, then the longitudinal population can be expressed as f/^ = U,\U^ = U^\U^. p " •^o^t^rfUi^) wAy N,L- Here the expectation E^ is taken with respect to the model m assumed for the probability of response. Examples: 1. In the Survey of Labour and Income Dynamics (SLID), every wave has an added component that consists of "cohabitants", Le., individuals who live in the households of the longitudinal individuals (Lavallee and Hunter 1992). SLID has a stratified two-stage design with approximately H = 400 strata at each wave. The number of clusters within stratum h may change if there is growth in it. The number of sampled clusters is usually 2 or 3. When a new panel is selected or an old one is replaced from time r = 0 to time t = l, then the number of sample clusters per stratum may vary. Similarly, we denote hy SQ, a representative sample of UQ, hy S^, a representative sample of U^ and by Sj and 5^ the respective subsamples of individuals in s^ who "died" between t = 0 and t = l, and of individuals in 5, "bom" between / = 0 and / = 1. Hence, the longitudinal sample, 2. The Longitudinal Administrative Database (LAD) of Statistics Canada is a longitudinal sample obtained representing U^, is defined by 5^ = ^QDJ, = s^s^ = s^\Sy from administrative data files and is a representative Non-respondents to the initial wave at / = 0 exist but they sample of the income-tax-ftUng population at any year. are relatively few compared to non-respondents in later The LAD is a collection of many panels since a panel waves. For the sake of simplicity, assume that s^ is the is "bom" at each wave (year). Here non-response is sample without the initial non-response and with the assoapproximately 5% of the cross-sectional sample every ciated weights already adjusted for it. Attrition from wave year. Longitudinal administrative samples, like LAD, 0 to wave 1 wiU be represented by a subset of individuals in do not have attrition, but are subject to wave non5Q denoted by s^. Hence, the longitudinal sample affected response caused usually by late filing (Rubin Bleuer by attrition can be expressed by .y^ = SQ\{S^ U S^). 1996). The design for LAD is non-stratified and single Note that for some parameters of interest s^ should stage. We use LAD as a base for a simulation of remain in the longitudinal sample for weighting purposes Survey Methodology, June 1999 89 attrition because its data are representative of the Canadian income population at every wave. This result is easily extended to the longitudinal situation for the estimator e = 3. F{MJ2,M.I2) (3.3) ESTIMATION UNDER ATTRITION Without loss of generality (wlog), in the following we define and explain the two estimation methods in terms of longitudinal low income proportions. In section 6 results on the impact of attrition are given for other two-wave parameters like gross-flows and transition rates. We also assume a negligible amount of "births" and "deaths" in the finite population, relative to the attrition rate. In fact, we assume that UQ = i/j and that though the units are the same at both time points, the incomes attached to the units can vary. Let yly be the value of the characteristic of interest (family income adjusted for family size) for they-th ultimate unit in tiie J-th PSU of sti-atum/i, j=l,...,M^., i=l,...,N^, h = l,...,H, and t = 0, I. Then the longitudinal proportion of individuals with income less than or equal to x at r = 0 and income less than or equal to y at r = 1 is given by under the assumptions of no change in the population from r = 0 to r = 1 and of no attrition. Let Mj denote the estimator of the median income at time t based on the cross-sectional sample s^ and cortesponding cross-sectional weights {w^y}, for r = 0, 1 M, = inf{y-yes,\F,(y'^p^ll2], where p, (y) = Es, "^hy^iyhy ^ y)/Es, Ky and let M^ denote the estimator of the median income at time t based on the longitudinal sample 5^ and the longitudinal weights {vv^y}: M., = inf[ylyes^\F,{yl,)>lll\, where "^'^^ = i S S ^ ''''^''' ''^'^ ^'-'^ where, since the two populations coincide, M^, = M^,, and H l^h ^(^> - Es:<^iyl,-y)/Es,^^- Then, there are two possible ways to estimate the longitudinal parameter (3.2): Nj^-E E M, h=l 1=1 coincides with the size of the original population C/Q. / is the indicator function of the incomes smaller than or equal to X and y respectively. F{x, y) is the bivariate distribution function of incomes attimes0 and 1. Let us now denote by A/Q 12, half the median income at time t = Q, and by M, 12 half the median income at r = 1. Then the longitudinal low income proportion is defined by e=f(M„/2,M,/2). (3.2) Under complete response, and f/g = f/j, 0 is the bivariate version of the cross-sectional low income proportion which was studied, among others, by Shao and Rao (1993). Under a framework for the development of asymptotic theory in the design space and under certain regularity conditions on the design and the income distributions, Shao and Rao proved that the estimator of the cross-sectional low income proportion F,(M,/2) is consistent (as the number of PSU's, A^psu' approaches infinity) for general stratified multistage designs where the PSU's are selected with replacement. The framework assumes: (i) the existence of a sequence of finite populations with either increasing number A^pg^ of PSU's or increasing number of independent units if the population is not clustered, and (ii) the existence of a cortesponding sequence of probability designs with the first stage sample size n^^^ increasing to infinity as A^p^^ -* °°. E ^by l[yty^MJ2) l[yly^M,l2)/Y ^y (3.4) and Y ^4 l[yly ^ M0/2) l[yly . M,I2)IY ^bij- (3.5) The first estimator is termed "mixed" because it combines longitudinal and cross-sectional samples. The second is only based on the longitudinal sample. Note that when there are no "births" or "deaths" from one wave to the next, the median at r = 1 can only be estimated from the longitudinal sample and thus we use M, in the definition of the mixed estimator. Under attrition, most of the missing data may correspond to individuals who are different from the rest of the population, and failure to account for this may result in biased estimates. Hence, weights are adjusted to compensate for the missing information according to a model. The estimates wiU become more sensitive to model misspecification as attrition increases. Thus, estimators that are robust to the choice of the model for non-response adjustments are desirable. 90 Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics In order to compare estimators (3.4) and (3.5) regarding their robustness against incortect non-response adjustments we made a simple simulation study to empirically estimate the expected (with respect to the design and the attrition model) values of 9^;^^^ and Q^ when the adjustment model was both the cortect one and an incortect one. As we already pointed out, if there is no change in the population from one wave to the next, and thus no new sample is selected in the second wave to represent the change, the estimation ofthe median in the second wave can only be based on the longitudinal sample. Thus the only difference between the two estimators lies in the estimation of the low income measure in the first wave. Hence, wlog, we consider for our simulation the parameter 9 = F{MJ2, °o). In that case, 0 coincides with the crosssectional low income proportion, and the estimator of 0 under complete response, 9^,^^^^ =FQ(MQ/2), is consistent (and thus asymptotically unbiased) as N^^^ tends to infinity. The simulation study, described in detail in Appendix A, consisted in simulating 1,000 samples of size 1,000 from a log-normal income population similar to the Canadian income population. We first selected a simple random sample without replacement (SRSWOR) from a large finite population of incomes and then we simulated attrition in that sample. Here we call a model missing at random (MAR) if the probability of non-response in the second wave is constant within response classes; and we call a model of attrition missing completely at random (MCAR) if the probability of non-response in the second wave is constant in the whole population. The attrition was simulated following a missing at random model where the nonresponse was induced in a low income class. The boundary of the low income class was the first quintile of the finite population, known apriori. For every sample, wecalculated 0„i^j.jj, 0,„„„ and 0 „ with adjustments under both the cortect boundary estimated from the sample, and attrition is heavier in the lower income categories, we will always have the inequality 0 long. <0 mixed.' and as with the low income proportion 0, the estimator of 0^ using less information is, in average, nearer the tmth. The description of the simulation and the numerical results are in Appendix A. The question now is if the bias caused by model misspecification is larger than the increase in variance caused by the attrition. In sections 5 and 6 we tackle this issue by simulating attrition on data from SLID and LAD, calculating 0, given by (3.3), 0^;^^ and 0,^,^^, and calculating the design variance of the estimators as well. 4. BOOTSTRAP VARIANCE ESTIMATION FOR LONGITUDINAL SAMPLES In order to compare the two approaches to estimation we need to study them in terms of variance and bias under different attrition situations. The estimators O^j^ej and Q^ defined in section 3 are nonUnear functions of the observations; in addition, the income data come from complex surveys. The variances of these estimators cannot be expressed in simple terms, and we have to rely on approximate variance estimation techniques. We seek a method that is easy to apply to many different complex parameters, and under different designs. We would like to evaluate the two estimation approaches for any parameter, using the same criteria and a consistent method of variance estimation. We concentrate on developing a bootstrap variance estimator that can be applied to a stratified multistage longitudinal long cross J design. It is important to emphasize that only the primary (MAR) and a MCAR attrition model. The arithmetic mean sampling units are resampled, not the units within them. of the estimates approximates the double expectation (with Kovacevic and Yung (1997) compared several resamrespect to the model and the design) of the first two pling methods and the Taylor linearization method for estimators and approximates the expectation of Q^^^^^ with variance estimation of cross-sectional estimators of income respect to the design. This last expectation approximates, in inequality under a complex survey design. They found, by turn, the parameter 0, since Q„^^ is asymptotically unbiased means of a simulation study, that the best method (in terms as npgy - OO. When the weight adjustments are calculated of relative bias, coverage properties, stability, robustness under the MCAR (incortect) attrition model, the following against assumptions, etc.) is the Taylor linearization method relationship is empirically found: via the estimating equation approach, and that the next best \EE {Q . A-Q\>\E E .(^long) 61 is the bootstrap method. I p m ^ mixed-' I \ p n In the calculation of the number of individuals in one where m refers to the simulated attrition model and p refers income class at time 0 and another income class at time 1, to the design under SRSWOR. Note that attrition from low the units in the longimdinal sample 5^ are involved, and the income individuals wiU always bias upwards the estimator of bootstrap sampUng scheme must ensure the selection of the median and thus we will always obtain Q^^.^^^ ^ 0, . units in s^. However, if we are confined only to the The somewhat surprising result is that the estimator which resampling of units in 5^, we would not allow enough utilises less information is, in average, nearer the tme variability for the consistent estimation of the variance of parameter, meaning that more information, if it is not used the cross-sectional quantile estimators M^ and My well, does not improve the estimator. Similarly, when 0^ is Therefore, the bootstrap sample should contain as well the proportion of incomes higher than the income category elements from s^Xsy 5^ and s^Xs^^ at each iteration. 91 Survey Methodology, June 1999 We assume a stratified two-stage design, and we assume e:.=E <!l'Ay-i-M;i2]i[yi,.M:i2\/Y ^S"that the primary sampUng units (PSU's) are exactly the same at both f=0 and f = l, that is there are no "deaths" or "births" of PSU's from one wave to the next. where This is the case of the first and second waves of SLID. ^r- = inf i)'^ ^ *'boot U ^Lboo. IP," (YhiP ^ 1/2} The "births" and cohabitants that appear in the second wave live in dwellings with individuals who were selected in the first wave. Every unit u in s^ or s^ is assigned to a PSU that and was selected in the first wave: ^;o'):= E <'/(y4^3')/E-.J'.' = o,i. if i< 6 jp n Sj, tiien we assign u to the PSU corresponding to ^iboM ^ ^Lboot its original dwelling at time r =0; if M 6 s^Xsy then we assign u to the PSU it belonged to at The estimate 0, computed from s^'boot is t = 0; if M e 5, \ 5Q and u Uves with v e Sg fl 5,, then we assign u to the PSU of V. In this way we reduce the problem to a cross-sectional situation. Then we perform the following steps: Suppose that the original weights of an individual are ^hy ^bij' ^hij- where the medians M.' are estimated from the longitudinal sample j^oof 3. Repeat steps 1 and 2 a large number oftimes,say B. A Monte Carlo estimate of the variance is obtained as 1. We select a simple random sample with replacement (SRSWR) of PSU's of size n^-l, independently in each stratum. A union of such samples is denoted by s^^. It contains a subsample So'boot ^f ""'^^ f™™ •^o ^^^ ^^ ^^^ in 5,, a subsample ^.'^oo, of units from s^ that are not in s^ and and a subsample 5^^001 ^^ ""'^^ "^hat are in both, s^ and ^B^^mixed) = T B 5 ^ i^mfc " ^^ixed) ^B^^long) =^1^[^lb-^long) Sy 2. Let m^j be the number of times the hi-th PSU is selected; the bootstrap modifications ofthe weights are where 0 mixed •(0) wby b f^hi » 0 w^y; ' B and 0 long EKJP- By resampling the original PSU's, we reduce the problem of variance estimation in a longitudinal survey to that of a cross-sectional framework. This is an extension of 'b .„ • . .1 (1) f^hi ^by' ^by the bootstrap variance estimator developed by Rao and Wu n,-l (1988) and later Kovacevic and Yung (1997) for variance estimation of cross-sectional income inequality measures •a) r^hi ^hyw,by from a stratified multistage sample survey. «.-! In order to accommodate attrition, we look at the original data set as a set of longitudinal records. Then, attrition can (L) .(1) are obtained from the be viewed as item non-response and accordingly the weight w The weights w^y and «..^ .y^y original weights by multiplying by an attrition adjustment adjustment for attrition can be considered as ratio factor. The adjustment factors are the inverses of the imputation (Hajek-ratio). response probabilities (assumed different for each response Indeed, recaUing that s^ = s^us^, let us denote by y^?^* class). These probabilities are estimated from the original {hijesj the ratio-imputed value of wave one information, data set. The process of estimation of the response probabi- based on the observed data in the longitudinal sample; and lities and subsequent adjustment is imitated in the bootstrap let us denote by yj,y' {hijesj the ratio-imputed value of resampling: for each bootstrap sample s^^^ the adjustments wave zero information based on the longitudinal sample are recalculated, to produce new vv^y and w^y. (5^). Note that the values yL- {hijesj are not missing, but Then the estimate Q^^y^,^ computed from a bootstrap we need yL to represent the weight adjustment in the sample s^^ is longitudinal sample. n,-l 92 Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics The estimation of 0 with weight adjustments for attrition is equivalent to estimation on a "complete" imputed data set weighted by wj^'y', {hijes^). The set ^ mi.e,=\(yS'ylS)'^^es^{yl^,yl^'),hijesJ^, is used to calculate Q^^^^ and the set 2> ,ong=te4^)./«y^^.(yir'>'ir)'''y^^.}' is used to calculate 0. „„. long We noted above that adjustment due to attrition is equivalent to ratio imputation for item non response. Hence the variance estimator proposed here has the same properties as the cross-sectional variance estimators for imputed survey data: consistency now follows from the consistency of the bootstrap variance for imputed survey data (Shao and Sitter 1996) and good coverage properties and small relative bias, as documented by Kovacevic and Yung (1997). In the case of small number of PSU's per stratum (SLID has two or three PSU's per stratum) Kovar, Rao and Wu (1988) showed empirically that the bootstrap variance estimate overestimates the tme variance by no more than 10%. RESPONDENTS Responded in thefirstwave 1, not in the second RESPONDENTS 1=0 t=l Responded in both waves Here X means that the response is available for the individual at the cortesponding wave, and 0 means the opposite. We omitted "births" and "deatiis" from botii SLID and LAD. The respondents in the first wave are divided into two response classes, low income class and else. The boundary is given by M^ 12, where M^ is the estimate of the median income at time t = 0, based on all respondents in the first wave. The size of the SLID sample (Ontario) is approximately 10,000, from which 2,000 were low income individuals. We simulate three different attrition situations. In all of them we select a subsample of the complete respondents with pattem XX and convert them into the following pattem X 1) 10% attrition. We select a subsample at random of individuals with low income {>';,-^MQ/2} at time t = 0, and make them non-respondents. In order to have 10% of the overall sample missing at time r = 1, we convert 50% of low income individuals into nonrespondents in the second wave. 2) 20% attrition. We select 70% of individuals from {y"y <. M(/2} and 7.5% from [yly>M^I2} at random and convert them into non-respondents at the second wave. This results in 20% overall attrition for SLID. 3) 30% attrition. We select 80% of individuals from {y'^y^M^n] and 17.5% from {y'^y>M^I2} at random and convert them into non-respondents at the second wave. This results into 30% attrition for SLID. We then consider two different adjustment models for each situation: Model 1: The non-respondents are missing completely at random (MCAR). This model is the worst possible model that we might use given that attrition is usually experienced by the group of low income individuals. Model 2: The non-respondents are missing at random from the low-income class. We allow for a small increase of the upper boundary for the low-income class, so that the response classes are defined as {y^y<,MQl2 + M^/IO} and {yly>MQl2 + M Q / I O } . We assume that this model as one of the best possible models under our setup, since it recognizes the response classes as separated by low income boundaries. 5. EMPIRICAL STUDY In order to compare the two approaches to estimation, we now consider two real longitudinal surveys, SLID and LAD, and simulate different attrition situations. The study begins with a sample of complete respondents in two waves. The response pattem of the individuals in this sample can be presented as «=0 We chose these two models of adjustment because they represent the two extremes. In practice, we may only be able to choose a model between these two. Let /, denote the low income measure defined as half the median: /, = MJ2, t = 0,1. Several longitudinal parameters were studied. We define some of them in Table 5.1. The values of the estimates are presented in Tables B1 to B3. The standard ertors were obtained using the bootstrap method described in section 4 assuming that the cortesponding adjusted weights are known a priori and do not change for each bootstrap sample. Survey Methodology, June 1999 93 Table 5.1 Some Longitudinal Parameters Evaluated in the Empirical Study Proportions The proportion of individuals with income (adjusted fainily income) below /, at r = 0,1. The proportion of individuals with income below /Q at / = 0 and above /,' = (11/10)Z, at t = 1; the factor 11/10 is used to be able to detect true transitions from one state to the other. The proportion of individuals with income above IQ ={l 1/10)/^ at / = 0 and below /, at t=l. The proportion of individuals with income above IQ and and/,' at times 0 and 1 respectively. Conditional Rates The probability of an individual having low income at the second wave (second year), given that he or she had low income in the first wave. The probability of not having low income at the second wave, given that the individual had low income at thefirstwave. p{yo^k<yi^h^ PO'o^'o'J'i^'r) PCyo^'o*'>'i^'i) pCyo^'o'')'i^'i') (representing 70% attrition) in the domain of individuals who were low income in the first wave (1993), and with a factor of 1.08 (representing 7.5% attrition) in the domain of individuals who had an income higher than the LIM in 1993. Thus, by adjusting with an incortect model, we incur in a much greater ertor in the estimation of one domain {y^ <. M(,/2) than in the estimation ofthe other (y^> MJ2). We see from these tables that both the mixed and longitudinal estimates seriously underestimate the parameter of interest in the first column and overestimate it in the second colunih, assuming that estimation based on the complete data set results in reasonable and acceptably good estimates. It is obvious that the mixed estimates are more affected by the wrong adjustment, verifying the inequality stated in section 3. Table 6.1 Gross Flows Estimated From SLID and LAD 20% Attrition (70% ofthe Low Income Missing) p(y,s/,|yoS/(,) 1994 PO'I>VJ)'O^'O) 6. RESULTS AND DISCUSSION The empirical study shows that attrition does affect estimates adversely, but different outcomes result depending on whether the parameter of interest is cross-sectional or longitudinal. In the estimation of later-wave cross-sectional parameters, estimators based on the actual longitudinal sample are more biased than estimators based on the crosssectional sample whether the model of adjustment is sound or not, (see for example the estimates of the median att = 0 in Tables B.1-B.3.) However, in the estimation of longitudinal parameters, longitudinal estimators (based entirely on the longitudinal sample) are less biased than mixed estimators (based on the three samples.) Tables 6.1a and 6.1b present gross flows estimated from SLID and LAD data, respectively. The estimates are calculated with the complete data set and after 20% nonresponse is simulated in the second wave. For the complete data set the longitudinal and the mixed estimators coincide (this is the no-attrition situation). As explained in the previous section, 20% of non-response was simulated by eUminating 70% ofthe responses from individuals who were low-income and 7.5% of individuals with income higher than /Q in the first wave. The adjustment for non-response was done assuming that the individuals were missing completely at random. The appUed adjustment model means that the original survey weights were adjusted with a factor of 1.25 (representing 20% attrition) across the sample, whereas the cortect adjustment should have been with a factor of 3.33 y,sM,/2 y,>l.lM,/2 1992 y,sM,/2 y,>l.lM,/2 a. SLID, Ontario 1993 y„.M„/2 y>M,l2 1,602,000 113,000 No attrition 425,000 152,000 Mixed 710,000 125,700 Longitudinal 70,000 8,080,000 No attrition 15,000 8,975,000 Mixed 30,000 8,870,000 Longitudinal b. LAD, Sub-Area from Toronto 1991 yo^M,l2 y>l.lM,l2 2,700 640 No attrition 1,100 800 Mixed 1,500 740 Longitudinal 580 10,420 No attrition 190 12,150 Mixed 380 11,650 Longitudinal Tables B.l to B.3, given in Appendix B, show the results for SLID at three different attrition levels: 10%, 20% and 30%. For each parameter the estimates and their cortesponding bootstrap standard ertors were calculated using both the longitudinal and mixed estimators. First, we calculated them for the ideal longitudinal "no attrition" sample, and then for the reduced sample adjusted under the two non-response models described in section 5. We provide the estimate of the model bias as the difference A between the estimate obtained under the model and the "no attrition" estimate. The numbers in Tables B.l to B.3 repeat the same pattem that was shown for gross flows in Tables 6.1a and 6.1b, Le., the estimates obtained using the longitudinal 94 Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics estimators are "less sensitive" to a choice of the non-re- of the estimators becomes more important than the model sponse adjustment model. At the same time, there is almost used in the adjustments. no difference between the corresponding standard errors We summarize our findings as following (given in parentheses). Overall, we found that the estimates 1) In the estimation of later wave cross-sectional ofthe standard errors of Q^^^^^^ are sUghtiy smaUertiianfor Q^ , parameters, estimators based on the actual longitudinal and that this negligible difference in favour of O^j^gj is not sample are more biased than estimators based on the enough to compensate for the larger bias affecting 0^1^^^, cross-sectional sample. induced by the wrong adjustment for non-response. There is no difference between O^i^gd and Q^ when the 2) In the estimation of longitudinal parameters, both the mixed and longitudinal estimators are considerably second (best) adjustment for non-response is used. This is biased if the wrong model of attrition adjustment is tine for most parameters except for the conditional rates: for used. example, in Table B.l (10% attrition) the empirical bias of the mixed estimate of the conditional rate of remaining low 3) The longitudinal estimator is more robust against the income in 1994 was found statistically significant, whereas inappropriate adjustment for attrition than the mixed the empirical biases of the longitudinal estimates of the two estimator. Under the perfect adjustment model, these conditional rates were found non-significant (see Appendix two estimators perform alike. C). Of course the "perfect" model for adjustment (not shown 4) In general, the sampling variance of the mixed here) yields exactly the same numbers from the two estimator is smaller than the variance of the longituestimation methods, and Q^^,^^^ is approximately equal to dinal estimator. This relationship remains steady over ^lons ^^^ ^ y parameter considered, but their variances differ. different attrition rates and different adjustment We introduce a single "sensitivity measure" that commodels. bines information on sampling standard error and model bias 5) For the mixed estimators, the magnitude of the bias caused by the applied attrition adjustment model: coming from inappropriate non-response adjustments overshadows the small gain in precision when Cxed-9o)'^s.e.^(0^,,,J compared to the longitudinal estimates. (6.1) 5(0 mixed ) = 6) Different models of adjustment perform alike for s.e.\%) \ higher attrition rates. In this case the choice of the ^A estimator is more important than the efforts at model Here, 0Q and 9^^^;,^^^, (A = MCAR or MAR) denote estimates improvement. obtained under "no attrition" and under an attrition adjustment model respectively, and s.e.(.) stands for the standard ertor due to sampUng. Similarly, we define 5(9, ). If an ACKNOWLEDGEMENTS attrition adjustment does not change by much the value of an estimator and its standard ertor (compared with the estimate The authors would like to thank the associate editor and obtained using another attrition adjustment), we say that the the referee for their helpful comments and suggestions. estimator is relatively insensitive to the applied attrition model. The ratios ofthe two sensitivity measures of the two APPENDIX A applied adjustment models are defined by ratiomixed ratio,long /sMCAR \ mixed )/4'rJl and Description of Simulation for Section 3. SMAR) (cr)/»(i long /• (6.2) Values of the ratio for different attrition scenarios are presented in Charts B.l-B.3 in Appendix B. From Charts B.1-B.3 it is evident that the ratio of the sensitivity measures (6.2) is systematically lower for the longitudinal estimator. This further means that the sensitivity measure of longitudinal estimator under applied adjustment models are more alike than those of the mixed estimator. The longitudinal estimator seems to be more insensitive on the appUed adjustment models. We refer to this as "robustness". Regarding the simulated attrition rates, the charts show that as the attrition rate increases the sensitivity measures ratio approaches 1. This means that both models of adjustment perform alike for higher attrition rates, and then the choice The simulation consisted of the following steps. 1. Let X be a log-normal random variable, X~ exp {N (p = 10.3,02=0.64)}. These parameters cortespond to a median similar to that of the SLID estimate for 1992, and a spread similar to that of the Canadian population. The low income boundary was set to the first quintile of the income population. The first quintile 9, was estimated from a simulated sample of size 50,000. The value obtained was ^i =14,901. 2. From this infinite population, 1,000 independent random samples of size 1,000 were selected. 3. To simulate attiition, from each sample, 50% of the units with income below g, were selected at random and dropped from the sample for the calculations Survey Methodology, June 1999 95 pertinent to the second wave. Thus 10% attrition from values was 0.012. This implies that for a SRSWOR sample the low income class was simulated (MAR model). of size 1,000, the estimate Q^^^^^ is quite stable and its expectation can be used as surtogate for 0. The expected For each sample i, the low income proportion estimators 4. ^mixed(') ^"d ©longC') ^^""6 estimated with weights values of Q^^^^^ and 0, are estimated by 0.109 and 0.145, adjusted under both the cortect attrition model (MAR) with standard deviations of 0.011 and 0.013 respectively. and under the incorrect assumption of units missing Expected Values Under Mis-Specification completely at random (MCAR). of the Adjustment Model 5. Also, for each sample i, we calculated the cross-secExpected Standard Number of Estimator value deviation samples tional estimator under complete response, ^^^^^^{i), which is entirely based on the sample before attrition. 0.109 0.011 1,000 mixed For each type of adjustment, the expectation with 0.145 0.013 1,000 long respect to design and the attrition model of 0mixed' 0long 0.193 0.012 1,000 e and Q^^^^^ was estimated by their respective arithmetic cross Expected Values Under the Correct Adjustment means over 1,000 samples with the incortect adjustment, ofthe Attrition Model and over 539 samples for the cortect adjustment. Expected value Standard deviation Number of samples mixed 0.193 0.016 539 long 0.193 0.014 539 0.194 0.012 539 Estimator Result of the Simulation The next two tables show that under the incorrect specification, the longitudinal estimator has less bias than the mixed estimator, and under the cortect specification for the adjustment, both estimators are unbiased (with respect to the model and the design). The arithmetic mean of 0„„„ over the 1,000 different cross ' values was 0.193, and the standard deviation of the 1,000 e cross We see from the table above that both estimators approximate the tme value of the parameter if the adjustment model is correct. APPENDIX B Table B.l Estimates of Different Population Characteristics and Their Standard Errors Obtained From the Complete Data Set and Under 10% Attrition in the Second Wave Parameter Type of Estimator No Attrition s.e. Attrition Adjustment Model MCAR MAR s.e. A s.e. e A e Quantiles M 29,300 (1,000) 29,300 (1,000) 0 29,300 (1,100) 0 Afg (median, t=0) L 29,300 (1,000) 30,900 (1,000) -1,600 29,300 (1,200) 0 M 28,600 (1,100) 30,400 (900) -1,800 28,600 (1,100) 0 Af, (median, r=l) L 28,600 (1,000) 30,400 (1,000) -1,800 28,600 (1,200) 0 Proportions M 0.156 (0.010) 0.092 (0.008) 0.064 0.124 (0.009) 0.032 PCyo^'o-yi^'i) L 0.156 (0.010) 0.111 0.124 (0.008) 0.045 (0.009) 0.032 M 0.007 (0.002) 0.003 (0.001) 0.004 0.005 (0.002) 0.002 P(yo^'o'>'i^'i') L 0.007 (0.002) 0.004 (0.001) 0.003 0.005 (0.002) 0.002 M 0.011 (0.002) 0.013 (0.003) -0.002 0.012 (0.002) 0.001 pCyo^'o'.yi^'i) L 0.011 (0.002) 0.011 (0.003) 0.000 0.012 (0.002) -0.001 M 0.790 (0.040) 0.840 (0.040) -0.050 0.804 (0.042) -0.005 P(>o^C>'i^'i') L 0.790 (0.040) 0.831 (0.042) -0.041 0.804 (0.043) -0.005 Conditional Rates M 0.923 (0.023) 0.546 (0.025) 0.734 0.377 (0.033) 0.189 /'()'i^M>'o^'o) L 0.923 (0.023) 0.926 (0.030) -0.003 0.921 (0.031) 0.002 M 0.040 (0.010) 0.018 (0.006) 0.022 0.030 (0.009) 0.010 p{yi>K\yo^k) L 0.040 (0.010) 0.036 (0.010) 0.004 0.038 (0.012) 0.002 /,' = (11/10)/,.; /,.* is used to identify true transitions from one wave to the next. M denotes Mixed (6) and L denotes Longitudinal (0) estimates. 9 is the estimate, s.e. denotes the standard error ofthe estimate, and A is the difference between the corresponding estimates obtained using the attrition adjustment model and assuming no attrition. e 96 Rubin Bleuer and Kovacevic: Some Issues in the Estimation of Income Dynamics Table B.2 Estimates of Different Population Characteristics and Their Standard Errors Obtained From the Complete Data Set and Under 20% Attrition in the Second Wave Parameter No Attrition Type of Estimator e MQ (median, r=0) M L M L 29,300 29,300 28,600 28,600 M L M L M L M L 0.156 0.156 0.007 0.007 0.011 0.011 0.790 0.790 M L M L 0.923 0.923 0.040 0.040 M^ (median, t = 1) p{yo^'o'yi^'0 p{yo^k'yi^K) /'(>'o^'o'J'i^'i) /'(yo^C^i^'i*) /7(y, i / J y o S / o ) /'(>'i>'ri>'o^'o) s.e. e Attrition Adjustment Model MCAR MAR s.e. A s.e. e Quantiles (1,000) 29,300 (1,000) (1,000) 31,800 (1,100) (1,100) 31,100 (800) (1,000) 31,100 (900) Proportions (0.010) 0.055 (0.008) (0.010) 0.080 (0.006) (0.002) 0.001 (0.001) (0.002) 0.003 (0.001) (0.002) 0.014 (0.003) (0.002) 0.012 (0.002) (0.040) 0.864 (0.041) (0.040) 0.855 (0.042) Conditional Rates (0.023) 0.323 (0.051) (0.023) 0.914 (0.040) (0.010) 0.007 (0.005) (0.010) 0.030 (0.012) A 0 -2500 -2500 -2500 29,300 29,300 28,600 28,600 (1,000) (1,000) (1,000) (1,000) 0 0 0 0 0.101 0.076 0.006 0.004 -0.003 -0.002 -0.074 -0.065 0.096 0.096 0.004 0.004 0.012 0.012 0.820 0.820 (0.012) (0.013) (0.002) (0.002) (0.002) (0.002) (0.049) (0.048) 0.060 0.060 0.003 0.003 -0.001 -0.001 -0.030 -0.030 0.600 0.009 0.033 0.010 0.570 0.928 0.026 0.042 (0.057) (0.058) (0.010) (0.014) 0.353 -0.005 0.014 -0.002 Table B.3 Estimates of Different Population Characteristics and Their Standard Errors Obtained From the Complete Data Set and Under 30% Attrition in the Second Wave Attrition.Adjustment Type of Estimator Parameter MQ (median, t =^0) A/j (median, t =^1) P(yo^lo'yi^' .) /'(>'o^'o'>'i^'i ' ) .) p{yo^K'yi^h ,') MCAR s.e. A 9 MAR s.e. A 31,200 31,300 (1,000) (900) (800) (900) 0 -2,700 -2,600 -2,700 29,300 29,300 28,600 28,600 (1,000) (1,000) (1,000) (1,100) 0 0 0 0 Proportions (0.01) 0.04 (0.011) 0.116 0.086 0.006 0.004 0.080 0.080 0.004 0.004 (0.014) 0.076 0.076 0.003 0.003 0.013 0.013 (0.016) (0.002) (0.002) (0.002) (0.002) e s.e. L 29,300 29,300 M L 28,600 28,600 (1,000) (1,000) (1,100) (1,000) M L M 0.156 0.156 0.007 0.007 M L M piyo^io^yi^i No Attrition L M L p{y\^h l > ' o ^ ''o) M L p{yi>K l ^ o ^ ''o) M L (0.01) (0.002) 0.07 0.001 0.003 0.015 0.012 (0.04) 0.874 (0.037) -0.005 -0.001 -0.084 (0.04) 0.864 (0.038) -0.074 0.828 (0.045) (0.045) (0.006) 0.678 0.465 0.038 0.032 (0.016) 0.003 0.930 0.022 0.044 0.011 0.011 0.790 0.790 0.040 29,300 32,000 (0.006) (0.001) (0.001) (0.003) (0.003) (0.002) (0.002) (0.002) 0.923 0.923 0.040 e Quantiles Conditional Rates (0.023) 0.245 (0.023) 0.885 (0.010) (0.010) 0.008 0.037 0.829 (0.049) (0.051) (0.071) (0.099) (0.012) (0.017) -0.002 -0.002 -0.039 -0.038 0.458 -0.007 0.018 -0.004 97 Survey Methodology, June 1999 Charts B Sensitivity Measures Ratio of the Mixed and Longitudinal Estimators 1. 10% attrition rate in the second wave Sensitivity Measures Ratio c s 4 e 10 e 12 1 1 p(y(0)<=L0,y(1)<=L1) 1=1 Mi (ed (MCAR/MAR) ^ LonaitudinaKMCAR /MAR) 1 ply(0)<=L0.y(1)=>L1) p(y(0)=>L0.y(1)<=L1) ) S i 1 1 p(y(0)=>L0.y(1)=>L1) ^^^ 0. 1 p{y(1)<=L1 ly(0)<=LO) 1 p(y(1)=>L1 ly{0)<=LO) 2. 20% attrition rate in the second wave; Sensitivity Measures Ratio p(y(0)<=L0,y(1)<=L1) 1=1 Mixed (MCAR/MAR) p(y(0)<=L0.y(1)=>L1) Longitudinal ( M C A R / M A R ) p(y(0)=>L0,y(1)<=L1) S p{y(0)=>L0.y(1)=>Lt) CL p{y(1)<=L1 ly(0)<=LO) p(y(1)=>L1 ly(0)<=LO) 3. 30% attrition rate in the second wave Sensitivity Measures Ratio > 4 6 8 10 p(y(0)<=L0,y(1)<=L1) p{y(0)<=L0,y(1)=>L1) 1—1 Mixed ( M C A R / M A R ) • • Longitudinal (MCAR / MAR) ^ 1 p(y(0)=>L0.y(1)<=L1} 1 p{y(0)=>L0,y(1)=>L1) 0. 1 p{y(1)<=L1 ly(0)<=LO) 1 p(y(1)=.>L1 ly(0)<=LO) 12 98 Rubin Bleuer and Kovacevic;: Some Issues in the Estimation of Income Dynamics APPENDIX C We found the empirical bias of the mixed estimates (under a MAR adjustment model) of tiie conditional rate of remaining low income in 1994 given that the individual was low income in 1993, statistically significant when performing a conservative test of the form (% - e^ixed) / ^ar(0^i,^,) + var(0o). We found the empirical bias of the mixed estimates of the conditional rate of having an income higher than the LIM in 1994 given that the individual'was low income in 1993, (under a MAR adjustment model) non-significant when performing a "radical" test of the form \% "^mixed) / S-e.(0n,i,,d). Similarly, we found the empirical bias ofthe longitudinal estimates (under a MAR adjustment model) of both conditional rates, non-significant when performing the same type of "radical" test as above, Le., when assuming that the estimate under no attrition is non-stochastic. These results hold for 10%, 20% and 30% attrition rates. REFERENCES BINDER, D., and KOVACEVIC, M.S (1995). Estimating some measures of income inequality from survey data: An application of the estimating equation approach. Survey Methodology, 21, 137-145. K O V A C E V I C , M.S., and YUNG, W. (1997). Variance estimation for measures of income inequality and polarization - An empirical study. Survey Methodology, 23,41-52. KOVAR, J.G., RAO, J.N.K., and WU, C.F.J. (1988). Bootstrap and other methods to measure errors in survey estimates. The Canadian Joumal of Statistics, 16, 25-45. LAVALLEE, P., and HUNTER, L. (1992). Weighting for the survey of labour and income dynainics. In Proceedings: Symposium 92, Design andAruilysis of Longitudinal Surveys, Statistics Canada. LEPKOWSKI, J.M. (1989). Non-response adjustments for wave nonresponse. In Panel Surveys, (Eds. D. Kasprzyk, et al.). New York: John Wiley and Sons. RAO, J.N.K., and WU, C.F.J. (1988). Resampling inference with complex survey data. Joumal of American Statistical Association, 83,231-241. RIZZO, L., KALTON, G., and BRICK, J.M. (1996). A comparison of some weighting adjustment methods for panel nonresponse. Survey Methodology, 22, 43-53. RUBIN BLEUER, S. (1996). Gross flows estimation from longitudinal administrative data with missing waves. Proceedings of the Section on Survey Research Methods, American Statistical Association, II, 681-686. SHAO, J., and RAO, J.N.K. (1993). Standard errors for low income proportions estimated from stratified multi-stage samples. Sankhya, B, 55, 393-414. SHAO, J., and SITTER, R. (1996). Bootstrap for imputed data, Joumal ofthe American Statistical Association, 91, 435, 12781288. TAMBAY, J-L., SCHIOPU-KRATINA, I., MAYDA, J., STUKEL, D., and NADON, S. (1998). Treatment of nonresponse in cycle two of the National Population Health Survey. Survey Methodology, 24,147-156. 99 Survey Methodology, June 1999 Vol. 25, No. 1, pp. 99-103 Statistics Canada Utilising Longitudinally Linked Data from the British Labour Force Survey PAM F. TATE' ABSTRACT The British Labour Force Survey (LFS) uses a rotating sample design, with each sample household retained for five consecutive quarters. Linking together the information on the same persons across quarters produces a potentially very rich source of longitudinal data. There are however serious risks of distortion in the results from such longitudinal linking, mainly arising from sample attrition, and from response errors, which can produce spurious flows between economic activity states. This paper describes the initial results of investigations by the Office for National Statistics (ONS) into the nature and extent of the problems. KEY WORDS: Longitudinal data; Labour Force Survey; Economic activity; Attrition bias; Response error. 1. INTRODUCTION The British Labour Force Survey (LFS) is a household survey, gathering information on a wide range of labour force characteristics and related topics. Since 1992 it has been conducted on a quarterly basis, with each sample household retained for five consecutive quarters, and a fifth of the sample replaced each quarter. The survey is designed to produce cross-sectional data, but in recent years it has been recognised that linking together data on each individual across quarters could produce a rich source of longitudinal data, the uses of which include estimation of labour force gross flows. The process of linking information on the same individual from different quarters in the LFS is relatively straightforward. However, there are methodological problems which pose seriousrisksof distortion in the results from this new, hitherto untested use of LFS data. Similar problems have been identified in other countries' labour force surveys, but there are as yet no generally accepted methods of dealing with them. The Office for National Statistics (ONS) has therefore undertaken a programme of work to address this issue. This paper describes the results so far of investigations into the nature and extent oftiieproblems, and the proposed methods of dealing with them. The issues fall into two main groups: biases arising from sample attrition and related factors; and biases arising from response ertors, particularly their effects in producing spuriousflowsbetween economic activity states. These are considered in turn. 2. SAMPLE ATTRITION AND ITS BIASING EFFECTS Some sample members are lost at the initial stage, because of nonresponse in thefirstinterview, either because it has not been possible for them to be contacted during the narrow time window available, or because they have refused to be interviewed. After that, further sample members are lost from each successive quarterly interview round, either because they have moved house (the basic sampling unit for this survey being the dwelling), or because it proves impossible to contact them or they refuse to continue. All these groups of people are, in different ways, atypical of the population as a whole, so their loss from the sample can introduce biases. Some of these biases are compensated for in the course of applying the normal LFS weighting procedure, which produces population level estimates which are consistent with census-based control totals by sex, age group and region. This process wiU compensate for biases arising, at all stages ofthe survey, from differential attrition by sex, age and region. However, biases in other characteristics which are not themselves used in the weighting procedure will not be compensated for (and may even be increased) in that process, except when they are related to age, sex or region, in such a way that the bias is caused entirely by the under- or overrepresentation of particular age, sex or region categories. Work on this subject therefore looked first at what characteristics are more or less represented in the LFS sample than the whole population, and in different waves of the LFS sample. (Each quarter, the sample is made up of five waves, the people in the first wave having their first interview, those in the second wave their second interview, and so on.) It then examined whether and to what extent these characteristics are related to each other, and whether it is possible to define a set of variables which characterise those people who are likely to be under-represented. 3. CHARACTERISTICS OF NON-RESPONDENTS Analysis ofthe proportions which could not be linked to the next quarter, by wave, for key demographic and Pam F. Tate, Office for National Statistics, Room RG/11, 1 Drummond Gate, I.x)ndon SWIV 2QQ, United Kingdom. 100 Tate: Utilising Longitudinally Linked Data economic variables (Table I gives an illustration for broad age/sex groups), showed that, consistently across all waves, there is a greater propensity to be under-represented for young people aged 18 to 29 (and especially 18 to 24), single people, those living in London, people in rented accommodation (especially privately rented), the unemployed, and those in temporary employment. Most oftiiesecharacteristics have also been found (by Foster (1994) in a study which linked data from the 1991 Census with non-responding LFS sample households) to be associated with high non response at the first interview, particularly young adults, single people, one person households, and those Uving in London. All persons Variable Category Wave 1 Wave 2 Wave 3 AgeGroup 18-19 1.89 2.56 2.86 1.92 20-24 1.79 2.08 2.10 2.83 25-29 1.17 1.30 1.44 1.55 Tenure Privately rented 2.12 1.52 1.86 2.29 Marital Status Single 1.25 (1.12) 1.27 1.49 Employee, fiiUtime, temp (1.12) (1.36) (1.13) 1.75 Unlinked percentage Wave 4 Wavel 8.6 4.8 AGE & SEX Male Table 2 Multiplying Factors for Odds Ratios Categories Associated with High Attrition Multiplying factor for odds ratio Table 1 Percentage of Unlinked Cases by Sex and Age Group by Wave Variable & category 29 (and especially 18 to 24), has a particularly strong effect, as does being in privately rented accommodation. Being single {Le., never married and not cohabiting) has a moderate association. There are no consistent associations with particular categories of economic activity or qualification level, except for a slight one with full-time, temporary employees. The effect of region is not consistent even for the two waves in which it appears. 8.6 4.9 15-17 6.8 4.9 18-29 13.8 10.2 30-44 6.8 4.1 45-64 7.2 2.5 Economic activity/ status Note: () indicates coefficient is not significant at 5% level. Female 8.7 4.7 15-17 5.9 3.0 18-29 13.9 10.9 30-44 6.3 2.8 45-59 7.5 2.2 Note: More detailed analyses are available from the author. Several of the characteristics of those who are lost to the sample appear likely to be related, and this was investigated in the first instance using logistic regression. The variables identified as being independently associated with whether the cases were lost from the sample, were found to be largely consistent for the four waves. In each case they included age group, marital status, tenure, {Le. whether the accommodation was owned, rented from a private landlord, or rented from a local authority or housing association), quaUfication level, and a combined economic activity variable incorporating broad economic activity (employed, unemployed or inactive), and, for the employed, employment status, part-time/full-time and temporary/permanent. Region was found to be independently associated in only two of the four waves, and sex in none. For the five variables consistently appearing for all waves, there was a good degree of consistency conceming which categories were associated with sample attrition. Table 2 gives the multiplying factors for the odds ratio for aU categories with a consistent association with increasing attrition. Being in the younger age groups, between 18 and Wave 4 The logistic regression analysis performed did not allow for interactions between the variables, and to investigate this possibility a further analysis was performed, using the CHAID module of SPSS to produce a segmentation of the data set into groups which have as great a variation as possible with respect to the proportion of unlinked cases. The results of this were however very similar to those of the logistic regression analysis. Overall, the main characteristics independently associated with a high proportion of sample loss were the younger adult age groups (18 to 29, especially 18 to 24) and living in privately rented accommodation, with some relatively minor additional effects of being in temporary employment for the youngest age groups. Separate analyses of the characteristics of those sample members who had been lost through moving away, and those lost through non-contact (or, more rarely, refusal) produced similar results. 4. COMPENSATING FOR ATTRITION BIAS The analysis so far has been directed at the biasing effect of sample attrition on the cross-sectional characteristics of the longitudinal sample, and has identified the characteristics independently associated with greater nonresponse. A possible approach to compensating for the bias arising from this is to incorporate tenure as well as age into the Survey Methodology, June 1999 101 weighting procedure for the longitudinal data. This is being explored, using a calibration approach with CALMAR software, and including prior weights derived from the work described above to compensate for differential nonresponse by tenure. However, there may be a problem which would limit the effectiveness of this approach. The propensity to not respond may be directly dependent on the unobserved labour force status of the individual, and possibly independentiy of their observed characteristics. Nonresponse of this kind is an example of non-ignorable nonresponse (Rubin 1976), and its presence would imply that estimates of the important measures of labour force gross flows would be biased even after the application of a weighting process of the type being explored. There are two indirect approaches which give some indication of whether non-ignorable nonresponse might be a problem for longitudinal LFS data. One is to investigate whether the proportion of the gross flows in the sample which are transitions between different economic activity states systematically decrease (or increase) from wave 1-2 to wave 4-5; if so, this would suggest that people changing from one state to another are more (or less) likely to be nonrespondents than those in a stable state. However, Table 3 shows that there is no consistent systematic pattem across waves - though this does not exclude the possibility of other patterns of differential nonresponse by labour force flows category. Table 3 Percentage of Transitions Between Different Economic Activity States by Wave for Pairs of Adjacent Quarters Data Set wave 1-2 Percentages Wave Wave 2-3 3-4 Wave 4-5 Summer/autumn 1995 8.0 7.3 7.3 7.1 Autumn 1995/ winter 1995-1996 7.2 6.7 6.5 6.5 Summer/autumn 1996 7.6 7.0 7.3 7.5 Autumn 1996/ winter 1996-1997 6.8 6.5 6.2 6.5 Another possibility is that people moving addresses (and thereby lost to the LFS sample) may have a different pattem of labour force flows than the rest of the population. We do not have any information on the people who have moved away, but we do know something about the people who have moved into the sample addresses from elsewhere. These movers-in can reasonably be taken to represent the movers-out, since they are equaUy samples from the same population of movers (ignoring the possible effects of the smaU proportion of intemational moves). Table 4 shows the distribution of the linked sample (all adults whose records were able to be matched) and of the identifiable movers-in for a pair of adjacent quarters in 1995. (It should however be noted that the fiows categories are not strictly comparable, since the previous economic activity state for the movers-in is obtained by retrospective reporting.) It is clear that the sample of movers does differ, with a lower proportion in stable inactivity, and a higher proportion in all the other flows categories, and in particular a greater proportion of people changing their economic activity state; but that the movers make up such a smaU proportion overall that the effect on the whole sample is negligible. Table 4 Gross Flows for Movers - in Compared with Linked Sample Activity states Linked sample Movers-in (%) (%) Linked + movers (%) EE 55.1 56.9 55.1 EU 0.8 1.5 0.8 EN 1.1 1.6 1.1 UE 1.0 1.7 1.0 UU 2.9 6.5 3.0 UN 0.7 1.1 0.7 NE 1.2 2.5 1.3 NU 1.0 2.3 1.1 NN 36.2 26.0 35.9 All transitions 5.9 10.6 6.0 TOTAL (no.) 80,664 1,790 82,454 Note: E represents in employment U represents ILO unemployed N represents economically inactive hence EE represents in employment at both quarters EU represents in employment then ILO unemployed etc. These indirect approaches do not indicate any very strong effect of non-ignorable nonresponse, but they do not mle it out. This possibility is therefore being investigated by work involving the modelling of nonresponse in the LFS. 5. RESPONSE ERROR AND ITS BIASING EFFECTS AU surveys in general, and household surveys in particular, are subject to response ertor, when the information given by the respondent is not an accurate reflection of the actuality. This may occur for a variety of reasons - the respondent may misunderstand the question; the interviewer may misunderstand or misrecord the response; the respondent may not know or remember the cortecl;^ answer; or the respondent may knowingly give an incortect answer for reasons of embartassment, prestige, fear 'of breach of confidentiality or a wish to give the "expected" answer. 102 In the field of labour force surveys it has generally been found (for an overview see Lemaitre 1994) that, for cross-sectional data, there is no particular tendency for the errors to be systematic, so that on average they tend to cancel out. However, for longitudinal data produced by linking together data collected on the same person at different points in time, this cancellation may not occur. In particular, this is likely to be the case for data on gross flows between economic activity states. The numbers of people who move from one state (in employment, unemployed, economically inactive) to another during the relatively short period usually considered (a month, a quarter, or perhaps a year) are small compared with the numbers of people who remain in the same state. A response ertor at one point of time is much more likely to lead to an apparent change of state when the tme situation is one of stability, than the reverse. Thus response ertors are likely to have a very disproportionate effect in upwardly biasing flows between reported states. In the LFS, they may arise from the use of proxy respondents, where one person answers questions on behalf of someone else in the same household; and from respondent errors. We will consider these in tum. 6. THE EFFECT OF PROXY RESPONDENTS To investigate the effect of proxy respondents, we need to look at the distribution of activity states at the two quarters according to whether the first quarter's interview was in person or by proxy, and whether the second quarter's interview was in person or by proxy. Very young adults under 20 are both exceptionally likely to be represented by proxies and also likely to be particularly volatile in terms of their economic activity category, and so may distort any relationship between these two factors. Table 5 therefore shows the distribution of activity states at the two quarters, for men aged 20 to 64 and women aged 20 to 59. There is a higher proportion of transitions for personal followed by proxy interviews than for personal at both quarters, but proxy followed by personal interviews show only a very slightly higher proportion than personal at both quarters. Thus switching between proxy and personal interviews does not show a consistently greater proportion of transitions. Cases with both interviews by proxy have the lowest proportion of transitions of all, and the inclusion of these brings the overall proportion to a level consistent with that for personal interviews at both quarters. Thus there do appear to be differences between the various combinations of interview types, which merit further investigation, but in the LFS the use of proxy respondents does not of itself produce an exaggerated estimate of gross flows. Tate: Utilising Longitudinally Linked Data Table 5 Percentage of Transitions by Interview Type Interview type Men (20-64) Sample no. % trans. Women (20-59) Sample no. % trans. Personal/ personal 14,527 5.2 19,582 7.3 Personal/ proxy 2,044 7.0 1,597 8.3 Proxy/ personal 2,214 5.4 1,632 7.6 Proxy/ proxy 8,602 4.7 4,206 5.4 27,387 5.2 27,017 7.1 All 7. RESPONDENT ERRORS By their nature, respondent ertors are impossible to identify directly, (except perhaps by re-interview, and even then there may be doubt about what is the correct answer). It is however sometimes possible to identify intemal inconsistencies in the survey data, which may indicate response ertor. In the LFS, respondents who are in employment, and respondents who are unemployed, are asked how long they have been in that state. If the period is greater than three months, but they stated in the previous quarter that they were in a different state, there is an inconsistency which may indicate a false transition between economic activity states. Table 6 shows the percentage of inconsistencies for various kinds of transitions - these are high throughout. Transitions from economic inactivity produce the highest percentages, especially when the transition is into unemployment. (There are no large or consistent differences between the different subcategories of the inactive.) Separating those in employment into part-time and full-time shows that there is a consistent pattem of a greater proportion of inconsistencies for part time employment, and similar but less pronounced results were found for the self-employed. Table 6 Percentage of Inconsistencies by Transition Type Percentage of inconsistencies Transition Type All Fulltime Part time (%) (%) (%) 12.2 7.8 Unempl. to Employment 8.7 30.4 18.1 Inactive to Employment 26.2 23.3 14.7 Employment to Unemployment 18.7 Inactive to Unemployment 49.5 All 23.9 103 Survey Methodology, June 1999 It is possible that the inconsistencies may have arisen through ertors in the reported length of time in the economic activity state at the second quarter, rather than in the initial state at thefirstquarter. The distribution of the length of time does not however show heaping at around four to five months (as would be expected in the case of ertors in the duration data). Also, the duration data reported in consecutive quarters for people in a stable state were found to be very consistent. These findings tend to suggest, though the evidence is indirect and by no means conclusive, that the ertors are more likely to be in the reporting of economic activity at one or other of the interviews. This is not the only possibility - for example, it may be that some respondents have correct transition data, but incorrect duration data through using an interpretation of their past economic activity which is not consistent with the standard definitions applied to the reporting of their curtent state but thefindingsso far suggest that it is likely to be the most common. Some light on which of the inconsistent categories is cortect may emerge by looking at the pattem of responses over three interviews. Table 7 shows the proportions of each group of inconsistent transitions from one quarter to the next which are followed by each economic activity category in the third quarter. (All relevant waves are combined in order to obtain reasonable sample sizes.) It is clear that of the transitions into eiriployment in the second quarter, the great majority remain in that category in the third quarter. The transitions^into unemployment show a much more mixed pattem, with a little over half remaining in unemployment, but a substantial group of about 30 to 40 per cent reverting to the state reported in the first quarter. It is noteworthy that scarcely any of the transitions from the second to the third quarter for this group were found to have a repeated inconsistency between the transition and the reported duration data. The results so far suggest that, in the case of an inconsistent transition into employment, that is likely to be the correct state, but more investigation is needed to achieve further clarification. 8. ADJUSTING FOR RESPONSE ERROR BIAS It is clear from the above that there is likely to be a substantial level of response ertor affecting the raw data on gross flows. Work on adjusting for such ertors has so far been largely confined to the USA and Canada. A review of three methods proposed for USA data is given by Haim and Hogue (1985), and a later proposal for Canadian data is given in Singh and Rao (1995), but to date, to the author's knowledge no official adjusted gross flows data are being published, though several countries are publishing unadjusted data while drawing attention to their Umitations. The adjustment methods so far proposed all rely on assumptions about the nature of the errors which seem unlikely to be met in practice - either full independence of the classification ertors or very limited departures from that assumption. (See Lemaitre 1994 for a review of problems with these adjustment methods.) It seems worthwhile to explore different routes to the development of methods of adjustment or compensation for response error bias. As a first stage, work is continuing on the investigation ofthe characteristics and circumstances of cases of inconsistency, and of other possible ways of identifying false transitions. It is also proposed to investigate the circumstances of people giving inconsistent responses of the kind analysed above, by means of more detailed follow-up interviews. This should provide better indications of the extent to which the inconsistencies do represent response ertor, and may provide results useful for both reducing, and adjusting for, response ertor. Both these strands will provide inputs to a third element of the forward programme, in which it is proposed to develop models of classification ertor in reporting economic activity. ACKNOWLEDGEMENTS The author wishes to thank the editor and two referees for their helpful comments. REFERENCES Table 7 Percentages of Inconsistent Transitions by Economic Activity at Following Quarter Transition type Unempl. to Employment Inactive to Employment Employment to Unempl. Inactive to Unempl. Total inconsistent Activity state in next quarter Employed Unempl. Inactive (%) (%) (%) 60 90 7 3 159 79 4 17 87 39 53 8 229 17 55 28 FLAIM, P. O., and HOGUE, C. R. (1985). Measuring labor force flows: a special conference examines the problems. Monthly Labor Review, July 1985, U.S. Bureau of Labor Statistics. FOSTER, K. (1994). The Labour Force Survey - Report of the 1991 Census-linked Study of Survey Nonrespondents. Office of Population Censuses and Surveys. LEMAITRE, G. (1994). Data on Labour Force Dynamics from Labour Force Surveys. Organisation for Economic Co-operation and Development. RUBIN, D.B. (1976). Inference and missing data. Biometrika, 63, 581-592. SINGH, A C , and RAO, J.N.K. (1995). On the adjustment of gross flow estimates for classification error with application to data from the Canadian Labour Force Survey. Joumal of the American Statistical Association, 90, 478-488. Survey Methodology, June 1999 Vol. 25, No. 1, pp. 105-106 Statistics Canada 105 A Model Based Justification of Kish's Formula for Design Effects for Weighting and Clustering SIEGFRIED GABLER, SABINE HAEDER and PARTHA LAHIRI' ABSTRACT In this short note, we demonstrate that the well-known formula for the design effect intuitively proposed by Kish has a model-based justification. The formula can be interpreted as a conservative value for the actual design effect. KEY WORDS: Cluster size; Intraclass correlation coefficient; Selection probabitities. 1. INTRODUCTION We consider multistage, clustered, sample designs where each observation belongs to a weighting class. For example, the clusters are blocks which are selected proportional to the number of its households. Within each block the same number of households is selected with equal probabilities. A randomly chosen person of the household has to be interviewed. Then, the household sizes determine the weighting classes. Kish (1987) proposed the following formula for determining the design effect in order to incorporate the effects due to both weighting needed to counter unequal selection probabilities, and clustered selection: (/ = 1,..., /; c = 1,..., C). Then m. = ^^ = I'Wjc. the number of observations in the I-th weighting class. Let b = X, = 1 nt.^, the number of observations in the c-th cluster (/ = l,...,/;c = l, ..., C) sotiiatZ7 =C-'i;f,,Z;^. Let y,j and w. be the observation and the weight for the j-th sampling unit in the c-th cluster (c = 1,..., C;j = 1,..., b ) . The usual design-based estimator for the population mean is defined as c K E E ^ciYcj — _ c=1 j = 1 c K E E ^cj c=l E ^ff^i 1 = 1 [l+(*-l)p], deffKish='"7 E j=l To justify Kish's formula, we assume the following model: ^if^i where m. and w. denote the number of observations and the weight attached to the i-th weighting class (/ = 1,...,/), m = Y,Uifny the total sample size, b is the average cluster size and p is the intraclass correlation coefficient. Kish's formula is very intuitive and novel, but he said that his "treatment may be incomplete and imperfect." Kish's formula is now used by many survey samplers. In fact, the above formula will be used in the sample size determination in the European Social Surveys to be conducted by its member countries. The purpose of this note is to provide a model-based justification for using Kish's formula. 2. A MODEL BASED JUSTIFICATION OF KISH'S FORMULA Let m.^ be the number of observations in the c-th sampled cluster belonging to the I-th weighting class Var(y^.)=o^ for c = 1,..., C;; = 1,..., fc^ Cov(ycj'Yc'j'' IQ c = c';j*f otherwise. (1) The above model is appropriate to account for the cluster effect and was used earlier by others (see, e.g.. Skinner, Holt and Smith 1989). We shall then define design effect as deff = Var,(y^)/Var2(y), where Var,(y^) is the variance of y^ under model (1) and Var2(y) is the variance of the overall sample mean y, defined as Ef= lE/= 1 Yc/"^' computed under the following model: Var(y^.)=a^ for c = 1,..., C;; = 1, ...,^^ Cov(y<,.,y,.p = 0 for all {c,j) * {c',r). (2) Note that model (2) is appropriate under simple random sampUng and provides the usual formula a^lm for Var2(y). Siegfried Gabler and Sabine Haeder, ZUMA, B 2,1, D-68I59 Mannheim; Partha L.ahiri, University of Nebraska-Lincoln, NE 68588-0323. 106 Gabler, Haeder and Lahiri: A Model Based Justification of Kish's Formula for Design Effects Now, turning our attention to Var, (y^), first note that E >^''", c 1 = 1 deff = m- fcc Var, E E ^cjYcj c 1 = 1 w C]-' .yCJ• where b* =Yfc-i(Zj= I M ' , ' " , C ) ^ / E ' = i ^ ' ' " , Using the Cauchy-Schwarz inequality, we get 2 -E (4) wm. I I \c = i i = i = EVar E — [I+(Z;--I)p], E ^cj^^iycj) + E '^cj'^cr^ov{y^j,y^j.) c=l J*J E ^i^i. m... w.-— E <=i o ,2V^ "-ic = fc: ,=1 <^ / c^i [j=i j*r J ^c E 2'"ic 0, "^f^ic 1 = 1 / C I 2< = o E ^ / ' 2" , + P E E^*','",C c=l 1=1 -PE»^,'",[ (= 1 (3) SO that 1= 1 c / E ^cE w,^'"/c fc'^ since Y.^_^, ^ - l , w^. = ^ [ ^ , wf m,. and C = 1 1 = 1 C ^ w ' say. / (5) E E w/'",c C = 1 ( = 1 C E 6c C *c E - E >v^- E %^cy- = E • c=lJ*/ Thus (4) and (5) yield c= 1 C =E1 E E w.m I , E '^i^'", W, m. / =E E ^i"^ic\ -E c= 1 V'= 1 / 1 = 1 [l+(^^-l)p]. / <• = 1 c= l deff ^ m- (6) E w.m.I V' = i wfff^r '=1 Noting that Xc = i 1,/= i % = Z; = i *^,'",-. we have Note that b ^ can be interpreted as an average (weighted) cluster size. If fc ^^, is equal to b, e.g., if all b^ are equal, tiie upper bound of deff is simply Kish's formula. Thus Kish's formula serves as a conservative value for the actual design effect. Var,(yJ ACKNOWLEDGEMENTS C Var, *c E E %3'c; The authors are thankful to the editor and the referees for their remarks which led to an improvement of the paper. The work was completed while the last author was a Guest Professor at ZUMA, the Center for Survey Research and Methodology, Mannheim, Germany. c=1> = 1 E m.w. I I 1 = 1 c ( I a^\Y |l = l w, m.-^pY E ^i"^ic c=l V 1 = 1 E ^i"^i (=1 so that V I - P E ^, '", 1=1 REFERENCES KISH, L. (1987). Weighting in Deft^. The Survey Statistician, June 1987. SKINNER, C.J., HOLT, D., and SMITH, T.M.F. (Eds.) (1989). Analysis of Complex Surveys. Chichester: Wiley. JOURNAL OF O F n C I A L STATISTICS An Intemational Review Published by Statistics Sweden JOS is a scholarly quarterly that specializes in statistical methodology and applications. Survey methodology and other issues pertinent to the production of statistics at national offices and other statistical organizations are emphasized. All manuscripts are rigorously reviewed by independent referees and members of the Editorial Board Contents Volume 14, Number 4 , 1 9 9 8 Introduction to the Special Issue: Disclosure Limitation Methods for Protecting the Confidentiality of Statistical Data Steven E. Fienberg and Leon C.R.J. Willenborg A Database System Prototype for Remote Access to Information Based on Confidential Data Sallie Keller-McNulty and Elizabeth A. Unger Estimating the Re-identification Risk Per Record in Microdata C.J. Skinner and D.J. Holmes A Bayesian Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment Stephen M. Samuels Confidentiality, Uniqueness, and Disclosure Limitation for Categorical Data Stephen E. Fienberg and Udi E. Makov Synthetic and Combined Estimators in Statistical Disclosure Control Jeroen Pannekoek and Ton de Waal Balancing Disclosure Risk Against the Loss of Nonpublication Alan M. Zaslavsky and Nicolas J. Norton Optimal Local Suppression in Microdata A.G.de Waal and LC.R.J Willenborg Models and Methods for the Microdata Protection Problem C.A.J. Hurkens and S.R. Tiourine Masking Microdata Using Micro - Aggregation D. Defays and M.N. Anwar Post Randomisation for Statistical Disclosure Control: Theory and Implementation J.M. Gouweleeuw, P. Kooiman, LC.R.J. Willenborg and P.-P. de Wolf Comment G. Sande Disclosure Limitation Using Perturbation and Related Methods for Categorical Data Stephen E. Fienberg, Udi E. Makov, and Russell J. Steel Comment Peter Kooiman Rejoinder Stephen E. Fienberg, Udi E. Makov. and Russell J. Steel Comparision of Systems Implementing Automated Cell Suppression for Economic Statistics N: Kirkendall and G. Sande Using Noise for Disclosure Limitation Establishment Tabular Data Timothy Evans, Laura Zayatz, and John Slanta Experiments with Controlled Rounding for Statistical Disclosure Control in Tabular Data with Linear Constraints Matteo Fischetti and Juan-Jose Salazar-Gonzdlez • 337 347 361 373 385 399 411 421 437 449 463 479 485 503 509 513 537 553 Editorial Collaborators 567 Index to Volume 14,1998 571 All inquires about submissions and subscriptions should be directed to the Chief Editor: Lars Lyberg, R&D Department, Statistics Sweden, Box 24 300, S -104 51 Stockholm, Sweden. The Canadian Journal of Statistics La Revue Canadienne de Statistique CONTENTS TABLE DES MATIERES Volume 27, No. 1, March/mars 1999 Christian GENEST Editor's report /Rapport du redacteur en Chef Mousumi BANERJEE, Michelle CAPOZZOLI, Laura MCSWEENEY and Debajyoti SINHA Beyond kappa: A review of interrater agreement measures L.MANCHESTER, C. A. FIELD and A. MCDOUGALL Regression for overdetermined systems: A fisheries example E.MONGA and S. TARDIF Asymptotic optimality of a class of rank tests for replicated Latin square designs Guohua PAN and Winson TAAM Distribution-free subset selection for incompletely ranked data Gary SNEDDON Smoothing in an underdetermined linear model with random explanatory variables John R.COLLINS Robust A/-estimators of scale: Minimax bias versus maximal variance Dongchu SUN and Keying YE Reference priors for a product of normal means when variances are unknown Sonia PETRONE Bayesian density estimation using Bernstein polynomials M. PENSKY and R. S. SINGH Empirical Bayes estimation of reliability characteristics for an exponential family M. C. nNOCCHIARO and D. SACCHETTI The variance function of the natural exponential family generated by a measure on n integers John E. KOLASSA Confidence intervals for parameters lying in a random polygon Nabendu PAL and Wool K. LIM Second order properties of intraclass correlation estimators for a symmetric normal, distiibution Andrew HEARD and Tim SWARTZ Extended voting measures Michael BARON Convergence rates of change-point estimators and tail probabilities of the first-passage-time process Celestin C. KOKONENDJI Le probleme d'Anscombe pour les lois binomiales negatives g6neralisees Helge BLAKER A class of shrinkage estimators in linear regression Acknowledgement of referees' services / Remerciements aux membres des jurys Forthcoming papers/Articles a paraitre The Canadian Journal of Statistics La Revue Canadienne de Statistique CONTENTS TABLE DES MATIERES Volume 27, No. 2, June/juin 1999 Andrey FEUERVERGER, John ROBINSON and Augustine WONG I On the relative accuracy of certain bootstiap procedures Biao ZHANG Bootstiapping with auxiliary information Steven N. MACEACHERN, Meriise CLYDE and Jun S. LIU Sequential importance sampling for nonparametric Bayes models: The next generation George D. PAPANDONATOS and Seymour GEISSER Bayesian interim analysis of lifetime data Paul DAMEEN and Stephen WALKER A full Bayesian analysis of circular data using the von Mises distribution Craig A. COOLEY and Steven N. MACEACHERN Prior elicitation in the classification problem Schultz CHAN and Malay GHOSH A geometric optimality of Cox's partial likelihood Y. LEE and John A. NELDER The robustness of the quasilikelihood estimator Hui CHEN and Joseph P. ROMANO An invariance principle fortiiangulararrays of dependent variables with apptication to autocovariance Estimation Kanchan MUKHERJEE The asymptotic behavior of a class of L-estimators under long-range dependence Ross H. TAPLIN Robust F-tests for linear models Ping ZHANG The optimal prediction of cross-sectional proportions in categorical panel-data analysis Dan NETTLETON Order-restricted hypothesis testing in a variation of the normal mixture model K. KRISHNAMOORTHY and Marutiiy K. PANNALA Confidence estimation of a normal mean vector with incomplete data Marten H. WEGKAMP Quasi-universal bandwidth selection for kernel density estimators Christian GENEST Probability and statistics: A tale of two worlds? Forthcoming Papers / Articles a paraiti-e Journal ofthe Royal Statistical Society Series D (The Statistician) Edited by N. R. J. Fieller and U. T. Moorthy Covering a broad range of topics of interest to professional statisticians. The Statistician includes applied papers on education, business, sport, industry and agriculture, statistical computing and professional affairs and obituaries of eminent statisticians. Recent and forthcoming highlights: Some Statistical Heresies (with Discussion), J. K. Lindsey Can Takeover Targets be Identified by Statistical Techiuques? Some UK Evidence, P. Barnes A Fractional Factorial Design for Bench-mark Testing of a Bayesian Method for Multilocation Audits, V. BamettandJ. Haworth Using Maximum Entropy to Double One's Expected Winnings in the UK National Lottery, S. J. Cox, G. J. DaniellandD. A. Nicole Demonstrating the Durbin-Watson Statistic, R. Champion, C. T. Lenard and T. M. Mills Tiers, Structure Formulae and the Analysis of Complicated Experiments, C. J. Brien andR. W. Payne Joumal of the Royal Statistical Society: Series D (The Statistician) ISSN 0039-0526 Published in March, June, September and December Subscnption Rates, Vol. 48/1999: Europe £80, N. America $149, Rest of World £90. To subscribe to The Statistician please use the order form on the Blackwell website: http://www.blackwellpublJshers.co.uk, send an email to jnllnfo@blackwellpublishers.co.uk, or contact either of the following: Blackwell Publishers Journals, PO Box 805, 108 Cowley Road, Oxford 0X4 1FH, UK. Tel: +44 (0)1865 244083, fax +44 (0)1865 381381 • Journals Marketing (RSSD), Blackwell Publishers, 350 Main Street, Maiden, MA 02148, USA. Tel. +1 (781) 388 8200, fax +1 (781)388 8210 ^BLAGWELl For further Information or to request a sample copy please visit our website http;//www.blackwellpublishers.co.uk GUIDELINES FOR MANUSCRIPTS Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 19, No. 1 and onward) as a guide and note particularly the points below. Accepted articles must be submitted in machine-readable form, preferably in WordPerfect. Other word processors are acceptable, but these also require paper copies for formulas and figures. 1. Layout 1.1 Manuscripts should be typed on white bond paper of standard size {SV2 x 11 inch), one side only, entirely double spaced withmarginsof at least I'/z inches on aU sides. • 1.2 The manuscripts should be divided iiito numbered sections with suitable verbal titles. 1.3 The name and address of each author should be given as a footnote on the first page of the manuscript. 1.4 Acknowledgements should appear at the end of the text; 1.5 Any appendix should be placed after the acknowledgements but before the list of references. 2. Abstract The manuscript should begin with an abstiact consisting of one paragraph followed by three to six key words. Avoid mathematical expressions in the abstiact. 3. Style 3.1 Avoid footnotes, abbreviations, and acronynis. 3.2 Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as "exp(-)" and "log(-)"., etc. •• • . . . 3.3 Short formulae should be left in the text but everything in the text shouldfitin single spacing. Long and important equatipns should be separated from the text and numbered consecutively with arable numerals on the right if they are to be referted to later. 3.4 Write fractions in the text using a soUdus. 3.5 Distinguish between ambiguous characters, (e.g., w, a);o, 0,0; 1, 1). 3.6 Itahcs are used for emphasis. Indicate italics by underlining on the manuscript. 4. Figures and Tables 4.1 . All figures,and tables should be numbered consecutively with arable numerals, with tities which are as; nearly self explanatory as possible, at the bottom forfiguresand at the top for tables. 4.2 They should be put on separate pages witii an indication of their appropriate placement in the text. (Normally they should appear near where they are first referred to). 5. References 5.1 References in the text should be cited with autiiors' names and the date of pubUcation. If part of a reference is cited, indicate after the reference, e.g., Cochran (1977, p. 164). . 5.2 The Ust of references at the end of the itianuscript should be arranged alphabetically and for the same autiior chronologically. Distinguish publications of tiie same author intiiesame year by attaching a, b, c to die year of publication. Joumal tities should not be abbreviated. Follow the same format used in recent issues.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising