SAS Global Forum 2013 Statistics and Data Analysis Paper 432-2013 Current Directions in SAS/STAT® Software Development Maura Stokes SAS Institute Inc. Abstract Recent years brought you SAS/STAT® releases in rapid succession, and another release is targeted for 2013. Which new software features will make a difference in your work? What new statistical trends should you know about? This paper describes recent areas of development focus, such as Bayesian analysis, missing data analysis, postfitting inference, quantile modeling, finite mixture models, specialized survival analysis, and structural equation modeling. This paper introduces you to the concepts and illustrates them with practical examples. More Frequent Releases of SAS/STAT Software In previous years, SAS/STAT software was updated only when Base SAS® software was released, but SAS/STAT is now released independently of Base SAS along with other SAS analytical products. This means these products can be released when enhancements are ready, and the goal is to update SAS/STAT every 12 to 18 months. To mark this newfound independence, the release numbering scheme for SAS analytical products has also changed; the current production release is SAS/STAT 12.1. With the new release paradigm, SAS provides you with new statistical techniques faster. However, it may be difficult to keep track of all of the new development. This paper provides a round-up of the current directions in SAS/STAT development, with emphasis on the areas that have received the most attention. Developing SAS/STAT Software SAS/STAT software has a long and rich legacy. Originally, statistical functionality was included in Base SAS for mainframes, but SAS/STAT became its own product in 1976. Originally written in the PL/I and Fortran programming languages, it was rewritten in the C language and made portable so it could run on a variety of platforms (multivendor architecture) including PCs. The Output Delivery System (ODS) was then incorporated, giving you complete control over the format of the tabular results and making reporting quite flexible. More recently, the development of ODS Statistical Graphics has resulted in the integration of graphics into the statistical analyses. What hasn’t changed over the years has been the goal of providing rich, up-to-date statistical techniques to SAS customers. SAS/STAT 12.1 includes over 75 procedures, soon to expand again with the 12.3 and 13.1 releases. New development directions are determined in a number of ways: customer input, company directives, the appearance of important new methodologies, and the drive to constantly refine and update existing software. Technical support is one channel from customers to development which fosters greater understanding of customer use and often leads to software enhancements. Other channels include professional contacts and customer meetings. Often, statistical development directions are influenced by overall SAS company directives, such as the current initiative to excel in high-performance computing. Of course, statistical developers, specialized PhD statisticians with computing backgrounds, keep up with current methodology in the usual ways: journals, professional conferences, expert contacts, and workshops. Developers attend statistical conferences and SAS user groups, both sources of feedback and suggestions. When R&D engages in a new area, it often begins with an invited seminar by a renowned expert and then sustains the contact for feedback as development proceeds. SAS development focuses on statistical methods that work in practice. SAS customers deal with real-life data, with its size and pitfalls, and the promising methodology that looked great for a few well-designed test cases often needs more work before it’s ready for the real world. An important objective is consistency in syntax so that customers can build on what they already know. The goal is always clear and consistent 1 SAS Global Forum 2013 Statistics and Data Analysis output, including graphics, and a major milestone for each new project is clearing a peer review of tabular and graphical content and layout. The documentation is written by the developers and undergoes substantial technical review and editing before it is finalized. Numerical accuracy and computational performance are essential. Numerical analysts develop the mathematical routines that are incorporated into the statistical procedures. Existing procedures are continously evaluated to see if they can benefit from cutting-edge algorithms. SAS avoids changing default methods because the software is used so often in production jobs, but new methods are added regularly and the documentation states when a newer method has become the predominant practice. The following sections outline some recent development directions, including model building, Bayesian analysis, linear model enrichment, analyzing time-to-event data, missing data methods, complex data, and high-performance computing. Model Building Model building has only grown more important in recent times as statisticians and data analysts face the world of Big Data and its inherent increases in data dimensionality. The value of predictive analytics depends on finding good models, so model selection is paramount. SAS/STAT model building took a step forward some years ago with the introduction of the GLMSELECT procedure for linear models selection. PROC GLMSELECT provides modern methods such as LARS and LASSO, and it also includes a rich array of selection criteria and numerous graphical displays. More recently, model building in SAS/STAT has expanded to include model selection for quantile regression and generalized linear models. New QUANTSELECT Procedure Quantile regression is a distribution-free method that determines a linear predictor for the conditional quantile. It is especially useful with data that are heterogeneous such that the tails and central location of the conditional distributions vary with the covariates. The QUANTREG procedure provides quantile regression in SAS/STAT software. Beginning with SAS/STAT 12.1, you can also perform model selection for quantile regression with the new QUANTSELECT procedure. This procedure provides capabilities similar to those offered by the GLMSELECT procedure, including: • forward, backward, stepwise, and LASSO selection methods • variable selection criteria: AIC, SBC, AICC, and so on • variable selection for both quantiles and the quantile process • the EFFECT statement for constructed model effects (splines) PROC QUANTSELECT is multithreaded so that it can take advantage of multiple processors. It is very efficient and can handle hundreds of variables and thousands of observations. After you have selected a model with the QUANTSELECT procedure, you can proceed to use the QUANTREG procedure for final model analysis. The following example illustrates the use of the QUANTSELECT procedure with baseball data from the 1986 season. The goal is to predict player salary. You can request model selection for any number of quantiles, and if you do so, you will find that different models are selected. If you are interested only in the model for those players making the most money, you can base the model on the 90th quantile, which is the analysis performed here. The following statements input the baseball data: data baseball; length name $ 18; length team $ 12; input name $ 1-18 nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league $ division $ team $ position $ nOuts nAssts nError salary; 2 SAS Global Forum 2013 Statistics and Data Analysis datalines; Allanson, Andy 293 66 1 30 1 293 66 1 30 29 14 American East Cleveland C 446 33 20 . Ashby, Alan 315 81 7 24 14 3449 835 69 321 414 375 National West Houston C 632 43 10 475 ..... ..... 29 14 38 39 The following statements invoke the QUANTSELECT procedure. The variable SALARY is the response variable, and a number of explanatory variables are available for selection. The adaptive LASSO method is used for model selection, with AIC as the stopping criterion. The plot requested is the coefficient panel. proc quantselect data=baseball plots=(coef); class league division; model Salary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError / selection=lasso (adaptive stop=aic) quantile=.9; run; Figure 1 displays the selection summary information. You can see the values of AIC and AICC change as variables are added to the model. The optimal value of AIC is 2099.23 at the fifth step, which corresponds to a model with four variables: number of hits, career hits, career home runs, and career RBIs. These explanatory variables are the main factors in determining salary for the 90th percentile. Figure 1 Selection Summary Quantile Level = 0.9 Selection Summary Step Effect Entered Number Effects In AIC AICC SBC Adjusted R1 0 Intercept 1 2436.7289 2436.7442 2440.3011 0.0000 1 crHits 2 2197.4349 2197.4811 2204.5792 0.3655 2 crRbi 3 2183.6148 2183.7075 2194.3313 0.3819 3 nHits 4 2113.2757 2113.4308 2127.5643 0.4593 4 crHome 5 2099.2203* 2099.4538* 2117.0811* 0.4735* * Optimal Value Of Criterion Figure 2 displays the coefficient panel, which shows the progression of the standardized coefficients and the SBC throughout the selection process. 3 SAS Global Forum 2013 Statistics and Data Analysis Figure 2 Coefficient Panel Figure 3 contains the parameter estimates and their standardized versions. Figure 3 Parameter Estimates Parameter Estimates DF Estimate Standardized Estimate Intercept 1 -100.381451 0 nHits 1 3.895310 0.379872 crHits 1 0.660121 0.949536 crHome 1 3.340492 0.637123 crRbi 1 -0.453248 -0.329973 Parameter Additional inference can be performed by using this model in the QUANTREG procedure. New HPGENSELECT Procedure The GENMOD procedure fits generalized linear models in SAS; these models include normal regression, logistic regression, Poisson regression, and other analyses where you specify a link function and a distribution that belongs to the exponential family. The new HPGENSELECT procedure, available with SAS/STAT 12.3 (which runs on Base 9.4), performs model selection for generalized linear models (GLMs). This means that you can now perform model selection for analyses such as Poisson regression, negative binomial regression, and any other GLM. Designed for the distributed computing of SAS® High-Performance Statistics, PROC HPGENSELECT also works in single-machine mode. It provides forward, backward, and stepwise selection (LASSO-type methods are still a research topic), and includes the AIC, SBC, and AICC selection criteria. It does not produce graphs with this version, but they will surface in a future release. Recall the baseball data. You can treat the number of home runs hit during the year as counts that follow the Poisson distribution, and thus you can employ Poisson regression to model these counts. The following statements illustrate how you would request model selection for Poisson regression with the HPGENSELECT procedure. 4 SAS Global Forum 2013 Statistics and Data Analysis proc hpgenselect data=baseball; class league division; model nHome = nAtBat nHits nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError / distribution=poisson link=log; selection method=forward details=all; run; You specify exactly the same MODEL statement as you would specify with the GENMOD procedure. You specify the selection method with the SELECTION statement, a new statement used by the high-performance procedures. Forward stepwise selection is specified with the METHOD=FORWARD option. Figure 4 displays information about the execution mode. Two threads were employed on a single machine. Figure 4 Performance Information Performance Information Execution Mode Single-Machine Number of Threads 2 Effects are added to the model if they produce improvement as judged by comparing the p-value of a score test to the entry significance level (SLE), which is 0.05 by default. The forward selection ends when no additional effect meets this criterion. Figure 5 provides the final effects that entered the model and the details of effect selection. 5 SAS Global Forum 2013 Statistics and Data Analysis Figure 5 Selection Details Selection Details Step Description Effects In Model 0 Initial Model 1 1 nRBI entered 2 1595.3565 2 nAssts entered 3 3 nHits entered 4 Chi-Square Pr > ChiSq -2 LogL AIC AICC 3419.513 3421.513 3421.525 <.0001 1994.778 1998.778 1998.815 48.6818 <.0001 1943.370 1949.370 1949.445 4 20.2656 <.0001 1922.611 1930.611 1930.737 nRuns entered 5 43.3903 <.0001 1880.196 1890.196 1890.386 5 crHome entered 6 6.3802 0.0115 1873.922 1885.922 1886.189 6 crRbi entered 7 27.3765 <.0001 1845.695 1859.695 1860.052 7 crAtBat entered 8 6.9953 0.0082 1838.769 1854.769 1855.229 8 crRuns entered 9 13.0622 0.0003 1825.626 1843.626 1844.203 9 crHits entered 10 15.2906 <.0001 1810.307 1830.307 1831.014 10 crBB entered 11 4.7299 0.0296 1805.576 1827.576 1828.428 Selection Details Step BIC 0 3425.287 1 2006.327 2 1960.693 3 1945.709 4 1909.069 5 1908.569 6 1886.117 7 1884.965 8 1877.597 9 1868.052 10 1869.096 Figure 6 contains fit statistics for the selected model. Note that the values of the 2 log likelihood and the various information criteria AIC, BIC, and AICC are smaller than the corresponding values for the full model (Initial Model) in the Selection Details table, which indicates that the more parsimonious model provides a better fit. However, note that the value of Pearson chi-square divided by its degrees of freedom is 1.60. This is an indication of overdispersion, which suggests that Poission regression is not the best technique to apply to these data. 6 SAS Global Forum 2013 Statistics and Data Analysis Figure 6 Fit Statistics Fit Statistics -2 Log Likelihood 1805.57637 AIC (smaller is better) 1827.57637 AICC (smaller is better) 1828.42799 BIC (smaller is better) 1869.09644 Pearson Chi-Square 496.91515 Pearson Chi-Square/DF 1.59780 Negative binomial regression is often an alternative in this situation; it allows the variance to be larger than the mean, unlike the assumption of equivalence in Poisson regression. The HPGENSELECT procedure also provides negative binomial regression, and that is requested with the DIST=NB option: proc hpgenselect data=baseball; class league division; model nHome = nAtBat nHits nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB league division nOuts nAssts nError / dist=nb link=log; selection method=forward details=all; run; Figure 7 lists the seven effects that entered the model for negative binomial regression. Figure 7 Selection Summary Selection Summary Step Effect Entered Number Effects In p Value 0 Intercept 1 . 1 nRBI 2 <.0001 2 nAssts 3 <.0001 3 nHits 4 0.0019 4 nRuns 5 <.0001 5 crHome 6 0.0203 6 crRbi 7 <.0001 7 yrMajor 8 0.0464 Figure 8 displays the corresponding fit statistics for the final model; these values are somewhat smaller than the fit statistics reported for the final Poisson regression model. Note, however, that this is a different analysis and these measures cannot be directly compared. The Pearson chi-square/degrees of freedom ratio takes the value 1.01 for this analysis, which indicates no evidence of overdispersion. Thus, this model is deemed to be satisfactory. 7 SAS Global Forum 2013 Statistics and Data Analysis Figure 8 Fit Statistics Fit Statistics -2 Log Likelihood 1785.67309 AIC (smaller is better) 1803.67309 AICC (smaller is better) 1804.25001 BIC (smaller is better) 1837.64405 Pearson Chi-Square 318.66241 Pearson Chi-Square/DF 1.01485 Figure 9 contains the parameter estimates for this model. Figure 9 Parameter Estimates Parameter Estimates DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept 1 1.110918 0.093513 141.1309 <.0001 nHits 1 -0.004750 0.001497 10.0700 0.0015 nRuns 1 0.007781 0.002272 11.7265 0.0006 nRBI 1 0.024132 0.001756 188.8612 <.0001 yrMajor 1 0.024209 0.012201 3.9369 0.0472 crHome 1 0.004504 0.000875 26.4625 <.0001 crRbi 1 -0.001383 0.000326 17.9730 <.0001 nAssts 1 -0.000636 0.000191 11.1047 0.0009 Dispersion 1 0.066758 0.014310 . . Parameter Bayesian Analysis While the founding principle of Bayesian analysis goes back to the theorem developed by Sir Thomas Bayes in 1763, modern Bayesian analysis didn’t catch fire until the computing advances of the 20th century. Bayesian analysis is based on the idea that inferences are based on measures of uncertainty through probability distributions, and prior knowledge can impact the inferences. In Bayesian analysis, parameters are random and you must estimate their posterior distributions. Summary statistics produced are posterior modes and credible intervals as opposed to the point estimates and confidence intervals in frequentist statistical analysis. The Bayesian framework provides a straightforward framework for addressing scientific questions. For example, you can estimate the probability of an interval containing a value, versus the in-the-long run definition of a confidence interval which confuses so many clients. Thus, the framework is attractive for all types of situations, not just those in which you have prior information. While every Bayesian analysis incorporates a prior distribution, you can have noninformative prior distributions which do not influence the likelihood but do allow you to perform the Bayesian analysis. While there are analytic solutions for a few analyses, usually a closed form does not exist and simulation methods are required, such as Markov chain Monte Carlo methods. SAS provides two avenues to Bayesian analysis: built-in Bayesian analysis in certain modeling procedures and the MCMC procedure for general-purpose modeling. Adding the BAYES statement generates Bayesian 8 SAS Global Forum 2013 Statistics and Data Analysis analyses without the need to program priors and likelihoods for the GENMOD, PHREG, LIFEREG, and FMM procedures. Thus, you can obtain Bayesian results for: • standard regression • Poisson regression • logistic regression • loglinear models • accelerated failure time models • Cox proportional models • piecewise exponential models • frailty models • finite mixture models These procedures are ideal for users beginning to use Bayesian methods and will suffice for many analysis objectives. The MCMC procedure is a general-purpose procedure for fitting Bayesian models. It uses a variety of sampling algorithms to draw from the posterior distribution. It produces the same convergence diagnostics and posterior summaries that you would find by using the BAYES statement in the modeling procedures. However, the MCMC procedure allows any likelihood, prior, or hyperprior that can be programmed with the SAS language. It supports multivariate distributions as well. If you are familiar with the NLMIXED procedure, you are familiar with the type of programming statements that the MCMC procedure requires. The RANDOM statement in the MCMC procedure facilitates the specification of random effects in linear or nonlinear models. You can build nested or nonnested hierarchical models to artbitrary depth. Using the RANDOM statement can result in reduced simulation time and improved convergence for models that have a large number of subjects. The MCMC procedure also handles missing data for both responses and covariates. See papers by Fang Chen in the online conference proceedings for additional information. Richer Linear Models Hidden in the SAS/STAT 9.22 release (which had stealth qualities in its own right) is the new PLM procedure. This new procedure was the end result of a re-architecturing effort which put all of the post-fitting analysis statements–CONTRAST, LSMEANS, LSMESTIMATES,ESTIMATE–into a common framework. This meant that: • 30 postfitting statements added to existing procedures • faster new procedure development • more efficient maintenance • new features get to all relevant procedures faster In addition, the PLM procedure performs postfitting inference with model fit information saved from a number of SAS/STAT modeling procedures. These procedures are equipped with the new STORE statement, which saves model information as a SAS item store. An item store is a special SAS binary file that is used to store and re-store information that has a hierarchical structure. Ten SAS/STAT procedures now provide the STORE statement: GENMOD, GLIMMIX, GLM, LOGISTIC, MIXED, ORTHOREG, PHREG, SURVEYLOGISTIC, SURVEYPHREG, and SURVEYREG. The PLM procedure takes these item stores as input and performs tasks such as testing hypotheses, producing effect plots, and scoring a new data set. These tasks are specified through the usual complement of postfitting statements such as the TEST, LSMEANS, and new EFFECTPLOT and SCORE statements. 9 SAS Global Forum 2013 Statistics and Data Analysis Any procedure that offers the STORE statement can produce the item stores that are necessary for postfitting processing with the PLM procedure. This allows you to perform additional postfitting inference at a later time without having to refit your model, which is especially convenient for those models that are computationally expensive. In addition, with growing concerns for data confidentiality, storing and using intermediate results for remaining analyses might become a requirement in some organizations. The EFFECT statement provides a convenient way to add additional modeling terms to your linear model. These effects can include classification beyond simple grouping (multimember and lag effects), continuous modeling beyond simple polynomials (polynomial and spline effects), and general terms that you define yourself (collection effects). These constructed effects are viewed as any other model effect, which means you can use them in any postfitting analysis that is based on your model. The EFFECT statement also works well together with the new EFFECTPLOT statement, which provides a way to visualize the impact of the effects on the response variable. The EFFECT statement is available with a number of SAS/STAT linear modeling procedures. Analyzing Time-To-Event Data Data that measure time until an event, also known as lifetime, failure time, or survival data, occur frequently. They are often seen in clinical trials for medical treatments, such as survival time for heart transplant patients, but they also occur in many other settings; consider the lifetime of pedometers or the time until a new real estate agent makes his first sale. Time-to-event data require special attention. Not only are you measuring the failure time as a dependent variable, and possibly some covariates so that you can form a statistical model, but you often have to deal with censoring as well. Censoring occurs when the event of interest hasn’t occurred by the time data collection ends. It may happen because patients in a study withdraw, or drop out, before the study concludes. Or the experiment may simply be terminated before the event has occurred for some experimental units. In any event, you only know the lower bound of the failure time; and the observations are right-censored. You can have data that are left-censored, or only known to be smaller than a given value, and you can also have interval-censored data, or failure times that are only known to fall within a certain interval. You have to take censoring into account because, for example, in the clinical trial, the longer-lived individuals are more likely to be right-censored. SAS/STAT has provided tools for the analysis of time-to-event data for years. They include the LIFETEST procedure for estimating the survivor function and comparing the survival curves of various groups. In addition, the LIFEREG procedure provides parametric regression methods for modeling the distribution of survival times with a set of covariates, and the PHREG procedure provides proportional hazards regression (Cox regression). With recent releases, the survival analysis tools have been extended in a number of ways, including • SURVEYPHREG procedure provides Cox regression for data collected from a complex survey • Bayesian methods are available with the BAYES statement in the PHREG and LIFEREG procedures • QUANTLIFE procedure provides quantile regression for right-censored data • macro for interval censoring • piecewise exponential regression available via PROC PHREG • frailty analysis available via PROC PHREG The SURVEYPHREG procedure is useful for data collected in large government surveys; for example, you might use Cox regression to model the time until the onset of depression in a national mental health care survey with a number of socioeconomic covariates. Bayesian methods are widely used in survival data analysis, and the BAYES statement had been added to the PHREG and LIFEREG procedures so that you can now perform Bayesian analysis for Cox regression and accelerated lifetime models. Note that the PIECEWISE= option has been added to the BAYES statement in the PHREG procedure so that you can now perform piecewise exponential modeling easily with SAS (both the Bayesian analysis and the maximum likelihood analysis are produced). 10 SAS Global Forum 2013 Statistics and Data Analysis When experimental units are clustered, the failure times of those units within a cluster tend to be correlated. You need to account for the within-cluster correlation, and one way of doing that is the shared frailty model, in which the cluster effects are incorporated in the model as normally distributed random variables. Stokes, Chen, and So (2011) describe the new PHREG functionality to fit shared frailty models via the specification of a RANDOM statement in the SAS/STAT 9.3 release. The penalized partial likelihood approach is used, and that first implementation assumed that the frailties were distributed as lognormal. With SAS/STAT 12.1, the frailties can also be assumed to be distributed as gamma. PROC PHREG also provides Bayesian frailty analysis. Competing risks develop when subjects are exposed to more than one cause of failure: for example, the cause of death in a bone marrow transplant could be relapse, death during remission, or death due to another cause. In that case, the cumulative incidence function is more appropriate than the standard Kaplan-Meier method of survival analysis. The SAS macro %CIF implements nonparametric methods for implementing this method and also provides Gray’s method for testing differences between these functions in multiple groups. See Lin et al. (2012) for more information about this method and this macro. Quantile regression provides an alternative and flexible technique for the analysis of survival data. You can apply this technique to right-censored responses, which allows you to explore the covariate effects on the quantiles of interest. Two approaches are implemented: one is based on the idea of the Kaplan-Meier estimator, and the other is based on the Nelson-Aalen estimator of the cumulative hazard function. The new QUANTLIFE procedure provides interior point algorithms for estimation, Wald tests for the parameter estimates, survival plots, conditional quantile plots, and quantile process plots. It also supports the EFFECT statement so that it can fit regression quantile spline curves, and it is multithreaded to take advantage of multiple processors when they are available. Consider a study of primary biliary cirrhosis disease discussed in Lin, Wei, and Ying (1993). Prognostic factors studied included age, edema, bilirubin, albumin, and prothrombin. Researchers followed 418 patients between 1974 and 1984. The patients had a median follow-up time of 4.74 years and a censoring rate of 61.5%. The following SAS statements create the SAS data set PBC: data pbc; input Time Status Age Albumin Bilirubin Edema Protime @@; label Time="Follow-up Time in Days"; logAlbumin =log(Albumin); logBilirubin =log(Bilirubin); logProtime =log(Protime); datalines; 400 1 58.7652 2.60 14.5 1.0 12.2 4500 0 56.4463 4.14 1.1 0.0 1012 1 70.0726 3.48 1.4 0.5 12.0 1925 1 54.7406 2.54 1.8 0.5 1504 0 38.1054 3.53 3.4 0.0 10.9 2503 1 66.2587 3.98 0.8 0.0 1832 0 55.5346 4.09 1.0 0.0 9.7 2466 1 53.0568 4.00 0.3 0.0 2400 1 42.5079 3.08 3.2 0.0 11.0 51 1 70.5599 2.74 12.6 1.0 3762 1 53.7139 4.16 1.4 0.0 12.0 304 1 59.1376 3.52 3.6 0.0 ... ... 10.6 10.3 11.0 11.0 11.5 13.6 The syntax for the MODEL statement for the QUANTLIFE procedure is similar to that used in other SAS survival procedures. The LOG option requests that the log response values be analyzed, the METHOD=NA option specifies the Nelson-Aalen method, and the PLOT=(QUANTPLOT SURVIVAL QUANTILE) option requests the estimated parameter by quantiles plot, the survival plot, and the predicted quantiles plot. The QUANTILE=(.1 .4 .5 .85) option requests that those quantiles be modeled. ods graphics on; proc quantlife data=pbc LOG method=na plot=(quantplot survival quantile) seed=1268; model Time*Status(0)=logBilirubin logProtime logAlbumin Age Edema /quantile=(.1 .4 .5 .85); run; ods graphics off; Figure 10 contains the parameter estimates. Each of the requested quantiles has its own set of parameter estimates. The confidence limits are computed by resampling methods. 11 SAS Global Forum 2013 Statistics and Data Analysis Figure 10 Parameter Estimates Parameter Estimates Quantile 0.1000 0.4000 0.5000 0.8500 DF Estimate Standard Error Intercept 1 14.8030 4.0967 6.7736 logBilirubin 1 -0.4488 0.1485 logProtime 1 -3.6378 logAlbumin 1 Age Parameter 95% Confidence Limits t Value Pr > |t| 22.8325 3.61 0.0003 -0.7398 -0.1578 -3.02 0.0027 1.4560 -6.4915 -0.7841 -2.50 0.0129 1.9286 0.9756 0.0165 3.8408 1.98 0.0487 1 -0.0244 0.0107 -0.0455 -0.00334 -2.27 0.0237 Edema 1 -1.0712 0.6688 -2.3820 0.2396 -1.60 0.1100 Intercept 1 13.4716 3.0874 7.4204 19.5228 4.36 <.0001 logBilirubin 1 -0.6047 0.0846 -0.7705 -0.4389 -7.15 <.0001 logProtime 1 -2.1632 1.1726 -4.4615 0.1351 -1.84 0.0658 logAlbumin 1 0.9819 0.7191 -0.4274 2.3912 1.37 0.1728 Age 1 -0.0255 0.00681 -0.0389 -0.0122 -3.74 0.0002 Edema 1 -1.0589 0.3104 -1.6672 -0.4506 -3.41 0.0007 Intercept 1 10.9205 2.8047 5.4235 16.4175 3.89 0.0001 logBilirubin 1 -0.5315 0.0904 -0.7087 -0.3543 -5.88 <.0001 logProtime 1 -1.2222 1.2142 -3.6020 1.1577 -1.01 0.3148 logAlbumin 1 1.5700 0.6284 0.3383 2.8016 2.50 0.0129 Age 1 -0.0318 0.00883 -0.0491 -0.0145 -3.60 0.0004 Edema 1 -0.7316 0.3743 -1.4653 0.00202 -1.95 0.0513 Intercept 1 11.0778 . . . . . logBilirubin 1 -0.5615 . . . . . logProtime 1 -1.2711 . . . . . logAlbumin 1 1.4471 . . . . . Age 1 -0.0144 . . . . . Edema 1 -0.4339 . . . . . This behavior of the covariate coefficients is illustrated in the plot in Figure 11. This is a scatter plot of the estimated regression parameter against the quantiles. In the plot for logPROTIME, the parameter estimate grows smaller from its value of –3.6456 for the 0.1 quantile and levels off around –1.0 for the 0.5 and higher quantiles. 12 SAS Global Forum 2013 Statistics and Data Analysis Figure 11 Estimated Parameter by Quantiles Plot See Lin and Rodriguez (2013) for more information about the QUANTLIFE procedure. Missing Data Methods While most of the data sets analyzed in classroom settings (or displayed in SAS documentation) are chock full of data, practicing statisticians rarely find this situation in the wild. Standard practice in software packages is to analyze complete case data by default, also known as casewise deletion, and SAS follows this practice. However, unless the missing observations magically reflect a random sample from the complete observations, the resulting inference is almost sure to be biased. Thus, managing missing data is an important aspect of statistical analysis. One strategy for handling missing data is to impute the missing value, or substitute a value, and then analyze the data as if they were complete. Single imputation substitutes a single value, but the resulting results are biased toward zero since they don’t reflect the uncertainty about the predictions of the unknown missing values (Rubin 1987). Multiple imputation does incorporate that uncertainty by replacing each missing value with a set of plausible values. You generate m complete data sets with m replacement values, and then you combine the results. This produces unbiased results. The MI procedure in SAS/STAT produces multiply imputed data sets for incomplete multivariate data sets. The imputation method depends on the pattern of missingness and the type of the imputed variable. You then use standard SAS procedures to analyze the m imputed data sets, and you use the MIANALYZE procedure to combine the results and generate valid statistical inference. The pattern of missing data determines the imputation method. There are many choices when you have monotone missing data: for missing continuous variables, you can use a regression method, predictive mean matching, or a propensity scoring method. For categorical missing variables, you can apply a logistic regression method or a discriminant function method. When you have arbitrary missing data patterns, you can use an MCMC method that assumes multivariate normality or a fully functional specification method (FCS) that assumes the existence of a joint distribution for all variables. The FCS method is a recent addition to the MI procedure, and it offers additional flexibility. SAS/STAT offers other techniques for managing missing data. The CALIS procedure for fitting structural equations model now provides full information maximum likelihood (FIML), which is an estimation method that uses information from both the incomplete and the complete observations. Besides FIML estimation, PROC CALIS also provides features for analyzing data coverage and missing patterns. Since structural equations modeling includes measurement error models, regression models, and factor analyses as subset 13 SAS Global Forum 2013 Statistics and Data Analysis analyses, the FIML method it includes has a wide range of applications. See Yung and Zhang (2011) for more information. One of the benefits of the Bayesian approach is that it can handle missing data in a straightforward manner. It treats the missing values as unknown parameters and estimates their posterior distributions. It is a model-based solution, and the additional parameters don’t add that much additional complexity to the analysis; they are simply sampled sequentially in the MCMC simulation. This approach does take into account the uncertainty about the missing values so you can estimate the posterior marginal distributions of the parameters of interest conditional on the observed and partially observed data. The MCMC procedure automatically samples all missing values and incorporates them in the Markov chain for the parameters. You can use PROC MCMC to handle various types of missing data, including data that are missing at random (MAR) and missing not at random (MNAR). PROC MCMC can also perform joint modeling of missing responses and covariates. See Chen (2013) for more information. Managing missing data continues to be an important area of research and application, and providing additional techniques is a major focus of current SAS/STAT research and development. High-Performance Computing SAS/STAT software has taken advantage of multithreading algorithms to improve performance in several procedures, including the REG, GLM, LOESS, and GLMSELECT procedures, for years. See Cohen (2002) for information on how this works and how the data configuration affects the resulting computing performance. Recent new procedures in SAS/STAT come equipped with multithreading if it would benefit their performance. These procedures include the FMM, QUANTSELECT, QUANTLIFE, and ADAPTIVEREG procedures. Most recently, SAS has focused on meeting the challenges of Big Data with more attention to highperformance computing. New software products designed specifically for distributed computing include the high-performance analytical products, which operate on data stored in databases such as Teradata, Greenplum, and Hadoop and uses multiple parallel processing techniques across a grid of servers. New statistical procedures were designed for this software, including those that perform logistic regression, linear regression, mixed models, and model selection for both linear models and generalized linear models. Beginning with Base SAS 9.4/SAS/STAT 12.3, released in the summer of 2013, the SAS/STAT product includes the high-performance procedures for use in single-machine mode. These procedures were designed to provide specific functionality required for Big Data analysis in a distributed environment, such as predictive modeling. Not all features of the traditional procedure are included or are relevant. The high-performance procedures are evolving and additional functionality such as graphics and BY-group processing will surface in later releases. However, these procedures may provide benefit for the typical SAS/STAT user: • New functionality is included in some of these procedures. For example, model selection for generalized linear models is available with the HPGENSELECT procedure. • Depending on the characteristics of the data and the complexity of the analysis, users may find performance gains in single-machine mode with these procedures. However, note that if you compare a high-performance procedure with a traditional procedure that is multithreaded (for example, PROC HPREG compared to PROC REG), you are unlikely to see performance differences in the singlemachine mode of execution. • Users with Big Data who would benefit from using the high-performance analytics products in a distributed environment can exercise the procedures in single-machine mode and assess their functionality. When the customer needs to process Big Data (more observations and larger numbers of variables), she can license the high-performance analytics products and execute the same procedures in a distributed, in-memory environment. The high-performance procedures are in the first stages of development. While production software, they will be enhanced as time goes on and will include some additional features, such as ODS graphics and BY-group processing. These procedures are documented in SAS/STAT 12.3 User’s Guide: High-Performance Procedures. See Cohen (2013) for more detail. 14 SAS Global Forum 2013 Statistics and Data Analysis Complex Data Data today are increasingly complex, or perhaps that complexity is being acknowledged because there are more sophisticated methods available to analyze them. A major growth area for SAS/STAT has always been statistical techniques that address data complexity. Recently, additional methods for handling complex data include quantile regression, finite mixture models, and adaptive regression splines. As previously discussed, quantile regression extends the basic regression model to the relationship between the conditional quantiles of a response variable with one or more covariates. It makes no distributional assumptions about the error term, and so it offers model robustness. It is a semiparametric method that can provide a more complete picture of your data based on these conditional distributions. Linear programming algorithms are used to produce the quantile regression estimates. See Koenker (2005) for further detail. SAS/STAT provides the QUANTREG procedures for quantile regression analysis, the QUANTSELECT procedure for model selection for quantile regression, and the QUANTLIFE procedure for right-censored data. New ADAPTIVEREG Procedure SAS/STAT software provides various tools for nonparametric regression, including the LOESS, TPSPLINE, and GAM procedures. Typical nonparametric regression methods involve a large number of parameters to capture nonlinear trends, so the model space is fairly large. The sparsity of data in high dimensions is another issue, often resulting in slow convergence or even failure for many nonparametric regression methods. The LOESS and TPSPLINE procedures are limited to problems in low dimensions. The GAM procedure fits generalized additive models with the assumption of additivity. It can handle data sets, but the computation time for its local scoring algorithm (Hastie and Tibshirani 1990) to converge increases quickly with the size of the data set. The new ADAPTIVEREG procedure provides a nonparametric modeling approach for high-dimensional data. PROC ADAPTIVEREG fits multivariate adaptive regression splines as introduced by Friedman (1991b). The method is a nonparametric regression technique that combines both regression splines and model selection methods. It does not assume parametric model forms, and it does not require knot values for constructing regression spline terms. Instead, it constructs spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables; it performs model reduction by applying model selection techniques. Thus, the ADAPTIVEREG procedure is both a nonparametric regression procedure and a predictive modeling procedure. The ADAPTIVEREG procedure: • supports classification variables with different ordering options • enables you to force effects into the final model or restrict variables in linear forms • supports options for fast forward selection • supports partitioning of data into training, validation, and testing roles • provides leave-one-out and k-fold cross validation • produces graphical representations of the selection process, model fit, functional components and fit diagnostics For more detail, see Kuhfeld and Cai (2013). The following example illustrates the use of the ADAPTIVEREG procedure. Researchers collected data on city-cycle fuel efficiency and automobile characteristics for 361 vehicle models manufactured from 1970 to 1982. The data can be downloaded from the UCI Machine Learning Repository (Asuncion and Newman 2007). The following DATA step creates the data set AUTOMPG: 15 SAS Global Forum 2013 Statistics and Data Analysis title 'Automobile MPG study'; data autompg; input mpg cylinders displacement horsepower weight acceleration year origin name $35.; datalines; 18.0 8 307.0 130.0 3504 12.0 70 1 chevrolet chevelle malibu 15.0 8 350.0 165.0 3693 11.5 70 1 buick skylark 320 18.0 8 318.0 150.0 3436 11.0 70 1 plymouth satellite 16.0 8 304.0 150.0 3433 12.0 70 1 amc rebel sst 17.0 8 302.0 140.0 3449 10.5 70 1 ford torino ... ... ; There are nine variables in the data set. The response variable MPG is city-cycle mileage per gallon (mpg). Seven predictor variables (number of cylinders, displacement, weight, acceleration, horsepower, year and origin) are created. The variables for number of cylinders, year, and origin are categorical. The dependency of vehicle fuel efficiency on these factors might be nonlinear. Dependency structures within the predictors might also mean that some of the predictors are redundant. For example, a model with more cylinders is likely to have more horsepower. The object of this analysis is to explore the nonlinear dependency structure and to find a parsimonious model that does not overfit the data. A more parsimonious model has better predictive ability. The following PROC ADAPTIVEREG statements fit an additive model with linear spline terms of continuous predictors. The variable transformations and the model selection based on the transformed terms are performed in an adaptive and automatic way. If the ADDITIVE option is not supplied, PROC ADAPTIVEREG will fit a model with both main effects and two-way interaction terms. ods graphics on; proc adaptivereg data=autompg plots=all; class cylinders year origin; model mpg = cylinders displacement horsepower weight acceleration year origin / additive; run; ods graphics off; Figure 12 displays information on how the bases are constructed. 16 SAS Global Forum 2013 Statistics and Data Analysis Figure 12 Bases Automobile MPG Study Basis Information Name Transformation Basis0 None Basis1 Basis0*MAX(Weight - Basis2 Basis0*MAX( Basis3 Basis0*NOT(MISSING(HorsePower)) Basis4 Basis0*MISSING(HorsePower) Basis5 Basis3*MAX(HorsePower - Basis6 Basis3*MAX( Basis7 Basis3*(Year = 80 OR Year = 82 OR Year = 81 OR Year = 79 OR Year = 78 OR Year = 77 OR Year = 73) Basis8 Basis3*NOT(Year = 80 OR Year = 82 OR Year = 81 OR Year = 79 OR Year = 78 OR Year = 77 OR Year = 73) Basis9 Basis0*MAX(Acceleration - Basis10 Basis0*MAX( Basis11 Basis0*(Cylinders = 3 OR Cylinders = 6) Basis12 Basis0*NOT(Cylinders = 3 OR Cylinders = 6) Basis13 Basis4*(Origin = 1) Basis14 Basis4*NOT(Origin = 1) Basis15 Basis0*(Origin = 3) Basis16 Basis0*NOT(Origin = 3) Basis17 Basis0*(Cylinders = 6) Basis18 Basis0*NOT(Cylinders = 6) Basis19 Basis0*(Year = 73 OR Year = 80 OR Year = 82 OR Year = 81 OR Year = 79) Basis20 Basis0*NOT(Year = 73 OR Year = 80 OR Year = 82 OR Year = 81 OR Year = 79) 3139,0) 3139 - Weight,0) 158,0) 158 - HorsePower,0) 21,0) 21 - Acceleration,0) The “Parameter Estimates” table in Figure 13 displays parameter estimates for constructed basis functions in addition to each function’s construction component. For example, BASIS1 has an estimate of –0.003242. It is constructed from a parent basis function BASIS0 (intercept) and a linear spline function of WEIGHT with a single knot placed at 3139. BASIS3 is constructed from a parent basis function BASIS0 and an indicator function of YEAR. The indicator is set to 1 when a class level of YEAR falls into the subset of levels listed in the “Levels” column and set to 0 otherwise. 17 SAS Global Forum 2013 Statistics and Data Analysis Figure 13 Parameter Estimates Regression Spline Model after Backward Selection Name Coefficient Parent Variable Knot Levels Basis0 29.4394 Intercept Basis2 0.004412 Basis0 Weight Basis3 -21.2899 Basis0 HorsePower . Basis6 0.1534 Basis3 HorsePower 158.00 Basis7 2.3920 Basis3 Year Basis9 1.6658 Basis0 Acceleration 21.0000 Basis10 0.4672 Basis0 Acceleration 21.0000 Basis11 -8.1766 Basis0 Cylinders 03 Basis13 -10.0976 Basis4 Origin 0 Basis15 2.1354 Basis0 Origin 2 Basis17 6.7675 Basis0 Cylinders 3 Basis19 1.4987 Basis0 Year 3 10 12 11 9 3139.00 10 12 11 9 8 7 3 During the model construction and selection process, some basis function terms are removed. Variable importance is another criterion that focuses on the contribution of each individual. Variable importance is defined to be the square root of the GCV value of a submodel with all basis functions that involve a removed variable, minus the square root of the GCV value of the selected model, then scaled to have the largest importance value of 100. Figure 14 lists importance values for four variables that comprise the selected model. Similar to the ANOVA decomposition results, WEIGHT and YEAR are two dominant factors that determine vehicle mpg values, while DISPLACEMENT and ACCELERATION are less important. Figure 14 Variable Importance Variable Importance Number of Bases Importance HorsePower 1 100.00 Year 2 85.46 Weight 1 21.10 Cylinders 2 19.08 Origin 2 18.67 Acceleration 2 16.38 Variable The component panel in Figure 15 displays the fitted functional components against their forming variables. When a vehicle model’s displacement is less than 85, its mpg value increases with its displacement. The displacement does not matter much once it exceeds 85. The shape of the functional component strongly suggests a logarithm transformation. The component of WEIGHT shows that vehicle weight has negative impact on its mpg value. The trend suggests a possible reciprocal transformation. When a model’s acceleration value is larger than 20.7, it affects the mpg value in a positive manner. It does not matter much if it is less than 20.7. Although YEAR is treated as a classification variable, its values are ordinal. The general 18 SAS Global Forum 2013 Statistics and Data Analysis trend is quite clear: more recent models tend to have higher mpg values. Automobile companies apparently paid more attention to improving vehicle fuel efficiency after 1976. Figure 15 Component Panel Finite Mixture Models Another type of complexity occurs when data can be viewed as coming from a mixture of different distributions. Finite mixture models enable you to fit statistical models to data when the distribution of the response is a finite mixture of univariate distributions. These models are useful for applications such as estimating multimodal or heavy-tailed densities, fitting zero-inflated or hurdle models to count data with excess zeros, modeling overdispersed data, and fitting regression models with complex error distributions. Many wellknown statistical models for dealing with overdispersed data are members of the finite mixture model family (for example, zero-inflated Poisson models and zero-inflated negative binomial models.) PROC FMM performs maximum likelihood estimation for all models, and it provides Markov chain Monte Carlo estimation for many models, including zero-inflated Poisson models. The procedure includes many built-in link and distribution functions, including the beta, shifted, Weibull, beta-binomial, and generalized Poisson distributions, as well as standard members of the exponential family of distributions. In addition, several specialized built-in mixture models are provided, such as the binomial cluster model (Morel and Nagaraj, 1993). The results of a finite mixture models analysis is displayed in Figure 16. The FMM procedure was used to fit a three-component model–two normal components and a Weibull component– to log feeding time for cattle. The plot shows the observed and estimated distributions for the response. 19 SAS Global Forum 2013 Statistics and Data Analysis Figure 16 Density Plot Summary Of course, recent SAS/STAT releases include numerous other updates. The STDRATE procedure computes direct and indirect standardized rates and proportions, measures key in epidemiology. The survey data analysis procedures continue to be a major development focus, with post-stratification estimation now available with the SURVEYMEANS procedure and Poisson sampling included in PROC SURVEYSELECT. Other new features include: • WEIGHT statement in PROC LIFETEST • partial R-square for relative importance of parameters in PROC LOGISTIC • Miettinen-Nurminen confidence limits for the difference of proportions in PROC FREQ • group sequential design with nonbinding acceptance boundary in the SEQDESIGN and SEQTEST procedures • REF= option added to the CLASS statement for GLM, MIXED, GLIMMIX, and ORTHOREG procedures SAS/STAT software provides its customers with a rich array of current statistical techniques. Released every 12-18 months, it provides modern statistical methodology via software that is geared towards user expectations and shaped for today’s data. For Further Information A good place to start for further information is the “What’s New in SAS/STAT 12.1” chapter in the online documentation. In addition, the Statistics and Operations Focus Area includes substantial information about the statistical products, and you can find it at support.sas.com/statistics/. The quarterly e-newsletter for that site is available on its home page. And of course, complete information is available in the online documentation located here: support.sas.com/documentation/onlinedoc/stat/. 20 SAS Global Forum 2013 Statistics and Data Analysis References Asuncion, A. and Newman, D. J. (2007), “UCI Machine Learning Repository,” Available at archive.ics.uci.edu/ml/. http:// Bellman, R. E. (1961), Adaptive Control Processes, Princeton University Press. Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984), Classification and Regression Trees, Wadsworth. Buja, A., Duffy, D., Hastie, T., and Tibshirani, R. (1991), “Discussion: Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 93–99. Chen, F. (2009), “Bayesian Modeling Using the MCMC Procedure,” in Proceedings of the SAS Global Forum 2008 Conference, Cary NC: SAS Institute Inc. Available at http://support.sas.com/resources/ papers/proceedings09/257-2009.pdf. Chen, F. (2011), “The RANDOM Statement and More: Moving on with PROC MCMC,” in Proceedings of the SAS Global Forum 2011 Conference, Cary NC: SAS Institute Inc. Available at http://support.sas. com/resources/papers/proceedings11/334-2011.pdf. Collier Books (1987), “The 1978 Baseball Encyclopedia Update,” New York: Macmillan. Cohen, R. (2002),“SAS® Meets Big Iron: High Performance Computing in SAS Analytic Procedures,” in Proceedings of the SAS Users Group International Conference, Cary NC: SAS Institute Inc. Cohen, R. and Rodriguez, R. (2013) “High Performance Statistical Modeling,” Available at http://support. sas.com/statistics/papers/ Derr, R. (2013) “Ordinal Response Modeling with the LOGISTIC Procedure,” Proceedings of the SAS Global Forum 2013 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/ resources/papers/proceedings13/446-2013.pdf. Friedman, J. (1991a), “Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines,” Technical report, Stanford University. Friedman, J. (1991b), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1–141. Friedman, J. (1993), “Fast MARS,” Technical report, Stanford University. Florida Department of Health, “Florida Vital Statistics Annual Report 2000.” Available at http://www. flpublichealth.com/VSBOOK/pdf/2000/Population.pdf. Accessed February 2012. Gamerman, D. (1997), “Sampling from the Posterior Distribution in Generalized Linear Mixed Models,” Statistics and Computing, 7, 57–68. Gibbs, P., Tobias, R., Kiernan, K., and Tao, J. 2013) “Having an EFFECT: More General Linear Modeling and Analysis with the New EFFECT Statement in SAS/STAT® Software,” Proceedings of the SAS Global Forum 2013 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/resources/ papers/proceedings13/437-2013.pdf. Hastie, T. J. and Tibshirani, R. J. (1990), Generalized Additive Models, New York: Chapman & Hall. Koenker, R. (2005), Quantile Regression, New York: Cambridge University Press. Kuhfeld, W., and Cai, W. (2013) “Introducing the New ADAPTIVEREG Procedure for Adaptive Regression,” Proceedings of the SAS Global Forum 2013 Conference, Cary, NC: SAS Institute Inc. Available at http: /support.sas.com/resources/papers/proceedings13/457-2013.pdf. Lin, G., So, Y., and Johnston, G. (2012) “Analyzing Survival Data with Competing Risks Using SAS Software,” Proceedings of the SAS Global Forum 2012 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/resources/papers/proceedings11/344-2012.pdf. Lin, G. (2013) “Using the QUANTLIFE Procedure for Survival Analysis,” Proceedings of the SAS Global Forum 2013 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/resources/ 21 SAS Global Forum 2013 Statistics and Data Analysis papers/proceedings11/421-2013.pdf. Lin, D. Y., Wei, L. J., and Ying, Z. (1993), “Checking the Cox Model with Cumulative Sums of Martingale-Based Residuals,” Biometrika, 80, 557–572. Little, R.J.A. and Rubin, D.B. (2002), Statistical Analysis with Missing Data, Second Edition, New York: John Wiley & Sons. Morel, J. G., and Nagaraj, N. K. (1993), “A Finite Mixture Distribution for Modelling Multinomial Extra Variation,”Biometrika, 80, 363–371. Peng L. and Huang Y. (2008), “Survival Analysis with Quantile Regression Models,” Journal of the American Statistical Association, 103, 637–649 Portnoy S. (2003). “Censored Quantile Regression,”" Journal of American Statistical Association, 98, 1001–1012. Silvapulle, M. J. and Sen, P. K. (2004), Constrained Statistical Inference: Order, Inequality, and Shape Constraints, New York: John Wiley & Sons. Stokes, M., Rodriguez, R. and Cohen, R. (2010), “SAS/STAT 9.22: The Next Generation,” in Proceedings of the SAS Global Forum 2011 Conference, Cary NC: SAS Institute Inc. Available at http://support. sas.com/resources/papers/proceedings10/264-2010.pdf. Stokes, M., Chen, F., and So, Y. (2011), “On Deck: SAS/STAT 9.3,” Proceedings of the SAS Global Forum 2011 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/resources/ papers/proceedings11/331-2011.pdf. Stokes, M., (2012), “Look Out: After SAS/STAT® 9.3 Comes SAS/STAT 12.1!,” Proceedings of the SAS Global Forum 2012 Conference, Cary, NC: SAS Institute Inc. Available at http:/support.sas.com/ resources/papers/proceedings12/313-2012.pdf. Yang, Y. (2013) “Computing Direct and Indirect Standardized Rates and Risks with the STDRATE Procedure,” Proceedings of the SAS Global Forum 2013 Conference, Cary, NC: SAS Institute Inc. Available at http: /support.sas.com/resources/papers/proceedings11/423-2013.pdf. U.S. Bureau of Census (2011), “Age and Sex Composition: 2010.” Available at http://www.census. gov/prod/cen2010/briefs/c2010br-03.pdf. Accessed February 2012. Acknowledgments The authors are grateful to Fang Chen for his contributions to the manuscript. Contact Information Your comments and questions are valued and encouraged. Contact the author: Maura Stokes SAS Institute Inc. SAS Campus Drive Cary, NC 27513 [email protected] Version 1.0 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 22

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement