Recent Advances in Categorical Data Analysis Maura E. Stokes SAS Institute Inc.
Statistics, Data Analysis, and Modeling Paper 268 Recent Advances in Categorical Data Analysis Maura E. Stokes SAS Institute Inc. Cary, North Carolina, USA Abstract The last fifteen years have brought many changes to the practice of categorical data analysis. This paper reviews some of the major changes and shifts of emphasis and discusses several examples using SAS software procedures. Topics include the use of exact methods, Generalized Estimating Equations, conditional logistic regression, and current uses of weighted least squares modeling. Applications provide illustrations for many topics. This paper describes software currently available in the SAS System and indicates the areas where users can expect to find strategies implemented in the next few releases. The references include some pertinent methodology and review papers. Introduction Fifteen years ago, the SUGI paper describing the categorical data analysis enhancements in the upcoming Version 5 release described the new CATMOD procedure that would replace PROC FUNCAT and also described how the FREQ procedure would now include the Mantel-Haenszel nonparametric statistics then present in PROC TFREQ. The paper then continued with a discussion directed at weighted least squares methods for statistical modeling and MantelHaenszel methods for testing association in a contingency table or sets of contingency tables. modeling repeated measurements data that can include continuous covariates, missing data, and timedependent covariates. Highly stratified data can be handled with conditional logistic methods, which also provide a way of analyzing crossover data. When asymptotic assumptions are not appropriate due to sparse data or small sample sizes, exact methods can provide a way to produce appropriate p-values for tests, valid confidence limits for odds ratios, and parameter estimates and standard errors in logistic regression. Weighted least squares, a very important strategy in the 70s and 80s, still provides a useful way of modeling functions of interest such as rank measures of association and incidence densities. What follows are descriptions of these newer strategies and illustrations of their application. Exact p-Values Since that time, new developments in categorical data analysis and new computing strategies have changed the typical strategies employed by the data analyst facing categorical data. Once safely armed with her CATMOD documentation and PROC FREQ, then patiently learning the difference between the supplemental procedure PROC LOGIST contributed by Frank Harrell and PROC LOGISTIC, the analyst now contends with a number of different procedures such as LOGISTIC, GENMOD, and PHREG, and keeps a scorecard of the growing number of exact tests available in the FREQ and NPAR1WAY procedures. Exact p-values provide an alternative strategy when data are sparse, skewed, or unbalanced so that the assumptions required for standard asymptotic tests are violated. Advances in computer performance and developments in network algorithms over the last decade have made exact p-values accessible for a number of statistical tests. In Release 6.11, exact p-values were added for the simple linear rank statistics produced by the NPAR1WAY procedure. In Release 6.12, exact p-values are produced for many of the statistics computed by the FREQ procedure. You are now able to request exact p-values for the following chi-square statistics: Pearson’s chisquare, likelihood-ratio chi-square, Mantel-Haenszel chi-square, Fisher’s exact test and r by c exact test, Jonckheere-Terpstra test, and McNemar’s test. In addition, you can also obtain exact p-values for hypothesis tests that the following statistics are equal to 0: Pearson correlation coefficient, Spearman correlation coefficient, simple kappa statistic, and weighted kappa statistic. Exact confidence bounds are also available for the odds ratios produced for 2 by 2 tables. Mantel-Haenszel strategies are now employed in the analysis of repeated measurements and crossover studies. GEE methods provide a convenient way of In Version 7, a test for the binomial proportion is available along with an exact p-value, and exact pvalues are available for the Pearson chi-square statis- Statistics, Data Analysis, and Modeling tic for two-way tables and the goodness-of-fit statistic for one-way tables. In addition, the Monte Carlo method of computing exact p-values is included in the NPAR1WAY procedure and will be included in the FREQ procedure in Version 8. The following example illustrates the use of the new EXACT statement to produce an exact p-value for the simple kappa statistic. Researchers studied two scoring systems for evaluating fitness in fifth grade students. Forty-three students were classified into one of four fitness categories. Interest lies in determining whether there is agreement between the two scoring systems, which can be assessed by testing whether the kappa coefficient is equal to 0. data fitness; input score1 $ score2 $ count; datalines; poor poor 5 average average 4 good good 4 superior superior 3 poor average 3 average poor 1 average good 6 good average 5 good superior 1 superior average 10 superior good 1 ; TABLE OF SCORE_1 BY SCORE_2 SCORE_1 SCORE_2 Frequency| Percent | Row Pct | Col Pct |average |good |poor |superior| Total ---------+--------+--------+--------+--------+ average | 4 | 6 | 1 | 0 | 11 | 9.30 | 13.95 | 2.33 | 0.00 | 25.58 | 36.36 | 54.55 | 9.09 | 0.00 | | 18.18 | 54.55 | 16.67 | 0.00 | ---------+--------+--------+--------+--------+ good | 5 | 4 | 0 | 1 | 10 | 11.63 | 9.30 | 0.00 | 2.33 | 23.26 | 50.00 | 40.00 | 0.00 | 10.00 | | 22.73 | 36.36 | 0.00 | 25.00 | ---------+--------+--------+--------+--------+ poor | 3 | 0 | 5 | 0 | 8 | 6.98 | 0.00 | 11.63 | 0.00 | 18.60 | 37.50 | 0.00 | 62.50 | 0.00 | | 13.64 | 0.00 | 83.33 | 0.00 | ---------+--------+--------+--------+--------+ superior | 10 | 1 | 0 | 3 | 14 | 23.26 | 2.33 | 0.00 | 6.98 | 32.56 | 71.43 | 7.14 | 0.00 | 21.43 | | 45.45 | 9.09 | 0.00 | 75.00 | ---------+--------+--------+--------+--------+ Total 22 11 6 4 43 51.16 25.58 13.95 9.30 100.00 Figure 1. Exact Test for Simple Kappa The resulting exact p-value for the hypothesis that the simple kappa statistic is equal to 0 is p=0.055, which may be considered to have marginal significance at best. Note the value p=0.038 for the asymptotic test. Using exact p-values for this analysis leads to a very different conclusion than using the asymptotic test. STATISTICS FOR TABLE OF SCORE_1 BY SCORE_2 Statistic = 11.091 To request the exact p-value for the kappa statistic, you specify the keyword KAPPA in the EXACT statement. The AGREE option in the MODEL statement requests the measures of agreement. proc freq; weight count; tables score1 * score2 / agree; exact kappa; run; Kappa = 0.167 Prob = 0.086 Simple Kappa Coefficient -----------------------95% Confidence Bounds ASE = 0.102 -0.032 0.366 Asymptotic P-Values (Right-sided) = 0.019 (Two-sided) = 0.038 Exact P-Values (Right-sided) = 0.034 (Two-sided) = 0.055 Weighted Kappa Coefficient -------------------------95% Confidence Bounds Kappa = 0.100 ASE = 0.100 -0.096 0.297 Sample Size = 43 Figure 2. The following figure displays the contingency table form of the data. Note the number of zero cells, which makes the use of the asymptotic test questionable. Test of Symmetry ---------------DF = 6 Exact Test for Simple Kappa Generalized Estimating Equations Weighted least squares (WLS) modeling of repeated categorical data was described by Landis et al. (1977) and provides a large sample asymptotic method that works nicely for data that have adequate sample size, a small number of response points, a small number of categorical explanatory variables measured at the subject level, and no missing data. The CATMOD procedure introduced the REPEATED statement to provide this analysis. While this method is still useful for data that meet these conditions, most data that are collected with clustered or repeated responses do not. Longitudinal data are usually plagued with missing responses, Statistics, Data Analysis, and Modeling explanatory variables include continuous variables as well as categorical, and there is often interest in timedependent covariates such as blood pressure in a clinical trial. In addition, as you start to increase the number of explanatory variables, you often don’t meet the asymptotic requirements for the WLS approach. Generalized Estimating Equations (GEE) provides a nonlikelihood based approach to modeling repeated or clustered data that applies to a broader set of data situations that are frequently encountered. It handles missing data, continous explantory variables, and time-dependent explanatory variables. While responses can be either continuous or categorical, it is especially useful for data that are binary or discrete counts. An extension of the generalized linear model (GLM) first suggested by Nelder and Wedderburn (1972), the GEE approach was outlined in work by Zeger and Liang (1986) and Liang and Zeger (1986) that describe a quasi-likelihood approach for modeling correlated responses. Besides using the linear predictor set-up of the GLM, you model the covariance matrix of the responses. The GEE approach produces population-averaged estimates. With quasilikelihood, you can pursue statistical models by making assumptions about the link function and the relationship between the first two moments, but without specifying the complete distribution of the response. Say that Yij (j = 1; : : : ; ni ; i = 1; : : : ; K ) represent the jth measurement on the ith subject. There are ni meaPK surements on subject i and i=1 ni total measurements. The generalized estimating equation for estimating is an extension of the GLM estimating equation: K X @ 0 =1 i ,1 @ V (Y , ())= 0 i i is the corresponding vector of means = 1 ; : : : ; ]0 and V is an estimate of the covariance i Y. i ini matrix of Y The covariance matrix of i V = A R ()A i i 1 2 i i is modeled as 1 2 A where i is an ni ni diagonal matrix with v (ij ) as the jth diagonal element. The working correlation matrix r ij = yp , v( ) ij ij ij There are many choices for the working correlation matrix. The independent working correlation matrix includes 1s on the diagonals and 0s on the off-diagonals. Other choices are the unstructured, exchangeable (compound symmetry), autoregressive(1), and m-dependent. Finding the GEE solution requires these steps: Relate the marginal response ij =E(Yij ) to with a link function. For example, the logit. ij Specify the variance function. Choose a working correlation matrix R (). i Compute an initial estimate of , for example with an ordinary generalized linear model assuming independence. Compute the working correlation matrix R. i Compute an estimate of the covariance matrix: V = A R^ ()A i x0 i 1 2 i i 1 2 Update : +1 = , r r @ 0 V,1 @ ,1 X @ 0 V,1 (Y , ) @ =1 @ =1 @ X K K i i i i i i i Compute residuals and update V. i Iterate until convergence. i where [ using the current value of the parameter vector to compute appropriate functions of the Pearson residual. R () is estimated as i The GEE parameter estimates have many important properties. They are the generalized linear model estimating equations when you have one measurement per cluster. They are the maximum likelihood score equations for multivariate Gaussian data. And, most importantly, the GEE parameter estimates are consistent as the number of clusters becomes large, even if you have misspecified the working correlation matrix, as long as the mean model is correct. ^ ) is given by The model-based estimator of Cov( ^) = CovM ( I0,1 where I0 = K X @ 0 ,1 @ =1 @ i V @ i Statistics, Data Analysis, and Modeling This is a consistent estimator if the model and working correlation matrix are correctly specified. ^ ) is given The empirical, or robust, estimator of Cov ( by M = I0,1I1 I,0 1 where I1 = K X @ 0 V =1 @ i ,1 Cov(Y)V,1 @ @ ^ ) as the numThis is a consistent estimator of Cov( ber of clusters become large, even if the working correlation matrix is not specified correctly. The GEE approach produces a marginal model. It models a known function of the marginal expectation of the dependent variable as a linear function of explanatory variables. The resulting parameter estimates are population-averaged. GEE relies on independence across subjects to consistently estimate the variance. Compare this to a mixed model where you are estimating subject-specific parameter estimates, but you are heavily leveraging the correlation assumption. Exercise Study One of the applications of GEE methods is to crossover studies. In a crossover study, subjects have their response measured at different periods under different conditions. In a classic two period, two treatment crossover study, some subjects get a Treatment A during Period 1 and Treatment B during Period 2, while other subjects get the sequence Treatment B during Period 1 and Treatment A during Period 2. Usually, there is some sort of washout period so that whatever effects of the treatment during the first period have washed out before the second period begins. Also, the nature of the response is such that the subject is able to have the indicated response a few times within a relatively short period of time. In a crossover study, the subject acts as his own control. More complicated designs, including sequences that can draw from more than two possible treatments (for example, A, B, and Placebo) and additional periods, can provide a better framework for estimating the treatment effect. In this example, GEE methods are used to analyze a three-period crossover study in which patients with a chronic respiratory condition are exposed to different levels of air pollution while exercising and measured for respiratory distress on a four point ordinal scale, ranging from 0 for none to 3 for severe. A dichotomous baseline distress measurement was taken at the beginning of the study. Six sequences were studied: HML, HLM, MHL, MLH, LHM, and LMH, where ’H’ means High, ’M’ means medium, and ’L’ means low. In this analysis, the subject is the cluster and there may be a maximum of three response corresponding to the three periods. Missing responses occurred at each of the three periods. Interest lies in determining whether there was a pollution effect, baseline effect, period, and carryover effects. The following DATA step inputs the exercise data. There is one observation per subject per period. The variable Sequence contains the sequence information, for example, observations with the value ’HML’ received the sequence High in the first period, Medium in the second period, and Low in the third period. The indicator variables High and Medium take the value ‘1’ is the exposure is High or Medium, respectively, for that period. ID is the subject ID within sequence group, Period1 and Period2 are indicator variables for whether the observation is from Period 1 or Period 2, and CarryHigh and CarryMedium are indicator variables for whether the previous period was High exposure or Medium exposure. The variable Baseline takes the value ‘1’ for respiratory distress at the beginning of the study. data Exercise; input Sequence $ ID $ Period1 Period2 High Medium Baseline Response CarryHigh CarryMedium @@; strata=sequence||id; DichotResponse= (Response >0); datalines; HML 1 1 0 1 0 0 3 0 0 HML 1 0 1 0 1 0 1 1 0 HML 1 0 0 0 0 0 0 0 1 HML 2 1 0 1 0 0 3 0 0 HML 2 0 1 0 1 0 2 1 0 HML 2 0 0 0 0 0 0 0 1 HML 3 1 0 1 0 1 3 0 0 HML 3 0 1 0 1 0 2 1 0 HML 3 0 0 0 0 0 . 0 1 HML 4 1 0 1 0 0 2 0 0 HML 4 0 1 0 1 0 0 1 0 HML 4 0 0 0 0 0 2 0 1 ... The following statements produce a listing of the number of subjects in each of the sequences. proc freq; tables Sequence Response; run; The FREQ Procedure Cumulative Cumulative Sequence Frequency Percent Frequency Percent ------------------------------------------------------HLM 72 16.00 72 16.00 HML 78 17.33 150 33.33 LHM 72 16.00 222 49.33 LMH 72 16.00 294 65.33 MHL 60 13.33 354 78.67 MLH 96 21.33 450 100.00 Figure 3. Frequencies of Exercise Sequences Statistics, Data Analysis, and Modeling The GEE analysis is performed with the GEE facility in the GENMOD procedure. This has been made much more comprehensive in Version 7 with the inclusion of Type III tests, the CONTRAST, ESTIMATE, and LSMEANS statement, and the capability of handling the ordinal response with the proportional odds model. PROC GENMOD also now provides the alternating logistic regression method for binary data. The following statements request the analysis. The crossclassification of the variables Sequence and Id uniquely identify each cluster (subject), so that effect is specified with the SUBJECT= option in the REPEATED statement. The model consisting of all the main effects is specified, and the proportional odds model is requested with the DIST=MULTINOMIAL and LINK=CLOGIT specifications. proc genmod; class Id Sequence; model Response = Period1 Period2 High Medium Baseline CarryHigh CarryMedium / dist=multinomial link=clogit; repeated subject= Sequence*Id /type=ind corrw; contrast ’Carryover Effect’ CarryHigh 1, CarryMedium 1; contrast ’Period Effect’ period1 1, period2 1 ; run; In order to assess joint effects for Period and Carryover, two sets of two-row contrasts are specified. Figure 4 tells you that the link function and distribution have been specified correctly and that there are 406 total period measurements. Missing values for the response occurs 44 times. Model Information Data Set Distribution Link Function Dependent Variable Observations Used Missing Values WORK.EXERCISE Multinomial Cumulative Logit Response 406 44 Figure 6 displays the information for the GEE analysis. The data includes 150 clusters, for which 37 have missing values. Cluster size ranges from 1 (only one period measured) to 3 (all periods represented). GEE Model Information Correlation Structure Subject Effect Number of Clusters Clusters With Missing Values Correlation Matrix Dimension Maximum Cluster Size Minimum Cluster Size Figure 6. Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Estimate Standard Error Intercept1 Intercept2 Intercept3 Period1 Period2 High Medium Baseline CarryHigh CarryMedium -0.8959 0.9478 3.2798 0.2609 -0.0287 -3.1225 -0.4693 0.4932 0.3721 0.4265 0.3074 0.3290 0.3524 0.2973 0.2380 0.3032 0.2641 0.3708 0.3041 0.2968 Figure 7. 95% Confidence Limits Lower Upper -1.4983 0.3030 2.5891 -0.3219 -0.4953 -3.7167 -0.9869 -0.2335 -0.2240 -0.1551 Z Pr > |Z| -0.2934 -2.91 1.5926 2.88 3.9705 9.31 0.8436 0.88 0.4378 -0.12 -2.5283 -10.30 0.0484 -1.78 1.2199 1.33 0.9682 1.22 1.0081 1.44 0.0036 0.0040 <.0001 0.3803 0.9040 <.0001 0.0756 0.1835 0.2211 0.1507 Parameter Estimates The contrast results provide the 2 degree of freedom tests for both the Carryover and Period effects. CONTRAST Statement Results for GEE Analysis Contrast Carryover Effect Period Effect DF ChiSquare Pr > ChiSq 2 2 2.55 1.03 0.2799 0.5976 Type Score Score Results of Contrasts GLM Information Figure 5 displays the internal ordering of responses values, which is from 0 to 3, for no distress to severe distress. Response Profile Ordered Level 1 2 3 4 Figure 5. GEE Information Figure 7 contains the parameter estimates. Neither the Carryover nor Period effects appear to be influential. The Medium exposure appears to be marginally influential with a parameter estimate of ,0:4693 and a p-value of 0.0756; the High exposure appears to be very significant with a parameter estimate of ,3:1225 and a p-value of less than 0.0001. Figure 8. Figure 4. Independent ID*Sequence (150 levels) 150 37 3 3 1 Ordered Value 0 1 2 3 With both p-values greater than 0.25, these joint tests are non-significant. There appears to be neither Carryover nor Period effects for these data. Note that the default test for the CONTRAST statement used for the GEE analysis is a score test; you can also request a Wald statistic. The score statistics are generally more suitable for smaller sample sizes. Count 87 130 127 62 Ordered Values The reduced model was fit with the following MODEL statement. Baseline was retained as a covariate. model response = high medium baseline / dist=multinomial link=clogit; Statistics, Data Analysis, and Modeling Figure 9 contains the parameter estimates for the final model. Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Intercept1 Intercept2 Intercept3 High Medium Baseline Figure 9. Estimate -0.5404 1.2933 3.6171 -3.2523 -0.6204 0.5006 Standard Error 0.1740 0.1949 0.2586 0.3057 0.2293 0.3381 95% Confidence Limits Lower Upper -0.8814 0.9114 3.1102 -3.8513 -1.0698 -0.1620 -0.1994 1.6752 4.1239 -2.6532 -0.1711 1.1631 Z Pr > |Z| -3.11 6.64 13.99 -10.64 -2.71 1.48 0.0019 <.0001 <.0001 <.0001 0.0068 0.1387 Reduced Model Results Note that goodness-of-fit statistics are still being researched for GEE analyses. See work by Barnhart and Williamson (1998) and Preisser and Quaqish (1996) for some recent discussions of goodness of fit and deletion diagnostics. Given adequate sample size, it may be advantageous to assess model adequacy by including additional terms in the model such as pairwise and possibly higher interactions and then performing a joint test on those effects. If the test is non-significant, it leads credence to the corresponding reduced model under consideration. Conditional Logistic Regression Conditional logistic regression has long been used in epidemiology where a retrospective study matched subjects, or cases, with an event of interest with a similar subject, or control, who didn’t have the event. You determine whether the case and control had the risk factors being investigated, and, by using a conditional likelihood, you can predict the event given the explanatory variables. You set up the probabilities for having the exposure given the event and then apply Bayes’ theorem to determine a relevant conditional probability. More recently, conditional logistic regression has also been applied in the situation of highly stratified data and crossover studies. When you have highly stratified data, you may have a small number of subjects pers stratum, and thus you have a small number of subjects relative to the number of parameters you are estimating because you will need to estimate stratification effects. Sample size requirements for the usual maximum likelihood approach to unconditional logistic regression may not be met. You have a similiar situation with crossover studies, in which subjects are acting as their own controls. Highly Stratified Data Stokes et al. (1995) include an example of a clinical trial in which researchers studied the effects of a new treatment for a skin condition. A pair of patients participated from each of 79 clinics. One person received the treatment and another person received the placebo. Age, sex, and an initial score for the skin condition (ranging from 1 to 4 for mild to severe) were recorded. The response was whether the skin condition improved. Note that because there are only two observations per clinic, it would not be possible to estimate properly a clinic effect. Generally speaking, you would want to have at least five observations per clinic in order for that type of estimation. However, by conditioning away the clinic effects as nuisance parameters, you can perform a logistic regression that results in far fewer parameters. In Stokes et al., this analysis is performed by recognizing that in the case of pairs within strata, you can create a response that is a within-stratum difference and analyze those differences with the LOGISTIC procedure. However, it’s more straightforward to use the PHREG procedure to perform this analysis. While designed for proportional hazards regression analysis, through computational equivalences the procedure can also be used for conditional logistic regression. The data have the following form, where each line consists of two observations. Indicator variables are created for various interactions and to make treatment into a numerical variable. (A CLASS statement is on the list for future PROC PHREG work). data trial; input center treat $ sex $ age improve initial @@; /* compute model terms for each observation */ trt=treat=(’t’); i_sex=(sex=’m’); i_trt=(treat=’t’); trtsex=i_sex*i_trt; trtinit=i_trt*initial; trtage=i_trt*age; isexage=i_sex*age; isexinit=i_sex*initial;iageinit=age*initial; cards; 1 t f 27 0 1 1 p f 32 0 2 2 t f 41 1 3 2 p f 47 0 1 3 t m 19 1 4 3 p m 31 0 4 4 t m 55 1 1 4 p m 24 1 3 . . . The following statements request the conditional logistic regression. The variable center is the stratification variable. The TIES=DISCRETE option is required. The first four variables in the MODEL statement are automatically included in the model, and then the procedure produces a score statistic for the joint inclusion of the remaining variables. This serves as a goodness-of-fit check. Statistics, Data Analysis, and Modeling proc phreg data=trial; strata center; model improve = trt initial age i_sex isexage isexinit iageinit trtsex trtinit trtage / ties=discrete selection=forward include=4 details; run; Figure 10 displays the parameter estimates for this analysis. Analysis of Maximum Likelihood Estimates Variable DF Parameter Standard Wald Hazard Estimate Error Chi-Square Pr > ChiSq Ratio trt initial age i_sex -0.70244 -1.09148 -0.02483 -0.53115 1 1 1 1 0.36009 0.33508 0.02243 0.55451 3.8052 10.6104 1.2252 0.9175 0.0511 0.0011 0.2683 0.3381 0.495 0.336 0.975 0.588 Figure 10. Parameter Estimates for Clinical Trial Data Figure 11 and Figure 12 contain information about entering variables. Analysis of Variables Not in the Model Variable isexage isexinit iageinit trtsex trtinit trtage Score Chi-Square Pr > ChiSq 0.6593 0.1775 2.9194 0.2681 0.0121 0.4336 0.4168 0.6736 0.0875 0.6046 0.9125 0.5102 Figure 11. Analysis of Entering Variables Residual Chi-Square Test Chi-Square 4.7211 DF 6 Pr > ChiSq 0.5800 Figure 12. Score Test for Other Variables First, consider the statistics displayed in Figure 11 and Figure 12. The residual score statistic value of 4.7211 with 6 df (p-value= 0.5800) indicates that the additional terms are of little consequence. Age and sex appear not to be influential, and you may choose to keep them in the model as covariates or further perform model reduction by fitting the model without them. Note that since the alphanumeric ordering of the response values is (0,1), the model is based on predicting the probability of no improvement. For dichotomous logistic regression, you can simply switch the signs of the parameter estimates to obtain the estimates for the model based on the probability of improvement. Thus, those with treatment have an odds e:70244 times higher of improving than those 2 =:019 patients receiving the placebo. This is true even after initial grade is adjusted for in the model. And, the odds of improvement are e1:0918 = 2:980 times higher per unit increase in initial grade. Note that the conditional logistic analysis has also taken into account any effect of clinic. Crossover Data Conditional logistic regression also provides a useful analysis strategy for the crossover design. When you apply conditional logistic regression in this setting, you are creating a strata for each subject. You are conditioning out subject to subject variability and focusing on intrasubject information. Thus, you can often perform analyses that would not be possible with population-averaging methods due to small sample size. Note that the odds ratios resulting from the conditioning approach apply to subjects individually instead of to subjects on average. This may be an important consideration depending on your analysis objectives. For example, in a study whose objective is to produce a model that can be used for patient protocol prediction, the conditional logistic model may be more appropriate. The exercise data above are reanalyzed with the conditional logistic model, using the PHREG procedure. For this analysis, the response is dichotomized into 1, for severe response, and 0, for other responses. In addition, a new variable Strata has been defined, which is a unique identifier for each subject based on a combination of Sequence and Id. The STRATA statement defines the strata; note that the specification TIES=DISCRETE is required in order to produce the correct estimates. You use the TEST statement to specify tests concerning the parameter estimates: here, joint tests for both the Carryover and Period effects are requested. proc phreg data=Exercise; strata Strata; model dichotresponse = period1 period2 high medium baseline CarryHigh CarryMedium / ties=discrete; Carryover: test CarryHigh=CarryMedium=0; Period: test Period1=period2=0; run; Results of the parameter estimation are displayed in Figure 13 and are similar to those obtained in the GEE analysis. High and Medium exposures are influential, and it doesn’t appear that there are Carryover or Period effects. Statistics, Data Analysis, and Modeling interest; another is incidence densities. Analysis of Maximum Likelihood Estimates Variable Period1 Period2 High Medium Baseline CarryHigh CarryMedum DF 1 1 1 1 1 1 1 Parameter Estimate -0.15440 -0.26202 -1.70458 -0.68624 0.65094 0.40777 0.28252 Standard Wald Hazard Error Chi-Square Pr > ChiSq Ratio 0.44683 0.31225 0.35000 0.35899 0.52766 0.44493 0.53024 0.1194 0.7041 23.7198 3.6542 1.5219 0.8400 0.2839 0.7297 0.4014 <.0001 0.0559 0.2173 0.3594 0.5942 0.857 0.769 0.182 0.503 1.917 1.503 1.326 Figure 13. Parameter Estimates for Exercise Data The results for the joint tests for Carryover and Period are displayed in Figure 14. The Wald statistics indicate that neither effect is important. A reduced model seemed to fit the data adequately. Linear Hypotheses Testing Results Label CARRYOVER PERIOD Wald Chi-Square DF Pr > ChiSq 0.8723 0.7083 0.6465 0.7018 2 2 Figure 14. Test Results for Carryover and Period Exact Logistic Regression Sometimes, sample sizes are simply not appropriate for the usual logistic regression or conditional logistic regression strategies to be appropriate. In those cases, exact logistic regression provides a means of producing regression estimates and standard error that are statistically valid. You use conditioning principles similar to those used in conditioning on observed margins of contingency tables to obtain exact tests. You eliminate nuisance parameters by conditioning on the observed values of their sufficient statistics and then use the exact permutational distribution of the sufficient statistics for the parameters of interest. You can apply conditional inference to both the unstratified and stratified logistc regression settings. See Mehta and Patel (1995). LogXact software, Cytel Software Corporation (1993), provides this capability currently. Weighted Least Squares Modeling of Categorical Response Functions While weighted least squares may not longer be the workhorse of categorical data modeling, it still plays an important role in providing a strategy for modeling the variation among various functions that can be computed for categorical data. Essentially you are applying noniterative generalized least squares to response functions that are of interest and using an observed covariance matrix as the weights. If you have adequate sample sizes, the response functions have a approximate multivariate normal distribution and you can carry out hypothesis tests concerning linear combinations of them. Rank meaures of association are one type of response function that may be of Modeling Rank Measures of Association Statistics Many studies include outcomes that are ordinal in nature. When the treatment is also ordinal, rank measures of correlation can be modeled using WLS methods to investigate various treatment effects and interactions; such an analysis can often complement statistical models such as the proportional odds model. Refer to Carr et al (1989) for an example of such an analysis applied to Goodman-Kruskal rank correlation coefficients, also known as gamma coefficients. The Mann-Whitney rank measure of association statistics are also useful statistics for assessing the association between an explanatory variable and an ordinal outcome. Consider the data from a randomized clinical trial of chronic pain (Stokes et al, 1995). Investigators compared a new treatment with a placebo and assessed the response for a particular condition. Patients were obtained from two investigators whose design included stratification relative to four diagnostic classes. Table 1 displays these data. Table 1. Chronic Pain Data Diagnostic Class I I I I II II II II III III III III IV IV IV IV Researcher A A B B A A B B A A B B A A B B Treatment Active Placebo Active Placebo Active Placebo Active Placebo Active Placebo Active Placebo Active Placebo Active Placebo P 3 7 1 5 1 1 0 3 2 5 2 2 8 5 1 3 Patient Status F M G 2 2 1 0 1 1 6 1 5 4 2 3 0 1 2 1 0 1 1 1 1 1 1 5 0 3 3 0 0 8 4 1 10 5 1 4 1 3 4 0 3 3 5 2 3 4 3 4 E 0 1 3 3 2 1 6 0 2 1 3 2 0 0 1 2 You may be interested in computing the MannWhitney rank measure of association as a way of assessing the extent to which patients with active treatments are more likely to have better response status than those with placebo. You may then be interested in seeing whether diagnostic status and investigator influence this association through model-fitting. You can perform such modeling by first computing the Mann-Whitney statistics and their standard errors and then using these estimates as input to the CATMOD procedure to perform modeling. You can compute the Mann-Whitney measures as functions of the Somer’s D measures, which are produced by the FREQ procedure. U i = fSomer’s D C|R + 1g and S 2 i = SE 2 Statistics, Data Analysis, and Modeling Si is the standard error of statistic. U, i the Mann-Whitney The following statements produce measures of association for the eight 2 4 tables formed for the combination of investigator and treatment. data cpain; input dstatus $ invest $ treat $ status $ count @@; datalines; I A active poor 3 I A active fair 2 I A active moderate 2 I A active good 1 I A active excel 0 I A placebo poor 7 I A placebo fair 0 I A placebo moderate 1 I A placebo good 1 I A placebo excel 1 I B active poor 1 I B active fair 6 I B active moderate 1 I B active good 5 I B active excel 3 I B placebo poor 5 I B placebo fair 4 I B placebo moderate 2 I B placebo good 3 I B placebo excel 3 II A active poor 1 II A active fair 0 II A active moderate 1 II A active good 2 II A active excel 2 ... proc freq; weight count; tables dstatus*invest*treat*status/ measures; run; Figure 15 displays the table for Diagnostic Status I and Investigator A. Figure 16 displays the measures of association for that table. The FREQ Procedure Table 1 of treat by status Controlling for dstatus=I invest=A treat status Frequency| Row Pct |poor |fair |moderate|good | excel | Total ---------|--------|--------|--------|--------|--------|-------active | 3 | 2 | 2 | 1 | 0 | 8 ---------|--------|--------|--------|--------|--------|-------placebo | 7 | 0 | 1 | 1 | 1 | 10 | 70.00 | 0.00 | 10.00 | 10.00 | 10.00 | ---------|--------|--------|--------|--------|--------|-------Total | 10| 2| 3| 2 | 1| 18 Figure 15. Frequency Table Statistics for Table 1 of treat by status Controlling for dstatus=I invest=A Statistic Value ASE -----------------------------------------------------Gamma -0.2857 0.3515 Kendall’s Tau-b -0.1763 0.2253 Stuart’s Tau-c -0.1975 0.2485 Somers’ D C|R Somers’ D R|C -0.2000 -0.1553 0.2514 0.2026 Pearson Correlation Spearman Correlation -0.0866 -0.1900 0.2331 0.2416 Figure 16. Measures of Association Table 2 displays the calculated values. Table 2. Mann Whitney Statistics Diagnostic Class I I II II III III IV IV Researcher A B A B A B A B Somer’s .2000 .2002 .2083 .6778 .0260 .1893 .0000 ,:0156 ASE .3515 .1915 .3622 .1834 .2271 .1923 .2007 .2116 Ui .6000 .6001 .6042 .8389 .5130 .5947 .5000 .4922 Si .1758 .0958 .1811 .0917 .1136 .0962 .1004 .1058 You compute the covariances and then create a data set that contains the estimates and the covariance matrix. The following DATA step creates the MannWhitney data set. data MannWhitney; input b1-b8 _type_ $ datalines; .6000 .6011 .6042 .03091 .0000 .0000 .0000 .00918 .0000 .0000 .0000 .3280 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 ; _name_ $8.; .8389 .0000 .0000 .0000 .0084 .0000 .0000 .0000 .0000 .5130 .0000 .0000 .0000 .0000 .0129 .0000 .0000 .0000 .5947 .0000 .0000 .0000 .0000 .0000 .0093 .0000 .0000 .5000 .0000 .0000 .0000 .0000 .0000 .0000 .0101 .0000 .4922 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0112 parms cov b1 cov b2 cov b3 cov b4 cov b5 cov b6 cov b7 cov b8 This data set is then input into the CATMOD procedure. Thus, instead of generating functions from an underlying contingency table, the CATMOD procedure does modeling directly on the input functions using the input covariance matrix as the weights. You define the profiles for each function with the PROFILE option in the FACTORS statement. You also define your factors, along with the number of levels for each, and describe the effects you want to include in your model with the – RESPONSE– option. proc catmod data=MannWhitney; response read b1-b8; factors diagnos $ 4 , invest $ 2 / _response_ = diagno invest profile = (I A, I B, II A, II B, III A, III B, IV A, IV B); model _f_ = _response_ / cov; run; The ANOVA table results follow. The residual Wald test is a test of the diagnostic class and investigator interaction, which is non-significant with a p-value of 0.78. Neither diagnostic class nor investigator appear to explain significant variation, with diagnostic class appearing to be modestly influential with a p-value of 0.093. Statistics, Data Analysis, and Modeling Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------Intercept 1 213.14 <.0001 diagnos 3 6.42 0.0930 invest 1 0.58 0.4469 Residual 3 1.06 0.7862 Figure 17. Main Effects Output By submitting another MODEL statement like the following model _f_ = / cov; run; You can obtain a test of the hypothesis that the measures have the same value for each diagnostic class and investigator combination. This is the seven degree test that is labeled ‘residual.’ Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------Model|Mean 7 9.41 0.2247 Residual 0 . . Figure 18. Intercept-Only Model You can’t reject this hypothesis with these data. Modeling Incidence Densities Incidence densities are defined as the ratio of the number of episodes of a disease or illness to the total person-time at risk. These measures are often of interest in epidemiological work. Lavange et al (1994) studied the effect of passive smoking exposure on the incidence of lower respiratory illness in young children. Measurements were made over time. The investigators used WLS to model the incidence densities; because the study was a complex survey design, the covariance structure was determined with sample survey methods. Interest was in comparing the marginal rates of lower respiratory disease between exposed and unexposed groups. The ratio estimates and their covariances were obtained with SUDAAN software, and then the response functions and covariances wered used as input to the CATMOD procedure for WLS modeling. The exposed group had a significantly higher rate of illness. Summary This paper describes recent enhancements in the area of categorical data analysis and discusses several applications of the recent methodology. Exact methods and quasi-likelihood methods provide ways to analyze data that previously had many data analysis limitations. The SAS System includes software for many of these newer methodologies and should contain additional features that implement the recent methodological advances in the next several years. References Barnhart, H. and Williamson, J (1998). Goodness-ofFit Tests for GEE Modeling with Binary Responses, Biometrics, 54, 720–729. Cytel Software Corporation (1993), LogXact: Software for Exact Logistic Regression, Cytel Software Corporation, Cambridge, MA. Carr, G. J., Hafner, K. B., and Koch, G. G., (1989), “Analysis of rank measures of association for ordinal data from longitudinal studies”, Journal of the American Statistical Association, 84, 797–804. Diggle, P.J., Liang, K.-Y. and Zeger, S.L. (1994). Analysis of Longitudinal Data, Oxford: Oxford Science Liang, K.-Y. and Zeger, S.L. (1986). Longitudinal Data Analysis Using Generalized Linear Models, Biometrika, 13–22 LaVange, L. M., Keyes, L. L., Koch, G. G., and Margolis, P. A. (1994). Application of sample survey methods for modelling ratios to incidence densities, Statistics in Medicine, 13, 343–355. Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H., and Lehnen, R. G. (1977). A general methodology for the analysis of experiments with repeated measurement of categorical data, Biometrics, 33, 133–158. Mehta, C. R. and Patel, N. R. (1995), “Exact logistic regression: theory and examples”, Statistics in Medicine, 14, 2143–1260 Nelder, J.A., and Wedderburn, R.W.M. (1972), Generalized Linear Models, Journal of the Royal Statistical Society A, 135, 370–384. Preisser, J. S., and Quaqish, B. F. (1996), Deletion diagnostics for generalised estimating equations, Biometrika, 83, 3, 551–562 Preisser, J. S., and Koch, G. G., (1997), “Categorical Data Analysis in Public Health”, Annual Review of Public Health, 18, 51–82 Stokes, M. E., Davis, C. S., and Koch, G. G (1995). Categorical Data Analysis Using the SAS System, Cary: SAS Institute, Inc. Zeger, S.L. and Liang, K.-Y. (1986). Longitudinal Data Analysis for Discrete and Continuous Outcomes, Biometrics, 42 121–130 Statistics, Data Analysis, and Modeling Author Maura E. Stokes, SAS Institute Inc., SAS Campus Drive, Cary, NC 27513. FAX (919) 677-4444 Email [email protected] SAS is a registered trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Version 1.0
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project