UNIVERSITY OF PRETORIA THE VARIABLE SELECTION PROBLEM AND THE APPLICATION OF THE ROC CURVE FOR BINARY OUTCOME VARIABLES James M Matshego Prepared in partial fulfilment of the requirements for the degree of Master of Science in Applied Statistics Supervisor: Prof H T Groeneveld External Examiner: Prof A J Van der Merve (U. O. F) 2006 i DECLARATION I, James Moeng Matshego, hereby testify that the work presented in this study is my own original work and that all the resources used have been indicated and reflected by means of complete references. I further hereby declare that the dissertation that hereby submit for the degree in Applied Statistics at the University of Pretoria has not previously been submitted by me for degree purpose at any other university. Signed......................................... ii Acknowledgements I sincerely thank o My supervisor Prof H T Groeneveld for his encouragement, guidance and support. o Department of Statistics for having been patient with me. o Mr E Sibanda from Research and Development at TUT for availing the data set used in this study. o My family for having afforded me time to study. o Prof A J van der Merwe for his valuable comments and advices. iii TABLE OF CONTENTS LIST OF TABLES……………………………………………………………... .viii ABSTRACT………………………………………………………………………...x CHAPTER 1..........................................................................................................1 ORIENTATION ……………………………………………………………………1 1.1 INTRODUCTION…………………………………………………………..1 1.2 VARIABLE SELECTION………………………………………………….2 1.3 SCOPE OF THIS WORK…………………………………………………..3 CHAPTER 2……………………………………………………………………..4 SELECTION PROCEDURES FOR CONTINUOUS OUTCOME VARIABLES…4 2.1 VARIABLE SELECTION IN LINEAR REGRESSION…………………..4 2.1.1 Forward selection…………………………………………………….4 2.1.2 Backward selection…………………………………………………..7 2.1.3 Conventional Stepwise selection or Efroymson’s Algorithm………..8 2.1.3.1 Criterion for addition…………………………………………9 2.1.3.2 Criterion for deletion…………………………………………9 2.1.3.3 Convergence of the Algorithm……………………………….9 2.1.4 Press………………………………………………………………….10 2.1.5 Principal component regression……………………………………...11 iv 2.1.6 Latent root regression………………………………………………..12 2.1.7 Branch-and bound Technique………………………………………..13 2.1.8 Variable selection via elastic net……………………………………..14 2.1.8.1 Naïve elastic net…………………………………………….. 14 2.1.8.2 Elastic net…………………………………………………….16 2.1.9 Generating all subsets………………………………………………..16 2.2 VARIABLE SELECTION IN THE COX REGRESSION MODEL ..........17 2.2.1 Purposeful selection variables.......…………………………………....18 2.2.2 The Lasso Method: Tibshirani(1997)....................................................19 2.2.3 Variable selection for time series data………………………………...21 2.4 HYPOTHESIS TESTING ………………………………………………....24 2.4.1 Lack-of-fit test ……………………………………………………….24 2.4.2 The coefficient of determination, R 2 ………………………………...24 2.4.3 Minimum adequate sets………………………………………………25 2.5 COMPARISON OF MODELS: SOLUTION CRITERIA…………………26 2.5.1 Akaike’s information criterion (AIC) and the Bayes information criterion (BIC)………………………………………………………...27 2.5.2 C p -Statistics ( C r -Criterion) ………………………………………...27 2.5.3 The S p -Statistics ( S r -Criterion)……………………………………...31 2.5.4 RMS, R 2 and adjusted R 2 - Statistics ………………………………... 32 2.5.4.1 The Residual Mean Square……………………………………..32 2.5.4.2 The Squared Multiple Correlation Coefficients (SMCC)………32 2.5.4.3 The adjusted R 2 or Fisher’s A-Statistics.……………………...33 CHAPTER 3…………………………………………………………… 34 THE LOGISTIC MODEL AND VARIABLE SELECTION FOR A BINARY OUTCOME VARIABLE ............……………………………………………………34 3.1 BINARY DATA. ……………………………………………………………34 3.2 LOGISTIC REGRESSION……………………………………………….....35 v 3.2.1 Assumptions…………………………………………………………..36 3.2.2 The multiple linear logistic regression model ……………………….37 3.3 PARAMETER ESTIMATION .…………………………………………… 38 3.3.1 Maximum likelihood estimation……………………………….........39 3.3.2 The Newton-Raphson method……………………………………….41 3.4 ODDS AND ODDS RATIOS .…………………………………………….43 3.5 INTERPRETATION OF COEFFICIENTS……………………………….45 3.5.1 Dichotomous predictor variables……………………………………45 3.5.2 Polytomous predictor variables……………………………………..46 3.5.3 One continuous predictor variable…………………………………..47 3.5.4 Multivariable case…………………………………………………...48 3.5.5 One dichotomous and one continuous and their interaction ………..48 3.6 TESTING FOR THE SIGNIFICANCE OF THE MODEL………………49 3.6.1 The likelihood ratio test……………………………………………..49 3.6.2 Wald Test Statistics…………………………………………………49 3.6.3 Using deviations to compare likelihoods…………………………...50 3.7 INTERACTION AND CONFOUNDING………………………………..50 3.8 VARIABLE SELECTION IN LOGISTIC REGRESSION……………... 51 3.8.1 Purposeful selection of variables………………………………….. 52 3.8.1.1 Screening of variables……………………………………….52 3.8.1.2 Scale of continuous variables………………………………..53 3.8.1.3 Inclusion of interactions.……………………………………. 54 3.8.2 Stepwise forward selection………………………………………...54 3.8.3 Stepwise backward selection………………………………………55 3.8.4 Stepwise selection (forward and backward)…………………….....56 3.8.5 Best subset selection……………………………………………….56 3.8.6 General .............................................................................................57 CHAPTER 4………………………………………………………………….58 THE RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE.……….58 vi 4.1 BACKROUND………………………………………………………… 58 4.2 DEFINITION OF ROC CURVE………………………………………. 59 4.3 DIAGNOSTIC TEST INTERPRETATION.…………………………... 60 4.3.1 2X2 Table or Contingency Matrix …………………………………60 4.3.2 Basic concepts ……………………………………………………...60 4.3.2.1 Sensitivity…………………………………………………...60 4.3.2.2 Specificity…………………………………………………..61 4.3.2.3 Pre-test probability………………………………………….61 4.3.2.4 Predictive value of a positive test…………………………..61 4.3.2.5 Predictive value of a negative test ………………………….61 4.4 ROC REGRESSION MODEL………………………………………...62 4.5 AREA UNDER THE ROC CURVE (AUC) ………………………….62 4.5.1 Interpretation of the area…………………………………………..62 4.5.2 Comparison of tests………………………………………………..63 4.5.3 Advantages and disadvantages of the ROC curve .....……………..64 CHAPTER 5……………………………………………………………….65 MODEL BULDING USING REAL DATA....……………………………….65 5.1 PURPOSEFUL SELECTION OF VARIABLES PROCEDURE…….66 5.2 OTHER LOGISTIC REGRESSION PROCEDURES………………..76 5.3 INVESTIGATION OF THE AUC AS A SELECTION TOOL ……...77 5.4 THE AUC AND THE STEPWISE SELECTION PROCEDURES ….79 CHAPTER 6 .................................................................................. 81 DISCUSSION AND CONCLUSION ……………………………………… .81 APPENDICES……………………………………………………………….. 83 REFERENCE………………………………………………………………..104 vii LIST OF TABLES Table 3.1: An example of coding of a design variable Race coded at three levels………37 Table 3.2: An example showing coefficients that will be obtained when fitting the the model using design variables in Table 3.1……………………………….45 Table 4.1: An example of a contingency table…………………………………………...45 Table 5.1: Codes of variables used in the data set for study of factors associated with Success of first year students at TUT from 1999 to 2002……………………63 Table 5.2: Indicator variables for the variable Faculty………………………………….. 64 Table 5.3: Univariable logistic regression models…………………………………….... 65 Table 5.4: Multivariable model containing variables identified in the univariable Analysis……………………………………………………………………….66 Table 5.5: Results of quartile analysis of the variable Agregate from the multivariable Model containing variables shown in Table 5.6……………………………...67 Table 5.6: Preliminary main effects model…………………………………………….....68 Table 5.7: Multivariable model with dichotomous variable Agregate_.............................70 Table 5.8: A model containing interactions which were significant when added one by one to the main effects model………………………………………....72 Table 5.9: Final model with interactions………………………………………………....73 Table 5.10: Contingency matrix for a model in Table 5.9…………………………….....73 Table 5.11: Odds ratios and association of predicted probabilities and observed Responses for the final model……………………………………………...74 Table 5.12: Contingency matrix for the model in Table 34………………………….....77 viii Table 5.13: Comparison of the AUC and the Stepwise procedures………………….....77 Table 14: Univariable Analysis of the variable Age…………………………………....81 Table 15: Univariable Analysis of the variable Agregate……………………………....82 Table 16: Analysis of categorical variables…………………………………………….83 Table 17: The results of Forward selection procedure………………………………….84 Table 18: The results of the Backward selection procedure…………………………....85 Table 19: The results of the Stepwise selection procedure…………………………….86 Table 20: The results of the Stepwise procedure with interactions included…………..87 Table 21: The results of Best subset selection procedure using Score criterion……….88 Table 22: The results of Best subset selection procedure using C p criterion…………88 Table 23: The results of the step1 of the AUC procedure…………………………….. 89 Table 24: The results of the step2 of the AUC procedure……………………………...89 Table 25: The results of the step3 of the AUC procedure……………………………...90 Table 26: The results of the step4 of the AUC procedure……………………………...90 Table 27: The results of the step5 of the AUC procedure…………………………… ..91 Table 28: The results of the step6 of the AUC procedure……………………………...91 Table 29: The results of the step7 of the AUC procedure……………………………...92 Table 30: The results of the step8 of the AUC procedure……………………………...92 Table 31: The results of the step9 of the AUC procedure……………………………...92 Table 32: The results of the step10 of the AUC procedure…………………………….92 Table 33: The results of the step11 of the AUC procedure…………………………….93 Table 34: The results of the step12 of the AUC procedure…………………………… 93 Table 35: The results of the step13 of the AUC procedure…………………………….93 Table 33: The results of the step14 of the AUC procedure…………………………….93 ix ABSTRACT Variable selection refers to the problem of selecting input variables that are most predictive of a given outcome. Variable selection problems are found in all machine learning tasks, supervised or unsupervised, classification, regression, time series prediction , two - class or multi-class, posing various levels of challenges. Variables selection problems are related to the problems of input dimensionality reduction and of parameter planning. It has practical and theoretical challenges of its own. From the practical point of view, eliminating variables may reduce the cost of producing the outcome and increase its speed, while space dimensionality does not address these problems. Theoretical challenges include estimating with what confidence one can state that a variable is relevant to the concept when it is useful to the outcome and providing a theoretical understanding of the stability of selected variables subsets. As the probability cut-points increase in value, the more likely it becomes that an observation is classified as a non-event by the selected variables. The mathematical statement of the problem is not widely agreed upon and may depend on the application. One typically distinguishes: i) The problem of discovering all the variables relevant to the outcome variable and determine HOW relevant they are and how they are related to each other. ii) The problem of finding a minimum subset of variables that is useful to the outcome variable. x Logistic regression is an increasingly popular statistical technique used to model the probability of discrete binary outcome. Logistic regression applies maximum likelihood estimation after transforming the outcome variable into a logit variable. In this way, logistic regression estimates the probability of a certain event. When properly applied, logistic regression analyses yield a very powerful insight in to what variables are more or less likely to predict event outcome in a population of interest. These models also show the extent to which changes in the values of the variable may increase or decrease the predicted probability of event outcome. Variable selection, in all its facets is similarly important with logistic regression. The receiver operating characteristics (ROC) curve is a graphic display that gives a measure of the predictive accuracy of a logistic regression model. It is a measure of classification performance, the area under the ROC curve (AUC) is a scalar measure gauging one facet of performance. Another measure of predictive accuracy of a logistic regression model is a classification table. It uses the model to classifying observations as events if their estimated probability is greater or equal to a given probability cut-point, otherwise events are classified as non-events. This technique, as it appears in the literature, is also studied in this thesis. In this thesis the issue of variable selection, both for continuous and binary outcome variables, is investigated as it appears in the statistical literature. It is clear that this topic has been widely researched and still remains a feature of modern research. The last word certainly hasn’t been spoken. xi CHAPTER 1 ORIENTATION 1.1 INTRODUCTION The problem of variable selection is one of the most pervasive problems in statistical models. As stated by Guyon and Elisseeff (2002), variable selection problems are found in all machine learning, supervised or unsupervised, classification, regression, time series prediction tasks, and are posing challenges. Owing to the current availability of high speed computors, this problem has received enormous attention in recent statistical literature. Often referred to as the problem of subset selection, it arises when one wants to model the relationship between a variable of interest and a subset of potential explanatory variables or predictors, but there is uncertainty about which subset to use. A common situation is that in which the explanatory or predictor variables, which will be denoted by X (nxp) measured at one time can be used to predict a variable of interest or response variable denoted by Y(1xn) at some future time. Unless the true form of the relationship between X and Y variables is known, it will be necessary for the data to be used to select the variables and to calibrate the relationship to be representative of the conditions in which the relationship will be used for prediction. In prediction we are usually looking for a small subset of variables which gives adequate prediction accuracy for a reasonable cost of measurement. On the other hand, in trying to understand the effect of one variable on another, particularly when the only data available are observational or survey data rather than experimental data, it may be desirable to include many potential variables which are either known or believed to have an effect (Miller (1990)). The problem of selecting a subset of predictor variables is usually described in an idealised setting. That is, it assumes that (a) all predictors are available for inclusion or exclusion from the model, 1 though this is not always the situation in practice. In many cases, the original set of measured variables will be augmented with other variables from them such as a product of two variables and (b) a ‘good’ data set is available on which to base the conclusions. The lack of these assumptions may make a detailed subset selection analysis a futile exercise. The rationale for minimizing the number of variables in the model is that the resultant model is more likely to be numerically stable, and is more easily generalised. The more variables included in a model, the greater the estimated standard errors become, and the more dependent the model becomes on the observed data (Hosmer and Lemeshow (1989)). 1.2 VARIABLE SELECTION It will be assumed that there are n ≥ p + 1 observations on a matrix of predictor variables, X = ( x1 ....x p ), and a scalar response, y , such that the j th response, j = 1,....n is determined by p y j = β 0 + ∑ β j xij + ξ j (1.1) i =1 The residuals, ξ j are assumed identically and independently distributed, usually normal, with mean zero and unknown variance, σ 2 . (The predictors, xij are frequently taken to be specified design variables, but in many cases it is more appropriate to consider them as random variables and assume a joint distribution on y and x , say, multivariate normal). Implicit in these assumptions is the assumption that the variables x1 ....x p include all relevant variables though extraneous variables may be included. The model (1.1) is frequently expressed in matrix notation as Y=Xβ +ε (1.2) where Y is the n vector of observed responses, X is the design matrix dimension n × ( p + 1) as defined by (1.2), assumed to have rank p + 1 and β is the ( p + 1 ) – vector of unknown regression coefficients. 2 The variable selection problem is most familiar in the linear regression context where attention is restricted to normal linear models. Let γ index the subsets of x1 ....x p and letting qγ be the size of the γ th subset, the problem is to select and fit a model of the form Y=X γ β γ + ε where X γ is an n × qγ matrix whose columns correspond to the γ (1.3) th subset, β γ is a qγ × 1 vector of regression coefficients and ε~ N (0, σ 2 I ) . More, generally, the variable selection problem is a special case of model selection problem, where each model under consideration corresponds to a distinct subset of x1 ....x p . The fundamental developments in variable selection seem to have occurred either directly in the context of linear model (1.3) or in the context of general model selection frameworks. Historically, the focus began with the linear model in the 1960s when the first wave of important developments occurred and computing was expensive (George (2000)). 1.3 SCOPE OF THIS WORK This manuscript consists of six chapters. In Chapter 2, methods and procedures for selecting variables in respect of continuous outcome variables for different regressions are described. In addition, statistics for comparison of models are discussed. Chapter 3 introduces and defines the logistic regression model, a model for a binary outcome variable. Various selection procedures for this model are also discussed. The Receiver Operating Characteristic (ROC) curve, a curve representing a diagnostic test with binary outcome, is presented in Chapter 4. Chapter 5 covers a model building exercise. All selection procedures discussed with regard to binary outcome variable are applied to an available data set. We also look into the possibility of using the area under the ROC curve as a variable selection criterion by doing a test with the same data set used for other procedures. Chapter 6 wraps up this study with Discussions and Conclusions. 3 CHAPTER 2 SELECTION PROCEDURES FOR CONTINUOUS OUTCOME VARIABLES This chapter will look at the problem of finding one or more subsets of variables which give models that fit a set of data fairly well. However, there is no unique statistical procedure or technique selecting the best regression equation. If there are p potential independent variables there are 2 p possible equations to be considered. According to Miller (1984), reasons for using only some of the variables or possible predictor variables include: I. to estimate or predict at lower cost by reducing the number of variables on which predictions can be made. II. to predict accurately by eliminating uninformative variables. III. to describe a multivariate data set parsimoniously. IV. to estimate regression coefficients with small standard errors (particularly when some of the predictors are highly correlated with others). 2.1 VARIABLE SELECTION IN LINEAR REGRESSION In linear regression an F-test is used since errors are assumed to be normally distributed (Hosmer and Lemeshow (1989)). 2.1.1 Forward Selection Hocking (1976) suggests a technique that starts with no variable in the equation and adds one variable at a time until either all variables are in or until a stopping criterion is satisfied. The variable considered for inclusion at any step is the one yielding the largest single degree of 4 freedom (d.f) F-ratio among those eligible for inclusion. That is : variable i is added to the r -term equation if Fi = max i ( RSS r − RSS r +i )〉 Fin where σˆ 2 r +i RSS r , RSS r +i are residual sum of squares for r -term and ( r + i ) – term models and r the number of terms which are retained in the final equation. The subscript ( r + i ) refers to quantities computed when the variable i is adjoined to the current r-term equation. Beale (1970) describes a method that requires the least amount of computation. In this method, all results are obtained as a by-product of solving the problem with all variables selected: if there are p regression variables, the covariance matrix of these variables is inverted by pivoting on each of the p diagonal elements in turn, and after each pivot step the results for the regression on those variables for which the corresponding diagonal elements have already been chosen as pivots, can be read off. With regard to this method there are no dependencies among the independent variables. If an element is less than some tolerance times its original value, pivoting is not done where the tolerance is normally 10-3 in single precision code or 10-7 in a double precision code. Theoretically this approach has a weakness when independent variables are correlated. Two (or more) variables may be individually useless but many together give a very good fit. Draper and Smith (1981) use the partial correlation coefficient as a measure of the importance of variables not yet in the equation. Assume Z 1 Z 2 , …Z k , are all functions of one or more of the X’s, represent the complete set of variables from which the equation is to be chosen and that this set includes any functions, such as squares, cross products, logarithms, inverses, and powers thought to be desirable and necessary. The procedure starts by first selecting the Z most correlated with Y . Suppose this Z is Z 1 , the first–order linear regression equation is found to be Yˆ = f ( Z1 ) . We check the significance of the variable and if it is not, we quit and the model Y = Y is adopted as best, otherwise we search for the second predictor variable to enter the regression. The partial correlation coefficients of all predictors not in regression at this stage, namely Z j , j ≠ 1 with Y is examined. In other words, Y and Z j , are both adjusted for their straight line relationships with Z1 , and the correlation between these adjusted values is calculated for all j ≠ 1 . Z j with the highest 5 partial correlation coefficient with Y is now selected, say it is Z 2 . So the second regression equation Yˆ = f ( Z 1 , Z 2 ) is fitted. The overall regression is checked for significance with the improvement in R 2 value noted, and the partial F - values for both variables now in the equation are examined. The smaller of these two partial F ' s is then compared with an appropriate F percentage point and the corresponding predictor variable is retained in the equation or rejected according to whether the test is significant or not. The testing of the “least useful predictor currently in the equation” is done at every stage of the procedure. Thus a predictor that may have been the best entry candidate at an earlier stage may, at a later stage become redundant as a result of the relationship between it and other variables now in the regression. Such a variable will be removed from the model upon testing non-significant and the appropriate fitted regression equation is then computed for all the remaining variables still in the model. Eventually, when no variables in the current equation can be removed and the next best candidate variable cannot hold its place in the equation, the process stops. As each variable is entered into the regression, its effect on R 2 is noted. However, the correct choice of the α- levels is necessary to avoid cycling effect. Miller (1990) suggests a method that finds a subset r < p of variables X (1) , X ( 2 ) ,..... X ( p ) from a set of variables X 1 , X 2 ,..... X p which minimises or gives a suitably small value for n S = ∑ ( y i − b j xij ) 2 . i =1 Since the value of b j is given by n b j = ∑ xij yi i =1 n ∑x 2 ij i =1 it follows that ⎛ n ⎞ S = ∑ y − ⎜ ∑ xij yi ⎟ i =1 ⎝ i =1 ⎠ 2 i 2 n ∑x i =1 2 ij . (2.1.1) If we let the first variable be denoted by X (1) , this variable is then forced into further subsets. The residuals Y –X (1) b (1) are orthogonal to X (1) , and to reduce the sum of squares by adding further 6 variables, the space orthogonal to X (1) must be searched. From each variable X j , other than the one already selected, we could form X j .(1) = X j - b j .(1) X (1) where b j .(1) is the least squares regression coefficient of X j upon X 1 , which maximises (2.1.1) when Y is replaced with Y-X (1) b (1) and X j is replaced with X j .(1) . The variables X (1) , X ( 2) ,.... X ( r ) are progressively added to the prediction equation, each variable being chosen because it minimises the residual sum of squares when added to those already selected. 2.1.2 Backward elimination The backward elimination method is more economical than the “all regressions” method in the sense that it tries to examine only the “best” regression containing a certain number of variables (Draper and Smith (1981)). We start with all p variables, including a constant if there is one, in the selected set. Thus, a regression equation containing all variables is computed and variables are eliminated one at the time. At any step, the variable with the smallest F- ratio as computed from the current regression is eliminated if this F- ratio does not exceed a specified value. That is, variable i deleted from the p-term equation if Fi = min i ( RSS p −i − RSS p σˆ P 2 ) < Fout . Here RSS p −i denotes the residual sum of squares obtained when variable i is deleted from the current p-term equation, and RSS p is the residual sum of squares for a p-term equation. Draper and Smith (1981) proposed a method with the following steps applied to the regression equation with all variables: 7 1. the partial F - value, which is associated with test H 0 : β = 0 versus H 1 : β ≠ 0 for any particular regression coefficient, is calculated for every predictor variable treated as though it were the last to enter the regression equation. 2. The lowest partial F - value say FL say, is compared with pre- selected significance level F0 say. If FL < F0 , the variable which gave rise to FL is removed and the regression equation is calculated with the remaining variable and step 1 is performed. If FL > F0 the regression equation is adopted as calculated. A rather simpler approach by Miller (1990) uses the residual sum of squares. If RSS p is the corresponding residual sum for regression will all p variables, a variable is chosen for deletion if it yields the smallest value of RSS p −1 after deletion. Then that variable from the remaining p-1 variables which yields the smallest RSS p − 2 is deleted. The process continues until one variable is left or a stopping criterion is satisfied. According to Mantel (1970) the advantageous property of the backward elimination regression procedure is that it drops regressive variables, or sets of regressor variables, only when one can afford to discard without seriously impairing the goodness of fit. Thus many variables can be discarded without abruptly worsening the regression. On the other hand, backward elimination is usually not feasible when there are more variables than observations. It also requires far more computation than forward selection. 2.1.3 Conventional Stepwise or Efroymson’s Algorithm This is a variation on forward selection. After each variable (except the first one) is added to the set of selected variables, a test is made to ascertain if any of the previously selected variables can be deleted without appreciably increasing the residual sum of squares. This algorithm incorporates criteria for the addition and deletion of variables. 8 2.1.3.1 Criterion for addition If RSS r denotes the residual sum of squares with r variables and a constant in the model and the smallest RSS which can be obtained by adding another variable to the present that is RSS r +1 , the ratio R = RSS r − RSS r +1 RSS r +1 ( n − r − 2) (2.1.3.1) is calculated and its value is as compared with an ‘F–to enter’ value, say Fe. If R is greater than Fe, the variable is added to the selected set. 2.1.3.2 Criterion for deletion If RSS r −1 is the smallest RSS which can be obtained after deleting any variable from the previously selected variables, the ratio R= RSS r −1 − RSS r RSS r (n − r − 1) (2.1.3.2) is calculated and its value compared with an ‘F – to delete (or drop)’ value, say Fd. If R is less than Fd, the variable is deleted from the selected set. 2.1.3.3 Convergence of the Algorithm From (2.1.3.1) it follows that when the criterion for adding a variable is satisfied we have RSS r +1 ≤ RSS r / {1+ Fe (n − r − 2) } and from (2.1.3.2) when the criterion for deletion of a variable is satisfied we have RSS r ≤ RSS r +1{1 + Fd } .Consequently when an addition is followed by a deletion, the (n − r − 2) new RSS, say RSS*r, is such that 9 RSS ≤ RSS r × * r 1+ Fd 1+ Fe ( n − r − 2) (n − r − 2) (2.1.3.3) The procedure stops when no further additions and deletions satisfying the criteria are possible. Since each RSS r is bounded below by the smallest RSS for any subset of r variables, by ensuring that the RSS is reduced each time that a new subset of r variables is found, convergence is guaranteed. From (2.1.3.3) it follows that a sufficient condition for convergence is that Fd < Fe. However, there is no guarantee that this algorithm will find the best fitting subsets, though it often performs better than forward selection when some of the predictors are highly correlated. 2.1.4 Press According to Draper and Smith (1981), the Press selection procedure proposed by D.M Allen in Technical Report No 23, Dept of Statistics, University of Kentucky, 1971, the procedure is a combination of all possible regressions, residual analysis and validation techniques. If r is the number of parameters including β o in a regression equation and there are n observations in all, the basic calculations entail: 1. Deleting the first observation on the response and predictor variables. 2. Fitting all possible regressions to the remaining n-1 data points 3. Using each fitted model to predict Y1 by Yˆ1r (say) and so obtain a predictive discrepancy (Y1 − Yˆ1r ) for all the possible regression models. 4. Repeating steps 1, 2 and 3, but deleting the second observation to give ( Y2 − Yˆ2 r ) values, the third to give (Y 3 −Yˆ3r ) values, and so on, to n deletions. 5. Calculating the predictive discrepancy sum of squares n ∑ (Y i =1 i − Yˆir ) 2 for each subset regression model. 10 6. Choosing the “best” subset regression. This will have a comparatively small predictive sum of squares but not involve many predictors. 2.1.5 Principal Component Regression This is a procedure which analyses the collaboration structures in some detail and was first proposed by Harold Hotelling (Draper and Smith (1981)). Let Z represent the appropriate centred and scaled X matrix. Then the correlation matrix Z'Z, and the eigenvalues of this correlation matrix are the k solutions λ1 , λ 2 ,.......λ k of the determinantal equation |Z'Z – λI|= 0 (2.1.5.1) for the model with all possible predictors Z1, Z 2 ,…,Z k . By making each new variable column Z ji = ( Z ji − Z j (2.1.5.2) 1 S jj 2 n n i =1 i =1 where n Z j = ∑ Z ji , S jj = ∑ ( Z ji − Z j ) 2 (2.1.5.3) with zero mean and unit sum of squares, we have orthogonalised out a new β 0 term, and cast the ' predictors into ‘correlation form’. The rank of the non-singular correlation matrix is k= r – 1. The total of all sums of the squares of the Zj is clearly k (Draper & Smith (1981)). We call this the total variance of the Z’s. For each eigenvalue, λ j , there is a eigenvector γ satisfying ( Z ' Z - λ j I)γ j = 0 (2.1.5.4) with γ ' j γ j = 1. The vectors rj are used to re-express the Z’s in terms of principal components W’s, in the form W j = γ 1 j z1 + γ 2 j z 2 + ..... + γ kj z k (2.1.5.5) 11 and the sum of the squares of the new Wj column with elements Wji, i=1, 2,………,n, is λ j i.e. Wj picks up an amount of λ j of the total variance. We note that ∑λ j j = k and ∑∑i W ji2 = k . j The W j corresponding to the largest λ j value is called the principal component and accounts for the largest proportion of the variation in the standardised data set. Also Wj' s explain smaller and smaller proportions until all variation is explained i.e. r ∑λ j =1 j = k . The Wj’s are not all used but a selection procedure of some sort is used, however, there is no universally agreed upon procedure. 2.1.6 Latent Root Regression This is an extension of the principal component regression for examining alternative predictive equations and elimination of predictor variables by Webster and his co-workers (Draper and Smith (1981)). The data matrix of the centered and scaled predictor variables is augmented with the centered and scaled responsible variable to provide Z*= (y,Z) where Z is the centered and scaled ‘X matrix’ y=(Y – 1 Y )/ S 1 2 YY where 1 is an nx1 vector of 1’s and S YY = ∑ (Yi − Y ) 2 . It follows that Z*'Z* is the augmented correlation matrix. The eigen values and their corresponding eigen vectors are calculated and the first element of each of the eigen vectors is used as a measure of predictability of the response by that eigen vector. The larger the size of the first element of the eigen value the more useful is that eigen vector in predicting the response variable and vice versa. The presence of small eigen values indicates potential linear dependence among predictor variables. Eigen vectors whose eigen values and corresponding first element of the eigen vectors are small are dropped and modified least squares estimation equation is obtained. The backward elimination procedure is then employed to remove predictor variables from the equation. The vector of a modified least square (MLS) equation coefficients are given by: 12 ⎡b1* ⎤ ⎡γ 1 j ⎤ ⎢ ⎥ ⎥ −1 ⎢ b * = ⎢b2* ⎥ = c ∑ * γ 0 j λ j ⎢γ 2 j ⎥ j ⎢b * ⎥ ⎢γ ⎥ ⎣ kj ⎦ ⎣ k⎦ (2.1.6.1) n 1 where c = - { ∑ * γ oj λ j }-1{ ∑ (Yi − Y ) 2 } 2 2 −1 and (2.1.6.2) i =1 j ∑ * denotes a summation over only those values of j whose vectors have been retained. Also b0* = Y for the model. The residual sum of the squares for any modified least squares (MLS) equation can be written as n RSS = {∑ (Yi − Y ) 2 }{∑ * γ oj λ−1 j }−1 2 l =1 j 1 = − c{∑ (Yi − Y ) 2 } 2 (2.1.6.3) the residual sum of squares that results from deletion of Xl, l = 1,2....k from the MLS equation can be evaluated as n 2 t {∑ (Yi − Y ) }{t00 − l 0 }−1 tll i =1 2 where t rq = ∑ * j γ rjγ qj λj (2.1.6.4) (2.1.6.5) The main advantage of this method is that by removing the effect of the non-predictive near singularities, the true influences of the independent variables on the dependent variable are more clearly represented. 2.1.7 Branch –and bound Techniques Suppose that we are looking for the subset of r variables out of p variables which yields the smallest RSS. We begin by dividing all possible subsets into two branches, those which contain X 1 , and those which do not. Within each branch we can have sub-branches including and excluding variable X2, etc. Suppose at some stage we found a subset of r variables containing X1 13 or X2 or both giving RSS=100 say. Suppose we are about to start examining the sub-branch which excludes both X1and X2. A lower bound on the smallest RSS which can be obtained from this subbranch is the RSS for all of the p-2 variables. If this is say, 108 then no subset of r variables can do better than this, and since we have already found a smaller RSS, this whole sub-branch can be skipped. The technique is useful when there are ‘dominant’ variables which good-fitting subsets must include. It is of no value when there are more variables than observations, as the lower bounds are nearly always zero. 2.1.8 Variable Selection via the Elastic net According to Zou and Hastie (2005), the elastic net encourages a grouping effect where strongly correlated predictors tend to be in or out of the model together. It is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). 2.1.8.1 Naive Elastic net Let y = (y 1, ..., y n ) ' be the response and X = (x1 | ... | x p ) the model matrix, where x j = ( x1 j ,..., x nj ) ' , j = 1,…,p, are the predictors. After a location and scale transformation, we can assume that the response is centered and the predictors are standardised, and hence n ∑y i =1 i =0 n ∑x i =1 ij = 0 and n ∑x i =1 2 ij = 1, for j =1,…,p (2.1.8.1.1) For any fixed non-negative λ1 and λ 2 , we define the naïve elastic net criterion as: L(λ 1, λ 2, β) =| y − Xβ | 2 +λ 2 | β | 2 + λ 1 | β |1 (2.1.8.1.2) where 14 p | β | 2 = ∑ β j2 j =1 p | β |1 = ∑ | β j | j =1 The naïve elastic net estimator β̂ is the minimiser of (2.1.8.1.2) βˆ = arg min{L(λ 1, λ 2, β}. β Let α= λ2 λ1 + λ 2 (2.1.8.1.3) , then solving β̂ in (2.1.8.2) is equivalent to the optimisation problem β̂ =argmin|y-Xβ| 2 subject to (1-α)|β| 1 + α|β| 2 ≤ t for some t. The function (1-α)|β| 1 + α|β| 2 is the elastic net penalty. In this discussion we consider the case where α<1. For all α ∈[0,1), the elastic net penalty function is singular (without first derivative) at 0 and it is strictly convex for all α>0. Lemma 1. Given data set (y,X) and ( λ1 , λ 2 ), define an artificial data set ( (y * , X * ) by X *( n + p )× p = (1 + λ 2 ) Let γ = −1 λ1 (1 + λ 2 ) 2 ⎛ ⎜⎜ ⎝ ⎞ ⎟⎟, λ2 I ⎠ X ⎛y⎞ y *( n + p ) = ⎜ ⎟ . ⎝0⎠ and β * = (1 + λ 2 )β . Then the naïve elastic net criterion can be written as L (γ , β ) = L (γ , β * ) =| y * − X * β * | 2 +γ | β * |1 . Let βˆ * = arg min L{(γ , β * )}; β* then βˆ = 1 (1 + λ 2 ) βˆ * . 15 2.1.8.2 Elastic net Zou and Hastie (2005) point out that empirical evidence shows that the naïve elastic net does not perform satisfactorily unless it is very close to the lasso method discussed in section (2.2.2). This is why it is called naïve. The elastic net improves the prediction performance of the naïve elastic net. Given (y, X), penalty parameter ( λ1 , λ2 ) and let ( y * , X * ) be the artificial data, the naive elastic net solves a lasso-type problem βˆ = arg min | y * − X*β* |2 + * β λ1 (1 + λ 2 ) | β * |1 (2.1.8.2.1) The elastic net (corrected) estimates β̂ are defined by β̂ (elastic net) = (1 + λ 2 )β̂ * We recall that β̂ (naïve elastic)={ (2.1.8.2.2) 1 (1 + λ 2 ) }β̂ * ; thus β̂ (elastic net) = (1+ λ2 )βˆ (naïve elastic net). (2.1.8.2.3) Hence the elastic net coefficient is a rescaled naïve elastic net coefficient. An algorithm called LARS-EN (Zou and Hastie (2005)) is recommended to solve the elastic net efficiently. Algorithm LARS-EN sequentially updates the elastic net fits. In the p>n case, such as with micro array data, it is not necessary to run the algorithm to the end. Real data and simulated computational experiments show that the optimal results are achieved at an early stage of algorithm LARS-EN. If we stop the algorithm after m steps, then it requires 0( m 3 + pm 3 ) operations. 2.1.9 Generating all Subsets It is feasible to generate all subsets of variables if the number of predictor variables is not too large, say less than 20 and if only the RSS is calculated for each set. When the complete search has been carried out, a small number of the more promising subsets can be examined in more detail. 16 The disadvantage of generating all subsets is cost. The computational cost roughly doubles with each additional variable. Hence the availability of high-speed computing becomes imperative for this rather cumbersome procedure. 2.2 VARIABLE SELECTION IN THE COX REGRESSION MODEL The Cox regression model or proportional hazards model for survival data assumes that h(t,x,β) = h0 (t ) exp( ∑ x j β j ) (2.2.1) j where h0 (t ) is the hazard at time t given predictor values x = ( x1 ..., x p ) and h0 (t ) is an arbitrary baseline function. We usually estimate the parameter β = ( β1 ,..., β p ) ' here in the proportional hazards model without specifying h0 (t ) through maximization of the partial likelihood : exp(β ' x jr ) L(β) = ∏ ' j r∈D { ∑ exp(β x )} (2.2.2) j∈Rr Performing a proportional hazards regression analysis requires a number of critical decisions. When selecting a subset of covariates, we must consider issues such as clinical importance and adjustment for confounding, as well as statistical significance. Once a subset is selected, we must determine whether the model is ‘linear’ in continuous covariates and, if not, what transformations are suggested by data and clinical considerations. Another important decision is the question of interactions, if any, to be included in the model. Regardless of which method is used for covariate selection, any survival analysis should begin with a thorough bivariate analysis of association between survival time and all important covariates. For categorical covariates the logrank test must be employed whilst quartiles are used for continuous covariates to make them nominal for the logrank test to be employed. Stepwise methods for the Cox regression are similar to those that will be discussed in Logistic regression in Chapter 3 and hence will not be considered in this section. 17 2.2.1 Purposeful Selection of variables This is a method that is completely controlled by the data analyst. It begins with a multivariable model containing all variables significant in the bivariate analysis at the 20-25 percent level, as well as any other variable not selected with this criterion, but which are judged to be of clinical importance. The use of the above level of significance should lead to the inclusion of any variable that has the potential to be either an important confounder, or statistically significant in the preliminary multivariable model. Following the fit of the initial multivariable model, we use the P-values from the Wald tests of the individual coefficients to identify covariates that might be deleted from the model. The P-value of the partial likelihood ratio test should confirm that the deleted covariate is not significant. After fitting the reduced model, we assess whether or not removal of the covariate has produced an “important” change in the coefficients of the variables remaining in the model. We use a value of about 20 percent as an indicator of an important change in the coefficients. If the variable excluded is an important confounder, it is recommended that any variable excluded from the initial multivariable model be added back into the model to confirm that it is neither statistically significant nor an important confounder. The next step is to examine the scale of continuous covariates in the preliminary main effects model. There are methods that can be employed to assess whether the effect of the covariate is linear in the log hazard and if not, which transformation is linear in the log hazard. One of the methods involves replacing the continuous covariate with design variables such as quartiles or other purposeful cut-points that may have been used in the bivariate analysis. The estimated coefficients for the design variables are plotted against the midpoints of the groups and, at the midpoint of the first group, a point is plotted at zero. If the correct scale is linear in the log hazard, then the polygon connecting the points should be nearly a straight line. If there is a substantial departure from the linear trend, its form may be used to suggest a transformation of the covariate. The quartile method does not require any special software. However, it is not powerful enough to detect subtle, but often important, deviations from a linear trend. 18 Another approach is the method of fractional polynomials which we shall not discuss in this study. The only software that has fully implemented this method is STATA (Hosmer & Lemeshow (1998)). In the final step we determine whether interactions are needed in the model. Special considerations may dictate the inclusion of certain interaction terms irrespective of whether the coefficients are statistically significant or not. In most settings there will be insufficient clinical theory to justify automatic inclusion of interactions. Biologically plausible interactions are formed and those that are individually significant at the 5 percent level are included simultaneously in the main effects model. The inclusion of nonsignificant interactions will increase standard error estimates, resulting in wide confidence intervals. The inclusion of an interaction term will change the coefficients of the relevant main effects. When there is statistically significant interaction, we include the corresponding main effect terms in the model regardless of their statistical significance. 2.2.2 The Lasso Method (Tibshirani (1997)) We denote the log partial likelihood by λ (β)=logL(β), and assume that the xij are standardised so that ∑x i ij / N = 0, ∑i xij2 / N = 1. We estimate β via the criterion βˆ = arg min λ(β ), subject to Σ | β j |≤ s (2.2.2.1) where s > 0 is a user specified parameter. Suppose βˆ 0 are maximisers of the partial likelihood (2.2.2). Then if s ≥ ∑ | βˆ 0j | , the solution to (2.2.2.1) are the usual partial likelihood estimates. If s< ∑ | β j2 | , the solutions to (2.2.2.1) are shrunken towards zero. An attractive feature of the particular constraint ∑| β j |≤ s is that quite often some of the solution coefficients are exactly zero and hence this makes for a more interpretable final model. The strategy for solving (2.2.2.1) is to express the usual Newton-Raphson update as an iterative reweighted least squares (IRLS) step, and then replace the weighted least squares step by a 19 constrained weighted least squares procedure. If X denotes the design matrix of regressor variables and η = Xβ, define u = ∂ℓ⁄∂η, A= ∂ 2 ℓ/∂ηη ' and z = η+A −1 u. Then a one-term Taylor series expansion for ℓ(β) has the form (z – η) ' A(z – η) (2.2.2.2) Hence to solve the original problem (2.2.2.1), we use the following procedure: i) Fix s and initialise β̂ =0. ii) Compute η, u, A and z based on the current value β. iii) Minimise (z – Xβ) ' A(z – Xβ) subject to Σ | β j |≤ s . iv) Repeat steps 2 and 3 until β̂ does not change. Since A is a full matrix, it requires computation of 0(N 2 ) elements. However, this difficulty can be avoided by replacing A with diagonal matrix D that has the same diagonal elements as A. If the log partial likelihood is bounded in β for the given data set, then for fixed s a solution to (2.2.2.1) exists since the region Σ | β j |≤ s is compact. But the solution may not be unique. In some situations it is desirable to have an automatic method for choosing the parameter s based on the data. Tibshirani’s proposal is to minimise an approximate Generalised Cross Validation (GCV) statistic. We write the constraint Σ | β j |≤ s as equivalent to adding a Lagrangian penalty λ ∑ β j2 ∑ |βj | β j2 |βj | ≤ s. This latter constraint is to the log partial likelihood, with λ≥0 depending on s. We may write the constrained solution β̂ step 3 in the form β̂ = (X ' DX + λW) −1 X ' Dz (2.2.2.3) ~ W = diag (W j ), W i = 1 ~ if | β j | >0 and 0 otherwise. Therefore we may approximate the |βj | number of effective parameters in the constrained fit β̂ by p(s) = tr[X(X ' DX + λW¯ ) −1 X ' D]. 20 Letting λ s be the log partial likelihood for the constrained fit with constraint s, we construct the GCV-style statistic GCV(s) = − λs 1 . N N [1 − p( s ) / N ] 2 The GCV criterion inflates the negative log partial likelihood by a factor that involves p(s), the effective number of parameters and larger values of p(s) cause more inflation of the negative log partial likelihood. The simulation study by Tibshirani revealed that the lasso clearly outperforms stepwise selection and picked the correct number of zero coefficients. It is less variable than the stepwise approach and still yields interpretable models. 2.3 VARIABLE SELECTION FOR TIME SERIES DATA Marriot and Pettitt (1997) proposed a model that takes the form: Yt= Filter + Covariates + noise where the filter is a “time series filter” and is designed to capture stochastic and deterministic trends and seasonality and also to correct for possible auto correlated noise terms. We simply seek to remove the “time series behaviour” from the dependent variable to prevent it from hiding the effects that any exogenous explanatory variable or covariable might have. The trend components take a lagged dependent variable and linear time trend, and the seasonal component is also a lagged dependent variable. The proposed time series filter is given by Filter = α + β p t + νYt −1 + ∂Yt − s + ∑ φ i ΔYt −i where T observations are available, Δ is the difference T i =1 operator, ΔYt = Yt − Yt −1 and s is the period of the seasonality. The exogenous explanatory variables or covariates are given as Covariates = k ∑ψ l =1 i Xt,i 21 where Xt,i,…,Xt,k, are observations on the covariates, and the complete model for observed data is Yt = α + β p k t + rYt −1 + ∂Yt − s + ∑ φi ΔYt −i + ∑ψ i xt , j + ε t T l =1 l =1 t = 1,2,…..T where ε t ~ iid N (0, σ 2 ) The model is given in vector form as Y = Zθ + ε (2.3.2) where Z = (F, X), the columns of F and X being sample values of the filter and covariates respectively, and θ ' = ( α , β ,ν , ∂, φ1 ,..., φ p ,ψ p ,...,ψ k ). From (2.3.2), Marriot and Pettitt (1997) point out that Zellner (1971) shows that using a non informative joint prior for parameters, and writing D to represent the past history of both YT and X T ,i the marginal posterior density for θ is: f ( θ | D ) ∝ {νs 2 + ( θ − θˆ ) ' Z ' Z( θ − θˆ )} − T 2 where ν = T − p − k − 4 , 2 s = ( Y − Zθˆ ) ' ( Y − Zθˆ ) ν and θˆ = ( Z ' Z) −1 Z ' Y This is a multivariate t-density. The marginal posterior density for σ is νs 2 ) f (σ | D) ∝ υ +1 exp(− σ 2σ 2 1 which is the inverse gamma type distribution. f (YF | D, ~z ) = ∫ f (YF | θ, σ , ~z ) f (θ, σ | D)dθdσ ∝ {υ + (YF − ~z θˆ ) ' H(YF − ~z θˆ )}− (ν +1) / 2 where H = 1 {1 − ~z ( Z ' Z + z~z ' ) −1 ~z ' } 2 s which is a t- density. The mean and variance of YF are E[ YF D, ~z ] = ~z ( Z ' Z) −1 Z ' Y and E[(YF − E[YF | D, ~z ]) 2 | D, ~z ] = ν ν −2 s 2 {1 + ~z ( Z ' Z) −1 ~z ' } 22 If we delete the ith row of the Z-matrix to get Z–i, the complete Bayesian analysis using Z –i in place of Z is undertaken to obtain the posterior densities. The deleted row zi are used to obtain the predictive densities for observed Y value, Yi. The predictive mean E[ Yi D, z i ] and standard deviation S[ Yi | D | z i ] are then used in the construction of diagnostic plots. The plots are designed to help to answer the questions of whether or not an exogenous explanatory variable makes a significant additional contribution to the model or not, where we consider any additional contribution to be significant if it appears to improve the predictive power of the model. The order of including explanatory variables is given by backward elimination, the variable corresponding to the smallest value of ⎡ E[ψ i D ⎤ ⎥ ⎢ ⎣⎢ S[ψ i D ⎦⎥ at each step being removed. We plot the absolute value of the deviation (AD) of the observation from the predictive mean Yi − E[Yi D, z i against the predictive standard deviation (SD), var[Yi | D, z i ] for each model. We then plot the convex hull of the scatter. For a clearer picture of the data, all points on the convex hull are ‘peeled’ away and the set of points that form the convex hull of the remaining scatter is identified. The process is repeated until the central 50% of the scatter is reached, and the convex hull of the central 50% is then superimposed on the picture. Plots arising from different models are superimposed, suppressing the original scatter, and the resulting pictures make the relative performance of competing models easy to assess. The better model is the model that combines low predictive dispersion with few extreme values, graphically, the plot of its convex hull is closest to the origin. If a graphical choice of a model is not clear cut, the sample means of the absolute mean deviations, MAD, and the standard deviations, MSD are used to select the optimal model. The use of sum of 23 these two, MAD + MSD, provides a simple but useful numerical summary of the absolute deviation-standard deviation, ADSP, plot. 2.4 HYPOTHESIS TESTING Suppose that by some method we have already selected r variables, where r may be zero, out of p variables available to include in our predictor subset. If the remaining variables contain no further information which is useful for predicting the response variable then we should certainly not make any further selection. But we need to know whether the remaining variables containing further information or not. The following hypothesis can be tested HO: β r +1 , β r + 2 ,........, β p = 0 where these β ' s are the regression coefficients of the variables which have not been selected. 2.4.1 The lack-of-fit Test If we have n observations and have fitted a linear model containing r out of p variables plus a constant, then the difference in RSS between fitting the r variables and fitting all the p variables, RSSr – RSSp, can be compared with RSSr giving the lack-of-fit statistics: RSS r − RSS p Lack of fit F = RSS P p−r (2.4.1.1) ( n − p − 1) If the usual conditions of independence, constant variance and normality are satisfied, then the lack-of-fit statistic is sampled from an F-distribution with (p-r) and (n-p-1) degrees of freedom. 2.4.2 The Coefficient of Determination, R2 According to Miller (1990), the distribution of R2 for a random subset of the Y-variable which is uncorrelated with the X-variables is a beta distribution with prob (R2<z) = z 1 t a −1 (1 − t ) b −1 dt ∫ B ( a, b) 0 24 where a = r , b = (n − r − 1) if a constant has been included in the model but not counted in the 2 2 r variables. Using the beta distribution and fitting constants to their tables, as Miller (1990) points out, Rencher and Pun obtained the following formula for the upper 100(1-γ) % point of the distribution of the maximum R2 using the Efroymson’s algorithm as Rγ2 = [ [1 + log e γ (log e N ) 1.8 N 0.4 ]F −1 (γ ) ] where (2.4.2.1) N = p Cr and F-1( γ ) is the value of z such that prob (R2<z) = γ Values of F-1( γ ) can be obtained from the tables of the incomplete beta function or from tables of the F-distribution by writing Regr to denote the regression sum of squares on r variables, we have R2 = Re g r (Re g r + RSS r ) Re g r Write F = RSS r r (n − r − 1) as the usual variance ratio for testing the significance of the subset of r variables, if had been chosen a priori, then R2 = r [r + (n − r − 1) F ] . (2.4.2.2) Thus the value of R2 such that the prob (R2<z) = γ is the value of F with prob(R2<z) = γ which is the value of F with r and (n-r-1) degrees of freedom for the numerator and denominator respectively so that the upper tail area is γ . The reciprocal of a variance ratio also has an F distribution but with the degrees of freedom interchanged, and use the tables with (n-r-1) and r degrees of freedom for numerator and denominator respectively and then take the reciprocal of the F-value read from the tables. The upper limit of R2 is then obtained by substitution in (2.4.2.2) and finally into (2.4.2.1). 2.4.3 Minimum Adequate Sets Miller (1990) points out that Aitkin advances the following argument: If we decide on a prior for the comparison of subset X2 with the full model, containing all the variables in X, then we should use the likelihood-ratio test which gives the variance ratio statistic: 25 ( RSS r − RSS p ) F= RSS p ( p − r) (2.4.3.1) (n − p) where the counts of variables (r and p) include one degree of freedom for a constant if it is included in the models. Under the null hypothesis that none of the (p-r) variables excluded from X 2 is in the ‘true’ model, this quantity is distributed as F(p-r,n-p), subject to assumption of independence, normality and homoscedacity of the residuals from the model. Aitkin then considers the statistic: U(X2) = (p-r)F (2.4.3.2) The maximum value of U for all possible subsets including a constant is then Umax = RSS1 − RSS p RSS p (n − p) where RSS1 is the sum of squares of Y about the mean. A simultaneous 100 α % test for all the hypotheses β 2 = 0 for all subsets X2 is obtained by testing that: U(X2) = (p-1) F ( α , p − 1, n − p ). (2.4.3.3) Subsets which satisfy (2.4.3.3) are referred to as ‘minimal adequate sets’ and are such that if any variable is removed from the subset, it fails to satisfy the condition. 2.5 COMPARISON OF MODELS: SOLUTION CRITERIA Once a manageable set of models is reached, criteria are needed to select or decide on appropriate subset among contending subsets .The accuracy of any model is measured by a discrepancy, a measure of lack of fit of the model at hand. The model which minimises the expected discrepancy is the ‘best’ model selected. The overall discrepancy consists of two components: discrepancy due to the approximation (bias) and discrepancy due to estimation (variance). The discrepancy due to 26 approximation decreases as the number of parameters increases; the discrepancy due to estimation increases as the number of parameters increases. A consistent estimator of the expected discrepancy is called a criterion and is used for model selection. 2.5.1 Akaike’s Information Criterion (AIC) and the Bayes Information Criterion (BIC). According to George (2000) these two criteria are among the most popular criteria, motivated from very different view points. Letting lˆγ denote the maximum log likelihood of the γth model, AIC selects the model which maximises A = lγ − qγ (2.5.1.1) where qγ is defined in paragraph (1.2) of Chapter1. Miller (1990) points out that the AIC has often been used as the stopping rule for selecting ARIMA(auto-regressive, integrated, moving average) models where selection is not only between models with different numbers of parameters but also between many models of the same size. He further suggests that the AIC, with various modifications of it, can be applied in situations in which normality is not assumed. The BIC selects the model which maximises 1 ⎛ ⎞ B = ⎜ lˆγ − (log n)qγ ⎟ 2 ⎝ ⎠ George (2000) mentions Haughton as saying that BIC is consistent when the model is fixed and Shibata saying that AIC is consistent if the dimensionality of the true model increases with n, the number of observations, (at an appropriate rate). 2.5.2 Cp – Statistics (Cr – Criterion) According to Hocking & Leslie (1967), C L Mallows suggests that the standardised total squared error be used as a criterion and he developed an estimate Cp of this quantity given by: 27 Cp = RSS r − ( n − 2r ) , σˆ 2 (2.5.2.1) where r is the number of variables in the regression, RSSr is as defined in (2.1.3.1) and σˆ 2 is an estimate of σ 2 . Now, if an equation with r parameters is adequate, that is, does not suffer from lack of fit, then E(RSSr) =(n-r) σ 2 so that E(Cp) ≈ (n − r )σ 2 σ2 − ( n − 2r ) (2.5.2.2) ≈r for an adequate model. It follows that a plot of Cp versus r will show up the ‘adequate models’ as points fairly close to the line Cp = r. Thus subsets with small Cp and Cp close to r will be considered to be good. ⎛ p⎞ Certainly, of the ⎜ ⎟ possible regressions of size r, only few will be considered to be good. We are ⎝r⎠ interested in that subset of size r for which the residual sum of squares and thus the Cp is minimal. Hocking & Leslie (1967) further describe a method that allows the subset of size r to be identified ⎛ p⎞ after having compared the residual sum of squares for only a small fraction of the possible ⎜ ⎟ ⎝r⎠ subsets. This computation will mostly yield those regressions with small Cp. Reference is made to the k = p – r variables which are to be removed from the regression rather than the variables which are to be retained. Reference shall also be made to the “reduction in regression sum of squares” due to removing a set of k variables. Now the set of k variables for which this reduction is minimum determines that set of r variables to be retained for which the residual sum of squares is minimum. If σ 2 is determined by the residual mean square for the complete regression, and Redr denotes the reduction, the Cp statistic can also be computed from this reduction: 28 Cp = Re d r − ( 2r − p ) σˆ 2 (2.5.2.3) If a single variable, say the ith is removed from the regression, the reduction is given by σ 2 t i 2 where 2 ti = (bi ) 2 σˆ bi 2 (2.5.2.4) is the square of the usual t- statistic associated with the ith regression coefficient. The bi are defined by bi = Di' X ' XDi ) −1 Di' X ' Y . Let θ i = σˆ 2 t i 2 (2.5.2.5) = reduction due to eliminating ith variable where i = 1,…,p. First, we compute the full regression by solving the normal equations: X ' X β = X'Y (2.5.2.6) and then evaluate the r univariable reductions, θ i . We assume that the variables are labelled according on the θ i . That is θ1 ≤ θ 2 ≤ .....θ p . (2.5.2.7) With this labelling, the subset of size p-1 with minimum residual sum of the squares is obtained by deleting the first variable. This approach is based on the fundamental property of quadratic forms which states that if the reduction in the regression sum of squares due to eliminating any set of variables for which the maximum subscripts j is not greater than θ i +1 , then no subset including any variable with subscripts greater than i can result in a smaller reduction. We now describe a sequential method consisting of at most r+1 stages for each value of r =1,2……..,p-2. The first stage consists of computing the reduction due to eliminating variables 1,2,…,k for k=p-r under labelling indicated in expression (2.5.2.7). If this reduction does not exceed θ k +1 , then, according the above property, the process is terminated and the regression consisting of the r variables k+1,…,p is to be the ‘best’ subset of size r in the sense of minimum residual sum of squares. 29 If the reduction computed in the first stage exceeds θ k +1 , then no decision can be made and we proceed to the second stage and variable k+1 is included among the candidates for elimination. ⎛k⎞ The ⎜ ⎟ reductions due to eliminating any set of k variables selected from the first θ k +1 but ⎝1⎠ ⎛k⎞ containing the (k+1)st variable are then computed. If the smallest of the 1 + ⎜ ⎟ ⎝1⎠ reductions computed to this point does not exceed θ k + 2 the process terminates and the corresponding subset is ‘best’. If not, no decision is taken at this second stage and we proceed to the third stage. In the third stage the reductions are computed for all subsets of the size k selected from the first ⎛ k +1 ⎞ k+2 variates which contain variable k+2, a total of ⎜ ⎟ computations. The minimum of the 1 + ⎝ 2⎠ ⎛ k ⎞ ⎛ k +1 ⎞ ⎜ ⎟ + ⎜ ⎟ reductions from the first three stages is now compared with θ k +3 and the iteration ⎝1⎠ ⎝ 2 ⎠ either terminates or continues to the next stage. ⎛ k + q−2 ⎞ In general, at any stage, say the qth, a total of ⎜ ⎟ reductions must be computed and checked to ⎝ q −1 ⎠ see if the ‘best’ subset can be identified. At this stage the largest subscript on any variable being ⎛ k + j − 2⎞ ⎟⎟ j =1 ⎝ ⎠ q considered is k+q-1 and hence the search can be terminated if the minimum of the ∑ ⎜⎜ j − 1 reductions computed in the first q stages does not exceed θ k + q and the corresponding subset is ‘best’. If not, we proceed to stage q+1 where subsets of size k containing variable k+q are considered. However, it has been observed that it rarely happens that all r+1 stages are completed except for very small values for r. 30 2.5.3 The Sp – Statistics ( Sr – Statistics) According to Thomson (1978) this method is regarded as being amongst the most suitable for variable selection in multivariable regression analysis where dependent variable y and the p independent variables have a (p+1)-dimensional normal distribution. The criterion used minimises the expected squared distribution between the true and predictable values of the dependent variable y. The value of y, conditionally given some predictor set xDr ,r ≤ p may be expressed as follows : y = β 0 + ( xDr − X r β r ) + ε r (2.5.3.1) where X r (1xr) vector of means obtained from a regression sample for the r variables being used and ε r ~ N(0; σ r ). For some particular predictor set x, a future value of y, yˆ r is predicted by: 2 yˆ r =b 0 +( xDr − X r )br (2.5.3.2) where b r = [ Dr ' ( X − 1n X r )' ( x − 1n X ) D r ] −1 Dr ' ( X − 1n X )' Y and n the regression sample size. The method involves calculating the statistic: Sp = Sp = MSE r n−r−2 or REDr + SSE p (n − r )(n − r − 2) (2.5.3.3) (2.5.3.4) For subsets of the independent variable where REDr, is the reduction in regression sums of squares between the full p-variable regression and the r variable regression, r=1,2,…,p and SSEp is the error sums of squares. Equation (2.5.3.4) as opposed to (2.5.2.3) provides an efficient computational procedure for the use of this statistic. The subset of variables chosen is the one which yields the smallest value of Sp. However, if the independent variables cannot be regarded as randomly and normally distributed, the use of Cp is suggested. 31 2.5.4 RMS, R2 and Adjusted R2 Statistics These are common criteria functions which are simple functions of the residual sum of the squares for the r-term equation denoted by RSSr 2.5.4.1 The Residual Mean Square The residual mean square is given by: RMSr = RSS r n−r (2.5.4.1) Hocking (1976), points out that many statisticians voice preference for the residual mean square, RMSr, as a criterion function. RMS r , is plotted against r and the choice of r is based on I. The minimum RMS. II. The value of r such that RMSr = RMS for the full equation or III. The value of r such that the locus of the smallest RMSr turns sharply upwards. 2.5.4.2 The Squared Multiple Correlation Coefficients (SMCC) The SMCC is given by: R 2 r = 1 − RSS r . TSS (2.5.4.2.1) The plot of R2r versus r may yield a locus of the minimum R2r which remains quite flat as r is decreased and then turns sharply down. The value of r at which this ‘knee’ in the R2r plot occurs is frequently used to indicate the number of terms in the model. However, it has been observed that R2 is a measure of the residual sum of the squares proportional to the total sum of squares and, hence, would appear to be a reasonable measure of model adequacy. The relation of R2 to Cp is given by Cp = (n − t − 1)(1 − Rr 2 (1 − R 2 ) + 2 p − n (2.5.4.2.2) It follows from this relation that, while the R2r plot may be quite flat for a given range of r, the coefficient (n-t-1) can magnify small differences causing Cp to increase dramatically as r is decreased. Therefore, the R 2r criterion may suggest the deletion of more variables than the minimum Cp criterion. Simulation studies by some authors as described by Hocking (1976) indicate that essential variables may be deleted using the Rr2 criterion. Also, lacking a precise definition of the knee, the qualitative inspection of the Rr2 plot is dependent on the scale. 32 2.5.4.3 The Adjusted R2 or Fisher’s A-statistics The adjusted R2-statistic (adjusted for degrees of freedom) is usually defined as: R 2 r = 1 − (1 − R 2 r ) (n − 1) n−r (2.5.4.3) as an alternative to R2. Some users recommend the adjusted squared multiple correlation 2 coefficient R and suggest using the value of r for which Rr is maximum. Following the simple 2 relation of Rr to Cp, the adjusted R 2 -statistic is given by: R 2r = 1 − n −1 RMS r . TSS 2 The Rr procedure is exactly equivalent to minimising RMSr. There appears to be no advantage in 2 using Rr over RMSr in view of the above relation. 33 Chapter 3 THE LOGISTIC MODEL AND VARIABLE SELECTION FOR A BINARY OUTCOME VARIABLE Having discussed variable selection procedures with regard to continuous outcome variables in Chapter 2, we now in this chapter, consider situations where the response variable is a categorical random variable, attaining only two possible outcomes. In the first place a model and estimation of its parameters is discussed in detail. Then variable selection for this model is presented. In the next discussions, use was made of the following references :( Czepiel, S, Guyon, I and Elisseeff, A (2002). Joubert, G (1994). Hosmer, D W and Lemeshow, S (1989). Larson, P V (2001). Menard S, (2001)). 3.1 BINARY DATA When the response variable is dichotomous, it is convenient to denote one of the outcomes as ‘success’ and the other as ‘failure’. For example, if a patient is cured of a disease, the response is ‘success’, if not, then the response is ‘failure’. If a mouse dies from toxic exposure, the response is ‘success’, if not (i.e. if it survives) the response is ‘failure’. It is standard to let the response variable Z be the binary variable, which attains the value 1, if the outcome is ‘success’, and 0 if the outcome is ‘failure’. Let π = P(Z=1) so that P(Z=0) = 1 – π, then Z~ B(1, π). Suppose that data on p predictor variables are available for each patient or mouse, x 1 ,…,x p . The objective is to investigate the relationship between π and the predictor variables. In a regression situation, each response variable is associated with a given set of values of a set of explanatory variables x 1 ,…,x k . For example whether or not a patient is cured of a disease may depend on the particular medical treatment the patient is given, the patient’s general state of health, age, gender, etc.; whether or not an item in a manufacturing process passes the quality control may depend on various conditions regarding the 34 production process, such as temperature, quality of raw material, time since last service of the machinery, etc. It is often possible to group the observations in such a way that all observations within a group have the same values of predictor variables. For instance, we may group the patients in the disease example according to type of medical treatment, gender and age group, etc such that there are several patients in each grouping. When the data can be grouped it is easier to record the number of successes and failures for each group, rather than recording a long series of 0s and 1s. Example 3.1 (Larsen 2005) The link between the use of oral contraceptives and the incidence of myocardial infarction was investigated. The table below gives the number of women in the study, using the contraceptive pill, who suffered a myocardial infarction, and the number using the pill who did not suffer a myocardial infarction. The corresponding numbers for women not using the pill are also given. Infarction Yes No 23 34 35 132 Yes Pill No Example 3.1 3.2 LOGISTIC REGRESSION Binomial logistic regression is a form of regression which is used when the response variable is a dichotomy and the predictor variable(s) is/are of any type (i.e. discrete or continuous). It can be used to predict a response variable on the basis of values of predictors and to determine the percentage of variance in the response variable explained by the predictors; to rank the relative importance of predictors; to assess interaction effects; and to understand the impact of covariate control variables. Logistic regression has proven to be one of the most versatile techniques in the class of generalised linear models (Czepiel, S). 35 Whereas linear regression models equate the expected value of the dependent variable to a linear combination of predictor variables and their corresponding parameters, generalised linear models equate the combination to some function of the probability of a given outcome on the dependent variable. In logistic regression, that function is the logit transform: the natural logarithm of the odds that some event will occur. In linear regression, parameters are estimated using the method of least squares by minimising the sum of squared deviations of predicted values from observed values. However, logistic regression is not capable of producing minimum variance unbiased (minvu) estimators of the actual parameters. In place of the minvu estimators maximum likelihood estimation is used to solve for the parameters. 3.2.1 Assumptions Logistic regression is popular in part because it enables the researcher to overcome many of the restrictive assumptions of ordinary least square (OLS) regression: i) Logistic regression does not require linear relationships between predictors and the response variable but assumes a linear relationship between the predictors and the logit of the response variable. ii) The response need not be normally distributed (we need to assume its distribution is within the range of the exponential family of distributions, such as normal, Poisson, binomial, gamma). iii) The response variable need not be homoscedastic for each combination of levels of the predictors; that is, there is no homogeneity of variance assumption. iv) Normally distributed errors are not assumed. However, errors are assumed to be independent. v) Logistic regression does not require that the predictors be measured on interval scale. vi) Logistic regression does not require the dependents to be unbounded. 36 3.2.2 The Multiple linear Logistic Regression Model Let Z be a dichotomous (termed ‘success’ and ‘failure’) random variable denoting the outcome of some experiment and let X = (x 1 ,…, x p ) be a collection of predictor variables. Given a data set with a total sample size of M, where each observation is independent from all the others, Z can be considered as a column vector of M binomial random variables Z i . The data is aggregated such that each row represents one distinct combination of values of the predictor variables. The rows are often referred to as ‘populations’. Let N represent the total number of populations and let n be a column vector with elements n i representing the number of observations in each population for i =1 to N where N ∑n i =1 i =M, the total sample size. Let Y be a column vector of length N where each element Y i is a random variable representing the observed counts of the number of successes of Z for population i. Let the column vector y contain elements y i representing the observed counts of the number of successes for each population. Let π be a column vector also of length N with elements π i = P(Z i =1|i), i.e., the probability of success for any given observation in the ith population. The linear component of the model contains the design matrix and the vector of parameters to be estimated. The design matrix of predictor variables, X, is composed of N rows and p+1 columns, where p is the number of predictor variables specified in the model. For each design matrix, the first element x i 0 = 1 for all i. This is the intercept. The parameter vector, β , is a column vector of length p+1. There is one parameter corresponding to each of the p columns of predictor variables settings in X, plus one, β 0 , for the intercept. The logistic regression model equates the logit transform, the log-odds of the probability of a success, to the linear component: 37 Logit ( π i ) = log ( πi )= 1− πi p ∑x k =0 ik βk i = 1,2, …, N (3.2.2.1) = β 0 xi 0 + β1 xi1 + β 2 xi 2 + ... + β p xip If some of the independent variables are discreet, (nominal scaled variables such as race, sex, treatment group, and so forth), it is inappropriate to include them in the model as if they were interval scaled. In fact the numbers used to represent the various levels are simply identifiers, and have no numeric significance. The method of choice is to use a collection of design variables (or dummy variables). For example, if one of the predictor variables is race, say, coded as ‘‘white”, “black” or “other” then two design variables are necessary. Table 3.1 illustrates coding of the design variables, D1 and D2. Design Variable RACE D1 D2 White 0 0 Black 1 0 Other 0 1 Table3.1. An example of the coding of Design Variable Race coded at three levels. (In general, if a nominal scaled variable has k possible values, then k-1 design variables are needed). 3.3 PARAMETER ESTIMATION The goal of logistic regression is to estimate the p+1 unknown parameters in equation (3.2.1.1). This is done with maximum likelihood estimation which entails the finding of a set of parameters for which the probability of the observed data is greatest. 38 3.3.1 Maximum likelihood Estimation The maximum likelihood estimation equation is derived from the probability distribution of the dependent variable. Since each y i represents a binomial count in the ith population, the joint density function of Y is: ni ! π iyi (1- π i ) ni − yi − y )! i i N f(y|β) = ∏ y !( n i =1 i (3.3.1.1) ⎛ ni ⎞ For each population, there are ⎜⎜ ⎟⎟ different ways to arrange y i success from ni trials. Since the ⎝ yi ⎠ probability of a success for any one of the ni trials is π i , the probability of y i successes is π iyi . Likewise, the probability of ni − y i failures is (1- π i ) ni − yi . The joint probability function in equation (3.3.1.1) expresses the values of y as function of known, fixed values for β. The likelihood function has the same form as the probability function, except that the parameters of the function are reversed: the likelihood function expresses the values of β in terms of the known values for y. Thus, N L(β|y) = ni ! π iyi (1- π i ) ni − yi i − y i )! ∏ y !( n i =1 i (3.3.1.2) The maximum likelihood estimates are the values for β that maximize the likelihood function in equation (3.3.1.2). The critical points of a function (maxima and minima) occur when the first derivative equals 0. Attempting to take the derivative of equation (3.3.1.2) with respect to β is a difficult task due to the complexity of multiplicative terms. However, the likelihood equation can be considerably simplified. We ignore the factorial terms since they do not contain π i and their exclusion will come to the same results. After rearranging equation (3.3.1.2) we obtain: yi ⎛ π ⎞ n L(y|β) = ∏ ⎜⎜ i ⎟⎟ (1 − π i ) i i =1 ⎝ 1 − π i ⎠ N (3.3.1.3) 39 Taking e to both sides of (3.2.2.1) gives, ⎛ πi ⎜⎜ ⎝1 − πi p ⎞ x β ⎟⎟ = e ∑ k = 0 ik k ⎠ (3.3.1.4) which after solving for π i becomes, ⎛ ∑ k = o x ik β pk e π i = ⎜⎜ p ⎜ 1 + e ∑ k = 0 xik β k ⎝ p ⎞ ⎟ ⎟⎟ ⎠ (3.3.1.5) Substituting equation (3.3.1.4) for (3.3.1.1) and equation (3.3.1.5) for (3.3.1.2), equation (3.3.1.3) becomes: ⎛ ∏ ⎜⎝ e ∑ N L(y|β) = p x β k = 0 ik k i =1 x β ⎛ e ∑ k = 0 ik k ⎞ ⎜ ⎟ ⎜1 − p ⎠ ⎜ 1 + e ∑ k = 0 xik β k ⎝ p yi ⎞ ⎟ ⎟⎟ ⎠ ni (3.3.1.6) which can be written as: ⎛ yi ∑ k = 0 xik β k ⎞⎛ ∑ xik β k ⎞ ⎟ ⎜e ⎟⎜1 + e k = 0 ∏ ⎠ ⎠⎝ i =1 ⎝ N L(y|β) = p p − ni (3.3.1.7) This is the kernel of the likelihood to maximize. We simplify by taking its log and equation (3.3.1.7) becomes: λ(β ) = ⎛ N i =1 i ⎞ p ∑ y ⎜⎜ ∑ x ⎝ k =0 ik β k ⎟⎟ − ni log(1 + e ∑ ⎠ p x β k = 0 ik k ) (3.3.1.8) 40 We now find the critical points of the log likelihood function by differentiating it and obtain: ∂λ(β ) N = ∑ yi xik − niπ i xik ∂β k i =1 (3.3.1.9) The critical point will be a maximum if the matrix of second partial derivatives is negative definite; that is, if every element on the diagonal of the matrix is less than zero. It is formed by differentiating each of the p+1 equations in equation (3.1.1.9) a second time with respect to each element of β . The general form of the matrix of second partial derivatives is ∂ ⎛ ∂λ(β ) ⎞ ∂ ⎜⎜ ⎟⎟ = ∂β k ⎝ ∂β k ⎠ ∂β k = ∂ ∂β k N ∑y x i ik − ni xik π i i =1 N ∑− n x i =1 i ik πi x β ∂ ⎛⎜ e ∑k =0 ik k = − ∑ ni xik P ∂β k ⎜⎜ 1 + e ∑k =0 xik βk i =1 ⎝ P N ⎞ ⎟ ⎟⎟ ⎠ N = − ∑ ni xik π i (1 − π i )xik (3.3.1.10) i =1 Thus the critical point will be a maximum since the matrix of second partial derivatives is negative definite following the result obtained in equation (3.3.1.10). 3.3.2 The Newton-Raphson Method Setting the equations in equation (3.3.1.9) equal to zero results in a system of p+1 nonlinear equations each with k+1 unknown variables. The solution to the system is vector β̂ k . However, 41 solving a system of nonlinear equations is not easy since the solution cannot be derived algebraically as it can be done in the case of linear equations. The solution must be found using an iterative process. The most popular method for solving systems of nonlinear equations is Newton’s method, also known as the Newton-Raphson method. It is more convenient to use matrix notation to express each step of the Newton-Raphson method. N We can write equation (3.3.1.10) as λ / (β) = − ∑ ni xik π i (1 − π i )xik . i =1 Let β (0 ) represent the vector of initial approximations for each β k , then the first step of NewtonRaphson can be expressed as: β (1) = β (0 ) + [ - λ// (β (0 ) )] −1 λ / (β (0 ) ) (3.3.2.1) Let μ be a column vector of length N with elements μ i = ni π i . Each element of μ can be expressed as μ i = E( y i ), the expected value y i . Using matrix multiplication, we can show that: λ / (β) =- X ' (y-μ) (3.3.2.2) is a column vector of length P+1 whose elements are ∂ (β ) , as derived in equation (3.3.1.9). Let W ∂β k be a square matrix of order N, with elements ni π i (1 − π i ) on the diagonal and zeros everywhere else. Again, using matrix multiplication, we can verify that λ// (β) = X ' WX is a p+1 × p+1 square matrix with elements (3.3.2.3) ∂ 2 λ(β ) . Now equation (3.3.2.1) can be written as ∂β 2k β (1) = β (0 ) + [X ' WX] −1 X ' (y-μ) (3.3.2.4) 42 We continue to apply equation (3.3.2.4) until there is essentially no change between the elements of β from one iteration to the next. At this point, the maximum likelihood estimates are said to have converged, and equation (3.3.2.3) will hold the variance-covariance matrix of the estimates. 3.4 ODDS AND ODDS RATIO The odds of some event happening (e.g. the event Y = 1) is defined as the ratio of probability that the event will occur divided by the probability that the event will not occur. That is, the odds of the event E is given by Odds (E) = P( E ) P( E ) = P (notE ) 1 − P( E ) Example 3.1 (continued from page 34) An estimate of the probability of having a myocardial infarction for women in the study using the pill is given by P(E pill ) = 23/57 = 0.4035. Hence, the odds, amongst these women, of having a myocardial infarction when using the pill, is given by Odds (E pill ) = 0.4035 = 0.6764. 1 − 0.4035 That is, the probability of having a myocardial infarction is around 2/3rds the probability of not having a myocardial infarction, for women using the pill. 43 Similarly, for women who are not using the pill, an estimate of the probability of having a myocardial infarction is given by P(E no − pill ) = 35/167 = 0.2096. The odds of having a myocardial infarction, when not using the pill, is given by Odds (E no − pill ) = 0.2035 = 0.2652. 1 − 0.2096 Thus the odds are around 1 to 4 that a woman in the study not using the pill will have a myocardial infarction. The odds ratio R A, B that compares the odds of events E A and E B ( that is, Event E occurring in group A and B, respectively), is defined as the ratio between the two odds; that is R A, B = odds ( E A ) P( E A ) = odds ( E B ) 1 − P ( E A ) P( E B ) . 1 − P( E B ) Example 3.1 (continued from page 42) The odds ratio comparing the odds of having a myocardial infarction for women using the pill with the odds of having a myocardial infarction for women not using the pill, is given by R pill ,no − pill = odds ( E pill ) odds ( E no − pill ) = 0.6764/0.2652 = 2.5505. That is, the odds of having myocardial infarction are 2.55 times higher for women using the pill, than for women not using the pill. In particular, if an odds ratio is equal to one, the odds are the same for the two groups. 44 3.5 INTERPRETATION OF COEFFICIENTS The interpretation of any fitted model requires that we be able to draw practical inferences from the estimated coefficients in the model. The estimated coefficients must be able to answer the questions that motivated the study. Interpretation involves determining the functional relationship between the response variable and the predictor variable, and appropriately defining the unit of change for the response variable. 3.5.1 Dichotomous Predictor Variables The link function is the logit transformation g(x) = ln{π(x)/[1- π(x)]} = β 0 +β 1 x for one predictor variable x . We assume that x is coded either as 1 or 0.The log odds ratio (that is, the logarithm of the odds ratio) corresponding to the probability of success when the predictor variable has a value x = 0 and the probability of success when the predictor variable has the value x = 1 , is given by ln(ψ) = ln{π(1)/[1- π(1)]}- ln{π(0)/[1- π(0)]} where ψ= π (1) /(1 − π (1)) g (1) = . π (0) /(1 − π (0)) g (0) Now ln(ψ)= g(1)-g(0) = β 0 +β 1 .1 –(β 0 +β 1 .0) = β1 It follows that the odds ratio is given by ψ = e β1 In general, the estimate of the log odds for any predictor variable at two different levels, say x = a versus x = b, is given by ) ln[ψ (a, b)] = gˆ ( x = a) − gˆ ( x = b) = ( βˆ0 + βˆ1 × a ) − ( βˆ 0 + βˆ1 × b) 45 = βˆ1 × (a − b) (3.4.1.1) and the estimated odds ratio is ψˆ (a, b) = exp[ βˆ1 × (a − b) ] (3.4.1.2) where ψˆ (a, b) = πˆ ( x = a) /(1 − πˆ ( x = a ) πˆ ( x = b) /(1 − πˆ ( x = b)) is used to represent the odds ratio in equations (3.4.1.1) and (3.4.1.2). The end points of the confidence interval for the odds ratio given in equation (3.4.1.2) are exp[ βˆ1 (a − b) ± z 1− α | a − b | ×SEˆ ( βˆ1 ) ] 2 3.5.2 Polytomous Predictor Variables In paragraph 3.2.2 we mentioned that if a nominal scale variable has more than two levels, say k levels, we must model the variable using a collection of k-1 design variables as illustrated in Table 3.1. With this method, we choose one level of the variable to be the reference level usually the 0 level, against which all other levels are compared. We fit the model using design variables to obtain coefficients equal in number to the number of design variables. Fitting the model using Table3.1 will give the following results with regard to coefficients: (Here the category ‘white’ is used as reference category) Estimated Variable Coefficient Black β̂11 Other βˆ12 Table 3.2 An example showing coefficients that will be obtained when fitting the model using design variables in Table 3.1 46 Comparing Whites with Blacks we obtain ln [ψˆ (black , white)] = gˆ (black , white ) = βˆ 0 + βˆ11 × ( D1 = 1) + βˆ12 × ( D2 = 0) − ( βˆ 0 + βˆ11 ( D1 = 0) + βˆ12 ( D2 = 0) = βˆ 11 Similarly, comparing others and with whites we obtain: ln [ψˆ (other , white)] = βˆ12 Thus the odds ratio of any level with the reference level will be the exponential of the coefficient of that level. If comparison is not with a reference level, the odds ratio will be the exponential of the difference between the coefficients in question. The limits for a 100(1-α) percent CI for the coefficient are βˆij ± z 1− α × SEˆ ( β ij ) 2 and the corresponding limits for the odds ratio are exp[ βˆij ± z 1− α × SEˆ ( βˆij ) ]. 2 3.5.3 One Continuous Predictor Variable We assume that the logit is linear in the continuous predictor, x, then the equation of the logit is g ( x) = β 0 + β 1 x. The log odds for a change of c units in x is obtained from the logit difference g ( x + c) − g ( x) = cβ 1 and the associated odds ratio is obtained by exponentiating this logit difference, ψ (c) = ψ ( x + c, x) = exp(cβ 1 ) . An estimate may be obtained by replacing β 1 with its maximum likelihood estimate β̂1 . The end points of the 100(1-α) percent CI estimate ψ (c) are 47 exp [cβˆ1 ± z 1− α cSEˆ ( βˆ1 )] 2 3.5.4 Multivariable Case We now face the situation in which the model contains two predictor variables, where one variable is dichotomous say, x1 coded 0 and 1 and one continuous, x 2 with primary interest focused on the effect of the dichotomous variable. The equation of the logit will then be g ( x1, x 2 ) = β 0 + β1 x1 + β 2 x 2 . If x1 changes from 0 to 1 with x 2 = a i.e. held constant, then the log odds ratio is: ln(ψ ) = g ( x1 = 1, x 2 = a) − g ( x1 = 0, x 2 = a ) = β 0 + β1 .1 + β 2 .a − ( β 0 .0 + β 2 .a ) = β1 and the odds ratio is ψ = e β1 Similarly, holding x1 constant when x 2 changes from x to x + c the odds ratio is ψ = e cβ1 . Confidence intervals are calculated as before. 3.5.5 One Dichotomous and one Continuous and their Interaction If the primary interest is focused on the effect of the dichotomous variable x1 coded 0 and 1 and x 2 is the continuous covariate, then the equation of the logistic interaction is g ( x1, x 2 ) = β 0 + β1 x1 + β 2 x 2 + β 3 x1 x 2 . If x1 changes from 0 to 1 and x 2 = a the log odds ratio is ln(ψ ) = g ( x1 = 1, x 2 = a) − g ( x1 = 0, x 2 = a) = β 0 + β1 .1 + β 2 .a + β 3 .a − ( β 0 + β1 .0 + β 2 .a + β 3 .0.a ) = β 0 + β 3 .a The odds ratio is thus ψ = e β1 + aβ3 which does not depend on the variable of interest only. The 100(1-α) percent CI for the odds ratio is 48 exp[ βˆ1 + βˆ3 .a ± z 1− α SEˆ ( βˆ1 + βˆ3 .a )] 2 where SEˆ ( βˆ1 + βˆ3 .a ) = v̂ar( βˆ1 ) + a 2 v̂ar( βˆ2 ) + 2aCˆ ov ( βˆ1 , βˆ2 ) 3.6 TESTING FOR THE SIGNIFICANCE OF THE MODEL 3.6.1 The Likelihood Ratio Test After fitting a particular multiple logistic regression model, we do an assessment of the model. We begin by assessing the significance of the p regression coefficients in the model. A likelihood ratio test for overall significance of the p coefficients for the predictor variables in the model is performed. This test is based on the statistic G = 2[L p (β) - L p (0)] Under the null hypothesis that the coefficients for the predictors in the model are all equal to zero, the distribution of G will be a chi-square with p degrees of freedom. The exceedance probability value (P-value) for the test is P= Pr[ χ 2 ( p) > G]. Rejection of the null hypothesis leads to the conclusion that at least one and perhaps all p coefficients are significantly different from zero. 3.6.2 Wald Test Statistics Before we conclude that all of the coefficients are nonzero, we may wish to look at the univariate Wald test statistics: Wj = βˆ j SEˆ ( βˆ j ) . This test is commonly used to test the significance of the individual logistic regression coefficients for each independent predictor variable (that is, to test the null hypothesis in logistic regression that a particular logit (effect) coefficient is zero). It is the ratio of the logit coefficient to its 49 standard error and is approximated by the standard normal distribution under the said null hypothesis. 3.6.3 Using Deviances to Compare Likelihoods Suppose that model one has t parameters while model two is a subset of model one with only r of the t parameters so that r < t. Model one will have a larger log-likelihood than model two. For large sample sizes, the difference between these two likelihoods, when multiplied by two, will behave like the chi-square distribution with t-r degrees of freedom. This fact can be used to test the null hypothesis that the t-r parameters that are not in model two (as above) are zero. The difference denoted by D is calculated using results from statistical packages, as follows: D = -2[(model 2) – (model 1)] = -2logL (model 2) - -2logL (model 1), and D ~ χ 2 (t − r ) , when the sample size is large. 3.7 INTERACTION AND CONFOUNDING The term confounding is used by epidemiologists to describe a covariate that is associated with both the outcome variable of interest AND a primary predictor variable or risk factor. When both associations are present then the relationship between the risk factor and the outcome variable is said to be confounded. Consider a model containing a dichotomous risk factor variable and a continuous covariate. If the association between the covariate and the outcome variable is the same within each level of risk factor, there is no interaction between the covariate and the risk factor. Graphically the absence of interaction yields a model with two parallel lines of outcome variable on covariate, one for each level of risk factor variable. In general, the absence of interaction is characterised by a model that contains no product terms involving two or more variables. 50 When interaction is present, the association between the risk factor and the outcome variable differs or depends in some way on the level of the covariate. That is, the covariate modifies the effect of the risk factor. The term ‘effect modifier’ is used by epidemiologists to describe a variable that interacts with a risk factor. Determining if a covariate is an effect modifier and/or a confounder involves several issues. Determining effect modification status involves the parametric structure of the logit, while determination of confounder status involves two things. First, the covariate must be associated with the outcome variable. This implies the logit must have a nonzero slope in the covariate. Second, the covariate must be associated with the risk factor. In practice, the confounder status of a covariate is ascertained by comparing the estimated coefficient for the risk factor variable from models containing and not containing the covariate. Any “biologically important” change in the estimated coefficient for the risk factor would dictate that the covariate is a confounder and should be included in the model, regardless of the statistical significance of the estimated coefficient for the covariate. On the other hand, we believe that a covariate is an effect modifier only when the interaction term added to the model is biologically meaningful and statistically significant. When a covariate is an effect modifier, its status as a confounder is of secondary importance and the estimate of the effect of the risk factor depends on the specific value of the covariate. 3.8 VARIABLE SELECTION FOR LOGISTIC REGRESSION According to Hosmer and Lemeshow (1989), in logistic regression the errors are assumed to follow a binomial distribution and the significance of a variable is assessed via the likelihood ratio chi-square. At any step in the procedure the most important variable in statistical terms will be the one that produces the greatest change in the log-likelihood relative to the model not containing the variable. 51 3.8.1 Purposeful Selection of Variables 3.8.1.1 Screening of Variables This method is almost similar to the one discussed in section (2.2.1) under the proportional hazards regression model. This method is also analyst driven. Hosmer and Lemeshow (1989) suggest that the selection process should begin with a univariate analysis of each variable. Hence it is suggested that the selection process should begin with a careful univariate analysis of each variable. For nominal, ordinal, and continuous predictor variables with few integer values, it is suggested this be done with a contingency table of outcome (y= 0, 1) versus the k levels of the predictor variable. The likelihood chi-square test with k-1 degrees of freedom is exactly equal to the value of the likelihood ratio test for the significance of the coefficients for the k-1 design variables in a univariate logistic regression model that contains that single predictor variable. Particular attention should be paid to any contingency table with a zero cell. Strategies for handling zero cells include: collapsing the categories of the predictor variable in some sensible way to eliminate the zero cells: eliminating the categories completely: or, if the variable is ordinally scaled, modelling the variable as if it is continuous. For continuous predictor variables the most desirable univariate analysis involves fitting a univariate logistic regression with each predictor to obtain the estimated coefficient, the estimated standard error, the likelihood ratio test for the significance of the coefficient, and the univariate Wald statistic. The completion of univariate analyses is followed by selection of variables for multivariate analysis. Any variable whose univariate test has a P-value<0.25 should be considered as a candidate for a multivariable model along with all other variables of known biologic importance. The univariate approach has the disadvantage of excluding predictor variables which can collectively be important predictors of outcome, whilst individually weakly linked with the 52 outcome. This problem can be overcome by choosing a significance level large enough to allow the suspect variables to be included. After fitting the multivariable model, the importance of each variable included in the model should be verified. This should include (a) an examination of the Wald statistic for each variable and (b) a comparison of each estimated regression coefficient with the coefficient from the univariate model containing only that specific variable. Variables that do not contribute to the model based on these criteria should be eliminated and a new model should be fitted. Comparison of models is done through the likelihood ratio test. Also, estimated coefficients for any remaining variables should be compared to those of the full model. Marked change in magnitude would imply that one or more of the excluded variables were important in the sense of providing a necessary adjustment of the effect of variables that remained in the model. This process is done repeatedly until it appears that all of the important variables are included in the model and those excluded are either biologically or statistically unimportant. 3.8.1.2 Scale of Continuous Predictors For continuous scaled predictor variables we must check the assumption of linearity in the logit. Since the concept of scale selection is the same for the multivariable models, we describe this approach using the univariable model. One method to ascertain linearity is to plot the fitted line on the scatter-plot of the logit versus the predictor variable and look for any obvious systematic deviations from the line. A modification of this approach is to break the range of the predictor variable into groups and, for each group, plot the average value of the logit versus the group midpoint. This approach in logistic regression requires that we transform the vertical axis to the logit. Thus we would plot, for each group, the logit of the group mean versus the midpoint of the group. The plot is examined with respect to the shape of the resulting “curve”. An alternative to scale identification in logistic regression is the Box-Tidwell transformation for linear regression. According to Hosmer and Lemeshow (1989), the use of this transformation has been examined for use in logistic regression by Guero and Johnson (1982). This approach adds a term of the form x ln(x) to the model. If the coefficient for this variable is significant, we have 53 evidence for non-linearity in the logit. This procedure, however, has low power in detecting small departures from linearity. 3.8.1.3 Inclusion of Interactions Once continuous variables are on the correct scale, we begin to check for interactions in the model. An interaction between two variables implies that the effect of one of the variables is not constant over levels of the other. For example, an interaction between sex and age would imply that the regression coefficient for age is different for males and females. The need to include interaction terms in a model is assessed by first creating the appropriate product of the variables in question and then using a likelihood ratio test to assess their significance (that is their contributions to the model). (See paragraph (3.5.3)). In general, for an interaction term to alter both the point and interval estimates, the estimated coefficient must attain at least a moderate level of statistical significance. The final decision as to whether an interaction term should be included in a model should be based on statistical as well as practical considerations. 3.8.2 Stepwise Forward Selection This procedure starts by fitting only the intercept term, then for each of the possible predictor variables, a univariate logistic regression containing the intercept and that predictor (say xj) is fitted. The log- likelihood of the intercept model ( L0 ) is compared with the log-likelihood of each of the univariate model (Lj) by means of the ratio test statistic: G j = 2( L j − L0 ) . Its P-value is determined by P = Pr( χ 2 (v) > G j ) , where ν=1 if xj is continuous and ν= k-1 if xj has k categories. The most important predictor variable is the one with minimum P-value and this variable, denoted by xe, is entered into the model. The subscript “e” indicates that the variable is a candidate for entry. The choice of an “alpha”( significance level) level used to judge the importance of variables is a crucial aspect. Let α E denote our choice where the “E” stands for entry and this choice for α E will determine how many variables will eventually be included in the model. Choosing a value for α E in the range 0.15 to 0.2 is highly recommended. Moreover, using 54 α E in this range will provide assurance that the procedure selects variables whose coefficients are different from zero (Hosmer and Lemeshow (1989)). After the variable xe has been entered, the next step is to determine whether any of the remaining p-1 variables are important once xe is in the model by fitting the p-1 logistic regression models containing xe and xj, j = 1,2,3 ….. p and j ≠ e. The log-likelihoods of these models are compared with that of the model containing the intercept and xe. The variable with the smallest P-value at this step is entered, and the algorithm continues provided P-value< α E , otherwise it stops. 3.8.3 Stepwise Backward Selection The process starts with a full model containing all variables. In the first step the log-likelihood of the model containing all variables ( L f ) is compared to that of p-1 variables with xj is removed denoted by ( L− j ) by using the likelihood ratio test statistic G − j = 2( L f − L − j ) . To ascertain which variable should be deleted from the model, we select that variable which, when removed, gives the maximum P-value. We denote the minimal level of continued contribution to the model by α R where “R” stands for remove. The value we choose for α R must exceed the value for α E , to avoid the possibility of having to enter and remove the same variable at successive steps. In the next step the log- likelihood of the model excluding the one removed at the previous step is compared to those of all p-1 models with one of the remaining variables removed. If P-value> α R , a variable is removed. Generally the choice of α R is 0.2 or 0.25. However, important variables can be forced to remain in the model. The algorithm stops when all variables have entered the model or when all variables in the model have P-values to which is less than α R . 55 3.8.4 Stepwise Selection (Forward and backward) This is a combination of forward and backward selection procedures discussed above. It is based on a statistical algorithm that allows moves in either direction, dropping or adding variables at various steps based on the ‘importance’ of variables. The ‘importance’ of a variable refers to the statistical significance of its coefficient. Since, in logistic regression the errors are assumed to follow a binomial distribution, the significance is assessed via the likelihood ratio chi-square test. Thus at any step in the procedure the most important variable will be the one that result in the largest likelihood ratio statistic, G. Since the magnitude of G depends on its degrees of freedom, any procedure based on the likelihood ratio test statistic, G must account for possible differences of degrees of freedom of variables. This is achieved by assessing significance through the p-value for G. 3.8.5 Best Subset Selection This is an alternative to stepwise selection. This model building approach has been available in linear regression. Typical software implementing this method for linear regression will identity a specified number of ‘best’ models containing one, two, three variables, and so on, up to the single model containing all p variables. According to Hosmer and Lemeshow (1989), we may use any best subsets linear regression program to execute the computations for best subsets logistic regression. The subsets of variables selected for ‘best’ models depend on the criterion for ‘best’. In logistic regression the Score and the C p criteria are preferred. A model with high score- value will be preferred to a model with a smaller score-value whereas a model with a small C p value or C p ≈ r will be preferred where r is the number of predictor variables in the model. It is important to note that variables suggested by best subset strategy should not be accepted without considerable critical evaluation. 56 Though we discussed several selection procedures in Chapter 2, a few of them have been discussed, and others left out in this chapter. The reason is that such procedures do not apply to the logistic regression 3.8.6 General From the information in this chapter, it is clear that selection methods for binary outcome variables are lacking. For this reason, we will be evaluating a new method, based on the ROC curve, briefly in Chapter 5. We will first discuss the concept of a ROC curve in Chapter 4. 57 Chapter 4 THE RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE 4.1 BACKROUND We discuss ROC curves as a separate chapter because we will be endeavouring (chapter 5) to utilise these curves as additional model (or variable) selection method. Specifically: the area under the curve (AUC) will be evaluated as a selection criterion. The AUC will be discussed in section 4.5. Researchers and analysts allocate a great deal of effort to the development of prediction models to support decision making. However, too often insufficient attention is allocated to the tool(s) used to evaluate the model(s) in question. The issue is that accurate prediction models may be measured inappropriately based upon the information available regarding classification error rate and the context of application. In the end, poor decisions are made because of selecting wrong models, using an inappropriate evaluation method. In the context of consumer risk prediction, understanding how to evaluate models which predict potential customers to be ‘good’ or ‘bad’ credit risks is critical to managing Customer Relationship Management (CRM). Since the dependent variable of concern is categorical, the issue is one of binary classification. For a binary classification problem (i.e. prediction of ‘good’ versus ‘bad’), logit analysis utilises a linear combination of the predictor variables and transforms the result to lie between 0 and 1, to equate to a probability. One method of evaluation, which enables a comprehensive analysis of all possible error severities, is the Receiver Operating Characteristic (ROC) curve. According to Morrison & Michelle (2005), ROC curves were developed in the field of statistical decision theory, and later used in the field of signal detection during WW II. ROC curves enabled radar operators to distinguish between an enemy target, a friendly ship, or noise. They further point out that ROC curves assess the value of diagnostic tests by providing a standard measure of the ability of the test to correctly classify 58 subjects. Mention is made of Metz (1978) stating that the biomedical field uses ROC curves extensively to assess the efficacy of diagnostic tests in discriminating between healthy and diseased individuals. ROC curves have since been used in fields ranging from electrical engineering and weather prediction to Psychology and are used almost everywhere in the literature on medical testing to determine the effectiveness of medications (Nargundkar and Priestly (2003)). 4.2 DEFINITION OF AN ROC CURVE Consider diagnostic tests with dichotomous outcomes, with positive outcomes suggesting presence of disease. For dichotomous tests, there are two potential types of error. A false- positive error happens when a non-diseased individual has a positive test result. On the other hand, a falsenegative error happens when a diseased individual has a negative test result. The rates of occurrence of these errors, termed false-positive and false negative rates, together constitute the operating characteristics of the dichotomous diagnostic test. These notions can be generalised to non-binary tests in this way: Let D be a binary (0/1) indicator of the disease status with D = 1 for diseased subjects. Let Y denote the test result with the convention that larger values of Y are more indicative of disease for some threshold value C. Now 1 minus the false-negative rate (or true positive rate) and 1 minus true negative rate (false-positive) associated with this decision criterion can be written as Pr (Y≥C D = 1 ) and Pr(Y<C|D=0), respectively. An ROC curve is a plot of the true positive rate versus 1 minus true negative rate across all positive threshold values, C. When Y { −1 } is continuous, a clear and brief way of writing the ROC curve is ROC(t) = FD FD (t ) t ∈ (0,1), where FD and F D are the survivor functions of Y in the diseased and non-diseased populations, respectively, and where t is the false positive rate which varies from 0 to 1 as the corresponding implicit threshold value, C, varies from ∞ to -∞ . When Y is discrete the ROC curve can also be { } written in the form FD FD (t ) but the domain for ROC (t) is restricted to the range of FD (.) , that −1 is, the set of all possible false positive rates associated with the test. By definition, the ROC curve is a monotone increasing function from [0,0] to [1,1] 59 4.3 DIAGNOSTIC TEST INTERPRETATION The basic idea of diagnostic test interpretation is to calculate, for example, the probability that a patient has a disease under the consideration given certain result. A 2 by 2 table is employed in this regard (See Table 4.3.1). 4.3.1 2 X 2 Table or Contingency Matrix Test Positive Disease Present Disease Absent True Positives (TP) False Positives (FP) Test False Negatives Negative (FN) Total with Disease Total Positive True Negatives (TN) Total Negative Total without Grand Total Disease Table 4.1 An example of a Contingency Table 4.3.2 Basic Concepts In this discussion we refer back to Table 4.1. 4.3.2.1 Sensitivity Sensitivity is the proportion of patients with disease whose tests are positive. P(T+|D+)=TP/(TP+FN) High sensitivity is important when: • The disease is serious and should not be missed. • The disease is treatable. • FP results do not lead to serious physic, psychological 60 or economic trauma to the patient. 4.3.2.2 Specificity Specificity is the proportion of patient without disease whose tests are negative. P(T-|D-) = TN/ (TN + FN) High specificity is needed when: • The disease is serious. • The disease is not treatable or curable. • FP results do not lead to serious physic, psychological or economic trauma to the patient. 4.3.2.3 Pre-test Probability Pre-test probability is the prevalence of the disease in the population. It is also called efficiency of the test. P(D+) = (TP+N)/(TP+FP+TN+FN) Higher Efficiency is needed when: • The disease is serious. • The disease is curable • FP and FN are essentially equally serious damages. 4.3.2.4 Predictive Value of a Positive Test Predictive values of a positive test is the proportion of patients with positive tests who do have disease. P(D+|T+) = TP/(TP+P) These values measure: • The same thing as posttest probability of disease given a positive test. • Measures how well the test rules in disease. 4.3.3.5 Predictive Value of a Negative Test Predictive value of a negative is the proportion of patients with negative tests who do not have disease. P(D-|T-) = TN/(TN+N) 61 This value measures how well the test rules out the disease. 4.4 ROC REGRESSION MODEL Let the false positive rate be denoted by t and let τ denote the set of possible values for t, namely the range of FD , which is a subset of [0, 1]. Let Z denote some factors which potentially influence test accuracy and let X be a corresponding vector of covariates. For example, if Z is a categorical variable, X might be the associated vector of dummy variables. The covariate vector X is a function of the factors Z. We write the ROC curve associated with Z as ROC z (t ) and model it as ROC z (t ) = g{α 0 (t ), βX } (t ∈ τ z ) , where α o (t) is a univariate baseline function of t, βX is a linear predictor which characterises the effect of the covariates X on the ROC curve, g is a known function and τ z denotes the domain of the ROC function associated with Z. In general the covariate vector X may include interactions between factors in Z and t, in which case we write the covariate vector X(t). Since the ROC curve is a monotone increasing function by definition, g and α must be chosen such that monotonicity in ROC z is ensured. 4.5 AREA UNDER THE ROC CURVE (AUC) 4.5.1 Interpretation of the Area The area under the ROC curve is commonly used as a summary measure of diagnostic accuracy. It takes values from 0.5 to 1.0. The AUC statistic can be interpreted as the probability that the test result from a randomly chosen diseased individual is more indicative of disease than that from a randomly chosen non-diseased individual or a measure of a model’s ability to discriminate between those who experience the outcome of the interest versus those who do not. AUC = P ( X i≥ X j D i = 1, D j = 0). An ROC curve summarises the possible set of 2 X 2 matrices that results when the cut-off value is varied continuously from its highest possible value down to its smallest possible value. An area of 1 represents a perfect discrimination. The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the 62 discrimination. On the other hand an area of 0.5 represents a worthless discrimination. The closer the curve comes to the 45 degrees diagonal of the ROC space, the less accurate the test. An area of • 0.9 – 1.0 = excellent discrimination • 0.80 -0.90 = good discrimination • 0.70 -0.80 = fair discrimination • 0.60 -0.70 = poor discrimination • 0.50 – 0.60 = fail, i.e. no discrimination However, in practice it is extremely unusual to observe areas under the curve greater than 0.9. 4.5.2 Comparison of Tests When results from multiple tests have been obtained, the ROC plots can be graphed together. The relative positions of the plots indicate the relative accuracies of the tests. A plot lying above and to the left of another plot indicates greater observed accuracy. If the curves for two tests cross, a meaningful difference between the tests over the range of interest might not be picked up by the AUCs. If we have two curves of similar area and we wish to decide whether the two curves differ significantly, we can use bivariate statistical analysis. Where we have different areas derived from two tests applied to different sets of cases, it is appropriate to calculate the standard error of the difference between the two areas, thus: 2 2 SE A1 + SE A 2 ) SE ( A1 − A2 ) = This approach is not appropriate where two sets are applied to the same set of patients. Hanley and McNeil (1982) show that in these circumstances, the correct formula is: SE ( A1 − A2 ) = SE 2 A1 + SE 2 A2 − 2r.SE A1 SE A2 63 where r is the quantity that represents the correlation induced between the two areas by the study of the same set of cases. Once we have the standard error of the difference in areas, we can then calculate the statistic: Z = ( A1 – A2 ) /( SE ( A1 − A2 ) ) If Z is above a critical level, then we accept that the two areas are different. Commonly this critical value is set at 1.96, and we then have a 0.05 chance of making a type I error in rejecting the hypothesis that the two curves are similar. Assuming we have two tests T1 and T2 that classify our cases into either normal (n) or abnormal (a), and we have already calculated the AUCs for each test, r is calculated as follows: 1. Look at (n), the non-diseased patients. We find how the two tests correlate for these patients and obtain a value rn for this correlation. 2. Similarly we derive ra , the correlation between the two tests for the patients 3. Average rn and ra . 4. Average out the areas A1 and A2 by calculating ( A1 + A2 )/2. 5. Look up the value of r in Hanley and McNeil’s Table I (Hanley and McNeil (1982)) given the the average areas of rn and ra . 4.5.3 Advantages and Disadvantages of ROC The ROC plot is a simple, graphical and easily appreciated visually. It is a comprehensive representation of pure accuracy, i.e. discriminating ability, over the entire range of a test. It provides a direct visual comparison between tests on a common scale and it requires no grouping and binning of data. With appropriate software, ROC plotting is quite readily done. Actual decision thresholds are usually not displayed in the plot. The number of subjects is also not shown on the display and as the sample size decreases, the ROC plot tend to become increasingly jagged and bumpy. However, even with a large number of subjects, the plot may be bumpy. 64 CHAPTER 5 MODEL BUILDING USING REAL DATA In this chapter we will look at the application of the procedures and methods outlined in chapters 3 and 4 with regard to selection of variables. Some of the criteria, discussed in Chapter 2, such as the Akaike Information Criterion may come into play since they also are applicable to logistic regression and needless to say, Cox regression as well. The data set to be used was developed for a study of factors associated with success of first year students at the Tshwane University of Technology (TUT) from the year 1999 to 2002. Information on 18047 students was obtained. Table 5.1 describes the response, predictor variables and their codes. Variable Description and code Pass pass=1, fail=0 Campuss main campus=1, satellite campus =2 Genderr female = 1, male=2 Agregate aggregate mark for all subjects in matric exam for an individual student Maritall marital status (single=1, married=2) Finaidd Financial aid (aided=1, not aided) Age student age at first registration English Performance in English in matric exam (good=1,not good=2) Race (white=1, coloured=2, Asian=3 and black=4) Faculty (Engineering=1, Commerce =2, Social Science=3, Arts=4, Natural Science =5, Agricultural Science=6 and Health =7) Table 5.1 Code Sheet of the Variables used in the Data set for the Study of Factors Associated with Success of First Year Students at TUT from 1999 to 2002 65 5.1 PURPOSEFUL SELECTION OF VARIABLES We begin with a univariable description of all predictors; both categorical and continuous variables are shown in Tables 14 and 15 of the appendix respectively. The univariable analysis does not reveal any variable for which there are illegal values. All binary variables are coded as 1; 2. Race and Faculty are the only non-binary categorical variables. We create indicator variables for the Faculty variable as shown in Table 5.2: Faculty Label faculty_2 faculty_3 faculty_4 faculty_5 Faculty_6 Faculty_7 1 Engineering 0 0 0 0 0 0 2 Commerce 1 0 0 0 0 0 3 Social Sci 0 1 0 0 0 0 4 Arts 0 0 1 0 0 0 5 Natural Sci 0 0 0 1 0 0 6 Agric Sci 0 0 0 0 1 0 7 Health 0 0 0 0 0 1 Table 5.2 Indicator Variables for the Variable Faculty. Since the numbers of Indians and Coloureds were quite small, each less than 2% of the total, a dichotomous variable Brace (black race for blacks) was created. Brace takes the value 1 if race is black and the value 0 for other races (White, Coloured and Indian). The dependent variable was the logit π = (logπ/(1-π)), where π is the probability that a student passed. Univariable logistic regressions were fitted to the data and the results are given in Table 5.3. 66 Predictor Estimated Estimated Estimated Wald Test Variable Coefficient Standard Error Odds ratio Age -0.0552 0.00726 0.759 (0.707,0.815) <0.0001 Agregate 0.00287 0.000078 1.267 (0.673,1.287) <0.0001 Campuss -0.1645 0.0172 0.720 (0.673,0.770) <0.0001 Maritall 0.2043 0.0762 1.505 (1.116,2.028) 0.0073 Finaidd 0.3483 0.0264 2.007 (1.810,2.225) <0.0001 Genderr 0.1662 0.0167 1.394 (1.306,1.489) <0.0001 English 0.3890 0.0199 2.177 (2.014,2.353) <0.0001 Faculty_2 0.2447 0.0586 1.277 (1.139,1.433) <0.0001 Faculty_3 0.5835 0.0620 1.792 (1.587,2.024) <0.0001 Faculty_4 1.8045 0.0757 6.077 (5.239,7.048) <0.0001 Faculty_5 0.7191 0.0744 2.053 (1.774,2.375) <0.0001 Faculty_6 -0.0894 0.0866 0.914 (0.772,1.084) 0.3020 Faculty_7 1.2743 0.0288 3.576 (3.040,4.207) <0.0001 Brace -0.9388 0.0344 0.391 (0.366,0.418) <0.0001 95% CI P-value Table 5.3 Univariable Logistic Regression Models For the variables Age and Agregate in Table 5.3 odds ratios are for an increase of 5 years and 100 marks respectively. A change of 1 mark or 1 year would not be meaningful. With the exception of variables Faculty_6 and Agregate, there is evidence that each of the variables has some association with the outcome variable, pass. This is based on the observation that the confidence interval estimates do not contain 1. Furthermore, all variables are significant with P-value≤0.25 for the Wald test. We now, based on the univariable results, begin the multivariable model including all variables besides Faculty_6 which is not significant. The model is shown in Table 5.4. The Wald statistics is now used to delete variables one by one that do not appear to be significant at the P-value≤0.05 level, starting with the least significant one. 67 Criterion Intercept Only AIC SC -2 Log L 21460.178 21467.979 21458.178 Intercepts and Covariates 19424.526 19533.737 19396.526 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 2061.6516 2085.0164 1798.0246 13 13 13 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Intercept age agregate Campuss maritall finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -2.5845 -0.0101 0.00162 0.0662 0.0637 0.3634 0.1001 0.0778 0.6209 0.5730 1.6185 0.7232 1.1871 -0.5585 Std 0.2700 0.00855 0.000094 0.0255 0.0946 0.0282 0.0186 0.0237 0.0564 0.0581 0.0878 0.0862 0.0815 0.0437 Wald Chi-Square 91.6291 1.3903 298.6322 6.7383 0.4535 165.6272 29.0649 10.7918 121.0806 97.1171 340.1683 70.4490 211.9119 163.3471 Pr > ChiSq <.0001 0.2384 <.0001 0.0094 0.5007 <.0001 <.0001 0.0010 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits Effect age agregate Campuss maritall finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace Error 1 1 1 1 1 vs vs vs vs vs 2 2 2 2 2 0.990 1.267 1.142 1.136 2.068 1.222 1.168 1.861 1.774 5.046 2.061 3.278 0.572 0.974 0.673 1.033 0.784 1.852 1.136 1.065 1.666 1.583 4.248 1.741 2.793 0.525 1.007 1.287 1.262 1.646 2.311 1.314 1.282 2.078 1.988 5.992 2.440 3.846 0.623 Table 5.4 Multivariable Model Containing Variables Identified in the Univariable Analysis. 68 The model at the end of the process of removing non-significant variables is shown in Table 5.6. At this point, we allow each of the variables not in the model, the opportunity to re-enter the model one by one. As each variable enters the model, we evaluate its statistical significance using the Wald test and also ascertain whether the variable is a confounder or not of other variables in the model by calculating the extent of change of coefficients of variables in the model. There is no significant change in the coefficients of other variables when Faculty_6 re-enters the model but according to the Wald test the variable is however, not statistically significant. The same argument holds for the variables Maritall and Age when they re-enter the model. Therefore, the preliminary main-effects model is as given in Table 5.6. Before proceeding to determine interactions we need to examine the variables that have been modelled as continuous to obtain the correct scale in the logit. In this case the variable we need to check is Agregate. We start by determining the quartiles of the distribution of Agregate from appendix 1 Table 14 and create three design variables using the lowest quartile as the reference group. The results of the quartile analysis are shown in Table 5.5. Quartile Midpoint Coefficient 95%CI for Odds Ratios 1 775 0 2 955 0.2898 (1.208,1.478) 3 1137 1.0672 (2.516,3.359) 4 1680 0.9989 (2.407,3.063) Table 5.5 Results of Quartile Analyses of the Variable Agregate from the Multivariable Model Containing Variables shown in Table 5.6 69 Criterion AIC SC -2 Log L Intercept Only 21460.178 21467.979 21458.178 Intercept and Covariates 19423.992 19517.601 19399.992 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 2058.1862 11 <.0001 Score 2083.3017 11 <.0001 Wald 1796.8154 11 <.0001 Parameter Intercept agregate Campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace Effect agregate Campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace 1 1 1 1 1 1 1 1 Analysis of Maximum Likelihood Estimates Standard DF Estimate Error Chi-Square 1 -2.7409 0.1250 480.9214 1 0.00163 0.000093 304.3700 1 0.0730 0.0252 8.3651 1 0.3637 0.0282 165.9746 1 0.1029 0.0184 31.1306 1 0.0837 0.0234 12.8165 1 0.6197 0.0564 120.6196 1 0.5729 0.0582 97.0702 1 1.6244 0.0877 342.9895 1 0.7316 0.0861 72.2871 1 1.1867 0.0815 211.7846 1 -0.5567 0.0437 162.5179 Odds Ratio Estimates Point Estimate 1.267 vs 2 1.157 vs 2 2.070 vs 2 1.229 vs 2 1.182 1.858 1.773 5.075 2.078 3.276 0.573 Wald Pr > ChiSq <.0001 <.0001 0.0038 <.0001 <.0001 0.0003 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 95% Wald Confidence Limits 0.673 1.287 1.048 1.277 1.853 2.312 1.143 1.321 1.079 1.296 1.664 2.076 1.582 1.988 4.274 6.027 1.756 2.460 2.792 3.844 0.526 0.624 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs Adjusted Odds Ratios Effect Unit agregate 100.0 agregate -100.0 69.8 29.3 0.9 65896012 Somers' D Gamma Tau-a c 0.405 0.409 0.164 0.703 Estimate 1.177 0.850 Table 5.6 Preliminary Main Effects Model 70 1. 4 1. 2 1. 0 0. 8 0. 6 0. 4 0. 2 0. 0 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 agr eg Figure 5.1 Plot of quartile midpoints against coefficients. The results of plotting quartile midpoints against the coefficients are shown in figure 5.1. The plot of the coefficients supports an assumption of non linearity in the logit. Addition of the variable [Agregate*ln(Agregate)] to the model containing Agregate as a continuous variable yields a significant coefficient for the variable [Agregate*ln(Agregate)]. This confirms that agregate is not linear in the logit. From Table 5.5 the two coefficients in the third and fourth quartiles are almost similar in magnitude and their confidence intervals have a great deal of overlap. These observations suggest the creation of a dichotomous variable taking on the value 1 if Agregate is in the third and fourth quartiles and the value of zero otherwise as also being supported by figure 5.1. The results of including a dichotomous variable Agregate_ in the multivariable model are shown in Table 5.7. 71 Criterion Intercept Only AIC SC -2 Log L Intercept and Covariates 21460.178 21467.979 21458.178 19676.792 19770.401 19652.792 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 1805.3857 1831.4242 1618.6281 11 11 11 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Parameter Intercept english finaidd Campuss genderr faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace agregate_ 1 1 1 1 DF Estimate Error Standard Chi-Square 1 1 1 1 1 1 1 1 1 1 1 1 -1.0521 0.1664 0.3990 0.0783 0.1162 0.5567 0.5393 1.6151 0.7533 1.1212 -0.7023 0.2966 0.0663 0.0229 0.0279 0.0251 0.0183 0.0558 0.0576 0.0868 0.0855 0.0806 0.0423 0.0393 251.6263 52.6993 204.5030 9.7167 40.3096 99.5148 87.7013 345.9076 77.5690 193.6399 275.3671 57.0404 Wald Pr > ChiSq <.0001 <.0001 <.0001 0.0018 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 Odds Ratio Estimates Point Estimate Effect english finaidd Campuss genderr faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace agregate_ 1 1 1 1 vs vs vs vs 2 2 2 2 1.395 2.221 1.169 1.262 1.745 1.715 5.028 2.124 3.069 0.495 1.345 95% Wald Confidence Limits 1.275 1.991 1.060 1.174 1.564 1.532 4.241 1.796 2.620 0.456 1.246 1.526 2.478 1.290 1.355 1.947 1.920 5.961 2.512 3.593 0.538 1.453 Table 5.7 Multivariable Model With Dichotomous Variable Agregate_. 72 We now form all possible two way interaction using the variables in Table 5.7. engagr=english*agregate_ engfin=english*finaidd amfin=campuss*finaidd finfac2=finaidd*faculty_2 finfac3=finaidd*faculty_3 finfac4=finaidd*faculty_4 finfac5=finaidd*faculty_5 finfac7=finaidd*faculty_7 racfin=brace*finaidd agrbrac=agregate*brace engbrac=english*brace The interaction terms are added to the model containing main effects one by one. Table 5.8 shows those interactions that were significant when added one by one to the main effects model. Interactions which are not significant will be excluded from the model. A model with significant interactions is shown in Table 5.9. However, it should be noted that when there is statistically significant interaction, we include the corresponding main effects in the model regardless of their statistical significance. Table 5.9 gives the final model containing main effects and interactions. From Table 5.10, we see that (12308+1046)=13354 or 73% of the 18047 observations in our data are correctly classified by the logistic regression model in Table 5.9. Of the 5083 observed passes, 1046 or 20.6% are correctly classified as predicted passes. 4037 of these observations are incorrectly classified as predicted fails. They are called false-negatives. Only 656 of the observed fails are incorrectly classified as predicted passes. These observations are called false-positives. 73 The c statistic in Table 5.11 gives the area under the ROC curve (the AUC) in figure 5.2. This cvalue is 0.694 and indicates that the model has low predictive accuracy. But the low predictive accuracy does not imply the model does not fit. Criterion Intercept Only AIC SC -2 Log L 21460.178 21467.979 21458.178 Intercept and Covariates 19635.958 19791.972 19595.958 Analysis of Maximum Likelihood Estimates Parameter Intercept Campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace agregate_ engagr engfin finfac4 finfac2 finfac5 racfin agrbrac engbrac 1 1 1 1 DF Estimate Error Standard Chi-Square Wald Pr > ChiSq 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -0.8944 0.0806 0.1099 0.2188 0.1100 0.8804 0.5433 0.4241 -0.0852 1.1313 -0.5948 0.6119 -0.1501 -0.1932 0.6224 -0.1775 0.4584 -0.1765 -0.2025 0.3499 0.3361 0.0252 0.0184 0.0922 0.1428 0.2504 0.0577 0.4249 0.3634 0.0808 0.3306 0.1228 0.0936 0.1293 0.2182 0.1301 0.1878 0.1473 0.0875 0.1325 7.0820 10.2242 35.7304 5.6253 0.5936 12.3660 88.6276 0.9962 0.0550 195.8363 3.2371 24.8345 2.5717 2.2326 8.1332 1.8627 5.9573 1.4345 5.3520 6.9747 0.0078 0.0014 <.0001 0.0177 0.4410 0.0004 <.0001 0.3182 0.8146 <.0001 0.0720 <.0001 0.1088 0.1351 0.0043 0.1723 0.0147 0.2310 0.0207 0.0083 Table 5.8 A model containing Interactions which were Significant when Added One by One to the Main Effects Model. 74 Intercept Only Criterion AIC SC -2 Log L Interaction and Covariates 21460.178 21467.979 21458.178 19636.828 19769.441 19602.828 Analysis of Maximum Likelihood Estimates Parameter Intercept Campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace agregate_ finfac4 finfac2 finfac5 agrbrac engbrac 1 1 1 1 DF Estimate Error Standard Chi-Square 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1.3260 0.0802 0.1097 0.3905 0.3305 1.0687 0.5447 0.3639 -0.0899 1.1313 -0.9373 0.4453 0.6557 -0.2770 0.4603 -0.2363 0.3694 0.0944 0.0252 0.0184 0.0441 0.0609 0.2342 0.0577 0.4263 0.3664 0.0808 0.1737 0.0688 0.2187 0.1212 0.1892 0.0838 0.1315 197.1637 10.1358 35.6720 78.5287 29.4292 20.8300 89.1103 0.7287 0.0602 196.0118 29.1054 41.8797 8.9884 5.2212 5.9179 7.9462 7.8870 Wald Pr > ChiSq <.0001 0.0015 <.0001 <.0001 <.0001 <.0001 <.0001 0.3933 0.8061 <.0001 <.0001 <.0001 0.0027 0.0223 0.0150 0.0048 0.0050 Table 5.9 Final Model with Interactions Predicted by Model 0 1 Actual Classification 0 Total 12308 656 12964 4037 1046 5083 16345 1702 18047 1 Total Table 5.10 Contingency Matrix for model in Table 5.9 75 Odds Ratio Estimates Point Estimate Effect Campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Brace agregate_ finfac4 finfac2 finfac5 agrbrac engbrac 1 1 1 1 vs vs vs vs 2 2 2 2 95% Wald Confidence Limits 1.174 1.245 2.183 1.937 2.912 1.724 1.439 0.914 3.100 0.392 1.561 1.926 0.758 1.585 0.790 1.447 1.064 1.159 1.837 1.525 1.840 1.540 0.624 0.446 2.646 0.279 1.364 1.255 0.598 1.094 0.670 1.118 1.296 1.338 2.595 2.459 4.607 1.930 3.318 1.874 3.632 0.551 1.786 2.958 0.961 2.296 0.931 1.872 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 68.6 29.8 1.7 65896012 Somers' D Gamma Tau-a c 0.388 0.394 0.157 0.694 Table 5.11 Odds Ratios and Association of Predicted Probabilities and Observed Responses for the Final Model in Table 5.9 5.2 OTHER LOGISTIC REGRESSION SELECTION PROCEDURES. The results of applying Forward, Backward, Stepwise and Best-Subset selection procedures are given in appendices 2, 3, 4, and 5 respectively. All the stepwise procedures except the Forward Selection produced eleven-variable models. The Forward Selection included two additional variables, Age and Faculty, which are non-significant at 5% significance level according to the Wald test. These variables satisfied the entry level of P=0.25 but could not leave the model since the Forward procedure does not provide room for non significant variables to leave the model. 76 The Best Subset procedure using the C p - criterion pointed to a model with twelve variables from the two ‘best’ models requested for in the procedure. With regard to the Best Subset procedure using the Score-criterion we requested for ‘best’ two models as well, of each size (i.e. from a model containing one variable to a model with 13 variables). From the two ‘best’ models with twelve variables the Score- criterion selected the same model as the C p - criterion. The Purposeful Selection procedure like Backward and Stepwise procedures produced a model with eleven variables. However, Purposeful Selection warranted for the variable Agregate to enter the model as a binary variable following analysis of scale of continuity of this variable. 5.3 INVESTIGATION OF THE AUC AS A SELECTION TOOL An attempt is now made to establish if the area under the ROC curve (AUC) can be used as a tool for selection of variables. In other words building a model by including variables that are increasing the AUC as they enter the model. A variable stays in the model provided it is significant in accordance with the Wald test. Like in the Forward stepwise selection, variables enter the model one at a time. The process starts by building one-variable models and recording the AUC and the P-values as shown in Table 23. The one-variable model with the highest AUC provides the first variable to enter the model. In the next step all other variables will enter the model one by one and only the two-variable models with AUC greater than the highest AUC obtained in the first step will be considered. In the third step, a two-variable model with the highest AUC will be the basis for a three-variable model and only models with AUC higher than the largest obtained in the previous step will be considered. In any step, if there is more than one model with the same maximum, the model to be considered to the basis for next step will be selected using AIC. The process continues in this way until the AUC does increase further even when the number of variables in the model increases. However, only variables that are significant according to the Wald test will be allowed to stay in the model. 77 Tables 23 to 36 give the results of applying the above procedure to our data set. We note that in the last two steps (Tables 35 and 36) there are non significant variables. The final model is given in Table 34 with eleven variables, also the same as the other eleven-variable model obtained previously using Purposeful, Backward and Stepwise selection procedures. The ROC curve for the model in Table 34 is given by figure 5.2. The area under this curve is 0.703 as shown in the table in question. This value of the area indicates a fair discrimination (predictive accuracy) by the model. From Table 5.12 we see that (12240+1191) =13431 or 74% of the observations in our data are correctly classified by the logistic regression model in Table34. Out of 5083 observed passes, 1191 or 23% are correctly classified as predicted passes. 3892 or 77% of these observations are incorrectly classified as predicted fails (false negatives). Only 724 or 5.6% o the observed fails are incorrectly classified as predicted passes (false positives). Sensi t i vi t y 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 1 - Speci f i ci t y Figure 5.2 ROC curve for the model obtained using AUC procedure. 78 Predicted by Model 0 Actual Classification 0 1 Total 12240 724 12964 3892 1191 5083 16132 1915 18047 1 Total Table 5.12 Contingency Matrix for the Model in Table 34 5.4 THE AUC AND THE STEPWISE SELECTION PROCEDURES These two selection procedures produced similar models. We note that these procedures involve ‘picking’ and ‘dropping’ of variables and we now investigate the sequence or the order of the variables entering and leaving the models. The comparison is shown in Table13. Stepwise Procedure Step Variable Entered/Removed AUC Procedure Wald Step Variable Entered /Removed Pvalue 1 Agregate 0.0001 1 Agregate 2 Faculty_4 0.0001 2 Brace 3 Faculty_7 0.0001 3 Finaidd 4 Finaidd 0.0001 4 Faculty_4 5 Brace 0.0001 5 Faculty_7 6 Genderr 0.0001 6 Genderr 7 Faculty_6 0.0001 7 Faculty_6 8 Faculty_2 0.0001 8 Faculty_2 9 Faculty_3 0.0001 9 Faculty_3 10 Faculty_5 0.0001 10 Faculty_4 11 Faculty_6 Removed 0.6888 10 Faculty_6 Removed 12 English 0.0002 11 English 13 Campuss 0.0038 12 Campuss 14 Age 0.862 13 Age Entered and Removed 15 Age Removed 0.864 14 Maritall Entered & Removed Table 5.13 Comparison of the Stepwise and the AUC procedures Wald P-value AUC 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0058 0.0001 0.6888 0.0002 0.0038 0.0864 0.1584 0.637 0.656 0.671 0.681 0.687 0.690 0.694 0.695 0.697 0.701 0.702 0.703 0.703 0.703 0.703 79 From Table 5.13 both procedures have Agregate as the first variable to enter the model. In step 2 up to step 4 the same variables entered the model though not in the same sequence. From step 5 up to the end, the two procedures yielded almost the same results. But the Stepwise procedure did not consider the variable Maritall for entry into the model. The example used is perhaps not ideal for investigating the ROC curve as a variable selection technique. Here we have a lot of potential variables to be selected; all of them only make small contributions to the predicted probabilities. However, almost all of all of these contributions are statistically significant because of the huge sample size! Judging according to the AUC’s, the increase in AUC from Table 32 to Table 36 (Appendix 6) is only 0.2% and from Table 29 to Table 36 only 0.8%. These are small increases and one may as well decide to use the model of Table 34 as the final model. It is clear that much more research on the use of the AUC’s is needed. 80 CHAPTER 6 DISCUSSION AND CONCLUSION The purpose of this study was to explore methods and procedures used to select predictor variables for binary response variables. However, as the point of departure selection procedures for a continuous response variable were also discussed in order to illuminate the whole question of variable selection. We have seen that selection procedures for binary responses and continuous dependent variables are basically the same, for example, all methods used in Logistic regression are almost similar to those used for the Cox regression model. For both regressions, the ‘Purposeful Selection of Variables’ emerges as the most interesting and recommended procedure for selecting variables, since the method is completely controlled by the analyst. The stepwise and the best subset procedures are statistical algorithms which, to some extend, do the selection automatically. In situations where the number of variables is not large, Purposeful selection is recommended as the sole tool for selection. It can be coupled with Stepwise selection when the number of variables is too large, in which case stepwise selection will reduce the number of predictor variables to a reasonable number before Purposeful selection is used. Another advantage of Purposeful selection is the inclusion of variables that are scientifically relevant or known to interact with other variables regardless of their statistical significance. Thus the analyst, not the computer, becomes responsible for the review and evaluation of the model. The results of a fitted logistic regression model can intuitively be summarised via classification tables. In this regard, the logistic regression model is a diagnostic test and the classification table measures the prediction accuracy. However, this measure is statistically insensitive. On the otherhand the area under the ROC Curve, another measure of the predictive accuracy, is not an extremely sensitive measure to compare two models. It is important to note that a model with high predictive accuracy does not necessarily provide evidence that the model fits well. We may have a situation where the logistic regression model is in-fact the correct model and thus fits the data but 81 classification or discrimination is poor. These measures should, therefore, supplement more rigorous methods of assessment of fit. The results in Tables 5.7, 5.13 and 34 suggest that to some extent, the AUC can be used as criterion for variable selection with the P-value of the Wald test used to remove insignificant variables. Perhaps even as an alternative to Purposeful and Stepwise selection procedures. However, further research is required to investigate this approach, especially for highly correlated variables. It is further recommended that the data set used to fit the model should not be used to test for the predictive accuracy, otherwise the results become biased. A new set of observation should be used to avoid this bias, and the method called jack-knifing should be applied. The following are some of the major challenges for evaluating diagnostic tests and for applying ROC methodology in particular: (1) Status, for example disease status, is often not a fixed entity, but rather may evolve over time. Now, how can the time aspect, be incorporated sensibly into ROC analysis? (2) The statistical literature on diagnostic testing assumes that the test result is a simple numeric value. However, test results may be much more complicated, involving several components. Do ROC curves and the AUC have a role to play in determining how to combine different sources of information to optimise diagnostic accuracy? The very brief investigation into the use of ROC curves and the AUC, in this thesis, yields, by no means, definitive answers to the question: How effective is the ROC curve as a tool for subset selection? Much more research is needed. Finally, as the information revolution brings us larger data sets, with more and more variables, the demand for variable selection will strengthen and continue to be a basic strategy for data analysis. New problems will also appear as demand increases for data mining of massive data sets. 82 APPENDIX 1A The UNIVARIATE Procedure Variable: age (age) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 18047 20.0791821 2.72214158 4.36262118 7409795 13.5570342 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 18047 362369 7.41005479 28.9348552 133721.849 0.02026321 Basic Statistical Measures Location Variability Mean 20.07918 Median 19.00000 Mode 19.00000 Interquartile Range Std Deviation Variance Range 2.00000 2.72214 7.41005 38.00000 Tests for Location: Mu0=0 Test -Statistic- Student's t Sign Signed Rank t M S -----p Value------ 990.9182 9023.5 81428064 Pr > |t| Pr >= |M| Pr >= |S| <.0001 <.0001 <.0001 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 54 33 24 22 21 19 19 18 18 17 16 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 16 16 16 16 16 17517 17497 11294 10238 9455 51 51 52 52 54 2298 11190 7516 13182 3372 Table 14 Univariate Analysis of the Variable Age 83 The UNIVARIATE Procedure Variable: agregate (agregate) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 18047 1056.67779 218.254459 0.5336167 2.10103E10 20.6547788 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 18047 19069864 47635.0089 -0.1263512 859621371 1.624653 Basic Statistical Measures Location Mean 1056.678 Median 1075.000 Mode 1075.000 Interquartile Range Variability Std Deviation Variance Range 365.00000 218.25446 47635 1440 Tests for Location: Mu0=0 Test -Statistic- Student's t Sign Signed Rank t M S -----p Value------ 650.4021 9023.5 81428064 Pr > |t| Pr >= |M| Pr >= |S| <.0001 <.0001 <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Variable: agregate 2160 1612 1440 1320 1200 1075 835 835 720 720 720 (agregate) Extreme Observations ----Lowest---Value 720 720 720 720 720 ----Highest--- Obs Value Obs 18044 18041 18038 18032 18022 1705 1715 1750 1750 2160 12262 6901 2105 8313 9123 Table 15 Univariate Analysis of the Variable Agregate 84 APPENDIX 1B Faculty Faculty 2 3 1 5 6 4 7 Frequency 6586 3771 2506 1540 1390 1313 941 Percent 36.49 20.90 13.89 8.53 7.70 7.28 5.21 Cumulative Cumulative Frequency Percent 6586 36.49 10357 57.39 12863 71.28 14403 79.81 15793 87.51 17106 94.79 18047 100.00 Race Race 4 1 3 2 Frequency 12105 5334 341 267 Percent 67.07 29.56 1.89 1.48 Cumulative Frequency 12105 17439 17780 18047 Cumulative Percent 67.07 96.63 98.52 100.00 Campuss Campuss 1 2 Frequency 12004 6043 Percent 66.52 33.48 Cumulative Frequency 12004 18047 Cumulative Percent 66.52 100.00 english english 1 2 Frequency 12520 5527 Percent 69.37 30.63 Cumulative Frequency 12520 18047 Cumulative Percent 69.37 100.00 Cumulative Frequency 9207 18047 Cumulative Percent 51.02 100.00 genderr genderr 1 2 Frequency 9207 8840 Percent 51.02 48.98 maritall maritall 1 2 Frequency 17782 265 Percent 98.53 1.47 Cumulative Frequency 17782 18047 Cumulative Percent 98.53 100.00 finaidd finaidd 2 1 Frequency 16391 1656 Percent 90.82 9.18 Cumulative Frequency Cumulative Percent 16391 18047 90.82 100.00 Table 16 Analysis of Categorical Variables 85 APPENDIX 2 Criterion Intercept Only AIC SC -2 Log L 21460.178 21467.979 21458.178 Intercept and Covariates 19424.820 19534.030 19396.820 Testing Global Null Hypothesis: BETA=0 Chi-Square DF Pr > ChiSq Test Likelihood Ratio Score Wald Parameter DF Intercept faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate Campuss genderr finaidd english Brace 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2061.3579 2085.9704 1798.6426 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Standard Wald Estimate Error Chi-Square Pr > ChiSq -2.4774 0.6342 0.5862 1.6319 0.7349 0.0370 1.2008 -0.0130 0.00162 0.0657 0.0987 0.3636 0.0779 -0.5574 0.2060 0.0653 0.0660 0.0922 0.0908 0.0909 0.0877 0.00750 0.000094 0.0256 0.0187 0.0282 0.0237 0.0437 Odds Ratio Estimates Point Estimate Effect faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate Campuss genderr finaidd english Brace 13 13 13 1 1 1 1 vs vs vs vs 2 2 2 2 1.886 1.797 5.113 2.085 1.038 3.323 0.987 1.002 1.140 1.218 2.069 1.169 0.573 144.5645 94.3829 78.8456 312.9702 65.4564 0.1654 187.6919 2.9863 298.0360 6.5952 27.8821 165.7910 10.8080 162.8453 <.0001 <.0001 <.0001 <.0001 <.0001 0.6842 <.0001 0.0840 <.0001 0.0102 <.0001 <.0001 0.0010 <.0001 95% Wald Confidence Limits 1.659 1.579 4.268 1.745 0.868 2.798 0.973 1.001 1.032 1.132 1.852 1.065 0.526 2.143 2.045 6.127 2.492 1.240 3.946 1.002 1.002 1.261 1.311 2.311 1.282 0.624 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 70.0 29.5 0.5 65896012 Somers' D Gamma Tau-a c Adjusted Odds Ratios Effect Unit Estimate age age agregate agregate 0.937 1.067 1.176 0.851 5.0000 -5.0000 100.0 -100.0 0.405 0.408 0.164 0.703 Table 17 The Results of Forward Selection Procedure 86 APPENDIX 3 Intercept Only 21460.178 21467.979 21458.178 Criterion AIC SC -2 Log L Intercept and Covariates 19423.992 19517.601 19399.992 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 2058.1862 2083.3017 1796.8154 11 11 11 <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Error Standard Chi-Square Wald Pr > ChiSq Intercept faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss genderr finaidd english brace 1 1 1 1 1 1 1 1 1 1 1 1 -2.7409 0.6197 0.5729 1.6244 0.7316 1.1867 0.00163 0.0730 0.1029 0.3637 0.0837 -0.5567 0.1250 0.0564 0.0582 0.0877 0.0861 0.0815 0.000093 0.0252 0.0184 0.0282 0.0234 0.0437 480.9214 120.6196 97.0702 342.9895 72.2871 211.7846 304.3700 8.3651 31.1306 165.9746 12.8165 162.5179 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0038 <.0001 <.0001 0.0003 <.0001 1 1 1 1 Odds Ratio Estimates Effect faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss genderr finaidd english Brace 1 1 1 1 vs vs vs vs 2 2 2 2 Point Estimate 1.858 1.773 5.075 2.078 3.276 1.002 1.157 1.229 2.070 1.182 0.573 95% Wald Confidence Limits 1.664 2.076 1.582 1.988 4.274 6.027 1.756 2.460 2.792 3.844 1.001 1.002 1.048 1.277 1.143 1.321 1.853 2.312 1.079 1.296 0.526 0.624 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs Adjusted Odds Ratios Effect Unit agregate 100.0 agregate -100.0 69.8 29.3 0.9 65896012 Somers' D Gamma Tau-a c 0.405 0.409 0.164 0.703 Estimate 1.177 0.850 Table 18 The Results of The Backward Selection Procedure 87 APPENDIX 4A Criterion AIC SC -2 Log L Intercept Only 21460.178 21467.979 21458.178 Intercepts and Covariates 19423.992 19517.601 19399.992 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 2058.1862 2083.3017 1796.8154 DF 11 11 11 Pr > ChiSq <.0001 <.0001 <.0001 Analysis of Maximum Likelihood Estimates Parameter Intercept faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss genderr finaidd english Brace DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Estimate -2.7409 0.6197 0.5729 1.6244 0.7316 1.1867 0.00163 0.0730 0.1029 0.3637 0.0837 -0.5567 standard wald Error Chi-Square 0.1250 480.9214 0.0564 120.6196 0.0582 97.0702 0.0877 342.9895 0.0861 72.2871 0.0815 211.7846 0.000093 304.3700 0.0252 8.3651 0.0184 31.1306 0.0282 165.9746 0.0234 12.8165 0.0437 162.5179 Pr > ChiSq <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0038 <.0001 <.0001 0.0003 <.0001 Odds Ratio Estimates Point Estimate Effect faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss genderr finaidd english Brace 1 1 1 1 vs vs vs vs 95% Wald Confidence Limits 1.858 1.773 5.075 2.078 3.276 1.002 1.157 1.229 2.070 1.182 0.573 2 2 2 2 1.664 1.582 4.274 1.756 2.792 1.001 1.048 1.143 1.853 1.079 0.526 2.076 1.988 6.027 2.460 3.844 1.002 1.277 1.321 2.312 1.296 0.624 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 69.8 29.3 0.9 65896012 Adjusted Odds Ratios Effect Unit agregate 100.0 agregate -100.0 Somers' D Gamma Tau-a c 0.405 0.409 0.164 0.703 Estimate 1.177 0.850 Table 17 Results of The Stepwise Selection Procedure 88 APPENDIX4B Criterion Intercept Only Intercept and Covariates AIC SC -2 Log L 21460.178 21467.979 21458.178 19604.472 19744.885 19568.472 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 1889.7059 1928.0421 1700.7783 17 17 17 <.0001 <.0001 <.0001 Likelihood Ratio Score Wald Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate_ Campuss genderr finaidd english Brace camfin finfac2 finfac4 finfac5 agrbrac engbrac 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.8217 1.0533 0.5549 -1.0342 -1.4882 1.1323 0.4487 -0.7741 0.1077 -0.1440 0.3295 -0.9424 -0.9071 -0.2644 1.4005 1.2059 -0.2351 0.3704 0.3777 0.2352 0.0578 0.4876 0.4363 0.0809 0.0688 0.1477 0.0184 0.1016 0.0610 0.1739 0.1547 0.1217 0.2526 0.2277 0.0839 0.1317 4.7341 20.0490 92.2633 4.4991 11.6360 196.1369 42.5332 27.4734 34.2445 2.0090 29.1972 29.3771 34.3832 4.7179 30.7308 28.0501 7.8607 7.9155 0.0296 <.0001 <.0001 0.0339 0.0006 <.0001 <.0001 <.0001 <.0001 0.1564 <.0001 <.0001 <.0001 0.0298 <.0001 <.0001 0.0051 0.0049 1 1 1 1 Odds Ratio Estimates Point Estimate Effect faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate_ Campuss genderr finaidd english Brace camfin finfac2 finfac4 finfac5 agrbrac engbrac 1 1 1 1 vs vs vs vs 2 2 2 2 2.867 1.742 0.356 0.226 3.103 1.566 0.213 1.240 0.750 1.933 0.390 0.404 0.768 4.057 3.340 0.790 1.448 95% Wald Confidence Limits 1.808 1.555 0.137 0.096 2.648 1.369 0.119 1.154 0.503 1.522 0.277 0.298 0.605 2.473 2.137 0.671 1.119 4.547 1.951 0.924 0.531 3.636 1.792 0.379 1.333 1.117 2.455 0.548 0.547 0.975 6.657 5.218 0.932 1.875 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 68.7 29.5 1.9 65896012 Somers' D Gamma Tau-a c 0.392 0.399 0.159 0.696 Table 20 The Results of The Stepwise Procedure with Interactions included. 89 APPENDIX 5 Number of Variables 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 Score Chi-Square Variables included in the model 976.4420 agregate 768.8932 Brace 1431.5177 faculty_4 agregate 1256.2088 agregate Brace 1597.4285 faculty_4 faculty_7 agregate 1580.0377 faculty_4 agregate Brace 1749.6162 faculty_4 agregate finaidd1 Brace 1734.2736 faculty_4 faculty_7 agregate finaidd1 1869.5660 faculty_4 faculty_7 agregate finaidd1 Brace 1838.9862 faculty_4 agregate genderr1 finaidd1 Brace 1950.3123 faculty_4 faculty_7 agregate genderr1 finaidd1 Brace 1905.0736 faculty_2 faculty_4 faculty_7 agregate finaidd1 Brace 1976.8571 faculty_4 faculty_6 faculty_7 agregate genderr1 finaidd1 Brace 1976.3923 faculty_2 faculty_4 faculty_7 agregate genderr1 finaidd1 Bra 2036.9823 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate finaidd1 Brace 2017.8803 faculty_2 faculty_3 faculty_4 faculty_7 agregate genderr1 finaidd1 Brace 2071.1645 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate genderr1 finaidd1 Brace 2044.7677 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate finaidd1 english1 Brace 2077.5397 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate genderr1 finaidd1 English Brace 2077.2280 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1 finaidd1 Brace 2083.3017 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1 finaidd1 english1 Brace 2080.0984 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 age agregate genderr1 finaidd1 english1 Brace 2084.8214 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 age agregate Campuss1 genderr1 finaidd1 english1 Brace 2084.3485 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1 marital1 finaidd1 english1 Brace 2085.9704 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate campuss1 genderr1 finaidd1 english1 Brace 2085.4222 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 agregate Campuss1 genderr1 maritall finaidd1 english1 Brace 2086.1659 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate campuss1 genderr1 maritall finaidd1 english1 Brace Table 21 The Results of Best Subset Selection Procedure using Score Criterion. C(p) Selection Method Number of Observations Read Number of Observations Used 18047 18047 Weight: v Number in Model C(p) R-Square Variables in Model 12 11.6129 0.0902 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 age agregate Campuss genderr finaidd English Brace 11 12.4942 0.0900 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 Campuss agregate gender finaidd English Brace Table 22 The Results of Best Subset Selection Procedure using Cp Criterion. 90 APPENDIX 6 Variable age agregate campuss genderr marital finaidd english brace faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 p-value <0.0001 <0.0001 <0.0001 <0.0001 0.0073 <0.0001 <0.0001 <0.0001 <0.0001 0.0148 <0.0001 <0.0001 <0.0001 <0.0001 AUC 0.532 0.637 * 0.537 0.541 0.503 0.532 0.576 0.608 0.545 0.508 0.555 0.508 0.520 0.523 Table 23 Step1 of the AUC procedure Variables : agregate p-value : <0.0001 AUC : 0.637 Variables : agregate p-value : <0.0001 AUC : 0.637 Variable : agregate p-value : <0.0001 AUC : 0.638 Variable : agregate p-value : <0.0001 AUC : 0.642 Variable : agregate p-value : <0.0001 AUC : 0.643 Variable : agregate p-value : <0.0001 AUC : 0643 Variable : agregate p-value : <0.0001 AUC : 0.647 Variable : agregate p-value : <0.0001 AUC : 0.648 Variable : agregate p-value : <0.0001 AUC : 0.655 Variable : aggregate p-value : <0.0001 AUC : 0.656 * marital <0.0001 faculty_5 0.4389 faculty_3 0.5985 gender <0.0001 campuss <0.0001 faculty_6 <0.0001 english <0.0001 finaidd <0.0001 faculty_4 <0.0001 brace <0.0001 Table 24 Step 2 of the AUC procedure 91 Variables : agregate brace faculty_2 p-value : <0.0001 <0.0001 <0.0132 AUC : 0.658 Variables : agregate brace english p-value : <0.0001 <0.0001 <0.0001 AUC : 0.658 Variable : agregate brace campuss p-value : <0.0001 <0.0001 <0.0001 AUC : 0.660 Variable : agregate brace faculty_7 p-value : <0.0001 <0.0001 <0.0001 AUC : 0.660 Variable : agregate brace genderr p-value : <0.0001 <0.0001 <0.0001 AUC : 0.662 Variable : agregate brace faculty_6 p-value : <0.0001 <0.0001 <0.0001 AUC : 0.663 Variable : agregate brace faculty_4 p-value : <0.0001 <0.0001 <0.0001 AUC : 0.666 Variable : agregate brace finaidd p-value : <0.0001 <0.0001 <0.0001 AUC : 0.671* Table 25 Step 3 of the AUC procedure Variables: agregate p-value : <0.0001 AUC : 0.673 Variables: agregate p-value : <0.0001 AUC : 0.674 Variables: agregate p-value : <0.0001 AUC : 0.672 Variables: agregate p-value : <0.0001 AUC : 0.677 Variables: agregate p-value : <0.0001 AUC : 0.677 Variables: agregate p-value : <0.0001 AUC : 0.681* brace finaidd faculty_2 <0.0001 <0.0001 <0.0305 brace finaidd campuss <0.0001 <0.0001 <0.0001 brace finaidd english <0.0001 <0.0001 <0.0001 brace finaidd genderr <0.0001 <0.0001 <0.0001 brace finaidd faculty_6 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 <0.0001 <0.0001 <0.0001 Table 26 Step 4 of the AUC Procedure 92 Variables: agregate p-value : <0.0001 AUC : 0.682 Variables: agregate p-value : <0.0001 AUC : 0.683 Variables: agregate p-value : <0.0001 AUC : 0.683 Variables: agregate p-value : <0.0001 AUC : 0.685 Variables: agregate p-value : <0.0001 AUC : 0.686 Variables: agregate p-value : <0.0001 AUC : 0.687* brace finaidd faculty_4 faculty_3 <0.0001 <0.0001 <0.0001 0.0226 brace finaidd faculty_4 english <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_2 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_6 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 genderr <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 <0.0001 <0.0001 <0.0001 <0.0001 Table 27 Step 5 of the AUC procedure Variables: agregate p-value : <0.0001 AUC : 0.688 Variables: agregate p-value : <0.0001 AUC : 0.688 Variables: agregate p-value : <0.0001 AUC : 0.688 Variables: agregate p-value : <0.0001 AUC : 0.690* Variables: agregate p-value : <0.0001 AUC : 0.690* brace finaidd faculty_4 faculty_7 faculty_5 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 faculty_3 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 english <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 faculty_2 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 genderr <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 28 Step 6 of the AUC procedure 93 Variables: agregate p-value : <0.0001 AUC : 0.691 Variables: agregate p-value : <0.0001 AUC : 0.692 Variables: agregate p-value : <0.0001 AUC : 0.692 Variables: agregate p-value : <0.0001 AUC : 0.693 Variables: agregate p-value : <0.0001 AUC : 0.694* brace finaidd faculty_4 faculty_7 genderr faculty_5 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0065 brace finaidd faculty_4 faculty_7 genderr faculty_3 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0002 brace finaidd faculty_4 faculty_7 genderr english <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0003 brace finaidd faculty_4 faculty_7 genderr faculty_2 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 genderr faculty_6 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 29 Step 7 of the AUC procedure Variables: agregate p-value : <0.0001 AUC : 0.695* Variables: agregate p-value : <0.0001 AUC : 0.695* brace finaidd faculty_4 faculty_7 genderr faculty_6 english <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0010 brace finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 30 Step 8 of the AUC procedure Variables: agregate p-value : <0.0001 AUC : 0.696 Variables: agregate p-value : <0.0001 AUC : 0.697* brace finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2 english <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 brace finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0058 Table 31 Step 9 of the AUC procedure Variables : agregate brace finaidd p-value : <0.0001 <0.0001 <0.0001 AUC : 0.698 Variables : agregate brace finaidd p-value : <0.0001 <0.0001 <0.0001 AUC : 0.701* Variables : agregate brace finaidd p-value : <0.0001 <0.0001 <0.0001 AUC : 0.701* faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3 english <0.0001 <0.0001 <0.0001 0.0091 <0.0001 <0.0001 0.0002 faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3 faculty_5 <0.0001 <0.0001 <0.0001 0.6888 <0.0001 <0.0001 0.0001 faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 <0.0001 <0.0001 <0.0001 < 0.0001 <0.0001 0.0001 Table 32 Step 10 of the AUC procedure 94 Variables: agregate brace finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0002 AUC : 0.702 * Table 33 Step 11 of the AUC procedure Variables: agregate brace finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 < 0.0001 0.0038 AUC : 0.703* Table 34 Step 12 of the AUC procedure Variables: agregate brace finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss age p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0090 <0.0001 0.0864 AUC : 0.703* Table 35 Step 13 of the AUC procedure Variables: agregate brace finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss maritall p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0005 <0.0060 0.1584 AUC : 0.703* Table 36 Step 14 of the AUC procedure 95 APPENDIX 7 SAS PROGRAMME data jimmy; set sasuser.osiame; proc freq order=freq; tables faculty race campuss english genderr maritall finaidd; run; data jimmy2; set sasuser.osiame; proc univariate; var age agregate; title; run; data joseph1; set sasuser.osiame; if score ='1' then pass = 1; else if score ='2' then pass=0; if faculty = '2' then faculty_2=1; else faculty_2=0; if faculty = '3' then faculty_3 =1; else faculty_3 =0; if faculty ='4' then faculty_4 =1; else faculty_4 = 0; if faculty ='5' then faculty_5 =1; else faculty_5=0; if faculty = '6' then faculty_6 =1; else faculty_6 = 0; if faculty ='7' then faculty_7=1; else faculty_7 =0; if race >3 then Brace=1; else Brace=0; Keep faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 pass age agregate campuss genderr maritall finaidd english brace; run; proc logistic descending; class campuss maritall finaidd genderr english; model pass =age; units age=5 -5; run; proc logistic descending; class campuss maritall finaidd genderr english; model pass=agregate; units agregate=100 -100; run; proc logistic descending; class campuss maritall finaidd genderr english; model pass=campuss; run; proc logistic descending; class campuss maritall finaidd genderr english; 96 model pass=maritall; run; proc logistic descending; class campuss maritall finaidd model pass=finaidd ; run; proc logistic descending; class campuss maritall finaidd model pass=genderr; run; proc logistic descending; class campuss maritall finaidd model pass=english; run; proc logistic descending; class campuss maritall finaidd model pass=faculty_2 faculty_3 run; genderr english; genderr english; genderr english; genderr english; faculty_4 faculty_5 faculty_6 faculty_7 ; */ The model without the variable Faculty_6 insignificant in Univariate Logistic regression; proc logistic descending; class campuss maritall finaidd genderr english; model pass=age agregate campuss maritall finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace; units age=5 -5 agregate= 100 -100; run; */ The model without the variables faculty_6,maritall; proc logistic descending; class campuss maritall finaidd genderr english; model pass=age agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace ; units age=5 -5 agregate= 100 -100; run; */ The model without the variables faculty_6,maritall and age; proc logistic descending; class campuss maritall finaidd genderr english; model pass= agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace ; units run; agregate= 100 -100; */ Variable faculty_6 re-enters the model proc logistic descending; proc logistic descending; class campuss maritall finaidd genderr english; model pass= agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace faculty_6 ; units agregate= 100 -100; run; */ Variable maritall re-enters the model; proc logistic descending; class campuss maritall finaidd genderr english; model pass= agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace maritall ; 97 units agregate= 100 -100; run; */ The variable age re-enters the model; proc logistic descending; class campuss maritall finaidd genderr english; model pass= agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace faculty_6 age ; units age=5 -5 agregate= 100 -100; run; */ Variables thet give the pleliminary Main eefects model; proc logistic descending; class campuss maritall finaidd genderr english; model pass= agregate campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace ; units agregate= 100 -100; run; */ Examining the scale of continuous covariate; */ The variable agregate is analysed using quartiles; data joseph3; set joseph1; if 720 <= agregate <= 835 then agregroup =1; else if 835 < agregate <=1075 then agregroup=2; else if 1075 < agregate <=1200 then agregroup=3; else if agregate > 1200 then agregroup=4; if agregroup='2' then agre_2=1; else agre_2=0; if agregroup ='3' then agre_3=1; else agre_3=0; if agregroup ='4' then agre_4 = 1; else agre_4=0; run; proc logistic descending; class campuss maritall finaidd genderr english;; model pass= agre_2 agre_3 agre_4 campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace; run; data midpoints; input agreg coef; cards; 775.5 0 955 .2898 1137.5 1.0672 1680 .9989 ; run; goptions reset =all; symbol c=blue v=dot h=.8 i=j; axis order=(0 to 1.5 by .2) label=(a=90 'logit'); proc gplot data=midpoints; plot coef*agreg / vaxis=axis; run; quit; 98 data joseph6; set joseph3; proc chart; vbar agregate / midpoints=100 to 2200 by 100 GROUP=pass; run; data scale; set joseph3; exlinex=agregate*log(agregate); run; proc logistic descending; class campuss maritall finaidd genderr english;; model pass=agregate exlinex campuss finaidd genderr english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace; run; */ The variable agregate is dichotomised; data joseph4; set joseph3; if agregate >= '1075' then agregate_=1; else agregate_=0; run; */ Fitting a dichotomous variable agregate_; proc logistic descending; class english finaidd campuss genderr; model pass = english finaidd campuss genderr faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_; run; data interaction; set joseph4; engagr=english*agregate_; engfin=english*finaidd; camfin=campuss*finaidd; finfac2=finaidd*faculty_2; finfac3=finaidd*faculty_3; finfac4=finaidd*faculty_4; finfac5=finaidd*faculty_5; finfac7=finaidd*faculty_7; racfin=brace*finaidd; agrbrac=agregate_*brace; engbrac=english*brace; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ engagr; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace 99 agregate_ engfin; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ camfin; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ finfac2; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ finfac3; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ finfac4; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ finfac7; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ finfac5; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ racfin; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace agregate_ agrbrac; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_4 faculty_5 faculty_7 brace faculty_3 faculty_3 faculty_3 faculty_3 faculty_3 faculty_3 faculty_3 faculty_3 100 agregate_ engbrac; run; proc logistic data=interaction descending; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ engagr engfin finfac4 finfac2 finfac5 racfin agrbrac engbrac ; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ engagr engfin finfac4 finfac2 finfac5 racfin agrbrac engbrac ; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ engagr engfin finfac4 finfac2 finfac5 agrbrac engbrac ; run; proc logistic data=interaction; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ engfin finfac4 finfac2 finfac5 agrbrac engbrac ; run; proc logistic data=interaction descending noprint; class campuss genderr finaidd english ; model pass= campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ finfac4 finfac2 finfac5 agrbrac engbrac ; output out=probability predicted=phat; run; data probability1; set probability; predicts=(phat>=.5); run; proc freq data=probability1; tables pass*predicts / norow nocol nopercent; run; proc logistic data=interaction descending; class campuss genderr finaidd english; model pass=campuss genderr finaidd english faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 brace agregate_ finfac4 finfac2 finfac5 agrbrac engbrac / outroc=rocl; goptions cback=white colors=(blue) border; axis1 length=2.5in; 101 axis2 order =(0 to 1 by .1) length=2.5in; proc gplot data=rocl; symbol1 i=join v=none; title 'First Year TUT Students Success ROC Curve'; plot _sensit_*_1mspec_ / haxis=axis1 vaxis=axis2; run; quit; */ Forward selection procedure; data foward; set joseph1; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate campuss genderr maritall finaidd english brace / selection=forward slentry=.25 details ; units age=5 -5 agregate=100 -100; run; */ Backward Selection procedure; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate campuss genderr maritall finaidd english brace / selection=backward details slstay=.05; units age=5 -5 agregate=100 -100; run; */ Stepwise Selection procedure; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 agregate campuss genderr maritall age finaidd english brace / selection=stepwise slentry=.25; units agregate=100 -100; run; */ Stepwise Selection procedure used to select Interactions; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate_ campuss genderr finaidd english brace engagr engfin camfin finfac2 finfac3 finfac4 finfac4 finfac5 finfac7 racfin agrbrac engbrac / selection=stepwise slentry=.25 include=11; run; */ Best Subset Selection procedure using Score criterion; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 age agregate campuss genderr maritall finaidd english / selection=score best=2; units age=5 -5 agregate=100 -100; run; */ Best Subset procedure using Cp criterion; proc logistic descending; class campuss genderr finaidd english; model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 age agregate campuss genderr maritall finaidd english faculty_7 brace faculty_7 brace; 102 output out=best2 prob=pihat; run; data best3; set best2 ; z=log(pihat/(1-pihat))+((pass-pihat)/(pihat*(1-pihat))); v=pihat*(1-pihat); run; proc reg; model z=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate campuss genderr maritall finaidd english brace/selection=cp best=3; weight v; run; quit; 103 Reference Beale, E M L (1970). Note on procedures or variable selection in multiple regression. Technometrics, 12, 909-14. Bergerud, W A (1996). Introduction to Regression Models: With worked forestry examples. Biom. Imf.Hand. Res.Br., B.C. Min. For., Victoria, B.C. work. Pap. 26/1996. Cody, R P and Smith, JK (1997). Applied Statistics and the SAS programming Language. London. Prentice and Hall. Cook, E D (2001). Solutions Manual to Accompany Applied Logistic Regression 2nd Edition by Hosmer, D W and Lemeshow, S. Czepiel S, http:// www.czep.net/contact.html. Dallal, G E (2001). Logistic regression. http:// www.tufts.edu/~gdallal/logistic.htm Delwiche, D L and Slaughter, S J (1995). A premier, Cary, NC: SAS Institute Inc Draper, N R and Smith, H (1981). Applied Regression Analysis, Second Edition. New York. Wiley. ERTAþ, G. Evaluation of Diagnostic Test Accuracy by Receiver Operation Characteristic (ROC) Analysis. Boðazici University, Biomedical Engineering Institute, 80815, Bebek, Ýstanbul, e-mail: [email protected] George, E I (2000). The variable selection problem. Journal of the American Statistical Association, vol 95, No 452, Vignettes. 104 Gorman, J W and Toman, R J (1966). Selection of variables for fitting equations to data. Technometrics, 12, No.1. Guyon, I and Elisseeff, A (2002). Special Issue on Variable and Feature selection. Journal of Machine Learning Research. Hanley, J A and McNeil, B J (1982). The meaning and use of the Area under a Receiver Operating Characteristic (ROC) curve. Radiology, 143, 29-36. Hocking, R R and Leslie, R N (1967). Selection of best subset in Regression Analysis.Technometrics, 9, 531-540. Hocking, R R (1972). Criteria for Selection of a subset Regression: Which one should be used? Technometrics, 14, No.4. Hocking, R R (1976). The Analysis and Selection of Variables in linear regression, Biometrics, 32, 1-49. Hosmer, D W and Lemeshow, S (1989). Applied Logistic Regression. New York. Wiley and Sons. Hosmer, D W and Lemeshow, S (1998). Applied Survival Analysis: Regression Modeling of Time to Event Data. New York.Wiley and Sons. Hosmer, D W and Lemeshow, S (2002). Applied Logistic Regression 2nd Edition. New York. Wiley and Sons. [email protected] (2001). The magnificent ROC. Google‘s cache of http://www.anaestethetist.com/mnm/stats/roc/ Joubert, G (1994). Variable Selection in Logistic Regression, with Special Application to Medical data. 105 Karp, A H. Using logistic regression to predict customer retention. http://www.Sierrainformation.com Larsen, P V(2001). Module 14: Logistic regression. http:// www.statmaster.sdu.dk/courses/st111/module14/. Mallows, C L (1973). More comments on C p . Technometrics, 15, 661-676. Mantel N (1970). Why stepwise selection in multiple regression. Technometrics, 12 621-25. Marzban, C (2004). A comment on the ROC Curve measures. http://www.nhn.ou.edu/marzban. Marriott, J M and Pettitt, A N (1997). Graphical Techniques for selecting explanatory variables for the time series data. Journal of Applied Statistics, 46, 253-264. McClish, D K (1989). Analysing a portion of the ROC curve. Medical Decision Making, 9, 190195. McCullagh, P and Nelder, J A (1989). Applied Regression Analysis. New York. Wiley and Sons. Menard, S (2001). Applied Logistic Regression Analysis. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-106. Thousand Oaks, CA: Sage. Metz, C E, Herman, B A and Shen, J (1998). Maximum likelihood estimation of ROC from continuously distributed data. Statistics in Medicine, 17, 1033-1053. Miller, A J (1984). Selection of subsets and regression variables. Journal of the Royal Statistical Society, A, 147, 389-425. Miller, A J (1990). Subset Selection in Regression, London. Chapman and Hall. 106 Morrison, Ann Michelle (2005). Receiver Operating Characteristic (ROC) curve Analysis of Antecedent Rainfall and the Alewife/Mystic River Receiving water. Water Resource Authority, Report ENQUAD 2005-23.26p. Nargundkar, S and Priestly, J L (2003). Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry. Federal Reserve System Report. http://www.federalreserve.gov/rnd.htm. Pepe, M S (1997). A regression modelling framework for receiver operating characteristic curves in the medical diagnostic testing. Biometrika, 84/3, 595-608. Raftery et al. Statistics in the 21st century, Monographs on Statistics and Applied 93, 60- . London. Chapman and Hall/CRC. Tosteson, A and Begg, C B (1988). A General Regression Methodology for ROC Curve Estimation. Medical Decision Making, 8, 204-15. Thomson, M L (1978). Selection of Variables in multiple regression: part II. Chosen Procedures, Computations and Examples. Internal Statistics Review, 46, 129-146. Tibshirani, R (1997). The Lasso Method for variable selection in the Cox model. Statistics in Medicine, 16, 385-395. Walters, S J (2001). What is a Cox Model? http://www.evidence-ased.medicine.co.uk. Zou, H and Hastie, T (2005). Regularisation and Variable Selection via elastic net. Journal of the Royal Statistics Society, 67, 301-320. Zweig, M H and Campell, G (1993). Receiver-Operating Characteristic (ROC). A Fundamental Evaluation Tool in Clinical Medicine. Clinical Chemistry, 39/4, 561-577. 107 108

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement