# UNIVERSITY OF PRETORIA THE VARIABLE SELECTION PROBLEM AND ```UNIVERSITY OF PRETORIA
THE VARIABLE SELECTION PROBLEM AND
THE APPLICATION OF THE ROC CURVE FOR
BINARY OUTCOME VARIABLES
James M Matshego
Prepared in partial fulfilment of the requirements for the degree of
Master of Science
in
Applied Statistics
Supervisor:
Prof H T Groeneveld
External Examiner: Prof A J Van der Merve (U. O. F)
2006
i
DECLARATION
I, James Moeng Matshego, hereby testify that the work presented in this study is my own original
work and that all the resources used have been indicated and reflected by means of complete
references. I further hereby declare that the dissertation that hereby submit for the degree in
Applied Statistics at the University of Pretoria has not previously been submitted by me for degree
purpose at any other university.
Signed.........................................
ii
Acknowledgements
I sincerely thank
o My supervisor Prof H T Groeneveld for his encouragement, guidance and support.
o Department of Statistics for having been patient with me.
o Mr E Sibanda from Research and Development at TUT for availing the data set
used in this study.
o My family for having afforded me time to study.
o Prof A J van der Merwe for his valuable comments and advices.
iii
LIST OF TABLES……………………………………………………………... .viii
ABSTRACT………………………………………………………………………...x
CHAPTER 1..........................................................................................................1
ORIENTATION ……………………………………………………………………1
1.1 INTRODUCTION…………………………………………………………..1
1.2 VARIABLE SELECTION………………………………………………….2
1.3 SCOPE OF THIS WORK…………………………………………………..3
CHAPTER 2……………………………………………………………………..4
SELECTION PROCEDURES FOR CONTINUOUS OUTCOME VARIABLES…4
2.1 VARIABLE SELECTION IN LINEAR REGRESSION…………………..4
2.1.1 Forward selection…………………………………………………….4
2.1.2 Backward selection…………………………………………………..7
2.1.3 Conventional Stepwise selection or Efroymson’s Algorithm………..8
2.1.3.2 Criterion for deletion…………………………………………9
2.1.3.3 Convergence of the Algorithm……………………………….9
2.1.4 Press………………………………………………………………….10
2.1.5 Principal component regression……………………………………...11
iv
2.1.6 Latent root regression………………………………………………..12
2.1.7 Branch-and bound Technique………………………………………..13
2.1.8 Variable selection via elastic net……………………………………..14
2.1.8.1 Naïve elastic net…………………………………………….. 14
2.1.8.2 Elastic net…………………………………………………….16
2.1.9 Generating all subsets………………………………………………..16
2.2 VARIABLE SELECTION IN THE COX REGRESSION MODEL ..........17
2.2.1 Purposeful selection variables.......…………………………………....18
2.2.2 The Lasso Method: Tibshirani(1997)....................................................19
2.2.3 Variable selection for time series data………………………………...21
2.4 HYPOTHESIS TESTING ………………………………………………....24
2.4.1 Lack-of-fit test ……………………………………………………….24
2.4.2 The coefficient of determination, R 2 ………………………………...24
2.5 COMPARISON OF MODELS: SOLUTION CRITERIA…………………26
2.5.1 Akaike’s information criterion (AIC) and the Bayes information
criterion (BIC)………………………………………………………...27
2.5.2 C p -Statistics ( C r -Criterion) ………………………………………...27
2.5.3 The S p -Statistics ( S r -Criterion)……………………………………...31
2.5.4 RMS, R 2 and adjusted R 2 - Statistics ………………………………... 32
2.5.4.1 The Residual Mean Square……………………………………..32
2.5.4.2 The Squared Multiple Correlation Coefficients (SMCC)………32
2.5.4.3 The adjusted R 2 or Fisher’s A-Statistics.……………………...33
CHAPTER 3…………………………………………………………… 34
THE LOGISTIC MODEL AND VARIABLE SELECTION FOR A BINARY
OUTCOME VARIABLE ............……………………………………………………34
3.1 BINARY DATA. ……………………………………………………………34
3.2 LOGISTIC REGRESSION……………………………………………….....35
v
3.2.1 Assumptions…………………………………………………………..36
3.2.2 The multiple linear logistic regression model ……………………….37
3.3 PARAMETER ESTIMATION .…………………………………………… 38
3.3.1 Maximum likelihood estimation……………………………….........39
3.3.2 The Newton-Raphson method……………………………………….41
3.4 ODDS AND ODDS RATIOS .…………………………………………….43
3.5 INTERPRETATION OF COEFFICIENTS……………………………….45
3.5.1 Dichotomous predictor variables……………………………………45
3.5.2 Polytomous predictor variables……………………………………..46
3.5.3 One continuous predictor variable…………………………………..47
3.5.4 Multivariable case…………………………………………………...48
3.5.5 One dichotomous and one continuous and their interaction ………..48
3.6 TESTING FOR THE SIGNIFICANCE OF THE MODEL………………49
3.6.1 The likelihood ratio test……………………………………………..49
3.6.2 Wald Test Statistics…………………………………………………49
3.6.3 Using deviations to compare likelihoods…………………………...50
3.7 INTERACTION AND CONFOUNDING………………………………..50
3.8 VARIABLE SELECTION IN LOGISTIC REGRESSION……………... 51
3.8.1 Purposeful selection of variables………………………………….. 52
3.8.1.1 Screening of variables……………………………………….52
3.8.1.2 Scale of continuous variables………………………………..53
3.8.1.3 Inclusion of interactions.……………………………………. 54
3.8.2 Stepwise forward selection………………………………………...54
3.8.3 Stepwise backward selection………………………………………55
3.8.4 Stepwise selection (forward and backward)…………………….....56
3.8.5 Best subset selection……………………………………………….56
3.8.6 General .............................................................................................57
CHAPTER 4………………………………………………………………….58
THE RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE.……….58
vi
4.1 BACKROUND………………………………………………………… 58
4.2 DEFINITION OF ROC CURVE………………………………………. 59
4.3 DIAGNOSTIC TEST INTERPRETATION.…………………………... 60
4.3.1 2X2 Table or Contingency Matrix …………………………………60
4.3.2 Basic concepts ……………………………………………………...60
4.3.2.1 Sensitivity…………………………………………………...60
4.3.2.2 Specificity…………………………………………………..61
4.3.2.3 Pre-test probability………………………………………….61
4.3.2.4 Predictive value of a positive test…………………………..61
4.3.2.5 Predictive value of a negative test ………………………….61
4.4 ROC REGRESSION MODEL………………………………………...62
4.5 AREA UNDER THE ROC CURVE (AUC) ………………………….62
4.5.1 Interpretation of the area…………………………………………..62
4.5.2 Comparison of tests………………………………………………..63
CHAPTER 5……………………………………………………………….65
MODEL BULDING USING REAL DATA....……………………………….65
5.1 PURPOSEFUL SELECTION OF VARIABLES PROCEDURE…….66
5.2 OTHER LOGISTIC REGRESSION PROCEDURES………………..76
5.3 INVESTIGATION OF THE AUC AS A SELECTION TOOL ……...77
5.4 THE AUC AND THE STEPWISE SELECTION PROCEDURES ….79
CHAPTER 6 .................................................................................. 81
DISCUSSION AND CONCLUSION ……………………………………… .81
APPENDICES……………………………………………………………….. 83
REFERENCE………………………………………………………………..104
vii
LIST OF TABLES
Table 3.1: An example of coding of a design variable Race coded at three levels………37
Table 3.2: An example showing coefficients that will be obtained when fitting the
the model using design variables in Table 3.1……………………………….45
Table 4.1: An example of a contingency table…………………………………………...45
Table 5.1: Codes of variables used in the data set for study of factors associated with
Success of first year students at TUT from 1999 to 2002……………………63
Table 5.2: Indicator variables for the variable Faculty………………………………….. 64
Table 5.3: Univariable logistic regression models…………………………………….... 65
Table 5.4: Multivariable model containing variables identified in the univariable
Analysis……………………………………………………………………….66
Table 5.5: Results of quartile analysis of the variable Agregate from the multivariable
Model containing variables shown in Table 5.6……………………………...67
Table 5.6: Preliminary main effects model…………………………………………….....68
Table 5.7: Multivariable model with dichotomous variable Agregate_.............................70
Table 5.8: A model containing interactions which were significant when added
one by one to the main effects model………………………………………....72
Table 5.9: Final model with interactions………………………………………………....73
Table 5.10: Contingency matrix for a model in Table 5.9…………………………….....73
Table 5.11: Odds ratios and association of predicted probabilities and observed
Responses for the final model……………………………………………...74
Table 5.12: Contingency matrix for the model in Table 34………………………….....77
viii
Table 5.13: Comparison of the AUC and the Stepwise procedures………………….....77
Table 14: Univariable Analysis of the variable Age…………………………………....81
Table 15: Univariable Analysis of the variable Agregate……………………………....82
Table 16: Analysis of categorical variables…………………………………………….83
Table 17: The results of Forward selection procedure………………………………….84
Table 18: The results of the Backward selection procedure…………………………....85
Table 19: The results of the Stepwise selection procedure…………………………….86
Table 20: The results of the Stepwise procedure with interactions included…………..87
Table 21: The results of Best subset selection procedure using Score criterion……….88
Table 22: The results of Best subset selection procedure using C p criterion…………88
Table 23: The results of the step1 of the AUC procedure…………………………….. 89
Table 24: The results of the step2 of the AUC procedure……………………………...89
Table 25: The results of the step3 of the AUC procedure……………………………...90
Table 26: The results of the step4 of the AUC procedure……………………………...90
Table 27: The results of the step5 of the AUC procedure…………………………… ..91
Table 28: The results of the step6 of the AUC procedure……………………………...91
Table 29: The results of the step7 of the AUC procedure……………………………...92
Table 30: The results of the step8 of the AUC procedure……………………………...92
Table 31: The results of the step9 of the AUC procedure……………………………...92
Table 32: The results of the step10 of the AUC procedure…………………………….92
Table 33: The results of the step11 of the AUC procedure…………………………….93
Table 34: The results of the step12 of the AUC procedure…………………………… 93
Table 35: The results of the step13 of the AUC procedure…………………………….93
Table 33: The results of the step14 of the AUC procedure…………………………….93
ix
ABSTRACT
Variable selection refers to the problem of selecting input variables that are most predictive of a
given outcome. Variable selection problems are found in all machine learning tasks, supervised or
unsupervised, classification, regression, time series prediction , two - class or multi-class, posing
various levels of challenges.
Variables selection problems are related to the problems of input dimensionality reduction and of
parameter planning. It has practical and theoretical challenges of its own. From the practical point
of view, eliminating variables may reduce the cost of producing the outcome and increase its
speed, while space dimensionality does not address these problems. Theoretical challenges include
estimating with what confidence one can state that a variable is relevant to the concept when it is
useful to the outcome and providing a theoretical understanding of the stability of selected
variables subsets. As the probability cut-points increase in value, the more likely it becomes that
an observation is classified as a non-event by the selected variables.
The mathematical statement of the problem is not widely agreed upon and may depend on the
application. One typically distinguishes:
i)
The problem of discovering all the variables relevant to the outcome variable and
determine HOW relevant they are and how they are related to each other.
ii)
The problem of finding a minimum subset of variables that is useful to the outcome
variable.
x
Logistic regression is an increasingly popular statistical technique used to model the probability of
discrete binary outcome. Logistic regression applies maximum likelihood estimation after
transforming the outcome variable into a logit variable. In this way, logistic regression estimates
the probability of a certain event. When properly applied, logistic regression analyses yield a very
powerful insight in to what variables are more or less likely to predict event outcome in a
population of interest. These models also show the extent to which changes in the values of the
variable may increase or decrease the predicted probability of event outcome. Variable selection,
in all its facets is similarly important with logistic regression.
The receiver operating characteristics (ROC) curve is a graphic display that gives a measure of the
predictive accuracy of a logistic regression model. It is a measure of classification performance,
the area under the ROC curve (AUC) is a scalar measure gauging one facet of performance.
Another measure of predictive accuracy of a logistic regression model is a classification table. It
uses the model to classifying observations as events if their estimated probability is greater or
equal to a given probability cut-point, otherwise events are classified as non-events. This
technique, as it appears in the literature, is also studied in this thesis.
In this thesis the issue of variable selection, both for continuous and binary outcome variables, is
investigated as it appears in the statistical literature. It is clear that this topic has been widely
researched and still remains a feature of modern research. The last word certainly hasn’t been
spoken.
xi
CHAPTER 1
ORIENTATION
1.1 INTRODUCTION
The problem of variable selection is one of the most pervasive problems in statistical models. As
stated by Guyon and Elisseeff (2002), variable selection problems are found in all machine
learning, supervised or unsupervised, classification, regression, time series prediction tasks, and
are posing challenges. Owing to the current availability of high speed computors, this problem has
received enormous attention in recent statistical literature. Often referred to as the problem of
subset selection, it arises when one wants to model the relationship between a variable of interest
and a subset of potential explanatory variables or predictors, but there is uncertainty about which
subset to use. A common situation is that in which the explanatory or predictor variables, which
will be denoted by X (nxp) measured at one time can be used to predict a variable of interest or
response variable denoted by Y(1xn) at some future time. Unless the true form of the relationship
between X and Y variables is known, it will be necessary for the data to be used to select the
variables and to calibrate the relationship to be representative of the conditions in which the
relationship will be used for prediction.
In prediction we are usually looking for a small subset of variables which gives adequate
prediction accuracy for a reasonable cost of measurement. On the other hand, in trying to
understand the effect of one variable on another, particularly when the only data available are
observational or survey data rather than experimental data, it may be desirable to include many
potential variables which are either known or believed to have an effect (Miller (1990)).
The problem of selecting a subset of predictor variables is usually described in an idealised setting.
That is, it assumes that (a) all predictors are available for inclusion or exclusion from the model,
1
though this is not always the situation in practice. In many cases, the original set of measured
variables will be augmented with other variables from them such as a product of two variables and
(b) a ‘good’ data set is available on which to base the conclusions. The lack of these assumptions
may make a detailed subset selection analysis a futile exercise.
The rationale for minimizing the number of variables in the model is that the resultant model is
more likely to be numerically stable, and is more easily generalised. The more variables included
in a model, the greater the estimated standard errors become, and the more dependent the model
becomes on the observed data (Hosmer and Lemeshow (1989)).
1.2 VARIABLE SELECTION
It will be assumed that there are n ≥ p + 1 observations on a matrix of predictor variables, X =
( x1 ....x p ), and a scalar response, y , such that the j th response, j = 1,....n is determined by
p
y j = β 0 + ∑ β j xij + ξ j
(1.1)
i =1
The residuals, ξ j are assumed identically and independently distributed, usually normal, with mean
zero and unknown variance, σ 2 . (The predictors, xij are frequently taken to be specified design
variables, but in many cases it is more appropriate to consider them as random variables and
assume a joint distribution on y and x , say, multivariate normal). Implicit in these assumptions is
the assumption that the variables x1 ....x p include all relevant variables though extraneous variables
may be included.
The model (1.1) is frequently expressed in matrix notation as
Y=Xβ +ε
(1.2)
where Y is the n vector of observed responses, X is the design matrix dimension
n × ( p + 1) as defined by (1.2), assumed to have rank p + 1 and β is the ( p + 1 ) – vector of
unknown regression coefficients.
2
The variable selection problem is most familiar in the linear regression context where attention is
restricted to normal linear models. Let γ index the subsets of x1 ....x p and letting qγ be the size of
the γ
th
subset, the problem is to select and fit a model of the form
Y=X γ β γ + ε
where X γ is an n × qγ matrix whose columns correspond to the γ
(1.3)
th
subset, β γ is a qγ × 1 vector
of regression coefficients and ε~ N (0, σ 2 I ) . More, generally, the variable selection problem is a
special case of model selection problem, where each model under consideration corresponds to a
distinct subset of x1 ....x p .
The fundamental developments in variable selection seem to have occurred either directly in the
context of linear model (1.3) or in the context of general model selection frameworks. Historically,
the focus began with the linear model in the 1960s when the first wave of important developments
occurred and computing was expensive (George (2000)).
1.3 SCOPE OF THIS WORK
This manuscript consists of six chapters. In Chapter 2, methods and procedures for selecting
variables in respect of continuous outcome variables for different regressions are described. In
addition, statistics for comparison of models are discussed. Chapter 3 introduces and defines the
logistic regression model, a model for a binary outcome variable. Various selection procedures for
this model are also discussed. The Receiver Operating Characteristic (ROC) curve, a curve
representing a diagnostic test with binary outcome, is presented in Chapter 4. Chapter 5 covers a
model building exercise. All selection procedures discussed with regard to binary outcome
variable are applied to an available data set. We also look into the possibility of using the area
under the ROC curve as a variable selection criterion by doing a test with the same data set used
for other procedures. Chapter 6 wraps up this study with Discussions and Conclusions.
3
CHAPTER 2
SELECTION PROCEDURES FOR CONTINUOUS OUTCOME VARIABLES
This chapter will look at the problem of finding one or more subsets of variables which give
models that fit a set of data fairly well. However, there is no unique statistical procedure or
technique selecting the best regression equation. If there are p potential independent variables
there are 2 p possible equations to be considered.
According to Miller (1984), reasons for using only some of the variables or possible predictor
variables include:
I. to estimate or predict at lower cost by reducing the number of variables on which
II. to predict accurately by eliminating uninformative variables.
III. to describe a multivariate data set parsimoniously.
IV. to estimate regression coefficients with small standard errors (particularly when some of
the predictors are highly correlated with others).
2.1 VARIABLE SELECTION IN LINEAR REGRESSION
In linear regression an F-test is used since errors are assumed to be normally distributed (Hosmer
and Lemeshow (1989)).
2.1.1 Forward Selection
Hocking (1976) suggests a technique that starts with no variable in the equation and adds one
variable at a time until either all variables are in or until a stopping criterion is satisfied. The
variable considered for inclusion at any step is the one yielding the largest single degree of
4
freedom (d.f) F-ratio among those eligible for inclusion. That is : variable i is added to the r -term
equation if Fi = max i (
)〉 Fin where
σˆ 2 r +i
RSS r , RSS r +i are residual sum of squares for r -term and ( r + i ) – term models and r the number
of terms which are retained in the final equation. The subscript ( r + i ) refers to quantities
computed when the variable i is adjoined to the current r-term equation.
Beale (1970) describes a method that requires the least amount of computation. In this method, all
results are obtained as a by-product of solving the problem with all variables selected: if there are
p regression variables, the covariance matrix of these variables is inverted by pivoting on each of
the p diagonal elements in turn, and after each pivot step the results for the regression on those
variables for which the corresponding diagonal elements have already been chosen as pivots, can
be read off. With regard to this method there are no dependencies among the independent
variables. If an element is less than some tolerance times its original value, pivoting is not done
where the tolerance is normally 10-3 in single precision code or 10-7 in a double precision code.
Theoretically this approach has a weakness when independent variables are correlated. Two (or
more) variables may be individually useless but many together give a very good fit.
Draper and Smith (1981) use the partial correlation coefficient as a measure of the importance of
variables not yet in the equation. Assume Z 1 Z 2 , …Z k , are all functions of one or more of the
X’s, represent the complete set of variables from which the equation is to be chosen and that this
set includes any functions, such as squares, cross products, logarithms, inverses, and powers
thought to be desirable and necessary. The procedure starts by first selecting the Z most correlated
with Y . Suppose this Z is Z 1 , the first–order linear regression equation is found to be Yˆ = f ( Z1 ) .
We check the significance of the variable and if it is not, we quit and the model Y = Y is adopted
as best, otherwise we search for the second predictor variable to enter the regression. The partial
correlation coefficients of all predictors not in regression at this stage, namely Z j , j ≠ 1 with Y is
examined. In other words, Y and Z j , are both adjusted for their straight line relationships with Z1 ,
and the correlation between these adjusted values is calculated for all j ≠ 1 . Z j with the highest
5
partial correlation coefficient with Y is now selected, say it is Z 2 . So the second regression
equation Yˆ = f ( Z 1 , Z 2 ) is fitted. The overall regression is checked for significance with the
improvement in R 2 value noted, and the partial F - values for both variables now in the equation
are examined. The smaller of these two partial F ' s is then compared with an appropriate
F percentage point and the corresponding predictor variable is retained in the equation or rejected
according to whether the test is significant or not. The testing of the “least useful predictor
currently in the equation” is done at every stage of the procedure. Thus a predictor that may have
been the best entry candidate at an earlier stage may, at a later stage become redundant as a result
of the relationship between it and other variables now in the regression. Such a variable will be
removed from the model upon testing non-significant and the appropriate fitted regression
equation is then computed for all the remaining variables still in the model. Eventually, when no
variables in the current equation can be removed and the next best candidate variable cannot hold
its place in the equation, the process stops. As each variable is entered into the regression, its effect
on R 2 is noted. However, the correct choice of the α- levels is necessary to avoid cycling effect.
Miller (1990) suggests a method that finds a subset r < p of variables X (1) , X ( 2 ) ,..... X ( p ) from a set
of variables X 1 , X 2 ,..... X p which minimises or gives a suitably small value for
n
S = ∑ ( y i − b j xij ) 2 .
i =1
Since the value of b j is given by
n
b j = ∑ xij yi
i =1
n
∑x
2
ij
i =1
it follows that
⎛ n
⎞
S = ∑ y − ⎜ ∑ xij yi ⎟
i =1
⎝ i =1
⎠
2
i
2
n
∑x
i =1
2
ij
.
(2.1.1)
If we let the first variable be denoted by X (1) , this variable is then forced into further subsets. The
residuals Y –X (1) b (1) are orthogonal to X (1) , and to reduce the sum of squares by adding further
6
variables, the space orthogonal to X (1) must be searched. From each variable X j , other than the one
X j .(1) = X j - b j .(1) X (1) where b j .(1) is the least squares regression coefficient of X j upon X 1 , which
maximises (2.1.1) when Y is replaced with Y-X (1) b (1) and X j is replaced with X j .(1) .
The variables X (1) , X ( 2) ,.... X ( r ) are progressively added to the prediction equation, each variable
being chosen because it minimises the residual sum of squares when added to those already
selected.
2.1.2 Backward elimination
The backward elimination method is more economical than the “all regressions” method in the
sense that it tries to examine only the “best” regression containing a certain number of variables
(Draper and Smith (1981)).
We start with all p variables, including a constant if there is one, in the selected set. Thus, a
regression equation containing all variables is computed and variables are eliminated one at the
time.
At any step, the variable with the smallest F- ratio as computed from the current regression is
eliminated if this F- ratio does not exceed a specified value. That is, variable i deleted from the
p-term equation if
Fi = min i (
σˆ P 2
) < Fout .
Here RSS p −i denotes the residual sum of squares obtained when variable i is deleted from the
current p-term equation, and RSS p is the residual sum of squares for a p-term equation.
Draper and Smith (1981) proposed a method with the following steps applied to the regression
equation with all variables:
7
1. the partial F - value, which is associated with test H 0 : β = 0 versus H 1 : β ≠ 0 for any
particular regression coefficient, is calculated for every predictor variable treated as
though it were the last to enter the regression equation.
2. The lowest partial F - value say FL say, is compared with pre- selected significance level
F0 say. If FL < F0 , the variable which gave rise to FL is removed and the regression
equation is calculated with the remaining variable and step 1 is performed. If FL > F0 the
regression equation is adopted as calculated.
A rather simpler approach by Miller (1990) uses the residual sum of squares. If RSS p is the
corresponding residual sum for regression will all p variables, a variable is chosen for deletion if it
yields the smallest value of RSS p −1 after deletion. Then that variable from the remaining p-1
variables which yields the smallest RSS p − 2 is deleted. The process continues until one variable is
left or a stopping criterion is satisfied.
According to Mantel (1970) the advantageous property of the backward elimination regression
procedure is that it drops regressive variables, or sets of regressor variables, only when one can
afford to discard without seriously impairing the goodness of fit. Thus many variables can be
discarded without abruptly worsening the regression.
On the other hand, backward elimination is usually not feasible when there are more variables than
observations. It also requires far more computation than forward selection.
2.1.3 Conventional Stepwise or Efroymson’s Algorithm
This is a variation on forward selection. After each variable (except the first one) is added to the
set of selected variables, a test is made to ascertain if any of the previously selected variables can
be deleted without appreciably increasing the residual sum of squares. This algorithm incorporates
criteria for the addition and deletion of variables.
8
If RSS r denotes the residual sum of squares with r variables and a constant in the model and the
smallest RSS which can be obtained by adding another variable to the present that is RSS r +1 , the
ratio R =
( n − r − 2)
(2.1.3.1)
is calculated and its value is as compared with an ‘F–to enter’ value, say Fe. If R is greater than Fe,
the variable is added to the selected set.
2.1.3.2 Criterion for deletion
If RSS r −1 is the smallest RSS which can be obtained after deleting any variable from the
previously selected variables, the ratio
R=
(n − r − 1)
(2.1.3.2)
is calculated and its value compared with an ‘F – to delete (or drop)’ value, say Fd. If R is less than
Fd, the variable is deleted from the selected set.
2.1.3.3 Convergence of the Algorithm
From (2.1.3.1) it follows that when the criterion for adding a variable is satisfied we have
Fe
(n − r − 2)
} and from (2.1.3.2) when the criterion for deletion of a
variable is satisfied we have
Fd
} .Consequently when an addition is followed by a deletion, the
(n − r − 2)
9
*
r
1+
Fd
1+
Fe
( n − r − 2)
(n − r − 2)
(2.1.3.3)
The procedure stops when no further additions and deletions satisfying the criteria are possible.
Since each RSS r is bounded below by the smallest RSS for any subset of r variables, by ensuring
that the RSS is reduced each time that a new subset of r variables is found, convergence is
guaranteed. From (2.1.3.3) it follows that a sufficient condition for convergence is that Fd < Fe.
However, there is no guarantee that this algorithm will find the best fitting subsets, though it often
performs better than forward selection when some of the predictors are highly correlated.
2.1.4 Press
According to Draper and Smith (1981), the Press selection procedure proposed by D.M Allen in
Technical Report No 23, Dept of Statistics, University of Kentucky, 1971, the procedure is a
combination of all possible regressions, residual analysis and validation techniques.
If r is the number of parameters including β o in a regression equation and there are n observations
in all, the basic calculations entail:
1. Deleting the first observation on the response and predictor variables.
2. Fitting all possible regressions to the remaining n-1 data points
3. Using each fitted model to predict Y1 by Yˆ1r (say) and so obtain a predictive discrepancy
(Y1 − Yˆ1r ) for all the possible regression models.
4. Repeating steps 1, 2 and 3, but deleting the second observation to give ( Y2 − Yˆ2 r ) values,
the third to give (Y 3 −Yˆ3r ) values, and so on, to n deletions.
5. Calculating the predictive discrepancy sum of squares
n
∑ (Y
i =1
i
− Yˆir ) 2 for each subset
regression model.
10
6. Choosing the “best” subset regression. This will have a comparatively small predictive
sum of squares but not involve many predictors.
2.1.5 Principal Component Regression
This is a procedure which analyses the collaboration structures in some detail and was first
proposed by Harold Hotelling (Draper and Smith (1981)).
Let Z represent the appropriate centred and scaled X matrix. Then the correlation matrix Z'Z, and
the eigenvalues of this correlation matrix are the k solutions λ1 , λ 2 ,.......λ k of the determinantal
equation
|Z'Z – λI|= 0
(2.1.5.1)
for the model with all possible predictors Z1, Z 2 ,…,Z k . By making each new variable column
Z ji =
( Z ji − Z j
(2.1.5.2)
1
S jj 2
n
n
i =1
i =1
where n Z j = ∑ Z ji , S jj = ∑ ( Z ji − Z j ) 2
(2.1.5.3)
with zero mean and unit sum of squares, we have orthogonalised out a new β 0 term, and cast the
'
predictors into ‘correlation form’. The rank of the non-singular correlation matrix is k= r – 1. The
total of all sums of the squares of the Zj is clearly k (Draper & Smith (1981)). We call this the total
variance of the Z’s.
For each eigenvalue, λ j , there is a eigenvector γ satisfying
( Z ' Z - λ j I)γ j = 0
(2.1.5.4)
with γ ' j γ j = 1. The vectors rj are used to re-express the Z’s in terms of principal components W’s,
in the form
W j = γ 1 j z1 + γ 2 j z 2 + ..... + γ kj z k
(2.1.5.5)
11
and the sum of the squares of the new Wj column with elements Wji, i=1, 2,………,n, is λ j i.e. Wj
picks up an amount of λ j of the total variance. We note that
∑λ
j
j
= k and ∑∑i W ji2 = k .
j
The W j corresponding to the largest λ j value is called the principal component and accounts for
the largest proportion of the variation in the standardised data set. Also Wj' s explain smaller and
smaller proportions until all variation is explained i.e.
r
∑λ
j =1
j
= k . The Wj’s are not all used but a
selection procedure of some sort is used, however, there is no universally agreed upon procedure.
2.1.6 Latent Root Regression
This is an extension of the principal component regression for examining alternative predictive
equations and elimination of predictor variables by Webster and his co-workers (Draper and Smith
(1981)). The data matrix of the centered and scaled predictor variables is augmented with the
centered and scaled responsible variable to provide Z*= (y,Z) where Z is the centered and scaled
‘X matrix’ y=(Y – 1 Y )/ S
1
2
YY
where 1 is an nx1 vector of 1’s and S YY = ∑ (Yi − Y ) 2 . It follows that
Z*'Z* is the augmented correlation matrix. The eigen values and their corresponding eigen vectors
are calculated and the first element of each of the eigen vectors is used as a measure of
predictability of the response by that eigen vector. The larger the size of the first element of the
eigen value the more useful is that eigen vector in predicting the response variable and vice versa.
The presence of small eigen values indicates potential linear dependence among predictor
variables. Eigen vectors whose eigen values and corresponding first element of the eigen vectors
are small are dropped and modified least squares estimation equation is obtained. The backward
elimination procedure is then employed to remove predictor variables from the equation.
The vector of a modified least square (MLS) equation coefficients are given by:
12
⎡b1* ⎤
⎡γ 1 j ⎤
⎢
⎥
⎥
−1 ⎢
b * = ⎢b2* ⎥ = c ∑ * γ 0 j λ j ⎢γ 2 j ⎥
j
⎢b * ⎥
⎢γ ⎥
⎣ kj ⎦
⎣ k⎦
(2.1.6.1)
n
1
where c = - { ∑ * γ oj λ j }-1{ ∑ (Yi − Y ) 2 } 2
2
−1
and
(2.1.6.2)
i =1
j
∑ * denotes a summation over only those values of j whose vectors have been retained. Also
b0* = Y for the model. The residual sum of the squares for any modified least squares (MLS)
equation can be written as
n
RSS = {∑ (Yi − Y ) 2 }{∑ * γ oj λ−1 j }−1
2
l =1
j
1
= − c{∑ (Yi − Y ) 2 } 2
(2.1.6.3)
the residual sum of squares that results from deletion of Xl, l = 1,2....k from the MLS equation can
be evaluated as
n
2
t
{∑ (Yi − Y ) }{t00 − l 0 }−1
tll
i =1
2
where t rq = ∑ *
j
γ rjγ qj
λj
(2.1.6.4)
(2.1.6.5)
The main advantage of this method is that by removing the effect of the non-predictive near
singularities, the true influences of the independent variables on the dependent variable are more
clearly represented.
2.1.7 Branch –and bound Techniques
Suppose that we are looking for the subset of r variables out of p variables which yields the
smallest RSS. We begin by dividing all possible subsets into two branches, those which contain
X 1 , and those which do not. Within each branch we can have sub-branches including and
excluding variable X2, etc. Suppose at some stage we found a subset of r variables containing X1
13
or X2 or both giving RSS=100 say. Suppose we are about to start examining the sub-branch which
excludes both X1and X2. A lower bound on the smallest RSS which can be obtained from this subbranch is the RSS for all of the p-2 variables. If this is say, 108 then no subset of r variables can do
better than this, and since we have already found a smaller RSS, this whole sub-branch can be
skipped.
The technique is useful when there are ‘dominant’ variables which good-fitting subsets must
include. It is of no value when there are more variables than observations, as the lower bounds are
nearly always zero.
2.1.8 Variable Selection via the Elastic net
According to Zou and Hastie (2005), the elastic net encourages a grouping effect where strongly
correlated predictors tend to be in or out of the model together. It is particularly useful when the
number of predictors (p) is much bigger than the number of observations (n).
2.1.8.1 Naive Elastic net
Let y = (y 1, ..., y n ) ' be the response and X = (x1 | ... | x p ) the model matrix,
where x j = ( x1 j ,..., x nj ) ' , j = 1,…,p, are the predictors. After a location and scale transformation,
we can assume that the response is centered and the predictors are standardised, and hence
n
∑y
i =1
i
=0
n
∑x
i =1
ij
= 0 and
n
∑x
i =1
2
ij
= 1, for j =1,…,p
(2.1.8.1.1)
For any fixed non-negative λ1 and λ 2 , we define the naïve elastic net criterion as:
L(λ 1, λ 2, β) =| y − Xβ | 2 +λ 2 | β | 2 + λ 1 | β |1
(2.1.8.1.2)
where
14
p
| β | 2 = ∑ β j2
j =1
p
| β |1 = ∑ | β j |
j =1
The naïve elastic net estimator β̂ is the minimiser of (2.1.8.1.2)
βˆ = arg min{L(λ 1, λ 2, β}.
β
Let
α=
λ2
λ1 + λ 2
(2.1.8.1.3)
, then solving β̂ in (2.1.8.2) is equivalent to the optimisation problem
β̂ =argmin|y-Xβ| 2 subject to (1-α)|β| 1 + α|β| 2 ≤ t for some t.
The function (1-α)|β| 1 + α|β| 2 is the elastic net penalty. In this discussion we consider the case
where α<1. For all α ∈[0,1), the elastic net penalty function is singular (without first derivative)
at 0 and it is strictly convex for all α>0.
Lemma 1. Given data set (y,X) and ( λ1 , λ 2 ), define an artificial data set ( (y * , X * ) by
X *( n + p )× p = (1 + λ 2 )
Let
γ =
−1
λ1
(1 + λ 2 )
2
⎛
⎜⎜
⎝
⎞
⎟⎟,
λ2 I ⎠
X
⎛y⎞
y *( n + p ) = ⎜ ⎟ .
⎝0⎠
and β * = (1 + λ 2 )β . Then the naïve elastic net criterion can be written as
L (γ , β ) = L (γ , β * ) =| y * − X * β * | 2 +γ | β * |1 .
Let βˆ * = arg min L{(γ , β * )};
β*
then βˆ =
1
(1 + λ 2 )
βˆ * .
15
2.1.8.2 Elastic net
Zou and Hastie (2005) point out that empirical evidence shows that the naïve elastic net does not
perform satisfactorily unless it is very close to the lasso method discussed in section (2.2.2). This
is why it is called naïve. The elastic net improves the prediction performance of the naïve elastic
net.
Given (y, X), penalty parameter ( λ1 , λ2 ) and let ( y * , X * ) be the artificial data, the naive elastic net
solves a lasso-type problem
βˆ = arg min
| y * − X*β* |2 +
*
β
λ1
(1 + λ 2 )
| β * |1
(2.1.8.2.1)
The elastic net (corrected) estimates β̂ are defined by
β̂ (elastic net) =
(1 + λ 2 )β̂ *
We recall that β̂ (naïve elastic)={
(2.1.8.2.2)
1
(1 + λ 2 )
}β̂ * ; thus
β̂ (elastic net) = (1+ λ2 )βˆ (naïve elastic net).
(2.1.8.2.3)
Hence the elastic net coefficient is a rescaled naïve elastic net coefficient.
An algorithm called LARS-EN (Zou and Hastie (2005)) is recommended to solve the elastic net
efficiently. Algorithm LARS-EN sequentially updates the elastic net fits. In the p>n case, such as
with micro array data, it is not necessary to run the algorithm to the end. Real data and simulated
computational experiments show that the optimal results are achieved at an early stage of
algorithm LARS-EN. If we stop the algorithm after m steps, then it requires 0( m 3 + pm 3 )
operations.
2.1.9 Generating all Subsets
It is feasible to generate all subsets of variables if the number of predictor variables is not too
large, say less than 20 and if only the RSS is calculated for each set. When the complete search has
been carried out, a small number of the more promising subsets can be examined in more detail.
16
The disadvantage of generating all subsets is cost. The computational cost roughly doubles with
each additional variable. Hence the availability of high-speed computing becomes imperative for
this rather cumbersome procedure.
2.2 VARIABLE SELECTION IN THE COX REGRESSION MODEL
The Cox regression model or proportional hazards model for survival data assumes that
h(t,x,β) = h0 (t ) exp( ∑ x j β j )
(2.2.1)
j
where h0 (t ) is the hazard at time t given predictor values x = ( x1 ..., x p ) and h0 (t ) is an arbitrary
baseline function. We usually estimate the parameter β = ( β1 ,..., β p ) ' here in the proportional
hazards model without specifying h0 (t ) through maximization of the partial likelihood :
exp(β ' x jr )
L(β) = ∏
' j
r∈D { ∑ exp(β x )}
(2.2.2)
j∈Rr
Performing a proportional hazards regression analysis requires a number of critical decisions.
When selecting a subset of covariates, we must consider issues such as clinical importance and
adjustment for confounding, as well as statistical significance. Once a subset is selected, we must
determine whether the model is ‘linear’ in continuous covariates and, if not, what transformations
are suggested by data and clinical considerations. Another important decision is the question of
interactions, if any, to be included in the model.
Regardless of which method is used for covariate selection, any survival analysis should begin
with a thorough bivariate analysis of association between survival time and all important
covariates. For categorical covariates the logrank test must be employed whilst quartiles are used
for continuous covariates to make them nominal for the logrank test to be employed.
Stepwise methods for the Cox regression are similar to those that will be discussed in Logistic
regression in Chapter 3 and hence will not be considered in this section.
17
2.2.1 Purposeful Selection of variables
This is a method that is completely controlled by the data analyst. It begins with a multivariable
model containing all variables significant in the bivariate analysis at the 20-25 percent level, as
well as any other variable not selected with this criterion, but which are judged to be of clinical
importance. The use of the above level of significance should lead to the inclusion of any variable
that has the potential to be either an important confounder, or statistically significant in the
preliminary multivariable model.
Following the fit of the initial multivariable model, we use the P-values from the Wald tests of the
individual coefficients to identify covariates that might be deleted from the model. The P-value of
the partial likelihood ratio test should confirm that the deleted covariate is not significant.
After fitting the reduced model, we assess whether or not removal of the covariate has produced an
“important” change in the coefficients of the variables remaining in the model. We use a value of
about 20 percent as an indicator of an important change in the coefficients. If the variable excluded
is an important confounder, it is recommended that any variable excluded from the initial
multivariable model be added back into the model to confirm that it is neither statistically
significant nor an important confounder.
The next step is to examine the scale of continuous covariates in the preliminary main effects
model. There are methods that can be employed to assess whether the effect of the covariate is
linear in the log hazard and if not, which transformation is linear in the log hazard.
One of the methods involves replacing the continuous covariate with design variables such as
quartiles or other purposeful cut-points that may have been used in the bivariate analysis. The
estimated coefficients for the design variables are plotted against the midpoints of the groups and,
at the midpoint of the first group, a point is plotted at zero. If the correct scale is linear in the log
hazard, then the polygon connecting the points should be nearly a straight line. If there is a
substantial departure from the linear trend, its form may be used to suggest a transformation of the
covariate. The quartile method does not require any special software. However, it is not powerful
enough to detect subtle, but often important, deviations from a linear trend.
18
Another approach is the method of fractional polynomials which we shall not discuss in this study.
The only software that has fully implemented this method is STATA (Hosmer & Lemeshow
(1998)).
In the final step we determine whether interactions are needed in the model. Special considerations
may dictate the inclusion of certain interaction terms irrespective of whether the coefficients are
statistically significant or not. In most settings there will be insufficient clinical theory to justify
automatic inclusion of interactions.
Biologically plausible interactions are formed and those that are individually significant at the 5
percent level are included simultaneously in the main effects model. The inclusion of nonsignificant interactions will increase standard error estimates, resulting in wide confidence
intervals. The inclusion of an interaction term will change the coefficients of the relevant main
effects. When there is statistically significant interaction, we include the corresponding main effect
terms in the model regardless of their statistical significance.
2.2.2 The Lasso Method (Tibshirani (1997))
We denote the log partial likelihood by λ (β)=logL(β), and assume that the xij are standardised so
that
∑x
i
ij
/ N = 0, ∑i xij2 / N = 1.
We estimate β via the criterion
βˆ = arg min λ(β ), subject to Σ | β j |≤ s
(2.2.2.1)
where s > 0 is a user specified parameter. Suppose βˆ 0 are maximisers of the partial likelihood
(2.2.2). Then if s ≥ ∑ | βˆ 0j | , the solution to (2.2.2.1) are the usual partial likelihood estimates. If
s< ∑ | β j2 | , the solutions to (2.2.2.1) are shrunken towards zero. An attractive feature of the
particular constraint
∑| β
j
|≤ s is that quite often some of the solution coefficients are exactly
zero and hence this makes for a more interpretable final model.
The strategy for solving (2.2.2.1) is to express the usual Newton-Raphson update as an iterative
reweighted least squares (IRLS) step, and then replace the weighted least squares step by a
19
constrained weighted least squares procedure. If X denotes the design matrix of regressor variables
and η = Xβ, define u = ∂ℓ⁄∂η, A= ∂ 2 ℓ/∂ηη ' and z = η+A −1 u. Then a one-term Taylor series
expansion for ℓ(β) has the form
(z – η) ' A(z – η)
(2.2.2.2)
Hence to solve the original problem (2.2.2.1), we use the following procedure:
i) Fix s and initialise β̂ =0.
ii) Compute η, u, A and z based on the current value β.
iii) Minimise (z – Xβ) ' A(z – Xβ) subject to Σ | β j |≤ s .
iv) Repeat steps 2 and 3 until β̂ does not change.
Since A is a full matrix, it requires computation of 0(N 2 ) elements. However, this difficulty can be
avoided by replacing A with diagonal matrix D that has the same diagonal elements as A.
If the log partial likelihood is bounded in β for the given data set, then for fixed s a solution to
(2.2.2.1) exists since the region Σ | β j |≤ s is compact. But the solution may not be unique.
In some situations it is desirable to have an automatic method for choosing the parameter s based
on the data. Tibshirani’s proposal is to minimise an approximate Generalised Cross Validation
(GCV) statistic. We write the constraint Σ | β j |≤ s as
equivalent to adding a Lagrangian penalty λ ∑
β j2
∑
|βj |
β j2
|βj |
≤ s. This latter constraint is
to the log partial likelihood, with λ≥0
depending on s. We may write the constrained solution β̂ step 3 in the form
β̂ = (X ' DX + λW) −1 X ' Dz
(2.2.2.3)
~
W = diag (W j ), W i = 1 ~ if | β j | >0 and 0 otherwise. Therefore we may approximate the
|βj |
number of effective parameters in the constrained fit β̂ by
p(s) = tr[X(X ' DX + λW¯ ) −1 X ' D].
20
Letting λ s be the log partial likelihood for the constrained fit with constraint s, we construct the
GCV-style statistic
GCV(s) =
− λs
1
.
N N [1 − p( s ) / N ] 2
The GCV criterion inflates the negative log partial likelihood by a factor that involves p(s), the
effective number of parameters and larger values of p(s) cause more inflation of the negative log
partial likelihood.
The simulation study by Tibshirani revealed that the lasso clearly outperforms stepwise selection
and picked the correct number of zero coefficients. It is less variable than the stepwise approach
and still yields interpretable models.
2.3 VARIABLE SELECTION FOR TIME SERIES DATA
Marriot and Pettitt (1997) proposed a model that takes the form:
Yt= Filter + Covariates + noise
where the filter is a “time series filter” and is designed to capture stochastic and deterministic
trends and seasonality and also to correct for possible auto correlated noise terms. We simply seek
to remove the “time series behaviour” from the dependent variable to prevent it from hiding the
effects that any exogenous explanatory variable or covariable might have.
The trend components take a lagged dependent variable and linear time trend, and the seasonal
component is also a lagged dependent variable.
The proposed time series filter is given by
Filter = α + β
p
t
+ νYt −1 + ∂Yt − s + ∑ φ i ΔYt −i where T observations are available, Δ is the difference
T
i =1
operator, ΔYt = Yt − Yt −1 and s is the period of the seasonality. The exogenous explanatory
variables or covariates are given as
Covariates =
k
∑ψ
l =1
i
Xt,i
21
where Xt,i,…,Xt,k, are observations on the covariates, and the complete model for observed data is
Yt = α + β
p
k
t
+ rYt −1 + ∂Yt − s + ∑ φi ΔYt −i + ∑ψ i xt , j + ε t
T
l =1
l =1
t = 1,2,…..T
where ε t ~ iid N (0, σ 2 )
The model is given in vector form as Y = Zθ + ε
(2.3.2)
where Z = (F, X), the columns of F and X being sample values of the filter and covariates
respectively, and θ ' = ( α , β ,ν , ∂, φ1 ,..., φ p ,ψ p ,...,ψ k ).
From (2.3.2), Marriot and Pettitt (1997) point out that Zellner (1971) shows that using a non
informative joint prior for parameters, and writing D to represent the past history of both YT and
X T ,i the marginal posterior density for θ is:
f ( θ | D ) ∝ {νs 2 + ( θ − θˆ ) ' Z ' Z( θ − θˆ )}
−
T
2
where ν = T − p − k − 4 ,
2
s =
( Y − Zθˆ ) ' ( Y − Zθˆ )
ν
and θˆ = ( Z ' Z) −1 Z ' Y
This is a multivariate t-density. The marginal posterior density for σ is
νs 2
)
f (σ | D) ∝ υ +1 exp(−
σ
2σ 2
1
which is the inverse gamma type distribution.
f (YF | D, ~z ) = ∫ f (YF | θ, σ , ~z ) f (θ, σ | D)dθdσ ∝ {υ + (YF − ~z θˆ ) ' H(YF − ~z θˆ )}− (ν +1) / 2
where H =
1
{1 − ~z ( Z ' Z + z~z ' ) −1 ~z ' }
2
s
which is a t- density. The mean and variance of YF are E[ YF D, ~z ] = ~z ( Z ' Z) −1 Z ' Y
and E[(YF − E[YF | D, ~z ]) 2 | D, ~z ] =
ν
ν −2
s 2 {1 + ~z ( Z ' Z) −1 ~z ' }
22
If we delete the ith row of the Z-matrix to get Z–i, the complete Bayesian analysis using Z –i in
place of Z is undertaken to obtain the posterior densities. The deleted row zi are used to obtain the
predictive densities for observed Y value, Yi. The predictive mean E[ Yi D, z i ] and standard
deviation S[ Yi | D | z i ] are then used in the construction of diagnostic plots.
The plots are designed to help to answer the questions of whether or not an exogenous explanatory
variable makes a significant additional contribution to the model or not, where we consider any
additional contribution to be significant if it appears to improve the predictive power of the model.
The order of including explanatory variables is given by backward elimination, the variable
corresponding to the smallest value of
⎡ E[ψ i D ⎤
⎥
⎢
⎣⎢ S[ψ i D ⎦⎥
at each step being removed.
We plot the absolute value of the deviation (AD) of the observation from the predictive mean
Yi − E[Yi D, z i against the predictive standard deviation (SD),
var[Yi | D, z i ] for each model.
We then plot the convex hull of the scatter. For a clearer picture of the data, all points on the
convex hull are ‘peeled’ away and the set of points that form the convex hull of the remaining
scatter is identified. The process is repeated until the central 50% of the scatter is reached, and the
convex hull of the central 50% is then superimposed on the picture. Plots arising from different
models are superimposed, suppressing the original scatter, and the resulting pictures make the
relative performance of competing models easy to assess. The better model is the model that
combines low predictive dispersion with few extreme values, graphically, the plot of its convex
hull is closest to the origin.
If a graphical choice of a model is not clear cut, the sample means of the absolute mean deviations,
MAD, and the standard deviations, MSD are used to select the optimal model. The use of sum of
23
these two, MAD + MSD, provides a simple but useful numerical summary of the absolute
2.4 HYPOTHESIS TESTING
Suppose that by some method we have already selected r variables, where r may be zero, out of p
variables available to include in our predictor subset. If the remaining variables contain no further
information which is useful for predicting the response variable then we should certainly not make
any further selection. But we need to know whether the remaining variables containing further
information or not. The following hypothesis can be tested
HO: β r +1 , β r + 2 ,........, β p = 0 where these β ' s are the regression coefficients of the variables which
have not been selected.
2.4.1 The lack-of-fit Test
If we have n observations and have fitted a linear model containing r out of p variables plus a
constant, then the difference in RSS between fitting the r variables and fitting all the p variables,
Lack of fit F =
p−r
(2.4.1.1)
( n − p − 1)
If the usual conditions of independence, constant variance and normality are satisfied, then the
lack-of-fit statistic is sampled from an F-distribution with (p-r) and (n-p-1) degrees of freedom.
2.4.2 The Coefficient of Determination, R2
According to Miller (1990), the distribution of R2 for a random subset of the Y-variable which is
uncorrelated with the X-variables is a beta distribution with
prob (R2<z) =
z
1
t a −1 (1 − t ) b −1 dt
∫
B ( a, b) 0
24
where a = r , b = (n − r − 1) if a constant has been included in the model but not counted in the
2
2
r variables. Using the beta distribution and fitting constants to their tables, as Miller (1990) points
out, Rencher and Pun obtained the following formula for the upper 100(1-γ) % point of the
distribution of the maximum R2 using the Efroymson’s algorithm as
Rγ2 = [ [1 +
log e γ
(log e N )
1.8 N 0.4
]F −1 (γ ) ]
where
(2.4.2.1)
N = p Cr and F-1( γ ) is the value of z such that prob (R2<z) = γ
Values of F-1( γ ) can be obtained from the tables of the incomplete beta function or from tables of
the F-distribution by writing Regr to denote the regression sum of squares on r variables, we have
R2 = Re g r
(Re g r + RSS r )
Re g r
Write F =
r
(n − r − 1)
as the usual variance ratio for testing the significance of the subset of r variables, if had been
chosen a priori, then R2 = r [r + (n − r − 1) F ] .
(2.4.2.2)
Thus the value of R2 such that the prob (R2<z) = γ is the value of F with prob(R2<z) = γ which is
the value of F with r and (n-r-1) degrees of freedom for the numerator and denominator
respectively so that the upper tail area is γ . The reciprocal of a variance ratio also has an F
distribution but with the degrees of freedom interchanged, and use the tables with (n-r-1) and r
degrees of freedom for numerator and denominator respectively and then take the reciprocal of the
F-value read from the tables. The upper limit of R2 is then obtained by substitution in (2.4.2.2) and
finally into (2.4.2.1).
Miller (1990) points out that Aitkin advances the following argument:
If we decide on a prior for the comparison of subset X2 with the full model, containing all the
variables in X, then we should use the likelihood-ratio test which gives the variance ratio statistic:
25
F=
( p − r)
(2.4.3.1)
(n − p)
where the counts of variables (r and p) include one degree of freedom for a constant if it is
included in the models. Under the null hypothesis that none of the (p-r) variables excluded from
X 2 is in the ‘true’ model, this quantity is distributed as F(p-r,n-p), subject to assumption of
independence, normality and homoscedacity of the residuals from the model. Aitkin then considers
the statistic:
U(X2) = (p-r)F
(2.4.3.2)
The maximum value of U for all possible subsets including a constant is then
Umax =
(n − p)
where RSS1 is the sum of squares of Y about the mean.
A simultaneous 100 α % test for all the hypotheses β 2 = 0 for all subsets X2 is obtained by testing
that:
U(X2) = (p-1) F ( α , p − 1, n − p ).
(2.4.3.3)
Subsets which satisfy (2.4.3.3) are referred to as ‘minimal adequate sets’ and are such that if any
variable is removed from the subset, it fails to satisfy the condition.
2.5 COMPARISON OF MODELS: SOLUTION CRITERIA
Once a manageable set of models is reached, criteria are needed to select or decide on appropriate
subset among contending subsets .The accuracy of any model is measured by a discrepancy, a
measure of lack of fit of the model at hand. The model which minimises the expected discrepancy
is the ‘best’ model selected. The overall discrepancy consists of two components: discrepancy due
to the approximation (bias) and discrepancy due to estimation (variance). The discrepancy due to
26
approximation decreases as the number of parameters increases; the discrepancy due to estimation
increases as the number of parameters increases.
A consistent estimator of the expected discrepancy is called a criterion and is used for model
selection.
2.5.1 Akaike’s Information Criterion (AIC) and the Bayes Information Criterion (BIC).
According to George (2000) these two criteria are among the most popular criteria, motivated from
very different view points.
Letting lˆγ denote the maximum log likelihood of the γth model, AIC selects the model which
maximises A = lγ − qγ
(2.5.1.1)
where qγ is defined in paragraph (1.2) of Chapter1. Miller (1990) points out that the AIC has often
been used as the stopping rule for selecting ARIMA(auto-regressive, integrated, moving average)
models where selection is not only between models with different numbers of parameters but also
between many models of the same size. He further suggests that the AIC, with various
modifications of it, can be applied in situations in which normality is not assumed.
The BIC selects the model which maximises
1
⎛
⎞
B = ⎜ lˆγ − (log n)qγ ⎟
2
⎝
⎠
George (2000) mentions Haughton as saying that BIC is consistent when the model is fixed and
Shibata saying that AIC is consistent if the dimensionality of the true model increases with n, the
number of observations, (at an appropriate rate).
2.5.2 Cp – Statistics (Cr – Criterion)
According to Hocking & Leslie (1967), C L Mallows suggests that the standardised total squared
error be used as a criterion and he developed an estimate Cp of this quantity given by:
27
Cp =
− ( n − 2r ) ,
σˆ 2
(2.5.2.1)
where r is the number of variables in the regression, RSSr is as defined in (2.1.3.1) and σˆ 2 is an
estimate of σ 2 .
Now, if an equation with r parameters is adequate, that is, does not suffer from lack of fit, then
E(RSSr) =(n-r) σ 2 so that
E(Cp) ≈
(n − r )σ 2
σ2
− ( n − 2r )
(2.5.2.2)
≈r
for an adequate model. It follows that a plot of Cp versus r will show up the ‘adequate models’ as
points fairly close to the line Cp = r. Thus subsets with small Cp and Cp close to r will be
considered to be good.
⎛ p⎞
Certainly, of the ⎜ ⎟ possible regressions of size r, only few will be considered to be good. We are
⎝r⎠
interested in that subset of size r for which the residual sum of squares and thus the Cp is minimal.
Hocking & Leslie (1967) further describe a method that allows the subset of size r to be identified
⎛ p⎞
after having compared the residual sum of squares for only a small fraction of the possible ⎜ ⎟
⎝r⎠
subsets. This computation will mostly yield those regressions with small Cp. Reference is made to
the k = p – r variables which are to be removed from the regression rather than the variables which
are to be retained. Reference shall also be made to the “reduction in regression sum of squares”
due to removing a set of k variables. Now the set of k variables for which this reduction is
minimum determines that set of r variables to be retained for which the residual sum of squares is
minimum.
If σ 2 is determined by the residual mean square for the complete regression, and Redr denotes the
reduction, the Cp statistic can also be computed from this reduction:
28
Cp =
Re d r
− ( 2r − p )
σˆ 2
(2.5.2.3)
If a single variable, say the ith is removed from the regression, the reduction is given by σ 2 t i
2
where
2
ti =
(bi ) 2
σˆ bi 2
(2.5.2.4)
is the square of the usual t- statistic associated with the ith regression coefficient. The bi are defined
by bi = Di' X ' XDi ) −1 Di' X ' Y . Let
θ i = σˆ 2 t i 2
(2.5.2.5)
= reduction due to eliminating ith variable where i = 1,…,p.
First, we compute the full regression by solving the normal equations:
X ' X β = X'Y
(2.5.2.6)
and then evaluate the r univariable reductions, θ i . We assume that the variables are labelled
according on the θ i . That is
θ1 ≤ θ 2 ≤ .....θ p .
(2.5.2.7)
With this labelling, the subset of size p-1 with minimum residual sum of the squares is obtained by
deleting the first variable.
This approach is based on the fundamental property of quadratic forms which states that if the
reduction in the regression sum of squares due to eliminating any set of variables for which the
maximum subscripts j is not greater than θ i +1 , then no subset including any variable with subscripts
greater than i can result in a smaller reduction.
We now describe a sequential method consisting of at most r+1 stages for each value of r
=1,2……..,p-2. The first stage consists of computing the reduction due to eliminating variables
1,2,…,k for k=p-r under labelling indicated in expression (2.5.2.7). If this reduction does not
exceed θ k +1 , then, according the above property, the process is terminated and the regression
consisting of the r variables k+1,…,p is to be the ‘best’ subset of size r in the sense of minimum
residual sum of squares.
29
If the reduction computed in the first stage exceeds θ k +1 , then no decision can be made and we
proceed to the second stage and variable k+1 is included among the candidates for elimination.
⎛k⎞
The ⎜ ⎟ reductions due to eliminating any set of k variables selected from the first θ k +1 but
⎝1⎠
⎛k⎞
containing the (k+1)st variable are then computed. If the smallest of the 1 + ⎜ ⎟
⎝1⎠
reductions
computed to this point does not exceed θ k + 2 the process terminates and the corresponding subset
is ‘best’. If not, no decision is taken at this second stage and we proceed to the third stage.
In the third stage the reductions are computed for all subsets of the size k selected from the first
⎛ k +1 ⎞
k+2 variates which contain variable k+2, a total of ⎜ ⎟ computations. The minimum of the 1 +
⎝ 2⎠
⎛ k ⎞ ⎛ k +1 ⎞
⎜ ⎟ + ⎜ ⎟ reductions from the first three stages is now compared with θ k +3 and the iteration
⎝1⎠ ⎝ 2 ⎠
either terminates or continues to the next stage.
⎛ k + q−2 ⎞
In general, at any stage, say the qth, a total of ⎜
⎟ reductions must be computed and checked to
⎝ q −1 ⎠
see if the ‘best’ subset can be identified. At this stage the largest subscript on any variable being
⎛ k + j − 2⎞
⎟⎟
j =1 ⎝
⎠
q
considered is k+q-1 and hence the search can be terminated if the minimum of the
∑ ⎜⎜ j − 1
reductions computed in the first q stages does not exceed θ k + q and the corresponding subset is
‘best’. If not, we proceed to stage q+1 where subsets of size k containing variable k+q are
considered. However, it has been observed that it rarely happens that all r+1 stages are completed
except for very small values for r.
30
2.5.3 The Sp – Statistics ( Sr – Statistics)
According to Thomson (1978) this method is regarded as being amongst the most suitable for
variable selection in multivariable regression analysis where dependent variable y and the p
independent variables have a (p+1)-dimensional normal distribution. The criterion used minimises
the expected squared distribution between the true and predictable values of the dependent variable
y.
The value of y, conditionally given some predictor set xDr ,r ≤ p may be expressed as follows :
y = β 0 + ( xDr − X r β r ) + ε r
(2.5.3.1)
where X r (1xr) vector of means obtained from a regression sample for the r variables being used
and ε r ~ N(0; σ r ). For some particular predictor set x, a future value of y, yˆ r is predicted by:
2
yˆ r =b 0 +( xDr − X r )br
(2.5.3.2)
where b r = [ Dr ' ( X − 1n X r )' ( x − 1n X ) D r ] −1 Dr ' ( X − 1n X )' Y and n the regression sample size.
The method involves calculating the statistic:
Sp =
Sp =
MSE r
n−r−2
or
REDr + SSE p
(n − r )(n − r − 2)
(2.5.3.3)
(2.5.3.4)
For subsets of the independent variable where REDr, is the reduction in regression sums of squares
between the full p-variable regression and the r variable regression, r=1,2,…,p and SSEp is the
error sums of squares. Equation (2.5.3.4) as opposed to (2.5.2.3) provides an efficient
computational procedure for the use of this statistic.
The subset of variables chosen is the one which yields the smallest value of Sp. However, if the
independent variables cannot be regarded as randomly and normally distributed, the use of Cp is
suggested.
31
2.5.4 RMS, R2 and Adjusted R2 Statistics
These are common criteria functions which are simple functions of the residual sum of the squares
for the r-term equation denoted by RSSr
2.5.4.1 The Residual Mean Square
The residual mean square is given by: RMSr =
n−r
(2.5.4.1)
Hocking (1976), points out that many statisticians voice preference for the residual mean square,
RMSr, as a criterion function. RMS r , is plotted against r and the choice of r is based on
I. The minimum RMS.
II. The value of r such that RMSr = RMS for the full equation or
III. The value of r such that the locus of the smallest RMSr turns sharply upwards.
2.5.4.2 The Squared Multiple Correlation Coefficients (SMCC)
The SMCC is given by: R 2 r = 1 −
.
TSS
(2.5.4.2.1)
The plot of R2r versus r may yield a locus of the minimum R2r which remains quite flat as r is
decreased and then turns sharply down. The value of r at which this ‘knee’ in the R2r plot occurs is
frequently used to indicate the number of terms in the model. However, it has been observed that
R2 is a measure of the residual sum of the squares proportional to the total sum of squares and,
hence, would appear to be a reasonable measure of model adequacy. The relation of R2 to Cp is
given by
Cp = (n − t − 1)(1 − Rr
2
(1 − R 2 ) + 2 p − n
(2.5.4.2.2)
It follows from this relation that, while the R2r plot may be quite flat for a given range of r, the
coefficient (n-t-1) can magnify small differences causing Cp to increase dramatically as r is
decreased. Therefore, the R 2r criterion may suggest the deletion of more variables than the
minimum Cp criterion. Simulation studies by some authors as described by Hocking (1976)
indicate that essential variables may be deleted using the Rr2 criterion. Also, lacking a precise
definition of the knee, the qualitative inspection of the Rr2 plot is dependent on the scale.
32
2.5.4.3 The Adjusted R2 or Fisher’s A-statistics
The adjusted R2-statistic (adjusted for degrees of freedom) is usually defined as:
R 2 r = 1 − (1 − R 2 r )
(n − 1)
n−r
(2.5.4.3)
as an alternative to R2. Some users recommend the adjusted squared multiple correlation
2
coefficient R and suggest using the value of r for which Rr is maximum. Following the simple
2
relation of Rr to Cp, the adjusted R 2 -statistic is given by:
R 2r = 1 −
n −1
RMS r .
TSS
2
The Rr procedure is exactly equivalent to minimising RMSr. There appears to be no advantage in
2
using Rr over RMSr in view of the above relation.
33
Chapter 3
THE LOGISTIC MODEL AND VARIABLE SELECTION FOR A
BINARY OUTCOME VARIABLE
Having discussed variable selection procedures with regard to continuous outcome variables in
Chapter 2, we now in this chapter, consider situations where the response variable is a categorical
random variable, attaining only two possible outcomes. In the first place a model and estimation of
its parameters is discussed in detail. Then variable selection for this model is presented.
In the next discussions, use was made of the following references :( Czepiel, S, Guyon, I and
Elisseeff, A (2002). Joubert, G (1994). Hosmer, D W and Lemeshow, S (1989). Larson, P V
(2001). Menard S, (2001)).
3.1 BINARY DATA
When the response variable is dichotomous, it is convenient to denote one of the outcomes as
‘success’ and the other as ‘failure’. For example, if a patient is cured of a disease, the response is
‘success’, if not, then the response is ‘failure’. If a mouse dies from toxic exposure, the response is
‘success’, if not (i.e. if it survives) the response is ‘failure’. It is standard to let the response
variable Z be the binary variable, which attains the value 1, if the outcome is ‘success’, and 0 if
the outcome is ‘failure’.
Let π = P(Z=1) so that P(Z=0) = 1 – π, then Z~ B(1, π). Suppose that data on p predictor variables
are available for each patient or mouse, x 1 ,…,x p . The objective is to investigate the relationship
between π and the predictor variables. In a regression situation, each response variable is
associated with a given set of values of a set of explanatory variables x 1 ,…,x k . For example
whether or not a patient is cured of a disease may depend on the particular medical treatment the
patient is given, the patient’s general state of health, age, gender, etc.; whether or not an item in a
manufacturing process passes the quality control may depend on various conditions regarding the
34
production process, such as temperature, quality of raw material, time since last service of the
machinery, etc. It is often possible to group the observations in such a way that all observations
within a group have the same values of predictor variables. For instance, we may group the
patients in the disease example according to type of medical treatment, gender and age group, etc
such that there are several patients in each grouping. When the data can be grouped it is easier to
record the number of successes and failures for each group, rather than recording a long series of
0s and 1s.
Example 3.1 (Larsen 2005)
The link between the use of oral contraceptives and the incidence of myocardial infarction was
investigated. The table below gives the number of women in the study, using the contraceptive pill,
who suffered a myocardial infarction, and the number using the pill who did not suffer a
myocardial infarction. The corresponding numbers for women not using the pill are also given.
Infarction
Yes
No
23
34
35
132
Yes
Pill
No
Example 3.1
3.2 LOGISTIC REGRESSION
Binomial logistic regression is a form of regression which is used when the response variable is a
dichotomy and the predictor variable(s) is/are of any type (i.e. discrete or continuous). It can be
used to predict a response variable on the basis of values of predictors and to determine the
percentage of variance in the response variable explained by the predictors; to rank the relative
importance of predictors; to assess interaction effects; and to understand the impact of covariate
control variables. Logistic regression has proven to be one of the most versatile techniques in the
class of generalised linear models (Czepiel, S).
35
Whereas linear regression models equate the expected value of the dependent variable to a linear
combination of predictor variables and their corresponding parameters, generalised linear models
equate the combination to some function of the probability of a given outcome on the dependent
variable. In logistic regression, that function is the logit transform: the natural logarithm of the
odds that some event will occur. In linear regression, parameters are estimated using the method of
least squares by minimising the sum of squared deviations of predicted values from observed
values. However, logistic regression is not capable of producing minimum variance unbiased
(minvu) estimators of the actual parameters. In place of the minvu estimators maximum likelihood
estimation is used to solve for the parameters.
3.2.1 Assumptions
Logistic regression is popular in part because it enables the researcher to overcome many of the
restrictive assumptions of ordinary least square (OLS) regression:
i) Logistic regression does not require linear relationships between predictors and the response
variable but assumes a linear relationship between the predictors and the logit of the response
variable.
ii) The response need not be normally distributed (we need to assume its distribution is within the
range of the exponential family of distributions, such as normal, Poisson, binomial, gamma).
iii) The response variable need not be homoscedastic for each combination of levels of the
predictors; that is, there is no homogeneity of variance assumption.
iv) Normally distributed errors are not assumed. However, errors are assumed to be independent.
v) Logistic regression does not require that the predictors be measured on interval scale.
vi) Logistic regression does not require the dependents to be unbounded.
36
3.2.2 The Multiple linear Logistic Regression Model
Let Z be a dichotomous (termed ‘success’ and ‘failure’) random variable denoting the outcome of
some experiment and let X = (x 1 ,…, x p ) be a collection of predictor variables. Given a data set
with a total sample size of M, where each observation is independent from all the others, Z can be
considered as a column vector of M binomial random variables Z i . The data is aggregated such
that each row represents one distinct combination of values of the predictor variables. The rows are
often referred to as ‘populations’. Let N represent the total number of populations and let n be a
column vector with elements n i representing the number of observations in each population for
i =1 to N where
N
∑n
i =1
i
=M, the total sample size.
Let Y be a column vector of length N where each element Y i is a random variable representing the
observed counts of the number of successes of Z for population i. Let the column vector y contain
elements y i representing the observed counts of the number of successes for each population. Let
π be a column vector also of length N with elements π i = P(Z i =1|i), i.e., the probability of success
for any given observation in the ith population.
The linear component of the model contains the design matrix and the vector of parameters to be
estimated. The design matrix of predictor variables, X, is composed of N rows and p+1 columns,
where p is the number of predictor variables specified in the model. For each design matrix, the
first element x i 0 = 1 for all i. This is the intercept. The parameter vector, β , is a column vector of
length p+1. There is one parameter corresponding to each of the p columns of predictor variables
settings in X, plus one, β 0 , for the intercept.
The logistic regression model equates the logit transform, the log-odds of the probability of a
success, to the linear component:
37
Logit ( π i ) = log (
πi
)=
1− πi
p
∑x
k =0
ik
βk
i = 1,2, …, N
(3.2.2.1)
= β 0 xi 0 + β1 xi1 + β 2 xi 2 + ... + β p xip
If some of the independent variables are discreet, (nominal scaled variables such as race, sex,
treatment group, and so forth), it is inappropriate to include them in the model as if they were
interval scaled. In fact the numbers used to represent the various levels are simply identifiers, and
have no numeric significance. The method of choice is to use a collection of design variables (or
dummy variables). For example, if one of the predictor variables is race, say, coded as ‘‘white”,
“black” or “other” then two design variables are necessary. Table 3.1 illustrates coding of the
design variables, D1 and D2.
Design Variable
RACE
D1
D2
White
0
0
Black
1
0
Other
0
1
Table3.1. An example of the coding of Design Variable Race coded at three levels.
(In general, if a nominal scaled variable has k possible values, then k-1 design variables are
needed).
3.3 PARAMETER ESTIMATION
The goal of logistic regression is to estimate the p+1 unknown parameters in equation (3.2.1.1).
This is done with maximum likelihood estimation which entails the finding of a set of parameters
for which the probability of the observed data is greatest.
38
3.3.1 Maximum likelihood Estimation
The maximum likelihood estimation equation is derived from the probability distribution of the
dependent variable. Since each y i represents a binomial count in the ith population, the joint
density function of Y is:
ni !
π iyi (1- π i ) ni − yi
−
y
)!
i
i
N
f(y|β) =
∏ y !( n
i =1
i
(3.3.1.1)
⎛ ni ⎞
For each population, there are ⎜⎜ ⎟⎟ different ways to arrange y i success from ni trials. Since the
⎝ yi ⎠
probability of a success for any one of the ni trials is π i , the probability of y i successes is π iyi .
Likewise, the probability of ni − y i failures is (1- π i ) ni − yi .
The joint probability function in equation (3.3.1.1) expresses the values of y as function of known,
fixed values for β. The likelihood function has the same form as the probability function, except
that the parameters of the function are reversed: the likelihood function expresses the values of β
in terms of the known values for y. Thus,
N
L(β|y) =
ni !
π iyi (1- π i ) ni − yi
i − y i )!
∏ y !( n
i =1
i
(3.3.1.2)
The maximum likelihood estimates are the values for β that maximize the likelihood function in
equation (3.3.1.2). The critical points of a function (maxima and minima) occur when the first
derivative equals 0. Attempting to take the derivative of equation (3.3.1.2) with respect to β is a
difficult task due to the complexity of multiplicative terms. However, the likelihood equation can
be considerably simplified. We ignore the factorial terms since they do not contain π i and their
exclusion will come to the same results. After rearranging equation (3.3.1.2) we obtain:
yi
⎛ π ⎞
n
L(y|β) = ∏ ⎜⎜ i ⎟⎟ (1 − π i ) i
i =1 ⎝ 1 − π i ⎠
N
(3.3.1.3)
39
Taking e to both sides of (3.2.2.1) gives,
⎛ πi
⎜⎜
⎝1 − πi
p
⎞
x β
⎟⎟ = e ∑ k = 0 ik k
⎠
(3.3.1.4)
which after solving for π i becomes,
⎛ ∑ k = o x ik β pk
e
π i = ⎜⎜
p
⎜ 1 + e ∑ k = 0 xik β k
⎝
p
⎞
⎟
⎟⎟
⎠
(3.3.1.5)
Substituting equation (3.3.1.4) for (3.3.1.1) and equation (3.3.1.5) for (3.3.1.2), equation (3.3.1.3)
becomes:
⎛
∏ ⎜⎝ e ∑
N
L(y|β) =
p
x β
k = 0 ik k
i =1
x β
⎛
e ∑ k = 0 ik k
⎞ ⎜
⎟ ⎜1 −
p
⎠ ⎜ 1 + e ∑ k = 0 xik β k
⎝
p
yi
⎞
⎟
⎟⎟
⎠
ni
(3.3.1.6)
which can be written as:
⎛ yi ∑ k = 0 xik β k ⎞⎛
∑ xik β k ⎞
⎟
⎜e
⎟⎜1 + e k = 0
∏
⎠
⎠⎝
i =1 ⎝
N
L(y|β) =
p
p
− ni
(3.3.1.7)
This is the kernel of the likelihood to maximize. We simplify by taking its log and equation
(3.3.1.7) becomes:
λ(β ) =
⎛
N
i =1
i
⎞
p
∑ y ⎜⎜ ∑ x
⎝ k =0
ik
β k ⎟⎟ − ni log(1 + e ∑
⎠
p
x β
k = 0 ik k
)
(3.3.1.8)
40
We now find the critical points of the log likelihood function by differentiating it and obtain:
∂λ(β ) N
= ∑ yi xik − niπ i xik
∂β k
i =1
(3.3.1.9)
The critical point will be a maximum if the matrix of second partial derivatives is negative
definite; that is, if every element on the diagonal of the matrix is less than zero. It is formed by
differentiating each of the p+1 equations in equation (3.1.1.9) a second time with respect to each
element of β . The general form of the matrix of second partial derivatives is
∂ ⎛ ∂λ(β ) ⎞
∂
⎜⎜
⎟⎟ =
∂β k ⎝ ∂β k ⎠ ∂β k
=
∂
∂β k
N
∑y x
i
ik
− ni xik π i
i =1
N
∑− n x
i =1
i
ik
πi
x β
∂ ⎛⎜ e ∑k =0 ik k
= − ∑ ni xik
P
∂β k ⎜⎜ 1 + e ∑k =0 xik βk
i =1
⎝
P
N
⎞
⎟
⎟⎟
⎠
N
= − ∑ ni xik π i (1 − π i )xik
(3.3.1.10)
i =1
Thus the critical point will be a maximum since the matrix of second partial derivatives is negative
definite following the result obtained in equation (3.3.1.10).
3.3.2 The Newton-Raphson Method
Setting the equations in equation (3.3.1.9) equal to zero results in a system of p+1 nonlinear
equations each with k+1 unknown variables. The solution to the system is vector β̂ k . However,
41
solving a system of nonlinear equations is not easy since the solution cannot be derived
algebraically as it can be done in the case of linear equations. The solution must be found using an
iterative process. The most popular method for solving systems of nonlinear equations is Newton’s
method, also known as the Newton-Raphson method.
It is more convenient to use matrix notation to express each step of the Newton-Raphson method.
N
We can write equation (3.3.1.10) as λ / (β) = − ∑ ni xik π i (1 − π i )xik .
i =1
Let β (0 ) represent the vector of initial approximations for each β k , then the first step of NewtonRaphson can be expressed as:
β (1) = β (0 ) + [ - λ// (β (0 ) )] −1 λ / (β (0 ) )
(3.3.2.1)
Let μ be a column vector of length N with elements μ i = ni π i . Each element of μ can be expressed
as μ i = E( y i ), the expected value y i . Using matrix multiplication, we can show that:
λ / (β) =- X ' (y-μ)
(3.3.2.2)
is a column vector of length P+1 whose elements are
∂ (β )
, as derived in equation (3.3.1.9). Let W
∂β k
be a square matrix of order N, with elements ni π i (1 − π i ) on the diagonal and zeros everywhere
else. Again, using matrix multiplication, we can verify that
λ// (β) = X ' WX
is a p+1 × p+1 square matrix with elements
(3.3.2.3)
∂ 2 λ(β )
. Now equation (3.3.2.1) can be written as
∂β 2k
β (1) = β (0 ) + [X ' WX] −1 X ' (y-μ)
(3.3.2.4)
42
We continue to apply equation (3.3.2.4) until there is essentially no change between the elements
of β from one iteration to the next. At this point, the maximum likelihood estimates are said to
have converged, and equation (3.3.2.3) will hold the variance-covariance matrix of the estimates.
3.4 ODDS AND ODDS RATIO
The odds of some event happening (e.g. the event Y = 1) is defined as the ratio of probability that
the event will occur divided by the probability that the event will not occur. That is, the odds of
the event E is given by
Odds (E) =
P( E )
P( E )
=
P (notE )
1 − P( E )
Example 3.1 (continued from page 34)
An estimate of the probability of having a myocardial infarction for women in the study using the
pill is given by P(E pill ) = 23/57 = 0.4035. Hence, the odds, amongst these women, of having a
myocardial infarction when using the pill, is given by
Odds (E pill ) =
0.4035
= 0.6764.
1 − 0.4035
That is, the probability of having a myocardial infarction is around 2/3rds the probability of not
having a myocardial infarction, for women using the pill.
43
Similarly, for women who are not using the pill, an estimate of the probability of having a
myocardial infarction is given by P(E no − pill ) = 35/167 = 0.2096. The odds of having a myocardial
infarction, when not using the pill, is given by
Odds (E no − pill ) =
0.2035
= 0.2652.
1 − 0.2096
Thus the odds are around 1 to 4 that a woman in the study not using the pill will have a myocardial
infarction.
The odds ratio R A, B that compares the odds of events E A and E B ( that is, Event E occurring in
group A and B, respectively), is defined as the ratio between the two odds; that is
R A, B =
odds ( E A )
P( E A )
=
odds ( E B ) 1 − P ( E A )
P( E B )
.
1 − P( E B )
Example 3.1 (continued from page 42)
The odds ratio comparing the odds of having a myocardial infarction for women using the pill with
the odds of having a myocardial infarction for women not using the pill, is given by
R pill ,no − pill =
odds ( E pill )
odds ( E no − pill )
= 0.6764/0.2652 = 2.5505.
That is, the odds of having myocardial infarction are 2.55 times higher for women using the pill,
than for women not using the pill. In particular, if an odds ratio is equal to one, the odds are the
same for the two groups.
44
3.5 INTERPRETATION OF COEFFICIENTS
The interpretation of any fitted model requires that we be able to draw practical inferences from
the estimated coefficients in the model. The estimated coefficients must be able to answer the
questions that motivated the study. Interpretation involves determining the functional relationship
between the response variable and the predictor variable, and appropriately defining the unit of
change for the response variable.
3.5.1 Dichotomous Predictor Variables
The link function is the logit transformation g(x) = ln{π(x)/[1- π(x)]} = β 0 +β 1 x for one predictor
variable x . We assume that x is coded either as 1 or 0.The log odds ratio (that is, the logarithm of
the odds ratio) corresponding to the probability of success when the predictor variable has a value
x = 0 and the probability of success when the predictor variable has the value x = 1 , is given by
ln(ψ) = ln{π(1)/[1- π(1)]}- ln{π(0)/[1- π(0)]}
where
ψ=
π (1) /(1 − π (1)) g (1)
=
.
π (0) /(1 − π (0)) g (0)
Now
ln(ψ)= g(1)-g(0)
= β 0 +β 1 .1 –(β 0 +β 1 .0)
= β1
It follows that the odds ratio is given by ψ = e β1
In general, the estimate of the log odds for any predictor variable at two different levels, say x = a
versus x = b, is given by
)
ln[ψ (a, b)] = gˆ ( x = a) − gˆ ( x = b)
= ( βˆ0 + βˆ1 × a ) − ( βˆ 0 + βˆ1 × b)
45
= βˆ1 × (a − b)
(3.4.1.1)
and the estimated odds ratio is
ψˆ (a, b) = exp[ βˆ1 × (a − b) ]
(3.4.1.2)
where
ψˆ (a, b) =
πˆ ( x = a) /(1 − πˆ ( x = a )
πˆ ( x = b) /(1 − πˆ ( x = b))
is used to represent the odds ratio in equations (3.4.1.1) and (3.4.1.2).
The end points of the confidence interval for the odds ratio given in equation (3.4.1.2) are
exp[ βˆ1 (a − b) ± z
1−
α
| a − b | ×SEˆ ( βˆ1 ) ]
2
3.5.2 Polytomous Predictor Variables
In paragraph 3.2.2 we mentioned that if a nominal scale variable has more than two levels, say k
levels, we must model the variable using a collection of k-1 design variables as illustrated in Table
3.1. With this method, we choose one level of the variable to be the reference level usually the 0
level, against which all other levels are compared. We fit the model using design variables to
obtain coefficients equal in number to the number of design variables.
Fitting the model using Table3.1 will give the following results with regard to coefficients:
(Here the category ‘white’ is used as reference category)
Estimated
Variable
Coefficient
Black
β̂11
Other
βˆ12
Table 3.2 An example showing coefficients that will be obtained
when fitting the model using design variables in Table 3.1
46
Comparing Whites with Blacks we obtain
ln [ψˆ (black , white)]
= gˆ (black , white )
= βˆ 0 + βˆ11 × ( D1 = 1) + βˆ12 × ( D2 = 0) − ( βˆ 0 + βˆ11 ( D1 = 0) + βˆ12 ( D2 = 0)
= βˆ
11
Similarly, comparing others and with whites we obtain:
ln [ψˆ (other , white)] = βˆ12
Thus the odds ratio of any level with the reference level will be the exponential of the coefficient
of that level. If comparison is not with a reference level, the odds ratio will be the exponential of
the difference between the coefficients in question.
The limits for a 100(1-α) percent CI for the coefficient are
βˆij ± z
1−
α
× SEˆ ( β ij )
2
and the corresponding limits for the odds ratio are
exp[ βˆij ± z
1−
α
× SEˆ ( βˆij ) ].
2
3.5.3 One Continuous Predictor Variable
We assume that the logit is linear in the continuous predictor, x, then the equation of the logit is
g ( x) = β 0 + β 1 x.
The log odds for a change of c units in x is obtained from the logit difference
g ( x + c) − g ( x) = cβ 1 and the associated odds ratio is obtained by exponentiating this logit
difference, ψ (c) = ψ ( x + c, x) = exp(cβ 1 ) . An estimate may be obtained by replacing β 1 with its
maximum likelihood estimate β̂1 . The end points of the 100(1-α) percent CI estimate ψ (c) are
47
exp [cβˆ1 ± z
1−
α
cSEˆ ( βˆ1 )]
2
3.5.4 Multivariable Case
We now face the situation in which the model contains two predictor variables, where one variable
is dichotomous say, x1 coded 0 and 1 and one continuous, x 2 with primary interest focused on the
effect of the dichotomous variable. The equation of the logit will then be
g ( x1, x 2 ) = β 0 + β1 x1 + β 2 x 2 . If x1 changes from 0 to 1 with x 2 = a i.e. held constant, then the log
odds ratio is:
ln(ψ ) = g ( x1 = 1, x 2 = a) − g ( x1 = 0, x 2 = a )
= β 0 + β1 .1 + β 2 .a − ( β 0 .0 + β 2 .a )
= β1
and the odds ratio is ψ = e β1
Similarly, holding x1 constant when x 2 changes from x to x + c the odds ratio is ψ = e cβ1 .
Confidence intervals are calculated as before.
3.5.5 One Dichotomous and one Continuous and their Interaction
If the primary interest is focused on the effect of the dichotomous variable x1 coded 0 and 1 and x 2
is the continuous covariate, then the equation of the logistic interaction is
g ( x1, x 2 ) = β 0 + β1 x1 + β 2 x 2 + β 3 x1 x 2 .
If x1 changes from 0 to 1 and x 2 = a the log odds ratio is
ln(ψ ) = g ( x1 = 1, x 2 = a) − g ( x1 = 0, x 2 = a)
= β 0 + β1 .1 + β 2 .a + β 3 .a − ( β 0 + β1 .0 + β 2 .a + β 3 .0.a )
= β 0 + β 3 .a
The odds ratio is thus ψ = e β1 + aβ3 which does not depend on the variable of interest only. The
100(1-α) percent CI for the odds ratio is
48
exp[ βˆ1 + βˆ3 .a ± z
1−
α
SEˆ ( βˆ1 + βˆ3 .a )]
2
where
SEˆ ( βˆ1 + βˆ3 .a ) = v̂ar( βˆ1 ) + a 2 v̂ar( βˆ2 ) + 2aCˆ ov ( βˆ1 , βˆ2 )
3.6 TESTING FOR THE SIGNIFICANCE OF THE MODEL
3.6.1 The Likelihood Ratio Test
After fitting a particular multiple logistic regression model, we do an assessment of the model. We
begin by assessing the significance of the p regression coefficients in the model. A likelihood ratio
test for overall significance of the p coefficients for the predictor variables in the model is
performed. This test is based on the statistic
G = 2[L p (β) - L p (0)]
Under the null hypothesis that the coefficients for the predictors in the model are all equal to zero,
the distribution of G will be a chi-square with p degrees of freedom. The exceedance probability
value (P-value) for the test is P= Pr[ χ 2 ( p) > G]. Rejection of the null hypothesis leads to the
conclusion that at least one and perhaps all p coefficients are significantly different from zero.
3.6.2 Wald Test Statistics
Before we conclude that all of the coefficients are nonzero, we may wish to look at the univariate
Wald test statistics:
Wj =
βˆ j
SEˆ ( βˆ j )
.
This test is commonly used to test the significance of the individual logistic regression coefficients
for each independent predictor variable (that is, to test the null hypothesis in logistic regression
that a particular logit (effect) coefficient is zero). It is the ratio of the logit coefficient to its
49
standard error and is approximated by the standard normal distribution under the said null
hypothesis.
3.6.3 Using Deviances to Compare Likelihoods
Suppose that model one has t parameters while model two is a subset of model one with only r of
the t parameters so that r < t. Model one will have a larger log-likelihood than model two. For
large sample sizes, the difference between these two likelihoods, when multiplied by two, will
behave like the chi-square distribution with t-r degrees of freedom. This fact can be used to test the
null hypothesis that the t-r parameters that are not in model two (as above) are zero. The difference
denoted by D is calculated using results from statistical packages, as follows:
D = -2[(model 2) – (model 1)]
= -2logL (model 2) - -2logL (model 1),
and D ~ χ 2 (t − r ) , when the sample size is large.
3.7 INTERACTION AND CONFOUNDING
The term confounding is used by epidemiologists to describe a covariate that is associated with
both the outcome variable of interest AND a primary predictor variable or risk factor. When both
associations are present then the relationship between the risk factor and the outcome variable is
said to be confounded.
Consider a model containing a dichotomous risk factor variable and a continuous covariate. If the
association between the covariate and the outcome variable is the same within each level of risk
factor, there is no interaction between the covariate and the risk factor. Graphically the absence of
interaction yields a model with two parallel lines of outcome variable on covariate, one for each
level of risk factor variable. In general, the absence of interaction is characterised by a model that
contains no product terms involving two or more variables.
50
When interaction is present, the association between the risk factor and the outcome variable
differs or depends in some way on the level of the covariate. That is, the covariate modifies the
effect of the risk factor. The term ‘effect modifier’ is used by epidemiologists to describe a
variable that interacts with a risk factor.
Determining if a covariate is an effect modifier and/or a confounder involves several issues.
Determining effect modification status involves the parametric structure of the logit, while
determination of confounder status involves two things. First, the covariate must be associated
with the outcome variable. This implies the logit must have a nonzero slope in the covariate.
Second, the covariate must be associated with the risk factor.
In practice, the confounder status of a covariate is ascertained by comparing the estimated
coefficient for the risk factor variable from models containing and not containing the covariate.
Any “biologically important” change in the estimated coefficient for the risk factor would dictate
that the covariate is a confounder and should be included in the model, regardless of the statistical
significance of the estimated coefficient for the covariate. On the other hand, we believe that a
covariate is an effect modifier only when the interaction term added to the model is biologically
meaningful and statistically significant. When a covariate is an effect modifier, its status as a
confounder is of secondary importance and the estimate of the effect of the risk factor depends on
the specific value of the covariate.
3.8 VARIABLE SELECTION FOR LOGISTIC REGRESSION
According to Hosmer and Lemeshow (1989), in logistic regression the errors are assumed to
follow a binomial distribution and the significance of a variable is assessed via the likelihood ratio
chi-square. At any step in the procedure the most important variable in statistical terms will be the
one that produces the greatest change in the log-likelihood relative to the model not containing the
variable.
51
3.8.1 Purposeful Selection of Variables
3.8.1.1 Screening of Variables
This method is almost similar to the one discussed in section (2.2.1) under the proportional hazards
regression model. This method is also analyst driven.
Hosmer and Lemeshow (1989) suggest that the selection process should begin with a univariate
analysis of each variable. Hence it is suggested that the selection process should begin with a
careful univariate analysis of each variable. For nominal, ordinal, and continuous predictor
variables with few integer values, it is suggested this be done with a contingency table of outcome
(y= 0, 1) versus the k levels of the predictor variable. The likelihood chi-square test with k-1
degrees of freedom is exactly equal to the value of the likelihood ratio test for the significance of
the coefficients for the k-1 design variables in a univariate logistic regression model that contains
that single predictor variable.
Particular attention should be paid to any contingency table with a zero cell. Strategies for
handling zero cells include: collapsing the categories of the predictor variable in some sensible
way to eliminate the zero cells: eliminating the categories completely: or, if the variable is
ordinally scaled, modelling the variable as if it is continuous.
For continuous predictor variables the most desirable univariate analysis involves fitting a
univariate logistic regression with each predictor to obtain the estimated coefficient, the estimated
standard error, the likelihood ratio test for the significance of the coefficient, and the univariate
Wald statistic.
The completion of univariate analyses is followed by selection of variables for multivariate
analysis. Any variable whose univariate test has a P-value<0.25 should be considered as a
candidate for a multivariable model along with all other variables of known biologic importance.
The univariate approach has the disadvantage of excluding predictor variables which can
collectively be important predictors of outcome, whilst individually weakly linked with the
52
outcome. This problem can be overcome by choosing a significance level large enough to allow
the suspect variables to be included.
After fitting the multivariable model, the importance of each variable included in the model should
be verified. This should include (a) an examination of the Wald statistic for each variable and (b) a
comparison of each estimated regression coefficient with the coefficient from the univariate model
containing only that specific variable. Variables that do not contribute to the model based on these
criteria should be eliminated and a new model should be fitted. Comparison of models is done
through the likelihood ratio test. Also, estimated coefficients for any remaining variables should be
compared to those of the full model. Marked change in magnitude would imply that one or more of
the excluded variables were important in the sense of providing a necessary adjustment of the
effect of variables that remained in the model. This process is done repeatedly until it appears that
all of the important variables are included in the model and those excluded are either biologically
or statistically unimportant.
3.8.1.2 Scale of Continuous Predictors
For continuous scaled predictor variables we must check the assumption of linearity in the logit.
Since the concept of scale selection is the same for the multivariable models, we describe this
approach using the univariable model. One method to ascertain linearity is to plot the fitted line on
the scatter-plot of the logit versus the predictor variable and look for any obvious systematic
deviations from the line. A modification of this approach is to break the range of the predictor
variable into groups and, for each group, plot the average value of the logit versus the group
midpoint. This approach in logistic regression requires that we transform the vertical axis to the
logit. Thus we would plot, for each group, the logit of the group mean versus the midpoint of the
group. The plot is examined with respect to the shape of the resulting “curve”.
An alternative to scale identification in logistic regression is the Box-Tidwell transformation for
linear regression. According to Hosmer and Lemeshow (1989), the use of this transformation has
been examined for use in logistic regression by Guero and Johnson (1982). This approach adds a
term of the form x ln(x) to the model. If the coefficient for this variable is significant, we have
53
evidence for non-linearity in the logit. This procedure, however, has low power in detecting small
departures from linearity.
3.8.1.3 Inclusion of Interactions
Once continuous variables are on the correct scale, we begin to check for interactions in the model.
An interaction between two variables implies that the effect of one of the variables is not constant
over levels of the other. For example, an interaction between sex and age would imply that the
regression coefficient for age is different for males and females. The need to include interaction
terms in a model is assessed by first creating the appropriate product of the variables in question
and then using a likelihood ratio test to assess their significance (that is their contributions to the
model). (See paragraph (3.5.3)). In general, for an interaction term to alter both the point and
interval estimates, the estimated coefficient must attain at least a moderate level of statistical
significance. The final decision as to whether an interaction term should be included in a model
should be based on statistical as well as practical considerations.
3.8.2 Stepwise Forward Selection
This procedure starts by fitting only the intercept term, then for each of the possible predictor
variables, a univariate logistic regression containing the intercept and that predictor (say xj) is
fitted. The log- likelihood of the intercept model ( L0 ) is compared with the log-likelihood of each
of the univariate model (Lj) by means of the ratio test statistic:
G j = 2( L j − L0 ) .
Its P-value is determined by P = Pr( χ 2 (v) > G j ) , where ν=1 if xj is continuous and ν= k-1 if xj
has k categories. The most important predictor variable is the one with minimum P-value and this
variable, denoted by xe, is entered into the model. The subscript “e” indicates that the variable is a
candidate for entry. The choice of an “alpha”( significance level) level used to judge the
importance of variables is a crucial aspect. Let α E denote our choice where the “E” stands for
entry and this choice for α E will determine how many variables will eventually be included in the
model. Choosing a value for α E in the range 0.15 to 0.2 is highly recommended. Moreover, using
54
α E in this range will provide assurance that the procedure selects variables whose coefficients are
different from zero (Hosmer and Lemeshow (1989)).
After the variable xe has been entered, the next step is to determine whether any of the remaining
p-1 variables are important once xe is in the model by fitting the p-1 logistic regression models
containing xe and xj, j = 1,2,3 ….. p and j ≠ e. The log-likelihoods of these models are compared
with that of the model containing the intercept and xe. The variable with the smallest P-value at
this step is entered, and the algorithm continues provided P-value< α E , otherwise it stops.
3.8.3 Stepwise Backward Selection
The process starts with a full model containing all variables. In the first step the log-likelihood of
the model containing all variables ( L f ) is compared to that of p-1 variables with xj is removed
denoted by ( L− j ) by using the likelihood ratio test statistic
G − j = 2( L f − L − j ) .
To ascertain which variable should be deleted from the model, we select that variable which, when
removed, gives the maximum P-value. We denote the minimal level of continued contribution to
the model by α R where “R” stands for remove. The value we choose for α R must exceed the
value for α E , to avoid the possibility of having to enter and remove the same variable at successive
steps.
In the next step the log- likelihood of the model excluding the one removed at the previous step is
compared to those of all p-1 models with one of the remaining variables removed. If P-value> α R ,
a variable is removed. Generally the choice of α R is 0.2 or 0.25. However, important variables can
be forced to remain in the model.
The algorithm stops when all variables have entered the model or when all variables in the model
have P-values to which is less than α R .
55
3.8.4 Stepwise Selection (Forward and backward)
This is a combination of forward and backward selection procedures discussed above. It is based
on a statistical algorithm that allows moves in either direction, dropping or adding variables at
various steps based on the ‘importance’ of variables. The ‘importance’ of a variable refers to the
statistical significance of its coefficient. Since, in logistic regression the errors are assumed to
follow a binomial distribution, the significance is assessed via the likelihood ratio chi-square test.
Thus at any step in the procedure the most important variable will be the one that result in the
largest likelihood ratio statistic, G.
Since the magnitude of G depends on its degrees of freedom, any procedure based on the
likelihood ratio test statistic, G must account for possible differences of degrees of freedom of
variables. This is achieved by assessing significance through the p-value for G.
3.8.5 Best Subset Selection
This is an alternative to stepwise selection. This model building approach has been available in
linear regression. Typical software implementing this method for linear regression will identity a
specified number of ‘best’ models containing one, two, three variables, and so on, up to the single
model containing all p variables. According to Hosmer and Lemeshow (1989), we may use any
best subsets linear regression program to execute the computations for best subsets logistic
regression.
The subsets of variables selected for ‘best’ models depend on the criterion for ‘best’. In logistic
regression the Score and the C p criteria are preferred. A model with high score- value will be
preferred to a model with a smaller score-value whereas a model with a small C p value or C p ≈ r
will be preferred where r is the number of predictor variables in the model. It is important to note
that variables suggested by best subset strategy should not be accepted without considerable
critical evaluation.
56
Though we discussed several selection procedures in Chapter 2, a few of them have been
discussed, and others left out in this chapter. The reason is that such procedures do not apply to the
logistic regression
3.8.6 General
From the information in this chapter, it is clear that selection methods for binary outcome variables
are lacking. For this reason, we will be evaluating a new method, based on the ROC curve, briefly
in Chapter 5. We will first discuss the concept of a ROC curve in Chapter 4.
57
Chapter 4
THE RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE
4.1 BACKROUND
We discuss ROC curves as a separate chapter because we will be endeavouring (chapter 5) to
utilise these curves as additional model (or variable) selection method. Specifically: the area under
the curve (AUC) will be evaluated as a selection criterion. The AUC will be discussed in section
4.5.
Researchers and analysts allocate a great deal of effort to the development of prediction models to
support decision making. However, too often insufficient attention is allocated to the tool(s) used
to evaluate the model(s) in question. The issue is that accurate prediction models may be measured
inappropriately based upon the information available regarding classification error rate and the
context of application. In the end, poor decisions are made because of selecting wrong models,
using an inappropriate evaluation method.
In the context of consumer risk prediction, understanding how to evaluate models which predict
potential customers to be ‘good’ or ‘bad’ credit risks is critical to managing Customer Relationship
Management (CRM). Since the dependent variable of concern is categorical, the issue is one of
binary classification. For a binary classification problem (i.e. prediction of ‘good’ versus ‘bad’),
logit analysis utilises a linear combination of the predictor variables and transforms the result to lie
between 0 and 1, to equate to a probability.
One method of evaluation, which enables a comprehensive analysis of all possible error severities,
is the Receiver Operating Characteristic (ROC) curve. According to Morrison & Michelle (2005),
ROC curves were developed in the field of statistical decision theory, and later used in the field of
signal detection during WW II. ROC curves enabled radar operators to distinguish between an
enemy target, a friendly ship, or noise. They further point out that ROC curves assess the value of
diagnostic tests by providing a standard measure of the ability of the test to correctly classify
58
subjects. Mention is made of Metz (1978) stating that the biomedical field uses ROC curves
extensively to assess the efficacy of diagnostic tests in discriminating between healthy and
diseased individuals. ROC curves have since been used in fields ranging from electrical
engineering and weather prediction to Psychology and are used almost everywhere in the literature
on medical testing to determine the effectiveness of medications (Nargundkar and Priestly (2003)).
4.2 DEFINITION OF AN ROC CURVE
Consider diagnostic tests with dichotomous outcomes, with positive outcomes suggesting presence
of disease. For dichotomous tests, there are two potential types of error. A false- positive error
happens when a non-diseased individual has a positive test result. On the other hand, a falsenegative error happens when a diseased individual has a negative test result. The rates of
occurrence of these errors, termed false-positive and false negative rates, together constitute the
operating characteristics of the dichotomous diagnostic test. These notions can be generalised to
non-binary tests in this way: Let D be a binary (0/1) indicator of the disease status with D = 1 for
diseased subjects. Let Y denote the test result with the convention that larger values of Y are more
indicative of disease for some threshold value C. Now 1 minus the false-negative rate (or true
positive rate) and 1 minus true negative rate (false-positive) associated with this decision criterion
can be written as Pr (Y≥C D = 1 ) and Pr(Y<C|D=0), respectively. An ROC curve is a plot of the
true positive rate versus 1 minus true negative rate across all positive threshold values, C. When Y
{
−1
}
is continuous, a clear and brief way of writing the ROC curve is ROC(t) = FD FD (t ) t ∈ (0,1),
where FD and F D are the survivor functions of Y in the diseased and non-diseased populations,
respectively, and where t is the false positive rate which varies from 0 to 1 as the corresponding
implicit threshold value, C, varies from ∞ to -∞ . When Y is discrete the ROC curve can also be
{
}
written in the form FD FD (t ) but the domain for ROC (t) is restricted to the range of FD (.) , that
−1
is, the set of all possible false positive rates associated with the test. By definition, the ROC curve
is a monotone increasing function from [0,0] to [1,1]
59
4.3 DIAGNOSTIC TEST INTERPRETATION
The basic idea of diagnostic test interpretation is to calculate, for example, the probability that a
patient has a disease under the consideration given certain result. A 2 by 2 table is employed in this
regard (See Table 4.3.1).
4.3.1 2 X 2 Table or Contingency Matrix
Test
Positive
Disease
Present
Disease
Absent
True Positives
(TP)
False
Positives
(FP)
Test
False
Negatives
Negative
(FN)
Total with
Disease
Total
Positive
True
Negatives
(TN)
Total
Negative
Total without
Grand Total
Disease
Table 4.1 An example of a Contingency Table
4.3.2 Basic Concepts
In this discussion we refer back to Table 4.1.
4.3.2.1 Sensitivity
Sensitivity is the proportion of patients with disease whose tests are positive.
P(T+|D+)=TP/(TP+FN)
High sensitivity is important when:
•
The disease is serious and should not be missed.
•
The disease is treatable.
•
FP results do not lead to serious physic, psychological
60
or economic trauma to the patient.
4.3.2.2 Specificity
Specificity is the proportion of patient without disease whose tests are negative.
P(T-|D-) = TN/ (TN + FN)
High specificity is needed when:
•
The disease is serious.
•
The disease is not treatable or curable.
•
FP results do not lead to serious physic, psychological or economic trauma to the patient.
4.3.2.3 Pre-test Probability
Pre-test probability is the prevalence of the disease in the population. It is also called
efficiency of the test.
P(D+) = (TP+N)/(TP+FP+TN+FN)
Higher Efficiency is needed when:
•
The disease is serious.
•
The disease is curable
•
FP and FN are essentially equally serious damages.
4.3.2.4 Predictive Value of a Positive Test
Predictive values of a positive test is the proportion of patients with positive tests who
do have disease.
P(D+|T+) = TP/(TP+P)
These values measure:
•
The same thing as posttest probability of disease given a positive test.
•
Measures how well the test rules in disease.
4.3.3.5 Predictive Value of a Negative Test
Predictive value of a negative is the proportion of patients with negative tests who
do not have disease.
P(D-|T-) = TN/(TN+N)
61
This value measures how well the test rules out the disease.
4.4 ROC REGRESSION MODEL
Let the false positive rate be denoted by t and let τ denote the set of possible values for t, namely
the range of FD , which is a subset of [0, 1]. Let Z denote some factors which potentially influence
test accuracy and let X be a corresponding vector of covariates. For example, if Z is a categorical
variable, X might be the associated vector of dummy variables. The covariate vector X is a
function of the factors Z. We write the ROC curve associated with Z as ROC z (t ) and model it as
ROC z (t ) = g{α 0 (t ), βX } (t ∈ τ z ) ,
where α o (t) is a univariate baseline function of t, βX is a linear predictor which characterises the
effect of the covariates X on the ROC curve, g is a known function and τ z denotes the domain of
the ROC function associated with Z. In general the covariate vector X may include interactions
between factors in Z and t, in which case we write the covariate vector X(t). Since the ROC curve
is a monotone increasing function by definition, g and α must be chosen such that monotonicity in
ROC z is ensured.
4.5 AREA UNDER THE ROC CURVE (AUC)
4.5.1 Interpretation of the Area
The area under the ROC curve is commonly used as a summary measure of diagnostic accuracy. It
takes values from 0.5 to 1.0. The AUC statistic can be interpreted as the probability that the test
result from a randomly chosen diseased individual is more indicative of disease than that from a
randomly chosen non-diseased individual or a measure of a model’s ability to discriminate
between those who experience the outcome of the interest versus those who do not.
AUC = P ( X i≥ X j D i = 1, D j = 0). An ROC curve summarises the possible set of 2 X 2 matrices
that results when the cut-off value is varied continuously from its highest possible value down to
its smallest possible value. An area of 1 represents a perfect discrimination. The closer the curve
follows the left-hand border and then the top border of the ROC space, the more accurate the
62
discrimination. On the other hand an area of 0.5 represents a worthless discrimination. The closer
the curve comes to the 45 degrees diagonal of the ROC space, the less accurate the test.
An area of
•
0.9 – 1.0 = excellent discrimination
•
0.80 -0.90 = good discrimination
•
0.70 -0.80 = fair discrimination
•
0.60 -0.70 = poor discrimination
•
0.50 – 0.60 = fail, i.e. no discrimination
However, in practice it is extremely unusual to observe areas under the curve greater than 0.9.
4.5.2 Comparison of Tests
When results from multiple tests have been obtained, the ROC plots can be graphed together. The
relative positions of the plots indicate the relative accuracies of the tests. A plot lying above and to
the left of another plot indicates greater observed accuracy. If the curves for two tests cross, a
meaningful difference between the tests over the range of interest might not be picked up by the
AUCs.
If we have two curves of similar area and we wish to decide whether the two curves differ
significantly, we can use bivariate statistical analysis.
Where we have different areas derived from two tests applied to different sets of cases, it is
appropriate to calculate the standard error of the difference between the two areas, thus:
2
2
SE A1 + SE A 2 )
SE ( A1 − A2 ) =
This approach is not appropriate where two sets are applied to the same set of patients. Hanley and
McNeil (1982) show that in these circumstances, the correct formula is:
SE ( A1 − A2 )
=
SE 2 A1 + SE 2 A2 − 2r.SE A1 SE A2
63
where r is the quantity that represents the correlation induced between the two areas by the study
of the same set of cases.
Once we have the standard error of the difference in areas, we can then calculate the statistic:
Z = ( A1 – A2 ) /( SE ( A1 − A2 ) )
If Z is above a critical level, then we accept that the two areas are different. Commonly this critical
value is set at 1.96, and we then have a 0.05 chance of making a type I error in rejecting the
hypothesis that the two curves are similar.
Assuming we have two tests T1 and T2 that classify our cases into either normal (n) or abnormal
(a), and we have already calculated the AUCs for each test, r is calculated as follows:
1. Look at (n), the non-diseased patients. We find how the two tests correlate for these
patients and obtain a value rn for this correlation.
2. Similarly we derive ra , the correlation between the two tests for the patients
3. Average rn and ra .
4. Average out the areas A1 and A2 by calculating ( A1 + A2 )/2.
5. Look up the value of r in Hanley and McNeil’s Table I (Hanley and McNeil (1982)) given the
the average areas of rn and ra .
The ROC plot is a simple, graphical and easily appreciated visually. It is a comprehensive
representation of pure accuracy, i.e. discriminating ability, over the entire range of a test. It
provides a direct visual comparison between tests on a common scale and it requires no grouping
and binning of data. With appropriate software, ROC plotting is quite readily done.
Actual decision thresholds are usually not displayed in the plot. The number of subjects is also not
shown on the display and as the sample size decreases, the ROC plot tend to become increasingly
jagged and bumpy. However, even with a large number of subjects, the plot may be bumpy.
64
CHAPTER 5
MODEL BUILDING USING REAL DATA
In this chapter we will look at the application of the procedures and methods outlined in chapters 3
and 4 with regard to selection of variables. Some of the criteria, discussed in Chapter 2, such as the
Akaike Information Criterion may come into play since they also are applicable to logistic
regression and needless to say, Cox regression as well.
The data set to be used was developed for a study of factors associated with success of first year
students at the Tshwane University of Technology (TUT) from the year 1999 to 2002. Information
on 18047 students was obtained.
Table 5.1 describes the response, predictor variables and their codes.
Variable
Description and code
Pass
pass=1, fail=0
Campuss
main campus=1, satellite campus =2
Genderr
female = 1, male=2
Agregate
aggregate mark for all subjects in matric exam for an individual student
Maritall
marital status (single=1, married=2)
Finaidd
Financial aid (aided=1, not aided)
Age
student age at first registration
English
Performance in English in matric exam (good=1,not good=2)
Race
(white=1, coloured=2, Asian=3 and black=4)
Faculty
(Engineering=1, Commerce =2, Social Science=3, Arts=4,
Natural Science =5, Agricultural Science=6 and Health =7)
Table 5.1 Code Sheet of the Variables used in the Data set for the Study of Factors
Associated with Success of First Year Students at TUT from 1999 to 2002
65
5.1 PURPOSEFUL SELECTION OF VARIABLES
We begin with a univariable description of all predictors; both categorical and continuous
variables are shown in Tables 14 and 15 of the appendix respectively.
The univariable analysis does not reveal any variable for which there are illegal values. All binary
variables are coded as 1; 2. Race and Faculty are the only non-binary categorical variables. We
create indicator variables for the Faculty variable as shown in Table 5.2:
Faculty Label
faculty_2 faculty_3 faculty_4 faculty_5 Faculty_6 Faculty_7
1
Engineering 0
0
0
0
0
0
2
Commerce
1
0
0
0
0
0
3
Social Sci
0
1
0
0
0
0
4
Arts
0
0
1
0
0
0
5
Natural Sci
0
0
0
1
0
0
6
Agric Sci
0
0
0
0
1
0
7
Health
0
0
0
0
0
1
Table 5.2 Indicator Variables for the Variable Faculty.
Since the numbers of Indians and Coloureds were quite small, each less than 2% of the total, a
dichotomous variable Brace (black race for blacks) was created. Brace takes the value 1 if race is
black and the value 0 for other races (White, Coloured and Indian). The dependent variable was
the logit π = (logπ/(1-π)), where π is the probability that a student passed.
Univariable logistic regressions were fitted to the data and the results are given in Table 5.3.
66
Predictor
Estimated
Estimated
Estimated
Wald Test
Variable
Coefficient
Standard Error Odds ratio
Age
-0.0552
0.00726
0.759
(0.707,0.815)
<0.0001
Agregate
0.00287
0.000078
1.267
(0.673,1.287)
<0.0001
Campuss
-0.1645
0.0172
0.720
(0.673,0.770)
<0.0001
Maritall
0.2043
0.0762
1.505
(1.116,2.028)
0.0073
Finaidd
0.3483
0.0264
2.007
(1.810,2.225)
<0.0001
Genderr
0.1662
0.0167
1.394
(1.306,1.489)
<0.0001
English
0.3890
0.0199
2.177
(2.014,2.353)
<0.0001
Faculty_2
0.2447
0.0586
1.277
(1.139,1.433)
<0.0001
Faculty_3
0.5835
0.0620
1.792
(1.587,2.024)
<0.0001
Faculty_4
1.8045
0.0757
6.077
(5.239,7.048)
<0.0001
Faculty_5
0.7191
0.0744
2.053
(1.774,2.375)
<0.0001
Faculty_6
-0.0894
0.0866
0.914
(0.772,1.084)
0.3020
Faculty_7
1.2743
0.0288
3.576
(3.040,4.207)
<0.0001
Brace
-0.9388
0.0344
0.391
(0.366,0.418)
<0.0001
95% CI
P-value
Table 5.3 Univariable Logistic Regression Models
For the variables Age and Agregate in Table 5.3 odds ratios are for an increase of 5 years and 100
marks respectively. A change of 1 mark or 1 year would not be meaningful.
With the exception of variables Faculty_6 and Agregate, there is evidence that each of the
variables has some association with the outcome variable, pass. This is based on the observation
that the confidence interval estimates do not contain 1. Furthermore, all variables are significant
with P-value≤0.25 for the Wald test. We now, based on the univariable results, begin the
multivariable model including all variables besides Faculty_6 which is not significant. The model
is shown in Table 5.4.
The Wald statistics is now used to delete variables one by one that do not appear to be significant
at the P-value≤0.05 level, starting with the least significant one.
67
Criterion
Intercept
Only
AIC
SC
-2 Log L
21460.178
21467.979
21458.178
Intercepts and
Covariates
19424.526
19533.737
19396.526
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
2061.6516
2085.0164
1798.0246
13
13
13
<.0001
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Intercept
age
agregate
Campuss
maritall
finaidd
genderr
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-2.5845
-0.0101
0.00162
0.0662
0.0637
0.3634
0.1001
0.0778
0.6209
0.5730
1.6185
0.7232
1.1871
-0.5585
Std
0.2700
0.00855
0.000094
0.0255
0.0946
0.0282
0.0186
0.0237
0.0564
0.0581
0.0878
0.0862
0.0815
0.0437
Wald Chi-Square
91.6291
1.3903
298.6322
6.7383
0.4535
165.6272
29.0649
10.7918
121.0806
97.1171
340.1683
70.4490
211.9119
163.3471
Pr > ChiSq
<.0001
0.2384
<.0001
0.0094
0.5007
<.0001
<.0001
0.0010
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
Effect
age
agregate
Campuss
maritall
finaidd
genderr
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
Error
1
1
1
1
1
vs
vs
vs
vs
vs
2
2
2
2
2
0.990
1.267
1.142
1.136
2.068
1.222
1.168
1.861
1.774
5.046
2.061
3.278
0.572
0.974
0.673
1.033
0.784
1.852
1.136
1.065
1.666
1.583
4.248
1.741
2.793
0.525
1.007
1.287
1.262
1.646
2.311
1.314
1.282
2.078
1.988
5.992
2.440
3.846
0.623
Table 5.4 Multivariable Model Containing Variables Identified in the Univariable Analysis.
68
The model at the end of the process of removing non-significant variables is shown in Table 5.6.
At this point, we allow each of the variables not in the model, the opportunity to re-enter the model
one by one. As each variable enters the model, we evaluate its statistical significance using the
Wald test and also ascertain whether the variable is a confounder or not of other variables in the
model by calculating the extent of change of coefficients of variables in the model.
There is no significant change in the coefficients of other variables when Faculty_6 re-enters the
model but according to the Wald test the variable is however, not statistically significant. The
same argument holds for the variables Maritall and Age when they re-enter the model. Therefore,
the preliminary main-effects model is as given in Table 5.6.
Before proceeding to determine interactions we need to examine the variables that have been
modelled as continuous to obtain the correct scale in the logit. In this case the variable we need to
check is Agregate.
We start by determining the quartiles of the distribution of Agregate from appendix 1 Table 14
and create three design variables using the lowest quartile as the reference group. The results of the
quartile analysis are shown in Table 5.5.
Quartile
Midpoint
Coefficient
95%CI for Odds Ratios
1
775
0
2
955
0.2898
(1.208,1.478)
3
1137
1.0672
(2.516,3.359)
4
1680
0.9989
(2.407,3.063)
Table 5.5 Results of Quartile Analyses of the Variable Agregate from the Multivariable
Model Containing Variables shown in Table 5.6
69
Criterion
AIC
SC
-2 Log L
Intercept
Only
21460.178
21467.979
21458.178
Intercept and
Covariates
19423.992
19517.601
19399.992
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
2058.1862
11
<.0001
Score
2083.3017
11
<.0001
Wald
1796.8154
11
<.0001
Parameter
Intercept
agregate
Campuss
finaidd
genderr
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
Effect
agregate
Campuss
finaidd
genderr
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
1
1
1
1
1
1
1
1
Analysis of Maximum Likelihood Estimates
Standard
DF
Estimate
Error
Chi-Square
1
-2.7409
0.1250
480.9214
1
0.00163
0.000093
304.3700
1
0.0730
0.0252
8.3651
1
0.3637
0.0282
165.9746
1
0.1029
0.0184
31.1306
1
0.0837
0.0234
12.8165
1
0.6197
0.0564
120.6196
1
0.5729
0.0582
97.0702
1
1.6244
0.0877
342.9895
1
0.7316
0.0861
72.2871
1
1.1867
0.0815
211.7846
1
-0.5567
0.0437
162.5179
Odds Ratio Estimates
Point
Estimate
1.267
vs 2
1.157
vs 2
2.070
vs 2
1.229
vs 2
1.182
1.858
1.773
5.075
2.078
3.276
0.573
Wald
Pr > ChiSq
<.0001
<.0001
0.0038
<.0001
<.0001
0.0003
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
95% Wald
Confidence Limits
0.673
1.287
1.048
1.277
1.853
2.312
1.143
1.321
1.079
1.296
1.664
2.076
1.582
1.988
4.274
6.027
1.756
2.460
2.792
3.844
0.526
0.624
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
Effect
Unit
agregate
100.0
agregate
-100.0
69.8
29.3
0.9
65896012
Somers' D
Gamma
Tau-a
c
0.405
0.409
0.164
0.703
Estimate
1.177
0.850
Table 5.6 Preliminary Main Effects Model
70
1. 4
1. 2
1. 0
0. 8
0. 6
0. 4
0. 2
0. 0
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
agr eg
Figure 5.1 Plot of quartile midpoints against coefficients.
The results of plotting quartile midpoints against the coefficients are shown in figure 5.1. The plot
of the coefficients supports an assumption of non linearity in the logit. Addition of the variable
[Agregate*ln(Agregate)] to the model containing Agregate as a continuous variable yields a
significant coefficient for the variable [Agregate*ln(Agregate)]. This confirms that agregate is not
linear in the logit.
From Table 5.5 the two coefficients in the third and fourth quartiles are almost similar in
magnitude and their confidence intervals have a great deal of overlap. These observations suggest
the creation of a dichotomous variable taking on the value 1 if Agregate is in the third and fourth
quartiles and the value of zero otherwise as also being supported by figure 5.1.
The results of including a dichotomous variable Agregate_ in the multivariable model are shown in
Table 5.7.
71
Criterion
Intercept
Only
AIC
SC
-2 Log L
Intercept and
Covariates
21460.178
21467.979
21458.178
19676.792
19770.401
19652.792
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
1805.3857
1831.4242
1618.6281
11
11
11
<.0001
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
english
finaidd
Campuss
genderr
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
agregate_
1
1
1
1
DF
Estimate
Error
Standard
Chi-Square
1
1
1
1
1
1
1
1
1
1
1
1
-1.0521
0.1664
0.3990
0.0783
0.1162
0.5567
0.5393
1.6151
0.7533
1.1212
-0.7023
0.2966
0.0663
0.0229
0.0279
0.0251
0.0183
0.0558
0.0576
0.0868
0.0855
0.0806
0.0423
0.0393
251.6263
52.6993
204.5030
9.7167
40.3096
99.5148
87.7013
345.9076
77.5690
193.6399
275.3671
57.0404
Wald
Pr > ChiSq
<.0001
<.0001
<.0001
0.0018
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
Odds Ratio Estimates
Point
Estimate
Effect
english
finaidd
Campuss
genderr
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
agregate_
1
1
1
1
vs
vs
vs
vs
2
2
2
2
1.395
2.221
1.169
1.262
1.745
1.715
5.028
2.124
3.069
0.495
1.345
95% Wald
Confidence Limits
1.275
1.991
1.060
1.174
1.564
1.532
4.241
1.796
2.620
0.456
1.246
1.526
2.478
1.290
1.355
1.947
1.920
5.961
2.512
3.593
0.538
1.453
Table 5.7 Multivariable Model With Dichotomous Variable Agregate_.
72
We now form all possible two way interaction using the variables in Table 5.7.
engagr=english*agregate_
engfin=english*finaidd
amfin=campuss*finaidd
finfac2=finaidd*faculty_2
finfac3=finaidd*faculty_3
finfac4=finaidd*faculty_4
finfac5=finaidd*faculty_5
finfac7=finaidd*faculty_7
racfin=brace*finaidd
agrbrac=agregate*brace
engbrac=english*brace
The interaction terms are added to the model containing main effects one by one. Table 5.8 shows
those interactions that were significant when added one by one to the main effects model.
Interactions which are not significant will be excluded from the model. A model with significant
interactions is shown in Table 5.9. However, it should be noted that when there is statistically
significant interaction, we include the corresponding main effects in the model regardless of their
statistical significance.
Table 5.9 gives the final model containing main effects and interactions. From Table 5.10, we see
that (12308+1046)=13354 or 73% of the 18047 observations in our data are correctly classified by
the logistic regression model in Table 5.9. Of the 5083 observed passes, 1046 or 20.6% are
correctly classified as predicted passes. 4037 of these observations are incorrectly classified as
predicted fails. They are called false-negatives. Only 656 of the observed fails are incorrectly
classified as predicted passes. These observations are called false-positives.
73
The c statistic in Table 5.11 gives the area under the ROC curve (the AUC) in figure 5.2. This cvalue is 0.694 and indicates that the model has low predictive accuracy. But the low predictive
accuracy does not imply the model does not fit.
Criterion
Intercept
Only
AIC
SC
-2 Log L
21460.178
21467.979
21458.178
Intercept and
Covariates
19635.958
19791.972
19595.958
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
Campuss
genderr
finaidd
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
agregate_
engagr
engfin
finfac4
finfac2
finfac5
racfin
agrbrac
engbrac
1
1
1
1
DF
Estimate
Error
Standard
Chi-Square
Wald
Pr > ChiSq
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-0.8944
0.0806
0.1099
0.2188
0.1100
0.8804
0.5433
0.4241
-0.0852
1.1313
-0.5948
0.6119
-0.1501
-0.1932
0.6224
-0.1775
0.4584
-0.1765
-0.2025
0.3499
0.3361
0.0252
0.0184
0.0922
0.1428
0.2504
0.0577
0.4249
0.3634
0.0808
0.3306
0.1228
0.0936
0.1293
0.2182
0.1301
0.1878
0.1473
0.0875
0.1325
7.0820
10.2242
35.7304
5.6253
0.5936
12.3660
88.6276
0.9962
0.0550
195.8363
3.2371
24.8345
2.5717
2.2326
8.1332
1.8627
5.9573
1.4345
5.3520
6.9747
0.0078
0.0014
<.0001
0.0177
0.4410
0.0004
<.0001
0.3182
0.8146
<.0001
0.0720
<.0001
0.1088
0.1351
0.0043
0.1723
0.0147
0.2310
0.0207
0.0083
Table 5.8 A model containing Interactions which were Significant when Added One by One
to the Main Effects Model.
74
Intercept
Only
Criterion
AIC
SC
-2 Log L
Interaction and
Covariates
21460.178
21467.979
21458.178
19636.828
19769.441
19602.828
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
Campuss
genderr
finaidd
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
agregate_
finfac4
finfac2
finfac5
agrbrac
engbrac
1
1
1
1
DF
Estimate
Error
Standard
Chi-Square
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-1.3260
0.0802
0.1097
0.3905
0.3305
1.0687
0.5447
0.3639
-0.0899
1.1313
-0.9373
0.4453
0.6557
-0.2770
0.4603
-0.2363
0.3694
0.0944
0.0252
0.0184
0.0441
0.0609
0.2342
0.0577
0.4263
0.3664
0.0808
0.1737
0.0688
0.2187
0.1212
0.1892
0.0838
0.1315
197.1637
10.1358
35.6720
78.5287
29.4292
20.8300
89.1103
0.7287
0.0602
196.0118
29.1054
41.8797
8.9884
5.2212
5.9179
7.9462
7.8870
Wald
Pr > ChiSq
<.0001
0.0015
<.0001
<.0001
<.0001
<.0001
<.0001
0.3933
0.8061
<.0001
<.0001
<.0001
0.0027
0.0223
0.0150
0.0048
0.0050
Table 5.9 Final Model with Interactions
Predicted by Model
0
1
Actual
Classification
0
Total
12308
656
12964
4037
1046
5083
16345
1702
18047
1
Total
Table 5.10 Contingency Matrix for model in Table 5.9
75
Odds Ratio Estimates
Point
Estimate
Effect
Campuss
genderr
finaidd
english
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
Brace
agregate_
finfac4
finfac2
finfac5
agrbrac
engbrac
1
1
1
1
vs
vs
vs
vs
2
2
2
2
95% Wald
Confidence Limits
1.174
1.245
2.183
1.937
2.912
1.724
1.439
0.914
3.100
0.392
1.561
1.926
0.758
1.585
0.790
1.447
1.064
1.159
1.837
1.525
1.840
1.540
0.624
0.446
2.646
0.279
1.364
1.255
0.598
1.094
0.670
1.118
1.296
1.338
2.595
2.459
4.607
1.930
3.318
1.874
3.632
0.551
1.786
2.958
0.961
2.296
0.931
1.872
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
68.6
29.8
1.7
65896012
Somers' D
Gamma
Tau-a
c
0.388
0.394
0.157
0.694
Table 5.11 Odds Ratios and Association of Predicted Probabilities and Observed Responses
for the Final Model in Table 5.9
5.2 OTHER LOGISTIC REGRESSION SELECTION PROCEDURES.
The results of applying Forward, Backward, Stepwise and Best-Subset selection procedures are
given in appendices 2, 3, 4, and 5 respectively.
All the stepwise procedures except the Forward Selection produced eleven-variable models. The
Forward Selection included two additional variables, Age and Faculty, which are non-significant at
5% significance level according to the Wald test. These variables satisfied the entry level of
P=0.25 but could not leave the model since the Forward procedure does not provide room for non
significant variables to leave the model.
76
The Best Subset procedure using the C p - criterion pointed to a model with twelve variables from
the two ‘best’ models requested for in the procedure. With regard to the Best Subset procedure
using the Score-criterion we requested for ‘best’ two models as well, of each size (i.e. from a
model containing one variable to a model with 13 variables). From the two ‘best’ models with
twelve variables the Score- criterion selected the same model as the C p - criterion.
The Purposeful Selection procedure like Backward and Stepwise procedures produced a model
with eleven variables. However, Purposeful Selection warranted for the variable Agregate to enter
the model as a binary variable following analysis of scale of continuity of this variable.
5.3 INVESTIGATION OF THE AUC AS A SELECTION TOOL
An attempt is now made to establish if the area under the ROC curve (AUC) can be used as a
tool for selection of variables. In other words building a model by including variables that are
increasing the AUC as they enter the model. A variable stays in the model provided it is significant
in accordance with the Wald test. Like in the Forward stepwise selection, variables enter the model
one at a time.
The process starts by building one-variable models and recording the AUC and the P-values as
shown in Table 23. The one-variable model with the highest AUC provides the first variable to
enter the model. In the next step all other variables will enter the model one by one and only the
two-variable models with AUC greater than the highest AUC obtained in the first step will be
considered. In the third step, a two-variable model with the highest AUC will be the basis for a
three-variable model and only models with AUC higher than the largest obtained in the previous
step will be considered. In any step, if there is more than one model with the same maximum, the
model to be considered to the basis for next step will be selected using AIC. The process continues
in this way until the AUC does increase further even when the number of variables in the model
increases. However, only variables that are significant according to the Wald test will be allowed
to stay in the model.
77
Tables 23 to 36 give the results of applying the above procedure to our data set. We note that in the
last two steps (Tables 35 and 36) there are non significant variables. The final model is given in
Table 34 with eleven variables, also the same as the other eleven-variable model obtained
previously using Purposeful, Backward and Stepwise selection procedures.
The ROC curve for the model in Table 34 is given by figure 5.2. The area under this curve is 0.703
as shown in the table in question. This value of the area indicates a fair discrimination (predictive
accuracy) by the model.
From Table 5.12 we see that (12240+1191) =13431 or 74% of the observations in our data are
correctly classified by the logistic regression model in Table34. Out of 5083 observed passes, 1191
or 23% are correctly classified as predicted passes. 3892 or 77% of these observations are
incorrectly classified as predicted fails (false negatives). Only 724 or 5.6% o the observed fails are
incorrectly classified as predicted passes (false positives).
Sensi t i vi t y
1. 0
0. 9
0. 8
0. 7
0. 6
0. 5
0. 4
0. 3
0. 2
0. 1
0. 0
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
1 - Speci f i ci t y
Figure 5.2 ROC curve for the model obtained using AUC procedure.
78
Predicted by Model
0
Actual
Classification
0
1
Total
12240
724
12964
3892
1191
5083
16132
1915
18047
1
Total
Table 5.12 Contingency Matrix for the Model in Table 34
5.4 THE AUC AND THE STEPWISE SELECTION PROCEDURES
These two selection procedures produced similar models. We note that these procedures involve
‘picking’ and ‘dropping’ of variables and we now investigate the sequence or the order of the
variables entering and leaving the models. The comparison is shown in Table13.
Stepwise Procedure
Step Variable
Entered/Removed
AUC Procedure
Wald
Step Variable Entered /Removed
Pvalue
1
Agregate
0.0001 1
Agregate
2
Faculty_4
0.0001 2
Brace
3
Faculty_7
0.0001 3
Finaidd
4
Finaidd
0.0001 4
Faculty_4
5
Brace
0.0001 5
Faculty_7
6
Genderr
0.0001 6
Genderr
7
Faculty_6
0.0001 7
Faculty_6
8
Faculty_2
0.0001 8
Faculty_2
9
Faculty_3
0.0001 9
Faculty_3
10
Faculty_5
0.0001 10
Faculty_4
11
Faculty_6 Removed 0.6888 10
Faculty_6 Removed
12
English
0.0002 11
English
13
Campuss
0.0038 12
Campuss
14
Age
0.862
13
Age Entered and Removed
15
Age Removed
0.864
14
Maritall Entered & Removed
Table 5.13 Comparison of the Stepwise and the AUC procedures
Wald
P-value
AUC
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0058
0.0001
0.6888
0.0002
0.0038
0.0864
0.1584
0.637
0.656
0.671
0.681
0.687
0.690
0.694
0.695
0.697
0.701
0.702
0.703
0.703
0.703
0.703
79
From Table 5.13 both procedures have Agregate as the first variable to enter the model. In step 2
up to step 4 the same variables entered the model though not in the same sequence. From step 5 up
to the end, the two procedures yielded almost the same results. But the Stepwise procedure did not
consider the variable Maritall for entry into the model.
The example used is perhaps not ideal for investigating the ROC curve as a variable selection
technique. Here we have a lot of potential variables to be selected; all of them only make small
contributions to the predicted probabilities. However, almost all of all of these contributions are
statistically significant because of the huge sample size! Judging according to the AUC’s, the
increase in AUC from Table 32 to Table 36 (Appendix 6) is only 0.2% and from Table 29 to Table
36 only 0.8%. These are small increases and one may as well decide to use the model of Table 34
as the final model. It is clear that much more research on the use of the AUC’s is needed.
80
CHAPTER 6
DISCUSSION AND CONCLUSION
The purpose of this study was to explore methods and procedures used to select predictor variables
for binary response variables. However, as the point of departure selection procedures for a
continuous response variable were also discussed in order to illuminate the whole question of
variable selection.
We have seen that selection procedures for binary responses and continuous dependent variables
are basically the same, for example, all methods used in Logistic regression are almost similar to
those used for the Cox regression model. For both regressions, the ‘Purposeful Selection of
Variables’ emerges as the most interesting and recommended procedure for selecting variables,
since the method is completely controlled by the analyst. The stepwise and the best subset
procedures are statistical algorithms which, to some extend, do the selection automatically. In
situations where the number of variables is not large, Purposeful selection is recommended as the
sole tool for selection. It can be coupled with Stepwise selection when the number of variables is
too large, in which case stepwise selection will reduce the number of predictor variables to a
reasonable number before Purposeful selection is used. Another advantage of Purposeful selection
is the inclusion of variables that are scientifically relevant or known to interact with other variables
regardless of their statistical significance. Thus the analyst, not the computer, becomes responsible
for the review and evaluation of the model.
The results of a fitted logistic regression model can intuitively be summarised via classification
tables. In this regard, the logistic regression model is a diagnostic test and the classification table
measures the prediction accuracy. However, this measure is statistically insensitive. On the otherhand the area under the ROC Curve, another measure of the predictive accuracy, is not an
extremely sensitive measure to compare two models. It is important to note that a model with high
predictive accuracy does not necessarily provide evidence that the model fits well. We may have a
situation where the logistic regression model is in-fact the correct model and thus fits the data but
81
classification or discrimination is poor. These measures should, therefore, supplement more
rigorous methods of assessment of fit.
The results in Tables 5.7, 5.13 and 34 suggest that to some extent, the AUC can be used as
criterion for variable selection with the P-value of the Wald test used to remove insignificant
variables. Perhaps even as an alternative to Purposeful and Stepwise selection procedures.
However, further research is required to investigate this approach, especially for highly correlated
variables.
It is further recommended that the data set used to fit the model should not be used to test for the
predictive accuracy, otherwise the results become biased. A new set of observation should be used
to avoid this bias, and the method called jack-knifing should be applied. The following are some of
the major challenges for evaluating diagnostic tests and for applying ROC methodology in
particular:
(1) Status, for example disease status, is often not a fixed entity, but rather may evolve over time.
Now, how can the time aspect, be incorporated sensibly into ROC analysis?
(2) The statistical literature on diagnostic testing assumes that the test result is a simple numeric
value. However, test results may be much more complicated, involving several components. Do
ROC curves and the AUC have a role to play in determining how to combine different sources of
information to optimise diagnostic accuracy?
The very brief investigation into the use of ROC curves and the AUC, in this thesis, yields, by no
means, definitive answers to the question: How effective is the ROC curve as a tool for subset
selection? Much more research is needed.
Finally, as the information revolution brings us larger data sets, with more and more variables, the
demand for variable selection will strengthen and continue to be a basic strategy for data analysis.
New problems will also appear as demand increases for data mining of massive data sets.
82
APPENDIX 1A
The UNIVARIATE Procedure
Variable: age (age)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
18047
20.0791821
2.72214158
4.36262118
7409795
13.5570342
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
18047
362369
7.41005479
28.9348552
133721.849
0.02026321
Basic Statistical Measures
Location
Variability
Mean
20.07918
Median
19.00000
Mode
19.00000
Interquartile Range
Std Deviation
Variance
Range
2.00000
2.72214
7.41005
38.00000
Tests for Location: Mu0=0
Test
-Statistic-
Student's t
Sign
Signed Rank
t
M
S
-----p Value------
990.9182
9023.5
81428064
Pr > |t|
Pr >= |M|
Pr >= |S|
<.0001
<.0001
<.0001
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
Estimate
54
33
24
22
21
19
19
18
18
17
16
Extreme Observations
----Lowest----
----Highest---
Value
Obs
Value
Obs
16
16
16
16
16
17517
17497
11294
10238
9455
51
51
52
52
54
2298
11190
7516
13182
3372
Table 14 Univariate Analysis of the Variable Age
83
The UNIVARIATE Procedure
Variable: agregate (agregate)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
18047
1056.67779
218.254459
0.5336167
2.10103E10
20.6547788
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
18047
19069864
47635.0089
-0.1263512
859621371
1.624653
Basic Statistical Measures
Location
Mean
1056.678
Median
1075.000
Mode
1075.000
Interquartile Range
Variability
Std Deviation
Variance
Range
365.00000
218.25446
47635
1440
Tests for Location: Mu0=0
Test
-Statistic-
Student's t
Sign
Signed Rank
t
M
S
-----p Value------
650.4021
9023.5
81428064
Pr > |t|
Pr >= |M|
Pr >= |S|
<.0001
<.0001
<.0001
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
Variable:
agregate
2160
1612
1440
1320
1200
1075
835
835
720
720
720
(agregate)
Extreme Observations
----Lowest---Value
720
720
720
720
720
----Highest---
Obs
Value
Obs
18044
18041
18038
18032
18022
1705
1715
1750
1750
2160
12262
6901
2105
8313
9123
Table 15 Univariate Analysis of the Variable Agregate
84
APPENDIX 1B
Faculty
Faculty
2
3
1
5
6
4
7
Frequency
6586
3771
2506
1540
1390
1313
941
Percent
36.49
20.90
13.89
8.53
7.70
7.28
5.21
Cumulative Cumulative
Frequency
Percent
6586
36.49
10357
57.39
12863
71.28
14403
79.81
15793
87.51
17106
94.79
18047
100.00
Race
Race
4
1
3
2
Frequency
12105
5334
341
267
Percent
67.07
29.56
1.89
1.48
Cumulative
Frequency
12105
17439
17780
18047
Cumulative
Percent
67.07
96.63
98.52
100.00
Campuss
Campuss
1
2
Frequency
12004
6043
Percent
66.52
33.48
Cumulative
Frequency
12004
18047
Cumulative
Percent
66.52
100.00
english
english
1
2
Frequency
12520
5527
Percent
69.37
30.63
Cumulative
Frequency
12520
18047
Cumulative
Percent
69.37
100.00
Cumulative
Frequency
9207
18047
Cumulative
Percent
51.02
100.00
genderr
genderr
1
2
Frequency
9207
8840
Percent
51.02
48.98
maritall
maritall
1
2
Frequency
17782
265
Percent
98.53
1.47
Cumulative
Frequency
17782
18047
Cumulative
Percent
98.53
100.00
finaidd
finaidd
2
1
Frequency
16391
1656
Percent
90.82
9.18
Cumulative
Frequency
Cumulative
Percent
16391
18047
90.82
100.00
Table 16 Analysis of Categorical Variables
85
APPENDIX 2
Criterion
Intercept
Only
AIC
SC
-2 Log L
21460.178
21467.979
21458.178
Intercept and
Covariates
19424.820
19534.030
19396.820
Testing Global Null Hypothesis: BETA=0
Chi-Square
DF
Pr > ChiSq
Test
Likelihood Ratio
Score
Wald
Parameter
DF
Intercept
faculty_2
faculty_3
faculty_4
faculty_5
faculty_6
faculty_7
age
agregate
Campuss
genderr
finaidd
english
Brace
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2061.3579
2085.9704
1798.6426
<.0001
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
Standard
Wald
Estimate
Error
Chi-Square
Pr > ChiSq
-2.4774
0.6342
0.5862
1.6319
0.7349
0.0370
1.2008
-0.0130
0.00162
0.0657
0.0987
0.3636
0.0779
-0.5574
0.2060
0.0653
0.0660
0.0922
0.0908
0.0909
0.0877
0.00750
0.000094
0.0256
0.0187
0.0282
0.0237
0.0437
Odds Ratio Estimates
Point
Estimate
Effect
faculty_2
faculty_3
faculty_4
faculty_5
faculty_6
faculty_7
age
agregate
Campuss
genderr
finaidd
english
Brace
13
13
13
1
1
1
1
vs
vs
vs
vs
2
2
2
2
1.886
1.797
5.113
2.085
1.038
3.323
0.987
1.002
1.140
1.218
2.069
1.169
0.573
144.5645
94.3829
78.8456
312.9702
65.4564
0.1654
187.6919
2.9863
298.0360
6.5952
27.8821
165.7910
10.8080
162.8453
<.0001
<.0001
<.0001
<.0001
<.0001
0.6842
<.0001
0.0840
<.0001
0.0102
<.0001
<.0001
0.0010
<.0001
95% Wald
Confidence Limits
1.659
1.579
4.268
1.745
0.868
2.798
0.973
1.001
1.032
1.132
1.852
1.065
0.526
2.143
2.045
6.127
2.492
1.240
3.946
1.002
1.002
1.261
1.311
2.311
1.282
0.624
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
70.0
29.5
0.5
65896012
Somers' D
Gamma
Tau-a
c
Effect
Unit
Estimate
age
age
agregate
agregate
0.937
1.067
1.176
0.851
5.0000
-5.0000
100.0
-100.0
0.405
0.408
0.164
0.703
Table 17 The Results of Forward Selection Procedure
86
APPENDIX 3
Intercept
Only
21460.178
21467.979
21458.178
Criterion
AIC
SC
-2 Log L
Intercept and
Covariates
19423.992
19517.601
19399.992
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
2058.1862
2083.3017
1796.8154
11
11
11
<.0001
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Error
Standard
Chi-Square
Wald
Pr > ChiSq
Intercept
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate
Campuss
genderr
finaidd
english
brace
1
1
1
1
1
1
1
1
1
1
1
1
-2.7409
0.6197
0.5729
1.6244
0.7316
1.1867
0.00163
0.0730
0.1029
0.3637
0.0837
-0.5567
0.1250
0.0564
0.0582
0.0877
0.0861
0.0815
0.000093
0.0252
0.0184
0.0282
0.0234
0.0437
480.9214
120.6196
97.0702
342.9895
72.2871
211.7846
304.3700
8.3651
31.1306
165.9746
12.8165
162.5179
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.0038
<.0001
<.0001
0.0003
<.0001
1
1
1
1
Odds Ratio Estimates
Effect
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate
Campuss
genderr
finaidd
english
Brace
1
1
1
1
vs
vs
vs
vs
2
2
2
2
Point
Estimate
1.858
1.773
5.075
2.078
3.276
1.002
1.157
1.229
2.070
1.182
0.573
95% Wald
Confidence Limits
1.664
2.076
1.582
1.988
4.274
6.027
1.756
2.460
2.792
3.844
1.001
1.002
1.048
1.277
1.143
1.321
1.853
2.312
1.079
1.296
0.526
0.624
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
Effect
Unit
agregate
100.0
agregate
-100.0
69.8
29.3
0.9
65896012
Somers' D
Gamma
Tau-a
c
0.405
0.409
0.164
0.703
Estimate
1.177
0.850
Table 18 The Results of The Backward Selection Procedure
87
APPENDIX 4A
Criterion
AIC
SC
-2 Log L
Intercept
Only
21460.178
21467.979
21458.178
Intercepts and
Covariates
19423.992
19517.601
19399.992
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
2058.1862
2083.3017
1796.8154
DF
11
11
11
Pr > ChiSq
<.0001
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate
Campuss
genderr
finaidd
english
Brace
DF
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Estimate
-2.7409
0.6197
0.5729
1.6244
0.7316
1.1867
0.00163
0.0730
0.1029
0.3637
0.0837
-0.5567
standard
wald
Error
Chi-Square
0.1250
480.9214
0.0564
120.6196
0.0582
97.0702
0.0877
342.9895
0.0861
72.2871
0.0815
211.7846
0.000093
304.3700
0.0252
8.3651
0.0184
31.1306
0.0282
165.9746
0.0234
12.8165
0.0437
162.5179
Pr > ChiSq
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.0038
<.0001
<.0001
0.0003
<.0001
Odds Ratio Estimates
Point
Estimate
Effect
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate
Campuss
genderr
finaidd
english
Brace
1
1
1
1
vs
vs
vs
vs
95% Wald
Confidence Limits
1.858
1.773
5.075
2.078
3.276
1.002
1.157
1.229
2.070
1.182
0.573
2
2
2
2
1.664
1.582
4.274
1.756
2.792
1.001
1.048
1.143
1.853
1.079
0.526
2.076
1.988
6.027
2.460
3.844
1.002
1.277
1.321
2.312
1.296
0.624
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
69.8
29.3
0.9
65896012
Effect
Unit
agregate
100.0
agregate
-100.0
Somers' D
Gamma
Tau-a
c
0.405
0.409
0.164
0.703
Estimate
1.177
0.850
Table 17 Results of The Stepwise Selection Procedure
88
APPENDIX4B
Criterion
Intercept
Only
Intercept
and
Covariates
AIC
SC
-2 Log L
21460.178
21467.979
21458.178
19604.472
19744.885
19568.472
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
1889.7059
1928.0421
1700.7783
17
17
17
<.0001
<.0001
<.0001
Likelihood Ratio
Score
Wald
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate_
Campuss
genderr
finaidd
english
Brace
camfin
finfac2
finfac4
finfac5
agrbrac
engbrac
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.8217
1.0533
0.5549
-1.0342
-1.4882
1.1323
0.4487
-0.7741
0.1077
-0.1440
0.3295
-0.9424
-0.9071
-0.2644
1.4005
1.2059
-0.2351
0.3704
0.3777
0.2352
0.0578
0.4876
0.4363
0.0809
0.0688
0.1477
0.0184
0.1016
0.0610
0.1739
0.1547
0.1217
0.2526
0.2277
0.0839
0.1317
4.7341
20.0490
92.2633
4.4991
11.6360
196.1369
42.5332
27.4734
34.2445
2.0090
29.1972
29.3771
34.3832
4.7179
30.7308
28.0501
7.8607
7.9155
0.0296
<.0001
<.0001
0.0339
0.0006
<.0001
<.0001
<.0001
<.0001
0.1564
<.0001
<.0001
<.0001
0.0298
<.0001
<.0001
0.0051
0.0049
1
1
1
1
Odds Ratio Estimates
Point
Estimate
Effect
faculty_2
faculty_3
faculty_4
faculty_5
faculty_7
agregate_
Campuss
genderr
finaidd
english
Brace
camfin
finfac2
finfac4
finfac5
agrbrac
engbrac
1
1
1
1
vs
vs
vs
vs
2
2
2
2
2.867
1.742
0.356
0.226
3.103
1.566
0.213
1.240
0.750
1.933
0.390
0.404
0.768
4.057
3.340
0.790
1.448
95% Wald
Confidence Limits
1.808
1.555
0.137
0.096
2.648
1.369
0.119
1.154
0.503
1.522
0.277
0.298
0.605
2.473
2.137
0.671
1.119
4.547
1.951
0.924
0.531
3.636
1.792
0.379
1.333
1.117
2.455
0.548
0.547
0.975
6.657
5.218
0.932
1.875
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
68.7
29.5
1.9
65896012
Somers' D
Gamma
Tau-a
c
0.392
0.399
0.159
0.696
Table 20 The Results of The Stepwise Procedure with Interactions included.
89
APPENDIX 5
Number of
Variables
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
Score
Chi-Square Variables included in the model
976.4420
agregate
768.8932
Brace
1431.5177 faculty_4 agregate
1256.2088 agregate Brace
1597.4285 faculty_4 faculty_7 agregate
1580.0377 faculty_4 agregate Brace
1749.6162 faculty_4 agregate finaidd1 Brace
1734.2736 faculty_4 faculty_7 agregate finaidd1
1869.5660 faculty_4 faculty_7 agregate finaidd1 Brace
1838.9862 faculty_4 agregate genderr1 finaidd1 Brace
1950.3123 faculty_4 faculty_7 agregate genderr1 finaidd1 Brace
1905.0736 faculty_2 faculty_4 faculty_7 agregate finaidd1 Brace
1976.8571 faculty_4 faculty_6 faculty_7 agregate genderr1 finaidd1 Brace
1976.3923 faculty_2 faculty_4 faculty_7 agregate genderr1 finaidd1 Bra
2036.9823 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate finaidd1 Brace
2017.8803 faculty_2 faculty_3 faculty_4 faculty_7 agregate genderr1 finaidd1 Brace
2071.1645 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate genderr1 finaidd1
Brace
2044.7677 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate finaidd1 english1
Brace
2077.5397 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate genderr1 finaidd1
English Brace
2077.2280 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1
finaidd1 Brace
2083.3017 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1
finaidd1 english1 Brace
2080.0984 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 age agregate genderr1
finaidd1 english1 Brace
2084.8214 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 age agregate Campuss1
genderr1 finaidd1 english1 Brace
2084.3485 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7 agregate Campuss1 genderr1
marital1 finaidd1 english1 Brace
2085.9704 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate
campuss1 genderr1 finaidd1 english1 Brace
2085.4222 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 agregate Campuss1
genderr1 maritall finaidd1 english1 Brace
2086.1659 faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7 age agregate
campuss1 genderr1 maritall finaidd1 english1 Brace
Table 21 The Results of Best Subset Selection Procedure using Score Criterion.
C(p) Selection Method
Number of Observations Used
18047
18047
Weight: v
Number in
Model
C(p)
R-Square Variables in Model
12
11.6129
0.0902 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7
age agregate Campuss genderr finaidd English Brace
11
12.4942
0.0900 faculty_2 faculty_3 faculty_4 faculty_5 faculty_7
Campuss agregate gender finaidd English Brace
Table 22 The Results of Best Subset Selection Procedure using Cp Criterion.
90
APPENDIX 6
Variable
age
agregate
campuss
genderr
marital
finaidd
english
brace
faculty_2
faculty_3
faculty_4
faculty_5
faculty_6
faculty_7
p-value
<0.0001
<0.0001
<0.0001
<0.0001
0.0073
<0.0001
<0.0001
<0.0001
<0.0001
0.0148
<0.0001
<0.0001
<0.0001
<0.0001
AUC
0.532
0.637 *
0.537
0.541
0.503
0.532
0.576
0.608
0.545
0.508
0.555
0.508
0.520
0.523
Table 23 Step1 of the AUC procedure
Variables : agregate
p-value : <0.0001
AUC
: 0.637
Variables : agregate
p-value : <0.0001
AUC
: 0.637
Variable : agregate
p-value : <0.0001
AUC
: 0.638
Variable : agregate
p-value : <0.0001
AUC
: 0.642
Variable : agregate
p-value : <0.0001
AUC
: 0.643
Variable : agregate
p-value : <0.0001
AUC
: 0643
Variable : agregate
p-value : <0.0001
AUC
: 0.647
Variable : agregate
p-value : <0.0001
AUC
: 0.648
Variable : agregate
p-value : <0.0001
AUC
: 0.655
Variable : aggregate
p-value : <0.0001
AUC
: 0.656 *
marital
<0.0001
faculty_5
0.4389
faculty_3
0.5985
gender
<0.0001
campuss
<0.0001
faculty_6
<0.0001
english
<0.0001
finaidd
<0.0001
faculty_4
<0.0001
brace
<0.0001
Table 24 Step 2 of the AUC procedure
91
Variables : agregate brace
faculty_2
p-value : <0.0001 <0.0001 <0.0132
AUC
: 0.658
Variables : agregate brace
english
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.658
Variable : agregate brace
campuss
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.660
Variable : agregate brace
faculty_7
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.660
Variable : agregate brace genderr
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.662
Variable : agregate brace
faculty_6
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.663
Variable : agregate brace
faculty_4
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.666
Variable : agregate brace
finaidd
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.671*
Table 25 Step 3 of the AUC procedure
Variables: agregate
p-value : <0.0001
AUC
: 0.673
Variables: agregate
p-value : <0.0001
AUC
: 0.674
Variables: agregate
p-value : <0.0001
AUC
: 0.672
Variables: agregate
p-value : <0.0001
AUC
: 0.677
Variables: agregate
p-value : <0.0001
AUC
: 0.677
Variables: agregate
p-value : <0.0001
AUC
: 0.681*
brace
finaidd faculty_2
<0.0001 <0.0001 <0.0305
brace
finaidd campuss
<0.0001 <0.0001 <0.0001
brace
finaidd english
<0.0001 <0.0001 <0.0001
brace
finaidd genderr
<0.0001 <0.0001 <0.0001
brace
finaidd faculty_6
<0.0001 <0.0001 <0.0001
brace
finaidd faculty_4
<0.0001 <0.0001 <0.0001
Table 26 Step 4 of the AUC Procedure
92
Variables: agregate
p-value : <0.0001
AUC
: 0.682
Variables: agregate
p-value : <0.0001
AUC
: 0.683
Variables: agregate
p-value : <0.0001
AUC
: 0.683
Variables: agregate
p-value : <0.0001
AUC
: 0.685
Variables: agregate
p-value : <0.0001
AUC
: 0.686
Variables: agregate
p-value : <0.0001
AUC
: 0.687*
brace
finaidd faculty_4 faculty_3
<0.0001 <0.0001 <0.0001 0.0226
brace
finaidd faculty_4 english
<0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_2
<0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_6
<0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 genderr
<0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7
<0.0001 <0.0001 <0.0001 <0.0001
Table 27 Step 5 of the AUC procedure
Variables: agregate
p-value : <0.0001
AUC
: 0.688
Variables: agregate
p-value : <0.0001
AUC
: 0.688
Variables: agregate
p-value : <0.0001
AUC
: 0.688
Variables: agregate
p-value : <0.0001
AUC
: 0.690*
Variables: agregate
p-value : <0.0001
AUC
: 0.690*
brace
finaidd faculty_4 faculty_7 faculty_5
<0.0001 <0.0001 <0.0001 <0.0001
<0.0001
brace
finaidd faculty_4 faculty_7 faculty_3
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7 english
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7 faculty_2
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7 genderr
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001
Table 28 Step 6 of the AUC procedure
93
Variables: agregate
p-value : <0.0001
AUC
: 0.691
Variables: agregate
p-value : <0.0001
AUC
: 0.692
Variables: agregate
p-value : <0.0001
AUC
: 0.692
Variables: agregate
p-value : <0.0001
AUC
: 0.693
Variables: agregate
p-value : <0.0001
AUC
: 0.694*
brace
finaidd faculty_4 faculty_7 genderr faculty_5
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0065
brace
finaidd faculty_4 faculty_7 genderr faculty_3
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0002
brace
finaidd faculty_4 faculty_7 genderr english
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0003
brace
finaidd faculty_4 faculty_7 genderr faculty_2
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7 genderr faculty_6
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
Table 29 Step 7 of the AUC procedure
Variables: agregate
p-value : <0.0001
AUC
: 0.695*
Variables: agregate
p-value : <0.0001
AUC
: 0.695*
brace
finaidd faculty_4 faculty_7 genderr faculty_6 english
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0010
brace
finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
Table 30 Step 8 of the AUC procedure
Variables: agregate
p-value : <0.0001
AUC
: 0.696
Variables: agregate
p-value : <0.0001
AUC
: 0.697*
brace
finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2 english
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
brace
finaidd faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0058
Table 31 Step 9 of the AUC procedure
Variables : agregate brace
finaidd
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.698
Variables : agregate brace
finaidd
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.701*
Variables : agregate brace
finaidd
p-value : <0.0001 <0.0001 <0.0001
AUC
: 0.701*
faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3 english
<0.0001
<0.0001 <0.0001 0.0091
<0.0001 <0.0001 0.0002
faculty_4 faculty_7 genderr faculty_6 faculty_2 faculty_3 faculty_5
<0.0001
<0.0001 <0.0001 0.6888
<0.0001 <0.0001 0.0001
faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5
<0.0001
<0.0001 <0.0001 < 0.0001 <0.0001 0.0001
Table 32 Step 10 of the AUC procedure
94
Variables: agregate brace
finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english
p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0002
AUC
: 0.702 *
Table 33 Step 11 of the AUC procedure
Variables: agregate brace
finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss
p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
<0.0001
<0.0001 < 0.0001 0.0038
AUC
: 0.703*
Table 34 Step 12 of the AUC procedure
Variables: agregate brace
finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss age
p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
0.0090 <0.0001 0.0864
AUC
: 0.703*
Table 35 Step 13 of the AUC procedure
Variables: agregate brace
finaidd faculty_4 faculty_7 genderr faculty_2 faculty_3 faculty_5 english campuss maritall
p-value : <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001
<0.0001
<0.0001 0.0005 <0.0060 0.1584
AUC
: 0.703*
Table 36 Step 14 of the AUC procedure
95
APPENDIX 7
SAS PROGRAMME
data jimmy;
set sasuser.osiame;
proc freq order=freq;
tables faculty race campuss english genderr maritall finaidd;
run;
data jimmy2;
set sasuser.osiame;
proc univariate;
var age agregate;
title;
run;
data joseph1;
set sasuser.osiame;
if score ='1' then pass = 1;
else if score ='2' then pass=0;
if faculty = '2' then faculty_2=1;
else faculty_2=0;
if faculty = '3' then faculty_3 =1;
else faculty_3 =0;
if faculty ='4' then faculty_4 =1;
else faculty_4 = 0;
if faculty ='5' then faculty_5 =1;
else faculty_5=0;
if faculty = '6' then faculty_6 =1;
else faculty_6 = 0;
if faculty ='7' then faculty_7=1;
else faculty_7 =0;
if race >3 then Brace=1;
else Brace=0;
Keep faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7
pass age agregate campuss genderr maritall finaidd english brace;
run;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass =age;
units age=5 -5;
run;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass=agregate;
units agregate=100 -100;
run;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass=campuss;
run;
proc logistic descending;
class campuss maritall finaidd genderr english;
96
model pass=maritall;
run;
proc logistic descending;
class campuss maritall finaidd
model pass=finaidd ;
run;
proc logistic descending;
class campuss maritall finaidd
model pass=genderr;
run;
proc logistic descending;
class campuss maritall finaidd
model pass=english;
run;
proc logistic descending;
class campuss maritall finaidd
model pass=faculty_2 faculty_3
run;
genderr english;
genderr english;
genderr english;
genderr english;
faculty_4 faculty_5 faculty_6 faculty_7 ;
*/ The model without the variable Faculty_6 insignificant in
Univariate Logistic regression;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass=age agregate campuss maritall finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace;
units age=5 -5 agregate= 100 -100;
run;
*/ The model without the variables faculty_6,maritall;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass=age agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace ;
units age=5 -5 agregate= 100 -100;
run;
*/ The model without the variables faculty_6,maritall and age;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass= agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace
;
units
run;
agregate= 100 -100;
*/ Variable faculty_6 re-enters the model
proc logistic descending;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass= agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7
brace faculty_6
;
units agregate= 100 -100;
run;
*/ Variable maritall re-enters the model;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass= agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace maritall ;
97
units
agregate= 100 -100;
run;
*/ The variable age re-enters the model;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass= agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace faculty_6 age ;
units age=5 -5
agregate= 100 -100;
run;
*/ Variables thet give the pleliminary Main eefects model;
proc logistic descending;
class campuss maritall finaidd genderr english;
model pass= agregate campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace ;
units
agregate= 100 -100;
run;
*/ Examining the scale of continuous covariate;
*/ The variable agregate is analysed using quartiles;
data joseph3;
set joseph1;
if 720 <= agregate <= 835 then agregroup =1;
else if 835 < agregate <=1075 then agregroup=2;
else if 1075 < agregate <=1200 then agregroup=3;
else if agregate > 1200 then agregroup=4;
if agregroup='2' then agre_2=1;
else agre_2=0;
if agregroup ='3' then agre_3=1;
else agre_3=0;
if agregroup ='4' then agre_4 = 1;
else agre_4=0;
run;
proc logistic descending;
class campuss maritall finaidd genderr english;;
model pass= agre_2 agre_3 agre_4 campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace;
run;
data midpoints;
input
agreg
coef;
cards;
775.5
0
955
.2898
1137.5
1.0672
1680
.9989
;
run;
goptions reset =all;
symbol c=blue v=dot h=.8 i=j;
axis order=(0 to 1.5 by .2) label=(a=90 'logit');
proc gplot data=midpoints;
plot coef*agreg / vaxis=axis;
run;
quit;
98
data joseph6;
set joseph3;
proc chart;
vbar agregate / midpoints=100 to 2200 by 100
GROUP=pass;
run;
data scale;
set joseph3;
exlinex=agregate*log(agregate);
run;
proc logistic descending;
class campuss maritall finaidd genderr english;;
model pass=agregate exlinex campuss finaidd genderr english faculty_2
faculty_3 faculty_4 faculty_5 faculty_7 brace;
run;
*/ The variable agregate is dichotomised;
data joseph4;
set joseph3;
if agregate >= '1075' then agregate_=1;
else agregate_=0;
run;
*/ Fitting a dichotomous variable agregate_;
proc logistic descending;
class english finaidd campuss genderr;
model pass = english finaidd campuss genderr faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace agregate_;
run;
data interaction;
set joseph4;
engagr=english*agregate_;
engfin=english*finaidd;
camfin=campuss*finaidd;
finfac2=finaidd*faculty_2;
finfac3=finaidd*faculty_3;
finfac4=finaidd*faculty_4;
finfac5=finaidd*faculty_5;
finfac7=finaidd*faculty_7;
racfin=brace*finaidd;
agrbrac=agregate_*brace;
engbrac=english*brace;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_ engagr;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
99
agregate_ engfin;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_ camfin;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ finfac2;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ finfac3;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ finfac4;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ finfac7;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ finfac5;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ racfin;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
agregate_ agrbrac;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2
faculty_4 faculty_5 faculty_7 brace
faculty_3
faculty_3
faculty_3
faculty_3
faculty_3
faculty_3
faculty_3
faculty_3
100
agregate_ engbrac;
run;
proc logistic data=interaction descending;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_ engagr engfin finfac4 finfac2
finfac5 racfin agrbrac engbrac ;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_ engagr engfin finfac4 finfac2
finfac5 racfin agrbrac engbrac ;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_ engagr engfin finfac4 finfac2
finfac5 agrbrac engbrac ;
run;
proc logistic data=interaction;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_
engfin finfac4 finfac2
finfac5 agrbrac engbrac ;
run;
proc logistic data=interaction descending noprint;
class campuss genderr finaidd english ;
model pass= campuss genderr finaidd english faculty_2 faculty_3
faculty_4 faculty_5 faculty_7 brace
agregate_
finfac4 finfac2
finfac5 agrbrac engbrac ;
output out=probability predicted=phat;
run;
data probability1;
set probability;
predicts=(phat>=.5);
run;
proc freq data=probability1;
tables pass*predicts / norow nocol nopercent;
run;
proc logistic data=interaction descending;
class campuss genderr finaidd english;
model pass=campuss genderr finaidd english faculty_2 faculty_3 faculty_4
faculty_5 faculty_7 brace agregate_ finfac4 finfac2 finfac5 agrbrac
engbrac / outroc=rocl;
goptions cback=white
colors=(blue)
border;
axis1 length=2.5in;
101
axis2 order =(0 to 1 by .1) length=2.5in;
proc gplot data=rocl;
symbol1 i=join v=none;
title 'First Year TUT Students Success ROC Curve';
plot _sensit_*_1mspec_ / haxis=axis1 vaxis=axis2;
run;
quit;
*/ Forward selection procedure;
data foward;
set joseph1;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7
age agregate campuss genderr maritall finaidd english brace
/ selection=forward slentry=.25 details ;
units age=5 -5 agregate=100 -100;
run;
*/ Backward Selection procedure;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7
age agregate campuss genderr maritall finaidd english brace
/ selection=backward details slstay=.05;
units age=5 -5 agregate=100 -100;
run;
*/ Stepwise Selection procedure;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7
agregate campuss genderr maritall age finaidd english brace
/ selection=stepwise slentry=.25;
units agregate=100 -100;
run;
*/ Stepwise Selection procedure used to select Interactions;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_7
agregate_ campuss genderr finaidd english brace engagr
engfin camfin finfac2 finfac3 finfac4 finfac4 finfac5 finfac7 racfin
agrbrac engbrac / selection=stepwise slentry=.25 include=11;
run;
*/ Best Subset Selection procedure using Score criterion;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6
age agregate campuss genderr maritall finaidd english
/ selection=score best=2;
units age=5 -5 agregate=100 -100;
run;
*/ Best Subset procedure using Cp criterion;
proc logistic descending;
class campuss genderr finaidd english;
model pass=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6
age agregate campuss genderr maritall finaidd english
faculty_7
brace
faculty_7
brace;
102
output out=best2 prob=pihat;
run;
data best3;
set best2 ;
z=log(pihat/(1-pihat))+((pass-pihat)/(pihat*(1-pihat)));
v=pihat*(1-pihat);
run;
proc reg;
model z=faculty_2 faculty_3 faculty_4 faculty_5 faculty_6 faculty_7
age agregate campuss genderr maritall finaidd
english brace/selection=cp best=3;
weight v;
run;
quit;
103
Reference
Beale, E M L (1970). Note on procedures or variable selection in multiple regression.
Technometrics, 12, 909-14.
Bergerud, W A (1996). Introduction to Regression Models: With worked forestry examples. Biom.
Imf.Hand. Res.Br., B.C. Min. For., Victoria, B.C. work. Pap. 26/1996.
Cody, R P and Smith, JK (1997). Applied Statistics and the SAS programming Language. London.
Prentice and Hall.
Cook, E D (2001). Solutions Manual to Accompany Applied Logistic Regression 2nd Edition by
Hosmer, D W and Lemeshow, S.
Czepiel S, http:// www.czep.net/contact.html.
Dallal, G E (2001). Logistic regression. http:// www.tufts.edu/~gdallal/logistic.htm
Delwiche, D L and Slaughter, S J (1995). A premier, Cary, NC: SAS Institute Inc
Draper, N R and Smith, H (1981). Applied Regression Analysis, Second Edition. New York.
Wiley.
ERTAþ, G. Evaluation of Diagnostic Test Accuracy by Receiver Operation Characteristic (ROC)
Analysis. Boðazici University, Biomedical Engineering Institute, 80815, Bebek, Ýstanbul, e-mail:
[email protected]
George, E I (2000). The variable selection problem. Journal of the American Statistical
Association, vol 95, No 452, Vignettes.
104
Gorman, J W and Toman, R J (1966). Selection of variables for fitting equations to data.
Technometrics, 12, No.1.
Guyon, I and Elisseeff, A (2002). Special Issue on Variable and Feature selection. Journal of
Machine Learning Research.
Hanley, J A and McNeil, B J (1982). The meaning and use of the Area under a Receiver Operating
Characteristic (ROC) curve. Radiology, 143, 29-36.
Hocking, R R and Leslie, R N (1967). Selection of best subset in Regression
Analysis.Technometrics, 9, 531-540.
Hocking, R R (1972). Criteria for Selection of a subset Regression: Which one should be used?
Technometrics, 14, No.4.
Hocking, R R (1976). The Analysis and Selection of Variables in linear regression, Biometrics, 32,
1-49.
Hosmer, D W and Lemeshow, S (1989). Applied Logistic Regression. New York. Wiley and Sons.
Hosmer, D W and Lemeshow, S (1998). Applied Survival Analysis: Regression Modeling of Time
to Event Data. New York.Wiley and Sons.
Hosmer, D W and Lemeshow, S (2002). Applied Logistic Regression 2nd Edition.
New York. Wiley and Sons.
[email protected] (2001). The magnificent ROC. Google‘s cache of
http://www.anaestethetist.com/mnm/stats/roc/
Joubert, G (1994). Variable Selection in Logistic Regression, with Special Application to Medical
data.
105
Karp, A H. Using logistic regression to predict customer retention.
http://www.Sierrainformation.com
Larsen, P V(2001). Module 14: Logistic regression. http://
www.statmaster.sdu.dk/courses/st111/module14/.
Mallows, C L (1973). More comments on C p . Technometrics, 15, 661-676.
Mantel N (1970). Why stepwise selection in multiple regression. Technometrics, 12 621-25.
Marzban, C (2004). A comment on the ROC Curve measures. http://www.nhn.ou.edu/marzban.
Marriott, J M and Pettitt, A N (1997). Graphical Techniques for selecting explanatory variables for
the time series data. Journal of Applied Statistics, 46, 253-264.
McClish, D K (1989). Analysing a portion of the ROC curve. Medical Decision Making, 9, 190195.
McCullagh, P and Nelder, J A (1989). Applied Regression Analysis. New York. Wiley and Sons.
Menard, S (2001). Applied Logistic Regression Analysis. Sage University Papers Series on
Quantitative Applications in the Social Sciences, 07-106. Thousand Oaks, CA: Sage.
Metz, C E, Herman, B A and Shen, J (1998). Maximum likelihood estimation of ROC from
continuously distributed data. Statistics in Medicine, 17, 1033-1053.
Miller, A J (1984). Selection of subsets and regression variables. Journal of the Royal Statistical
Society, A, 147, 389-425.
Miller, A J (1990). Subset Selection in Regression, London. Chapman and Hall.
106
Morrison, Ann Michelle (2005). Receiver Operating Characteristic (ROC) curve Analysis of
Antecedent Rainfall and the Alewife/Mystic River Receiving water. Water Resource Authority,
Nargundkar, S and Priestly, J L (2003). Assessment of Evaluation Methods for Prediction and
Classification of Consumer Risk in the Credit Industry. Federal Reserve System Report.
http://www.federalreserve.gov/rnd.htm.
Pepe, M S (1997). A regression modelling framework for receiver operating characteristic curves
in the medical diagnostic testing. Biometrika, 84/3, 595-608.
Raftery et al. Statistics in the 21st century, Monographs on Statistics and Applied 93, 60- . London.
Chapman and Hall/CRC.
Tosteson, A and Begg, C B (1988). A General Regression Methodology for ROC Curve
Estimation. Medical Decision Making, 8, 204-15.
Thomson, M L (1978). Selection of Variables in multiple regression: part II. Chosen Procedures,
Computations and Examples. Internal Statistics Review, 46, 129-146.
Tibshirani, R (1997). The Lasso Method for variable selection in the Cox model. Statistics in
Medicine, 16, 385-395.
Walters, S J (2001). What is a Cox Model? http://www.evidence-ased.medicine.co.uk.
Zou, H and Hastie, T (2005). Regularisation and Variable Selection via elastic net. Journal of the
Royal Statistics Society, 67, 301-320.
Zweig, M H and Campell, G (1993). Receiver-Operating Characteristic (ROC). A Fundamental
Evaluation Tool in Clinical Medicine. Clinical Chemistry, 39/4, 561-577.
107
108
```