/smash/get/diva2:790367/FULLTEXT01.pdf

/smash/get/diva2:790367/FULLTEXT01.pdf
DEGREE PROJECT, IN MACHINE LEARNING , SECOND LEVEL
STOCKHOLM, SWEDEN 2015
Two-Stage Logistic Regression
Models for Improved Credit Scoring
ANTON LUND
KTH ROYAL INSTITUTE OF TECHNOLOGY
COMPUTER SCIENCE AND COMMUNICATION (CSC)
Two-Stage Logistic Regression Models for
Improved Credit Scoring
February 24, 2015
ANTON LUND
[email protected]
Master’s Thesis in Computer Sicence
School of Computer Science and Communication
Royal Institute of Technology, Stockholm
Supervisor at KTH: Olov Engwall
Examiner: Olle Bälter
Project commissioned by: Klarna AB
Supervisor at the company: Hans Hjelm
Abstract
This thesis has investigated two-stage regularized logistic
regressions applied on the credit scoring problem. Credit
scoring refers to the practice of estimating the probability that a customer will default if given credit. The data
was supplied by Klarna AB, and contains a larger number of observations than many other research papers on
credit scoring. In this thesis, a two-stage regression refers
to two staged regressions were the some kind of information
from the first regression is used in the second regression to
improve the overall performance. In the best performing
models, the first stage was trained on alternative labels,
payment status at earlier dates than the conventional. The
predictions were then used as input to, or to segment, the
second stage. This gave a gini increase of approximately
0.01. Using conventional scorecutoffs or distance to a decision boundary to segment the population did not improve
performance.
Referat
Denna uppsats har undersökt tvȧstegs regulariserade logistiska regressioner för att estimera credit score hos konsumenter. Credit score är ett mȧtt pȧ kreditvärdighet och
mäter sannolikheten att en person inte betalar tillbaka sin
kredit. Data kommer frȧn Klarna AB och innehȧller fler observationer än mycket annan forskning om kreditvärdighet.
Med tvȧstegsregressioner menas i denna uppsats en regressionsmodell bestȧende av tvȧ steg där information frȧn det
första steget används i det andra steget för att förbättra
den totala prestandan. De bäst presterande modellerna använder i det första steget en alternativ förklaringsvariabel,
betalningsstatus vid en tidigare tidpunkt än den konventionella, för att segmentera eller som variabel i det andra
steget. Detta gav en giniökning pȧ approximativt 0,01. Användandet av enklare segmenteringsmetoder sȧ som scoregränser eller avstȧnd till en beslutsgräns visade sig inte förbättra prestandan.
Contents
1 Introduction
1.1 Background . . . . . .
1.2 Thesis Objective . . .
1.3 Ethical Concerns . . .
1.4 Delimitations . . . . .
1.5 Choice of Methodology
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
4
5
5
2 Theory Review
2.1 Machine Learning . . . . . . . . . . . . .
2.1.1 Sampling Bias . . . . . . . . . .
2.1.2 Overfitting . . . . . . . . . . . .
2.2 Training, Validation & Testing . . . . .
2.2.1 Validation . . . . . . . . . . . . .
2.2.2 Testing . . . . . . . . . . . . . .
2.3 Logistic Regressions . . . . . . . . . . .
2.3.1 Basics . . . . . . . . . . . . . . .
2.3.2 Regularized Logistic Regressions
2.3.3 Estimation . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
8
9
9
12
12
13
14
3 Related Works
3.1 Segmentation Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Two-stage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
15
16
17
4 Method
4.1 Synthesis . . . . . . . . . . . .
4.2 Methodology . . . . . . . . . .
4.2.1 Proposed Model Classes
4.3 Practical Details . . . . . . . .
4.4 Data . . . . . . . . . . . . . . .
.
.
.
.
.
18
18
19
19
21
22
.
.
.
.
24
24
24
25
29
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Results
5.1 Choosing the Mixing Parameter, – . . . . . . . . . . . .
5.2 First-stage Model . . . . . . . . . . . . . . . . . . . . . .
5.3 Finding Candidate Models Using the Validation Sample
5.4 Evaluating Candidate Models Using the Test Sample . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Discussion
32
7 Conclusions
7.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
36
Bibliography
37
Appendix
40
Chapter 1
Introduction
This thesis aims to investigate two-stage logistic regression models for use in retail
underwriting. Underwriting refers to the process of assessing a customer’s eligibility
to receive credit. Underwriting models are used in a wide array of industries such
as banks, credit card companies, mobile phone companies and insurance companies.
When this process has been automated, it is usually instead referred to as credit
scoring. Credit scoring models use information such as income data and payment
history to predict the probability that the customer will default if given credit.
The degree project is carried out at Klarna, a company based in Stockholm,
Sweden that provides payment services to online merchants. One of the main components of this service is constructing and applying credit scoring models for retail
customers. The accuracy of the credit scoring models is a key driver of Klarna’s
profitability but the models also need to fulfil various obligations.
Some of the current credit scoring models at Klarna involve logistic regressions,
a type of probabilistic classification model used to predict a binary response. Credit
scoring models can often with a high accuracy classify customers that are clearly
likely to or not likely to pay on time, but may not do as well in more grey areas.
Klarna therefore wants to investigate whether the accuracy can be improved by implementing and evaluating two-stage logistic regression models. A two-stage model
should in this context be seen as a model where the first-stage model is trained using
the available data and the second-stage model uses information from the first-stage
model in some way to increase performance. The problem can be described by the
research question:
Do two-stage logistic regression models, while retaining simplicity, improve the performance of credit scoring models when compared to the
conventional logistic regression?
This paper starts with chapter 1, which gives an introduction to credit scoring
and a motivation for the research question in this thesis along with a quick discussion
of ethical concerns, the delimitations and the choice of methodology.
The remainder of the thesis will be outlined as follows. In chapter 2, basics
of machine learning will be covered and some important concepts when training
machine learning models will be described. This will be followed by chapter 3, a
summary of previous works in topics related to two-stage credit scoring models.
Next, chapter 4 will cover the specifics of the method and some practical details
with regards to the implementation. Chapter 5 will present the results and will be
1
CHAPTER 1. INTRODUCTION
followed by chapter 6 that will discuss the results and put them in a general context.
Lastly, chapter 7 will give a brief summary of the thesis and the conclusions.
1.1
Background
The applicant in a credit scoring process can for example be a consumer applying
for a credit card or a mortgage but also an organization, e.g. a company trying to
secure a loan for their business. An important difference between scoring consumers
and organizations is the available data (Thomas 2010). This paper will focus on
credit scoring of consumers and examples related to credit to retail customers.
The credit score is often designed to predict the probability that the credit
applicant will default but can also be related to more vague concepts such as the
probability that the credit applicant will be profitable (Hand and Henley 1997). In
practice, this is usually turned into a problem of classifying a customer as either
a “good” or a “bad”. The terms “goods” and “bads” will be used throughout the
paper to distinguish between customers that have good and bad outcomes. How
this is defined differs from company to company but simply put, a bad transaction
is when the customer has not fulfilled his or her payment obligations after some
duration of time after the credit due date.
Credit scoring is in a generalized perspective an application of a classification
problem. In a credit scoring model, each credit applicant is attributed a score,
based on available data. A good model should assign a high score to a credit applicant that is unlikely to default (or the equivalent positive outcome) (Mester 1997).
The industry standard in credit scoring is for the credit score to be a logarithmic
transformation of the probability of default (PD) so that every 20 points decrease
in score double the odds of defaulting. The precise definition varies from company
to company. At Klarna credit score is defined as follows.
3
PD
Score = ≠ log
1 ≠ PD
4
·
20
20
+ 600 ≠ log(50) ·
log(2)
log(2)
(1.1)
As an example, PDs of 1 % and 99 % roughly correspond to scores of 620
and 355, respectively. Figure 1.1 shows the relationship between probability of
default and credit score. An important characteristic of the score is the increased
granularity as probability approaches 0 and 1. For the reasoning behind the form
of the relationship with the PD and the score, see section 2.3.1.
There is a wide range of methods that can be used to create credit scoring
models. Some examples are (Lessmann, Seow, and Baesens 2013):
1. Statistical models such as linear probability models and discriminant analysis.
2. Machine learning models such as neural networks, support vector machines
and decision trees.
3. Methods derived from financial theory such as option pricing models.
4. Combination of different methods using for example ensemble learning methods.
2
CHAPTER 1. INTRODUCTION
700
Score
600
500
400
300
0.00
0.25
0.50
Probability of default
0.75
1.00
Figure 1.1. Plot of the relationship between probability of default and credit score.
Credit scoring is a highly active field and as in most fields of applied classification, there is a tradition of searching broadly when testing out new models and
methods.
A problem with many of the academic papers is that they often use data sets
that, in comparison to the data owned by Klarna, are very small. For example, one
source of data commonly used in machine learning is the UCI Machine Learning
Repository. It holds 4 data sets with credit application data which are all 1000
observations or less1 . Klarna will usually have data sets on the order of 100.000
observations when training their models as well as more variables. A higher quality
data set should improve prediction accuracy and increase the potential of models
that train on subsets of the data.
Though dependent on the specific application, there are generally three sources
of data available when assessing credit eligibility (Thomas 2000):
1. Application data - E.g. type of purchase, requested credit.
2. External data - E.g. payment history from credit agencies or address lookup
from an address bureau.
3. Internal data - Data from previously accepted and rejected applications.
The application data is collected at each transaction. External data needs to be
purchased and a company might develop models to predict when buying external
data for a specific customer is profitable. Internal data exists for customers that
have used the company’s service earlier and increases in size over time. Internal
data might therefore not be available for all customers and might lose its relevance
over time as the population changes.
1
See UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets.html
for more information.
3
CHAPTER 1. INTRODUCTION
1.2
Thesis Objective
There is an extensive amount of literature on credit scoring and an abundance of
papers that explain methods for improving the accuracy of credit scoring models.
Many of these papers describe models that are intuitively hard to interpret such as
artificial neural networks (ANN) or complicated composite methods using ensemble
learning. Despite this fact, the model of choice for many companies developing
scoring models is the relatively simple logistic regression.
One possible reason for this is the sampling bias inherent in the credit scoring
problem (Greene 1998). As the data has usually passed through some earlier version
of a model, data for training and evaluation is restricted to data from previously
accepted customers. It is not ex-ante clear how a model trained on data with this
bias will perform on unfiltered data. This implies that there is a need to interpret the
model and assess whether it is reasonable or not. Another argument for interpretive
simplicity comes from the need to explain reasoning behind accept/reject decisions
to different entities within and outside the company. In some countries, it is required
by law to be able explain to a customer the specific reason or reasons he or she was
rejected credit.
For these reasons, it is clear that the credit scoring industry faces constraints that
make complicated classification algorithms and ensemble learning models unsuited.
Logistic regressions models are, on the other hand, easy to understand and interpret.
Their simplicity also make them easy to implement and if needed, modify posttraining.
This comes with a caveat. Logistic regressions are linear models. It is therefore not as simple to produce more complex decision surface as for example for
an artificial neural network model. If underlying data is not perfectly linear, simply training a logistic regression on the data can not be expected to give the best
possible prediction.
There is therefore a need to explore how credit scoring models can be improved,
while retaining simplicity, within the context of logistic regressions.
This paper will investigate if the performance of logistic regressions can be improved by introducing one or several logistic regressions in a second step to form a
two-stage model. The second step will in some way take into consideration information from the first model to create a full scoring model that performs better relative
to a simple logistic regression. A short list of different methods will be implemented
and compared to a single-stage model to find the best performing model.
1.3
Ethical Concerns
It has been argued that a weakness of statistical credit scoring models is their lack
of an explanatory model (Capon 1982). This makes the specific reasoning behind
why a customer was rejected or accepted opaque. From this it is arguable that
there is a certain arbitrariness to the whole system. On the other hand, automated
scoring models could remove prejudice that might influence human decisions.
4
CHAPTER 1. INTRODUCTION
Others have also argued that some customer characteristics might act as proxies
for characteristics that, if used for differentiating between customers, would be
illegal discrimination. Examples of this would be that differentiating on postal
codes could be a proxy for ethnicity while distinguishing between certain types of
incomes (e.g. full-time or part-time) could discriminate on gender (Capon 1982). In
general this is a problem of statistical models which by nature will tend to predict
the behavior of an individual from the behavior of the group he or she belongs to.
There is an additional problem of credit giving that is not isolated to the credit
scoring practice: false positives and false negatives. It is not realistic to assume that
a model will accurately predict all credit applications. A fraction of people that
should not receive credit will be accepted and some who should, will be rejected.
This is however a problem that can be alleviated with more accurate credit scoring
models and is arguably an aspect where credit scoring models perform better than
non-statistical underwriting models.
The results from this thesis are not likely to have much effect on the ethical
aspects of credit scoring except for the chance that it might increase the accuracy
of predictions. There is perhaps a need to further discuss the ethics of credit scoring
as the scope and prevalence increases, but that discussion deserves its own paper.
1.4
Delimitations
There is a large amount of models that can be applied to the credit scoring model.
This thesis can be seen as an in-depth study and has therefore only covered the
logistic regression. Therefore, for all implemented models, both stages consist of
logistic regression models.
There have also been a few restriction on the data. Klarna use different models
for different geographic regions. This paper has only tested the implementation on
one of these regions. A reason for this is lack of readily available data. Additionally
the data used is from a specific time period. This together with the geographic
restriction needs to be taken into account when attempting to generalize the results
from this paper.
The Related Works chapter mentions a number of machine learning algorithms.
It is outside the scope of this paper to cover them all in detail. For more information,
please consult a machine learning textbook, e.g. Hastie et al. (2009), or some other
introductory material.
1.5
Choice of Methodology
The method implementation can be divided into five separate tasks:
1.
2.
3.
4.
Acquire a data set and split it into training, validation and test sets.
Identify methods for constructing the two-stage models.
Train the models using the training set.
Evaluate results on validation set to select a number of candidate models.
5
CHAPTER 1. INTRODUCTION
5. Evaluate results of candidate models on the test set.
The data set used in the implementation will be delivered by Klarna. To minimize the time spent on assembling and cleaning the data, a data set that was
previously used to train models at Klarna will be used. Previous research as well as
discussions with Klarna experts will form the basis for selecting methods for constructing the two-stage models. The number of actually implemented models will
depend on the time available.
The models will be trained using R, a free software programming language for
statistical computing2 . To make sure that the training is automated and reproducible, a package including all the necessary functions and code will be developed.
The aim of this package is to have the ability to input a data set and receive the
trained model along with relevant statistics and graphics.
2
See http://www.r-project.org/ for more information.
6
Chapter 2
Theory Review
The theory review will start by giving an introduction to the concept and practice
of machine learning. It will focus on practical concepts like overfitting, sampling
bias, training and validation. The next section will define the logistic regression,
give an introduction to regularization and finally describe the estimation method
used in this paper. Lastly, there will be a section on evaluating credit models that
describes some of the common metrics.
2.1
Machine Learning
Machine learning is a subset of computer science that studies systems that can automatically learn from data (Domingos 2012). The field is closely linked to statistics,
optimization and artificial intelligence. The most widely used application of machine
learning is in classification (Domingos 2012).
The typical setup for a classifier is a vector of p attributes x = (x1 , x2 , . . . , xp )
and a set of classes y = (y1 , y2 , . . . , yd ). The x vector is known for n examples and
each example is assigned to one of the classes. The classification problem can then
be described as the problem of using historical data to identify to which class a new
data point, xi , belongs.
An example of this could be an algorithm that recognizes faces in images. A
data point, xi , could then consist of the individual pixels in an image and the yvector would be a vector with a binary response depending on whether a face is
actually in the image or not.
2.1.1
Sampling Bias
A problematic issue when training credit scoring models is sampling bias. Sampling
bias arises when the sample the model was trained on does not represent the same
population from which data points are drawn on for prediction. Most credit scoring
models require some type of label. Labeled data is typically collected by giving
customers credit and waiting to see if they default or not. It will therefore only be
possible to train the model on transactions that were previously accepted (Thomas
2000). Greene (1998); Banasik, Crook, and Thomas (2003) find that sampling bias
may decrease the accuracy of credit scoring models.
This phenomenon can lead to some unexpected results. An example of this
could be a historical model with a highly predictive variable, i.e. the amount of
debt. That model would penalize transactions with high debt and only let high debt
transactions through when the other variables give very strong positive predictions.
7
CHAPTER 2. THEORY REVIEW
If at some later time, a new model is trained on the data that was filtered by the
historical model, high debt transactions are likely to have a much lower default rate
than from the data in the first model. This could make the debt variable have a
positive effect in the new model, something that is not likely to reflect the nature
of future, unfiltered, requests.
There is unfortunately no patent solution to this problem. There are a few
methods that have been suggested to alleviate the problem, for example reject
inference (e.g. (Crook and Banasik 2004)) and virgin sampling, i.e. letting through
a fraction of requests regardless of score, to have a small but representative sample.
Having the ability to spot these problems is a contributing reason for prioritizing
interpretive simplicity in credit scoring models.
2.1.2
Overfitting
Overfitting refers to the problem of when a fitted model describes the random error
in the sample rather than the underlying characteristics. An overfitted model is
not suitable for prediction as it fails to generalize over data other than from the
training set (Hawkins 2004). One cause for overfitting is using too complex models,
e.g. using a quadratic model to predict a linear trend. Two common ways to
alleviate overfitting is cross-validation or methods to reduce complexity and induce
parsimony such as regularization and feature selection (Hastie et al. 2009).
Additionally, adhering to the principle of parsimony is also important from a
more practical perspective. Decreasing the number of variables means that less
resources have to be spent on developing and maintaining variables and decreases
the risk of various errors in the database (Hawkins 2004).
2.2
Training, Validation & Testing
To make sure that the trained models are generalizable, it is important to test the
models on different data than what was used for the training. This is usually done
in steps with two separate goals (Hastie et al. 2009):
1. Validation (Model Selection) - Comparing the performance of different models
to find the best one.
2. Testing (Model Assessment) - Given that the best fitting model has been
found, estimating its generalization error on new data.
The data is therefore typically divided into three sets, a training set, a validation
set and a test set. The training set is used to estimate the models, the validation set
for model selection and lastly the test set to evaluate the generalizability. (Hastie
et al. 2009). It is important that the training, validation phase and testing phase
use different data to prevent an underestimation of the true error. A typical split
could be 12 for training, 14 for validation and the last 14 for testing (Hastie et al.
2009).
8
CHAPTER 2. THEORY REVIEW
2.2.1
Validation
A common problem in credit scoring is that, while the total number of data points
might be sufficiently large, only a small fraction of those represent customers who
defaulted or were designated as "bad" in some other way. A method used to allevieate this problem is cross-validation, one of the most common implementations
being K-fold cross-validation. Cross-validation involves separating the training set
into K folds. For each of the K folds, a model is trained on the other folds and
validated with the k-th fold. A typical value of K could be 10. The prediction error
for the k-th fold is defined as
N
1 ÿ
Ek (fˆ) =
L(yi , fˆ≠k(i) (xi )),
N i=1
(2.1)
where fˆ≠k(i) refers to the model trained on the sample that is not included in the
k-th fold and L is a function that measures the error depending on the estimate
and the true value for observation i (Hastie et al. 2009).
The cross-validation estimate of the prediction error is then (Hastie et al. 2009)
K
1 ÿ
CV (fˆ) =
Ek (fˆ)
K k=1
2.2.2
(2.2)
Testing
It is in most cases not possible to create a model that can perfectly separate the
classes. This implies that there will, for every score threshold, be some fraction of
data points that become misclassified. This also implies a need to compare different
models in order to pick the model that has the best performance. It is not entirely
clear how to define the performance of a scorecard. A guiding principle is to measure
how well the scoring model can distinguish between goods and bads. This section
will explain the meausures used in this paper
ROC-curve, AUC and Gini
A commonly used method of evaluation is the AUC-measure and a linear transformation of the AUC, the Gini-coefficient (Hand and Anagnostopoulos 2013; Krzanowski
and Hand 2009). The AUC (Area Under the Curve) is derived from the ROC (Receiver Operating Characteristic) curve which depicts the true positive rate against
the false positive rate depending on some threshold (Davis and Goadrich 2006). The
true positive rate (TPR) is defined as the ratio of true positives over all positives
(T P/P ) and the false positive rate (FPR) as the ratio of false positives over all
negatives (F P/N ). An example could be a data set of 1000 observations with a
bad-rate of 10%, i.e. P = 100 and N = 900.
In the credit scoring problem, a bad, is defined as a positive response and vice
verse. If we with a certain model, manage to correctly classify 80 bads (TP = 80)
and mistakenly classify 100 goods as bads (FP = 100), we get a true positive rate
of TPR = 0.8 and a false positive rate of FPR = 0.11.
9
CHAPTER 2. THEORY REVIEW
The threshold can in credit scoring be a score cutoff and the ROC curve is then
created by calculating the TPR and FPR for a sufficient amount of score-cutoffs. If
the cutoff is set so that no applications are accepted, then TPR = 1 and FPR = 1.
If instead all applications are accepted then TPR = 0 and FPR = 0. The area
under the ROC curve can be interpreted as the probability of a randomly selected
positive data point being ranked higher than a randomly selected negative data
point. The Gini-coefficient is a linear transformation of the AUC defined as
Gini = 2AU C ≠ 1
(2.3)
The Gini can take on values in the range [≠1, 1] and a value of 0 represents the
random baseline. A higher AUC or Gini implies an overall higher discriminatory
power of the model. Figure 2.1 shows some schematic ROC curves along with the
resulting Gini.
Receiver Operating Characteristics (ROC)
1.00
True positive rate
0.75
gini = 0
gini = 0.75
0.50
gini = 0.90
gini = 0.99
0.25
0.00
0.00
0.25
0.50
0.75
False positive rate
1.00
Figure 2.1. Schematic plot of the relationship between the ROC-curve and gini for
different classifiers.
Kolmogorov-Smirnov statistic
Another evaluation measure is the Kolmogorov-Smirnov (KS) statistic. It measures
the distance between the distributions of the goods and the bads (Thomas, Edelman,
and Crook 2002). Let PG (s) and PB (s) be the cumulative distribution function of
goods and bads. The KS statistics is then defined as
KS = max |PG (s) ≠ PB (s)|
s
10
(2.4)
CHAPTER 2. THEORY REVIEW
When calculating the KS-statistics, pG (s) and pB (s), can be plotted. Figure 2.2
shows a schematic plot of the KS statistics for two fictive cumulative distributions
of goods and bads.
Kolmogorov−Smirnov Statistics
1.00
Cumulative Probability
0.75
Cumulative goods
0.50
Cumulative bads
KS statistics = 0.53
0.25
0.00
0.00
0.25
0.50
0.75
Normalized Score
1.00
Figure 2.2. Schematic plot of PB (s), PG (S) and the KS statistics for two distributions.
Mays (1995) gives a general guide to interpreting the KS statistic that can be
seen in table 2.1. This table should not be too taken literally.
Table 2.1. Guideline of the quality of a scorecard based on the KS statistic (Mays
1995). Values have been transformed from percentages to decimals.
KS statistic
Evaluation
< 0.20
0.20 - 0.40
0.41 - 0.50
0.51 - 0.60
0.61 - 0.75
> 0.75
Scorecard probably not worth using
Fair
Good
Very good
Awesome
Probably too good to be true
Brier score
A third measure is the Brier score, originally proposed in a 1950 paper on weather
forecasting (Brier 1950). The Brier score is the mean squared error of the probability
11
CHAPTER 2. THEORY REVIEW
estimates produced by the model. Let yi be the outcome for a certain observation
and take the value 0 for a good and 1 for a bad. Let also fi be the estimated
probability of y. The Brier score is then defined as
BS =
N
1 ÿ
(fi ≠ yi )2
N i=1
(2.5)
Score-to-log-odds plot
The last measure that will be used is the Score-to-log-odds. When plotted, it shows
how well actual outcomes in different score segments correspond to the outcomes
estimated by the model. If the actual log-odds for a segment is higher than what is
expected from the model, the risk is under-estimated in this certain segment. The
score-to-log-odds plot therefore shows how the model performs over the score band.
A benefit with using the measures described above is that they do not take into
account the specific threshold when evaluating the model (Hand and Anagnostopoulos 2013). The backside is that when specific thresholds have been decided, it is
more interesting to measure the performance of the model around these thresholds
(Thomas, Edelman, and Crook 2002).
2.3
Logistic Regressions
Logistic regression emerged in the late 1960s and early 1970s as an alternative to
OLS regressions and linear discriminant analysis (Peng, Lee, and Ingersoll 2002).
In the normal case, the logistic regression is used to predict the response of a binary
variable. There is also a generalization called multinomial logistic regressions that
can be used to predict multi-class problems. The logistic regression is widely used
in empirical applications and has emerged as the industry standard credit scoring
model (Bellotti and Crook 2009).
Lessmann, Seow, and Baesens (2013) run a benchmarking of a wide array of
classification algorithms on a credit scoring data set and find that on average, the
logistic regression performs well and even outperforms many state-of-the-art machine learning algorithms. They also found that models like random forests and
neural networks can give better predictions than logistic regressions.
A reason that logistic regressions still are so widely used is that it is relatively
easy to perform reality checks on them. It is for example possible to from just
looking at the sign of the coefficient of a variable to see if the results intuitively
make sense. Such checks can easily find problems such as the example of the debt
variable described in section 2.1.1.
2.3.1
Basics
The logistic regression model comes from the need to construct a linearly additive
model that can predict a probability, i.e. a number between 0 and 1. This was solved
by a neat trick using the concept of the logit. Let Yi be a binary-valued outcome
variable with an associated probability pi that is related to a number of explanatory
12
CHAPTER 2. THEORY REVIEW
variables X. Let for simplicity the explanatory variables be normalized as it makes
the specification of the regularization simpler. The logit is then the logarithm of the
p
odds of probability, log 1≠p
. This means that the probability can be transformed
to a number that ranges from ≠Œ to Œ, which can be used in a linear model.
With m explanatory variables, the probability can be modeled using a linear
prediction function, f, that for a particular data point i takes the form:
f (x) = logit(pi ) = log
pi
= —0 + —xi = —0 + —1 x1,i + . . . + —m xm,i
1 ≠ pi
(2.6)
ˆ we can calculate the estimated
For a trained model, with estimates —ˆ0 and —,
logit and probability, p̂i , using
logit(p̂i ) = log
2.3.2
p̂i
1
ˆ i and E[Y |Xi ] = p̂ =
= —ˆ0 +—x
= (2.7)
ˆ i)
1 ≠ p̂i
1 + exp ≠(—ˆ0 + —x
Regularized Logistic Regressions
Two widely recognized problems of logistic regressions are overfitting and performing
feature selection. Feature selection refers to the problem of selecting the correct
variables (features) to be included in the model. A common method to alleviate
those problems is regularization. The idea behind regularization is to put a penalty
on the sum of the regression coefficients which can be done in a number of ways. A
selection of common regularization methods are:
• Lasso - a type of ¸1 -regularization.
• Ridge regression - a type of ¸2 -regularization.
• Elastic net regression - a linear combination of the lasso and ridge regression.
The Lasso (least absolute shrinkage and selection operator) estimator introduces
q
a ¸1 -restriction on the form
Ηm Î Æ t for a constant t. This restriction has the
effect that it tends to decrease the size of the coefficients and for sufficiently large
values of t, also sets some coefficients to zero (Tibshirani 1996). It has been shown
that this implies that under some conditions, the lasso can be used as a method
of automated feature selection while still maintaining an efficient estimation (Zhao
and Yu 2006). It however has some non-trivial problems. For example, it can only
select a maximum of n variables, where n is the number of observations. Also, in
case of groups of highly correlated variables, it tends to somewhat arbitrarily choose
only one of those variables to include into the regression (Zou and Hastie 2005).
The ridge regression method, also known as Tikhonov regularization, instead
q
uses a ¸2 -restriction on the form Ηm Î2 Æ c for a constant c (Hoerl and Kennard
1970). The ridge regression reduces the variance of coefficient estimates at the
price of a bias. Combined, these effects can decrease the mean square error of the
regression and lead to a better prediction (Hoerl and Kennard 1970). It has also
13
CHAPTER 2. THEORY REVIEW
been shown that for situations with n > m, and highly correlated predictors, that
the ridge regression performs better than the lasso method (Tibshirani 1996). It
does however lack the ability to produce a sparse model as it keeps all the predictors
in the model (Zou and Hastie 2005).
Zou and Hastie (2005) suggested a combination of the lasso and the ridge regression called the elastic net. It has the benefit of both automated feature selection
and a lower variance of the prediction. Additionally it can also select groups of
variables if one of the variables in the group is found significant. Using the method
of Lagrange multipliers, the lasso and ridge constraints can be rewritten as penalties
in an optimization problem. Combined together, the elastic net penalty function
has the form
P (—, ⁄t , ⁄c ) = ⁄t
ÿ
Ηm Î + ⁄c
ÿ
Ηm Î2
(2.8)
where the ⁄ parameters are called shrinkage parameters. When applied naively,
this specification causes an excessive shrinkage due to shrinkage effects from both
the lasso and the ridge component. It can be shown that this can be mitigated
by multiplying the estimates from the naive specification with a scaling factor,
(1 + ⁄2 )(Zou and Hastie 2005).
2.3.3
Estimation
The logistic regression is usually estimated using a maximum likelihood estimation
(MLE) approach (Peng, Lee, and Ingersoll 2002). The MLE approach is a widely
used estimation method that given a set of n data points x1 , x2 , . . . , xn selects the
model that is most likely to have generated said data. If we define a specific model
as ◊, and the objective likelihood function as L, the maximum likelihood approach
selects a model that satisfies
◊ˆ = arg max L(◊; x1 , x2 , . . . , xn )
(2.9)
◊œ
For both logistic and regularized logistic regressions, there are no closed-form
solutions to the MLE problem which implies that an iterative and numerical process
needs to be used.
14
Chapter 3
Related Works
The classification approach to consumer credit scoring has been thoroughly researched in academia. Lessmann, Seow, and Baesens (2013) perform a meta-study
on recently published papers on credit scoring. They find that in regards to prediction, ANN are very strong when applied individually. Ensemble classifiers can
outperform individual classifiers with random forests giving the most accurate predictions. Another important finding is that complicated state-of-the-art algorithms
do not necessarily outperform simpler versions. An example of this is random
forests (see Breiman (2001)) outperforming the more advanced rotation forests (see
Rodriguez, Kuncheva, and Alonso (2006)). An ensemble selection approach combining bootstrap sampling (see section 3.2) and a greedy hill-climbing algorithm (a
type of optimization method) for base model selection gave the overall best results.
They also find, by comparing 45 credit scoring papers from the period 2003 to 2013,
that the mean and median number of observations in data sets are 6167 and 959
respectively (Lessmann, Seow, and Baesens 2013). This can in machine learning
contexts be considered as relatively small data sets.
There is an extensive amount of academic papers attempting to increase the
performance by training more than one model. These can somewhat coarsely be
put into two categories: segmentation and ensemble models. Segmentation models
attempt to divide the population into sub-populations and develop scorecards for
each sub-population (Thomas, Edelman, and Crook 2002). Ensemble learning on
the other hand, aims to build a model by combining the results from a collection of
(simple) models (Hastie et al. 2009).
3.1
Segmentation Models
Segmentation models can be constructed on either an experience-based (heuristic)
or a statistical basis (Siddiqi 2005). The heuristic strategy uses some characteristics,
such as age or requested credit, to segment the population. The statistical basis uses
statistical methods or machine learning models such as cluster analysis to segment
the population. Each credit application is therefore first segmented by some means
and then scored by a single scorecard trained specifically for that segment. This
simplifies performance monitoring and model retraining.
Segmentation has proven to be successful in some cases. So et al. (2014) build a
classification model that segments credit card customers between transactors (those
who pay off their balance every month and are thus by definition good) and revolvers (those who sometimes only pay part of their monthly balance and incur
15
CHAPTER 3. RELATED WORKS
interest charge). They find that this segmentation gives a more accurate profitability estimate than a single scorecard. Banasik, Crook, and Thomas (2001) build a
two-stage model to distinguish between high-usage and low-usage customers. They
find that, when taking into account that the usage of credit is dependent on the
amount of credit actually awarded a customer, the two-stage model gives a better
prediction accuracy. Hand, Sohn, and Kim (2005) implement a bipartite model by
splitting on variable characteristics, training two separate scorecards and selecting
the split that maximizes the combined likelihood. They show that this procedure
can increase the performance compared to a single logistic regression.
Other segmentations methods have not been as successful. Bijak and Thomas
(2012) use a statistical machine learning approach. They distinguish between a
two-step and simultaneous method. For the two-step method, they do the segmentation and scoring model independently. In the simultaneous approach, the
segmentation and scoring model is optimized simultaneously. The first step implements statistically based segmentation methods, CART-models, CHAID-trees
and LOTUS-models, to separate the data set into several groups and the second
step builds scorecards for each group. They find that neither segmentation using
two-step or simultaneous methods significantly increase the prediction power.
3.2
Ensemble Models
Ensemble learning classifiers come in many different flavors. Bootstrap aggregating
(bagging) involves drawing a number of random samples with replacement (Bootstrap sampling) of the training data and training a classifier on each sample. Boosting creates an ensemble by iteratively creating new models and increasing weights
for misclassified data points in each step. The final classification model is then a
weighting of all iterations. A widely used boosting algorithm is the AdaBoost algorithm (Marsland 2011). Stacked generalization (Stacking) is a method whereby a
number of classifiers are trained and then aggregated using a second-step to combine
the scores (Wolpert 1992).
Wang et al. (2011) compare bagging, boosting and stacking on three real world
credit data sets. As base learners, they use logistic regressions, decision trees,
artificial neural networks and support vector machines. They find in general, that
the ensemble methods increase accuracy on all types of base learners. While bagging
decision trees showed the best performance improvement, results seem to somewhat
differ between data sets. Marqués, García, and Sánchez (2012) examine composite
ensemble learning using random subspace, rotation forest and convolutional neural
networks to construct two-level classifier ensembles. They run tests on six different
data sets and find that two-level ensembles can increase prediction accuracy quite
significantly. Yao (2009) similarly uses CART, bagging and boosting methods and
also finds performance increases.
16
CHAPTER 3. RELATED WORKS
3.3
Two-stage Models
The category "Two-stage models" is not clearly defined in literature but in this thesis refers to a model that is constructed, using similar methods, in two stages, where
the second stage uses some kind of information from the first stage. This definition somewhat overlaps with the definition of segmentation models and ensemble
learning models but the exact distinctions are not so important.
Finlay (2011) performs a broad study of multi-classifier systems on a number
of baseline models on two data sets of approximately 100,000 observations each.
He finds some evidence that CART and a combination of neural networks and
logistic regressions show potential for increasing performance. The highest increase
in performance comes from the ET boost, significantly out-performing the more
commonly used AdaBoost algorithm. These methods are not two-stage models in
the strict sense but have some similar characteristics.
He also creates two-stage models where for example the result from a baseline
logistic regression was used to segment the population. The finding was that this
type of segmentation performed poorly for all baseline models. His interpretation
of these results is that the data is only very weakly non-linear but that it could also
be an effect of overfitting. For some segments, the number of bads were as low as
under 1,000 cases. A problem with the paper is that due to the large number of
evaluated models, the exact methodology is not clearly explained.
Similar attempts to base segmentation on scoring on a first-stage model was
first attempted by Myers and Forgy (1963). Using discriminant analysis, they train
second-stage models with different score cutoffs from the first stage. They find no
positive effects but the results are limited by using a small sample of 300 observations.
17
Chapter 4
Method
Chapter 4 starts with a synthesis of the literature review in the form of a number
of stylized facts. This leads up to a presentation of the methods investigated in this
thesis. Next, some practical details of the implementation are discussed followed
lastly, by a presentation of the data.
4.1
Synthesis
The findings in earlier chapters can be summarized in a few stylized facts.
1. There is a wide selection of methods that seem to increase the predictive power when comparing to the industry standard logistic regression.
From the discussion in earlier chapters it seems that there is no single way of
increasing the performance of credit scoring models. Both segmentation methods
and ensemble methods seem to in many cases be effective. As many others have
echoed, (i.e. Thomas (2000)), there does not yet seem to exist a "holy grail" to
the credit scoring problem. The optimal model for any implementation will likely
be dependent on the characteristics of that specific instance of the problem and
needs and capabilities of the organization. Additionally the relationship between
model performance and model complexity is not always positive and simpler models
sometimes outperform more complex derivatives.
2. Many empirical evaluations use data with small sample sizes.
Introducing new data is problematic in the sense that it complicates comparison
with other studies. Data sets will have different size, different variables and be
drawn from populations with dissimilar characteristics. In such a comparison, it
is not clear what part of the difference comes from the data and what part comes
from the implementation. This is especially a problem when data is proprietary and
not available for replication or further studies. A clear and exhaustive description
of the methodology is therefore important to ensure reproducibility so that the
methodology can be tested on other data.
With this in mind, there is still a need for using proprietary data if the size or
quality of the data exceeds that of the publicly available data sets. This is especially
true when implementing models such as segmentation models, that can significantly
benefit from larger samples.
3. Broad comparative studies are hard to interpret
Some recent studies such as Finlay (2011) and Lessmann, Seow, and Baesens
(2013) have identified that there is a need for broad studies that compare a large
number of classifiers using similar data and methodology. This approach has a
18
CHAPTER 4. METHOD
number of problems. When the number of classifier models grows large enough,
it will not be possible to explain the practical implementation with enough detail.
Data preprocessing (e.g. data cleaning, feature selection) and model specifics (e.g.
regularization, estimation) have a big influence on results and specific methods
employed might be better suited for certain classifiers. There might also be a bias in
terms of the researchers’ experience with different classifiers causing some classifiers
to under-perform. The non-exhaustive descriptions of methodologies make this
problem difficult to tackle.
Conclusions
The conclusion of these stylized facts is that there is a need to look broadly when
investigating classifier models so that credit scorers can find the model that suits
their specific capabilities and problems. Additionally, with new and better data,
there is a need of re-evaluating models that earlier had not performed as well as
expected. Finally, broad comparatives studies need to be complemented by narrower
studies that in greater detail cover the implementational details of a model. These
studies, while interesting on their own, will also be a help for researchers attempting
comparative studies.
4.2
Methodology
Given the conclusions above and the imperative of retaining simplicity, there is a
motivation for, using Klarna’s high quality data, investigating the potential lowhanging fruits that can increase the performance of the industry standard scoring
model. While there have been some attempts at constructing simple two-stage
logistic regression models, there has yet to be an exhaustive study, with a clearly
explained methodology. Another important addition to this study is the use of
regularization methods that makes for a separated and automated feature selection
for both stages.
In discussions and inspired by previous work, a number of methods to construct
the two-stage model have been identified. The basic idea is to train a logistic
regression on all observations in the training set. Information from the first stage will
then be included into or used to segment the data for the second-stage model. The
results from both models will then be combined in some way to form a conventional
logistic regression for all observations.
4.2.1
Proposed Model Classes
Model class 1: Cutoff on predicted score from first-stage regression.
Defaulting customers can naively be put into three categories. The first category
is those that do not have the ability to pay. The second category is those that for
some reason have no intention to pay and the third is fraud. It is likely that the
characteristics predicting these categories are quite different.
A hypothesis is that the prevalence of the first category relative to the second
category is larger at lower scores and vice versa. An option is then to train a
19
CHAPTER 4. METHOD
model on all data and in a second stage train separate models on low- and highscoring customers. An alternative would be to assume that the second category
might not easily be predicted by variables used for credit scoring. Defaults in the
second category would add noise when attempting to predict the first category. The
alternative suggestion is therefore to make a cutoff on the score from the first stage
and only train the second stage on the low-scoring customers and let the remaining
customers be trained on the full data. While it would be interesting to also train
models on the subsample above the cutoff, the lack of bads in the high-scoring
segment makes this difficult.
Model class 2: Segmenting on observations close to the decision boundary.
After a credit scoring model has been trained, some kind of decision model decides
whether to accept or reject a new application. A type of naive decision model is a
constant risk model. It defines a constant risk which is the maximum expected loss
that can be accepted.
PD · credit = constant
(4.1)
credit
≠ 1| < x
PD/constant
(4.2)
The line formed by plotting the maximum allowed credit for each score is referred to as the decision boundary. Given such a model, the performance is highly
dependent on the accuracy of the scoring model around the decision boundary.
Given a hypothesized non-linearity in the data, the first stage can be used to identify observations in proximity to the decision boundary, and the second stage will
be trained on that subset which might give a better accuracy in the region. The
condition to include an observation in the second-stage is then:
|
where the constant has been calculated from the constant risk decision model
in equation 4.1 and x is the percentage window boundary.
Model class 3: Cutoffs on predicted score from alternative labels.
As the payment status of a transaction evolves over time, there is not a clear definition of what point of time should be used to classify a transaction. Using the
payment status at the credit due date might make us mistakenly class cases where
the customer intended to, but forgot to, pay in time or where the customer was
temporarily unable to pay, as bad. The industry standard for bank cards is to
use 90 days after credit due date (Mays 1995). There might however be valuable
information in the payment status at a shorter time after credit due date. Perhaps
people that are likely to forget to pay on time have other characteristics than those
that never fail to pay on time. This model will therefore investigate the effect of
training on earlier payment statuses, 5 and 30 days after credit due date.
These labels will in the rest of the paper be referred to as dpoint5, dpoint30
and dpoint90 with dpoint90 being the final definition of goods and bads. Using the
20
CHAPTER 4. METHOD
idea from above, this model class will first train first-stage models using the two
alternative default labels dpoint5 and dpoint30. The scores from these first-stage
regressions will then be used to segment the population similarly to model class 1 so
that a new dpoint90 regression can be trained for that subsample. The remaining
sample will be scored by a simple logistic regression on the full sample. As the
alternative labels have a higher amount of bads compared to dpoint90, it is possible
to also train models on the subsamples above the cutoffs.
Model class 4: Predicted score from alternative labels as variables.
An alternative to the segmentation models proposed above is to train scoring
models on the alternative default labels and use the predicted scores as variables
in a final scoring model on the dpoint90 label. For the previous model classes, the
first stage aims to segment out a part of the sample to increase the accuracy on
that subsample. For this model however, both the first and the second stages are
trained on the full sample which means that the full amount of bads can be used
for training. The scores from the first stage will naturally have a high correlation
with the remaining variables in the final scoring model. While correlation between
explanatory variables is usually a problem for logistic regressions, regularization
should help account for this.
4.3
Practical Details
For estimating logistic regressions in this thesis, an R package ‘glmnet’ has been
used1 . The glmnet package is a highly efficient implementation of the elastic net regression for generalized linear models using cyclical coordinate descent. It includes a
function for estimating logistic regressions (Friedman, Hastie, and Tibshirani 2010).
The regularization penalty in the glmnet package is defined as
Ó1 ≠ –
ΗÎ2 + –ΗÎ
Ô
(4.3)
2
where – is a mixing parameter so that 0 Æ – Æ 1. The package also includes
functionality for k-fold cross-validation to choose an optimal value of the shrinkage
parameter, ⁄. When the mixing parameter is set to 1 ≠ ‘ for some small ‘ > 0, the
retains much of the characteristics of the lasso but with some increased numerical
stability (Hastie and Qian 2014).
The glmmnet estimation procedures need, apart from the variable coefficients, a
shrinkage parameter, ⁄, and a mixing parameter, –. The package includes a crossvalidation function for finding the optimal ⁄ for a given –. The optimum – can also
be found by nesting a two-stage cross-validation method (Hastie and Qian 2014).
Because of the considerable time it takes to train an individual model, it is not
reasonable to perform a two-stage cross-validation for each model. A compromise is
to find an optimum – once for the full training set and then only use the included
cross-validation technique for ⁄ for each subsequently trained model.
⁄
1
More information about the glmnet package can be found on http://cran.r-project.org/
web/packages/glmnet/index.html.
21
CHAPTER 4. METHOD
The glmnet package has a number of statistics that can be used as measure of
fit for the data model. This paper will use the AUC-measure, since the derivative
Gini is one of the main statistics we use to evaluate models.
It is important to use separated sets for training, validating and testing to prevent getting bad results. The testing set was therefore an out-of-time set consisting
of approximately 20 % of the total sample. The remaining part was split on a customer identifier so that 25 % of remaining customers (Approximately 20 % of the
whole sample) end up in the validation set and 75 % (Approximately 60 % of the
whole sample) in the test set.
A practical concern is how to compare results between first and second-stage
models. It is important that the underlying data is the same to make comparisons
useful. A solution would be to evaluate both stages on the subset of observations
that are to be included in the second stage. This makes different second-stage
models difficult to compare as the subset changes. The alternative approach used
in this paper is to first predict scores using the first-stage model, and then replace
the scores of the observations that should be included in the second-stage model
with the score from the second-stage model.
4.4
Data
The data used in this paper consists of Dutch invoices that were initiated during the
period 2013-06-01 to 2014-03-31. The data set contains a total number of approximately 350, 000 observations. Each observation is a previously accepted purchase
through Klarna and the data set can contain several observations for one individual. There are 136 variables that can be used for modelling purposes with some
being internal data and others being data acquired from external credit bureaus.
Note that there is not valid data on all variables and observations so some fields are
missing.
Some variables are binary variables and others numeric or categorical variables.
The non-binary variables were put into bins or categories and transformed into
binary variables by making each bin into a binary feature. Binning transforms the
contribution of a variable from linear to piece-wise linear which makes the model
more flexible (Hastie et al. 2009). When applicable, cumulative binning was used.
For a variable such as income, an observation being in the e100, 000 ≠ e200, 000
bin would also be in the e50, 000 ≠ e100, 000 bin. After binning and transforming
the variables, the final data set contains 770 binary variables that form the final
feature vector x.
The data was divided into three sets, a training set (60 %), a validation set (20
%) and a test set (20 %). Table 4.1 shows the number of observations, number of
bads and the bad-rate for each set. The training and validation set as expected
show similar characteristics. For the out of time test set, it seems that the bad-rate
has decreased, implying there is some change in the population over time.
The original data set only contained labels for payment status 90 days after
credit due date. The payment statuses at earlier times were thus extracted and
22
CHAPTER 4. METHOD
Table 4.1. Distribution of goods and bads for payment status at 90 days after credit
due date.
Training set
Validation set
Test set
Good
Bad
Bad-rate
208222
68756
67860
8842
3014
2503
0.041
0.042
0.036
matched separately afterwards. Unfortunately, matches could not be found for
around 20 % of observations. On the other hand, the number of bads, when looking
at the payment status 5 days after credit due date is larger than the 90 days label
by a factor 10. For the 30 days after due date, the increase is approximately by a
factor 3. Given that the bads are so scarce for the dpoint90 label, this is a major
increase.
An important thing to note is that the definition of a bad is in this data set very
restrictive. Klarna has sorted out a large number of transactions that were deemed
indeterminate. The reasoning for this comes from an earlier experience where a
more strict definition of bads increased the prediction accuracy. Additionally, some
non-bads have been filtered out for other reasons. The bad-rate should therefore
not be taken as a measure of the actual loss rate of Klarna.
23
Chapter 5
Results
Chapter 5 starts by describing the results from implementing the cross-validation
technique to find the mixing parameter. Next, the performance of the first-stage
model is described. The chapter continues by presenting results from training the
models proposed in section 4.2.1 and evaluating them using using the validation
sample. Lastly, the chapter describes the result from evaluating a number of candidate models using the test sample. This chapter refers to a large number of graphs
and tables. Graphs with the prefix A can be found in the appendix.
5.1
Choosing the Mixing Parameter, –
A two-staged cross-validation method was first set up to get the optimum – on the
full training set. Table A.1 shows the result of testing a sequence of values for the
mixing parameter, –. Apart from the minimum value, for – = 0, the magnitude
of difference in AUC is on the 5th decimal for the remaining – values. Without
any confidence intervals it is difficult to say if any values of the AUCs are with a
statistical significance higher than any others.
Given that the default value in the glmnet package is – = 1, i.e. a lasso, and
using the idea that – = 1 ≠ ‘ for some small positive ‘ works like the lasso but with
increased numerical stability, it was decided to go with – = 0.999 (Hastie and Qian
2014) . This selection of – implies that the major part of the regularization comes
from the lasso parameter. This value of – has then been used for all subsequently
trained models.
5.2
First-stage Model
The first stage was trained on the training set using the binary label dpoint90, i.e.
payment status 90 days after due date as response variable. Figure 5.1 shows a
number of plots using the predictions on the validation sample. From the distributions of goods and bads, it is clear that there is an overlap on the larger part of the
score band. The score-to-logodds plot shows that the predicted score lies reasonably
close to the actual line but that there is a an underestimation of the log-odds on
low scores and an overestimation on high scores. The KS of 0.56 and the reasonably
high Gini suggest that the first-stage model on its own is performing relatively well.
24
CHAPTER 5. RESULTS
1st stage model on validation sample
Distribution of goods and bads
0.05
Actual and predicted log−odds
−1
Bads
0.04
0.03
−2
Log−odds
Probability
Goods
0.02
−3
−4
0.01
Predicted Log−odds
−5
Actual Log−odds
0.00
400
500
600
Score
700
800
500
1.00
0.75
0.75
Actual bads
0.50
Actual goods
Expected bads
0.25
Score
600
650
ROC−curve and Gini−value
1.00
True Positive Rate
Cumulative Probability
Kolmogorov−Smirnov Plot
550
Expected goods
0.50
0.25
KS Statistics = 0.560
Gini = 0.730
0.00
0.00
400
500
600
Score
700
800
0.00
0.25
0.50
0.75
False Positive Rate
1.00
Figure 5.1. Combination of plots when evaluating the first-stage model on the
validation sample.
5.3
Finding Candidate Models Using the Validation
Sample
This section describes the results from the full set of trained models. The metrics
used to evaluate the models seemed to be quite highly correlated. That is, a model
with a high Gini in general had a high KS and a low brier score. Therefore, to ensure
readability, this section only contains figures showing how the Gini for different
models compares with the first-stage model. The full result tables can be found in
appendix in tables A.2 to A.5.
Model class 1
Table A.2 and figure 5.2 show the results from retraining the subset of observations
with predicted score less than different score cutoffs. It is easily noticeable that the
results are not very sensitive to this method. The difference in any of the metrics are
at most on a magnitude of around 0.1 percentage points. A partial explanation for
the edge cases might be that for the very low scorecutoffs, almost no observations
are included into the second stage. As a contrast, for the very high scorecutoffs,
almost all observations are included into the second stage. As the deviations are
25
CHAPTER 5. RESULTS
so small, it is difficult to say if any variation is statistically significant. However,
the 550, 630 and 670 models, being the three models with the highest Gini, were
selected for further analysis.
Gini of Model Class 1 on validation sample
0.740
Gini
0.735
0.730
0.725
700
690
680
670
660
650
640
630
620
610
600
590
580
570
560
550
540
530
520
510
500
0.720
Scorecutoff
Figure 5.2. Gini of different models from Model class 1 compared to the first-stage
model (dotted line) when evaluated on the validation sample.
Model class 2
The decision boundary was decided using the constant risk concept as described
in section 4.2.1. The constant maximum allowed expected risk was decided using
numerical optimization to find the value that would give an acceptance rate of 95 %
on the full training sample. This decision boundary was then used for subsequent
training and analysis. The rule for including observations into the second-stage
subset was being inside a x % boundary window. Table A.3 and figure 5.3 show
the results for a number of different percentages on the validation sample. Apart
from the 90 % window, which will be evaluated on the test sample, none of these
models seem to give better results on any metrics. And even for the 90 % model,
the difference is at best marginal.
Model class 3
Table A.4 and figures 5.4 and 5.5 show similarly the result of retraining the secondstage model on the subsample of observations with a score under or over, different
score cutoffs. For this table, the first-stage scores were in this model class predicted
using models trained on all observations with dpoint5 and dpoint30 as labels. The
second-stage models were then trained on these subsamples using the dpoint90 label
and the results combined with the dpoint90 first-stage model for the remaining
sample.
This model class seems to in general give more interesting results than the two
previous ones. Predicted score of the dpoint5 < 480 model as well as the models
where predicted score of dpoint30 > 490 , > 500 and > 510 all have Gini-values
26
CHAPTER 5. RESULTS
Gini of Model Class 2 on validation sample
0.740
Gini
0.735
0.730
0.725
90
80
70
60
50
40
30
20
0.720
Percentage distance to decision boundary cutoff
Figure 5.3. Gini of different models from Model class 2 compared to the first-stage
model (dotted line) when evaluated on the validation sample.
larger than the first-stage model and those in model classes 1 and 2. They were
therefore selected as candidate models. The score(dpoint5 ) > 490 model stands out
as particularly bad. This is probably due to the model not converging properly
which might be because of the random allocations to the cross-validation folds. In
figure 5.4, this model was therefore not plotted.
Gini of dpoint5 models from Model Class 3 on validation sample
0.740
Gini
0.735
0.730
0.725
score(dpoint5) > 530
score(dpoint5) > 520
score(dpoint5) > 510
score(dpoint5) > 500
score(dpoint5) > 490
score(dpoint5) > 480
score(dpoint5) > 470
score(dpoint5) < 530
score(dpoint5) < 520
score(dpoint5) < 510
score(dpoint5) < 500
score(dpoint5) < 490
score(dpoint5) < 480
score(dpoint5) < 470
0.720
Figure 5.4. Gini of different models with segmentation based on the dpoint5 score
from Model class 3 compared to the first-stage model (dotted line) when evaluated
on the validation sample.
Table 5.1 shows a two way table of combination of actual outcomes of observations where the predicted score from the dpoint5 regression was < 480. Even though
the sample has decreased significantly, a large portion of the bads with regards to
27
CHAPTER 5. RESULTS
Gini of dpoint30 models from Model Class 3 on validation sample
0.740
Gini
0.735
0.730
0.725
score(dpoint30) > 530
score(dpoint30) > 520
score(dpoint30) > 510
score(dpoint30) > 500
score(dpoint30) > 490
score(dpoint30) > 480
score(dpoint30) > 470
score(dpoint30) < 530
score(dpoint30) < 520
score(dpoint30) < 510
score(dpoint30) < 500
score(dpoint30) < 490
score(dpoint30) < 480
score(dpoint30) < 470
0.720
Figure 5.5. Gini of different models with segmentation based on the dpoint30 score
from Model class 3 compared to the first-stage model (dotted line) when evaluated
on the validation sample.
dpoint90 remains. Additionally, of the non-missing observations, around 20 % of
observations that were bad with regards to dpoint5 were also bad with regards to
dpoint90, compared with approximately 10 % on the full sample.
Table 5.1. Table of combinations of actual outcome labels on dpoint5 and dpoint90
for the test sample on the subsample where the predicted score from the dpoint5
regression was < 480.
dpoint90
Good
Bad
dpoint5
Missing
Good
Bad
3723
123
3708
0
6003
1566
Model class 4
In order to implement the fourth model class, two models with the respective labels dpoint5 and dpoint30 were first trained on the full training set. The predicted
scores were then binned and transformed to binary non-cumulative variables. The
two groups of binary variables were then included first separately and then in combination when training with the label dpoint90
Table A.5 and figure 5.6 show the result of applying the trained model to the
validation sample. The difference when compared to the first-stage model is with
this model notably higher than for the other models for all 3 combinations. All
three will therefore be evaluated on the test sample.
28
CHAPTER 5. RESULTS
Gini of Model Class 4 on validation sample
0.740
Gini
0.735
0.730
0.725
dpoint5 & dpoint30
dpoint5
dpoint30
0.720
Figure 5.6. Gini of different models from Model class 4 compared to the first-stage
model (dotted line) when evaluated on the validation sample.
Table A.6 shows the estimated coefficients from the model where only the score
from the dpoint5 bins were included. As expected, a higher expected score on the
dpoint5 label is correlated with a lower score on the dpoint90 variable. Similar
results can be seen for the regression using the dpoint30 label.
5.4
Evaluating Candidate Models Using the Test Sample
A total number of 11 models were along with the first-stage model selected to be
evaluated on the test sample. The results from this can be seen in figure 5.7 and
in the appendix in table A.7. Table 5.2 shows a condensed version with the best
performing model of each model class along with the first-stage model for comparison. A first and interesting result to notice is that the first-stage model performs
significantly better on the test data than on the validation data. For example, the
simple 1-stage logistic regressions have a gini of 0.7298 on the validation sample and
0.7956 on the test sample.
Table 5.2. Results from the first-stage model and the four best performing candidate
models on test sample
1st-stage
Scorecutoffs Score < 550
Decision boundary 90%
score(dpoint5) < 480
dpoint5 as variable
No. of Obs.
No. of Bads
Bad-rate
Gini
KS
Brier Score
70363
6455
21918
15123
70363
2503
1589
1514
1689
2503
0.036
0.246
0.069
0.112
0.036
0.7956
0.7962
0.7971
0.8011
0.8037
0.6404
0.6386
0.6383
0.6416
0.6465
0.0276
0.0271
0.0276
0.0272
0.0273
Using simple cutoffs on the predicted score from the first-stage model seem to
29
CHAPTER 5. RESULTS
reproduce a similar performance when retraining on the lower score spectrum (Score
< 550) with a marginal increase Gini and a lower Brier Score. Any gain found in
the validation set has disappeared when increasing the scorecutoffs to 630 and 670.
Looking at the subset of observations at a 90 % distance to the decision boundary,
the table shows a marginal increase in Gini coupled with a marginal decrease in
KS. It is therefore difficult to say if any of these four perform better than the first
model.
Gini of candidate models on test sample
0.82
Gini
0.80
0.78
dpoint5 & dpoint30
dpoint30
dpoint5
score(dpoint30) > 510
score(dpoint30) > 500
score(dpoint30) > 490
score(dpoint5) < 480
Decision boundary 90%
Scorecutoffs Score < 670
Scorecutoffs Score < 630
Scorecutoffs Score < 550
0.76
Figure 5.7. Gini of candidate models compared to the first-stage model (dotted
line) when evaluated on the test sample.
When looking at the models using information from the dpoint5 and dpoint30
labels, it is clear that they, except for when looking at predicted score of dpoint30 >
510, seem to outperform the first-stage model on the test set as well. Using the
predicted score as a variable seems, consistent with the result from the validation
set, to perform better than when using it to segment the sample. The largest score
increase is as in the validation set seen when using only the predicted score from
the model using dpoint5 as a label. This model showed an increase of Gini of 0.0081
points and an increase in KS score of 0.0061.
Figures A.2 to A.6 show the score to log odds plots for the five boldfaced models
from table A.7. These graphs are more condensely displayed in figure 5.8 where
the difference between predicted and actual log odds is plotted against the score.
Consistent with earlier observations, all models tend to overestimate the risk of
low-scoring applications and underestimate that of high-scoring applications. The
model where the subset of predicted score of dpoint5 < 480 has been retrained
stands out and does not show as strong tendencies as the other models.
30
CHAPTER 5. RESULTS
Difference between predicted and actual log odds
Difference between predicted and actual logodds
0.2
1st−stage
Predicted score of dpoint90 < 550
90 % distance to decision boundary
Predicted score of dpoint5 < 480
0.0
Predicted score of dpoint5 as variable
−0.2
500
550
Score
600
650
Figure 5.8. Difference between predicted and actual logodds over the scoreband for
different models when implemented on the test sample.
Assessing the statistical significance of these results is not trivial as the glmnet
method for various reasons does not have a clearly defined methodology for estimating standard errors. Some attempts have recently been made, notably by Lockhart
et al. (2014), by the creators themselves, but they have not been tested enough in
application. In short, the thesis does not delve too much on statistical significance.
Instead, it takes results as "significant" if they are approximately reproducible on
different data, i.e. when the results are similar on the validation set and the test
set.
31
Chapter 6
Discussion
When comparing the results from the validation and the test sample it is clear
that the performance is higher on the test sample which stresses the importance of
measuring relative performance within samples. There are two plausible reason for
this. Firstly, the training and validation samples were built by separating individuals
so that for each unique individual, all observations would end up in either the
training or the validation sample. The test sample on the other hand, is an out-oftime set. 20 % of observations in the test sample can be attributed to an individual
that exists in the training sample. Given that some variables are more or less
constant for an individual, the training sample should be more correlated with the
test sample than the validation sample. It is worth noting that an actual business
implementation will resemble the the test sample, with data from a later time
period and partially overlapping individuals. The fact that the models perform also
perform well on the validation sample suggests that performance is consistent over
moderate changes in the population.
Secondly, the lower bad-rate of the test sample, shows that the population
changes over time. It is not clear how this would affect the prediction accuracy
but it is possible that this could partially account for the relatively higher performance.
Another general observation, when looking at performance relative to the firststage model on the two samples, is that the relative performance seems to be stable
both over time and on a new population, which supports the general methodology.
For the segmentation models using scorecutoffs and distance to boundary, there were
no positive significant results. Contrary to what we expected, for most proposed
models, evaluation metrics hardly changed at all. This could be interpreted as the
underlying relations in fact being sufficiently linear. Another possible explanation
for the result is that there is in fact non-linearity but that the two implemented
models do not segment the population in a way to capture these non-linearities.
This result is in line with previous research on two-stage credit scoring models.
It has from working with this paper become clear that all sorts of segmentation
come with a caveat, decreasing the number of observations. This particular application is especially vulnerable as bads are of rare occurrence. When the second-stage
models leave out a large number of bads, performance on that segment clearly deteriorates. Given the highly skewed distribution towards goods, simple segmentation
methods will probably only be viable strategies for data sets of a much larger size
than what was used in this thesis. This might also explain why the decision boundary model class performed particularly badly. By letting the simple scorecutoff
32
CHAPTER 6. DISCUSSION
models retrain on the lower score segment, they could capture a relatively large
share of the bads. For the decision boundary model, one had to go quite far from
the decision boundary in order to get a sufficient number of bads.
Dissecting the results of model classes 3 and 4, when alternative default labels are
introduced, is not fully straight-forward. A major difference from model 1 and 2 is
that new information has been introduced to the modelling. The difference between
model class 3 and model class 4 is the way this information affects the model. When
the predicted score is used to segment the population as in model class 3, the new
information has no actual influence on the training. This strengthens the claim that
there is in fact non-linearity in the data. The large amount of bads for especially
the dpoint5 label in relation to the dpoint30 label is interesting.
Understanding why transactions change from bads at dpoint5 to goods in the
dpoint90 label is crucial. Connecting back to the discussion in section 4.2.1 two
types of customers explain this behavior:
1. The customer has a temporary financial constraint that hinders him or her
from paying in time.
2. The customer is somewhat sloppy and forgets to pay or delays payment for
some other reason.
When training a model on the earlier default labels, we are therefore trying to
find earlier mentioned categories of people as well as those who never fulfill their
credit obligations. The results from model class 3, suggests there is something with
this group of people that differs from the remaining population.
As for model class 4, when the predicted scores are used as variables in the
dpoint90 model, predicted from the first-stage model(s), into the dpoint90 model.
The results seem to favor this approach. This approach also works similarly to the
model class 3 by giving a lower score to those that are likely to not have paid at
5 och 30 days past due date. An obvious advantage of this model is that we can
train both the first and second stage on the full training set meaning that we do
not have to forsake any of the scarce bads. Comparing the results of the validation
and testing set, model class 4 seems the most robust.
Given that we have found and improvement in the model, the big question
is what this means for the profitability of a company such as Klarna. While an
increase in the Gini of 0.01 might seem economically insignificant, it is important to
remember that the credit industry is mostly a high volume - low margin industry.
Using the decision boundary and definition of expected loss discussed earlier in the
thesis, it is possible to construct a simple model that estimates the expected loss
from applying a certain model to on a data set. When comparing the dpoint5 as-variable model with the simple one stage model, the expected losses decreased
by approximately 1 % on the test set. For a company with a monthly number of
transactions on the order of 10, 000, 000, using the dpoint5 -as-variable model would
decrease the expected monthly losses by a magnitude of SEK 100, 000. While these
33
CHAPTER 6. DISCUSSION
are not game-changing results, this is certainly evidence that the methodology holds
promise.
Before implementing this in a live system, it is important to consider potential
weaknesses in the model. One question is whether these results would be generalizable to other time periods or other geographic regions. The problem of scarcity
of bads is a universal problem of a well-functioning consumer credit system, so it
is plausible that there is a similar gain of using relatively abundant earlier default
labels in other markets. That being said, model class 3 might not be a valid option
for smaller data sets as it segments the data. For such data sets, the methodology
from model class 4 will probably be a better approach. Another question is how
this method is affected by the sampling bias. It will of course still be present in
the regressions with the dpoint90 label, but the other labels should have much less
correlation with reject decisions. There is of course still the risk that the total
vulnerability to sampling bias increases using these kind of models.
34
Chapter 7
Conclusions
In this thesis, two-stage logistic regression models have been estimated and evaluated on credit scoring data. Credit scoring models use information such as payment
history to predict the probability of default. The data has been provided by Klarna
and consists of roughly 350,000 credit applications from Dutch retail customers.
The two-stage models have been constructed by first training a regularized logistic
regression on the full sample and using information from the first model to re-score
part or all of the observations. Specifically, 4 classes of models have been evaluated:
1. Segmenting on predicted score from the first-stage model
2. Segmenting on distance to decision boundary from predicted probability of
default.
3. Segmenting on predicted score from regressions using alternative default labels
4. Including predicted score from regressions using alternative default labels.
The models were compared on validation and test samples using KS, Gini,
Brier score and score-to-logodds plots. Implementing the different classes of models
showed that the more simple models (classes 1 and 2) gave no significant increase in
performance. The other two classes increased performance on both the validation
and the test set. The other two model classes (3 and 4) showed that there is an increase in performance when segmenting or taking into account predictions on early
payment status. A conclusion of this is that two-stage do improve performance of
credit scoring models but not using the more conventional methods.
A key to understanding why using the earlier default improves the prediction
probaby lies in the scarcity of bads in the original credit scoring problem. Using
the large number of bads at earlier dates and the correlation with the dpoint90
it is possible to extract more information from the data and improve the model
performance.
35
CHAPTER 7. CONCLUSIONS
7.1
Future Research
Overall, this paper shows that by taking into account the fact that a transaction
is a process that changes over time, the prediction accuracy can be increased. The
interesting question is then how this information best can be put to use. While
there is arguably much room for improvement of the methodology of this thesis,
there are probably other methods that can use this information better.
An approach that has been tested previously in credit scoring is survival analysis.
In survival analysis the credit scoring problem is changed from trying to estimate
the probability of a default to estimate the time of default. For more information,
see for example a paper by Stepanova and Thomas (2002).
Another way could be to view the problem as a directed graphical model as
depicted in figure 7.1. In this model, x-vector of conventional would predict dpoint
at t ≠ n for some starting point in time that feeds probabilities forward in time until
the probability dpoint at t has been calculated. A paper on how to implement such
a model could possibly find even further improvements of the prediction accuracy.
Figure 7.1. Schematic specification of a directed graphical model that takes into
account the time aspect of a transaction.
36
Bibliography
Banasik, J, J Crook, and L Thomas (2001). “Scoring by usage”. In: Journal of the
Operational Research Society 52.9, pp. 997–999.
Banasik, Jonathan, John Crook, and Lyn Thomas (2003). “Sample selection bias
in credit scoring models”. In: Journal of the Operational Research Society 54.8,
pp. 822–832.
Bellotti, Tony and Jonathan Crook (2009). “Credit scoring with macroeconomic
variables using survival analysis”. In: Journal of the Operational Research Society 60.12, pp. 1699–1707.
Bijak, Katarzyna and Lyn C Thomas (2012). “Does segmentation always improve
model performance in credit scoring?” In: Expert Systems with Applications 39.3,
pp. 2433–2442.
Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32.
Brier, Glenn W (1950). “Verification of forecasts expressed in terms of probability”.
In: Monthly weather review 78.1, pp. 1–3.
Capon, Noel (1982). “Credit scoring systems: A critical analysis”. In: The Journal
of Marketing, pp. 82–91.
Crook, Jonathan and John Banasik (2004). “Does reject inference really improve the
performance of application scoring models?” In: Journal of Banking & Finance
28.4, pp. 857–874.
Davis, Jesse and Mark Goadrich (2006). “The relationship between Precision-Recall
and ROC curves”. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp. 233–240.
Domingos, Pedro (2012). “A few useful things to know about machine learning”.
In: Communications of the ACM 55.10, pp. 78–87.
Finlay, Steven (2011). “Multiple classifier architectures and their application to
credit risk assessment”. In: European Journal of Operational Research 210.2,
pp. 368–378.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani (2010). “Regularization paths
for generalized linear models via coordinate descent”. In: Journal of statistical
software 33.1, p. 1.
Greene, William (1998). “Sample selection in credit-scoring models”. In: Japan and
the world Economy 10.3, pp. 299–316.
Hand, David J and Christoforos Anagnostopoulos (2013). “When is the area under
the receiver operating characteristic curve an appropriate measure of classifier
performance?” In: Pattern Recognition Letters 34.5, pp. 492–495.
Hand, David J and William E Henley (1997). “Statistical classification methods in
consumer credit scoring: a review”. In: Journal of the Royal Statistical Society:
Series A (Statistics in Society) 160.3, pp. 523–541.
37
BIBLIOGRAPHY
Hand, David J, So Young Sohn, and Yoonseong Kim (2005). “Optimal bipartite
scorecards”. In: Expert Systems with Applications 29.3, pp. 684–690.
Hastie, Trevor and Junyang Qian (2014). Glmnet Vignette. http://web.stanford.
edu/~hastie/glmnet/glmnet_alpha.html. Accessed: 2014-12-29.
Hastie, Trevor et al. (2009). The elements of statistical learning. Vol. 2. 1. Springer.
Hawkins, Douglas M (2004). “The problem of overfitting”. In: Journal of chemical
information and computer sciences 44.1, pp. 1–12.
Hoerl, Arthur E and Robert W Kennard (1970). “Ridge regression: Biased estimation for nonorthogonal problems”. In: Technometrics 12.1, pp. 55–67.
Krzanowski, Wojtek J and David J Hand (2009). ROC curves for continuous data.
CRC Press.
Lessmann, Stefan, Hsin-Vonn Seow, and Bart Baesens (2013). “Benchmarking stateof-the-art classification algorithms for credit scoring: A ten-year update”. In:
Proceedings of Credit Scoring and Credit Control XIII Conference, Edinburgh,
United Kingdom, 28-30th August.
Lockhart, Richard et al. (2014). “A significance test for the lasso”. In: The Annals
of Statistics 42.2, pp. 413–468.
Marqués, AI, Vicente García, and Javier Salvador Sánchez (2012). “Two-level classifier ensembles for credit risk assessment”. In: Expert Systems with Applications
39.12, pp. 10916–10922.
Marsland, Stephen (2011). Machine learning: an algorithmic perspective. CRC Press.
Mays, Elizabeth (1995). Handbook of credit scoring. Global Professional Publishi.
Mester, Loretta J (1997). “What’s the point of credit scoring?” In: Business review
3, pp. 3–16.
Myers, James H and Edward W Forgy (1963). “The development of numerical credit
evaluation systems”. In: Journal of the American Statistical Association 58.303,
pp. 799–806.
Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll (2002). “An introduction to logistic regression analysis and reporting”. In: The Journal of Educational
Research 96.1, pp. 3–14.
Rodriguez, Juan José, Ludmila I Kuncheva, and Carlos J Alonso (2006). “Rotation
forest: A new classifier ensemble method”. In: Pattern Analysis and Machine
Intelligence, IEEE Transactions on 28.10, pp. 1619–1630.
Siddiqi, Naeem (2005). Credit risk scorecards: developing and implementing intelligent credit scoring. Vol. 3. John Wiley & Sons.
So, Mee Chi et al. (2014). “Using a transactor/revolver scorecard to make credit
and pricing decisions”. In: Decision Support Systems 59, pp. 143–151.
Stepanova, Maria and Lyn Thomas (2002). “Survival analysis methods for personal
loan data”. In: Operations Research 50.2, pp. 277–289.
Thomas, Lyn C (2000). “A survey of credit and behavioural scoring: forecasting
financial risk of lending to consumers”. In: International Journal of Forecasting
16.2, pp. 149–172.
— (2010). “Consumer finance: Challenges for operational research”. In: Journal of
the Operational Research Society 61.1, pp. 41–52.
38
BIBLIOGRAPHY
Thomas, Lyn C, David B Edelman, and Jonathan N Crook (2002). Credit scoring
and its applications. Siam.
Tibshirani, Robert (1996). “Regression shrinkage and selection via the lasso”. In:
Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288.
Wang, Gang et al. (2011). “A comparative assessment of ensemble learning for credit
scoring”. In: Expert systems with applications 38.1, pp. 223–230.
Wolpert, David H (1992). “Stacked generalization”. In: Neural networks 5.2, pp. 241–
259.
Yao, Ping (2009). “Credit scoring using ensemble machine learning”. In: Hybrid
Intelligent Systems, 2009. HIS’09. Ninth International Conference on. Vol. 3.
IEEE, pp. 244–246.
Zhao, Peng and Bin Yu (2006). “On model selection consistency of Lasso”. In: The
Journal of Machine Learning Research 7, pp. 2541–2563.
Zou, Hui and Trevor Hastie (2005). “Regularization and variable selection via the
elastic net”. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.2, pp. 301–320.
39
Appendix
Table A.1. Table of alphas and corresponding average AUC of the trained models
Alphas
Lambdas
AUC
0
0.111
0.222
0.333
0.444
0.556
0.667
0.778
0.889
0.950
0.990
0.999
1
0.002352
0.0005493
0.0003014
0.0002206
0.0001815
0.0001594
0.0001103
0.0000715
0.0000996
0.0000585
0.0000617
0.0000973
0.0000735
0.8680419
0.8685152
0.8685560
0.8685746
0.8685794
0.8685755
0.8685673
0.8685685
0.8685652
0.8685669
0.8685660
0.8685698
0.8685704
40
APPENDIX
Table A.2. Model Class 1: Results on the validation sample from re-training on
the subsample with score under different score-cutoffs
1st-stage
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
No. of Obs.
No. of Bads
Bad-rate
Gini
KS
Brier Score
70917
777
1151
1753
2649
4214
6616
9982
14488
20055
25943
31763
37181
42426
47486
51910
55866
59274
62169
64532
66345
67681
2962
432
595
775
1005
1296
1617
1922
2178
2414
2598
2718
2799
2846
2890
2918
2935
2947
2950
2955
2958
2959
0.042
0.556
0.517
0.442
0.379
0.308
0.244
0.193
0.150
0.120
0.100
0.086
0.075
0.067
0.061
0.056
0.053
0.050
0.047
0.046
0.045
0.044
0.7298
0.7292
0.7300
0.7296
0.7293
0.7296
0.7301
0.7293
0.7285
0.7288
0.7293
0.7293
0.7296
0.7292
0.7301
0.7307
0.7296
0.7295
0.7306
0.7295
0.7298
0.7286
0.5604
0.5605
0.5604
0.5601
0.5595
0.5601
0.5601
0.5607
0.5594
0.5566
0.5557
0.5581
0.5592
0.5594
0.5624
0.5631
0.5605
0.5601
0.5634
0.5609
0.5603
0.5588
0.0335
0.0336
0.0333
0.0334
0.0334
0.0333
0.0334
0.0334
0.0334
0.0334
0.0334
0.0335
0.0335
0.0336
0.0335
0.0335
0.0336
0.0336
0.0335
0.0336
0.0336
0.0336
Table A.3. Model Class 2: Results on the validation sample from re-training on
the subsample with score with x percentange points from the decision boundary
1st-stage
20
30
40
50
60
70
80
90
No. of Obs.
No. of Bads
Bad-rate
Gini
KS
Brier Score
70917
2059
3253
4701
6557
9166
12961
18835
30213
2962
275
423
590
772
990
1238
1543
1934
0.042
0.134
0.130
0.126
0.118
0.108
0.096
0.082
0.064
0.7298
0.7285
0.7277
0.7264
0.7269
0.7262
0.7276
0.7273
0.7300
0.5604
0.5600
0.5591
0.5570
0.5588
0.5578
0.5595
0.5594
0.5619
0.0335
0.0336
0.0336
0.0337
0.0337
0.0336
0.0336
0.0336
0.0336
41
42
score(dpoint30) > Cutoff
score(dpoint30) < Cutoff
score(dpoint5) > Cutoff
1st-stage
score(dpoint5) < Cutoff
names
470
480
490
500
510
520
530
470
480
490
500
510
520
530
470
480
490
500
510
520
530
470
480
490
500
510
520
530
Cutoff
71770
6847
12332
23874
41452
54723
63467
68398
64923
59438
47896
30318
17047
8303
3372
755
1220
2099
3643
6508
11799
21266
71015
70550
69671
68127
65262
59971
50504
No..of.Obs.
3014
867
1400
2133
2690
2920
2988
3005
2147
1614
881
324
94
26
9
293
430
627
944
1381
1891
2371
2721
2584
2387
2070
1633
1123
643
No..of.Bads
0.042
0.127
0.114
0.089
0.065
0.053
0.047
0.044
0.033
0.027
0.018
0.011
0.006
0.003
0.003
0.388
0.352
0.299
0.259
0.212
0.160
0.111
0.038
0.037
0.034
0.030
0.025
0.019
0.013
Bad.rate
0.7285
0.7281
0.7309
0.7302
0.7288
0.7283
0.7278
0.7278
0.7306
0.7298
0.5805
0.7266
0.7270
0.7263
0.7275
0.7296
0.7292
0.7283
0.7291
0.7304
0.7304
0.7295
0.7299
0.7309
0.7317
0.7338
0.7319
0.7296
0.7277
Gini
0.5585
0.5569
0.5601
0.5627
0.5583
0.5588
0.5594
0.5600
0.5645
0.5653
0.5045
0.5572
0.5600
0.5587
0.5585
0.5599
0.5592
0.5591
0.5570
0.5597
0.5618
0.5603
0.5657
0.5642
0.5652
0.5690
0.5658
0.5639
0.5599
KS
Table A.4. Model Class 3: Results on the validation sample from re-training
using predicted score on alternative response labels and scorecutoffs to define sample
to be included into the 2nd-stage
0.0337
0.0337
0.0336
0.0336
0.0337
0.0338
0.0338
0.0338
0.0337
0.0337
0.0341
0.0337
0.0337
0.0337
0.0337
0.0336
0.0336
0.0337
0.0336
0.0335
0.0336
0.0337
0.0337
0.0336
0.0337
0.0337
0.0337
0.0337
0.0337
Brier.Score
APPENDIX
APPENDIX
Table A.5. Model Class 4: Results on the validation sample from re-training
using predicted score on alternative response labels
1st-stage
dpoint5
dpoint30
dpoint5 & dpoint30
No. of Obs.
No. of Bads
Bad-rate
Gini
KS
Brier Score
71770
71770
71770
71770
3014
3014
3014
3014
0.042
0.042
0.042
0.042
0.7285
0.7325
0.7309
0.7314
0.5585
0.5647
0.5636
0.5649
0.0337
0.0336
0.0336
0.0337
Table A.6. Coefficient values of dpoint5 variables when training on dpoint90 and
including the predicted score from the dpoint5 label as a variable
Variable
Coefficient
dpoint5_binned(471,482]
dpoint5_binned(482,488]
dpoint5_binned(488,493]
dpoint5_binned(493,497]
dpoint5_binned(497,501]
dpoint5_binned(501,506]
dpoint5_binned(506,513]
dpoint5_binned(513,522]
dpoint5_binned(522,595]
43
-0.0741
-0.2522
-0.4297
-0.6205
-0.7592
-0.9728
-1.3032
-1.4635
-1.8813
APPENDIX
Table A.7. Results from evaluating the candidate models on test sample
1st-stage
Scorecutoffs Score < 550
Scorecutoffs Score < 630
Scorecutoffs Score < 670
Decision boundary 90%
score(dpoint5) < 480
score(dpoint30) > 490
score(dpoint30) > 500
score(dpoint30) > 510
dpoint5 as variable
dpoint30 as variable
dpoint5 & dpoint30 as variables
No. of Obs.
No. of Bads
Bad-rate
Gini
KS
Brier Score
70363
6455
37190
55263
21918
15123
66964
65016
62014
70363
70363
70363
2503
1589
2408
2494
1514
1689
1521
1219
878
2503
2503
2503
0.036
0.246
0.065
0.045
0.069
0.112
0.023
0.019
0.014
0.036
0.036
0.036
0.7956
0.7962
0.7705
0.7535
0.7971
0.8011
0.8003
0.8008
0.7958
0.8037
0.8028
0.8031
0.6404
0.6386
0.6005
0.5985
0.6383
0.6416
0.6450
0.6442
0.6359
0.6465
0.6464
0.6474
0.0276
0.0271
0.0285
0.0287
0.0276
0.0272
0.0276
0.0276
0.0276
0.0273
0.0273
0.0273
Actual and predicted log−odds
0
Log−odds
−2
−4
Predicted Log−odds
Actual Log−odds
−6
500
550
Score
600
650
Figure A.2. Actual and predicted score to log odds when applying the first-stage
model on the test sample.
44
APPENDIX
Actual and predicted log−odds
0
Log−odds
−2
−4
Predicted Log−odds
Actual Log−odds
−6
500
550
Score
600
650
Figure A.3. Actual and predicted score to log odds when applying the scorecutoff
model on the subset of observations with predicted score < 550 on the test sample.
Actual and predicted log−odds
0
Log−odds
−2
−4
Predicted Log−odds
Actual Log−odds
−6
500
550
Score
600
650
Figure A.4. Actual and predicted score to log odds when applying the scorecutoff
model on the subset of observations with less than a 90 % distance from the decision
boundary on the test sample.
45
APPENDIX
Actual and predicted log−odds
−1
Log−odds
−2
−3
−4
−5
Predicted Log−odds
Actual Log−odds
−6
500
550
Score
600
650
Figure A.5. Actual and predicted score to log odds when training on the subset
of observations with a predicted score from the dpoint5 label less than 480 on the
testsample.
Actual and predicted log−odds
0
Log−odds
−2
−4
Predicted Log−odds
Actual Log−odds
−6
500
550
Score
600
650
Figure A.6. Actual and predicted score to log odds when training on dpoint90 and
including the predicted score from the dpoint5 label as a variable on the test sample.
46
www.kth.se
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement