DEGREE PROJECT, IN MACHINE LEARNING , SECOND LEVEL STOCKHOLM, SWEDEN 2015 Two-Stage Logistic Regression Models for Improved Credit Scoring ANTON LUND KTH ROYAL INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND COMMUNICATION (CSC) Two-Stage Logistic Regression Models for Improved Credit Scoring February 24, 2015 ANTON LUND [email protected] Master’s Thesis in Computer Sicence School of Computer Science and Communication Royal Institute of Technology, Stockholm Supervisor at KTH: Olov Engwall Examiner: Olle Bälter Project commissioned by: Klarna AB Supervisor at the company: Hans Hjelm Abstract This thesis has investigated two-stage regularized logistic regressions applied on the credit scoring problem. Credit scoring refers to the practice of estimating the probability that a customer will default if given credit. The data was supplied by Klarna AB, and contains a larger number of observations than many other research papers on credit scoring. In this thesis, a two-stage regression refers to two staged regressions were the some kind of information from the first regression is used in the second regression to improve the overall performance. In the best performing models, the first stage was trained on alternative labels, payment status at earlier dates than the conventional. The predictions were then used as input to, or to segment, the second stage. This gave a gini increase of approximately 0.01. Using conventional scorecutoffs or distance to a decision boundary to segment the population did not improve performance. Referat Denna uppsats har undersökt tvȧstegs regulariserade logistiska regressioner för att estimera credit score hos konsumenter. Credit score är ett mȧtt pȧ kreditvärdighet och mäter sannolikheten att en person inte betalar tillbaka sin kredit. Data kommer frȧn Klarna AB och innehȧller fler observationer än mycket annan forskning om kreditvärdighet. Med tvȧstegsregressioner menas i denna uppsats en regressionsmodell bestȧende av tvȧ steg där information frȧn det första steget används i det andra steget för att förbättra den totala prestandan. De bäst presterande modellerna använder i det första steget en alternativ förklaringsvariabel, betalningsstatus vid en tidigare tidpunkt än den konventionella, för att segmentera eller som variabel i det andra steget. Detta gav en giniökning pȧ approximativt 0,01. Användandet av enklare segmenteringsmetoder sȧ som scoregränser eller avstȧnd till en beslutsgräns visade sig inte förbättra prestandan. Contents 1 Introduction 1.1 Background . . . . . . 1.2 Thesis Objective . . . 1.3 Ethical Concerns . . . 1.4 Delimitations . . . . . 1.5 Choice of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 4 5 5 2 Theory Review 2.1 Machine Learning . . . . . . . . . . . . . 2.1.1 Sampling Bias . . . . . . . . . . 2.1.2 Overfitting . . . . . . . . . . . . 2.2 Training, Validation & Testing . . . . . 2.2.1 Validation . . . . . . . . . . . . . 2.2.2 Testing . . . . . . . . . . . . . . 2.3 Logistic Regressions . . . . . . . . . . . 2.3.1 Basics . . . . . . . . . . . . . . . 2.3.2 Regularized Logistic Regressions 2.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 8 8 9 9 12 12 13 14 3 Related Works 3.1 Segmentation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Two-stage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 16 17 4 Method 4.1 Synthesis . . . . . . . . . . . . 4.2 Methodology . . . . . . . . . . 4.2.1 Proposed Model Classes 4.3 Practical Details . . . . . . . . 4.4 Data . . . . . . . . . . . . . . . . . . . . 18 18 19 19 21 22 . . . . 24 24 24 25 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Results 5.1 Choosing the Mixing Parameter, – . . . . . . . . . . . . 5.2 First-stage Model . . . . . . . . . . . . . . . . . . . . . . 5.3 Finding Candidate Models Using the Validation Sample 5.4 Evaluating Candidate Models Using the Test Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Discussion 32 7 Conclusions 7.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 Bibliography 37 Appendix 40 Chapter 1 Introduction This thesis aims to investigate two-stage logistic regression models for use in retail underwriting. Underwriting refers to the process of assessing a customer’s eligibility to receive credit. Underwriting models are used in a wide array of industries such as banks, credit card companies, mobile phone companies and insurance companies. When this process has been automated, it is usually instead referred to as credit scoring. Credit scoring models use information such as income data and payment history to predict the probability that the customer will default if given credit. The degree project is carried out at Klarna, a company based in Stockholm, Sweden that provides payment services to online merchants. One of the main components of this service is constructing and applying credit scoring models for retail customers. The accuracy of the credit scoring models is a key driver of Klarna’s profitability but the models also need to fulfil various obligations. Some of the current credit scoring models at Klarna involve logistic regressions, a type of probabilistic classification model used to predict a binary response. Credit scoring models can often with a high accuracy classify customers that are clearly likely to or not likely to pay on time, but may not do as well in more grey areas. Klarna therefore wants to investigate whether the accuracy can be improved by implementing and evaluating two-stage logistic regression models. A two-stage model should in this context be seen as a model where the first-stage model is trained using the available data and the second-stage model uses information from the first-stage model in some way to increase performance. The problem can be described by the research question: Do two-stage logistic regression models, while retaining simplicity, improve the performance of credit scoring models when compared to the conventional logistic regression? This paper starts with chapter 1, which gives an introduction to credit scoring and a motivation for the research question in this thesis along with a quick discussion of ethical concerns, the delimitations and the choice of methodology. The remainder of the thesis will be outlined as follows. In chapter 2, basics of machine learning will be covered and some important concepts when training machine learning models will be described. This will be followed by chapter 3, a summary of previous works in topics related to two-stage credit scoring models. Next, chapter 4 will cover the specifics of the method and some practical details with regards to the implementation. Chapter 5 will present the results and will be 1 CHAPTER 1. INTRODUCTION followed by chapter 6 that will discuss the results and put them in a general context. Lastly, chapter 7 will give a brief summary of the thesis and the conclusions. 1.1 Background The applicant in a credit scoring process can for example be a consumer applying for a credit card or a mortgage but also an organization, e.g. a company trying to secure a loan for their business. An important difference between scoring consumers and organizations is the available data (Thomas 2010). This paper will focus on credit scoring of consumers and examples related to credit to retail customers. The credit score is often designed to predict the probability that the credit applicant will default but can also be related to more vague concepts such as the probability that the credit applicant will be profitable (Hand and Henley 1997). In practice, this is usually turned into a problem of classifying a customer as either a “good” or a “bad”. The terms “goods” and “bads” will be used throughout the paper to distinguish between customers that have good and bad outcomes. How this is defined differs from company to company but simply put, a bad transaction is when the customer has not fulfilled his or her payment obligations after some duration of time after the credit due date. Credit scoring is in a generalized perspective an application of a classification problem. In a credit scoring model, each credit applicant is attributed a score, based on available data. A good model should assign a high score to a credit applicant that is unlikely to default (or the equivalent positive outcome) (Mester 1997). The industry standard in credit scoring is for the credit score to be a logarithmic transformation of the probability of default (PD) so that every 20 points decrease in score double the odds of defaulting. The precise definition varies from company to company. At Klarna credit score is defined as follows. 3 PD Score = ≠ log 1 ≠ PD 4 · 20 20 + 600 ≠ log(50) · log(2) log(2) (1.1) As an example, PDs of 1 % and 99 % roughly correspond to scores of 620 and 355, respectively. Figure 1.1 shows the relationship between probability of default and credit score. An important characteristic of the score is the increased granularity as probability approaches 0 and 1. For the reasoning behind the form of the relationship with the PD and the score, see section 2.3.1. There is a wide range of methods that can be used to create credit scoring models. Some examples are (Lessmann, Seow, and Baesens 2013): 1. Statistical models such as linear probability models and discriminant analysis. 2. Machine learning models such as neural networks, support vector machines and decision trees. 3. Methods derived from financial theory such as option pricing models. 4. Combination of different methods using for example ensemble learning methods. 2 CHAPTER 1. INTRODUCTION 700 Score 600 500 400 300 0.00 0.25 0.50 Probability of default 0.75 1.00 Figure 1.1. Plot of the relationship between probability of default and credit score. Credit scoring is a highly active field and as in most fields of applied classification, there is a tradition of searching broadly when testing out new models and methods. A problem with many of the academic papers is that they often use data sets that, in comparison to the data owned by Klarna, are very small. For example, one source of data commonly used in machine learning is the UCI Machine Learning Repository. It holds 4 data sets with credit application data which are all 1000 observations or less1 . Klarna will usually have data sets on the order of 100.000 observations when training their models as well as more variables. A higher quality data set should improve prediction accuracy and increase the potential of models that train on subsets of the data. Though dependent on the specific application, there are generally three sources of data available when assessing credit eligibility (Thomas 2000): 1. Application data - E.g. type of purchase, requested credit. 2. External data - E.g. payment history from credit agencies or address lookup from an address bureau. 3. Internal data - Data from previously accepted and rejected applications. The application data is collected at each transaction. External data needs to be purchased and a company might develop models to predict when buying external data for a specific customer is profitable. Internal data exists for customers that have used the company’s service earlier and increases in size over time. Internal data might therefore not be available for all customers and might lose its relevance over time as the population changes. 1 See UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets.html for more information. 3 CHAPTER 1. INTRODUCTION 1.2 Thesis Objective There is an extensive amount of literature on credit scoring and an abundance of papers that explain methods for improving the accuracy of credit scoring models. Many of these papers describe models that are intuitively hard to interpret such as artificial neural networks (ANN) or complicated composite methods using ensemble learning. Despite this fact, the model of choice for many companies developing scoring models is the relatively simple logistic regression. One possible reason for this is the sampling bias inherent in the credit scoring problem (Greene 1998). As the data has usually passed through some earlier version of a model, data for training and evaluation is restricted to data from previously accepted customers. It is not ex-ante clear how a model trained on data with this bias will perform on unfiltered data. This implies that there is a need to interpret the model and assess whether it is reasonable or not. Another argument for interpretive simplicity comes from the need to explain reasoning behind accept/reject decisions to different entities within and outside the company. In some countries, it is required by law to be able explain to a customer the specific reason or reasons he or she was rejected credit. For these reasons, it is clear that the credit scoring industry faces constraints that make complicated classification algorithms and ensemble learning models unsuited. Logistic regressions models are, on the other hand, easy to understand and interpret. Their simplicity also make them easy to implement and if needed, modify posttraining. This comes with a caveat. Logistic regressions are linear models. It is therefore not as simple to produce more complex decision surface as for example for an artificial neural network model. If underlying data is not perfectly linear, simply training a logistic regression on the data can not be expected to give the best possible prediction. There is therefore a need to explore how credit scoring models can be improved, while retaining simplicity, within the context of logistic regressions. This paper will investigate if the performance of logistic regressions can be improved by introducing one or several logistic regressions in a second step to form a two-stage model. The second step will in some way take into consideration information from the first model to create a full scoring model that performs better relative to a simple logistic regression. A short list of different methods will be implemented and compared to a single-stage model to find the best performing model. 1.3 Ethical Concerns It has been argued that a weakness of statistical credit scoring models is their lack of an explanatory model (Capon 1982). This makes the specific reasoning behind why a customer was rejected or accepted opaque. From this it is arguable that there is a certain arbitrariness to the whole system. On the other hand, automated scoring models could remove prejudice that might influence human decisions. 4 CHAPTER 1. INTRODUCTION Others have also argued that some customer characteristics might act as proxies for characteristics that, if used for differentiating between customers, would be illegal discrimination. Examples of this would be that differentiating on postal codes could be a proxy for ethnicity while distinguishing between certain types of incomes (e.g. full-time or part-time) could discriminate on gender (Capon 1982). In general this is a problem of statistical models which by nature will tend to predict the behavior of an individual from the behavior of the group he or she belongs to. There is an additional problem of credit giving that is not isolated to the credit scoring practice: false positives and false negatives. It is not realistic to assume that a model will accurately predict all credit applications. A fraction of people that should not receive credit will be accepted and some who should, will be rejected. This is however a problem that can be alleviated with more accurate credit scoring models and is arguably an aspect where credit scoring models perform better than non-statistical underwriting models. The results from this thesis are not likely to have much effect on the ethical aspects of credit scoring except for the chance that it might increase the accuracy of predictions. There is perhaps a need to further discuss the ethics of credit scoring as the scope and prevalence increases, but that discussion deserves its own paper. 1.4 Delimitations There is a large amount of models that can be applied to the credit scoring model. This thesis can be seen as an in-depth study and has therefore only covered the logistic regression. Therefore, for all implemented models, both stages consist of logistic regression models. There have also been a few restriction on the data. Klarna use different models for different geographic regions. This paper has only tested the implementation on one of these regions. A reason for this is lack of readily available data. Additionally the data used is from a specific time period. This together with the geographic restriction needs to be taken into account when attempting to generalize the results from this paper. The Related Works chapter mentions a number of machine learning algorithms. It is outside the scope of this paper to cover them all in detail. For more information, please consult a machine learning textbook, e.g. Hastie et al. (2009), or some other introductory material. 1.5 Choice of Methodology The method implementation can be divided into five separate tasks: 1. 2. 3. 4. Acquire a data set and split it into training, validation and test sets. Identify methods for constructing the two-stage models. Train the models using the training set. Evaluate results on validation set to select a number of candidate models. 5 CHAPTER 1. INTRODUCTION 5. Evaluate results of candidate models on the test set. The data set used in the implementation will be delivered by Klarna. To minimize the time spent on assembling and cleaning the data, a data set that was previously used to train models at Klarna will be used. Previous research as well as discussions with Klarna experts will form the basis for selecting methods for constructing the two-stage models. The number of actually implemented models will depend on the time available. The models will be trained using R, a free software programming language for statistical computing2 . To make sure that the training is automated and reproducible, a package including all the necessary functions and code will be developed. The aim of this package is to have the ability to input a data set and receive the trained model along with relevant statistics and graphics. 2 See http://www.r-project.org/ for more information. 6 Chapter 2 Theory Review The theory review will start by giving an introduction to the concept and practice of machine learning. It will focus on practical concepts like overfitting, sampling bias, training and validation. The next section will define the logistic regression, give an introduction to regularization and finally describe the estimation method used in this paper. Lastly, there will be a section on evaluating credit models that describes some of the common metrics. 2.1 Machine Learning Machine learning is a subset of computer science that studies systems that can automatically learn from data (Domingos 2012). The field is closely linked to statistics, optimization and artificial intelligence. The most widely used application of machine learning is in classification (Domingos 2012). The typical setup for a classifier is a vector of p attributes x = (x1 , x2 , . . . , xp ) and a set of classes y = (y1 , y2 , . . . , yd ). The x vector is known for n examples and each example is assigned to one of the classes. The classification problem can then be described as the problem of using historical data to identify to which class a new data point, xi , belongs. An example of this could be an algorithm that recognizes faces in images. A data point, xi , could then consist of the individual pixels in an image and the yvector would be a vector with a binary response depending on whether a face is actually in the image or not. 2.1.1 Sampling Bias A problematic issue when training credit scoring models is sampling bias. Sampling bias arises when the sample the model was trained on does not represent the same population from which data points are drawn on for prediction. Most credit scoring models require some type of label. Labeled data is typically collected by giving customers credit and waiting to see if they default or not. It will therefore only be possible to train the model on transactions that were previously accepted (Thomas 2000). Greene (1998); Banasik, Crook, and Thomas (2003) find that sampling bias may decrease the accuracy of credit scoring models. This phenomenon can lead to some unexpected results. An example of this could be a historical model with a highly predictive variable, i.e. the amount of debt. That model would penalize transactions with high debt and only let high debt transactions through when the other variables give very strong positive predictions. 7 CHAPTER 2. THEORY REVIEW If at some later time, a new model is trained on the data that was filtered by the historical model, high debt transactions are likely to have a much lower default rate than from the data in the first model. This could make the debt variable have a positive effect in the new model, something that is not likely to reflect the nature of future, unfiltered, requests. There is unfortunately no patent solution to this problem. There are a few methods that have been suggested to alleviate the problem, for example reject inference (e.g. (Crook and Banasik 2004)) and virgin sampling, i.e. letting through a fraction of requests regardless of score, to have a small but representative sample. Having the ability to spot these problems is a contributing reason for prioritizing interpretive simplicity in credit scoring models. 2.1.2 Overfitting Overfitting refers to the problem of when a fitted model describes the random error in the sample rather than the underlying characteristics. An overfitted model is not suitable for prediction as it fails to generalize over data other than from the training set (Hawkins 2004). One cause for overfitting is using too complex models, e.g. using a quadratic model to predict a linear trend. Two common ways to alleviate overfitting is cross-validation or methods to reduce complexity and induce parsimony such as regularization and feature selection (Hastie et al. 2009). Additionally, adhering to the principle of parsimony is also important from a more practical perspective. Decreasing the number of variables means that less resources have to be spent on developing and maintaining variables and decreases the risk of various errors in the database (Hawkins 2004). 2.2 Training, Validation & Testing To make sure that the trained models are generalizable, it is important to test the models on different data than what was used for the training. This is usually done in steps with two separate goals (Hastie et al. 2009): 1. Validation (Model Selection) - Comparing the performance of different models to find the best one. 2. Testing (Model Assessment) - Given that the best fitting model has been found, estimating its generalization error on new data. The data is therefore typically divided into three sets, a training set, a validation set and a test set. The training set is used to estimate the models, the validation set for model selection and lastly the test set to evaluate the generalizability. (Hastie et al. 2009). It is important that the training, validation phase and testing phase use different data to prevent an underestimation of the true error. A typical split could be 12 for training, 14 for validation and the last 14 for testing (Hastie et al. 2009). 8 CHAPTER 2. THEORY REVIEW 2.2.1 Validation A common problem in credit scoring is that, while the total number of data points might be sufficiently large, only a small fraction of those represent customers who defaulted or were designated as "bad" in some other way. A method used to allevieate this problem is cross-validation, one of the most common implementations being K-fold cross-validation. Cross-validation involves separating the training set into K folds. For each of the K folds, a model is trained on the other folds and validated with the k-th fold. A typical value of K could be 10. The prediction error for the k-th fold is defined as N 1 ÿ Ek (fˆ) = L(yi , fˆ≠k(i) (xi )), N i=1 (2.1) where fˆ≠k(i) refers to the model trained on the sample that is not included in the k-th fold and L is a function that measures the error depending on the estimate and the true value for observation i (Hastie et al. 2009). The cross-validation estimate of the prediction error is then (Hastie et al. 2009) K 1 ÿ CV (fˆ) = Ek (fˆ) K k=1 2.2.2 (2.2) Testing It is in most cases not possible to create a model that can perfectly separate the classes. This implies that there will, for every score threshold, be some fraction of data points that become misclassified. This also implies a need to compare different models in order to pick the model that has the best performance. It is not entirely clear how to define the performance of a scorecard. A guiding principle is to measure how well the scoring model can distinguish between goods and bads. This section will explain the meausures used in this paper ROC-curve, AUC and Gini A commonly used method of evaluation is the AUC-measure and a linear transformation of the AUC, the Gini-coefficient (Hand and Anagnostopoulos 2013; Krzanowski and Hand 2009). The AUC (Area Under the Curve) is derived from the ROC (Receiver Operating Characteristic) curve which depicts the true positive rate against the false positive rate depending on some threshold (Davis and Goadrich 2006). The true positive rate (TPR) is defined as the ratio of true positives over all positives (T P/P ) and the false positive rate (FPR) as the ratio of false positives over all negatives (F P/N ). An example could be a data set of 1000 observations with a bad-rate of 10%, i.e. P = 100 and N = 900. In the credit scoring problem, a bad, is defined as a positive response and vice verse. If we with a certain model, manage to correctly classify 80 bads (TP = 80) and mistakenly classify 100 goods as bads (FP = 100), we get a true positive rate of TPR = 0.8 and a false positive rate of FPR = 0.11. 9 CHAPTER 2. THEORY REVIEW The threshold can in credit scoring be a score cutoff and the ROC curve is then created by calculating the TPR and FPR for a sufficient amount of score-cutoffs. If the cutoff is set so that no applications are accepted, then TPR = 1 and FPR = 1. If instead all applications are accepted then TPR = 0 and FPR = 0. The area under the ROC curve can be interpreted as the probability of a randomly selected positive data point being ranked higher than a randomly selected negative data point. The Gini-coefficient is a linear transformation of the AUC defined as Gini = 2AU C ≠ 1 (2.3) The Gini can take on values in the range [≠1, 1] and a value of 0 represents the random baseline. A higher AUC or Gini implies an overall higher discriminatory power of the model. Figure 2.1 shows some schematic ROC curves along with the resulting Gini. Receiver Operating Characteristics (ROC) 1.00 True positive rate 0.75 gini = 0 gini = 0.75 0.50 gini = 0.90 gini = 0.99 0.25 0.00 0.00 0.25 0.50 0.75 False positive rate 1.00 Figure 2.1. Schematic plot of the relationship between the ROC-curve and gini for different classifiers. Kolmogorov-Smirnov statistic Another evaluation measure is the Kolmogorov-Smirnov (KS) statistic. It measures the distance between the distributions of the goods and the bads (Thomas, Edelman, and Crook 2002). Let PG (s) and PB (s) be the cumulative distribution function of goods and bads. The KS statistics is then defined as KS = max |PG (s) ≠ PB (s)| s 10 (2.4) CHAPTER 2. THEORY REVIEW When calculating the KS-statistics, pG (s) and pB (s), can be plotted. Figure 2.2 shows a schematic plot of the KS statistics for two fictive cumulative distributions of goods and bads. Kolmogorov−Smirnov Statistics 1.00 Cumulative Probability 0.75 Cumulative goods 0.50 Cumulative bads KS statistics = 0.53 0.25 0.00 0.00 0.25 0.50 0.75 Normalized Score 1.00 Figure 2.2. Schematic plot of PB (s), PG (S) and the KS statistics for two distributions. Mays (1995) gives a general guide to interpreting the KS statistic that can be seen in table 2.1. This table should not be too taken literally. Table 2.1. Guideline of the quality of a scorecard based on the KS statistic (Mays 1995). Values have been transformed from percentages to decimals. KS statistic Evaluation < 0.20 0.20 - 0.40 0.41 - 0.50 0.51 - 0.60 0.61 - 0.75 > 0.75 Scorecard probably not worth using Fair Good Very good Awesome Probably too good to be true Brier score A third measure is the Brier score, originally proposed in a 1950 paper on weather forecasting (Brier 1950). The Brier score is the mean squared error of the probability 11 CHAPTER 2. THEORY REVIEW estimates produced by the model. Let yi be the outcome for a certain observation and take the value 0 for a good and 1 for a bad. Let also fi be the estimated probability of y. The Brier score is then defined as BS = N 1 ÿ (fi ≠ yi )2 N i=1 (2.5) Score-to-log-odds plot The last measure that will be used is the Score-to-log-odds. When plotted, it shows how well actual outcomes in different score segments correspond to the outcomes estimated by the model. If the actual log-odds for a segment is higher than what is expected from the model, the risk is under-estimated in this certain segment. The score-to-log-odds plot therefore shows how the model performs over the score band. A benefit with using the measures described above is that they do not take into account the specific threshold when evaluating the model (Hand and Anagnostopoulos 2013). The backside is that when specific thresholds have been decided, it is more interesting to measure the performance of the model around these thresholds (Thomas, Edelman, and Crook 2002). 2.3 Logistic Regressions Logistic regression emerged in the late 1960s and early 1970s as an alternative to OLS regressions and linear discriminant analysis (Peng, Lee, and Ingersoll 2002). In the normal case, the logistic regression is used to predict the response of a binary variable. There is also a generalization called multinomial logistic regressions that can be used to predict multi-class problems. The logistic regression is widely used in empirical applications and has emerged as the industry standard credit scoring model (Bellotti and Crook 2009). Lessmann, Seow, and Baesens (2013) run a benchmarking of a wide array of classification algorithms on a credit scoring data set and find that on average, the logistic regression performs well and even outperforms many state-of-the-art machine learning algorithms. They also found that models like random forests and neural networks can give better predictions than logistic regressions. A reason that logistic regressions still are so widely used is that it is relatively easy to perform reality checks on them. It is for example possible to from just looking at the sign of the coefficient of a variable to see if the results intuitively make sense. Such checks can easily find problems such as the example of the debt variable described in section 2.1.1. 2.3.1 Basics The logistic regression model comes from the need to construct a linearly additive model that can predict a probability, i.e. a number between 0 and 1. This was solved by a neat trick using the concept of the logit. Let Yi be a binary-valued outcome variable with an associated probability pi that is related to a number of explanatory 12 CHAPTER 2. THEORY REVIEW variables X. Let for simplicity the explanatory variables be normalized as it makes the specification of the regularization simpler. The logit is then the logarithm of the p odds of probability, log 1≠p . This means that the probability can be transformed to a number that ranges from ≠Œ to Œ, which can be used in a linear model. With m explanatory variables, the probability can be modeled using a linear prediction function, f, that for a particular data point i takes the form: f (x) = logit(pi ) = log pi = —0 + —xi = —0 + —1 x1,i + . . . + —m xm,i 1 ≠ pi (2.6) ˆ we can calculate the estimated For a trained model, with estimates —ˆ0 and —, logit and probability, p̂i , using logit(p̂i ) = log 2.3.2 p̂i 1 ˆ i and E[Y |Xi ] = p̂ = = —ˆ0 +—x = (2.7) ˆ i) 1 ≠ p̂i 1 + exp ≠(—ˆ0 + —x Regularized Logistic Regressions Two widely recognized problems of logistic regressions are overfitting and performing feature selection. Feature selection refers to the problem of selecting the correct variables (features) to be included in the model. A common method to alleviate those problems is regularization. The idea behind regularization is to put a penalty on the sum of the regression coefficients which can be done in a number of ways. A selection of common regularization methods are: • Lasso - a type of ¸1 -regularization. • Ridge regression - a type of ¸2 -regularization. • Elastic net regression - a linear combination of the lasso and ridge regression. The Lasso (least absolute shrinkage and selection operator) estimator introduces q a ¸1 -restriction on the form Î—m Î Æ t for a constant t. This restriction has the effect that it tends to decrease the size of the coefficients and for sufficiently large values of t, also sets some coefficients to zero (Tibshirani 1996). It has been shown that this implies that under some conditions, the lasso can be used as a method of automated feature selection while still maintaining an efficient estimation (Zhao and Yu 2006). It however has some non-trivial problems. For example, it can only select a maximum of n variables, where n is the number of observations. Also, in case of groups of highly correlated variables, it tends to somewhat arbitrarily choose only one of those variables to include into the regression (Zou and Hastie 2005). The ridge regression method, also known as Tikhonov regularization, instead q uses a ¸2 -restriction on the form Î—m Î2 Æ c for a constant c (Hoerl and Kennard 1970). The ridge regression reduces the variance of coefficient estimates at the price of a bias. Combined, these effects can decrease the mean square error of the regression and lead to a better prediction (Hoerl and Kennard 1970). It has also 13 CHAPTER 2. THEORY REVIEW been shown that for situations with n > m, and highly correlated predictors, that the ridge regression performs better than the lasso method (Tibshirani 1996). It does however lack the ability to produce a sparse model as it keeps all the predictors in the model (Zou and Hastie 2005). Zou and Hastie (2005) suggested a combination of the lasso and the ridge regression called the elastic net. It has the benefit of both automated feature selection and a lower variance of the prediction. Additionally it can also select groups of variables if one of the variables in the group is found significant. Using the method of Lagrange multipliers, the lasso and ridge constraints can be rewritten as penalties in an optimization problem. Combined together, the elastic net penalty function has the form P (—, ⁄t , ⁄c ) = ⁄t ÿ Î—m Î + ⁄c ÿ Î—m Î2 (2.8) where the ⁄ parameters are called shrinkage parameters. When applied naively, this specification causes an excessive shrinkage due to shrinkage effects from both the lasso and the ridge component. It can be shown that this can be mitigated by multiplying the estimates from the naive specification with a scaling factor, (1 + ⁄2 )(Zou and Hastie 2005). 2.3.3 Estimation The logistic regression is usually estimated using a maximum likelihood estimation (MLE) approach (Peng, Lee, and Ingersoll 2002). The MLE approach is a widely used estimation method that given a set of n data points x1 , x2 , . . . , xn selects the model that is most likely to have generated said data. If we define a specific model as ◊, and the objective likelihood function as L, the maximum likelihood approach selects a model that satisfies ◊ˆ = arg max L(◊; x1 , x2 , . . . , xn ) (2.9) ◊œ For both logistic and regularized logistic regressions, there are no closed-form solutions to the MLE problem which implies that an iterative and numerical process needs to be used. 14 Chapter 3 Related Works The classification approach to consumer credit scoring has been thoroughly researched in academia. Lessmann, Seow, and Baesens (2013) perform a meta-study on recently published papers on credit scoring. They find that in regards to prediction, ANN are very strong when applied individually. Ensemble classifiers can outperform individual classifiers with random forests giving the most accurate predictions. Another important finding is that complicated state-of-the-art algorithms do not necessarily outperform simpler versions. An example of this is random forests (see Breiman (2001)) outperforming the more advanced rotation forests (see Rodriguez, Kuncheva, and Alonso (2006)). An ensemble selection approach combining bootstrap sampling (see section 3.2) and a greedy hill-climbing algorithm (a type of optimization method) for base model selection gave the overall best results. They also find, by comparing 45 credit scoring papers from the period 2003 to 2013, that the mean and median number of observations in data sets are 6167 and 959 respectively (Lessmann, Seow, and Baesens 2013). This can in machine learning contexts be considered as relatively small data sets. There is an extensive amount of academic papers attempting to increase the performance by training more than one model. These can somewhat coarsely be put into two categories: segmentation and ensemble models. Segmentation models attempt to divide the population into sub-populations and develop scorecards for each sub-population (Thomas, Edelman, and Crook 2002). Ensemble learning on the other hand, aims to build a model by combining the results from a collection of (simple) models (Hastie et al. 2009). 3.1 Segmentation Models Segmentation models can be constructed on either an experience-based (heuristic) or a statistical basis (Siddiqi 2005). The heuristic strategy uses some characteristics, such as age or requested credit, to segment the population. The statistical basis uses statistical methods or machine learning models such as cluster analysis to segment the population. Each credit application is therefore first segmented by some means and then scored by a single scorecard trained specifically for that segment. This simplifies performance monitoring and model retraining. Segmentation has proven to be successful in some cases. So et al. (2014) build a classification model that segments credit card customers between transactors (those who pay off their balance every month and are thus by definition good) and revolvers (those who sometimes only pay part of their monthly balance and incur 15 CHAPTER 3. RELATED WORKS interest charge). They find that this segmentation gives a more accurate profitability estimate than a single scorecard. Banasik, Crook, and Thomas (2001) build a two-stage model to distinguish between high-usage and low-usage customers. They find that, when taking into account that the usage of credit is dependent on the amount of credit actually awarded a customer, the two-stage model gives a better prediction accuracy. Hand, Sohn, and Kim (2005) implement a bipartite model by splitting on variable characteristics, training two separate scorecards and selecting the split that maximizes the combined likelihood. They show that this procedure can increase the performance compared to a single logistic regression. Other segmentations methods have not been as successful. Bijak and Thomas (2012) use a statistical machine learning approach. They distinguish between a two-step and simultaneous method. For the two-step method, they do the segmentation and scoring model independently. In the simultaneous approach, the segmentation and scoring model is optimized simultaneously. The first step implements statistically based segmentation methods, CART-models, CHAID-trees and LOTUS-models, to separate the data set into several groups and the second step builds scorecards for each group. They find that neither segmentation using two-step or simultaneous methods significantly increase the prediction power. 3.2 Ensemble Models Ensemble learning classifiers come in many different flavors. Bootstrap aggregating (bagging) involves drawing a number of random samples with replacement (Bootstrap sampling) of the training data and training a classifier on each sample. Boosting creates an ensemble by iteratively creating new models and increasing weights for misclassified data points in each step. The final classification model is then a weighting of all iterations. A widely used boosting algorithm is the AdaBoost algorithm (Marsland 2011). Stacked generalization (Stacking) is a method whereby a number of classifiers are trained and then aggregated using a second-step to combine the scores (Wolpert 1992). Wang et al. (2011) compare bagging, boosting and stacking on three real world credit data sets. As base learners, they use logistic regressions, decision trees, artificial neural networks and support vector machines. They find in general, that the ensemble methods increase accuracy on all types of base learners. While bagging decision trees showed the best performance improvement, results seem to somewhat differ between data sets. Marqués, García, and Sánchez (2012) examine composite ensemble learning using random subspace, rotation forest and convolutional neural networks to construct two-level classifier ensembles. They run tests on six different data sets and find that two-level ensembles can increase prediction accuracy quite significantly. Yao (2009) similarly uses CART, bagging and boosting methods and also finds performance increases. 16 CHAPTER 3. RELATED WORKS 3.3 Two-stage Models The category "Two-stage models" is not clearly defined in literature but in this thesis refers to a model that is constructed, using similar methods, in two stages, where the second stage uses some kind of information from the first stage. This definition somewhat overlaps with the definition of segmentation models and ensemble learning models but the exact distinctions are not so important. Finlay (2011) performs a broad study of multi-classifier systems on a number of baseline models on two data sets of approximately 100,000 observations each. He finds some evidence that CART and a combination of neural networks and logistic regressions show potential for increasing performance. The highest increase in performance comes from the ET boost, significantly out-performing the more commonly used AdaBoost algorithm. These methods are not two-stage models in the strict sense but have some similar characteristics. He also creates two-stage models where for example the result from a baseline logistic regression was used to segment the population. The finding was that this type of segmentation performed poorly for all baseline models. His interpretation of these results is that the data is only very weakly non-linear but that it could also be an effect of overfitting. For some segments, the number of bads were as low as under 1,000 cases. A problem with the paper is that due to the large number of evaluated models, the exact methodology is not clearly explained. Similar attempts to base segmentation on scoring on a first-stage model was first attempted by Myers and Forgy (1963). Using discriminant analysis, they train second-stage models with different score cutoffs from the first stage. They find no positive effects but the results are limited by using a small sample of 300 observations. 17 Chapter 4 Method Chapter 4 starts with a synthesis of the literature review in the form of a number of stylized facts. This leads up to a presentation of the methods investigated in this thesis. Next, some practical details of the implementation are discussed followed lastly, by a presentation of the data. 4.1 Synthesis The findings in earlier chapters can be summarized in a few stylized facts. 1. There is a wide selection of methods that seem to increase the predictive power when comparing to the industry standard logistic regression. From the discussion in earlier chapters it seems that there is no single way of increasing the performance of credit scoring models. Both segmentation methods and ensemble methods seem to in many cases be effective. As many others have echoed, (i.e. Thomas (2000)), there does not yet seem to exist a "holy grail" to the credit scoring problem. The optimal model for any implementation will likely be dependent on the characteristics of that specific instance of the problem and needs and capabilities of the organization. Additionally the relationship between model performance and model complexity is not always positive and simpler models sometimes outperform more complex derivatives. 2. Many empirical evaluations use data with small sample sizes. Introducing new data is problematic in the sense that it complicates comparison with other studies. Data sets will have different size, different variables and be drawn from populations with dissimilar characteristics. In such a comparison, it is not clear what part of the difference comes from the data and what part comes from the implementation. This is especially a problem when data is proprietary and not available for replication or further studies. A clear and exhaustive description of the methodology is therefore important to ensure reproducibility so that the methodology can be tested on other data. With this in mind, there is still a need for using proprietary data if the size or quality of the data exceeds that of the publicly available data sets. This is especially true when implementing models such as segmentation models, that can significantly benefit from larger samples. 3. Broad comparative studies are hard to interpret Some recent studies such as Finlay (2011) and Lessmann, Seow, and Baesens (2013) have identified that there is a need for broad studies that compare a large number of classifiers using similar data and methodology. This approach has a 18 CHAPTER 4. METHOD number of problems. When the number of classifier models grows large enough, it will not be possible to explain the practical implementation with enough detail. Data preprocessing (e.g. data cleaning, feature selection) and model specifics (e.g. regularization, estimation) have a big influence on results and specific methods employed might be better suited for certain classifiers. There might also be a bias in terms of the researchers’ experience with different classifiers causing some classifiers to under-perform. The non-exhaustive descriptions of methodologies make this problem difficult to tackle. Conclusions The conclusion of these stylized facts is that there is a need to look broadly when investigating classifier models so that credit scorers can find the model that suits their specific capabilities and problems. Additionally, with new and better data, there is a need of re-evaluating models that earlier had not performed as well as expected. Finally, broad comparatives studies need to be complemented by narrower studies that in greater detail cover the implementational details of a model. These studies, while interesting on their own, will also be a help for researchers attempting comparative studies. 4.2 Methodology Given the conclusions above and the imperative of retaining simplicity, there is a motivation for, using Klarna’s high quality data, investigating the potential lowhanging fruits that can increase the performance of the industry standard scoring model. While there have been some attempts at constructing simple two-stage logistic regression models, there has yet to be an exhaustive study, with a clearly explained methodology. Another important addition to this study is the use of regularization methods that makes for a separated and automated feature selection for both stages. In discussions and inspired by previous work, a number of methods to construct the two-stage model have been identified. The basic idea is to train a logistic regression on all observations in the training set. Information from the first stage will then be included into or used to segment the data for the second-stage model. The results from both models will then be combined in some way to form a conventional logistic regression for all observations. 4.2.1 Proposed Model Classes Model class 1: Cutoff on predicted score from first-stage regression. Defaulting customers can naively be put into three categories. The first category is those that do not have the ability to pay. The second category is those that for some reason have no intention to pay and the third is fraud. It is likely that the characteristics predicting these categories are quite different. A hypothesis is that the prevalence of the first category relative to the second category is larger at lower scores and vice versa. An option is then to train a 19 CHAPTER 4. METHOD model on all data and in a second stage train separate models on low- and highscoring customers. An alternative would be to assume that the second category might not easily be predicted by variables used for credit scoring. Defaults in the second category would add noise when attempting to predict the first category. The alternative suggestion is therefore to make a cutoff on the score from the first stage and only train the second stage on the low-scoring customers and let the remaining customers be trained on the full data. While it would be interesting to also train models on the subsample above the cutoff, the lack of bads in the high-scoring segment makes this difficult. Model class 2: Segmenting on observations close to the decision boundary. After a credit scoring model has been trained, some kind of decision model decides whether to accept or reject a new application. A type of naive decision model is a constant risk model. It defines a constant risk which is the maximum expected loss that can be accepted. PD · credit = constant (4.1) credit ≠ 1| < x PD/constant (4.2) The line formed by plotting the maximum allowed credit for each score is referred to as the decision boundary. Given such a model, the performance is highly dependent on the accuracy of the scoring model around the decision boundary. Given a hypothesized non-linearity in the data, the first stage can be used to identify observations in proximity to the decision boundary, and the second stage will be trained on that subset which might give a better accuracy in the region. The condition to include an observation in the second-stage is then: | where the constant has been calculated from the constant risk decision model in equation 4.1 and x is the percentage window boundary. Model class 3: Cutoffs on predicted score from alternative labels. As the payment status of a transaction evolves over time, there is not a clear definition of what point of time should be used to classify a transaction. Using the payment status at the credit due date might make us mistakenly class cases where the customer intended to, but forgot to, pay in time or where the customer was temporarily unable to pay, as bad. The industry standard for bank cards is to use 90 days after credit due date (Mays 1995). There might however be valuable information in the payment status at a shorter time after credit due date. Perhaps people that are likely to forget to pay on time have other characteristics than those that never fail to pay on time. This model will therefore investigate the effect of training on earlier payment statuses, 5 and 30 days after credit due date. These labels will in the rest of the paper be referred to as dpoint5, dpoint30 and dpoint90 with dpoint90 being the final definition of goods and bads. Using the 20 CHAPTER 4. METHOD idea from above, this model class will first train first-stage models using the two alternative default labels dpoint5 and dpoint30. The scores from these first-stage regressions will then be used to segment the population similarly to model class 1 so that a new dpoint90 regression can be trained for that subsample. The remaining sample will be scored by a simple logistic regression on the full sample. As the alternative labels have a higher amount of bads compared to dpoint90, it is possible to also train models on the subsamples above the cutoffs. Model class 4: Predicted score from alternative labels as variables. An alternative to the segmentation models proposed above is to train scoring models on the alternative default labels and use the predicted scores as variables in a final scoring model on the dpoint90 label. For the previous model classes, the first stage aims to segment out a part of the sample to increase the accuracy on that subsample. For this model however, both the first and the second stages are trained on the full sample which means that the full amount of bads can be used for training. The scores from the first stage will naturally have a high correlation with the remaining variables in the final scoring model. While correlation between explanatory variables is usually a problem for logistic regressions, regularization should help account for this. 4.3 Practical Details For estimating logistic regressions in this thesis, an R package ‘glmnet’ has been used1 . The glmnet package is a highly efficient implementation of the elastic net regression for generalized linear models using cyclical coordinate descent. It includes a function for estimating logistic regressions (Friedman, Hastie, and Tibshirani 2010). The regularization penalty in the glmnet package is defined as Ó1 ≠ – Î—Î2 + –Î—Î Ô (4.3) 2 where – is a mixing parameter so that 0 Æ – Æ 1. The package also includes functionality for k-fold cross-validation to choose an optimal value of the shrinkage parameter, ⁄. When the mixing parameter is set to 1 ≠ ‘ for some small ‘ > 0, the retains much of the characteristics of the lasso but with some increased numerical stability (Hastie and Qian 2014). The glmmnet estimation procedures need, apart from the variable coefficients, a shrinkage parameter, ⁄, and a mixing parameter, –. The package includes a crossvalidation function for finding the optimal ⁄ for a given –. The optimum – can also be found by nesting a two-stage cross-validation method (Hastie and Qian 2014). Because of the considerable time it takes to train an individual model, it is not reasonable to perform a two-stage cross-validation for each model. A compromise is to find an optimum – once for the full training set and then only use the included cross-validation technique for ⁄ for each subsequently trained model. ⁄ 1 More information about the glmnet package can be found on http://cran.r-project.org/ web/packages/glmnet/index.html. 21 CHAPTER 4. METHOD The glmnet package has a number of statistics that can be used as measure of fit for the data model. This paper will use the AUC-measure, since the derivative Gini is one of the main statistics we use to evaluate models. It is important to use separated sets for training, validating and testing to prevent getting bad results. The testing set was therefore an out-of-time set consisting of approximately 20 % of the total sample. The remaining part was split on a customer identifier so that 25 % of remaining customers (Approximately 20 % of the whole sample) end up in the validation set and 75 % (Approximately 60 % of the whole sample) in the test set. A practical concern is how to compare results between first and second-stage models. It is important that the underlying data is the same to make comparisons useful. A solution would be to evaluate both stages on the subset of observations that are to be included in the second stage. This makes different second-stage models difficult to compare as the subset changes. The alternative approach used in this paper is to first predict scores using the first-stage model, and then replace the scores of the observations that should be included in the second-stage model with the score from the second-stage model. 4.4 Data The data used in this paper consists of Dutch invoices that were initiated during the period 2013-06-01 to 2014-03-31. The data set contains a total number of approximately 350, 000 observations. Each observation is a previously accepted purchase through Klarna and the data set can contain several observations for one individual. There are 136 variables that can be used for modelling purposes with some being internal data and others being data acquired from external credit bureaus. Note that there is not valid data on all variables and observations so some fields are missing. Some variables are binary variables and others numeric or categorical variables. The non-binary variables were put into bins or categories and transformed into binary variables by making each bin into a binary feature. Binning transforms the contribution of a variable from linear to piece-wise linear which makes the model more flexible (Hastie et al. 2009). When applicable, cumulative binning was used. For a variable such as income, an observation being in the e100, 000 ≠ e200, 000 bin would also be in the e50, 000 ≠ e100, 000 bin. After binning and transforming the variables, the final data set contains 770 binary variables that form the final feature vector x. The data was divided into three sets, a training set (60 %), a validation set (20 %) and a test set (20 %). Table 4.1 shows the number of observations, number of bads and the bad-rate for each set. The training and validation set as expected show similar characteristics. For the out of time test set, it seems that the bad-rate has decreased, implying there is some change in the population over time. The original data set only contained labels for payment status 90 days after credit due date. The payment statuses at earlier times were thus extracted and 22 CHAPTER 4. METHOD Table 4.1. Distribution of goods and bads for payment status at 90 days after credit due date. Training set Validation set Test set Good Bad Bad-rate 208222 68756 67860 8842 3014 2503 0.041 0.042 0.036 matched separately afterwards. Unfortunately, matches could not be found for around 20 % of observations. On the other hand, the number of bads, when looking at the payment status 5 days after credit due date is larger than the 90 days label by a factor 10. For the 30 days after due date, the increase is approximately by a factor 3. Given that the bads are so scarce for the dpoint90 label, this is a major increase. An important thing to note is that the definition of a bad is in this data set very restrictive. Klarna has sorted out a large number of transactions that were deemed indeterminate. The reasoning for this comes from an earlier experience where a more strict definition of bads increased the prediction accuracy. Additionally, some non-bads have been filtered out for other reasons. The bad-rate should therefore not be taken as a measure of the actual loss rate of Klarna. 23 Chapter 5 Results Chapter 5 starts by describing the results from implementing the cross-validation technique to find the mixing parameter. Next, the performance of the first-stage model is described. The chapter continues by presenting results from training the models proposed in section 4.2.1 and evaluating them using using the validation sample. Lastly, the chapter describes the result from evaluating a number of candidate models using the test sample. This chapter refers to a large number of graphs and tables. Graphs with the prefix A can be found in the appendix. 5.1 Choosing the Mixing Parameter, – A two-staged cross-validation method was first set up to get the optimum – on the full training set. Table A.1 shows the result of testing a sequence of values for the mixing parameter, –. Apart from the minimum value, for – = 0, the magnitude of difference in AUC is on the 5th decimal for the remaining – values. Without any confidence intervals it is difficult to say if any values of the AUCs are with a statistical significance higher than any others. Given that the default value in the glmnet package is – = 1, i.e. a lasso, and using the idea that – = 1 ≠ ‘ for some small positive ‘ works like the lasso but with increased numerical stability, it was decided to go with – = 0.999 (Hastie and Qian 2014) . This selection of – implies that the major part of the regularization comes from the lasso parameter. This value of – has then been used for all subsequently trained models. 5.2 First-stage Model The first stage was trained on the training set using the binary label dpoint90, i.e. payment status 90 days after due date as response variable. Figure 5.1 shows a number of plots using the predictions on the validation sample. From the distributions of goods and bads, it is clear that there is an overlap on the larger part of the score band. The score-to-logodds plot shows that the predicted score lies reasonably close to the actual line but that there is a an underestimation of the log-odds on low scores and an overestimation on high scores. The KS of 0.56 and the reasonably high Gini suggest that the first-stage model on its own is performing relatively well. 24 CHAPTER 5. RESULTS 1st stage model on validation sample Distribution of goods and bads 0.05 Actual and predicted log−odds −1 Bads 0.04 0.03 −2 Log−odds Probability Goods 0.02 −3 −4 0.01 Predicted Log−odds −5 Actual Log−odds 0.00 400 500 600 Score 700 800 500 1.00 0.75 0.75 Actual bads 0.50 Actual goods Expected bads 0.25 Score 600 650 ROC−curve and Gini−value 1.00 True Positive Rate Cumulative Probability Kolmogorov−Smirnov Plot 550 Expected goods 0.50 0.25 KS Statistics = 0.560 Gini = 0.730 0.00 0.00 400 500 600 Score 700 800 0.00 0.25 0.50 0.75 False Positive Rate 1.00 Figure 5.1. Combination of plots when evaluating the first-stage model on the validation sample. 5.3 Finding Candidate Models Using the Validation Sample This section describes the results from the full set of trained models. The metrics used to evaluate the models seemed to be quite highly correlated. That is, a model with a high Gini in general had a high KS and a low brier score. Therefore, to ensure readability, this section only contains figures showing how the Gini for different models compares with the first-stage model. The full result tables can be found in appendix in tables A.2 to A.5. Model class 1 Table A.2 and figure 5.2 show the results from retraining the subset of observations with predicted score less than different score cutoffs. It is easily noticeable that the results are not very sensitive to this method. The difference in any of the metrics are at most on a magnitude of around 0.1 percentage points. A partial explanation for the edge cases might be that for the very low scorecutoffs, almost no observations are included into the second stage. As a contrast, for the very high scorecutoffs, almost all observations are included into the second stage. As the deviations are 25 CHAPTER 5. RESULTS so small, it is difficult to say if any variation is statistically significant. However, the 550, 630 and 670 models, being the three models with the highest Gini, were selected for further analysis. Gini of Model Class 1 on validation sample 0.740 Gini 0.735 0.730 0.725 700 690 680 670 660 650 640 630 620 610 600 590 580 570 560 550 540 530 520 510 500 0.720 Scorecutoff Figure 5.2. Gini of different models from Model class 1 compared to the first-stage model (dotted line) when evaluated on the validation sample. Model class 2 The decision boundary was decided using the constant risk concept as described in section 4.2.1. The constant maximum allowed expected risk was decided using numerical optimization to find the value that would give an acceptance rate of 95 % on the full training sample. This decision boundary was then used for subsequent training and analysis. The rule for including observations into the second-stage subset was being inside a x % boundary window. Table A.3 and figure 5.3 show the results for a number of different percentages on the validation sample. Apart from the 90 % window, which will be evaluated on the test sample, none of these models seem to give better results on any metrics. And even for the 90 % model, the difference is at best marginal. Model class 3 Table A.4 and figures 5.4 and 5.5 show similarly the result of retraining the secondstage model on the subsample of observations with a score under or over, different score cutoffs. For this table, the first-stage scores were in this model class predicted using models trained on all observations with dpoint5 and dpoint30 as labels. The second-stage models were then trained on these subsamples using the dpoint90 label and the results combined with the dpoint90 first-stage model for the remaining sample. This model class seems to in general give more interesting results than the two previous ones. Predicted score of the dpoint5 < 480 model as well as the models where predicted score of dpoint30 > 490 , > 500 and > 510 all have Gini-values 26 CHAPTER 5. RESULTS Gini of Model Class 2 on validation sample 0.740 Gini 0.735 0.730 0.725 90 80 70 60 50 40 30 20 0.720 Percentage distance to decision boundary cutoff Figure 5.3. Gini of different models from Model class 2 compared to the first-stage model (dotted line) when evaluated on the validation sample. larger than the first-stage model and those in model classes 1 and 2. They were therefore selected as candidate models. The score(dpoint5 ) > 490 model stands out as particularly bad. This is probably due to the model not converging properly which might be because of the random allocations to the cross-validation folds. In figure 5.4, this model was therefore not plotted. Gini of dpoint5 models from Model Class 3 on validation sample 0.740 Gini 0.735 0.730 0.725 score(dpoint5) > 530 score(dpoint5) > 520 score(dpoint5) > 510 score(dpoint5) > 500 score(dpoint5) > 490 score(dpoint5) > 480 score(dpoint5) > 470 score(dpoint5) < 530 score(dpoint5) < 520 score(dpoint5) < 510 score(dpoint5) < 500 score(dpoint5) < 490 score(dpoint5) < 480 score(dpoint5) < 470 0.720 Figure 5.4. Gini of different models with segmentation based on the dpoint5 score from Model class 3 compared to the first-stage model (dotted line) when evaluated on the validation sample. Table 5.1 shows a two way table of combination of actual outcomes of observations where the predicted score from the dpoint5 regression was < 480. Even though the sample has decreased significantly, a large portion of the bads with regards to 27 CHAPTER 5. RESULTS Gini of dpoint30 models from Model Class 3 on validation sample 0.740 Gini 0.735 0.730 0.725 score(dpoint30) > 530 score(dpoint30) > 520 score(dpoint30) > 510 score(dpoint30) > 500 score(dpoint30) > 490 score(dpoint30) > 480 score(dpoint30) > 470 score(dpoint30) < 530 score(dpoint30) < 520 score(dpoint30) < 510 score(dpoint30) < 500 score(dpoint30) < 490 score(dpoint30) < 480 score(dpoint30) < 470 0.720 Figure 5.5. Gini of different models with segmentation based on the dpoint30 score from Model class 3 compared to the first-stage model (dotted line) when evaluated on the validation sample. dpoint90 remains. Additionally, of the non-missing observations, around 20 % of observations that were bad with regards to dpoint5 were also bad with regards to dpoint90, compared with approximately 10 % on the full sample. Table 5.1. Table of combinations of actual outcome labels on dpoint5 and dpoint90 for the test sample on the subsample where the predicted score from the dpoint5 regression was < 480. dpoint90 Good Bad dpoint5 Missing Good Bad 3723 123 3708 0 6003 1566 Model class 4 In order to implement the fourth model class, two models with the respective labels dpoint5 and dpoint30 were first trained on the full training set. The predicted scores were then binned and transformed to binary non-cumulative variables. The two groups of binary variables were then included first separately and then in combination when training with the label dpoint90 Table A.5 and figure 5.6 show the result of applying the trained model to the validation sample. The difference when compared to the first-stage model is with this model notably higher than for the other models for all 3 combinations. All three will therefore be evaluated on the test sample. 28 CHAPTER 5. RESULTS Gini of Model Class 4 on validation sample 0.740 Gini 0.735 0.730 0.725 dpoint5 & dpoint30 dpoint5 dpoint30 0.720 Figure 5.6. Gini of different models from Model class 4 compared to the first-stage model (dotted line) when evaluated on the validation sample. Table A.6 shows the estimated coefficients from the model where only the score from the dpoint5 bins were included. As expected, a higher expected score on the dpoint5 label is correlated with a lower score on the dpoint90 variable. Similar results can be seen for the regression using the dpoint30 label. 5.4 Evaluating Candidate Models Using the Test Sample A total number of 11 models were along with the first-stage model selected to be evaluated on the test sample. The results from this can be seen in figure 5.7 and in the appendix in table A.7. Table 5.2 shows a condensed version with the best performing model of each model class along with the first-stage model for comparison. A first and interesting result to notice is that the first-stage model performs significantly better on the test data than on the validation data. For example, the simple 1-stage logistic regressions have a gini of 0.7298 on the validation sample and 0.7956 on the test sample. Table 5.2. Results from the first-stage model and the four best performing candidate models on test sample 1st-stage Scorecutoffs Score < 550 Decision boundary 90% score(dpoint5) < 480 dpoint5 as variable No. of Obs. No. of Bads Bad-rate Gini KS Brier Score 70363 6455 21918 15123 70363 2503 1589 1514 1689 2503 0.036 0.246 0.069 0.112 0.036 0.7956 0.7962 0.7971 0.8011 0.8037 0.6404 0.6386 0.6383 0.6416 0.6465 0.0276 0.0271 0.0276 0.0272 0.0273 Using simple cutoffs on the predicted score from the first-stage model seem to 29 CHAPTER 5. RESULTS reproduce a similar performance when retraining on the lower score spectrum (Score < 550) with a marginal increase Gini and a lower Brier Score. Any gain found in the validation set has disappeared when increasing the scorecutoffs to 630 and 670. Looking at the subset of observations at a 90 % distance to the decision boundary, the table shows a marginal increase in Gini coupled with a marginal decrease in KS. It is therefore difficult to say if any of these four perform better than the first model. Gini of candidate models on test sample 0.82 Gini 0.80 0.78 dpoint5 & dpoint30 dpoint30 dpoint5 score(dpoint30) > 510 score(dpoint30) > 500 score(dpoint30) > 490 score(dpoint5) < 480 Decision boundary 90% Scorecutoffs Score < 670 Scorecutoffs Score < 630 Scorecutoffs Score < 550 0.76 Figure 5.7. Gini of candidate models compared to the first-stage model (dotted line) when evaluated on the test sample. When looking at the models using information from the dpoint5 and dpoint30 labels, it is clear that they, except for when looking at predicted score of dpoint30 > 510, seem to outperform the first-stage model on the test set as well. Using the predicted score as a variable seems, consistent with the result from the validation set, to perform better than when using it to segment the sample. The largest score increase is as in the validation set seen when using only the predicted score from the model using dpoint5 as a label. This model showed an increase of Gini of 0.0081 points and an increase in KS score of 0.0061. Figures A.2 to A.6 show the score to log odds plots for the five boldfaced models from table A.7. These graphs are more condensely displayed in figure 5.8 where the difference between predicted and actual log odds is plotted against the score. Consistent with earlier observations, all models tend to overestimate the risk of low-scoring applications and underestimate that of high-scoring applications. The model where the subset of predicted score of dpoint5 < 480 has been retrained stands out and does not show as strong tendencies as the other models. 30 CHAPTER 5. RESULTS Difference between predicted and actual log odds Difference between predicted and actual logodds 0.2 1st−stage Predicted score of dpoint90 < 550 90 % distance to decision boundary Predicted score of dpoint5 < 480 0.0 Predicted score of dpoint5 as variable −0.2 500 550 Score 600 650 Figure 5.8. Difference between predicted and actual logodds over the scoreband for different models when implemented on the test sample. Assessing the statistical significance of these results is not trivial as the glmnet method for various reasons does not have a clearly defined methodology for estimating standard errors. Some attempts have recently been made, notably by Lockhart et al. (2014), by the creators themselves, but they have not been tested enough in application. In short, the thesis does not delve too much on statistical significance. Instead, it takes results as "significant" if they are approximately reproducible on different data, i.e. when the results are similar on the validation set and the test set. 31 Chapter 6 Discussion When comparing the results from the validation and the test sample it is clear that the performance is higher on the test sample which stresses the importance of measuring relative performance within samples. There are two plausible reason for this. Firstly, the training and validation samples were built by separating individuals so that for each unique individual, all observations would end up in either the training or the validation sample. The test sample on the other hand, is an out-oftime set. 20 % of observations in the test sample can be attributed to an individual that exists in the training sample. Given that some variables are more or less constant for an individual, the training sample should be more correlated with the test sample than the validation sample. It is worth noting that an actual business implementation will resemble the the test sample, with data from a later time period and partially overlapping individuals. The fact that the models perform also perform well on the validation sample suggests that performance is consistent over moderate changes in the population. Secondly, the lower bad-rate of the test sample, shows that the population changes over time. It is not clear how this would affect the prediction accuracy but it is possible that this could partially account for the relatively higher performance. Another general observation, when looking at performance relative to the firststage model on the two samples, is that the relative performance seems to be stable both over time and on a new population, which supports the general methodology. For the segmentation models using scorecutoffs and distance to boundary, there were no positive significant results. Contrary to what we expected, for most proposed models, evaluation metrics hardly changed at all. This could be interpreted as the underlying relations in fact being sufficiently linear. Another possible explanation for the result is that there is in fact non-linearity but that the two implemented models do not segment the population in a way to capture these non-linearities. This result is in line with previous research on two-stage credit scoring models. It has from working with this paper become clear that all sorts of segmentation come with a caveat, decreasing the number of observations. This particular application is especially vulnerable as bads are of rare occurrence. When the second-stage models leave out a large number of bads, performance on that segment clearly deteriorates. Given the highly skewed distribution towards goods, simple segmentation methods will probably only be viable strategies for data sets of a much larger size than what was used in this thesis. This might also explain why the decision boundary model class performed particularly badly. By letting the simple scorecutoff 32 CHAPTER 6. DISCUSSION models retrain on the lower score segment, they could capture a relatively large share of the bads. For the decision boundary model, one had to go quite far from the decision boundary in order to get a sufficient number of bads. Dissecting the results of model classes 3 and 4, when alternative default labels are introduced, is not fully straight-forward. A major difference from model 1 and 2 is that new information has been introduced to the modelling. The difference between model class 3 and model class 4 is the way this information affects the model. When the predicted score is used to segment the population as in model class 3, the new information has no actual influence on the training. This strengthens the claim that there is in fact non-linearity in the data. The large amount of bads for especially the dpoint5 label in relation to the dpoint30 label is interesting. Understanding why transactions change from bads at dpoint5 to goods in the dpoint90 label is crucial. Connecting back to the discussion in section 4.2.1 two types of customers explain this behavior: 1. The customer has a temporary financial constraint that hinders him or her from paying in time. 2. The customer is somewhat sloppy and forgets to pay or delays payment for some other reason. When training a model on the earlier default labels, we are therefore trying to find earlier mentioned categories of people as well as those who never fulfill their credit obligations. The results from model class 3, suggests there is something with this group of people that differs from the remaining population. As for model class 4, when the predicted scores are used as variables in the dpoint90 model, predicted from the first-stage model(s), into the dpoint90 model. The results seem to favor this approach. This approach also works similarly to the model class 3 by giving a lower score to those that are likely to not have paid at 5 och 30 days past due date. An obvious advantage of this model is that we can train both the first and second stage on the full training set meaning that we do not have to forsake any of the scarce bads. Comparing the results of the validation and testing set, model class 4 seems the most robust. Given that we have found and improvement in the model, the big question is what this means for the profitability of a company such as Klarna. While an increase in the Gini of 0.01 might seem economically insignificant, it is important to remember that the credit industry is mostly a high volume - low margin industry. Using the decision boundary and definition of expected loss discussed earlier in the thesis, it is possible to construct a simple model that estimates the expected loss from applying a certain model to on a data set. When comparing the dpoint5 as-variable model with the simple one stage model, the expected losses decreased by approximately 1 % on the test set. For a company with a monthly number of transactions on the order of 10, 000, 000, using the dpoint5 -as-variable model would decrease the expected monthly losses by a magnitude of SEK 100, 000. While these 33 CHAPTER 6. DISCUSSION are not game-changing results, this is certainly evidence that the methodology holds promise. Before implementing this in a live system, it is important to consider potential weaknesses in the model. One question is whether these results would be generalizable to other time periods or other geographic regions. The problem of scarcity of bads is a universal problem of a well-functioning consumer credit system, so it is plausible that there is a similar gain of using relatively abundant earlier default labels in other markets. That being said, model class 3 might not be a valid option for smaller data sets as it segments the data. For such data sets, the methodology from model class 4 will probably be a better approach. Another question is how this method is affected by the sampling bias. It will of course still be present in the regressions with the dpoint90 label, but the other labels should have much less correlation with reject decisions. There is of course still the risk that the total vulnerability to sampling bias increases using these kind of models. 34 Chapter 7 Conclusions In this thesis, two-stage logistic regression models have been estimated and evaluated on credit scoring data. Credit scoring models use information such as payment history to predict the probability of default. The data has been provided by Klarna and consists of roughly 350,000 credit applications from Dutch retail customers. The two-stage models have been constructed by first training a regularized logistic regression on the full sample and using information from the first model to re-score part or all of the observations. Specifically, 4 classes of models have been evaluated: 1. Segmenting on predicted score from the first-stage model 2. Segmenting on distance to decision boundary from predicted probability of default. 3. Segmenting on predicted score from regressions using alternative default labels 4. Including predicted score from regressions using alternative default labels. The models were compared on validation and test samples using KS, Gini, Brier score and score-to-logodds plots. Implementing the different classes of models showed that the more simple models (classes 1 and 2) gave no significant increase in performance. The other two classes increased performance on both the validation and the test set. The other two model classes (3 and 4) showed that there is an increase in performance when segmenting or taking into account predictions on early payment status. A conclusion of this is that two-stage do improve performance of credit scoring models but not using the more conventional methods. A key to understanding why using the earlier default improves the prediction probaby lies in the scarcity of bads in the original credit scoring problem. Using the large number of bads at earlier dates and the correlation with the dpoint90 it is possible to extract more information from the data and improve the model performance. 35 CHAPTER 7. CONCLUSIONS 7.1 Future Research Overall, this paper shows that by taking into account the fact that a transaction is a process that changes over time, the prediction accuracy can be increased. The interesting question is then how this information best can be put to use. While there is arguably much room for improvement of the methodology of this thesis, there are probably other methods that can use this information better. An approach that has been tested previously in credit scoring is survival analysis. In survival analysis the credit scoring problem is changed from trying to estimate the probability of a default to estimate the time of default. For more information, see for example a paper by Stepanova and Thomas (2002). Another way could be to view the problem as a directed graphical model as depicted in figure 7.1. In this model, x-vector of conventional would predict dpoint at t ≠ n for some starting point in time that feeds probabilities forward in time until the probability dpoint at t has been calculated. A paper on how to implement such a model could possibly find even further improvements of the prediction accuracy. Figure 7.1. Schematic specification of a directed graphical model that takes into account the time aspect of a transaction. 36 Bibliography Banasik, J, J Crook, and L Thomas (2001). “Scoring by usage”. In: Journal of the Operational Research Society 52.9, pp. 997–999. Banasik, Jonathan, John Crook, and Lyn Thomas (2003). “Sample selection bias in credit scoring models”. In: Journal of the Operational Research Society 54.8, pp. 822–832. Bellotti, Tony and Jonathan Crook (2009). “Credit scoring with macroeconomic variables using survival analysis”. In: Journal of the Operational Research Society 60.12, pp. 1699–1707. Bijak, Katarzyna and Lyn C Thomas (2012). “Does segmentation always improve model performance in credit scoring?” In: Expert Systems with Applications 39.3, pp. 2433–2442. Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32. Brier, Glenn W (1950). “Verification of forecasts expressed in terms of probability”. In: Monthly weather review 78.1, pp. 1–3. Capon, Noel (1982). “Credit scoring systems: A critical analysis”. In: The Journal of Marketing, pp. 82–91. Crook, Jonathan and John Banasik (2004). “Does reject inference really improve the performance of application scoring models?” In: Journal of Banking & Finance 28.4, pp. 857–874. Davis, Jesse and Mark Goadrich (2006). “The relationship between Precision-Recall and ROC curves”. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp. 233–240. Domingos, Pedro (2012). “A few useful things to know about machine learning”. In: Communications of the ACM 55.10, pp. 78–87. Finlay, Steven (2011). “Multiple classifier architectures and their application to credit risk assessment”. In: European Journal of Operational Research 210.2, pp. 368–378. Friedman, Jerome, Trevor Hastie, and Rob Tibshirani (2010). “Regularization paths for generalized linear models via coordinate descent”. In: Journal of statistical software 33.1, p. 1. Greene, William (1998). “Sample selection in credit-scoring models”. In: Japan and the world Economy 10.3, pp. 299–316. Hand, David J and Christoforos Anagnostopoulos (2013). “When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?” In: Pattern Recognition Letters 34.5, pp. 492–495. Hand, David J and William E Henley (1997). “Statistical classification methods in consumer credit scoring: a review”. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 160.3, pp. 523–541. 37 BIBLIOGRAPHY Hand, David J, So Young Sohn, and Yoonseong Kim (2005). “Optimal bipartite scorecards”. In: Expert Systems with Applications 29.3, pp. 684–690. Hastie, Trevor and Junyang Qian (2014). Glmnet Vignette. http://web.stanford. edu/~hastie/glmnet/glmnet_alpha.html. Accessed: 2014-12-29. Hastie, Trevor et al. (2009). The elements of statistical learning. Vol. 2. 1. Springer. Hawkins, Douglas M (2004). “The problem of overfitting”. In: Journal of chemical information and computer sciences 44.1, pp. 1–12. Hoerl, Arthur E and Robert W Kennard (1970). “Ridge regression: Biased estimation for nonorthogonal problems”. In: Technometrics 12.1, pp. 55–67. Krzanowski, Wojtek J and David J Hand (2009). ROC curves for continuous data. CRC Press. Lessmann, Stefan, Hsin-Vonn Seow, and Bart Baesens (2013). “Benchmarking stateof-the-art classification algorithms for credit scoring: A ten-year update”. In: Proceedings of Credit Scoring and Credit Control XIII Conference, Edinburgh, United Kingdom, 28-30th August. Lockhart, Richard et al. (2014). “A significance test for the lasso”. In: The Annals of Statistics 42.2, pp. 413–468. Marqués, AI, Vicente García, and Javier Salvador Sánchez (2012). “Two-level classifier ensembles for credit risk assessment”. In: Expert Systems with Applications 39.12, pp. 10916–10922. Marsland, Stephen (2011). Machine learning: an algorithmic perspective. CRC Press. Mays, Elizabeth (1995). Handbook of credit scoring. Global Professional Publishi. Mester, Loretta J (1997). “What’s the point of credit scoring?” In: Business review 3, pp. 3–16. Myers, James H and Edward W Forgy (1963). “The development of numerical credit evaluation systems”. In: Journal of the American Statistical Association 58.303, pp. 799–806. Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll (2002). “An introduction to logistic regression analysis and reporting”. In: The Journal of Educational Research 96.1, pp. 3–14. Rodriguez, Juan José, Ludmila I Kuncheva, and Carlos J Alonso (2006). “Rotation forest: A new classifier ensemble method”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.10, pp. 1619–1630. Siddiqi, Naeem (2005). Credit risk scorecards: developing and implementing intelligent credit scoring. Vol. 3. John Wiley & Sons. So, Mee Chi et al. (2014). “Using a transactor/revolver scorecard to make credit and pricing decisions”. In: Decision Support Systems 59, pp. 143–151. Stepanova, Maria and Lyn Thomas (2002). “Survival analysis methods for personal loan data”. In: Operations Research 50.2, pp. 277–289. Thomas, Lyn C (2000). “A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers”. In: International Journal of Forecasting 16.2, pp. 149–172. — (2010). “Consumer finance: Challenges for operational research”. In: Journal of the Operational Research Society 61.1, pp. 41–52. 38 BIBLIOGRAPHY Thomas, Lyn C, David B Edelman, and Jonathan N Crook (2002). Credit scoring and its applications. Siam. Tibshirani, Robert (1996). “Regression shrinkage and selection via the lasso”. In: Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288. Wang, Gang et al. (2011). “A comparative assessment of ensemble learning for credit scoring”. In: Expert systems with applications 38.1, pp. 223–230. Wolpert, David H (1992). “Stacked generalization”. In: Neural networks 5.2, pp. 241– 259. Yao, Ping (2009). “Credit scoring using ensemble machine learning”. In: Hybrid Intelligent Systems, 2009. HIS’09. Ninth International Conference on. Vol. 3. IEEE, pp. 244–246. Zhao, Peng and Bin Yu (2006). “On model selection consistency of Lasso”. In: The Journal of Machine Learning Research 7, pp. 2541–2563. Zou, Hui and Trevor Hastie (2005). “Regularization and variable selection via the elastic net”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2, pp. 301–320. 39 Appendix Table A.1. Table of alphas and corresponding average AUC of the trained models Alphas Lambdas AUC 0 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 0.950 0.990 0.999 1 0.002352 0.0005493 0.0003014 0.0002206 0.0001815 0.0001594 0.0001103 0.0000715 0.0000996 0.0000585 0.0000617 0.0000973 0.0000735 0.8680419 0.8685152 0.8685560 0.8685746 0.8685794 0.8685755 0.8685673 0.8685685 0.8685652 0.8685669 0.8685660 0.8685698 0.8685704 40 APPENDIX Table A.2. Model Class 1: Results on the validation sample from re-training on the subsample with score under different score-cutoffs 1st-stage 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 No. of Obs. No. of Bads Bad-rate Gini KS Brier Score 70917 777 1151 1753 2649 4214 6616 9982 14488 20055 25943 31763 37181 42426 47486 51910 55866 59274 62169 64532 66345 67681 2962 432 595 775 1005 1296 1617 1922 2178 2414 2598 2718 2799 2846 2890 2918 2935 2947 2950 2955 2958 2959 0.042 0.556 0.517 0.442 0.379 0.308 0.244 0.193 0.150 0.120 0.100 0.086 0.075 0.067 0.061 0.056 0.053 0.050 0.047 0.046 0.045 0.044 0.7298 0.7292 0.7300 0.7296 0.7293 0.7296 0.7301 0.7293 0.7285 0.7288 0.7293 0.7293 0.7296 0.7292 0.7301 0.7307 0.7296 0.7295 0.7306 0.7295 0.7298 0.7286 0.5604 0.5605 0.5604 0.5601 0.5595 0.5601 0.5601 0.5607 0.5594 0.5566 0.5557 0.5581 0.5592 0.5594 0.5624 0.5631 0.5605 0.5601 0.5634 0.5609 0.5603 0.5588 0.0335 0.0336 0.0333 0.0334 0.0334 0.0333 0.0334 0.0334 0.0334 0.0334 0.0334 0.0335 0.0335 0.0336 0.0335 0.0335 0.0336 0.0336 0.0335 0.0336 0.0336 0.0336 Table A.3. Model Class 2: Results on the validation sample from re-training on the subsample with score with x percentange points from the decision boundary 1st-stage 20 30 40 50 60 70 80 90 No. of Obs. No. of Bads Bad-rate Gini KS Brier Score 70917 2059 3253 4701 6557 9166 12961 18835 30213 2962 275 423 590 772 990 1238 1543 1934 0.042 0.134 0.130 0.126 0.118 0.108 0.096 0.082 0.064 0.7298 0.7285 0.7277 0.7264 0.7269 0.7262 0.7276 0.7273 0.7300 0.5604 0.5600 0.5591 0.5570 0.5588 0.5578 0.5595 0.5594 0.5619 0.0335 0.0336 0.0336 0.0337 0.0337 0.0336 0.0336 0.0336 0.0336 41 42 score(dpoint30) > Cutoff score(dpoint30) < Cutoff score(dpoint5) > Cutoff 1st-stage score(dpoint5) < Cutoff names 470 480 490 500 510 520 530 470 480 490 500 510 520 530 470 480 490 500 510 520 530 470 480 490 500 510 520 530 Cutoff 71770 6847 12332 23874 41452 54723 63467 68398 64923 59438 47896 30318 17047 8303 3372 755 1220 2099 3643 6508 11799 21266 71015 70550 69671 68127 65262 59971 50504 No..of.Obs. 3014 867 1400 2133 2690 2920 2988 3005 2147 1614 881 324 94 26 9 293 430 627 944 1381 1891 2371 2721 2584 2387 2070 1633 1123 643 No..of.Bads 0.042 0.127 0.114 0.089 0.065 0.053 0.047 0.044 0.033 0.027 0.018 0.011 0.006 0.003 0.003 0.388 0.352 0.299 0.259 0.212 0.160 0.111 0.038 0.037 0.034 0.030 0.025 0.019 0.013 Bad.rate 0.7285 0.7281 0.7309 0.7302 0.7288 0.7283 0.7278 0.7278 0.7306 0.7298 0.5805 0.7266 0.7270 0.7263 0.7275 0.7296 0.7292 0.7283 0.7291 0.7304 0.7304 0.7295 0.7299 0.7309 0.7317 0.7338 0.7319 0.7296 0.7277 Gini 0.5585 0.5569 0.5601 0.5627 0.5583 0.5588 0.5594 0.5600 0.5645 0.5653 0.5045 0.5572 0.5600 0.5587 0.5585 0.5599 0.5592 0.5591 0.5570 0.5597 0.5618 0.5603 0.5657 0.5642 0.5652 0.5690 0.5658 0.5639 0.5599 KS Table A.4. Model Class 3: Results on the validation sample from re-training using predicted score on alternative response labels and scorecutoffs to define sample to be included into the 2nd-stage 0.0337 0.0337 0.0336 0.0336 0.0337 0.0338 0.0338 0.0338 0.0337 0.0337 0.0341 0.0337 0.0337 0.0337 0.0337 0.0336 0.0336 0.0337 0.0336 0.0335 0.0336 0.0337 0.0337 0.0336 0.0337 0.0337 0.0337 0.0337 0.0337 Brier.Score APPENDIX APPENDIX Table A.5. Model Class 4: Results on the validation sample from re-training using predicted score on alternative response labels 1st-stage dpoint5 dpoint30 dpoint5 & dpoint30 No. of Obs. No. of Bads Bad-rate Gini KS Brier Score 71770 71770 71770 71770 3014 3014 3014 3014 0.042 0.042 0.042 0.042 0.7285 0.7325 0.7309 0.7314 0.5585 0.5647 0.5636 0.5649 0.0337 0.0336 0.0336 0.0337 Table A.6. Coefficient values of dpoint5 variables when training on dpoint90 and including the predicted score from the dpoint5 label as a variable Variable Coefficient dpoint5_binned(471,482] dpoint5_binned(482,488] dpoint5_binned(488,493] dpoint5_binned(493,497] dpoint5_binned(497,501] dpoint5_binned(501,506] dpoint5_binned(506,513] dpoint5_binned(513,522] dpoint5_binned(522,595] 43 -0.0741 -0.2522 -0.4297 -0.6205 -0.7592 -0.9728 -1.3032 -1.4635 -1.8813 APPENDIX Table A.7. Results from evaluating the candidate models on test sample 1st-stage Scorecutoffs Score < 550 Scorecutoffs Score < 630 Scorecutoffs Score < 670 Decision boundary 90% score(dpoint5) < 480 score(dpoint30) > 490 score(dpoint30) > 500 score(dpoint30) > 510 dpoint5 as variable dpoint30 as variable dpoint5 & dpoint30 as variables No. of Obs. No. of Bads Bad-rate Gini KS Brier Score 70363 6455 37190 55263 21918 15123 66964 65016 62014 70363 70363 70363 2503 1589 2408 2494 1514 1689 1521 1219 878 2503 2503 2503 0.036 0.246 0.065 0.045 0.069 0.112 0.023 0.019 0.014 0.036 0.036 0.036 0.7956 0.7962 0.7705 0.7535 0.7971 0.8011 0.8003 0.8008 0.7958 0.8037 0.8028 0.8031 0.6404 0.6386 0.6005 0.5985 0.6383 0.6416 0.6450 0.6442 0.6359 0.6465 0.6464 0.6474 0.0276 0.0271 0.0285 0.0287 0.0276 0.0272 0.0276 0.0276 0.0276 0.0273 0.0273 0.0273 Actual and predicted log−odds 0 Log−odds −2 −4 Predicted Log−odds Actual Log−odds −6 500 550 Score 600 650 Figure A.2. Actual and predicted score to log odds when applying the first-stage model on the test sample. 44 APPENDIX Actual and predicted log−odds 0 Log−odds −2 −4 Predicted Log−odds Actual Log−odds −6 500 550 Score 600 650 Figure A.3. Actual and predicted score to log odds when applying the scorecutoff model on the subset of observations with predicted score < 550 on the test sample. Actual and predicted log−odds 0 Log−odds −2 −4 Predicted Log−odds Actual Log−odds −6 500 550 Score 600 650 Figure A.4. Actual and predicted score to log odds when applying the scorecutoff model on the subset of observations with less than a 90 % distance from the decision boundary on the test sample. 45 APPENDIX Actual and predicted log−odds −1 Log−odds −2 −3 −4 −5 Predicted Log−odds Actual Log−odds −6 500 550 Score 600 650 Figure A.5. Actual and predicted score to log odds when training on the subset of observations with a predicted score from the dpoint5 label less than 480 on the testsample. Actual and predicted log−odds 0 Log−odds −2 −4 Predicted Log−odds Actual Log−odds −6 500 550 Score 600 650 Figure A.6. Actual and predicted score to log odds when training on dpoint90 and including the predicted score from the dpoint5 label as a variable on the test sample. 46 www.kth.se

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement