Description of the task
Data Mining
Lab Exercises
Predictive Modelling
Purpose
The purpose of this study is to learn how data mining methods / tools (SAS System/ SAS Enterprise
Miner) can be used to solve predictive modeling tasks, in particular building classification models based
on real life data.
The study will illustrate stages of data mining process, as guided by the SEMMA (or CRISP-DM)
methodology, required to build a successful DM solution:
o Sample data
o Explore data in order to understand characteristics of data,
o Modify data in order to prepare them for modeling,
o Model, i.e., fit a predictive model to data,
o Assess predictive performance of the model.
Input data
Data set: spam
This data set shows attributes of email messages and a target variable set to ‘yes’ for spam message and
‘no’ otherwise. The task consists in building a model to classify email messages as spam or no-spam.
The attributes of email messages used here to train the classifier are based on the SpamAssassin
(http://spamassassin.apache.org) project
Data set: copper_wire
This data set is related to quality of manufacturing process of copper wire. The variable quality
represents quality of a section (roll) of copper wire produced. Other variables represent data from the
production process monitoring szstem (such as temperatures, levels of various impurities in copper etc.).
The values of quality ≤ 6 denote section of good quality; quality levels above 6 denote poor quality.
Tasks – overview
The task involves
1. building a model for prediction of the target (i.e., quality (good / poor) of sections of copper wire
based on parameters of production process, or spam / no-spam classification of email),
2. estimation of predictive performance of the model for new data,
3. fine-tuning the classifier.
Predictive models will be built in SAS Enterprise Miner. A sample Enterprise Miner diagram is shown in
the following figure:
1
Output (deliverables):
Spam data
 Classification result for every email message in the spam data set,
 Classification error rates:
 “noyes” (good mail classified as spam) and
 “yesno” (spam classified as no spam).
The purpose of fine-tuning the classifier is to:
- minimize the “yesno” error rate,
- while ensuring that “noyes” error rate is < 1%.
Copper wire data
 Classification result for every manufactured item,
 Classification error rates:
 “goodpoor” (good quality misclassified as poor quality) and
 “poorgood” (poor quality misclassified as good quality).
The purpose of fine-tuning the classifier is to:
- Maximize the total profit related to classification decisions calculated per production batch.
2
We assume the following cost/profit model associated with classifier’s decisions:
Decision
good  good
good  poor
poor  good
poor  poor
Profit incurred
+10
+5
-100
+5
Report
Summarize predictve performance of your models as a function of:
- Model type
o logistic regression,
o tree,
o neural network,
o k-nn classifier,
o model boosting, etc.
- Feature selection method
o none,
o Variable selection node,
o classifier specific feature selection (e.g. Forward selection in Regression model).
- Cost of misclassification events (see Target profiling in Tutorial below).
- Specific settings of classification models (e.g., settings related to simplicity / complexity of
models, such as the number of neurons in the hidden layer).
- Metamodels / metalearning (you can fine-tune e.g., number of individual trees in the bagging or
boosting collection of models (use Start Groups-End Groups node [Utility] with Mode Bagging or
Boosting); or settings of the Ensamble aggregate model).
Hints
Detailed procedure how to build a SEMMA diagram in Enterprise Miner is given in the Script DATA
MINING AND DATA WAREHOUSING – PRACTICAL GUIDE, part I.
Most importrant part of this document are also given in the next section Detailed description of
SEMMA steps in Enterprise Miner.
(The following part of this document is related to assessment of copper wire quality)
3
Detailed description of SEMMA steps in Enterprise Miner:
In this tutorial we describe the consecutive steps you need to follow to build a SEMMA diagram, such as:
- Preparatory tasks,
- Sample data,
- Explore data,
- Modify data,
- Model,
- Assessment of model performance,
- Scoring new data.
Preparatory tasks
Enterprise Miner is a project – oriented tool, so the first task is to create a new project for this study and
to connect the source data to the project. Since all datasets are referenced in SAS through the means of
libraries (i.e., LIBREFs which are references to folders / directories in the host operating system), a new
library has to be created in the project to point to the location of the copper_bin source file.
These tasks are realized in the following steps:
Task
1
Description
Create the new Project
The project is created using the Project Wizard launched by the File-New-Project
menu. In the wizard, the following three elements are specified:



SAS server for the project
Project name and SAS server directory
SAS Folder location where project metadata will be stored
For the first and the third point the default values can be left. The project name and
server directory in this example are: copper and C:\robocza, which gives the final
setting for the new project as shown below.
2
Connect the source data to the project
1. Create the new Library
The source data used in this example is the copper_bin dataset. First, a new library
has to be created to point to the location (OS folder) of this dataset – the library is
created using the Library Wizard launched by the File-New-Library menu.
4
2. Create the new Data Source
data source is created using the Data Source Wizard (File-New-Data Source menu).
In the wizard, the following decisions are made:
 A SAS table is selected that becomes the data source (in this example the dataset
copper_bin is used, found in the library created in the previous step)
 The column metadata are defined (such as the roles in the model and
measurement levels of variables). In this step default values of metadata can be
left, as proposed by the Metadata Advisor. Although the metadata could be
corrected here, this will be done and explained in detail in the Sample data step.
 All other settings should be accepted as proposed by the wizard.
3
Create the new Diagram
Using the File-New-Diagram menu, the new diagram called copper is created. The diagram
will be populated with Enterprise Miner nodes to form the process flow.
5
Sample data
In this step, we start building the process flow by connecting to and possibly sampling the data source,
providing proper metadata for the variables and partitioning the data into training, validation and
testing layers. These tasks are done in the following steps.
Task
1
Description
Add to the diagram the Input Data node
1. Drag and drop onto the diagram pane the Input Data node found on the Sample
toolbar.
2. Next, select the Input Data node and specify the data source that the node connects
to (using the Data Source editor in the node’s Property pane). Select the copper_bin
data source.
2
Setup metadata for the data source
Prior to building a predictive model, variables must be properly described by metadata to
indicate (a) the intended role of a variable in the process (target or predictor variable), and
(b) the measurement level of a variable (as quantitative or class variable).
Metadata pertaining to variables are setup using the Variables editor (available in the Input
Data property pane), as shown below.
6
The role and measurement level of the copper_bin variables should be specified as
follows. The qbin variable is used as the binary target, the part number and original
value of quality variables are rejected from model building and all other variables are
used as predictors with the interval measurement level (which indicates a
quantitative variable).
3
Partition data
Prior to training predictive models, input data must be divided into training,
validation and test partitions. The training partition is used to fit a predictive model
to the data, the validation partition is used to fine-tune parameters of the model in
order to avoid model overfitting Model fine-tuning consists in modification of such
parameters of models as depth of a decision tree or the number of perceptrons in
the hidden layer of a neural net, etc. The test partition can then be used to estimate
the expected prediction error for new data.
The data is split by connecting the Data Partition node (available in the Sample
toolbox) as shown below.
Data Set Allocations property of the Data Partition node provides the size of
partitions (as percentages of original data); default size can be left as is. The
Exported Data property provides names of the training, validation and test datasets
produced by the node.
7
8
Explore data
The purpose of this step is to discover important characteristics of distributions of variables as well as
inconsistencies, errors and outliers in data. Based on these findings, required data modifications can be
designed in order to clean the data, impute missing values or transform variables to make some
distributions less skewed or more normal-like (e.g., by the log transformation). These steps are
mandatory to ensure robust predictive models.
In this guide we focus on some Enterprise Miner tools for graphical exploration and for statistical
analysis of data, such as:



Graph Explore,
MultiPlot,
StatExplore.
These tools can be used as shown and explained in the following steps.
Task
1
Description
Perform graphical exploration of data
Using the Graph Explore tool
1. Drag the Graph Explore icon into the diagram and connect to the Input Data node
(the tool can be found in the Explore toolbar).
2. Run the tool (right click – Run menu).
3. Open Results window. Using View-Plot menu of Results window open the Chart
wizard to configure a relevant graphical summary of data.
The first observation we make is that the data are heavily unbalanced in terms of
distribution of the target variable. The proportion of observations is about 95% vs 5%
of good quality (qbin=1) and poor quality (qbin=0) items, respectively.
Building predictive models based on such data is generally difficult, as machine
9
learning algorithms tend to bias the models towards accurate prediction of the
majority class (qbin=1), with poor predictive performance of the rare class (qbin=0).
To avoid this, several techniques will be discussed later in this guide, including
oversampling representatives of the rare class.
The Graph Explore tool provides features for graphical analysis of distributions of predictors
and mutual relationships of predictors. These include several types of statistical plots
(density plot, boxplots etc.) as well as many types of interactive explanatory plots (such as
2D and 3D plot scatter plots, contour plots etc.)
An example of a Scatter Chart is shown below, illustrating the relationship between the
variables level_o2 and temp, grouped by the target variable (qbin).
It can be observed from this plot that the upper tails of distributions of these two predictors
correspond to poor quality items (qbin=0). This is an important finding, also confirmed for
other predictors, as counter - outlier techniques usually tend to remove these tails. In our
study, we will use these techniques cautiously so as not to lose some significant portion of
the rare class representatives from the training data.
It should be noticed that all charts created in the Graph Explore tool are interactive, i.e., by
selecting an element on the plot (such as a point or group of points in scatter plot, a bin in
the histogram etc.), corresponding observations in the tabular view of raw data are also
selected. This allows for easy and efficient analysis of observations which contribute to
untypical values in distributions of some variables.
Graphical exploration using the Multiplot tool
10
To use this tool, connect its icon as shown in the diagram and Run the analysis (right click
menu).
In the Results window, maximize the Train Graphs window to start quick, interactive
inspection of distributions of subsequent variables. In this way we can efficiently
scan through large data volumes to reveal illegal values / errors /outliers. An
example of such analysis is shown below, where very small values of the size_min
variable are immediately detected in the histogram below. Such erroneous
observations will have to be filtered off in the Modify step.
We also observe that some variables have very skewed distributions (such as sum_ferro_f1,
shown below). These variables can be later transformed to make the distribution more
symmetric, which improves performance of regression or discriminant analysis methods.
Another interesting summary obtained from the Multiplot results window are descriptive
statistics calculated for all interval variables (shown below). This result provides a number of
hints related to required transformations of data.
11
The following observations can be made:
 Some observations have very small or very high MIN or MAX values as compared to
the mean value (e.g., level_o2 with erroneous MIN value, or vel_max and vel_mean
with erroneous MAX value),
 Some observations have very large standard deviation as compared with the mean,
 All variables contain missing values (the number of missing values is given in the
NMISS column).
Based on these observations, required data filtering and transformation rules will be
implemented in the next SEMMA step.
2
Perform statistical analysis of data – using the StatExplore tool
The StatExplore tool allows to examine distributions of variables and analyze association of
inputs with the (interval or class) target. Based on these analyses, variable selection can be
realized in order to reduce dimensionality of predictive (or clustering) models.
The StatExplore tool should be connected as shown in the diagram and executed (right click
Run menu).
In the Results window, the Variable Worth and Interval Variables : qbin charts are available
(through the View-Plot menu of the Results window).
The Variable Worth chart (shown below) ranks variables by their importance for target
prediction, as calculated by the decision tree algorithm.
12
Association of variables with the target is also shown in the following Scaled Mean Deviation
plot. Here the (scaled) means in the “target=0” and “target=1” groups are compared with
the overall mean of the variable (represented by the level of zero). This plot corresponds
with the per-group descriptive statistics shown next. For instance, the variable size_max
realizes roughly the same mean value in the groups “qbin=1”, “qbin=0” and overall, hence its
position on the right of the plot. On the other hand, the variable sum_eddy_f1 realizes
similar mean value in the group “qbin=1” and overall, with much bigger mean in the
“qbin=0” groups, which is clearly seen in the Mean Deviation chart.
13
Summarizing, the following conclusions can be drawn from the Explore step:



The data has heavily imbalanced distribution of target with roughly 5% of the rare class (poor quality
items),
Some predictors have clearly erroneous observations (such as the negative value of the level_o2), most of
predictors have skewed distributions,
Most of predictors include outlying observations, these however should be removed with caution (i.e.,
only when the value of predictor is beyond physical range of the variable), as outliers generally
correspond to the rare class (poor quality) observations.
These issues will be tackled in the following Modify and Model steps.
14
Modify data
The purpose of this step is to:



Remove observations with wrong or outlying values of input variables,
Transform variables to reduce skewness in distribution,
Impute missing values.
Transformations of data to reduce skew in distribution bring variables closer to the normal distribution,
which improves performance of predictive models based on the assumption on normality of features
(such as the LDA). Filling in missing data (e.g., based on values in similar observations or using more
sophisticated approach, such as prediction based methods) may improve performance of regression
methods or neural classifiers which otherwise ignore observations in which missing values occur. (Note
that decision tree algorithms accept missing values as legitimate values of predictors).
In this guide, we demonstrate several features of Enterprise Miner used to modify data, such as features
implemented in the following nodes:



Filter,
Transform Variables,
Impose.
The nodes should be connected to the Input Data as shown in the diagram below.
Data transformations implemented with these nodes are outlined in the following procedure.
Task
1
Description
Filter observations containing erroneous inputs
1. Select the Filer node, which activates the node’s Property window as shown below.
2. In the Property window, setup conditions for filtering out observations, based on
values of class variables and interval variables. In this example, we change the
Default Filtering Method for both class and interval variables to None (as these apply
to all variables), and define specific filtering conditions for individual variables.
15
The tool offers several filtering methods, e.g., Standard Deviations from the Mean for
interval variables. This method removes outliers, where outliers are defined by the 3
standard deviations from the mean condition.
In this example, we switch the Default Filtering Method to None. The default Standard
Deviations from the Mean method would remove too many rare case (qbin=0) observations,
as these observations are generally overrepresented in the far ends of distributions of
predictors, as shown in the Explore stage. Thus we manually setup filtering conditions to
remove wrong and preserve feasible albeit outlying data.
We do this using the editor of filtering conditions for interval inputs (the editor is invoked as
indicated by the arrow below).
Next, we specify lower and upper limits for the inputs: level_o2 and vel_max as shown below
(this removes observations with erroneous values in these variables).
16
2
Transform input variables to reduce skewness in distribution
1. Select the Transform Variables node, this activates the node’s Property window as
shown below.
2. Open the transformations editor (as indicated by the arrow).
In this example, we will apply the log transformation to the level_o2 input, as shown in the
Method column of the transformations editor.
17
3
Impute missing values
1. Select the Impute node, which activates the node’s Property window as shown
below.
2. Change the imputation method for class variables to None and for interval variables
to Tree Surrogate, as shown below.
There are several methods of missing value imputation, such as simple methods which fill in
the mean or median of the variable’s distribution, or more sophisticated methods such as
the ones based on robust M-estimators of distribution location parameter, etc.
In this example, we use the tree based imputation method which calculates the missing
value as predicted by the remaining inputs (for the purpose of this analysis, the variable with
missing values is regarded as the target).
We note that data transformation and missing value imputation nodes in Enterprise Miner do not
override the original variables. Instead they add new variables to the dataset, with names based on the
type of transformation, as shown in the metadata listing below.
E.g., the level_o2 variable was first logged to create the new LOG_level_o2 variable, which was later
transformed with the Impute node to create the IMP_LOG_level_o2 variable. The ‘old’ variables are
18
rejected from analysis and the modified variables labeled as inputs, i.e., will be used as predictors (this is
done by the Role metadata column).
The metadata can be inspected and modified if necessary using the Metadata node as shown in the
diagram below. This node could be used for manual feature selection, i.e., to include or reject variables
based on association of inputs with the target (see results of the StatExplore node).
In this example we use the Metadata node only for illustration on how transformation nodes work and
hence this node will not be shown in the following diagrams.
19
Model
In this step, we will build different predictive models to estimate the target, i.e., to classify the
manufactured items as good or poor quality. We will try:




Decision trees,
Neural networks,
Logistic regression,
Memory based reasoning method (i.e., the nonparametric k nearest neighbours classifier).
We will also demonstrate how feature selection can be realized in Enterprise Miner. Strictly speaking,
this step is not crucial in our study, since the number of variables is relatively small. However in many
real life problems with hundreds or more features, feature selection reducing dimensionality of data is
mandatory, since many noisy features lead to deterioration in model performance and increase
processing time and memory requirements.
Another important issue to consider prior to fitting a predictive model is definition of the criterion for
model selection / comparison. By default, predictive models attempt to minimize the overall
misclassification rate. This does not necessarily guarantee the optimal performance, especially if the
consequences (costs) of the 01 and 10 misclassification decisions are different. In such studies,
minimization of misclassification costs (or alternatively maximization of profits) might be the right
criterion for model selection. This issue is discussed in the next section on Target profiling.
In this study, we also have to consider the problem of highly imbalanced data (as the poor quality class
is represented by only ca 5% of observations). Predictive models usually demonstrate poor performance
for the rare class. The reasons for this and the methods to tackle the problem are discussed in the
Working with imbalanced data section.
Target profiling
The target profile is used to specify costs of 01 and 10 misclassifications. Target profiles are also
applied in non binary classification problems, when costs of ci  cj decisions are provided in the form of
the cost matrix, with ci, cj denoting the class labels.
Once defined for the target variable, the target profile is used by the model fitting algorithms to attempt
to minimize the misclassification costs or maximize the overall profit.
The target profile is setup using the procedure outlined below.
Task
1
Description
Associate the Target profile with the qbin variable
1. Select the Input Data node, which activates the node’s Property window.
2. In the Property window, open the Decisions editor as shown below. In the Decision
Processing window, click the Build button, which creates the default target profile
for the qbin variable.
20
3. The target event level is selected based on the target level order. The target event
level is later used to define the meaning of sensitivity of the classifier; also the (logit
of) probability modelled by the logistic regression is related to the target event. In
our case, we accept the event level of 1 (which translates into the decision of the
classifier that an item is of good quality; also sensitivity of the model, reported later
in the “Assessment of performance of the models” section will denote
probability of correct recognition of the good quality item).
(If event level of 0 makes interpretation of classifier’s decisions easier, the event
level can be changed by setting the Target level order to Ascending in the metadata
associated with the target variable. Generally, the event level is selected as the first
value in the list of sorted values of the target).
2
Define decision weights
In Decision Weights editor (shown in the following picture), we specify costs (or profits)
associated with particular decisions by the classifier, where:
 DECISION1 means classify as good quality (qbin=1),
 DECISION2 means classify as poor quality (qbin=0),
as indicated by the Decision tab (see the second picture).
In the weights (or profits) matrix shown below, we reflect the following scenario (this
scenario is based on the actual business perspective as seen by the copper company, albeit
the values of profits/costs are fictitious):
 If an actually good quality item (level=1) is classified as good quality (DECISION1),
then the company makes the profit of 10.
 If an actually poor quality item (level=0) is classified as good quality (DECISION1),
21

then the company makes the profit of -100 (i.e., makes a loss, due to having to pay
high warranty costs to its customer, exceeding prior profits).
If an item is classified as poor quality (DECISION2), then the company sells the
product as second quality, thus cheaper, and makes the profit of 5, irrespective of
the actual quality.
We select the maximize decision function which is consistent with our interpretation of the
values entered as profits (if costs were entered, then the minimize decision function would
have to be selected).
Once the target profile and the weights matrix is associated with the target, subsequent builds of
predictive models will attempt to maximize the profit (or minimize costs) as the criterion for classifier
selection.
Using predictive modelling nodes
We will build and compare five classifiers:





Decision tree,
Neural network (multilayer perceptron),
Neural network preceded by a variable selection node,
Logistic regression using forward feature selection method,
Memory Based Reasoning (MBR), which is a simple nonparametric nearest neighbours classifier.
To build these classifiers, predictive modelling nodes should be added to the diagram as shown below.
22
Once the classifiers are fitted to data, the modelling nodes provide the following technical details to the
user:



detailed information about performance of the models measured for the training, validation and test
partitions in terms of misclassification rate, total profit etc.,
detailed information about the fitting process (such as details on subsequent iterations of the forward
feature selection process in logistic regression, error rates for subsequent iterations of a neural network,
etc.),
access to the model code in the form of a standalone SAS 4GL program.
Systematic analysis and comparison of performance of the fitted models follows in the section devoted
to the Assessment step.
Based on the decision tree model, we will now explain how a model can be analyzed using its Results
screen.
The fitted tree is presented graphically, as schematically shown below. Based on performance on
validation data, the algorithm fine-tuned the tree to have the depth of 5.
23
We can examine the tree in the equivalent form of English language rules, where subsequent nodes
correspond to the leaf nodes in the graphical representation of the tree. The nodes are related to
classification decisions of the tree model, with the majority class in a particular node indicating the
tree’s answer.
IF Imputed vel_min < 10.050000191
THEN
NODE
:
2
N
:
79
1
:
11.4%
0
:
88.6%
IF
22.5 <= Imputed sum_eddy_f1
AND 10.050000191 <= Imputed vel_min
THEN
NODE
:
7
N
:
26
1
:
0.0%
0
: 100.0%
IF
4.5 <= Imputed sum_ferro_f1
AND Imputed sum_eddy_f1 <
22.5
AND 10.050000191 <= Imputed vel_min
THEN
NODE
:
11
N
:
10
1
:
0.0%
0
: 100.0%
IF Imputed vel_max < 12.349999905
AND Imputed: Transformed level_o2 <
9.3181616203
AND Imputed sum_ferro_f1 <
4.5
AND Imputed sum_eddy_f1 <
22.5
AND 10.050000191 <= Imputed vel_min
THEN
NODE
:
14
N
:
3874
1
:
98.3%
0
:
1.7%
IF 12.349999905 <= Imputed vel_max
AND Imputed: Transformed level_o2 <
9.3181616203
AND Imputed sum_ferro_f1 <
4.5
AND Imputed sum_eddy_f1 <
22.5
AND 10.050000191 <= Imputed vel_min
THEN
NODE
:
15
N
:
5
1
:
20.0%
0
:
80.0%
IF 9.3181616203 <= Imputed: Transformed
level_o2
AND Imputed sum_ferro_f1 <
4.5
AND Imputed sum_eddy_f1 <
22.5
AND 10.050000191 <= Imputed vel_min
THEN
NODE
:
13
N
:
5
1
:
20.0%
0
:
80.0%
Analyzing the tree model, we can also observe various fit statistics calculated for the training, validate
and test partitions:
24
We observe that the misclassification rate for the test data (i.e., expected error rate for new data)
equals about 1.5%, and the total profit expected from 3002 items in the test partition equals 25015,
which translates into average profit per item of 8.33. Note that if all the items were actually good quality
and all classification decisions were correct, then the total profit would amount to about 30 thousand.
The difference (of ca 5 thousand) is due to


some poor quality items in the batch (this accounts for ca 0.75K difference),
the classifier’s errors (these account for majority of the difference, i.e., over 4K).
Another interesting perspective in analysis of the tree model is based on the Classification chart, where
we observe that the rare class (qbin=0) is indeed much more difficult to properly classify, while items of
the frequent class are classified almost perfectly.
Analysis of the tree model also provides information about importance of inputs for prediction of target,
as shown below. This information may be useful for implementation of feature selection rules. It can be
observed that the tree selects only the first five variables on top of the list below as predictors for
estimation of the target.
25
The tree node offers several other interesting methods for analysis of the model, such as e.g., lift
analysis. These methods are however more appropriate for problems where the task consists in
selection of items with the highest probability of event. In our case, the problem consists in quality
prediction of all the items with the criterion to minimize the cost of wrong decisions (or maximize
overall profit).
Similar in-depth analysis of the fitted models is available with other nodes (Neural Network, Regression,
MBR). However, in the next section we will concentrate on comparison of the models in terms of some
simple practical criteria.
26
Assessment of performance of the models
The purpose of this SEMMA step is to evaluate and compare models in terms of their practical
usefulness. The models can be compared using several criteria such as the profit/loss, misclassification
rate, or using ROC curves or cut-off analysis.
Overall assessment of fitted models is realized with the Model Comparison node, added to the diagram
after the modelling nodes (see diagram below).
The Model Comparison node provides several tools for analysis of the models, such as the ROC curves as
shown below. The ROC analysis indicates that the models demonstrate similar performance, where
sensitivity of ca 100% inevitably translates into about 20% (=1-specificity) error rate in recognition of the
other (poor quality) class. Note that sensitivity is related to the target event of 1, i.e., recognition of
good quality (frequent) class. The ROC analysis is consistent with the rare class recognition problem,
described previously.
In terms of selection of the winning model, no firm conclusion can be made based on the ROC curves.
The tree, neural network models and regression perform similarly (with slightly better results of the tree
expected for the test data), while the MBR (nearest neighbours) classifier is remarkably weaker.
27
The qualitative conclusions from the ROC analysis can be quantitatively confirmed through a number of
criteria summarized in the table below. These measures are calculated for the test partition, i.e., similar
performance can be expected for new data. Whereas all the models misclassify roughly 2% of cases, the
tree model slightly outperforms other models in terms of the total (and average per decision) profit, as
well as in terms of the total number of wrong classifications.
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
Test:
KS Statistic
Average Profit
Average Squared Error
Roc Index
Average Error Function
Bin 2Way KS Prob Cutoff
Cum % Captured Response
Percent Captured Response
Freq of Classified Cases
Divisor for ASE
Error Function
Gain
Gini Coefficient
Bin-Based 2Way KS Statistic
KS Probability Cutoff
Cumulative Lift
Lift
Maximum Absolute Error
Misclassification Rate
Sum of Frequencies
Total Profit
Root Average Squared Error
Cumulative Percent Response
Percent Response
Sum of Squared Errors
# of Wrong Classifications
Tree
Reg
Neural
Neural2
MBR
0.76
8.66
0.01
0.89
0.07
0.99
10.39
5.18
3002.00
6004.00
423.79
3.61
0.77
0.76
0.80
1.04
1.04
0.99
0.02
3002.00
25985.00
0.12
98.95
98.95
85.85
46.00
0.72
8.41
0.02
0.89
0.08
0.94
10.43
5.16
3002.00
6004.00
464.20
4.01
0.78
0.72
0.94
1.04
1.03
0.99
0.02
3002.00
25240.00
0.13
99.34
98.67
95.11
51.00
0.72
8.38
0.02
0.91
0.08
0.94
10.46
5.20
3002.00
6004.00
454.30
4.36
0.82
0.72
0.89
1.04
1.04
1.00
0.02
3002.00
25155.00
0.13
99.67
99.33
95.20
54.00
0.70
8.34
0.02
0.86
0.08
0.97
10.32
5.13
3002.00
6004.00
483.11
2.97
0.73
0.69
0.92
1.03
1.03
0.99
0.02
3002.00
25035.00
0.13
98.34
98.00
100.87
57.00
0.62
7.68
0.02
0.86
0.26
0.91
10.37
5.17
3002.00
6004.00
1552.93
3.43
0.71
0.62
0.88
1.03
1.03
1.00
0.03
3002.00
23060.00
0.15
98.77
98.77
133.87
79.00
Again, the Model Comparison node provides several other tools for assessment of models, such as lift
analysis; these however do not bring useful interpretation for the problem of quality prediction, and
hence are omitted in this guide.
More in-depth analysis of the winning model will be presented in the next section on Scoring new data.
28
Scoring new data
The purpose of this section is to explain


how the predictive model (e.g., the winning model selected by the Assessment step) can be used to
classify (score) new data, and
how the scoring code can be maintained.
We will also examine contents of the dataset produced by the scoring code and explain how these
scored datasets can be the basis for more in-depth analyses using custom SAS coding.
The tool used for scoring new data and for management of scoring codes is the Score node. The node
can be connected to:


any node that produces the scoring code (such as the predictive modelling nodes used in this study),
the Model Comparison node – in this case the Score node will obtain the winning model from the
preceding node (the winning model is selected based on a single criterion specified in the properties of
the Model Comparison node).
In this example, we connect the Score node directly to the Decision Tree node, as shown below.
The functionality of the Score node allows to:


Obtain the scoring code from the preceding node. The scoring code can be then managed outside the
Enterprise Miner environment. The scoring code can be exported in the following languages: SAS 4GL, C,
Java, PMML.
Execute the scoring code against a dataset connected to the Score node. Normally, this dataset is
connected to the Score node using the Input Data node with the metadata role of Score. Alternatively,
the scoring code is applied for the train, validate and test partitions (these data sets are passed through
to the Score node).
In this example, we will use the Score node to classify data in the test partition, since predictive
performance for this partition is a reliable measure of expected performance for new data.
The Exported Data property of the Score node indicates where the results of scoring are placed by the
node. In this example the scored test data is found in the EMWS9.Score_TEST dataset.
29
The variables in this dataset involved in the classification process are listed by the node in its Results
screen as shown below. These include the predictors used by the tree model as well as variables
produced by the tree node or the score node to provide detailed technical information pertaining to the
classification process.
The following variables provide interesting technical information about the process of classification:
D_QBIN
Contains the predicted class level (quality of an item). The prediction is
done by the classifier fine-tuned to maximize profit (i.e., built using
the target profile Decision Weights matrix).
I_QBIN
Contains the predicted class label, produced by the classifier finetuned to minimize the overall number of misclassified items (i.e., built
without using the target profile Decision Weights matrix). The name
given to this variable by the scoring node is EM_CLASSIFICATION.
EM_EVENTPROBABILITY The probability associated with the classifier’s decision that an item is
good quality.
EM_PROBABILITY
The probability of the decision finally made by the classifier. This
probability is estimated as:
30
EM_PROBABILITY = max(EM_EVENTPROBABILITY, 1- EM_EVENTPROBABILITY)
Given this technical output appended to the results of scoring dataset (i.e., EMWS9.Score_TEST), further
in-depth analysis of the model itself or of the scored data is possible using custom SAS coding.
To illustrate this, we will post-process results of scoring to calculate the coincidence matrix and the
sensitivity and specificity parameters of the model.
To do this, the SAS Code node is connected to the Score node, as shown below.
In order to compute the coincidence matrix, the following SAS code is placed in the SAS Code node (the
Code Editor is available through properties of this node):
proc freq data=emws9.score_test;
tables qbin*d_qbin;
tables qbin*i_qbin;
run;
PROC FREQ is the SAS/STAT procedure used to produce frequency or contingency tables to examine
relationship between two classification variables.
In this example, we use the FREQ procedure to compare:


the actual quality of copper (qbin) with the quality predicted using the profit maximization rule (this
decision is coded in the d_qbin classifier’s output variable), or
the actual quality of copper (qbin) with the quality predicted using the misclassification rate minimization
rule (this decision is coded in the i_qbin classifier’s output variable).
The coincidence matrixes summarizing performance of these two classifiers are given below. We also
calculate the total profit and misclassification rates.
The conclusions can be summarized as follows:


The model based on decision weights indeed realizes higher total profit as compared to the original
classifier (25985 vs. 25015), although the total number of misclassifications is higher (72 vs. 46).
Improvement in the total profit is achieved by reducing the number of the costly 01 classification errors
(from 41 to 30), at the expense of increased 10 error rate.
Classifier fine tuned to maximize the
total profit
Classifier fine tuned to minimize the
misclassification rate
Table of qbin by D_QBIN
Table of qbin by I_qbin
31
qbin
D_QBIN(Decision: qbin)
qbin
I_qbin(Into: qbin)
Frequency|
Percent |
Row Pct |
Col Pct |0
|1
| Total
---------+--------+--------+
0 |
105 |
30 |
135
|
3.50 |
1.00 |
4.50
| 77.78 | 22.22 |
| 71.43 |
1.05 |
---------+--------+--------+
1 |
42 |
2825 |
2867
|
1.40 | 94.10 | 95.50
|
1.46 | 98.54 |
| 28.57 | 98.95 |
---------+--------+--------+
Total
147
2855
3002
4.90
95.10
100.00
Frequency|
Percent |
Row Pct |
Col Pct |0
|1
| Total
---------+--------+--------+
0 |
94 |
41 |
135
|
3.13 |
1.37 |
4.50
| 69.63 | 30.37 |
| 94.95 |
1.41 |
---------+--------+--------+
1 |
5 |
2862 |
2867
|
0.17 | 95.34 | 95.50
|
0.17 | 99.83 |
|
5.05 | 98.59 |
---------+--------+--------+
Total
99
2903
3002
3.30
96.70
100.00
TOTAL PROFIT 25985
TOTAL PROFIT 25015
TOTAL # OF MISCLASSIFICATIONS 72
TOTAL # OF MISCLASSIFICATIONS 46
These models can also be compared in terms of sensitivity and specificity. These parameters compare as
follows:


model on the left:
model on the right:
sensitivity=98.54% , specificity=77.78%
sensitivity=99.83% , specificity=69.63%
Observe that (1-specificity) is the misclassification rate for the rare class (poor quality items): this
parameter was reduced from ca 30% to 22%. This analysis confirms that the decision weights matrix
leads to improvement in recognition of the rare class.
32
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising